U-Net and its variants have achieved impressive results in medical image segmentation. However, the downsampling operation of such U-shaped networks causes the feature maps to lose a certain degree of spatial information, and most existing methods use convolution and transformer sequentially, it is hard to extract more comprehensive feature representation of the image. In this paper, we propose a novel U-shaped segmentation network named Multi-scale Axial Attention Network (MSAANet) to solve the above problems. Specifically, we propose a cross-scale interactive attention: multi-scale axial attention (MSAA), which achieves direction-perception attention of different scales interaction. So that the downsampling deep features and the shallow features can maintain context spatial consistency. Besides, we propose a Convolution-Transformer (CT) block, which makes transformer and convolution complement each other to enhance comprehensive feature representation. We evaluate the proposed method on the public datasets Synapse and ACDC. Experimental results demonstrate that MSAANet effectively improves segmentation accuracy.