计算机科学
分割
人工智能
变压器
模式识别(心理学)
地点
像素
计算机视觉
电压
语言学
量子力学
物理
哲学
作者
Ziheng Wang,Xiongkuo Min,Fangyu Shi,Ruinian Jin,Saida S. Nawrin,Ichen Yu,Ryoichi Nagatomi
标识
DOI:10.1007/978-3-031-16443-9_50
摘要
Vision transformer is the new favorite paradigm in medical image segmentation since last year, which surpassed the traditional CNN counterparts in quantitative metrics. The significant advantage of ViTs is to utilize the attention layers to model global relations between tokens. However, the increased representation capacity of ViTs comes with corresponding shortcomings: short of CNN's inductive biases (locality), translation invariance, and hierarchical structure of visual information. Consequently, well-trained ViTs require more data than CNNs. As high quality data in medical imaging area is always limited, we propose SMESwin UNet. Firstly, based on Channel-wise Cross fusion Transformer (CCT) we fuse multi-scale semantic features and attention maps by designing a compound structure with CNN and ViTs (named MCCT). Secondly, we introduce superpixel by dividing the pixel-level feature into district-level to avoid the interference of meaningless parts of the image. Finally, we used External Attention to consider the correlations among all data samples, which may further reduce the limitation of small datasets. According to our experiments, the proposed superpixel and MCCT-based Swin Unet (SMESwin Unet) achieves better performance than CNNs and other Transformer-based architectures on three medical image segmentation datasets (nucleus, cells, and glands).
科研通智能强力驱动
Strongly Powered by AbleSci AI