摘要
Polyp segmentation plays an important role in preventing Colorectal cancer. Although Vision Transformer has been widely introduced in medical image segmentation to compensate the limitations of traditional CNN in modeling global context, its shortcomings in learning the fine-detailed features and the heavy computation cost also hinder its application in challenging polyp segmentation due to the various shapes and sizes of polyps, the low-intensity contrast between polyps and surrounding tissues, and the inherent real-time requirement. In this paper, we propose a multi-scale efficient transformer attention (META) mechanism for fast and high-accuracy polyp segmentation, where efficient transformer blocks are employed to generate multi-scale element-wise attentions for adaptive feature fusion in the famous U-shape encoder-decoder architecture. Specifically, our META mechanism includes two branches to capture multi-scale long-term dependencies, which are implemented via two efficient transformer blocks with different resolutions. The local branch is used to capture a relatively smaller transform attention under a relatively lower resolution, while the global branch is used to capture high-resolution transform attention. The final poly segmentation results are progressively integrated based on the META mechanism in each layer of the decoder. Extensive experiments are conducted on four polyp segmentation datasets (CVC-ClinicDB, Endoscenestill, Kvasir-SEG and ETIS-Larib) to demonstrate its advantages, consistently outperforming different competitors. While using ResNet34 as backbones, it can achieve 85.78% IoU and 92.03% Dice, 88.99% IoU and 93.85% Dice, 86.42% IoU and 91.86% Dice respectively in CVC-ClinicDB, Endoscenestill, and Kvasir-SEG, and a speed of 98 FPS at the input size of $3 \times 512 \times 512$ on a NVIDIA GeForce RTX 3090 card. The code is available at https://github.com/szuzzb/META-Unet. Note to Practitioners —Automatic polyp segmentation is a crucial step of polyp recognition and diagnostic of colonoscopy, which usually require both high-accuracy and real-time performance. This article proposes a novel polyp segmentation method, namely META-Unet, by modeling multi-scale attention maps effectively and efficiently based on a novel multi-scale efficient transformer attention (META) mechanism, for faster and higher-accuracy polyp segmentation. We evaluate our META-Unet on four public polyp image segmentation datasets (CVC-ClinicDB, Endoscenestill, Kvasir-SEG and ETIS-Larib). Comprehensive experimental results validate its outstanding performance with a better balance in both accuracy and inference speed. The proposed META mechanism is potentially to be embedded in various deep learning frameworks and facilitates more computer-aided applications in clinical practice.