期刊:IEEE Geoscience and Remote Sensing Letters [Institute of Electrical and Electronics Engineers] 日期:2023-11-23卷期号:21: 1-5被引量:6
标识
DOI:10.1109/lgrs.2023.3336061
摘要
Remote-sensing image semantic segmentation is usually based on convolutional neural networks (CNNs). CNNs demonstrate powerful local feature extraction capabilities through stacked convolution and pooling. However, the locality of the convolution operation limits the ability of CNNs to directly extract global information. Relying on the multihead self-attention (MHSA) mechanism, transformer shows great advantages in modeling global information. In this letter, we propose a CNN-transformer fusion network (CTFNet) for remote-sensing image semantic segmentation. CTFNet applies a U-shaped encoder-decoder structure to achieve the extraction and adaptive fusion of local features and global context information. Specifically, a lightweight W/P transformer block is proposed as the decoder to obtain global context information with low complexity and connected to the encoder through the skip connection. Finally, the channel and spatial attention fusion module (AFM) is exploited to adaptively fuse deep semantic features and shallow detail features. On the Vaihingen and Potsdam datasets of the International Society for Photogrammetry and Remote Sensing (ISPRS), the effectiveness of each module is demonstrated by ablation experiments. Compared with several classical networks, our proposed CTFNet can obtain superior performance.