期刊:IEEE Transactions on Geoscience and Remote Sensing [Institute of Electrical and Electronics Engineers] 日期:2023-01-01卷期号:61: 1-11被引量:5
标识
DOI:10.1109/tgrs.2023.3329152
摘要
Remote sensing semantic segmentation plays a significant role in various applications such as environmental monitoring, land use planning, and disaster response. CNNs have been dominating remote sensing semantic segmentation. However, due to the limitations of convolution operations, CNNs cannot effectively model global context. The success of Transformers in the NLP domain provides a new solution for global context modeling. Inspired by Swin Transformer, we propose a novel remote sensing semantic segmentation model called CSTUNet. This model employs a dual-encoder structure consisting of a CNN-based main encoder and a Swin Transformer-based auxiliary encoder. We first utilize a detail-structure preservation module (DPM) to mitigate the loss of detail and structure information caused by Swin Transformer downsampling. Then we introduce a spatial feature enhancement module (SFE) to collect contextual information from different spatial dimensions. Finally, we construct a position-aware attention fusion module (PAFM) to fuse contextual and local information. Our proposed model obtained 70.75% MIoU on the ISPRS-Vaihingen dataset and 77.27% MIoU on the ISPRS-Potsdam dataset.