Chenyang Liu,Rui Zhao,Hao Chen,Zhengxia Zou,Zhenwei Shi
出处
期刊:IEEE Transactions on Geoscience and Remote Sensing [Institute of Electrical and Electronics Engineers] 日期:2022-01-01卷期号:60: 1-20被引量:61
标识
DOI:10.1109/tgrs.2022.3218921
摘要
Analyzing land cover changes with multi-temporal remote sensing (RS) images is crucial for environmental protection and land planning. In this paper, we explore Remote Sensing Image Change Captioning (RSICC), a new task aiming at generating human-like language descriptions for the land cover changes in multi-temporal RS images. We propose a novel Transformer-based RSICC model (RSICCformer). It consists of three main components: 1) a CNN-based feature extractor to generate high-level features of RS image pairs, 2) a dual-branch Transformer encoder to improve the feature discrimination capacity for the changes, and 3) a caption decoder to generate sentences describing the differences. The dual-branch Transformer encoder consists of a hierarchy of processing stages to capture and recognize multiple changes of interest. Concretely, we use the bi-temporal feature differences as keys to enhance image features (queries) from each temporal image in the dual-branch Transformer encoder. To explore the RSICC task, we build a large-scale dataset named LEVIR-CC, which contains 10077 pairs of bi-temporal RS images and 50385 sentences describing the differences between images. We benchmark existing state-of-the-art synthetic image change captioning methods on the LEVIR-CC dataset, and our RSICCformer outperforms previous methods with a significant margin (+4.98% on BLEU-4 and +9.86% on CIDEr-D). The attention visualization results also suggest that our model can focus on changes of interest and ignore irrelevant changes.