期刊:IEEE Transactions on Geoscience and Remote Sensing [Institute of Electrical and Electronics Engineers] 日期:2023-01-01卷期号:61: 1-12被引量:7
标识
DOI:10.1109/tgrs.2023.3312479
摘要
Visual question answering (VQA) aims to build an interactive system that infers the answer according to the input image and text-based question. Recently, VQA for remote sensing has attracted considerable attention since it is essential and expedient for monitoring global resources and querying objective attributes. In reality, question-related semantic information is helpful for the reasoning and understanding capabilities, which is always contained in remote sensing images or complex questions. To capture the valuable information and extend the applications of remote-sensing VQA, we propose an end-to-end multiple-step question-driven VQA (MQVQA) system for remote sensing. In MQVQA, we employ a multiple-step attention mechanism to interactively reason and concomitantly mark the region that is most related to the question. To understand the semantic information in complex questions, we build a question-driven module that classifies the question types and keywords, which will further guide the combination of image feature maps in different scales. To benchmark this model, we construct a new complex remote sensing VQA dataset (CRSVQA), wherein the question is asked in complex forms and involves various remote sensing scenes. The evaluation results on CRSVQA, RSVQA and RSIVQA datasets indicate that the proposed MQVQA model surpasses other remote sensing VQA models. The visualization results demonstrate that MQVQA has a robust ability in reasoning and understanding the content from images and complex questions. Our code and dataset are publicly available at: https://github.com/MeimeiZhang-data/MQVQA.