隐藏字幕
计算机科学
特征(语言学)
情态动词
人工智能
图像(数学)
遥感
特征提取
模式识别(心理学)
计算机视觉
地质学
哲学
语言学
化学
高分子化学
作者
Zhigang Yang,Qiang Li,Yuan Yuan,Qi Wang
出处
期刊:IEEE Transactions on Geoscience and Remote Sensing
[Institute of Electrical and Electronics Engineers]
日期:2024-01-01
卷期号:62: 1-11
被引量:2
标识
DOI:10.1109/tgrs.2024.3401576
摘要
Remote sensing image captioning aims to describe the crucial objects from remote sensing images in the form of natural language. The inefficient utilization of object texture and semantic features in images, along with the ineffective cross-modal alignment between image and text features, are the primary factors that impact the model to generate high-quality captions. To alleviate this trouble, this paper presents a network for remote sensing image captioning, namely HCNet, including hierarchical feature aggregation and cross-modal feature alignment. Specifically, a hierarchical feature aggregation module is proposed to obtain a comprehensive representation of vision features, which is beneficial for producing accurate descriptions. Considering the disparities between different modal features, we design a cross-modal feature interaction module in the decoder to facilitate feature alignment. It can fully utilize cross-modal features to localize critical objects. Besides, a cross-modal feature align loss is introduced to realize the alignment between image and text features. Extensive experiments show our HCNet can achieve satisfactory performance. Especially, we demonstrate significant performance improvements of +14.15% CIDEr score on NWPU datasets compared to existing approaches. The source code is publicly available at https://github.com/CVer-Yang/HCNet.
科研通智能强力驱动
Strongly Powered by AbleSci AI