TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption
隐藏字幕
计算机科学
变压器
判决
人工智能
计算机视觉
图像(数学)
工程类
电气工程
电压
作者
Zihang Chen,Junjue Wang,Ailong Ma,Yanfei Zhong
出处
期刊:IEEE Geoscience and Remote Sensing Letters [Institute of Electrical and Electronics Engineers] 日期:2022-01-01卷期号:19: 1-5被引量:7
标识
DOI:10.1109/lgrs.2022.3192062
摘要
Image captioning in remote sensing can help us understandthe inner attributes of the objects and the outer relations between different objects. However, the existing image captioning algorithms lack the ability of global representation, and cannot obtain object relations over long distances. In addition, these algorithmics generate captions randomly without consideration of the specific demands. To this end, we propose a pure transformer architecture with caption type controller for remote sensing image captioning. Specifically, a multi-scale vision transformer is adopted for the image representation, where the global and detailed content can be captured with multi-head self-attention layers. A transformer decoder is then introduced to successively translate the image features into comprehensive sentences. The optional block called the caption type controller is designed to consider the types of captions through caption type matrix sets according to the demands, embedding the learnable sentence feature with the required type. The comparison and ablation experiments conducted on the Remote Sensing Image Captioning Dataset (RSICD) dataset demonstrate that the proposed framework outperforms the current state-of-the-art image captioning methods. The experiments conducted on the FloodNet caption dataset further illustrate that the proposed methods can effectively generate specific types of captions.