Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

计算机科学隐藏字幕人工智能编码器特征学习自然语言处理常识卷积神经网络图形计算机视觉知识表示与推理图像（数学）理论计算机科学操作系统

作者

Shan Cao,Gaoyun An,Zhenxing Zheng,Zhiyong Wang

出处

期刊：IEEE Transactions on Circuits and Systems for Video Technology [Institute of Electrical and Electronics Engineers]
日期：2022-05-30 卷期号：32 (10): 7005-7018 被引量：25

标识

DOI：10.1109/tcsvt.2022.3178844

摘要

Image captioning generates descriptions in a natural language for a given image. Due to its great potential for a wide range of applications, many deep learning based-methods have been proposed. The co-occurrence of words such as mouse and keyboard, constitutes commonsense knowledge, which is referred to as consensus. However, it is challenging to consider commonsense knowledge in producing captions that have rich, natural, and meaningful semantics. In this paper, a Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhanced encoder, consensus-aware knowledge representation generator, and consensus-aware decoder. The vision-enhanced encoder extends the vanilla self-attention module with a memory-based attention module and a visual perception module for learning better visual representation of an image. Specifically, the relationships between regions in an image and the image’s global context are leveraged with scene memory in the memory-based attention module. The visual perception module further enhances the correlation among neighboring tokens in both the spatial and channel-wise dimensions. To learn consensus-aware representations, a word correlation graph is constructed by computing the statistical co-occurrence between semantic concepts. Then consensus knowledge can be acquired using a graph convolutional network in the consensus-aware knowledge representation generator. Finally, such consensus knowledge is integrated into the consensus-aware decoder through consensus memory and a knowledge-based control module to produce a caption. Experimental results on two popular benchmark datasets (MSCOCO and Flickr30k) demonstrate that our proposed model achieves state-of-the-art performance. Extensive ablation studies also validate the effectiveness of each component.

求助该文献

最长约 10秒，即可获得该文献文件

Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

今日热心研友