隐藏字幕
计算机科学
频道(广播)
人工智能
代表(政治)
编码器
块(置换群论)
图像(数学)
光学(聚焦)
对象(语法)
对偶(语法数字)
计算机视觉
自然语言处理
模式识别(心理学)
文学类
法学
艺术
几何学
物理
光学
操作系统
政治
数学
计算机网络
政治学
作者
Boyang Wan,Wenhui Jiang,Yuming Fang,Wenying Wen,Hantao Liu
标识
DOI:10.1109/vcip56404.2022.10008904
摘要
Self-attention based encoder-decoder models achieve dominant performance in image captioning. However, most existing image captioning models (ICMs) only focus on modeling the relation between spatial tokens, while channel-wise attention is neglected for getting visual representation. Considering that different channels of visual representation usually denote different visual objects, it may lead to poor performance in terms of object and attribute words in the captioning sentences generated by the ICMs. In this paper, we propose a novel dual-stream self-attention module (DSM) to alleviate the above issue. Specifically, we propose a parallel self-attention based module that simultaneously encodes visual information from the spatial and channel dimensions. Besides, to obtain channel-wise visual features effectively and efficiently, we introduce a group self-attention block with linear computational complexity. To validate the effectiveness of our model, we conduct extensive experiments on the standard IC benchmarks including MSCOCO and Flickr30k. Without bells and whistles, the proposed model performs new SOTAs containing 135.4 CIDEr score on MSCOCO and 70.8 CIDEr score on Flickr30k.
科研通智能强力驱动
Strongly Powered by AbleSci AI