隐藏字幕
计算机科学
图像(数学)
人工智能
计算机视觉
语音识别
自然语言处理
作者
Heng Song,Junwu Zhu,Yi Jiang
标识
DOI:10.1016/j.compeleceng.2020.106630
摘要
Abstract Recently, researchers have made extensive research on the technology of automatically generating descriptions for an image. Various technologies for image captioning have been proposed, among which attention-based encoder-decoder framework achieved great success. Two different types of attention models are proposed to generate image captions respectively, i.e., model based visual attention that is good at describing details, and model based text attention that is good at comprehensive understanding. In order to integrate and make full use of visual information and text information to generate more accurate captions for images, in this paper, we firstly introduce a visual attention model to generate the visual information and a text attention model to form the text information respectively, and then propose an adaptive visual-text merging network(avtmNet). This merging network can effectively merge the visual information and text information, and automatically determine the proportion of both visual information and text information to generate the next caption word. Extensive experiments are performed on the datasets named COCO2014 and Flickr30K respectively, and show the effectiveness and superiority of our proposed approach.
科研通智能强力驱动
Strongly Powered by AbleSci AI