隐藏字幕
计算机科学
杠杆(统计)
人工智能
利用
水准点(测量)
可视化
滤波器(信号处理)
图像(数学)
背景(考古学)
计算机视觉
模式识别(心理学)
机器学习
古生物学
生物
计算机安全
地理
大地测量学
作者
Meng-Hao Guo,Qiaohong Chen,Xian Fang,Jia Bao,Shenxiang Xiang
标识
DOI:10.1007/978-3-031-44210-0_30
摘要
Image captioning represents a challenging multimodal task, requiring the generation of corresponding textual descriptions for complex input images. Existing methods usually leverage object detectors to extract visual features of images, and thus utilize text generators for learning. However, the features extracted by these methods lack focus and tend to ignore the relationship between objects and background information. To solve the aforementioned problems, we exploit both region features and grid features of the image to fully leverage the information encapsulated within the images. Specifically, we first propose an Object Filter Module (OFM) to extract the primary visual objects. Furthermore, we introduce a Global Injection Cross Attention (GICA) to inject the global context of the image into the filtered primary objects. The experimental results substantiate the efficacy of our model. Our model’s effectiveness and immense potential have been demonstrated through extensive experimentation on the widely-used benchmark COCO dataset. It outperforms previous methods on the image captioning task, achieving a CIDEr score of 136.1.
科研通智能强力驱动
Strongly Powered by AbleSci AI