隐藏字幕
计算机科学
特征(语言学)
代表(政治)
图形
人工智能
特色
可视化
图像(数学)
模式识别(心理学)
理论计算机科学
哲学
语言学
政治
政治学
法学
作者
Changzhi Wang,Xiaodong Gu
标识
DOI:10.1007/978-981-99-1645-0_38
摘要
Existing attention based image captioning approaches treat local feature and global feature in the image individually, neglecting the intrinsic interaction between them that provides important guidance for generating caption. To alleviate above issue, in this paper we propose a novel Local-Global Visual Interaction Network (LGVIN) that novelly explores the interactions between local feature and global feature. Specifically, we devise a new visual interaction graph network that mainly consists of visual interaction encoding module and visual interaction fusion module. The former implicitly encodes the visual relationships between local feature and global feature to obtain an enhanced visual representation containing rich local-global feature relationship. The latter fuses the previously obtained multiple relationship features to further enrich different-level relationship attribute information. In addition, we introduce a new relationship attention based LSTM module to guide the word generation by dynamically focusing on the previously output fusion relationship information. Extensive experimental results show that the superiority of our LGVIN approach, and our model obviously outperforms the current similar relationship based image captioning methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI