计算机科学
情态动词
图形
卷积神经网络
人工智能
情报检索
模式识别(心理学)
自然语言处理
理论计算机科学
化学
高分子化学
作者
Xueyang Qin,Lishuang Li,Guangyao Pang,Fei Hao
标识
DOI:10.1016/j.eswa.2024.123842
摘要
Exploring the semantic correspondence of image-text pairs is significant as it bridges vision and language. Most prior works focus on global semantic alignment or local semantic alignment, by developing a fine neural network that facilitates the corresponding alignment but neglects the semantic information and relative position information between image regions, or text words, which will lead to a non-meaningful alignment. To this end, a Heterogeneous Graph Fusion Network (HGFN) is proposed to explore the correlation score of vision-language for improving the accuracy of cross-modal image-text retrieval in this paper. Specifically, we first construct an undirected fully-connected graph based on the semantic or relative position information for each image, as well as a textual graph with neighborhood information of the text. Then, we present a graph fusion module to integrate the features of heterogeneous graphs into a unified hybrid representation, in which the graph convolutional network is utilized to gather neighborhood information to alleviate potentially non-meaningful alignment. In addition, we also propose a novel "Dynamic top-K negative" strategy for the selection of negative examples in the training process. Experimental results demonstrate that HGFN achieves comparable performance with state-of-the-art approaches on the Flickr30K and MSCOCO datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI