计算机科学
人工智能
生成语法
桥接(联网)
语义鸿沟
自然语言处理
匹配(统计)
保险丝(电气)
图像(数学)
光学(聚焦)
词(群论)
模式识别(心理学)
情报检索
图像检索
语言学
统计
数学
计算机网络
物理
哲学
光学
电气工程
工程类
作者
Guoshuai Zhao,Chaofeng Zhang,Heng Shang,Yaxiong Wang,Zhu Li,Xueming Qian
标识
DOI:10.1016/j.knosys.2023.110280
摘要
Although there is a long line of research on bidirectional image–text matching, the problem remains a challenge due to the well-known semantic gap between visual and textual modalities. Popular solutions usually first detect the objects and then find the association between visual objects and the textual words to estimate the relevance; however, these methods only focus on the visual object features while ignoring the semantic attributions of the detected regions, which is an important clue in terms of bridging the semantic gap. To remedy this issue, we propose a generative multiattribution tag fusion method that further includes region attribution to alleviate the semantic gap. In particular, our method comprises three steps: the extraction of image features, the extraction of text features, and the matching of image and text by an attention mechanism. We first divide the image into blocks to obtain the region image features and region attribute labels. Then, we fuse them to reduce the semantic gap between the image features and text features. Second, BERT and bi-GRU are used to extract text features, and third, we use the attention mechanism to match each area in the image with each word in the text with the same meaning. The quantitative and qualitative results on the public datasets Flickr30K and MS-COCO demonstrate the effectiveness of our method. The source code is released on Github https://github.com/smileslabsh/Generative-Label-Fused-Network.
科研通智能强力驱动
Strongly Powered by AbleSci AI