计算机科学
人工智能
特征(语言学)
模式识别(心理学)
判决
语义学(计算机科学)
特征提取
语义特征
语义鸿沟
模式
图形
图像(数学)
计算机视觉
图像检索
理论计算机科学
社会科学
哲学
语言学
社会学
程序设计语言
作者
Xinfeng Dong,Huaxiang Zhang,Lei Zhu,Liqiang Nie,Li Liu
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2022-04-01
卷期号:32 (9): 6437-6447
被引量:30
标识
DOI:10.1109/tcsvt.2022.3164230
摘要
In order to carry out more accurate retrieval across image-text modalities, some scholars use fine-grained feature to align image and text. Most of them directly use attention mechanism to align image regions and words in the sentence, and ignore the fact that semantics related to an object is abstract and cannot be accurately expressed by object information alone. To overcome this weakness, we propose a hierarchical feature aggregation algorithm based on graph convolutional networks (GCN) to facilitate object semantic integrity by integrating attributes of an object and relations between objects hierarchically in both image and text modalities. In order to eliminate the semantic gap between modalities, we propose a cross-modal feature fusion method based on transformer to generate modal-specific feature representations by integrating both the object feature and global feature from the other modality. Then we map the fusion feature into a common space. Experiment results on the most frequently-used datasets MSCOCO and Flickr30K show the effectiveness of the proposed model compared with the latest methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI