计算机科学
杠杆(统计)
情报检索
利用
答疑
知识图
管道(软件)
图形
人工智能
自然语言处理
理论计算机科学
计算机安全
程序设计语言
作者
Wei Wang,Junyu Gao,Xiaoshan Yang,Changsheng Xu
标识
DOI:10.1109/tmm.2022.3149716
摘要
The problem of video-text retrieval, which searches videos via natural language descriptions or vice versa, has attracted growing attention due to the explosive scale of videos produced every day. The dominant approaches for this problem follow the pipeline that firstly learns compact feature representations of videos and texts, and then jointly embeds them into a common feature space where matched video-text pairs are close and unmatched pairs are far away. However, most of them neither consider the structural similarities among cross-modal samples in a global view, nor leverage useful information from other relevant retrieval processes. We argue that both information has great potential for video-text retrieval. In this paper, we treat the relevant retrieval processes as auxiliary tasks and we extract useful knowledge from them by exploiting structural similarities via Graph Neural Networks (GNNs). We then progressively transfer the knowledge from auxiliary tasks in a general-to-specific manner to assist the main task of the current retrieval process. Specifically, for the retrieval of the given query, we first construct a sequence of query-graphs whose central queries are chosen from distant to close to the given query. Then we conduct knowledge-guided message passing in each query-graph to exploit regional structural similarities and gather knowledge of different levels from the updated query-graphs with a knowledge-based attention mechanism. Finally, we transfer the extracted useful knowledge from general to specific to assist the current retrieval process. Extensive experimental results show that our model outperforms the state-of-the-arts on four benchmarks.
科研通智能强力驱动
Strongly Powered by AbleSci AI