计算机科学
依赖关系(UML)
人工智能
超图
情态动词
代表(政治)
自然语言处理
语义学(计算机科学)
钥匙(锁)
组分(热力学)
树(集合论)
词(群论)
情报检索
语言学
数学分析
化学
物理
哲学
数学
计算机安全
离散数学
政治
政治学
高分子化学
法学
热力学
程序设计语言
作者
Zenan Xu,Wanjun Zhong,Qinliang Su,Fuwei Zhang
标识
DOI:10.1109/icme55011.2023.00073
摘要
A key challenge in video question answering (VideoQA) is how to align textual concepts with the cross-modal visual regions accurately. Existing methods mostly rely on the alignment between individual words and relevant video regions, but individual words are generally not able to capture the complete information of a textual concept, which is often represented by the composition of several words. To address this issue, we propose to build a syntactic dependency tree for each question with an off-the-shelf tool and use it to extract meaningful word compositions (i.e., textual concept). By viewing the words and compositions as nodes and hyperedges, respectively, a hypergraph convolutional network (HCN) is built to learn the representations of textual concepts. Then, to enable cross-modal interaction of relevant concepts from different modalities, an optimal transport (OT) based alignment method is developed to establish the connection between textual concepts and their relevant visual regions. Experimental results on three benchmarks show that our method outperforms all competing baselines. Further analyses demonstrate the effectiveness of each component, and show that our model is good at modeling different levels of semantic compositions and filtering out irrelevant information.
科研通智能强力驱动
Strongly Powered by AbleSci AI