计算机科学
答疑
代表(政治)
抓住
情态动词
领域(数学)
人机交互
运动(物理)
帧(网络)
图层(电子)
对象(语法)
人工智能
理解力
能见度
多媒体
数据科学
软件工程
物理
光学
政治
有机化学
化学
高分子化学
程序设计语言
法学
纯数学
电信
数学
政治学
作者
Meng Liu,Fenglei Zhang,Xin Luo,Lijuan Fan,Yinwei Wei,Liqiang Nie
标识
DOI:10.1145/3581783.3612239
摘要
Video question answering is an increasingly vital research field, spurred by the rapid proliferation of video content online and the urgent need for intelligent systems that can comprehend and interact with this content. Existing methodologies often lean towards video understanding and cross-modal information interaction modeling but tend to overlook the crucial aspect of comprehensive question understanding. To address this gap, we introduce the multi-modal and multi-layer question enhancement network, a groundbreaking framework emphasizing nuanced question understanding. Our approach begins by extracting object, appearance, and motion features from videos. Subsequently, we harness multi-layer outputs from a pre-trained language model, ensuring a thorough grasp of the question. Integrating object data into appearance is guided by global question and frame representation, facilitating the adaptive acquisition of appearance and motion-enhanced question representation. By amalgamating multi-modal question insights, our methodology adeptly determines answers to questions. Experimental results conducted on three benchmarks demonstrate the superiority of our tailored approach, underscoring the importance of advanced question comprehension in VideoQA.
科研通智能强力驱动
Strongly Powered by AbleSci AI