计算机科学
成对比较
答疑
粒度
基线(sea)
代表(政治)
人工智能
情报检索
光学(聚焦)
判别式
情态动词
自然语言处理
海洋学
物理
化学
光学
政治
政治学
高分子化学
法学
地质学
操作系统
作者
Linjun Li,Tao Jin,Lin Wang,Hao Jiang,Wenwen Pan,Jian Wang,Shuwen Xiao,Yan Xia,Weihao Jiang,Zhou Zhao
标识
DOI:10.1109/tcsvt.2023.3264524
摘要
Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.
科研通智能强力驱动
Strongly Powered by AbleSci AI