计算机科学
答疑
冗余(工程)
任务(项目管理)
人工智能
期限(时间)
情报检索
自然语言处理
机器学习
物理
管理
量子力学
经济
操作系统
作者
Tianwen Qian,Ran Cui,Jingjing Chen,Pai Peng,Xiaowei Guo,Yu–Gang Jiang
标识
DOI:10.1109/tmm.2023.3323878
摘要
Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video. Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only. Under this scheme, we propose “Locate before Answering” (LocAns), a novel approach that integrates a question localization module and an answer prediction module into an end-to-end model. During the training phase, the available answer label not only serves as the supervision signal of the answer prediction module, but also is used to generate pseudo temporal labels for the question localization module. Moreover, we design a decoupled alternative training strategy to update the two modules separately. In the experiments, LocAns achieves state-of-the-art performance on three modern long-term VideoQA datasets, NExT-QA, ActivityNet-QA, and AGQA. Its qualitative examples show the reliable performance of the question localization.
科研通智能强力驱动
Strongly Powered by AbleSci AI