Video Question Answering With Semantic Disentanglement and Reasoning
计算机科学
答疑
情报检索
自然语言处理
语义学(计算机科学)
人工智能
程序设计语言
作者
Jin Liu,Guoxiang Wang,Jialong Xie,Fengyu Zhou,Huijuan Xu
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology [Institute of Electrical and Electronics Engineers] 日期:2023-09-20卷期号:34 (5): 3663-3673被引量:2
标识
DOI:10.1109/tcsvt.2023.3317447
摘要
Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentence-level. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper's acceptance.