四元数
计算机科学
超图
构造(python库)
人工智能
答疑
理论计算机科学
数学
程序设计语言
几何学
离散数学
作者
Zhicheng Guo,Jiaxuan Zhao,Licheng Jiao,Xu Liu,Fang Liu
标识
DOI:10.1109/tmm.2021.3120544
摘要
Fusion and interaction of multimodal features are essential for video question answering. Structural information composed of the relationships between different objects in videos is very complex, which restricts understanding and reasoning. In this paper, we propose a quaternion hypergraph network (QHGN) for multimodal video question answering, to simultaneously involve multimodal features and structural information. Since quaternion operations are suitable for multimodal interactions, four components of the quaternion vectors are applied to represent the multimodal features. Furthermore, we construct a hypergraph based on the visual objects detected in the video. Most importantly, the quaternion hypergraph convolution operator is theoretically derived to realize multimodal and relational reasoning. Question and candidate answers are embedded in quaternion space, and a Q&A reasoning module is creatively designed for selecting the answer accurately. Moreover, the unified framework can be extended to other video-text tasks with different quaternion decoders. Experimental evaluations on the TVQA dataset and DramaQA dataset show that our method achieves state-of-the-art performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI