计算机科学
变压器
匹配(统计)
人工智能
模态(人机交互)
情态动词
自然语言处理
多媒体
情报检索
人机交互
物理
统计
电压
化学
高分子化学
量子力学
数学
作者
Beibei Zhang,Yaqun Fang,Tongwei Ren,Gangshan Wu
标识
DOI:10.1145/3503161.3551600
摘要
The Deep Video Understanding Challenge (DVUC) is aimed to use multiple modality information to build high-level understanding of video, involving tasks such as relationship recognition and interaction detection. In this paper, we use a joint learning framework to simultaneously predict multiple tasks with visual, text, audio and pose features. In addition, to answer the queries of DVUC, we design multiple answering strategies and use video language transformer which learns cross-modal information for matching videos with text choices. The final DVUC result shows that our method ranks first for group one of movie-level queries, and ranks third for both of group one and group two of scene-level queries.
科研通智能强力驱动
Strongly Powered by AbleSci AI