Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

计算机科学 成对比较 答疑 粒度 基线(sea) 代表(政治) 人工智能 情报检索 光学(聚焦) 判别式 情态动词 自然语言处理 海洋学 物理 化学 光学 政治 政治学 高分子化学 法学 地质学 操作系统
作者
Linjun Li,Tao Jin,Lin Wang,Hao Jiang,Wenwen Pan,Jian Wang,Shuwen Xiao,Yan Xia,Weihao Jiang,Zhou Zhao
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology [Institute of Electrical and Electronics Engineers]
卷期号:34 (8): 7080-7094 被引量:10
标识
DOI:10.1109/tcsvt.2023.3264524
摘要

Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
1秒前
华仔应助zxy666采纳,获得10
1秒前
Daisykiller发布了新的文献求助10
1秒前
尽我所能完成签到,获得积分20
1秒前
我是老大应助maohuibai采纳,获得10
2秒前
脑洞疼应助细胞色素采纳,获得10
3秒前
hh发布了新的文献求助10
3秒前
Hany完成签到,获得积分10
3秒前
3秒前
holo发布了新的文献求助10
4秒前
凶狠的绮波完成签到,获得积分10
4秒前
qll发布了新的文献求助10
5秒前
尽我所能发布了新的文献求助10
5秒前
5秒前
。。。。发布了新的文献求助10
5秒前
5秒前
hs发布了新的文献求助10
5秒前
sunsun完成签到,获得积分10
6秒前
杨一发布了新的文献求助10
7秒前
7秒前
欣喜石头完成签到 ,获得积分10
7秒前
量子世界小居民完成签到,获得积分20
7秒前
MM完成签到,获得积分10
7秒前
8秒前
8秒前
史夏兰完成签到,获得积分10
9秒前
一蓑烟雨任平生应助菠萝采纳,获得10
9秒前
李爱国应助qll采纳,获得10
10秒前
科研通AI5应助fisher采纳,获得30
10秒前
10秒前
思源应助xiaobai采纳,获得10
10秒前
11秒前
英姑应助量子世界小居民采纳,获得10
11秒前
QIANGQIANG发布了新的文献求助30
11秒前
11秒前
天天快乐应助Hua采纳,获得10
12秒前
Wind发布了新的文献求助10
12秒前
holo完成签到,获得积分10
13秒前
13秒前
高分求助中
【此为提示信息,请勿应助】请按要求发布求助,避免被关 20000
Encyclopedia of Geology (2nd Edition) 2000
CRC Handbook of Chemistry and Physics 104th edition 1000
Izeltabart tapatansine - AdisInsight 600
An International System for Human Cytogenomic Nomenclature (2024) 500
Introduction to Comparative Public Administration Administrative Systems and Reforms in Europe, Third Edition 3rd edition 500
Distinct Aggregation Behaviors and Rheological Responses of Two Terminally Functionalized Polyisoprenes with Different Quadruple Hydrogen Bonding Motifs 450
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3765323
求助须知:如何正确求助?哪些是违规求助? 3309825
关于积分的说明 10152134
捐赠科研通 3025137
什么是DOI,文献DOI怎么找? 1660434
邀请新用户注册赠送积分活动 793237
科研通“疑难数据库(出版商)”最低求助积分说明 755495