计算机科学
概化理论
人工智能
代表(政治)
模态(人机交互)
任务(项目管理)
钥匙(锁)
机器学习
政治学
数学
计算机安全
政治
统计
经济
管理
法学
作者
Chengjie Zheng,Wei Ding,Shiqian Shen,Ping Chen
标识
DOI:10.1007/978-3-031-36819-6_22
摘要
Video classification is a complex task that involves analyzing audio and video signals using deep neural models. To reliably classify these signals, researchers have developed multimodal fusion techniques that combine audio and video data into compact, quickly processed representations. However, previous approaches to multimodal data fusion have relied heavily on manually designed attention mechanisms. To address these limitations, we propose the Multimodal Auto Attention Fusion (MAF) model, which uses Neural Architecture Search (NAS) to automatically identify effective attentional representations for a wide range of tasks. Our approach includes a custom-designed search space that allows for the automatic generation of attention representations. Using automated Key, Query, and Value representation design, the MAF model enhances its self-attentiveness, allowing for the creation of highly effective attention representation designs. Compared to other multimodal fusion methods, our approach exhibits competitive performance in detecting modality interactions. We conducted experiments on three large datasets (UCF101, ActivityNet, and YouTube-8M), which confirmed the effectiveness of our approach and demonstrated its superior performance compared to other popular models. Furthermore, our approach exhibits robust generalizability across diverse datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI