计算机科学
任务(项目管理)
事件(粒子物理)
情态动词
代表(政治)
块(置换群论)
人工智能
人机交互
感知
特征学习
多任务学习
多模式学习
语义学(计算机科学)
机器学习
深度学习
物理
管理
量子力学
经济
程序设计语言
化学
几何学
数学
神经科学
政治
政治学
高分子化学
法学
生物
作者
Han Liang,Jincai Chen,Fazlullah Khan,Gautam Srivastava,Jiangfeng Zeng
摘要
Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI