Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

计算机科学任务（项目管理）事件（粒子物理）情态动词代表（政治）块（置换群论）人工智能人机交互感知特征学习多任务学习多模式学习语义学（计算机科学）机器学习深度学习物理管理量子力学经济程序设计语言化学几何学数学神经科学政治政治学高分子化学法学生物

作者

Han Liang,Jincai Chen,Fazlullah Khan,Gautam Srivastava,Jiangfeng Zeng

出处

期刊：ACM Transactions on Internet Technology [Association for Computing Machinery]
日期：2024-03-16 被引量：1

链接

acm.orgdoi.org

标识

DOI：10.1145/3653018

摘要

Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

求助该文献

Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

今日热心研友