答疑
模态(人机交互)
计算机科学
情态动词
视听
结合属性
特征(语言学)
推论
人工智能
自然语言处理
模式
语音识别
多样性(控制论)
模式识别(心理学)
多媒体
语言学
数学
化学
高分子化学
纯数学
社会学
哲学
社会科学
作者
Kyu Ri Park,Youngmin Oh,Jung Uk Kim
标识
DOI:10.1109/icassp48485.2024.10446292
摘要
We present a novel method for Audio-Visual Question Answering (AVQA) in real-world scenarios where one modality (audio or visual) can be missing. Inspired by human cognitive processes, we introduce a Trans-Modal Associative (TMA) memory that recalls missing modal information (i.e., pseudo modal feature) by establishing associations between available modal features and textual cues. During training phase, we employ a Trans-Modal Recalling (TMR) loss to guide the TMA memory in generating the pseudo modal feature that closely matches the real modal feature. This allows our method to robustly answer the question, even when one modality is missing during inference. We believe that our approach, which effectively copes with missing modalities, can be broadly applied to a variety of multimodal applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI