模态(人机交互)
计算机科学
视听
特征(语言学)
人工智能
模式
语音识别
音频信号处理
情态动词
音频分析器
音频信号
多媒体
语音编码
社会科学
哲学
语言学
化学
社会学
高分子化学
作者
Jung Uk Kim,Seong Tae Kim
标识
DOI:10.1109/icassp49357.2023.10096773
摘要
Although audio modality has the potential to solve various visually challenging conditions of visual modality, there are few studies on audio-based detection. This is because the audio modality itself contains less accurate spatial information. To alleviate this issue, the existing audio-based methods adopt the visual modality in the training phase to transfer more precise spatial knowledge to the audio modality. However, they do not consider the case where the visual modality is less informative. In this paper, we present a new audio-based vehicle detector that can transfer multimodal knowledge of vehicles to the audio modality during training. To this end, we combine the audio-visual modal knowledge according to the importance of each modality to generate integrated audiovisual feature. Also, we introduce an audio-visual distillation (AVD) loss that guides representation of the audio modal feature to resemble that of the integrated audio-visual feature. As a result, our audio-based detector can perform robust vehicle detection as if it were utilizing both modalities, even if it only receives audio modality as input in the inference. Comprehensive experimental results demonstrate that our method exhibits consistent improvements over the existing methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI