视听
计算机科学
人工智能
特征(语言学)
语音识别
音频信号
可视化
模式识别(心理学)
对象(语法)
钥匙(锁)
计算机视觉
语音编码
多媒体
哲学
语言学
计算机安全
作者
Ehsan Yaghoubi,André Peter Kelm,Timo Gerkmann,Simone Frintrop
标识
DOI:10.1145/3577190.3614144
摘要
This paper introduces an unsupervised model for audio-visual localization, which aims to identify regions in the visual data that produce sounds. Our key technical contribution is to demonstrate that using distilled prior knowledge of both sounds and objects in an unsupervised learning phase can improve performance significantly. We propose an Audio-Visual Correspondence (AVC) model consisting of an audio and a vision student, which are respectively supervised by an audio teacher (audio recognition model) and a vision teacher (object detection model). Leveraging a contrastive learning approach, the AVC student model extracts features from sounds and images and computes a localization map, discovering the regions of the visual data that correspond to the sound signal. Simultaneously, the teacher models provide feature-based hints from their last layers to supervise the AVC model in the training phase. In the test phase, the teachers are removed. Our extensive experiments show that the proposed model outperforms the state-of-the-art audio-visual localization models on 10k and 144k subsets of the Flickr and VGGS datasets, including cross-dataset validation.
科研通智能强力驱动
Strongly Powered by AbleSci AI