计算机科学
任务(项目管理)
领域(数学)
保险丝(电气)
编码(集合论)
弹丸
扩展(谓词逻辑)
图像(数学)
人工智能
领域(数学分析)
语音识别
音频分析器
模式识别(心理学)
音频信号处理
音频信号
语音编码
工程类
数学分析
集合(抽象数据类型)
有机化学
化学
程序设计语言
系统工程
纯数学
电气工程
数学
作者
Andrey Guzhov,Federico Raue,J.J. van Hees,Andreas Dengel
标识
DOI:10.1109/icassp43922.2022.9747631
摘要
The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.We present AudioCLIP – an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP’s zero-shot capabilities.AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on ESC-50 and 90.07 % on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69.40 % and 68.78 %, respectively).We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.
科研通智能强力驱动
Strongly Powered by AbleSci AI