计算机科学
短时傅里叶变换
初始化
稳健性(进化)
人工智能
语音识别
音频信号
频域
转化(遗传学)
小波
模式识别(心理学)
傅里叶变换
语音编码
傅里叶分析
计算机视觉
数学
生物化学
基因
数学分析
化学
程序设计语言
作者
Andrey Guzhov,Federico Raue,J.J. van Hees,Andreas Dengel
标识
DOI:10.1109/ijcnn52387.2021.9533654
摘要
Environmental Sound Classification (ESC) is a rapidly evolving field that recently demonstrated the advantages of application of visual domain techniques to the audio-related tasks. Previous studies indicate that the domain-specific modification of cross-domain approaches show a promise in pushing the whole area of ESC forward. In this paper, we present a new time-frequency transformation layer that is based on complex frequency B-spline (fbsp) wavelets. Being used with a high-performance audio classification model, the proposed fbsp-layer provides an accuracy improvement over the previously used Short-Time Fourier Transform (STFT) on standard datasets. We also investigate the influence of different pre-training strategies, including the joint use of two large-scale datasets for weight initialization: ImageNet and AudioSet. Our proposed model out-performs other approaches by achieving accuracies of 95.20 % on the ESC-50 and 89.14 % on the UrbanSound8K datasets. Additionally, we assess the increase of model robustness against additive white Gaussian noise and reduction of an effective sample rate introduced by the proposed layer and demonstrate that the fbsp-layer improves the model's ability to withstand signal perturbations, in comparison to STFT-based training. For the sake of reproducibility, our code is made available.
科研通智能强力驱动
Strongly Powered by AbleSci AI