Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection

立体声录音计算机科学编码器频道（广播）语音识别不变（物理）感知人工智能数学电信心理学神经科学数学物理操作系统

作者

Rui Liu,Jinhua Zhang,Guanglai Gao

出处

期刊：Information Fusion [Elsevier BV]
日期：2024-01-21 卷期号：105: 102257-102257 被引量：2

标识

DOI：10.1016/j.inffus.2024.102257

摘要

Audio deepfake detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), and voice conversion (VC), etc., which is an emerging topic. Traditionally we read the mono signal and analyze the artifacts directly. Recently, the mono-to-binaural conversion based ADD approach has attracted increasing attention since the binaural audio signals provide a unique and comprehensive perspective on speech perception. Such method attempts tried to first convert the mono audio into binaural, then process the left and right channels respectively to discover authenticity cues. However, the acoustic information from the two channels exhibits both differences and similarities, which have not been thoroughly explored in previous research. To address this issue, we propose a new mono-to-binaural conversion based ADD framework that considers multi-space channel representation learning, termed "MSCR-ADD". Specifically, (1) the feature representations of the respective channels are learned by the channel-specific encoder and stored in the channel-specific space; (2) the feature representations capturing the difference between the two channels are learned by the channel-differential encoder and stored in the channel-differential space; (3) after which the channel-invariant encoder learn the channel commonality representations in the channel-invariant space. Note that we propose orthogonal and mutual information maximization losses to constrain the channel-specific and invariant encoders. At last, three representations from various spaces are mixed together to finalize the deepfake detection. It is worth noting that the feature representations in the channel-differential and invariant spaces unveil the differences and similarities between the two channels in binaural audio, enabling us to effectively detect artifacts in fake audio. The experimental results on four benchmark datasets demonstrate that our MSCR-ADD is superior to existing state-of-the-art approaches.

求助该文献

最长约 10秒，即可获得该文献文件

Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection

今日热心研友