计算机科学
模式
人工智能
面子(社会学概念)
利用
机器学习
语音识别
社会科学
计算机安全
社会学
作者
Yizhe Zhu,Jialin Gao,Xi Zhou
标识
DOI:10.1145/3591106.3592218
摘要
Existing cross-dataset deepfake detection approaches exploit mouth-related mismatches between the auditory and visual modalities in fake videos to enhance generalisation to unseen forgeries. However, such methods inevitably suffer performance degradation with limited or unaltered mouth motions, we argue that face forgery detection consistently benefits from using high-level cues across the whole face region. In this paper, we propose a two-phase audio-driven multi-modal transformer-based framework, termed AVForensics, to perform deepfake video content detection from an audio-visual matching view related to full face. In the first pre-training phase, we apply the novel uniform masking strategy to model global facial features and learn temporally dense video representations in a self-supervised cross-modal manner, by capturing the natural correspondence between the visual and auditory modalities regardless of large-scaled labelled data and heavy memory usage. Then we use these learned representations to fine-tune for the down-stream deepfake detection task in the second phase, which encourages the model to offer accurate predictions based on captured global facial movement features. Extensive experiments and visualizations on various public datasets demonstrate the superiority of our self-supervised pre-trained method for achieving generalisable and robust deepfake video detection.
科研通智能强力驱动
Strongly Powered by AbleSci AI