计算机科学
嵌入
人工智能
视听
光学(聚焦)
编码器
利用
模式识别(心理学)
自编码
情态动词
特征提取
深度学习
过程(计算)
计算机视觉
计算机安全
多媒体
物理
化学
高分子化学
光学
操作系统
作者
Miao Liu,Jing Wang,Xinyuan Qian,Haizhou Li
标识
DOI:10.1109/tcsvt.2023.3326694
摘要
Audio-visual deepfake detection is the process of identifying and detecting deepfakes that have been generated using both audio and visual content with AI algorithms. Most existing methods primarily focus on the overall authenticity while neglecting the position of forgeries in time. This can be particularly problematic, as even a small alteration in a clip can significantly impact its meaning. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. In this paper, we present a novel neural network-based model to tackle the temporal forgery detection (TFD) problem. It consists of new audio and visual encoders with cross-modal attention for embedding extraction, and an embedding-level fusion mechanism with self-attention for forgery localization. Besides, a multi-dimensional contrastive loss is proposed which helps the model not only to capture audio-visual inconsistency for deepfake detection but also to exploit temporal inconsistency by coherently constraining the extracted embeddings. Extensive experiments on the LAV-DF dataset show that the presented method outperforms several state-of-the-art temporal forgery localization methods by up to 23.4% on AP@0.5 and 13.8% on AR@100. In addition, we also show the effectiveness of the proposed model on deepfake detection.
科研通智能强力驱动
Strongly Powered by AbleSci AI