计算机科学
可解释性
稳健性(进化)
时态数据库
判别式
人工智能
利用
空间分析
可视化
机器学习
数据挖掘
模式识别(心理学)
遥感
生物化学
化学
计算机安全
基因
地质学
作者
Cairong Zhao,Chutian Wang,Guosheng Hu,Haonan Chen,Chun Liu,Jinhui Tang
标识
DOI:10.1109/tifs.2023.3239223
摘要
With the rapid development of Deepfake synthesis technology, our information security and personal privacy have been severely threatened in recent years. To achieve a robust Deepfake detection, researchers attempt to exploit the joint spatial-temporal information in the videos, like using recurrent networks and 3D convolutional networks. However, these spatial-temporal models remain room to improve. Another general challenge for spatial-temporal models is that people do not clearly understand what these spatial-temporal models really learn. To address these two challenges, in this paper, we propose an Interpretable Spatial-Temporal Video Transformer (ISTVT), which consists of a novel decomposed spatial-temporal self-attention and a self-subtract mechanism to capture spatial artifacts and temporal inconsistency for robust Deepfake detection. Thanks to this decomposition, we propose to interpret ISTVT by visualizing the discriminative regions for both spatial and temporal dimensions via the relevance (the pixel-wise importance on the input) propagation algorithm. We conduct extensive experiments on large-scale datasets, including FaceForensics++, FaceShifter, DeeperForensics, Celeb-DF, and DFDC datasets. Our strong performance of intra-dataset and cross-dataset Deepfake detection demonstrates the effectiveness and robustness of our method, and our visualization-based interpretability offers people insights into our model.
科研通智能强力驱动
Strongly Powered by AbleSci AI