Combining EfficientNet and Vision Transformers for Video Deepfake Detection

07/06/2021
by   Davide Coccomini, et al.
0

Deepfakes are the result of digital manipulation to obtain credible videos in order to deceive the viewer. This is done through deep learning techniques based on autoencoders or GANs that become more accessible and accurate year after year, resulting in fake videos that are very difficult to distinguish from real ones. Traditionally, CNN networks have been used to perform deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. The best model achieved an AUC of 0.951 and an F1 score of 88.0 close to the state-of-the-art on the DeepFake Detection Challenge (DFDC).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset