Detecting Adversarial Attacks On Audio-Visual Speech Recognition
Adversarial attacks pose a threat to deep learning models. However, research on adversarial detection methods, especially in the multi-modal domain, is very limited. In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams. The main idea is that the correlation between audio and video in adversarial examples will be lower than benign examples due to added adversarial noise. We use the synchronisation confidence score as a proxy for audio-visual correlation and based on it we can detect adversarial attacks. To the best of our knowledge, this is the first work on detection of adversarial attacks on audio-visual speech recognition models. We apply recent adversarial attacks on two audio-visual speech recognition models trained on the GRID and LRW datasets. The experimental results demonstrated that the proposed approach is an effective way for detecting such attacks.
READ FULL TEXT