Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend
Word-level textual adversarial attacks have achieved striking performance in fooling natural language processing models. However, the fundamental questions of why these attacks are effective, and the intrinsic properties of the adversarial examples (AEs), are still not well understood. This work attempts to interpret textual attacks through the lens of n-gram frequency. Specifically, it is revealed that existing word-level attacks exhibit a strong tendency toward generation of examples with n-gram frequency descend (n-FD). Intuitively, this finding suggests a natural way to improve model robustness by training the model on the n-FD examples. To verify this idea, we devise a model-agnostic and gradient-free AE generation approach that relies solely on the n-gram frequency information, and further integrate it into the recently proposed convex hull framework for adversarial training. Surprisingly, the resultant method performs quite similarly to the original gradient-based method in terms of model robustness. These findings provide a human-understandable perspective for interpreting word-level textual adversarial attacks, and a new direction to improve model robustness.
READ FULL TEXT