Adversarial Self-Attention for Language Understanding
An ultimate language system aims at the high generalization and robustness when adapting to diverse scenarios. Unfortunately, the recent white hope pre-trained language models (PrLMs) barely escape from stacking excessive parameters to the over-parameterized Transformer architecture to achieve higher performances. This paper thus proposes Adversarial Self-Attention mechanism (ASA), which adversarially reconstructs the Transformer attentions and facilitates model training from contaminated model structures, coupled with a fast and simple implementation for better PrLM building. We conduct comprehensive evaluation across a wide range of tasks on both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gain compared to regular training for longer periods. For fine-tuning, ASA-empowered models consistently outweigh naive models by a large margin considering both generalization and robustness.
READ FULL TEXT