Catch the "Tails" of BERT
Recently, contextualized word embeddings outperform static word embeddings on many NLP tasks. However, we still don't know much about the mechanism inside these internal representations produced by BERT. Do they have any common patterns? What are the relations between word sense and context? We find that nearly all the contextualized word vectors of BERT and RoBERTa have some common patterns. For BERT, the 557^th element is always the smallest. For RoBERTa, the 588^th element is always the largest and the 77^th element is the smallest. We call them as "tails" of models. We find that these "tails" are the major cause of anisotrpy of the vector space. After "cutting the tails", the same word's different vectors are more similar to each other. The internal representations also perform better on word-in-context (WiC) task. These suggest that "cutting the tails" can decrease the influence of context and better represent word sense.
READ FULL TEXT