Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations
Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, a phrase embedding search to efficiently create high-coverage dictionaries is presented. Specifically, the reformulation of natural language queries into phrase representations allows the retriever to search a space densely populated with various entities. In addition, we present a novel framework, HighGEN, that generates NER datasets with high-coverage dictionaries obtained using the phrase embedding search. HighGEN generates weak labels based on the distance between the embeddings of a candidate phrase and target entity type to reduce the noise in high-coverage dictionaries. We compare HighGEN with current weakly supervised NER models on six NER benchmarks and demonstrate the superiority of our models.
READ FULL TEXT