Published April 12, 2021 | Version v1
Conference paper Open

Simple ways to improve NER in every language using markup

  • 1. University of La Rochelle, L3i, F-17000, La Rochelle, France
  • 2. Universite Paul Sabatier, IRIT, Toulouse

Description

We explore three different methods for improving Named Entity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Specifically, we first explore the marking of uppercase tokens for providing extra casing information. We then randomly mask tokens, as in a masked language model, and predict them along with the NER task to improve NER generalization. Finally, we predict entity boundaries to ameliorate named entity detection. The experiments were done over five languages, three of which are low-resourced: Slovene, Croatian, Finnish, English and Spanish. Results show that predicting masked tokens can be beneficial for most languages, while marking uppercase tokens can be a simple method for dealing with uppercase sentences in NER. Furthermore, our methods improved the state of the art for Croatian and Finnish.

Files

paper2.pdf

Files (593.8 kB)

Name Size Download all
md5:2caf34fe2dcdef4c3fdfd43af17a7223
593.8 kB Preview Download

Additional details

Funding

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299
European Commission
EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153
European Commission