Simple ways to improve NER in every language using markup
- 1. University of La Rochelle, L3i, F-17000, La Rochelle, France
- 2. Universite Paul Sabatier, IRIT, Toulouse
Description
We explore three different methods for improving Named Entity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Specifically, we first explore the marking of uppercase tokens for providing extra casing information. We then randomly mask tokens, as in a masked language model, and predict them along with the NER task to improve NER generalization. Finally, we predict entity boundaries to ameliorate named entity detection. The experiments were done over five languages, three of which are low-resourced: Slovene, Croatian, Finnish, English and Spanish. Results show that predicting masked tokens can be beneficial for most languages, while marking uppercase tokens can be a simple method for dealing with uppercase sentences in NER. Furthermore, our methods improved the state of the art for Croatian and Finnish.
Files
paper2.pdf
Files
(593.8 kB)
Name | Size | Download all |
---|---|---|
md5:2caf34fe2dcdef4c3fdfd43af17a7223
|
593.8 kB | Preview Download |