Simple ways to improve NER in every language using markup

doi:10.5281/zenodo.4680998

Published April 12, 2021 | Version v1

Conference paper Open

Simple ways to improve NER in every language using markup

1. University of La Rochelle, L3i, F-17000, La Rochelle, France
2. Universite Paul Sabatier, IRIT, Toulouse

We explore three different methods for improving Named Entity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Specifically, we first explore the marking of uppercase tokens for providing extra casing information. We then randomly mask tokens, as in a masked language model, and predict them along with the NER task to improve NER generalization. Finally, we predict entity boundaries to ameliorate named entity detection. The experiments were done over five languages, three of which are low-resourced: Slovene, Croatian, Finnish, English and Spanish. Results show that predicting masked tokens can be beneficial for most languages, while marking uppercase tokens can be a simple method for dealing with uppercase sentences in NER. Furthermore, our methods improved the state of the art for Croatian and Finnish.

Files

paper2.pdf

Files (593.8 kB)

Name	Size	Download all
paper2.pdf md5:2caf34fe2dcdef4c3fdfd43af17a7223	593.8 kB	Preview Download

Additional details

NewsEye – NewsEye: A Digital Investigator for Historical Newspapers 770299: European Commission
EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153: European Commission

	All versions	This version
Views	143	142
Downloads	64	64
Data volume	39.2 MB	39.2 MB

Simple ways to improve NER in every language using markup

Creators

Description

Files

paper2.pdf

Files (593.8 kB)

Additional details

Funding