Published January 15, 2022 | Version v1
Conference paper Open

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

  • 1. University of Ljubljana, Ljubljana, Slovenia

Description

Large pretrained masked language models have become state-of-theart solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

Files

Ulčar_conf.pdf

Files (297.4 kB)

Name Size Download all
md5:c4a07820b3ef1966cb2bcdc1fd7c3182
297.4 kB Preview Download

Additional details

Funding

EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153
European Commission