Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

doi:10.5281/zenodo.5854584

Published January 15, 2022 | Version v1

Conference paper Open

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

1. University of Ljubljana, Ljubljana, Slovenia

Large pretrained masked language models have become state-of-theart solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

Files

Ulčar_conf.pdf

Files (297.4 kB)

Name	Size	Download all
Ulčar_conf.pdf md5:c4a07820b3ef1966cb2bcdc1fd7c3182	297.4 kB	Preview Download

Additional details

EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153: European Commission

Views

Downloads

Show more details

	All versions	This version
Views	63	60
Downloads	65	64
Data volume	19.6 MB	19.3 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

Conference

In the Proceedings of the 10th International Conference on Analysis of Images, Social Networks and Texts (AIST 2021) (AIST 2021)

Languages

English

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: January 15, 2022
Modified: July 17, 2024

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Creators

Description

Files

Ulčar_conf.pdf

Files (297.4 kB)

Additional details

Funding