Published November 4, 2025 | Version v1
Dataset Open

NTEU Multilingual Evaluation Dataset

  • 1. ROR icon Barcelona Supercomputing Center

Description

Dataset Card for NTEU Multilingual Evaluation Dataset

Dataset Description

Dataset Summary

This evaluation dataset for Machine Translation was created by the NTEU - Neural Translation for the EU project. The evaluation dataset includes around 1,000 parallel sentences in the 24 official European languages. The original NTEU dataset has been cleaned and filtered by removing empty lines and near-duplicates, and it has been augmented with Catalan. The Catalan version was manually produced by a native Catalan translator from the original English and Spanish versions, and was sponsored by the AINA project.

Supported Tasks and Leaderboards

This dataset can be used to evaluate bilingual and multilingual machine translation systems for any combination of the 24 official European languages and Catalan in the legal domain.

Languages

The languages included in the dataset are the following:

CODE LANGUAGE SCRIPT
bg Bulgarian Cyrillic
ca Catalan Latin
cs Czech Latin
da Danish Latin
de German Latin
el Greek Greek
en English Latin
es Spanish Latin
et Estonian Latin
fi Finnish Latin
fr French Latin
ga Irish Latin
hr Croatian Latin
hu Hungarian Latin
it Italian Latin
lt Lithuanian Latin
lv Latvian Latin
mt Maltese Latin
nl Dutch Latin
pl Polish Latin
pt Portuguese Latin
ro Romanian Latin
sk Slovak Latin
sl Slovenian Latin
sv Swedish Latin

Dataset Structure

Data Instances

A separate .txt file is provided for each language, with sentences aligned in the same order across all files. Each file uses the two-letter language code of its language as the file extension.

Data Fields

[N/A]

Data Splits

The dataset contains a single split: Test.

Dataset Creation

Curation Rationale

The aim of this dataset is to promote the evaluation of machine translation systems for the official European languages, plus Catalan.

Source Data

Initial Data Collection and Normalization

The data was originally extracted from EUR-Lex, the official online database of European Union law and other public documents of the European Union (EU), published in the 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EUR-Lex.

Who are the source language producers?

EUR-Lex

Annotations

Annotation process

The dataset does not contain any annotations.

Who are the annotators?

[N/A]

Personal and Sensitive Information

No specific anonymisation process has been applied, personal and sensitive information may be present in the data. This needs to be considered when using the data for training models.

Considerations for Using the Data

Social Impact of Dataset

By providing this resource, we intend to promote the evaluation of machine translation systems including all the official European Languages and Catalan, thereby improving the accessibility and visibility of the Catalan language in Europe.

Discussion of Biases

No specific bias mitigation strategies were applied to this dataset. Inherent biases may exist within the data.

Other Known Limitations

The dataset contains data of a legal/administrative domain. Applications of this dataset in other domains would be of limited use.

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es).

Funding

This work has been promoted and financed by the Government of Catalonia through the Aina project.

Licensing Information

This work is licensed under an Attribution 4.0 International licence.

Citation Information

For more information about the NTEU Project, please refer to the following paper:

@inproceedings{bie-etal-2020-neural,
    title = "Neural Translation for the {E}uropean {U}nion ({NTEU}) Project",
    author = "Bi{\'e}, Laurent  and
      Cerd{\`a}-i-Cuc{\'o}, Aleix  and
      Degroote, Hans  and
      Estela, Amando  and
      Garc{\'i}a-Mart{\'i}nez, Mercedes  and
      Herranz, Manuel  and
      Kohan, Alejandro  and
      Melero, Maite  and
      O{'}Dowd, Tony  and
      O{'}Gorman, Sin{\'e}ad  and
      Pinnis, M{\={a}}rcis  and
      Rozis, Roberts  and
      Superbo, Riccardo  and
      Vasi{\c{l}}evskis, Art{\={u}}rs",
    editor = "Martins, Andr{\'e}  and
      Moniz, Helena  and
      Fumega, Sara  and
      Martins, Bruno  and
      Batista, Fernando  and
      Coheur, Luisa  and
      Parra, Carla  and
      Trancoso, Isabel  and
      Turchi, Marco  and
      Bisazza, Arianna  and
      Moorkens, Joss  and
      Guerberof, Ana  and
      Nurminen, Mary  and
      Marg, Lena  and
      Forcada, Mikel L.",
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.60/",
    pages = "477--478",
    abstract = "The Neural Translation for the European Union (NTEU) project aims to build a neural engine farm with all European official language combinations for eTranslation, without the necessity to use a high-resourced language as a pivot. NTEU started in September 2019 and will run until August 2021."
}

Contributions

[N/A]

Files

Files (5.7 MB)

Name Size Download all
md5:3426298e3362b04b057a36112c8efbc5
382.6 kB Download
md5:ef182cca7e2f4e0699f7bbedd24a9230
221.7 kB Download
md5:2ef9d769ab04517fa3bb831fa624fc33
210.5 kB Download
md5:cb65266dd3350179cb118ffedcdd402a
203.8 kB Download
md5:3f3593425467f412a387c5eea25507ff
229.9 kB Download
md5:08cf83d25db02226f632bc4db3a01491
409.6 kB Download
md5:e9d634ba2353552c06fbb3b3ad58eb3e
197.9 kB Download
md5:110d76576eaf9ac859fb075dfaeb604e
224.7 kB Download
md5:cdfaa674d977ea984b89bb11c82a4747
189.7 kB Download
md5:68b984a991bade396a3149af18c16221
213.9 kB Download
md5:86e083343c5d2f1c6b460568d2aa21e4
234.8 kB Download
md5:122f098c02097e4ab1566f6ae815ead9
233.8 kB Download
md5:aea11cc367e0bd2b77418b571c4e2d8a
193.2 kB Download
md5:99216fbd9a6a2343d52d9fccfa8d5fa4
237.2 kB Download
md5:fb12f7062ec0438ad1a2868836f2d9a8
217.8 kB Download
md5:1407b74875a468c7cc0acad91fada293
203.2 kB Download
md5:d711d319f6f71de4624900a8f4b74481
205.5 kB Download
md5:34301d4979676f9d9055cd4f2ad73aa5
224.7 kB Download
md5:d1de1e90d8ef8f76880326e44b996543
219.0 kB Download
md5:02888300d9a214afcc7f7676db833aee
218.3 kB Download
md5:4b08cf089702802b5220d7efed2ba206
220.1 kB Download
md5:420d38ddedf22b6365b77ab9df5032e8
230.8 kB Download
md5:7887a7e0bfa0d42f593d94271c06c1c1
210.7 kB Download
md5:5cc67a148b41d4191ef073c66eac21fe
188.4 kB Download
md5:684dd6379931a107c6ac84afa56985a5
202.7 kB Download