Dataset Open Access
Cartoni, Bruno;
Meyer, Thomas;
Koehn, Philipp
EUROPARL CORPUS - DIRECTIONAL SUB-CORPORA
Description
In the Europarl corpus, Version 6, as released by P. Koehn (http://www.statmt.org/europarl/), there are language tags indicating the original source language in which a certain statement has been uttered by the speaker in the European Parliament, e.g.
<SPEAKER ID=6 LANGUAGE="IT" NAME="Segni">
Madam President, coinciding with this year's first part-session of the European Parliament, a date has been set, unfortunately for next Thursday, in Texas in America, for the execution of a young 34 year-old man who has been sentenced to death. We shall call him Mr Hicks.
In the corpus provided as is, such language tags are however only given scarcely, so that for all the segments where there is no such tag, one cannot know what the original language was and if it was translated directly to the target language (or via a pivot language).
Our directional extractions therefore account for two corrections:
Such directional corpora are a valuable resource for linguistic studies that want to account for translation variability and universals such as the explicitation hypothesis (see e.g. Cartoni et al., 2011). The datasets are also worthwile for Machine Translation. Ozdowska (2009) for example, has shown that SMT system trained on directional corpora might outperform ones that are trained on the original, parallel only corpus.
Citation
If you use Europarl, as well as our directional extractions for your research, please cite the following two papers:
@InProceedings{Koehn-Europarl-2005,
Author = {Koehn, Philipp},
Title = {Europarl: A Parallel Corpus for Statistical Machine Translation},
BookTitle = {Proceedings of MT Summit X},
address = {Phuket, Thailand},
Pages = {79--86},
year = {2005}
}
@InProceedings{Cartoni-Directional-2012,
Author = {Cartoni, Bruno and Meyer, Thomas},
Title = {Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies},
BookTitle = {Proceedings 8th International Conference on Language Resources and Evaluation (LREC)},
address = {Istanbul, Turkey},
year = {2012}
}
Structure
In this archive, we provide the following directional corpus extractions based on Europarl Version 6.
EN_to_FR
FR_to_EN
EN_to_IT
IT_to_EN
FR_to_IT
IT_to_FR
EN_to_ES
Usage
To use the directional corpora for training Statistical Machine Translation systems, for example, you may want to:
ES_to_EN
FR_to_ES
ES_to_FR
EN_to_DE
DE_to_EN
EN_to_CZ
CZ_to_EN
For each language direction there is an own directory, containing two sub-directories:
XX_source
YY_target
each containing the daily text files with the names preserved from the original Europarl distribution. The file ending was changed from .txt to .ctags.out (corrected tags) and the files (as the overall size of the corpora) is now of course smaller as they only contain the directly corresponding and translated statements.
For statistics, please see the LREC paper above.
Name | Size | |
---|---|---|
europarl-direct.tar.gz
md5:97141a9bae7b346ffd142e5fd9bd7d1b |
314.1 MB | Download |
MD5SUM.TXT
md5:f08c6c31f90fa78e2e4c69329043b4c9 |
57 Bytes | Download |
Views | 1,248 |
Downloads | 37 |
Data volume | 8.5 GB |
Unique views | 1,231 |
Unique downloads | 27 |