Dataset Open Access

Automatic translation and multilingual cultural heritage retrieval: a case study with transcriptions in Europeana (dataset)

Mónica Marrero; Antoine Isaac; Nuno Freire

The dataset contains all the data required to reproduce the experiments done in the paper "Automatic translation and multilingual cultural heritage retrieval: a case study with transcriptions in Europeana", published in the 25th International Conference on Theory and Practice of Digital Libraries (TPDL'21). In that work we run an experiment using the Europeana CH digital library as a use case, and we evaluated the effectiveness of a multilingual information retrieval strategy using machine translations to English as pivot language. We used the CEF translation service (eTranslation) for the translation of queries and content to English (https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eTranslation).

The dataset is also available at https://rnd-2.eanadev.org/share/crosslingual-search/, and it is organized in four main folders:

  • queries: sample of 68 queries and their translations to English. The queries were issued in languages other than English from the Europeana Portal, using the Europeana’s 1914-1918 thematic collection, between January and August 2019.
  • transcriptions: sample of 18,257 handwriting transcriptions  and its translations to English. The transcriptions are taken  from the Europeana 1914-1918 thematic collection, and obtained from the Transcribathon crowdsourcing platform (https://europeana.transcribathon.eu/).
  • solr_configuration: Apache Solr search engine configuration used in the experiments (which replicates the one used in Europeana).
  • results: manual evaluation of the query translations, and automatic evaluation of the multilingual retrieval.

 

Files (34.2 MB)
Name Size
crosslingual-search.zip
md5:bae6701105224fddc33da9907accd27e
34.2 MB Download
139
4
views
downloads
All versions This version
Views 139139
Downloads 44
Data volume 136.8 MB136.8 MB
Unique views 124124
Unique downloads 33

Share

Cite as