Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published April 30, 2021 | Version v1
Conference paper Open

Aligning Estonian and Russian news industry keywords with the help of subtitle translations and an environmental thesaurus

  • 1. Jožef Stefan Institute
  • 2. Ekspress Meedia, Estonia

Description

This paper presents the implementation of a bilingual term alignment approach developed by Repar et al. (2019) to a dataset of unaligned Estonian and Russian keywords which were manually assigned by journalists to describe the article topic. We started by separating the dataset into Estonian and Russian tags based on whether they are written in the Latin or Cyrillic script. Then we selected the available language-specific resources necessary for the alignment system to work. Despite the domains of the language-specific resources (subtitles and environment) not matching the domain of the dataset (news articles), we were able to achieve respectable results with manual evaluation indicating that almost 3/4 of the aligned keyword pairs are at least partial matches.

Files

2021.hackashop-1.10.pdf

Files (147.1 kB)

Name Size Download all
md5:74e8df3bb45e237e33cfafd9fe2677cd
147.1 kB Preview Download

Additional details

Funding

EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153
European Commission