Word-embedding based bilingual terminology alignment

doi:10.5281/zenodo.5547982

Published October 4, 2021 | Version v1

Conference paper Open

Word-embedding based bilingual terminology alignment

1. Jožef Stefan Institute
2. University of Ljubljana, Ljubljana, Slovenia

The ability to accurately align concepts between languages can provide significant benefits in many practical applications. In this paper, we extend a machine learning approach using dictionary and cognate-based features with novel cross-lingual embedding features using pretrained fastText embeddings. We use the tool VecMap to align the embeddings between Slovenian and English and then for every word calculate the top 3 closest word embeddings in the opposite language based on cosine distance. These alignments are then used as features for the machine learning algorithm. With one configuration of the input parameters, we managed to improve the overall F-score compared to previous work, while another configuration yielded improved precision (96%) at a cost of lower recall. Using embedding-based features as a replacement for dictionary-based features provides a significant benefit: while a large bilingual parallel corpus is required to generate the Giza++ word alignment lists, no such data is required for embedding-based features where the only required inputs are two unrelated monolingual corpora and a small bilingual dictionary from which the embedding alignments are calculated.

Files

Repar_elex2021.pdf

Files (421.6 kB)

Name	Size	Download all
Repar_elex2021.pdf md5:76c8f031012588059352d3796ea139de	421.6 kB	Preview Download

Additional details

EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153: European Commission

	All versions	This version
Views	81	81
Downloads	26	26
Data volume	11.4 MB	11.4 MB

Word-embedding based bilingual terminology alignment

Creators

Description

Files

Repar_elex2021.pdf

Files (421.6 kB)

Additional details

Funding