Tigrinya Analogy Test for evaluating Word Embeddings

Gaim, Fitsum; Park, Jong C.

doi:10.5281/zenodo.7089244

Published May 24, 2022 | Version 1.0.0

Dataset Open

Tigrinya Analogy Test for evaluating Word Embeddings

1. KAIST

Tigrinya Analogy Test for evaluating Word Embeddings

This is a Tigrinya version of the Google Analogy Test set, which is used to evaluate English word-embedding models. The analogy test is a well-established strategy to empirically evaluate the quality of word-embedding models. More information about the English task can be found at the ACL Wiki.

This data is was first machine translated then manually verified by a native speaker to reduce errors.

Some aspects of the original analogy test is focused on English and may not transfer well to other languages, such as those related to grammar or morphology. Therefore, we have discarded examples that became irrelevant in Tigrinya when adapting the task. Finally, there are a total of 18465 entries in the Tigrinya Analogy Test set, while the source English data has 19544 entries.

An entry is dropped if the translations led to one of the following conditions:

If the source word pair map to one Tigrinya word, for example, lucky & luckiest both correspond to ዕድለኛ.
If the source word results in a multi-word expression. For example, grandson (ወዲ ጓል / ወዲ ወዲ), granddaughter (ጓል ጓል / ጓል ወዲ). This because the typical word-embedding approaches such as word2vec are not designed to predict multi-word phrases.

Test Sections

The test includes a series of semantic and syntactic analogies divided up into subsections including world capitals, currencies, family, tense, and plurality. The test contains the following sections:

capital-world
currency
city-in-state
family
gram1-adjective-to-adverb
gram2-opposite
gram3-comparative
gram4-superlative
gram5-present-participle
gram6-nationality-adjective
gram7-past-tense
gram8-plural
gram9-plural-verbs

Examples:

Semantic section of World Capitals: “ኣስመራ: ኤርትራ as ፓሪስ: ?” and if the model responds correctly it will return: “ፈረንሳ”.
Semantic section of Family section: “ሰብኣይ: ሰበይቲ as ወዲ: ጓል”.
Syntax section with tense, a sample analogy might be “Walk: Walked as Run: Ran”.

Evaluation

The final accuracy of a model is the proportion of the questions that the model answers correctly.
Generally, a better-quality model would answer more questions correctly than a model of lower quality.
However, note that a model with low performance on this analogy test, might still contain useful information, but may not be robust or good enough for more complex tasks.

Limitations

The analogy test could be a good indicator of the quality of word-embeddings, but it should be used with caution when comparing models trained on varying domains of data. It shall not be expected to generalize equally to all domains.
The final score can be affected by the size, vocabulary, and domain of the text with which the models are trained on. For example, this may not be a good benchmark to compare models trained on news text vs posts on social media.
Even though a manual sanity check was performed, we note that the semi-automatic construction of the Tigrinya test set might contains errors. If you discover any, you are welcome to contribute back by either opening an Issue at the GitHub repo, https://github.com/fgaim/tigrinya-analogy-test.

Citation

If you use this resource in your research, please cite it accordingly.

Files

TigrinyaAnalogyTest.zip

Files (82.5 kB)

Name	Size	Download all
TigrinyaAnalogyTest.zip md5:cafaf293227bf6d419738d03a6e9a462	82.5 kB	Preview Download

	All versions	This version
Views	420	205
Downloads	43	24
Data volume	3.7 MB	2.1 MB

Tigrinya Analogy Test for evaluating Word Embeddings

Creators

Description

Files

TigrinyaAnalogyTest.zip

Files (82.5 kB)