Published November 12, 2021 | Version 0.1
Dataset Open

Embedding Evaluation Data for South African Languages

  • 1. University of Pretoria, MasakhaneNLP
  • 2. Sol Plaatje University
  • 3. University of Pretoria
  • 4. CSIR

Description

WordSim and Simlex Data for South African Languages

  • Setswana
  • Sepedi

Embedding Evaluation Data for South African Languages

Dataset Information\

The datasets(Simlex and WordSim) contain pairs of Setswana and Sepedi words that have been assigned similarity ratings by humans to measure semantic relatedness. The word-pairs(Simlex and WordSim) are manually translated from English to Setswana and Sepedi. The evaluation task aims to find the degree of correlation between the scores provided by the model and the human rating, the score of the model is collected by computing the cosine similarity of corresponding vectors for word pairs.

Online Repository link

Authors

  • Vukosi Marivate - @vukosi
  • Valencia Wagner
  • Mack Makgatho
  • Tshephisho Sefara

See also the list of contributors who participated in this project.

Citing the dataset

To appear in conference proceedings

@article{Makgatho_Marivate_Sefara_Wagner_2022, title={Training Cross-Lingual embeddings for Setswana and Sepedi}, 
volume={3}, 
url={https://upjournals.up.ac.za/index.php/dhasa/article/view/3822}, 
DOI={10.55492/dhasa.v3i03.3822}, 
number={03},
journal={Journal of the Digital Humanities Association of Southern Africa },
author={Makgatho, Mack and Marivate, Vukosi and Sefara, Tshephisho and Wagner, Valencia}, 
year={2022}, 
month={Feb.}}

Files

Files (243.8 kB)

Name Size Download all
md5:af12e09dd0d3283b0d096223f12eeaf1
89.4 kB Download
md5:7c37269de1355cb108a35a2e57fc2e00
89.4 kB Download
md5:1ed2c48ecdcdeb5096ae115669db85cf
34.0 kB Download
md5:ffd0633f4fa2cdaa27c4e1b48e8df3cc
31.0 kB Download

Additional details

Related works

Is supplement to
Journal article: 10.55492/dhasa.v3i03.3822 (DOI)