Published March 19, 2022 | Version 1.0.0
Dataset Open

HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa

  • 1. Lesan
  • 2. Mermru

Description

The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa. It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa.

Supported Languages

Language

ISO 639-3 code

Afar

aaf

Amharic

amh

English

eng

Oromo

orm

Somali

som

Tigrinya

tir

 

data/ contains one text file per language and each file contains news snippets in the same order for each language.

data
├── aar.txt
├── amh.txt
├── eng.txt
├── orm.txt
├── som.txt
└── tir.txt

metadata.tsv contains tab separated data describing each news snippet. The metadata contains the following fields.

  • Scope - describes whether the news is global or local. It takes two values: Global news and Local news.

  • Category - News category covering the following 12 topics

    • Art and Culture

    • Business and Economy

    • Conflicts and Attacks

    • Disaster and Accidents

    • Entertainment

    • Environment

    • Health

    • International Relations

    • Law and Crime

    • Politics

    • Science and Technology

    • Sport

  • Source - List of one or more URLs from which the news content is extracted or based on.

  • Domain - TLD corresponding to the URL(s) in Source.

  • Date - The publication date of the source article. The format is yyyy-mm-dd.

Other formats

All the data and associated metadata together in one file is also available in other file formats.

HornMT.xlsx - data and associated metadata in xlsx format.

HornMT.json - data and associated metadata in json format.

Below is an example row.

{
"data":{
"eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.",
"aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.",
"amh":"የአለም የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡",
"orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa haga ammaatti akka miidhame gabaase.",
"som":"Ururka Saadaasha Hawada Adduunka ayaa ku warramaya in lakabka ozoneka ee Ka koreeya dhulka baraflayda uu waxyeelladii abid ugu darnaa soo gaadhay.",
"tir":"ውድብ ሜትሮሎጂ ዓለም ኣብ ኣርክቲክ ዝርከብ ናሕሲ ኦዞን ኣዝዩ ብዝኸፍአ ደረጃ ከምዝተጎድአ ሓቢሩ፡፡"
},
"metadata":{
"scope":"Global",
"category":"Science and Technology",
"source":"https://www.independent.co.uk/environment/climate-change/ozone-layer-damaged-by-unusually-harsh-winter-2263653.html",
"domain":"www.independent.co.uk",
"date":"2011-04-05"
}
}

Team

Afar

  • Mohammed Deresa

  • Yasin Nur

Amharic

  • Tigist Taye

  • Selamawit Hailemariam

  • Wako Tilahun

Oromo

  • Gemechis Melkamu

  • Galata Girmaye

Somali

  • Abdiselam Mohamed

  • Beshir Abdi

Tigrinya

  • Berhanu Abadi Weldegiorgis

  • Michael Minassie

  • Nureddin Mohammedshiek

Project Leaders

License

Shield: CC BY 4.0

This work is licensed under a
Creative Commons Attribution 4.0 International License.

Notes

This work was carried out with support from Lacuna Fund, an initiative co‐founded by The Rockefeller Foundation, Google.org, and Canada's International Development Research Centre.

Files

data.zip

Files (5.2 MB)

Name Size Download all
md5:c7965604c0a4a0536c70fb89b277a9cd
805.0 kB Preview Download
md5:b41b2f994219fa4f264ff181c674527f
2.9 MB Preview Download
md5:f419516a9a32abd5bfccb5783edce81b
1.3 MB Download
md5:49f51161a120ce7237f11cab7b07c9ee
218.3 kB Download
md5:e5d0c371092fafd69453a867dfc91095
4.0 kB Preview Download