Published July 27, 2021 | Version 1.0.0
Dataset Open

TLMD: Tigrinya Language Modeling Dataset

  • 1. KAIST

Description

A monolingual dataset built for Tigrinya language modeling. To the best of our knowledge, this is the largest dataset for Tigrinya of its kind. The data was collected from various sources across the web including news, blogs, and books. The largest portion of the data, ~75%, comes from over 2150 issues of the Haddas Ertra newspaper and other magazines published by www.shabait.com.

Data Statistics:

  • Total size: ~0.5GB
  • Around 40 million tokens
  • Over 2 million lines
  • 367 unique characters
  • Train split: 98%, 1.97 million lines
  • Validation split: 2%, 43k lines

We have done a light-weight cleanup of the data:
 - Removal of Tigrinya text with legacy and non-standard encoding systems
 - Normalization of punctuation and special characters
 - Removal of redundant white spaces and empty lines
 - Rejoining or fixing broken sentences when possible
 - Removal of foreign words

We avoid applying any form of tokenization, extensive cleanup, and preprocessing operations in order not to take away potentially useful information, those decisions are left to the use-case researchers or developers.

This dataset is shared solely to advance research on natural language processing for Tigrinya. While the dataset authors do not claim any copyright on the content, some of the original sources may do. To use the content for commercial purposes or other forms of redistribution of the data, permission shall be acquired from the original owners, mainly shabait.com.

Files

tlmd_v1.0.0.zip

Files (135.4 MB)

Name Size Download all
md5:689870ea077926df8fcd38cc8197f7cf
135.4 MB Preview Download