TLMD: Tigrinya Language Modeling Dataset

Gaim, Fitsum; Yang, Wonsuk; Park, Jong C.

doi:10.5281/zenodo.5139094

Published July 27, 2021 | Version 1.0.0

Dataset Open

TLMD: Tigrinya Language Modeling Dataset

1. KAIST

A monolingual dataset built for Tigrinya language modeling. To the best of our knowledge, this is the largest dataset for Tigrinya of its kind. The data was collected from various sources across the web including news, blogs, and books. The largest portion of the data, ~75%, comes from over 2150 issues of the Haddas Ertra newspaper and other magazines published by www.shabait.com.

Data Statistics:

Total size: ~0.5GB
Around 40 million tokens
Over 2 million lines
367 unique characters
Train split: 98%, 1.97 million lines
Validation split: 2%, 43k lines

We have done a light-weight cleanup of the data:
- Removal of Tigrinya text with legacy and non-standard encoding systems
- Normalization of punctuation and special characters
- Removal of redundant white spaces and empty lines
- Rejoining or fixing broken sentences when possible
- Removal of foreign words

We avoid applying any form of tokenization, extensive cleanup, and preprocessing operations in order not to take away potentially useful information, those decisions are left to the use-case researchers or developers.

This dataset is shared solely to advance research on natural language processing for Tigrinya. While the dataset authors do not claim any copyright on the content, some of the original sources may do. To use the content for commercial purposes or other forms of redistribution of the data, permission shall be acquired from the original owners, mainly shabait.com.

Files

tlmd_v1.0.0.zip

Files (135.4 MB)

Name	Size	Download all
tlmd_v1.0.0.zip md5:689870ea077926df8fcd38cc8197f7cf	135.4 MB	Preview Download

	All versions	This version
Views	1,984	1,979
Downloads	528	526
Data volume	191.6 GB	191.4 GB

TLMD: Tigrinya Language Modeling Dataset

Authors/Creators

Description

Files

tlmd_v1.0.0.zip

Files (135.4 MB)