TLMD: Tigrinya Language Modeling Dataset
Description
A monolingual dataset built for Tigrinya language modeling. To the best of our knowledge, this is the largest dataset for Tigrinya of its kind. The data was collected from various sources across the web including news, blogs, and books. The largest portion of the data, ~75%, comes from over 2150 issues of the Haddas Ertra newspaper and other magazines published by www.shabait.com.
Data Statistics:
- Total size: ~0.5GB
- Around 40 million tokens
- Over 2 million lines
- 367 unique characters
- Train split: 98%, 1.97 million lines
- Validation split: 2%, 43k lines
We have done a light-weight cleanup of the data:
- Removal of Tigrinya text with legacy and non-standard encoding systems
- Normalization of punctuation and special characters
- Removal of redundant white spaces and empty lines
- Rejoining or fixing broken sentences when possible
- Removal of foreign words
We avoid applying any form of tokenization, extensive cleanup, and preprocessing operations in order not to take away potentially useful information, those decisions are left to the use-case researchers or developers.
This dataset is shared solely to advance research on natural language processing for Tigrinya. While the dataset authors do not claim any copyright on the content, some of the original sources may do. To use the content for commercial purposes or other forms of redistribution of the data, permission shall be acquired from the original owners, mainly shabait.com.
Files
tlmd_v1.0.0.zip
Files
(135.4 MB)
Name | Size | Download all |
---|---|---|
md5:689870ea077926df8fcd38cc8197f7cf
|
135.4 MB | Preview Download |