Language modeling data for Swahili

Shivachi Casper Shikali; Mokhosi Refuoe

doi:10.5281/zenodo.3553423

Published November 26, 2019 | Version 1

Dataset Open

Language modeling data for Swahili

1. University of Electronic Science and Technology of China

The Swahili dataset developed specifically for language modeling task. The dataset contains 28,000 unique words with 6.84M, 970k, and 2M words for the train, valid and test partitions respectively which represent the ratio 80:10:10. The entire dataset is lowercased, has no punctuation marks and, the start and end of sentence markers have been incorporated to facilitate easy tokenization during language modeling. The train partition is the largest in order to support unsupervised learning of word representations while the hyper-parameters are adjusted based on the performance on the valid partition before evaluating the language model on the test partition.

Files

Swahili data.zip

Files (2.8 MB)

Name	Size	Download all
Swahili data.zip md5:410e7afa3f997ac9ac0c9887bf201e9f	2.8 MB	Preview Download

Views

903

Downloads

Show more details

	All versions	This version
Views	1,784	1,781
Downloads	903	894
Data volume	2.8 GB	2.8 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 26, 2019
Modified: January 24, 2020

Language modeling data for Swahili

Creators

Description

Files

Swahili data.zip

Files (2.8 MB)