Dataset Open Access

WiLI-2018 - Wikipedia Language Identification database

Thoma, Martin

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.

See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.

Files (62.4 MB)
Name Size
wili-2018.zip
md5:3dc5bd41587811ad6b0d04ae2f235f84
62.4 MB Download
5,186
3,384
views
downloads
All versions This version
Views 5,1865,184
Downloads 3,3843,384
Data volume 211.2 GB211.2 GB
Unique views 4,5524,551
Unique downloads 2,2002,200

Share

Cite as