Dataset Open Access

WiLI-2018 - Wikipedia Language Identification database

Thoma, Martin

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.

See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.

Files (62.4 MB)
Name Size
wili-2018.zip
md5:3dc5bd41587811ad6b0d04ae2f235f84
62.4 MB Download
1,185
468
views
downloads
All versions This version
Views 1,1851,185
Downloads 468468
Data volume 29.2 GB29.2 GB
Unique views 1,0631,063
Unique downloads 399399

Share

Cite as