Dataset Open Access

WiLI-2018 - Wikipedia Language Identification database

Thoma, Martin

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.

See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.

Files (62.4 MB)
Name Size
wili-2018.zip
md5:3dc5bd41587811ad6b0d04ae2f235f84
62.4 MB Download
892
366
views
downloads
All versions This version
Views 892892
Downloads 366366
Data volume 22.8 GB22.8 GB
Unique views 806806
Unique downloads 310310

Share

Cite as