Dataset Open Access
Thoma, Martin
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.
See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.
Name | Size | |
---|---|---|
wili-2018.zip
md5:3dc5bd41587811ad6b0d04ae2f235f84 |
62.4 MB | Download |
All versions | This version | |
---|---|---|
Views | 6,601 | 6,599 |
Downloads | 3,972 | 3,972 |
Data volume | 247.9 GB | 247.9 GB |
Unique views | 5,786 | 5,785 |
Unique downloads | 2,705 | 2,705 |