Published January 7, 2018
| Version 1.0.0
Dataset
Open
WiLI-2018 - Wikipedia Language Identification database
Creators
Description
WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.
See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.
Files
wili-2018.zip
Files
(62.4 MB)
Name | Size | Download all |
---|---|---|
md5:3dc5bd41587811ad6b0d04ae2f235f84
|
62.4 MB | Preview Download |