Published January 7, 2018 | Version 1.0.0
Dataset Open

WiLI-2018 - Wikipedia Language Identification database

Creators

Description

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.

See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.

Files

wili-2018.zip

Files (62.4 MB)

Name Size Download all
md5:3dc5bd41587811ad6b0d04ae2f235f84
62.4 MB Preview Download