WiLI-2018 - Wikipedia Language Identification database

Thoma, Martin

doi:10.5281/zenodo.841984

Published January 7, 2018 | Version 1.0.0

Dataset Open

WiLI-2018 - Wikipedia Language Identification database

Thoma, Martin

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.

See "The WiLI benchmark dataset for written language identification" paper (soon on arXiv) for more information.

Files

wili-2018.zip

Files (62.4 MB)

Name	Size	Download all
wili-2018.zip md5:3dc5bd41587811ad6b0d04ae2f235f84	62.4 MB	Preview Download

10K

Views

Downloads

Show more details

	All versions	This version
Views	9,657	9,642
Downloads	4,157	4,146
Data volume	354.3 GB	353.5 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Technical metadata

Created: January 7, 2018
Modified: January 24, 2020

WiLI-2018 - Wikipedia Language Identification database

Authors/Creators

Description

Files

wili-2018.zip

Files (62.4 MB)