Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published May 28, 2024 | Version v1
Dataset Open

WikiReaD (Wikipedia Readability Dataset)

Description

Dataset Description:

The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

 

Dataset Details:

  • Number of Languages: 14
  • Number of files: 19
  • Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.
  • Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.
  • Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `<wiki>-<language_code>_sentences.bz2`). Also, the dataset contains train-test split files for 
    simplewiki-en (trainsplit_simplewiki-en_sentences.bz2, testsplit_simplewiki-en_sentences.bz2) needed to reproduce the results of the corresponding paper. 
     

Attribution:

The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.

Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (<page_title>) and language (<language_code>). For example, https://en.wikipedia.org/wiki/Spain links to the page “Spain” in English Wikipedia)

 

Code for data collection: TBD

Related paper citation: TBD

Files

Files (127.5 MB)

Name Size Download all
md5:f659ca06e6cb89dc6d45fdeedde012f9
1.7 MB Download
md5:47334ef7e5f332570a7e7dad634a754d
54.1 MB Download
md5:6dd781882ffb9ac57b51e1b273decff6
10.9 MB Download
md5:a3de1a04eee5d618ecc5f0af69bd65de
43.2 MB Download
md5:baa3f74f8fbef5b809853a9848b8a800
611.9 kB Download
md5:005f9560d5bfa736cf4700cd368e6c05
134.6 kB Download
md5:77ab90de49c09ce8f4562d5f1f0e0dc8
150.0 kB Download
md5:36e035174f8766b99cfed7635df704ba
31.0 kB Download
md5:7d0c4a9fcd3e1820b805ceef23ead15d
1.5 MB Download
md5:fcc7fd718337e841dd58696a2243792d
1.1 MB Download
md5:97f0b208b34f6ac4a2474541a04f207b
304.6 kB Download
md5:2e65673a28855322c396768c0bce1d93
6.5 MB Download
md5:6ba1554f66ffd51eb255689297118140
476.9 kB Download
md5:34012e8337bfb2ad51fdbd2aafb68791
671.6 kB Download
md5:c566f2713285485486034ca6716a4146
5.7 kB Download
md5:5b2c1b8acf2bafde9508be41071c2a5c
578.4 kB Download
md5:7ad3c91dd59ef94d01aa2b538f1861d0
86.6 kB Download
md5:b28f69e9eee2ebedd41b669a1c1c13ba
2.9 kB Download
md5:a5968e6f3cd90164c89be3ea506098fb
5.5 MB Download