WikiReaD (Wikipedia Readability Dataset)
Description
Dataset Description:
The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).
Dataset Details:
- Number of Languages: 14
- Number of files: 19
- Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.
- Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.
- Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `<wiki>-<language_code>_sentences.bz2`). Also, the dataset contains train-test split files for
simplewiki-en (trainsplit_simplewiki-en_sentences.bz2, testsplit_simplewiki-en_sentences.bz2) needed to reproduce the results of the corresponding paper.
Attribution:
The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.
Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (<page_title>) and language (<language_code>). For example, https://en.wikipedia.org/wiki/Spain links to the page “Spain” in English Wikipedia)
- Wikipedia
- Source:
https://<language_code>.wikipedia.org/wiki/<page_title>
- License: CC BY-SA 4.0, GFDL
- Source:
- Simple English Wikipedia
- Source:
https://simple.wikipedia.org/wiki/<page_title>
- License: CC BY-SA 4.0, GFDL
- Source:
- Vikidia
- Source:
https://<language_code>.vikidia.org/wiki/<page_title>
- License: CC BY-SA 3.0, GFDL
- Source:
- Klexikon
- Source:
https://klexikon.zum.de/wiki/<page_title>
- License: CC BY-SA 4.0
- Source:
- Txikipedia
- Source:
https://eu.wikipedia.org/wiki/Txikipedia:<page_title>
- License: CC BY-SA 4.0, GFDL
- Source:
- Wikikids
- Source:
https://wikikids.nl/<page_title>
- License: CC BY-SA 3.0
- Source:
Code for data collection: TBD
Related paper citation: TBD
Files
Files
(127.5 MB)
Name | Size | Download all |
---|---|---|
md5:f659ca06e6cb89dc6d45fdeedde012f9
|
1.7 MB | Download |
md5:47334ef7e5f332570a7e7dad634a754d
|
54.1 MB | Download |
md5:6dd781882ffb9ac57b51e1b273decff6
|
10.9 MB | Download |
md5:a3de1a04eee5d618ecc5f0af69bd65de
|
43.2 MB | Download |
md5:baa3f74f8fbef5b809853a9848b8a800
|
611.9 kB | Download |
md5:005f9560d5bfa736cf4700cd368e6c05
|
134.6 kB | Download |
md5:77ab90de49c09ce8f4562d5f1f0e0dc8
|
150.0 kB | Download |
md5:36e035174f8766b99cfed7635df704ba
|
31.0 kB | Download |
md5:7d0c4a9fcd3e1820b805ceef23ead15d
|
1.5 MB | Download |
md5:fcc7fd718337e841dd58696a2243792d
|
1.1 MB | Download |
md5:97f0b208b34f6ac4a2474541a04f207b
|
304.6 kB | Download |
md5:2e65673a28855322c396768c0bce1d93
|
6.5 MB | Download |
md5:6ba1554f66ffd51eb255689297118140
|
476.9 kB | Download |
md5:34012e8337bfb2ad51fdbd2aafb68791
|
671.6 kB | Download |
md5:c566f2713285485486034ca6716a4146
|
5.7 kB | Download |
md5:5b2c1b8acf2bafde9508be41071c2a5c
|
578.4 kB | Download |
md5:7ad3c91dd59ef94d01aa2b538f1861d0
|
86.6 kB | Download |
md5:b28f69e9eee2ebedd41b669a1c1c13ba
|
2.9 kB | Download |
md5:a5968e6f3cd90164c89be3ea506098fb
|
5.5 MB | Download |