Published March 4, 2022 | Version v1
Dataset Open

Vikidia En/Fr bilingual dataset for Automatic Readability Assessment

  • 1. University of Toronto, Canada
  • 2. National Research Council, Canada

Description

Vikidia.org is a children's encyclopedia, with content targeting 8-13 year old children, in several European languages. Our dataset contains 24660 texts distributed across 6165 articles in 2 reading levels, for English and French respectively i.e., each text in the corpus has four versions: en, en-simple, fr and fr-simple, and there are 6165 slugs in total. The uniqueness of the current dataset is that these are parallel, document level aligned texts in four versions - en, en-simple, fr, fr-simple. While we did not create paragraph/sentence level alignments on the corpus, we hope that this will be a useful dataset for future English and French research on ARA and Automatic Text Simplification. This is the first such dataset in ARA, and perhaps the first readily available French readability dataset.

This dataset is used in the paper "A neural pairwise ranking model for automatic readability assessment" by Justin Lee and Sowmya Vajjala, to appear in Findings of ACL 2022. 

Files

Files (175.8 MB)

Name Size Download all
md5:8baf740205560fef9952213b4c03a053
175.8 MB Download