Published 2024 | Version v2
Dataset Open

ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160

  • 1. ROR icon TH Köln - University of Applied Sciences
  • 2. ROR icon Technische Hochschule Mittelhessen
  • 3. ROR icon Herder Institute
  • 4. Technische Hochschule Köln

Description

Datasets for readability and text simplicity evaluation in three sizes: 94, 300, 3000 and 160 disjunctive data entries. One data entry contains the following information:

  • Text_original: Text from a parallel corpus for text simplification
  • Text_formatted: Text_original where formatting issues have been resolved either manually (ARTS94) or automatically (ARTS300, ARTS3000, ARTS160) 
  • Dataset: Parallel corpus for text simplification, from which the original text has been extracted 
  • Label: information, if the text has been from the simplified (simp) or source (src) part of the corpus
  • ID: Unique ID
  • Score: Simplicity/readability score of the formatted text, between 0 and 1, the higher a score, the more complex/less readable the text

Licenses of the different datasets apply for the respective texts.

Files

ARTS160_Scores.csv

Files (1.2 MB)

Name Size Download all
md5:66fc336d63d34611fa9432725544b169
47.6 kB Preview Download
md5:ad4ebb3d13de5bf2f3a81d5d247fbace
999.8 kB Preview Download
md5:a75105b9a961edae1a09cf02b698f956
99.0 kB Preview Download
md5:fce76b540d1f76c4c0475f4a4671b8b9
29.0 kB Preview Download