There is a newer version of the record available.

Published May 28, 2024 | Version v1
Dataset Open

ARTS Datasets - ARTS94, ARTS300, ARTS3000

  • 1. ROR icon TH Köln - University of Applied Sciences
  • 2. ROR icon Technische Hochschule Mittelhessen
  • 3. ROR icon Herder Institute
  • 4. Technische Hochschule Köln

Description

Datasets for readability and text simplicity evaluation in three sizes: 94, 300 and 3000 disjunctive data entries. One data entry contains the following information:

  • Text_original: Text from a parallel corpus for text simplification
  • Text_formatted: Text_original where formatting issues have been resolved either manually (ARTS94) or automatically (ARTS300 and ARTS3000) 
  • Dataset: Parallel corpus for text simplification, from which the original text has been extracted 
  • Label: information, if the text has been from the simplified (simp) or source (src) part of the corpus
  • ID: Unique ID
  • Score: Simplicity/readability score of the formatted text, between 0 and 1, the higher a score, the more complex/less readable the text

Licenses of the different datasets apply for the respective texts.

Files

arts3000_scores.csv

Files (1.1 MB)

Name Size Download all
md5:ad4ebb3d13de5bf2f3a81d5d247fbace
999.8 kB Preview Download
md5:a75105b9a961edae1a09cf02b698f956
99.0 kB Preview Download
md5:fce76b540d1f76c4c0475f4a4671b8b9
29.0 kB Preview Download