Published May 7, 2021 | Version 1.1
Dataset Open

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

  • 1. TU Berlin
  • 2. Staatsbibliothek zu Berlin - Preußischer Kulturbesitz
  • 3. Le Mans Université


The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here


Files (300.0 kB)

Name Size Download all
300.0 kB Download