Published May 7, 2021 | Version 1.1
Dataset Open

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

  • 1. TU Berlin
  • 2. Staatsbibliothek zu Berlin - Preußischer Kulturbesitz
  • 3. Le Mans Université

Description

The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here https://github.com/millawell/ocr-data/tree/1.1

Files

Files (300.0 kB)

Name Size Download all
md5:99a25e5a8cc8942e571cd908dfc61927
300.0 kB Download