Published May 7, 2021
| Version 1.1
Dataset
Open
Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"
- 1. TU Berlin
- 2. Staatsbibliothek zu Berlin - Preußischer Kulturbesitz
- 3. Le Mans Université
Description
The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here https://github.com/millawell/ocr-data/tree/1.1
Files
Files
(300.0 kB)
Name | Size | Download all |
---|---|---|
md5:99a25e5a8cc8942e571cd908dfc61927
|
300.0 kB | Download |