Dataset Open Access

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

David Lassner; Julius Coburger; Clemens Neudecker; Anne Baillot

The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here https://github.com/millawell/ocr-data/tree/1.1

Files (300.0 kB)
Name Size
2021-05-7_v1.1_ocr-data.tgz
md5:99a25e5a8cc8942e571cd908dfc61927
300.0 kB Download
36
10
views
downloads
All versions This version
Views 3636
Downloads 1010
Data volume 3.0 MB3.0 MB
Unique views 2929
Unique downloads 77

Share

Cite as