Dataset Open Access

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

David Lassner; Julius Coburger; Clemens Neudecker; Anne Baillot

The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here https://github.com/millawell/ocr-data/tree/1.1

Files (300.0 kB)
Name Size
2021-05-7_v1.1_ocr-data.tgz
md5:99a25e5a8cc8942e571cd908dfc61927
300.0 kB Download
90
18
views
downloads
All versions This version
Views 9090
Downloads 1818
Data volume 5.4 MB5.4 MB
Unique views 8383
Unique downloads 1515

Share

Cite as