Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

David Lassner; Julius Coburger; Clemens Neudecker; Anne Baillot

doi:10.5281/zenodo.4742068

Published May 7, 2021 | Version 1.1

Dataset Open

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

1. TU Berlin
2. Staatsbibliothek zu Berlin - Preußischer Kulturbesitz
3. Le Mans Université

The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here https://github.com/millawell/ocr-data/tree/1.1

Files

Files (300.0 kB)

Name	Size	Download all
2021-05-7_v1.1_ocr-data.tgz md5:99a25e5a8cc8942e571cd908dfc61927	300.0 kB	Download

541

Views

Downloads

Show more details

	All versions	This version
Views	541	540
Downloads	95	95
Data volume	30.3 MB	30.3 MB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 7, 2021
Modified: May 12, 2021

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

Authors/Creators

Description

Files

Files (300.0 kB)