1344132
doi
10.5281/zenodo.1344132
oai:zenodo.org:1344132
Reul, Christian
Universität Würzburg
Dipper, Stefanie
Ruhr-Universität Bochum
Baiter, Johannes
Bayerische Staatsbibiliothek München
GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin
Springmann, Uwe
LMU
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
OCR, historical documents, digital humanities, Fraktur, Early Modern Latin, Early New High German
<p><strong>GT4HistOCR</strong> contains ground truth for research in Optical Character Recognition (OCR) technology applied to historical printings in German Fraktur and Early Modern Latin.</p>
<p>The ground truth comes in pairs of images of single printed lines as they appear in book pages (*.png) and their corresponding diplomatic transcriptions (*.gt.txt), which are UTF-8 strings preserving the character forms (glyphs) as much as possible within the UNICODE standard. These pairs of line images and their transcriptions can be directly used to train recognition models with, e.g., the open source OCR engines <em>OCRopy</em> or <em>Tesseract</em>. A total of 313,173 ground truth lines are provided.</p>
<p><strong>Please note that the subcorpora making up this collection used different transcription guidelines, so it is a bad idea to train a recognition model on the total collection! Rather train individual models for each subcorpus.</strong> Fur further information about the subcorpora, please see the README file and the accompanying publication.</p>
<p>If these data are useful for you, please cite the accompanying publication:</p>
<pre>@article{<a href="http://springmann.net/publications.html#springmann2018gt4hist">springmann2018gt4hist</a>,
author = {Uwe Springmann and Christian Reul and Stefanie Dipper and
Johannes Baiter},
title = {{Ground Truth for training {OCR} engines on historical
documents in German Fraktur and Early Modern Latin}},
journal = {J. Lang. Technol. Comput. Linguistics},
volume = {33},
number = {1},
pages = {97--114},
year = {2018},
url = {https://jlcl.org/content/2-allissues/1-heft1-2018/jlcl_2018-1_5.pdf}
}</pre>
Zenodo
2018-08-12
info:eu-repo/semantics/other
1344131
1.0
1613908811.342214
4025354240
md5:3c382e707042ed5f548caf180fec40f8
https://zenodo.org/records/1344132/files/GT4HistOCR.tar
2559
md5:91061dbdcd8b0da4abbffbdefab006e2
https://zenodo.org/records/1344132/files/README
public
10.5281/zenodo.1344131
isVersionOf
doi