DESCRIPTION GT4HistOCR contains ground truth for research in Optical Character Recognition (OCR) technology applied to historical printings in German Fraktur and Early Modern Latin. The ground truth comes in pairs of images of single printed lines as they appear in book pages (*.png) and their corresponding diplomatic transcriptions (*.gt.txt), which are UTF-8 strings preserving the character forms (glyphs) as much as possible within the UNICODE standard. These pairs of line images and their transcriptions can be directly used to train recognition models with, e.g., the open source OCR engines OCRopy or Tesseract. The subcorpora contain a certain number of shuffled lines from books printed in a variety of historical fonts. subcorpus description # lines ---------------------------------------------------------------------- dta19 first editions of 19th c. German books 243,942 Early Modern Latin Latin printings 1471-1686 10,288 KALLIMACHOS Narrenschiff editions 1488-1509 20,929 Refence Corpus ENHG German printings 1476-1499 24,766 RIDGES Herbals printed 1487-1870 13,248 -------------- sum: 313,173 In addition to the ground truth, there is a perl script under tools/ which may be adapted to harmonize existing transcriptions of historical books by different transcribers. Pretrained OCRopy recognition models for the Early Modern Latin Corpus (latin1 and latin2), the RIDGES corpus (ridges1 and rigdes2), and the incunabula printings of the Reference Corpus ENHG are found under models/. For more information on these models see our paper cited below which also contains references to the literature. AUTHORS The data have been generated in the following projects: * dta19: http://www.deutschestextarchiv.de/ * Early Modern Latin: http://www.cis.lmu.de/ocrworkshop/ * KALLIMACHOS: http://kallimachos.de * Reference Corpus Early New High German: https://www.linguistics.ruhr-uni-bochum.de/ref/ * RIDGES: https://korpling.org/ridges They were collected and put in the current form by Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. COPYRIGHT This data collection has a CC-BY-SA 4.0 license. SEE ALSO If you use these data, please cite the following paper: Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter (2018). Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. LAST CHANGE August 2018, uvius@gmx.de