Dataset Open Access

GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Springmann, Uwe; Reul, Christian; Dipper, Stefanie; Baiter, Johannes


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">eng</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">OCR, historical documents, digital humanities, Fraktur, Early Modern Latin, Early New High German</subfield>
  </datafield>
  <controlfield tag="005">20180907150620.0</controlfield>
  <controlfield tag="001">1344132</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Universität Würzburg</subfield>
    <subfield code="a">Reul, Christian</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Ruhr-Universität Bochum</subfield>
    <subfield code="a">Dipper, Stefanie</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Bayerische Staatsbibiliothek München</subfield>
    <subfield code="a">Baiter, Johannes</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">4025354240</subfield>
    <subfield code="z">md5:3c382e707042ed5f548caf180fec40f8</subfield>
    <subfield code="u">https://zenodo.org/record/1344132/files/GT4HistOCR.tar</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">2559</subfield>
    <subfield code="z">md5:91061dbdcd8b0da4abbffbdefab006e2</subfield>
    <subfield code="u">https://zenodo.org/record/1344132/files/README</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2018-08-12</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:1344132</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">LMU</subfield>
    <subfield code="a">Springmann, Uwe</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">http://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;&lt;strong&gt;GT4HistOCR&lt;/strong&gt; contains ground truth for research in Optical Character Recognition (OCR) technology applied to historical printings in German Fraktur and Early Modern Latin.&lt;/p&gt;

&lt;p&gt;The ground truth comes in pairs of images of single printed lines as they appear in book pages (*.png) and their corresponding diplomatic transcriptions (*.gt.txt), which are UTF-8 strings preserving the character forms (glyphs) as much as possible within the UNICODE standard. These pairs of line images and their transcriptions can be directly used to train recognition models with, e.g., the open source OCR engines &lt;em&gt;OCRopy&lt;/em&gt; or &lt;em&gt;Tesseract&lt;/em&gt;. A total of 313,173 ground truth lines are provided.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.1344131</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.1344132</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
773
171
views
downloads
All versions This version
Views 773773
Downloads 171171
Data volume 322.0 GB322.0 GB
Unique views 725725
Unique downloads 122122

Share

Cite as