Dataset Open Access

Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"

David Lassner; Julius Coburger; Clemens Neudecker; Anne Baillot


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">OCR ground-truth</subfield>
  </datafield>
  <controlfield tag="005">20210512104546.0</controlfield>
  <controlfield tag="001">4742068</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">TU Berlin</subfield>
    <subfield code="a">Julius Coburger</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Staatsbibliothek zu Berlin - Preußischer Kulturbesitz</subfield>
    <subfield code="0">(orcid)0000-0001-5293-8322</subfield>
    <subfield code="a">Clemens Neudecker</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Le Mans Université</subfield>
    <subfield code="0">(orcid)0000-0002-4593-059X</subfield>
    <subfield code="a">Anne Baillot</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">300004</subfield>
    <subfield code="z">md5:99a25e5a8cc8942e571cd908dfc61927</subfield>
    <subfield code="u">https://zenodo.org/record/4742068/files/2021-05-7_v1.1_ocr-data.tgz</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-05-07</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:4742068</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">TU Berlin</subfield>
    <subfield code="0">(orcid)0000-0001-9013-0834</subfield>
    <subfield code="a">David Lassner</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Data set of the paper "Publishing an OCR ground truth data set for reuse in an unclear copyright setting"</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;The data set consists of a METS file for each of the PDFs that were used for transcription and a directory data/page_xml that contains the transcriptions of the ground truth in PAGE-XML format. In parallel to the data set publication, a data paper will be published that contains a detailed description of the data set. As soon as it is published, we will link to it. The corresponding source code can be found here&amp;nbsp;https://github.com/millawell/ocr-data/tree/1.1&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4742067</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.4742068</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
35
10
views
downloads
All versions This version
Views 3535
Downloads 1010
Data volume 3.0 MB3.0 MB
Unique views 2828
Unique downloads 77

Share

Cite as