Dataset Open Access

OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

Labusch, Kai; Zellhöfer, David


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">OCR fulltext</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">historic texts</subfield>
  </datafield>
  <controlfield tag="005">20200124192625.0</controlfield>
  <controlfield tag="001">3257041</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Berlin State Library</subfield>
    <subfield code="0">(orcid)0000-0002-0403-457X</subfield>
    <subfield code="a">Zellhöfer, David</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">175051440</subfield>
    <subfield code="z">md5:683fe1c3d5c1b275c002248bddbb88e1</subfield>
    <subfield code="u">https://zenodo.org/record/3257041/files/corpus-entropy.pkl</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">198737314</subfield>
    <subfield code="z">md5:014cabcad9e974174f2a590f702c63fc</subfield>
    <subfield code="u">https://zenodo.org/record/3257041/files/corpus-language.pkl</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">4173625252</subfield>
    <subfield code="z">md5:11b23cddbf82cd6e0595ad367189db18</subfield>
    <subfield code="u">https://zenodo.org/record/3257041/files/corpus.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">2173534503</subfield>
    <subfield code="z">md5:7c9c9922dece2068252533e5b7be8536</subfield>
    <subfield code="u">https://zenodo.org/record/3257041/files/de_corpus.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">143317470</subfield>
    <subfield code="z">md5:7afad26c0ab83601c6467dfd74039b97</subfield>
    <subfield code="u">https://zenodo.org/record/3257041/files/selection_de.pkl</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">33116656376</subfield>
    <subfield code="z">md5:0cf20919da1df6f67d634304e12c3a1a</subfield>
    <subfield code="u">https://zenodo.org/record/3257041/files/xml2csv_alto.csv</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2019-06-26</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-stabi</subfield>
    <subfield code="o">oai:zenodo.org:3257041</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Berlin State Library</subfield>
    <subfield code="a">Labusch, Kai</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-stabi</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945.&lt;/p&gt;

&lt;p&gt;At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages.&lt;br&gt;
For each page with OCR text, the language has been determined by &lt;em&gt;langid &lt;/em&gt;(Lui/Baldwin 2012).&lt;/p&gt;

&lt;p&gt;corpus-entropy.pkl &amp;nbsp; &amp;nbsp;&amp;nbsp; entropy rate per document page&lt;/p&gt;

&lt;p&gt;corpus-language.pkl&amp;nbsp;&amp;nbsp; language per document page&lt;/p&gt;

&lt;p&gt;corpus.zip &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; fulltext corpus (extracts to .txt format)&lt;/p&gt;

&lt;p&gt;de_corpus.zip &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; German sub-corpus (extracts to .txt format)&lt;/p&gt;

&lt;p&gt;selection_de.pkl&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Selection list of German documents&lt;/p&gt;

&lt;p&gt;xml2csv_alto.csv&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fulltext corpus per document page (incl.OCR word confidences)&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sources&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Marco Lui and Timothy Baldwin. 2012. Langid.py:&lt;/p&gt;

&lt;p&gt;An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations,&lt;/p&gt;

&lt;p&gt;ACL &amp;rsquo;12, pages 25&amp;ndash;30, Stroudsburg, PA, USA. Association for Computational Linguistics&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3257040</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3257041</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
458
305
views
downloads
All versions This version
Views 458458
Downloads 305305
Data volume 3.0 TB3.0 TB
Unique views 423423
Unique downloads 150150

Share

Cite as