Published June 26, 2019 | Version 1.0
Dataset Open

OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

  • 1. Berlin State Library

Description

The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945.

At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages.
For each page with OCR text, the language has been determined by langid (Lui/Baldwin 2012).

corpus-entropy.pkl      entropy rate per document page

corpus-language.pkl   language per document page

corpus.zip                    fulltext corpus (extracts to .txt format)

de_corpus.zip              German sub-corpus (extracts to .txt format)

selection_de.pkl          Selection list of German documents

xml2csv_alto.csv         fulltext corpus per document page (incl.OCR word confidences)

 

Sources

Marco Lui and Timothy Baldwin. 2012. Langid.py:

An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations,

ACL ’12, pages 25–30, Stroudsburg, PA, USA. Association for Computational Linguistics

Files

corpus.zip

Files (40.0 GB)

Name Size Download all
md5:683fe1c3d5c1b275c002248bddbb88e1
175.1 MB Download
md5:014cabcad9e974174f2a590f702c63fc
198.7 MB Download
md5:11b23cddbf82cd6e0595ad367189db18
4.2 GB Preview Download
md5:7c9c9922dece2068252533e5b7be8536
2.2 GB Preview Download
md5:7afad26c0ab83601c6467dfd74039b97
143.3 MB Download
md5:0cf20919da1df6f67d634304e12c3a1a
33.1 GB Preview Download