Dataset Open Access

OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

Labusch, Kai; Zellhöfer, David

The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945.

At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages.
For each page with OCR text, the language has been determined by langid (Lui/Baldwin 2012).

corpus-entropy.pkl      entropy rate per document page

corpus-language.pkl   language per document page

corpus.zip                    fulltext corpus (extracts to .txt format)

de_corpus.zip              German sub-corpus (extracts to .txt format)

selection_de.pkl          Selection list of German documents

xml2csv_alto.csv         fulltext corpus per document page (incl.OCR word confidences)

 

Sources

Marco Lui and Timothy Baldwin. 2012. Langid.py:

An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations,

ACL ’12, pages 25–30, Stroudsburg, PA, USA. Association for Computational Linguistics

Files (40.0 GB)
Name Size
corpus-entropy.pkl
md5:683fe1c3d5c1b275c002248bddbb88e1
175.1 MB Download
corpus-language.pkl
md5:014cabcad9e974174f2a590f702c63fc
198.7 MB Download
corpus.zip
md5:11b23cddbf82cd6e0595ad367189db18
4.2 GB Download
de_corpus.zip
md5:7c9c9922dece2068252533e5b7be8536
2.2 GB Download
selection_de.pkl
md5:7afad26c0ab83601c6467dfd74039b97
143.3 MB Download
xml2csv_alto.csv
md5:0cf20919da1df6f67d634304e12c3a1a
33.1 GB Download
438
303
views
downloads
All versions This version
Views 438438
Downloads 303303
Data volume 3.0 TB3.0 TB
Unique views 405405
Unique downloads 148148

Share

Cite as