OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

Labusch, Kai; Zellhöfer, David

doi:10.5281/zenodo.3257041

Published June 26, 2019 | Version 1.0

Dataset Open

OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

1. Berlin State Library

The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945.

At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages.
For each page with OCR text, the language has been determined by langid (Lui/Baldwin 2012).

corpus-entropy.pkl entropy rate per document page

corpus-language.pkl language per document page

corpus.zip fulltext corpus (extracts to .txt format)

de_corpus.zip German sub-corpus (extracts to .txt format)

selection_de.pkl Selection list of German documents

xml2csv_alto.csv fulltext corpus per document page (incl.OCR word confidences)

Sources

Marco Lui and Timothy Baldwin. 2012. Langid.py:

An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations,

ACL ’12, pages 25–30, Stroudsburg, PA, USA. Association for Computational Linguistics

Files

corpus.zip

Files (40.0 GB)

Name	Size	Download all
corpus-entropy.pkl md5:683fe1c3d5c1b275c002248bddbb88e1	175.1 MB	Download
corpus-language.pkl md5:014cabcad9e974174f2a590f702c63fc	198.7 MB	Download
corpus.zip md5:11b23cddbf82cd6e0595ad367189db18	4.2 GB	Preview Download
de_corpus.zip md5:7c9c9922dece2068252533e5b7be8536	2.2 GB	Preview Download
selection_de.pkl md5:7afad26c0ab83601c6467dfd74039b97	143.3 MB	Download
xml2csv_alto.csv md5:0cf20919da1df6f67d634304e12c3a1a	33.1 GB	Preview Download

	All versions	This version
Views	1,329	1,319
Downloads	439	433
Data volume	6.0 TB	5.9 TB

OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

Creators

Description

Files

corpus.zip

Files (40.0 GB)