Published March 3, 2019 | Version 1.0
Dataset Open

Metadata, Title Pages, and Network Graph of the Digitized Content of the Berlin State Library (146,000 items)


The data set has been downloaded via the OAI-PMH endpoint of the Berlin State Library/Staatsbibliothek zu Berlin’s Digitized Collections ( on March 1st 2019 and converted into common tabular formats on the basis of the provided Dublin Core metadata. It contains 146,000 records.

In addition to the bibliographic metadata, representative images of the works have been downloaded, resized to a 512 pixel maximum thumbnail image and saved in JPEG format. The image data is split into title pages and first pages. Title pages have been derived from structural metadata created by scan operators and librarians. If this information was not available, first pages of the media have been downloaded. In case of multi-volume media, title pages are not available.

In total, 141,206 images title/first pages are available.


Furthermore, the tabular data has been cleaned and extended with geo-spatial coordinates provided by the OpenStreetMap project ( The actual data processing steps are summarized in the next section. For the sake of transparency and reproducibility, the original data taken from the OAI-PMH endpoint is still present in the table.


To conclude with, various graphs in GML file format are available that can be loaded directly into graph analysis tools such as Gephi (


The implementation of the data processing steps (incl. graph creation) are available as a Jupyter notebook provided at


Tabular Metadata


The metadata is available in Excel (cleanedData.xlsx) and CSV (cleanedData.csv) file formats with equal content.

The table contains the following columns. Italique columns have not been processed.

·      title                             The title of the medium

·      creator                        Its creator (family name, first name)

·      subject            A collection’s name as provided by the library

·      type                            The type of medium

·      format                         A MIME type for full metadata download

·      identifier                    An additional identifier (most often the PPN)

·      language                    A 3-letter language code of the medium

·      date                            The date of creation/publication or a time span

·      relation                       A relation to a project or collection a medium has been digitized for.

·      coverage                     The location of publication or origin (ranging from cities to continents)

·      publisher                    The publisher of the medium.

·      rights                          Copyright information.

·      PPN                             The unique identifier that can be used to find more information about the current medium in all information systems of Berlin State Library/Staatsbibliothek zu Berlin.

·      spatialClean               In case of multiple entries in coverage, only the first place of origin has been extracted. Additionally, characters such as question marks, brackets, or the like have been removed. The entries have been normalized regarding whitespaces and writing variants with the help of regular expressions.

·      dateClean                   As the original date may contain various format variants to indicate unclear creation dates (e.g., time spans or question marks), this field contains a mapping to a certain point in time.

·      spatialCluster             The cluster ID determined with the help of the Jaro-Winkler distance on the spatialClean string. This step is needed because the spatialClean fields still contain a huge amount of orthographic variants and latinizations of geographic names.

·      spatialClusterName   A verbal cluster name (controlled manually).

·      latitude                       The latitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

·      longitude                    The longitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

·      century                       A century derived from the date.

·      textCluster                  A text cluster ID on the basis of a k-means clustering relying on the title field with a vocabulary size of 125,000 using the tf*idf model and k=5,000.

·      creatorCluster             A text cluster ID based on the creator field with k=20,000.

·      titleImage                  The path to the first/title page relative to the img/ subdirectory or None in case of a multi-volume work.

Other Data


Various pre-computed graphs.


First and title pages in JPEG format.


JSON files for each record in the following format:


ppn                         "PPN57346250X"

dateClean               "1625"

title                         "M. Georgii Gutkii, Gymnasii Berlinensis Rectoris Habitus Primorum Principiorum, Seu Intelligentia; Annexae Sunt Appendicis loco Disputationes super eodem habitu tum in Academia Wittebergensi, tum in Gymnasio Berlinensi ventilatae"

creator                    "Gutke, Georg"

spatialClusterName   "Berlin"

spatialClean           "Berolini"

spatialRaw             "Berolini"

mediatype              "monograph"

subject                    "Historische Drucke"

publisher                "Kallius"

lat                           "52.5170365"

lng                          "13.3888599"

textCluster              "45"

creatorCluster        "5040"

titleImage              "titlepages/PPN57346250X.jpg"



Files (4.2 GB)

Name Size Download all
58.8 MB Preview Download
23.4 MB Download
209.8 kB Preview Download
25.1 MB Preview Download
4.1 GB Preview Download
59.4 MB Preview Download