Published March 3, 2019 | Version 1.0
Dataset Open

Metadata, Title Pages, and Network Graph of the Digitized Content of the Berlin State Library (146,000 items)

Description

The data set has been downloaded via the OAI-PMH endpoint of the Berlin State Library/Staatsbibliothek zu Berlin’s Digitized Collections (https://digital.staatsbibliothek-berlin.de/oai) on March 1st 2019 and converted into common tabular formats on the basis of the provided Dublin Core metadata. It contains 146,000 records.

In addition to the bibliographic metadata, representative images of the works have been downloaded, resized to a 512 pixel maximum thumbnail image and saved in JPEG format. The image data is split into title pages and first pages. Title pages have been derived from structural metadata created by scan operators and librarians. If this information was not available, first pages of the media have been downloaded. In case of multi-volume media, title pages are not available.

In total, 141,206 images title/first pages are available.

 

Furthermore, the tabular data has been cleaned and extended with geo-spatial coordinates provided by the OpenStreetMap project (https://www.openstreetmap.org). The actual data processing steps are summarized in the next section. For the sake of transparency and reproducibility, the original data taken from the OAI-PMH endpoint is still present in the table.

 

To conclude with, various graphs in GML file format are available that can be loaded directly into graph analysis tools such as Gephi (https://gephi.org/).

 

The implementation of the data processing steps (incl. graph creation) are available as a Jupyter notebook provided at https://github.com/elektrobohemian/SBBrowse2018/blob/master/DataProcessing.ipynb.

 

Tabular Metadata

 

The metadata is available in Excel (cleanedData.xlsx) and CSV (cleanedData.csv) file formats with equal content.

The table contains the following columns. Italique columns have not been processed.

·      title                             The title of the medium

·      creator                        Its creator (family name, first name)

·      subject            A collection’s name as provided by the library

·      type                            The type of medium

·      format                         A MIME type for full metadata download

·      identifier                    An additional identifier (most often the PPN)

·      language                    A 3-letter language code of the medium

·      date                            The date of creation/publication or a time span

·      relation                       A relation to a project or collection a medium has been digitized for.

·      coverage                     The location of publication or origin (ranging from cities to continents)

·      publisher                    The publisher of the medium.

·      rights                          Copyright information.

·      PPN                             The unique identifier that can be used to find more information about the current medium in all information systems of Berlin State Library/Staatsbibliothek zu Berlin.

·      spatialClean               In case of multiple entries in coverage, only the first place of origin has been extracted. Additionally, characters such as question marks, brackets, or the like have been removed. The entries have been normalized regarding whitespaces and writing variants with the help of regular expressions.

·      dateClean                   As the original date may contain various format variants to indicate unclear creation dates (e.g., time spans or question marks), this field contains a mapping to a certain point in time.

·      spatialCluster             The cluster ID determined with the help of the Jaro-Winkler distance on the spatialClean string. This step is needed because the spatialClean fields still contain a huge amount of orthographic variants and latinizations of geographic names.

·      spatialClusterName   A verbal cluster name (controlled manually).

·      latitude                       The latitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

·      longitude                    The longitude provided by OpenStreetMap of the spatialClusterName if the location could be found.

·      century                       A century derived from the date.

·      textCluster                  A text cluster ID on the basis of a k-means clustering relying on the title field with a vocabulary size of 125,000 using the tf*idf model and k=5,000.

·      creatorCluster             A text cluster ID based on the creator field with k=20,000.

·      titleImage                  The path to the first/title page relative to the img/ subdirectory or None in case of a multi-volume work.

Other Data

 

graphs.zip

 

Various pre-computed graphs.

 

img.zip

 

First and title pages in JPEG format.

 

json.zip

 

JSON files for each record in the following format:

 

ppn                         "PPN57346250X"

dateClean               "1625"

title                         "M. Georgii Gutkii, Gymnasii Berlinensis Rectoris Habitus Primorum Principiorum, Seu Intelligentia; Annexae Sunt Appendicis loco Disputationes super eodem habitu tum in Academia Wittebergensi, tum in Gymnasio Berlinensi ventilatae"

creator                    "Gutke, Georg"

spatialClusterName   "Berlin"

spatialClean           "Berolini"

spatialRaw             "Berolini"

mediatype              "monograph"

subject                    "Historische Drucke"

publisher                "Kallius"

lat                           "52.5170365"

lng                          "13.3888599"

textCluster              "45"

creatorCluster        "5040"

titleImage              "titlepages/PPN57346250X.jpg"

Files

cleanedData.csv

Files (4.2 GB)

Name Size Download all
md5:7a2b28852a6da645911046e8e4aa7a9a
58.8 MB Preview Download
md5:b68d77559703be8c9c4725db50fc82a3
23.4 MB Download
md5:ffd9da8e794e9f5e9082b8ba482d9e12
209.8 kB Preview Download
md5:2d0cb2a3828cda17a89a2e7957565089
25.1 MB Preview Download
md5:fd144eb8cc9d209c5f499a4d03d7071f
4.1 GB Preview Download
md5:799f8e3f8728c94fcaa058796a77fb88
59.4 MB Preview Download