Metadata, Title Pages, and Network Graph of the Digitized Content of the Berlin State Library (146,000 items)
Creators
Description
The data set has been downloaded via the OAI-PMH endpoint of the Berlin State Library/Staatsbibliothek zu Berlin’s Digitized Collections (https://digital.staatsbibliothek-berlin.de/oai) on March 1st 2019 and converted into common tabular formats on the basis of the provided Dublin Core metadata. It contains 146,000 records.
In addition to the bibliographic metadata, representative images of the works have been downloaded, resized to a 512 pixel maximum thumbnail image and saved in JPEG format. The image data is split into title pages and first pages. Title pages have been derived from structural metadata created by scan operators and librarians. If this information was not available, first pages of the media have been downloaded. In case of multi-volume media, title pages are not available.
In total, 141,206 images title/first pages are available.
Furthermore, the tabular data has been cleaned and extended with geo-spatial coordinates provided by the OpenStreetMap project (https://www.openstreetmap.org). The actual data processing steps are summarized in the next section. For the sake of transparency and reproducibility, the original data taken from the OAI-PMH endpoint is still present in the table.
To conclude with, various graphs in GML file format are available that can be loaded directly into graph analysis tools such as Gephi (https://gephi.org/).
The implementation of the data processing steps (incl. graph creation) are available as a Jupyter notebook provided at https://github.com/elektrobohemian/SBBrowse2018/blob/master/DataProcessing.ipynb.
Tabular Metadata
The metadata is available in Excel (cleanedData.xlsx) and CSV (cleanedData.csv) file formats with equal content.
The table contains the following columns. Italique columns have not been processed.
· title The title of the medium
· creator Its creator (family name, first name)
· subject A collection’s name as provided by the library
· type The type of medium
· format A MIME type for full metadata download
· identifier An additional identifier (most often the PPN)
· language A 3-letter language code of the medium
· date The date of creation/publication or a time span
· relation A relation to a project or collection a medium has been digitized for.
· coverage The location of publication or origin (ranging from cities to continents)
· publisher The publisher of the medium.
· rights Copyright information.
· PPN The unique identifier that can be used to find more information about the current medium in all information systems of Berlin State Library/Staatsbibliothek zu Berlin.
· spatialClean In case of multiple entries in coverage, only the first place of origin has been extracted. Additionally, characters such as question marks, brackets, or the like have been removed. The entries have been normalized regarding whitespaces and writing variants with the help of regular expressions.
· dateClean As the original date may contain various format variants to indicate unclear creation dates (e.g., time spans or question marks), this field contains a mapping to a certain point in time.
· spatialCluster The cluster ID determined with the help of the Jaro-Winkler distance on the spatialClean string. This step is needed because the spatialClean fields still contain a huge amount of orthographic variants and latinizations of geographic names.
· spatialClusterName A verbal cluster name (controlled manually).
· latitude The latitude provided by OpenStreetMap of the spatialClusterName if the location could be found.
· longitude The longitude provided by OpenStreetMap of the spatialClusterName if the location could be found.
· century A century derived from the date.
· textCluster A text cluster ID on the basis of a k-means clustering relying on the title field with a vocabulary size of 125,000 using the tf*idf model and k=5,000.
· creatorCluster A text cluster ID based on the creator field with k=20,000.
· titleImage The path to the first/title page relative to the img/ subdirectory or None in case of a multi-volume work.
Other Data
graphs.zip
Various pre-computed graphs.
img.zip
First and title pages in JPEG format.
json.zip
JSON files for each record in the following format:
ppn "PPN57346250X"
dateClean "1625"
title "M. Georgii Gutkii, Gymnasii Berlinensis Rectoris Habitus Primorum Principiorum, Seu Intelligentia; Annexae Sunt Appendicis loco Disputationes super eodem habitu tum in Academia Wittebergensi, tum in Gymnasio Berlinensi ventilatae"
creator "Gutke, Georg"
spatialClusterName "Berlin"
spatialClean "Berolini"
spatialRaw "Berolini"
mediatype "monograph"
subject "Historische Drucke"
publisher "Kallius"
lat "52.5170365"
lng "13.3888599"
textCluster "45"
creatorCluster "5040"
titleImage "titlepages/PPN57346250X.jpg"
Files
cleanedData.csv
Files
(4.2 GB)
Name | Size | Download all |
---|---|---|
md5:7a2b28852a6da645911046e8e4aa7a9a
|
58.8 MB | Preview Download |
md5:b68d77559703be8c9c4725db50fc82a3
|
23.4 MB | Download |
md5:ffd9da8e794e9f5e9082b8ba482d9e12
|
209.8 kB | Preview Download |
md5:2d0cb2a3828cda17a89a2e7957565089
|
25.1 MB | Preview Download |
md5:fd144eb8cc9d209c5f499a4d03d7071f
|
4.1 GB | Preview Download |
md5:799f8e3f8728c94fcaa058796a77fb88
|
59.4 MB | Preview Download |