Extracted Illustrations of the Berlin State Library's Digitized Collections (part 1 of 4)

Zellhöfer, David

doi:10.5281/zenodo.2602431

Published March 22, 2019 | Version 2.0

Dataset Open

Extracted Illustrations of the Berlin State Library's Digitized Collections (part 1 of 4)

Zellhöfer, David¹

1. Berlin State Library

The dataset consists of various illustrations extracted from 26,233 historical books and other media offered in the Berlin State Library's Digitized Collections. The media objects are older than 1920.

Version 1.0 contains of 594,890 extracted illustrations in total.

The extraction of illustrations is driven by the coordinates given by the ABBYY FineReader OCR engine (in ALTO XML) . The extracted illustrations have not been resized but compressed and saved in JPEG format.

Pre-trained models in order to separate color scales, hand-written signatures, library stamps or the like from interesting content are available under: https://github.com/elektrobohemian/imi-unicorns.

The extracts for each media object are stored in separated sub-folders and tar files named after the PPN (a unique ID used in the library) to facilitate further processing. Additional metadata can be obtained with help of the PPN as described here: https://github.com/elektrobohemian/StabiHacks/blob/master/ppn-howto.md .

The dataset is published as a set of ZIP files, each fitting on a Blu Ray disc. After decompression, the contents will consume ca. 166 GB.

Change Log for Version

original dataset
added color histograms (RGB, separated by channel) in JSON and Python pickle format as extracted by the Pillow package (see https://github.com/elektrobohemian/StabiHacks/tree/master/image-tools)

This is part 1 of 4. The following datasets contain the other ZIP files (8 files in total):

Files

_samples.pdf

Files (50.0 GB)

Name	Size	Download all
_samples.pdf md5:f891104c3c5ef49a4a0d54e6467ee716	8.3 MB	Preview Download
color_histograms.zip md5:906115d30daa56d20ea83a0da2528269	1.6 GB	Preview Download
extracted_images.zip.001 md5:d8bbbbd4211da19e3a1db5b0e0ecf3a9	24.2 GB	Download
extracted_images.zip.002 md5:2532b2978f32221a60eb235637dc47d6	24.2 GB	Download
OCR-PPN-Liste.txt md5:052a22da8c321e790bfc147a65c4981a	2.0 MB	Preview Download
ppn_log.log md5:dacf09da0c95989e7b7f796c5e8ea909	1.3 MB	Download
sbbget.py md5:996e9f0349b7c6e88a569496b7d59f6a	17.6 kB	Download
sbbget_error.log md5:bccc975b04caa971ee74c8240c3bbaa6	49.6 kB	Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	883	622
Downloads	757	612
Data volume	5.3 TB	3.6 TB

Extracted Illustrations of the Berlin State Library's Digitized Collections (part 1 of 4)

Creators

Description

Files

_samples.pdf

Files (50.0 GB)