Dataset Open Access

Dataset of Pages from Early Printed Books with Multiple Font Groups

Seuret, Mathias; Limbach, Saskia; Weichselbaumer, Nikolaus; Maier, Andreas; Christlein, Vincent

Data collector(s)
Bittmann, Janina; Duntze, Oliver; Hinrichsen, Lena; Hoppe, Leonie; Hosfeld, Maria; Lieneke, Lukas; Meier, Annette; Menz, Lennart; Schmidt, Christian; Stumm, Magdalena; Wiechmann, Eileen; Limbach, Saskia; Weichselbaumer, Nikolaus
Data manager(s)
Limbach, Saskia; Weichselbaumer, Nikolaus; Seuret, Mathas

This dataset is composed of photos of various resolution of 35'623 pages of printed books dating from the 15th to the 18th century. Each page has been attributed by experts from one to five labels corresponding to the font groups used in the text, with two extra-classes for non-textual content and fonts not present in the following list:  Antiqua, Bastarda, Fraktur, Gotico Antiqua, Greek, Hebrew, Italic, Rotunda, Schwabacher, and Textura.

Note that to make downloading the dataset with slow or unreliable Internet connections easier, the dataset has been separated in several zip files. All zip files must be extracted in the same folder. The CSV files containing the labels should ideally be in the parent folder.

The labels are provided in two CSV files, one for training/tuning font group recognition methods, and the second one for evaluation purposes. Where several pages come from the same book, a special care has been taken to have all of them in the same subset.

The paper presenting this dataset in detail is "Dataset of Pages from Early Printed Books with Multiple Font Groups", accepted at the 5th International Workshop on Historical Document Imaging and Processing, Sydney, Australia.

We would like to thank the British Library (London), Bayerische Staatsbibliothek München, Staatsbibliothek zu Berlin, Universitätsbibliothek Erlangen, Universitätsbibliothek Heidelberg, Staats- und Universitäatsbibliothek Göttingen, Stadt- und Universitätsbibliothek Köln, Württembergische Landesbibliothek Stuttgart and Herzog August Bibliothek Wolfenbüttel for the data they sent us and kindly allowed us to use for this public dataset.

Files (44.2 GB)
Name Size
fontgroupsdataset-a.zip
md5:2fcd1cf7f4e766625ab5aaae6f10eb3e
6.3 GB Download
fontgroupsdataset-b.zip
md5:fe597b957dbbc29e5940bce21ac08f8c
6.2 GB Download
fontgroupsdataset-c.zip
md5:98d7afd41328b30afb9125708cd8bb0f
6.3 GB Download
fontgroupsdataset-d.zip
md5:3c86b0ae51fb458ad1daea3d243668f5
6.4 GB Download
fontgroupsdataset-e.zip
md5:203b8555c92b8bd2cdc4e596b2a76ef1
6.3 GB Download
fontgroupsdataset-f.zip
md5:697eb0182aca2acc839abe1008e999ba
6.5 GB Download
fontgroupsdataset-g.zip
md5:0dc8dbd31c9942d85289fce41776eb56
6.3 GB Download
fontgroupsdataset-labels.zip
md5:bf98a2d56bdcb7e5c09de4c92727cc18
292.0 kB Download
  • Dataset of Pages from Early Printed Books with Multiple Font Groups

149
56
views
downloads
All versions This version
Views 149149
Downloads 5656
Data volume 295.7 GB295.7 GB
Unique views 130130
Unique downloads 2525

Share

Cite as