Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published August 6, 2021 | Version 1.0
Dataset Open

BiblIA - an Open Annotated Dataset

  • 1. EPHE, PSL
  • 2. Oxford University
  • 3. University of Maryland

Description

Version 1.0, 6.8.2021, Daniel Stökl Ben Ezra (daniel.stoekl@ephe.psl.eu)

This dataset for Handwritten Text Recognition includes layout segmentation (regions, toplines and linepolygons) and unicode-transcriptions in alto 4.2 XML for 202 images of Medieval Hebrew manuscripts from the Bibliothèque nationale de France (BnF, National Library of France) and the Biblioteca Apostolica Vaticana (BAV, Vatican Library) corresponding to the article "BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset" by Daniel Stökl Ben Ezra, Bronson Brown-DeVost, Pawel Jablonski, Benjamin Kiessling, Elena Lolli, and Hayim Lapin, published in HIP@ICDAR 2021 held in Lausanne, September 2021.
The catalogue.txt file contains the information about the images, the shelfmarks of the manuscripts, their script-style (Ashkenazy, Italian, Sephardi), genre (Biblical, Rabbinic), date (where known), image-type (color or grayscale), iiif-manifest and links to download the images where it was impossible to include them directly here for copyright issues. The 132 corresponding images from the BnF are included here. For the 70 images from the BAV, the iiif-links are given in the catalogue.txt.
The schema for the alto files is http://www.loc.gov/standards/alto/v4/alto-4-2.xsd.
To create these files, we used the open source kraken OCR/HTR engine kraken (now at version 3.05) and the eScriptorium virtual research environment (https://escripta.hypotheses.org) created by the digital and computational task force of the research laboratory Archéologie & Philologie d'Orient et d'Occident (UMR 8546, CNRS - PSL (EPHE, ENS)).
This work was funded by the projects Scripta-PSL, Biblissima+ (Équipex ANR-11-EQPX-0007), Resilience (Horizon 2020 Grant Agreement 871127), and in particular Sofer Mahir.

Licence CC-BY-NC-SA 4.0: https://creativecommons.org/licenses/by-nc-sa/4.0/

 

Files

BiblIA_dataset.zip

Files (546.2 MB)

Name Size Download all
md5:8cef38c1e501afd628d245a72f49bf05
546.2 MB Preview Download

Additional details

Funding

RESILIENCE – REligious Studies Infrastructure: tooLs, Experts, conNections and CEnters 871127
European Commission