Published November 27, 2025 | Version v1
Dataset Open

MiDRASH Automatic Transcriptions of the Cairo Geniza Fragments

  • 1. École Pratique des Hautes Études
  • 2. ROR icon Université Paris Sciences et Lettres
  • 3. ROR icon Archéologie et Philologie d'Orient et d'Occident
  • 4. EDMO icon National Institute for Research in Computer and Control Sciences
  • 5. ROR icon ALMANACH: Modélisation et analyse linguistique automatique et humanités computationnelles
  • 6. EDMO icon University of Maryland
  • 7. ROR icon Princeton University
  • 8. ROR icon Tel Aviv University
  • 9. ROR icon Bar-Ilan University
  • 10. ROR icon University of Haifa
  • 11. ROR icon National Library of Israel
  • 12. University of Oxford
  • 13. École Pratique des Hautes Études Section des Sciences historiques et philologiques
  • 14. ROR icon Savoirs et Pratiques du Moyen Âge au XIXe siècle

Description

This is the first automatic transcription of the entire collection of digital images of the Geniza at the National Library of Israel as of this date. It was created using kraken version 5.3.1.dev56. 

To find a fragment put the 99 ID number into KTIV.

We are aware that this is a very preliminary and imperfect result, which we are releasing now because of the high value for scholarship even in its current form. We are aware of the following misgivings: Obviously there are segmentation and text recognition mistakes. Some texts have wrong reading order where the left region region precedes the right. Vertical text has mostly been ignored. Many images with 3 or 4 parallel text regions only have the outer ones. Arabic script recognition is less good than Hebrew script.

The three steps encompassed

a) an image classifier to choose the best layout segmentation and recognition models. https://edizionicafoscari.it//it/edizioni/riviste/magazen/2024/2/netlay-layout-classification-dataset-for-enhancing/#d670e63

b) Region and line segmentation with kraken

c) Text recognition with kraken

 

Funded by the European Union (ERC, MiDRASH, Project No. 101071829).
Views a
nd opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Files

MiDRASH_Geniza_Transcriptions_0.8.txt.zip

Files (444.6 MB)

Name Size Download all
md5:30fd630604a8d6654d4feedcb11a9192
444.6 MB Preview Download