MiDRASH Automatic Transcriptions of the Cairo Geniza Fragments
Creators
-
stoekl ben ezra, daniel
(Project leader)1, 2, 3
-
Bambaci, Luigi
(Project member)1, 2, 3
-
Kiessling, Benjamin1, 2, 3, 4, 5
-
Lapin, Hayim
(Project member)6
- Ezer, Nurit (Annotator)
-
LOLLI, ELENA
(Annotator)2, 3, 1
-
Rustow, Marina
(Project leader)7
-
Dershowitz, Nachum
(Project leader)8
-
Kurar Barakat, Berat
(Project member)8
-
Gogawale, Sharva
(Project member)8
-
Shmidman, Avi
(Project leader)9
-
Lavee, Moshe
(Project member)10
-
Siew, Tsafra
(Project member)11
-
Raziel Kretzmer, Vered
(Project member)9
-
Vasyutinsky Shapira, Daria
(Project member)8
-
Olszowy-Schlanger, Judith
(Project leader)12, 13, 14
- Gila, Yitzchak (Project member)11
- 1. École Pratique des Hautes Études
-
2.
Université Paris Sciences et Lettres
-
3.
Archéologie et Philologie d'Orient et d'Occident
-
4.
National Institute for Research in Computer and Control Sciences
-
5.
ALMANACH: Modélisation et analyse linguistique automatique et humanités computationnelles
-
6.
University of Maryland
-
7.
Princeton University
-
8.
Tel Aviv University
-
9.
Bar-Ilan University
-
10.
University of Haifa
-
11.
National Library of Israel
- 12. University of Oxford
- 13. École Pratique des Hautes Études Section des Sciences historiques et philologiques
-
14.
Savoirs et Pratiques du Moyen Âge au XIXe siècle
Description
This is the first automatic transcription of the entire collection of digital images of the Geniza at the National Library of Israel as of this date. It was created using kraken version 5.3.1.dev56.
To find a fragment put the 99 ID number into KTIV.
We are aware that this is a very preliminary and imperfect result, which we are releasing now because of the high value for scholarship even in its current form. We are aware of the following misgivings: Obviously there are segmentation and text recognition mistakes. Some texts have wrong reading order where the left region region precedes the right. Vertical text has mostly been ignored. Many images with 3 or 4 parallel text regions only have the outer ones. Arabic script recognition is less good than Hebrew script.
The three steps encompassed
a) an image classifier to choose the best layout segmentation and recognition models. https://edizionicafoscari.it//it/edizioni/riviste/magazen/2024/2/netlay-layout-classification-dataset-for-enhancing/#d670e63
b) Region and line segmentation with kraken
c) Text recognition with kraken
Funded by the European Union (ERC, MiDRASH, Project No. 101071829).
Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Files
MiDRASH_Geniza_Transcriptions_0.8.txt.zip
Files
(444.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:30fd630604a8d6654d4feedcb11a9192
|
444.6 MB | Preview Download |