Published January 5, 2026 | Version v1
Dataset Open

HTR Winter School 2025 - Syriac, MS Jerusalem, Saint Mark's Monastery 36

Description

Ground truth of 133 bifolio images of MS Jerusalem, Saint Mark's Monastery 36. This ground truth was produced by participants of the Vienna 2025 HTR Winter School, who used Transkribus to manually correct a preliminary automatic transcription that had been generated using a Kraken model (doi.org/10.5281/zenodo.17406773).

Description

  • Jerusalem, Saint Mark's Monastery, MS 36
  • Syriac, primarily Estrangelo but with Serto and Eastern features
  • Codex approximately 12th - 14th century
  • Scribe uncertain, perhaps the otherwise unknown Elias or Giwargis

Origin of the data

We are thankful to the St Mark's Syrian Orthodox Monastery - Jerusalem for providing us with the digital images of MS 36 and for allowing us to use and share these images to support research with Syriac handwritten text recognition.

An online digitization of the manuscript may also be viewed in the virtual reading room of the Hill Museum & Manuscript Library at the shelfmark SMMJ 00036.

Segmentation and Transcription guidelines

The segmentation of the folios followed the SegmOnto vocabulary for annotation of regions:

  • MainZone: the main column of text.
  • MarginTextZone: any marginal words or phrases, including catchwords. Also used for interlinear glosses.
  • NumberingZone: any page or folio numbers.

The transcription guidelines included spaces, the Syriac letters, some diacritics, punctuation, and no vowel dots or markings.

  • Allowed diacritics:
    • Syome
    • Dots over feminine suffix heh
    • Dots in pronouns: above for demonstrative, below for personal
    • Dots in verbs: to distinguish participles and perfects
    • Dots to distinguish homographs
  • Excluded diacritics:
    • Vowel dots
    • Dots of hardening and softening (qushoyo and rukokho)

Punctuation marks were not normalized, but rather transcribed as they appear in the manuscript (. ܆ ܇ : ܀).

Transkribus's unclear tag was used when readings were uncertain or the text was damaged or unclear. There is additionally some use of the sic and variant tags in the corpus, but these were not applied consistently.

Data organisation

  • CITATION.cff
  • htr-united.yml
  • alto.zip: the ground truth in ALTO XML format, exported from eScriptorium
  • page.zip: the ground truth in PAGE XML format, exported from Transkribus
  • images.zip: the corresponding image files

Copyright and licence

This dataset was created as part of the Winter School of Handwritten Text Recognition of Medieval Manuscripts 2025, Vienna at the Österreichische Akademie der Wissenschaften, Institut für Mittelalterforschung, all transcriptions are licensed under the Creative Commons 4 licence. Images were provided by the St Mark's Syrian Orthodox Monastery - Jerusalem and are licensed under Creative Commons 4 licence.

Files

page.zip

Files (975.3 MB)

Name Size Download all
md5:1d0e3430e8fd96f921c603e1d728c10f
2.6 MB Preview Download
md5:e2964aeb9cc87a771022fd97f545b1b6
1.7 kB Download
md5:496b874120e4ecf01ec7a6ae528eed2e
3.1 kB Download
md5:eee0c6c1ba6d8a887c2ee22eac17f5a9
970.2 MB Preview Download
md5:b88c33b0dd0117db0f88b24e5e26cb6e
2.5 MB Preview Download