Graphemic HTR ground truth dataset for an Early New High German transcription model (15th centruy)

Juszczak, Adam; Skidzun, Frederik

doi:10.5281/zenodo.18441031

Published February 6, 2026 | Version v1

Dataset Open

Graphemic HTR ground truth dataset for an Early New High German transcription model (15th centruy)

1. Berlin-Brandenburg Academy of Sciences and Humanities

This repository contains a set of training data for ATR models (Kraken). It contains 50 pages of ground truth as image files (jpg) and transcription files (PAGE xml).

The ground truth contains 50 pages including 2,177 lines with 18,626 word tokens and 113,491 characters.

Please refer to the README.md file for further information.

A ground truth dataset following a graphemic transcription of the same data conntained within this repository may be found here: Diplomatic HTR-Ground Truth dataset .

Authors

The data in this repository was prepared and curated by Adam Juszczak (ORCiD: 0009-0000-5330-6183) and Frederik Skidzun (ORCiD: 0009-0002-7712-4207) of the Regesta Imperii - Regesta of Emperor Frederik III.

License

This dataset is made available under the CC-BY 4.0 license.

Files

graphemic-gt.zip

Files (153.2 MB)

Name	Size	Download all
graphemic-gt.zip md5:09010d99143fd59179980204f924cfa3	649.0 kB	Preview Download
images.zip md5:4de34aedf731af66788a7584d2491668	152.6 MB	Preview Download
LICENSE.md md5:022586c6f84dd4805a5b99205fe933c3	19.1 kB	Preview Download
README.md md5:6eec5fef0980fa681e31264d79fd187d	3.7 kB	Preview Download
transcription-rules.md md5:107a404a4aae75b0d77ef1acb4f68aea	4.7 kB	Preview Download

Additional details

Is version of: Dataset: 10.5281/zenodo.18377766 (DOI)

	All versions	This version
Views	30	30
Downloads	2	2
Data volume	23.8 kB	23.8 kB

Graphemic HTR ground truth dataset for an Early New High German transcription model (15th centruy)

Authors/Creators

Description

Authors

License

Files

graphemic-gt.zip

Files (153.2 MB)

Additional details

Related works