Published February 6, 2026 | Version v1
Dataset Open

Graphemic HTR ground truth dataset for an Early New High German transcription model (15th centruy)

  • 1. ROR icon Berlin-Brandenburg Academy of Sciences and Humanities

Description

 
This repository contains a set of training data for ATR models (Kraken). It contains 50 pages of ground truth as image files (jpg) and transcription files (PAGE xml).
 
The ground truth contains 50 pages including 2,177 lines with 18,626 word tokens and 113,491 characters.
 
Please refer to the README.md file for further information.

A ground truth dataset following a graphemic transcription of the same data conntained within this repository may be found here: Diplomatic HTR-Ground Truth dataset .

Authors

The data in this repository was prepared and curated by Adam Juszczak (ORCiD: 0009-0000-5330-6183) and Frederik Skidzun (ORCiD: 0009-0002-7712-4207) of the Regesta Imperii - Regesta of Emperor Frederik III.

License

This dataset is made available under the CC-BY 4.0 license.

Files

graphemic-gt.zip

Files (153.2 MB)

Name Size Download all
md5:09010d99143fd59179980204f924cfa3
649.0 kB Preview Download
md5:4de34aedf731af66788a7584d2491668
152.6 MB Preview Download
md5:022586c6f84dd4805a5b99205fe933c3
19.1 kB Preview Download
md5:6eec5fef0980fa681e31264d79fd187d
3.7 kB Preview Download
md5:107a404a4aae75b0d77ef1acb4f68aea
4.7 kB Preview Download

Additional details

Related works

Is version of
Dataset: 10.5281/zenodo.18377766 (DOI)