Published August 29, 2025 | Version v2
Dataset Open

Corpus Litterarum

  • 1. ROR icon University of Zurich

Description

Corpus Litterarum is a line-based annotated dataset of Latin manuscript characters sampled from the Codices Sangallenses CSG 11 and CSG 70, provided by e-codices. Each line image has been annotated at the character level (73 classes) using Roboflow, with a semi-automatic workflow that combines manual annotation and model-assisted labelling. The dataset contains 2,152 line images and 44,407 annotations, distributed across predefined train/validation/test splits. Characters include standard Latin letters, abbreviations, and scribal signs, with suspensions left unresolved. The dataset supports research in palaeography, handwritten text recognition, and character segmentation.

Files

README.md

Files (109.7 MB)

Name Size Download all
md5:73ff3a388fbe1adc6b38a97f892caa39
711 Bytes Download
md5:215d4c156f530665d75c0beeeafe67d4
3.8 kB Preview Download
md5:129ecf4cc73086065635b6cca6fa57fd
901 Bytes Preview Download
md5:1a998247f849403110e102b9c21f062f
10.8 MB Preview Download
md5:d4aad61a3dc095d539172f5b1d61b309
192.5 kB Preview Download
md5:16ffb2b62644c52f190894d6c83f3020
78.3 MB Preview Download
md5:63325d3f2801c88771e75c5c9347e135
1.4 MB Preview Download
md5:46534797d4877322465a574d89f58981
18.7 MB Preview Download
md5:79e9b789128743d0cc816014adf7618c
328.9 kB Preview Download

Additional details

Additional titles

Subtitle
A Ground Truth for 8th Century Character Recogntition

Software

Development Status
Active