Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts

Matthias Gille Levenson

doi:10.5281/zenodo.8406222

Published October 5, 2023 | Version v2

Dataset Open

Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts

Matthias Gille Levenson¹

1. École Normale Supérieure de Lyon

This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models.

The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files `*.normalized.xml`) and models.

Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see `corpus/conversion_table.csv`) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded.

Please cite the following paper if you use this dataset or the models:

@article{gille_levenson_2023_towards,
author = {Gille Levenson, Matthias},
date = {2023},
journaltitle = {Journal of Data Mining and Digital Humanities},
doi = {10.46298/jdmdh.10416},
editor = {Pinche, Ariane and Stokes, Peter},
issuetitle = {Special Issue: Historical documents and automatic text recognition},
title = {Towards a general open dataset and models for late medieval Castilian text recognition
(HTR/OCR)},

GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castilian
text recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : Special
Issue : Historical documents and automatic text recognition, eds. Ariane PINCHE and Peter
STOKES, DOI : 10.46298/jdmdh.10416.

The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript.

All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.

The global licence for the dataset (except for images) is CC-BY-NC-SA.

All manuscripts reproductions are published with the authorization of the libraries.

©Biblioteca General Histórica de Salamanca

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L)

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J)

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086

©Museo Lázaro Galdiano. Madrid

Inv. 15304, Fundación Lázaro Galdiano (A)

©Universidad de Valladolid

Ms. 251, Biblioteca Santa Cruz (S)

©Real Biblioteca del Escorial

Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q)

Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published

Ms. Z-I-12

Ms.Z-III-9

Ms. X-III-4

Ms. h-III-9

Ms. b-IV-15

Ms. b-II-11

Ms. a-II-17

Ms. T-III-5

©Rosenbach Foundation

Ms. 482/2 (U)

© Gallica.bnf.fr

Espagnol 12

Espagnol 36

Espagnol 218

Ms. Span. d. 1

Ms. Span. d. 2/1

Ms. II/215 (G)

Mss/4183

Inc/901 (Z)

Ms. 332/131 (R)

Edit: add result files

Notes

V2 version: contains the original data described in the paper, plus the normalized transcriptions and models.

Files

data_v2.zip

Files (2.1 GB)

Name	Size
data_v2.zip md5:95fd13aa0a1dfb1d7fe2cf19b7cb801d	2.1 GB	Preview Download

	All versions	This version
Views	1,125	606
Downloads	209	118
Data volume	630.3 GB	369.8 GB

Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts

Authors/Creators

Description

Notes

Files

data_v2.zip

Files (2.1 GB)