Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published October 5, 2023 | Version v2
Dataset Open

Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts

  • 1. École Normale Supérieure de Lyon

Description

This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models.

The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files `*.normalized.xml`) and models.

Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see `corpus/conversion_table.csv`) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded.

Please cite the following paper if you use this dataset or the models:

@article{gille_levenson_2023_towards,
 author = {Gille Levenson, Matthias},
 date = {2023},
 journaltitle = {Journal of Data Mining and Digital Humanities},
 doi = {10.46298/jdmdh.10416},
 editor = {Pinche, Ariane and Stokes, Peter},
 issuetitle = {Special Issue: Historical documents and automatic text recognition},
 title = {Towards a general open dataset and models for late medieval Castilian text recognition
(HTR/OCR)},

GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castilian
text recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : Special
Issue : Historical documents and automatic text recognition, eds. Ariane PINCHE and Peter
STOKES, DOI : 10.46298/jdmdh.10416.

The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript.

All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.

 

The global licence for the dataset (except for images) is CC-BY-NC-SA.

All manuscripts reproductions are published with the authorization of the libraries.

©Biblioteca General Histórica de Salamanca

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L)

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J)

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654

Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086

©Museo Lázaro Galdiano. Madrid

Inv. 15304, Fundación Lázaro Galdiano (A)

©Universidad de Valladolid

Ms. 251, Biblioteca Santa Cruz (S)

©Real Biblioteca del Escorial

Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q)

Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published

Ms. Z-I-12

Ms.Z-III-9

Ms. X-III-4

Ms. h-III-9

Ms. b-IV-15

Ms. b-II-11

Ms. a-II-17

Ms. T-III-5

©Rosenbach Foundation

Ms. 482/2 (U)

© Gallica.bnf.fr

Espagnol 12

Espagnol 36

Espagnol 218

© Bodleian Library

Ms. Span. d. 1

Ms. Span. d. 2/1

© Biblioteca Real, Madrid

Ms. II/215 (G)

© Biblioteca Nacional de España

Mss/4183

Inc/901 (Z)

© Biblioteca Universitaria, Sevilla

Ms. 332/131 (R)

 

Edit: add result files

Notes

V2 version: contains the original data described in the paper, plus the normalized transcriptions and models.

Files

data_v2.zip

Files (2.1 GB)

Name Size Download all
md5:95fd13aa0a1dfb1d7fe2cf19b7cb801d
2.1 GB Preview Download