Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts
Description
This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models.
The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files `*.normalized.xml`) and models.
Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see `corpus/conversion_table.csv`) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded.
Please cite the following paper if you use this dataset or the models:
@article{gille_levenson_2023_towards,
author = {Gille Levenson, Matthias},
date = {2023},
journaltitle = {Journal of Data Mining and Digital Humanities},
doi = {10.46298/jdmdh.10416},
editor = {Pinche, Ariane and Stokes, Peter},
issuetitle = {Special Issue: Historical documents and automatic text recognition},
title = {Towards a general open dataset and models for late medieval Castilian text recognition
(HTR/OCR)},
GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castilian
text recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : Special
Issue : Historical documents and automatic text recognition, eds. Ariane PINCHE and Peter
STOKES, DOI : 10.46298/jdmdh.10416.
The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript.
All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.
The global licence for the dataset (except for images) is CC-BY-NC-SA.
All manuscripts reproductions are published with the authorization of the libraries.
©Biblioteca General Histórica de Salamanca
Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L)
Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J)
Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673
Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011
Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654
Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086
©Museo Lázaro Galdiano. Madrid
Inv. 15304, Fundación Lázaro Galdiano (A)
©Universidad de Valladolid
Ms. 251, Biblioteca Santa Cruz (S)
©Real Biblioteca del Escorial
Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q)
Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published
Ms. Z-I-12
Ms.Z-III-9
Ms. X-III-4
Ms. h-III-9
Ms. b-IV-15
Ms. b-II-11
Ms. a-II-17
Ms. T-III-5
©Rosenbach Foundation
Ms. 482/2 (U)
© Gallica.bnf.fr
Espagnol 12
Espagnol 36
Espagnol 218
© Bodleian Library
Ms. Span. d. 1
Ms. Span. d. 2/1
© Biblioteca Real, Madrid
Ms. II/215 (G)
© Biblioteca Nacional de España
Mss/4183
Inc/901 (Z)
© Biblioteca Universitaria, Sevilla
Ms. 332/131 (R)
Edit: add result files
Notes
Files
data_v2.zip
Files
(2.1 GB)
Name | Size | Download all |
---|---|---|
md5:95fd13aa0a1dfb1d7fe2cf19b7cb801d
|
2.1 GB | Preview Download |