Published September 29, 2021 | Version v1
Dataset Open

HIMANIS Guérin

  • 1. Institut de Recherche et d'Histoire des Textes (CNRS)
  • 2. Leopold Franzens Universität für Innsbruck

Description

The dataset HIMANIS Guérin provides a ground-truth for HTR training (Handwritten Text Recognition) for 1217 images or part of images and 30015 lines (933 images and 22093 lines in Guérin 1; 284 images and 7922 lines in Guérin 2). It was established as part of the HIMANIS research project in collaboration with the READ consortium (Recognition and Enrichment of Archival Documents).

The base text is the edition by Paul Guérin, Recueil des documents concernant le Poitou contenus dans les registres de la Chancellerie de France, published between 1881 and 1919. The edition was digitized and OCR processed by the Bibliothèque nationale de France, then encoded by the Ecole nationale des Chartes (http://corpus.enc.sorbonne.fr/actesroyauxdupoitou/), then corrected and enhanced in HIMANIS, esp. for abbreviations and links to digital images (https://github.com/oriflamms/himanis/blob/master/Editions/Guerin_tome1-tome12.xml).

The text was aligned line by line on Transkribus by the READ consortium for the acts whose coordinates were indicated in the HIMANIS project, mainly for volumes Paris, Archives nationales, JJ 35 to JJ 91, but supplemented by information for the vol. 12 of Guérin's edition.

This dataset comprises two Transkribus exports, enriched with links to images accessible via IIIF protocol in the @corresp attribute of <graphic/> elements.

The historical corpus is described in Stutzmann, Dominique, Jean-François Moufflet, and Sébastien Hamel. « La recherche en plein texte dans les sources manuscrites médiévales : enjeux et perspectives du projet HIMANIS pour l’édition électronique ». Médiévales : Langue, textes, histoire 73 (2017): 67‑96. https://doi.org/10.4000/medievales.8198.

The present dataset is the training data for the " HIMANIS Chancery M1+ " model, cf. https://readcoop.eu/model/french-and-latin-chancery-documents/

Files

Guerin(1).zip

Files (4.9 GB)

Name Size Download all
md5:c3f9ed153f96e0ec7c1e1e52aaa9b413
4.0 GB Preview Download
md5:5e9d71408ab45055e616052fee0ee270
871.4 MB Preview Download

Additional details

Related works

Is documented by
10.4000/medievales.8198 (DOI)

Funding

HIMANIS – Indexation de manuscrits historiques pour une recherche contrôlée par l'utilisateur ANR-15-EPAT-0003
Agence Nationale de la Recherche
READ – Recognition and Enrichment of Archival Documents 674943
European Commission

References

  • Stutzmann, Dominique, Jean-François Moufflet, and Sébastien Hamel. « La recherche en plein texte dans les sources manuscrites médiévales : enjeux et perspectives du projet HIMANIS pour l'édition électronique ». Médiévales : Langue, textes, histoire 73 (2017): 67‑96. https://doi.org/10.4000/medievales.8198