Published February 2, 2023 | Version 1.0
Dataset Open

The e-NDP project : collaborative digital edition of the Chapter registers of Notre-Dame of Paris (1326-1504). Ground-truth for handwriting text recognition (HTR) on late medieval manuscripts.

  • 1. LaMOP, Université de Paris 1 Panthéon-Sorbonne
  • 2. Université du Luxembourg
  • 3. Archives nationales de France
  • 4. École nationale des chartes, Paris Sciences et Lettres
  • 5. Université de Paris 1 Panthéon-Sorbonne
  • 6. Université de Franche-Comté
  • 7. Université de Limoges
  • 1. LaMOP, Université de Paris 1 Panthéon-Sorbonne
  • 2. École nationale des chartes, Paris Sciences et Lettres
  • 3. Archives nationales de France
  • 4. Université de Paris 1 Panthéon-Sorbonne
  • 5. Université de Franche-Comté
  • 6. Université de Limoges
  • 7. University of Luxembourg

Description

The e-NDP project, funded by the ANR, is led by the LaMOP (Julie Claustre and Darwin Smith).

The project's partners are the Archives nationales, the Bibliothèque nationale de France (Department of Manuscripts, Bibliothèque de l'Arsenal), the École nationale des chartes and the Bibliothèque Mazarine.

The e-NDP project aims at renewing our knowledge on Notre-Dame de Paris cathedral through the creation of a collaborative digital edition of the registers of its Chapter (1326-1504, AN LL 105-128), the community of 51 canons meeting three times a week on set days to take all administrative, financial and practical decisions pertaining to the cathedral, its estate and the society living in its cloister. This corpus has never been the object of a comprehensive study to understand the workings and history of this urban enclave and powerful community. The collaborative digital edition is based on a process of handwriting text recognition (HTR), tested and supervised by scholars, researchers and engineers combining expertise in Medieval history, paleography, philology and digital humanities. The edition shall allow a better insight into the Chapter’s administration, into its economical and political power within Paris, and the relationships it maintained with other institutions in the city.

 

Section 1 : The e-NDP ground-truth dataset for Handwriting text recognition.

The full e-NDP corpus kept today in the French National Archives and was entirely digitized and described in its catalog in 2022.

The first major goal of the e-NDP projet is to propose a first automatic transcription of the 14k pages composing the 26 chapter registers. To achieve this goal representative samples from each one of the volumes were selected and transcribed in order to train a specialized HTR model able to propose a high quality automatic transcription. The collected ground-truth released on this repository currently has 512 pages from the 26 registers of the cathedral chapter preserved in the National Archives (LL105 - LL128, 1326-1504). The transcriptions were manually completed in two rounds by a group of 12 contributors, historians and paleographers, over the course of 2021-2022 using eScriptorium as annotation environment.  

 

Ground-truth features :


Number of hands : according to our estimates no fewer than 18 main hands were involved in the writing of the registers during the medieval period. 

Language : More than 98% of the content of the registers was written in Latin, the rest in French. The exact percentage is hard to estimate because the vernacular language is often used in formulae, notes and comments. It is rare to find entire pages or blocks written in French. 

Script family : The registers were written using a Cursive script (ca. late XIIIe - XVIe).

Documental typology : The volumes containing the chapter conclusions were conceived to serve as memorial records, but above all as documents for regular use and consultation in the daily practice of administration and management. In diplomatics the notion of "documentary manuscripts" is used to describe this kind of sources also by opposition to books and litterary or normative manuscripts.

Ground truth statistics
Text units Count
Pages 512
Annotated regions (see section 2) 2448
Lines of text 34231
Tokens 205083
Characters 3320407

 

Rules of transcription :

  • The abbreviations have been resolved, both those by suspension (facimꝰ ---> facimus) and by contraction (dñi --> domini). Likewise, those using conventional signs ( --> et ; --> pro) have been resolved. 
  • The named entities (names of persons, places and institutions) have been capitalized. The beginning of a block of text as well as the original capitals used by the notary are also capitalized.
  • The consonantal i and u characters have been transcribed as j and v in both French and Latin.
  • The punctuation marks used in the text: . and / have been transcribed, but the transcription has not been standardized with modern punctuation.
  • Corrections and words that appear cancelled in the manuscript have been transcribed surrounded by the sign $ at the beginning and at the end.
  • More specific transcription rules can be found into the file transcription_guidelines.pdf

 

Section 2. e-NDP Layout Segmentation.

Layout segmentation is a compulsory step before HTR recognition in order to distinguish sections and regions inside a document. This process intend to separate interdependant page zones to produce a recognition in a section-sequence order and not in a line-sequence order which mix textual and peri-textual content.

The regions of 364 pages (see GT-layout_list) of the e-NDP corpus were annotated using a 5 sections vocabulary (see endp_layout_regions) in order to describe the page distribution in all the 26 volumes :

  1. Block : All the central text blocks, that normally corresponds to the main content called "conclusions" in registers.
  2. Liste : List of names of the canons who were present during the meeting. Normally located before the conclusions.
  3. Entrée : Marginal notes or entries to inform about the content of conclusions.
  4. Date : Paragraph contending the date. Normally at the head of a conclusion, but separate of the main body.
  5. Numérotation : Page numbers in roman or arabic. Usually appear in the top corners of the pages.
Layout GT statistics
Region Count
block 833
liste 431
date 448
entrée 205
numérotation 531

 

Section 3. The e-NDP HTR modeling.

The e-NDP project has progressively trained several HTR models adapted to work on late medieval cursive in order to accelerate the production of ground truth. Currently the best model delivers an average CER (Character error ratio) of 9.7% in handwriting recognition on the 26 registers (see endp_learning_curve) and can serve as generalist model for other manuscripts of the same period and similar script family. These models and their training implementation details can be found in the project's github repository

Additionally, the automatic HTR transcriptions of the 26 registers (14k pages, 4.5M tokens) enriched with lexical and semantical information has been the subject of a first online publication using the NoSketch engine that allows advanced data mining based on the combination of data, metadata and NLP features. 

 

Section 4. Dataset content.

This zip dataset contains :

- HTR_ground_truth : Two folders containing the jpg / jpeg images and their curated transcriptions in PAGE XML format.

- images_docs : 4 files illustrating the different phases of the project (list of GT for layout segmentation, layout ontologie, transcription guideline and HTR evaluation curves)

Files

e-NDP_dataset.zip

Files (913.8 MB)

Name Size Download all
md5:3cadd82f6fb2d1f2874efcc7e590be53
913.8 MB Preview Download

Additional details

Funding

E-NDP – Notre-Dame de Paris and its cloister: places, people, life ANR-20-CE27-0012
Agence Nationale de la Recherche