M-POPP datasets: Datasets for full page text recognition and information extraction from French handwritten and printed marriage records

CONSTUM, Thomas; PREEL, Lucas; LARCHER, Théo; TRANOUEZ, Pierrick; PAQUET, Thierry; BREE, Sandra

doi:10.5281/zenodo.11296970

Published May 25, 2024 | Version v2

Dataset Open

M-POPP datasets: Datasets for full page text recognition and information extraction from French handwritten and printed marriage records

1. Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes
2. Université de Rouen Normandie
3. Laboratoire de Recherche Historique Rhône-Alpes
4. Centre National de la Recherche Scientifique

M-POPP datasets

This repository contains 2 datasets created within the EXO-POPP project (Optical EXtraction of handwritten named entities for marriage records of the POPulation of Paris) for the task of text recognition and information extraction. These datasets have been published in End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940 [1]at ICDAR 2024.

This version contains the labels for Handwritten Text Recognition and Handwritten Text Recognition + Information Extraction as used in[1].

The performance of the model described in [1] is detailled in the Leaderboard section.

General information

The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. The M-POPP corpus (which stands for Marriage records of the POPulation of Paris) is the corpus on which the EXO-POPP project focuses. This corpus was built by gathering the marriage records of Paris and its suburb regions (Hauts- de-Seine, Seine-Saint-Denis, Val-de-Marne).

The M-POPP corpus are a subset of the M-POPP database with annotations for full-page text recognition and named entity recognition/information extraction from both handwritten and printed documents. The first dataset comprises handwritten marriage records, while the second dataset consists of typewritten marriage records. It should be noted that even in typewritten marriage records, some handwritten information occurs, especially concerning the names of the spouses, and notes in the margin.
The dataset contains single-page images obtained from the original scans of double pages via page segmentation.

The structure of the files is the following:

handwritten: the handwritten dataset
- images: images of the dataset divided following the split used in [1]
  - train
  - valid
  - test
- labels: labels for joint handwritten text recognition and information extraction for each encoding tested in [1]
printed: the printed dataset
- images: images of the dataset divided following the split used in [1]
  - train
  - valid
  - test
- labels: labels for joint handwritten text recognition and information extraction for each encoding tested in [1]
encoding-2-to-encoding-5.json: a JSON file giving the correspondence between the symbols of encoding 2 and encoding 5.

Table 1: Details on the split of the handwritten dataset.

	Train	Validation	Test
Pages	250	32	32
Acts	344	51	53
Named entities	16727	2223	2517

Table 2: Details on the split of the printed dataset.

	Train	Validation	Test
Pages	116	14	13
Acts	363	43	30
Named entities	22036	2559	2405

Table 3: Average annotation statistics per act for the two M-POPP datasets.

Dataset	# of characters	# of words	# of named entities
Handwritten	1519	231	48
Printed	1328	200	60

Document structure Annotation

We employ the procedure applied in [2], which involves adding opening and closing tags to the character set for each text block we want to recognize.
In total, we define four types of text blocks.

Block A is located in the margin and contains the last names of the married couple, possibly with their first names and the date of the marriage.
Block B is the body of the text. Block B is the one that contains most of the information to be extracted.
Block C is optional and corresponds to marginal notes used in various cases, such as the mention of a divorce or a correction made to the act.
Block D corresponds to a set containing a block A and a block B, optionally with one or more blocks C.

Information Extraction annotation

The dataset contains 118 information categories. As explained in the paper, we broke down the named entities into sub-elements pertaining to 4 hierarchical levels, which reduces the total number of categories to 23 instead of 118. Notice that level 1, 2, and 3 categories do not encode named entities but rather the relations that may occur between some lower level categories for example: (day, birth, husband) encodes the fact that the annotated piece of text is the date of birth of the husband.

For these datasets, we chose to represent these hierarchical elements with emojis. For instance, the information first name is represented by the emoji 💬.
The meaning of each emoji can be found in Table 4. To determine the best way to encode named entities in the ground truth, we compared in [1] 5 types of encoding. To illustrate these encodings, let’s take for instance Louis Alexandre MOUDEL that we define as the father of the bride, where Louis Alexandre are his two first names, and Moudel is his last name.

1) Single separate tags before each word: In this approach, each level of information is indicated by a dedicated tag, and the tags are placed before the word they encode information for. With this encoding, the ground truth for the example would be:

💬👴👰Louis 💬👴👰Alexandre 🗨️👴👰MOUDEL

2) Single separate tags after each word: Similar to the previous approach, except here the tags are placed after the word. With this encoding the previous example becomes:

Louis👰👴💬 Alexandre👰👴💬 MOUDEL👰👴🗨️

3) Open & close separate tags: Here, each word presenting information to be extracted is surrounded by one or more opening and closing tags, where each tag encodes a level of information. So the example would be as:

<👰> <👴> <💬> Louis <\💬> <\👴> <\👰>
<👰> <👴> <💬> Alexandre <\💬> <\👴> <\👰>
<👰> <👴> <🗨️> MOUDEL <\🗨️> <\👴> <\👰>

4) Nested open & close separate tags: Similar to the previous approach, but this time a tag is closed only when the encoded information is no longer the same for that level of information. We can see in the example below that the tags for wife and father are only used twice.

<👰> <👴> <💬> Louis Alexandre <\💬> <🗨️> MOUDEL <\🗨️>

5) Single combined tags after each word: In the last approach, one tag encodes all the hierarchical levels constituting information. The tags are located after the word they encode information for.

Louis<wife_father_first_name> Alexandre<wife_father_first_name> MOUDEL<wife_father_family_name>

NB: In the labels file of encoding 5, the information are still encoded with emojis but the chosen emojis do not have a semantic meaning due to the number of information categories to be represented. The correspondence between the symbols of encoding 2 and encoding 5 can be found in the file encoding-2-to-encoding-5.json.

Table 4: Details of the hierarchical breakdown of named entities. Each tag is placed in the corresponding hierarchical level and associated with the emoji representing it.

Level	Tags
1	Administrative 📖	Husband `👨`	Wife 👰	Witness 🥸
2	Father 👴	Mother 👵	Ex-husband 💔
3	Birth 🏥	Residence 🏠
4	First name 💬	Family name 🗨️	Age ⌛	Occupation 🔧
5	Street number 🔟	Street type 🛣	Street name 🔠	City 🌆
	Department 🗺	Country 🗺	Day 🌞	Month 📅
	Year 🗓	Hour ⏰	Minute ⏱

Leaderboard

Results on M-POPP handwritten

HTR

The following table contains the current leaderboard of M-POPP v2 for HTR on the handwritten dataset.

In this configuration, layout block C is not considered.

These results for DAN NER are given using the named entity encoding format 5 described above.

HTR stands for Handwritten Text Recognition and HTR+IE for combined Handwritten Text Recognition and Information Extraction.

Metrics are expressed in percentages.

Method	CER	WER
DAN - HTR [1]	7.42	16.29
DAN NER - HTR + IE [1]	6.57	15.93

NER

The following table contains the current leaderboard of M-POPP v2 for NER on the handwritten dataset.

In this configuration, layout block C is not considered.

These results are given using the named entity encoding format 5 described above.

Metrics are expressed in percentages.

Method	F1
DAN NER [1]	73.51

Results on M-POPP printed

HTR

The following table contains the current leaderboard of this version for TR on the printed dataset.

In this configuration, only layout block B is considered.

These results for DAN NER are given using the named entity encoding format 5 described above.

TR stands for Text Recognition and TR+IE for combined Text Recognition and Information Extraction.

Metrics are expressed in percentages.

Method	CER	WER
DAN - TR [1]	0.88	3.17
DAN NER - TR + IE [1]	1.54	3.55

NER

The following table contains the current leaderboard of this version for NER on the printed dataset.

In this configuration, only layout block B is considered.

These results are given using the named entity encoding format 5 described above.

Metrics are expressed in percentages.

Method	F1
DAN NER [1]	93.04

Citation Request

If you publish material based on this database, we request you to include a reference to the paper T. Constum, L. Preel, T. Paquet, P. Tranouez, S. Brée, End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940, International Conference on Document Analysis and Recognition (ICDAR), Athens, Greece, 2024.

Bibliography

1: T. Constum, L. Preel, T. Paquet, P. Tranouez, S. Brée, End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940, International Conference on Document Analysis and Recognition (ICDAR), Athens, Greece, 2024.

2: D.Coquenet, C. Chatelain, T. Paquet: DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–17 (2023).

Files

m-popp_datasets-v2.zip

Files (1.0 GB)

Name	Size	Download all
m-popp_datasets-v2.zip md5:612af8adbc95d866d7bd624b288302d9	1.0 GB	Preview Download

Additional details

Is published in: Conference paper: https://arxiv.org/abs/2404.19329 (URL)

	All versions	This version
Views	549	278
Downloads	96	58
Data volume	116.1 GB	62.1 GB

M-POPP datasets: Datasets for full page text recognition and information extraction from French handwritten and printed marriage records

Creators

Description

M-POPP datasets

General information

Document structure Annotation

Information Extraction annotation

Leaderboard

Results on M-POPP handwritten

Results on M-POPP printed

Citation Request

Bibliography

Files

m-popp_datasets-v2.zip

Files (1.0 GB)

Additional details

Related works