M-POPP datasets: Datasets for full page text recognition and information extraction from French handwritten and printed marriage records
Creators
Description
M-POPP datasets
This repository contains 2 datasets created within the EXO-POPP project (Optical EXtraction of handwritten named entities for marriage records of the POPulation of Paris) for the task of text recognition and information extraction. These datasets have been published in End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940
[1]
at ICDAR 2024.
The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. The M-POPP corpus (which stands for Marriage records of the POPulation of Paris) is the corpus on which the EXO-POPP project focuses. This corpus was built by gathering the marriage records of Paris and its suburb regions (Hauts- de-Seine, Seine-Saint-Denis, Val-de-Marne).
The M-POPP corpus are a subset of the M-POPP database with annotations for full-page text recognition and named entity recognition/information extraction from both handwritten and printed documents. The first dataset comprises handwritten marriage records, while the second dataset consists of typewritten marriage records. It should be noted that even in typewritten marriage records, some handwritten information occurs, especially concerning the names of the spouses, and notes in the margin.
The dataset contains single-page images obtained from the original scans of double pages via page segmentation.
The structure of the files is the following:
- handwritten: the handwritten dataset
- images: images of the dataset divided following the split used in [1]
- train
- valid
- test
- labels: labels for joint handwritten text recognition and information extraction for each encoding tested in [1]
- images: images of the dataset divided following the split used in [1]
- printed: the printed dataset
- images: images of the dataset divided following the split used in [1]
- train
- valid
- test
- labels: labels for joint handwritten text recognition and information extraction for each encoding tested in [1]
- images: images of the dataset divided following the split used in [1]
- encoding-2-to-encoding-5.json: a JSON file giving the correspondence between the symbols of encoding 2 and encoding 5.
Table 1: Details on the split of the handwritten dataset.
Train | Validation | Test | |
Pages | 250 | 32 | 32 |
Acts | 344 | 51 | 53 |
Named entities | 16727 | 2223 | 2517 |
Table 2: Details on the split of the printed dataset.
Train | Validation | Test | |
Pages | 116 | 14 | 13 |
Acts | 363 | 43 | 30 |
Named entities | 22036 | 2559 | 2405 |
Table 3: Average annotation statistics per act for the two M-POPP datasets.
Dataset | # of characters | # of words | # of named entities |
Handwritten | 1519 | 231 | 48 |
Printed | 1328 | 200 | 60 |
Document structure Annotation
We employ the procedure applied in [2], which involves adding opening and closing tags to the character set for each text block we want to recognize.
In total, we define four types of text blocks.
- Block A is located in the margin and contains the last names of the married couple, possibly with their first names and the date of the marriage.
- Block B is the body of the text. Block B is the one that contains most of the information to be extracted.
- Block C is optional and corresponds to marginal notes used in various cases, such as the mention of a divorce or a correction made to the act.
- Block D corresponds to a set containing a block A and a block B, optionally with one or more blocks C.
Information Extraction annotation
The dataset contains 118 information categories. As explained in the paper, we broke down the named entities into sub-elements pertaining to 4 hierarchical levels, which reduces the total number of categories to 23 instead of 118. Notice that level 1, 2, and 3 categories do not encode named entities but rather the relations that may occur between some lower level categories for example: (day, birth, husband) encodes the fact that the annotated piece of text is the date of birth of the husband.
For these datasets, we chose to represent these hierarchical elements with emojis. For instance, the information first name is represented by the emoji 💬.
The meaning of each emoji can be found in Table 4. To determine the best way to encode named entities in the ground truth, we compared in [1] 5 types of encoding. To illustrate these encodings, let’s take for instance Louis Alexandre MOUDEL that we define as the father of the bride, where Louis Alexandre are his two first names, and Moudel is his last name.
1) Single separate tags before each word: In this approach, each level of information is indicated by a dedicated tag, and the tags are placed before the word they encode information for. With this encoding, the ground truth for the example would be:
💬👴👰Louis 💬👴👰Alexandre 🗨️👴👰MOUDEL
2) Single separate tags after each word: Similar to the previous approach, except here the tags are placed after the word. With this encoding the previous example becomes:
Louis👰👴💬 Alexandre👰👴💬 MOUDEL👰👴🗨️
3) Open & close separate tags: Here, each word presenting information to be extracted is surrounded by one or more opening and closing tags, where each tag encodes a level of information. So the example would be as:
<👰> <👴> <💬> Louis <\💬> <\👴> <\👰>
<👰> <👴> <💬> Alexandre <\💬> <\👴> <\👰>
<👰> <👴> <🗨️> MOUDEL <\🗨️> <\👴> <\👰>
4) Nested open & close separate tags: Similar to the previous approach, but this time a tag is closed only when the encoded information is no longer the same for that level of information. We can see in the example below that the tags for wife and father are only used twice.
<👰> <👴> <💬> Louis Alexandre <\💬> <🗨️> MOUDEL <\🗨️>
5) Single combined tags after each word: In the last approach, one tag encodes all the hierarchical levels constituting information. The tags are located after the word they encode information for.
Louis<wife_father_first_name> Alexandre<wife_father_first_name> MOUDEL<wife_father_family_name>
NB: In the labels file of encoding 5, the information are still encoded with emojis but the chosen emojis do not have a semantic meaning due to the number of information categories to be represented. The correspondence between the symbols of encoding 2 and encoding 5 can be found in the file encoding-2-to-encoding-5.json.
Table 4: Details of the hierarchical breakdown of named entities. Each tag is placed in the corresponding hierarchical level and associated with the emoji representing it.
Level | Tags | |||
1 | Administrative 📖 |
Husband
|
Wife 👰 | Witness 🥸 |
2 | Father 👴 | Mother 👵 | Ex-husband 💔 | |
3 | Birth 🏥 | Residence 🏠 | ||
4 | First name 💬 | Family name 🗨️ | Age ⌛ | Occupation 🔧 |
5 | Street number 🔟 | Street type 🛣 | Street name 🔠 | City 🌆 |
Department 🗺 | Country 🗺 | Day 🌞 | Month 📅 | |
Year 🗓 | Hour ⏰ | Minute ⏱ |
Citation Request
If you publish material based on this database, we request you to include a reference to the paper T. Constum, L. Preel, T. Paquet, P. Tranouez, S. Brée, End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940, International Conference on Document Analysis and Recognition (ICDAR), Athens, Greece, 2024.
Bibliography
2: D.Coquenet, C. Chatelain, T. Paquet: DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–17 (2023).
Files
m-popp_datasets.zip
Files
(1.0 GB)
Name | Size | Download all |
---|---|---|
md5:4705fb825027ef28d56914895d808284
|
1.0 GB | Preview Download |
Additional details
Related works
- Is published in
- Conference paper: https://arxiv.org/abs/2404.19329 (URL)