RIMES, complete

Grosicki, Emmanuèle; Carré, Matthieu; Geoffrois, Edouard; Augustin, Emmanuel; Preteux, Françoise; Messina, Ronaldo

doi:10.5281/zenodo.10812725

Published March 13, 2024 | Version v1

Dataset Open

RIMES, complete

1. DGA (at the time of the project)
2. INT/ARTEMIS (at the time of the project)
3. A2iA SA (at the time of the project)

Introduction

The RIMES-database (Reconnaissance et Indexation de données Manuscrites et de fac similÉS / Recognition and Indexing of handwritten documents and faxes) comprises handwritten correspondence letters, in French, “sent” by individuals to companies or administrations; all correspondence is fictitious and there is no PII in the records.

The database was collected by asking volunteers to write handwritten letters in exchange for gift vouchers. Volunteers were given a fictitious identity (same sex as the real one) and up to 5 scenarios. Each scenario was chosen from among 9 realistic topics: change of personal data (address, bank account), request for information, opening and closing (customer account), change of contract or order, complaint (e.g. poor quality of service), payment difficulties (request for delay, tax exemption, etc.), reminder, claim with other circumstances and a target (administrations or service providers such as telephone, electricity, bank, insurance companies). The volunteers wrote a letter with this information in their own words. The layout was free and the only request was to use white paper and to write legibly in black ink.

The Communications part is where the full document images and their annotations are stored. We give some detail of what the annotations comprise and briefly describe the other subsets; NB there was a subset comprising images from the cropped logos, which are not distributed here, due to some issues with the annotations.

Communications -- Images_Courriers.zip

There are in total 5605 communications, each containing from 2 to 3 pages:

One correspondence (mandatory)
One questionnaire (mandatory)
One fax (optional)

Filenames are constituted as follows:

Communication number [1, 5605]
Underscore “_”
One letter [F, L, Q]
- L for correspondence/Letter (Lettre, in French)
- Questionnaire
- Fax

The images and the corresponding annotations are split into 3 folders:

DVD1: images from 1 to 1799
DVD2: images from 1800 to 3699
DVD3: images from 3700 to 5605

There are in total 12610 images.

The annotation files are in xml format and support different tasks:

Document Structure Identification
Handwritten text recognition
Writer recognition
Information Extraction

Document Structure Identification

For the document structure, there are 8 "types" to be identified in the Letters for the different blocks of text in the image:

Sender address
Recipient address
Date/Place (each is annotated with it's own tag)
Subject
Introduction (Ouverture in French)
Text body
Signature
PS / annex

Types 1 and 2 have further details, if it is a person or a corporate entity and address can also contain telephone/fax number.

For example:

<type>Coordonnées Expéditeur</type> (Sender Address)

<text>Maxime Granier\n13 Grand rue\n57370 Dames et Quatre Vents</text>

</box>

In the case of Faxes, the types can be further complemented by the type of text:

Dactylographié, which stands for Printed;
Manuscrit, for Handwritten text.

For instance:

<type>Expéditeur_autre / Dactylographié</type>

</box>

<type>Expéditeur_personne / Manuscrit</type>

<text>Lucie FOURES</text>

</box>

Questionnaires have several other types, but there is no remanescent documentation about them.

Handwritten text recognition

There are annotations for the paragraphs; line breaks are indicated with "\n". The transcriptions are verbatim and contain the same spelling and grammar errors that could be seen in the pages. When there could be more than one possible spelling (j’essaie/j’essaye, événement/évènement, ultrason/ultra son, and writing errors), the options are in the ground truth following a special construction:

¤{alternative_1/alternative_2}¤

Writer recognition

There is an identity for each writer in the database, so the usual tasks of identification and verification can be realized. The writer is identified in the usual French form, with family name, in all caps, first followed by the given name, for instance:

<writer>GRANIER Maxime</writer>

Information Extraction

Nine scenarios are annotated for the different types of communication. We provide some free translations into English.

Scenario	Free translation
Changement de données personnelles	Change of personal data
Demande d'information	Request for information
Difficulté de paiement	Payment Difficulties
Fermeture de compte	Account Closure
Gestion de sinistre	Claims Handling
Modification de contrat / Commande	Contract Changes / Order
Ouverture de compte	Account Opening
Réclamation	Complaint
Relance de courrier sans réponse	Correspondance Reminder

Paragraphs -- images_blocs_de_texte.zip

The main body of the letters was cropped from the full page images and stored as grayscale JPEG images and split into 3 folders:

DVD1: images from 1 to 1799 (1796 images)
DVD2: images from 1800 to 3699 (1899 images)
DVD3: images from 3700 to 5605 (1905 images)

The transcriptions can be obtained from the corresponding Communications transcriptions; the numbers in the filenames correspond to the communications.

Cursive words -- imagettes_mots_cursif.zip

The paragraphs were split into lines and each line was further split into words.

The images were split into 57 blocks (lot in French) organized in folders named:

lot_N_rimes_version_definitive, where N is the block number [1, 57]

Each folder has data from 100 letters, further organized into sub-folders following the convention:

<Communication number>_L

Then each sub-folder has one image per word, with naming:

<Communication number>_L_<Line Number>_<Word position>.tiff

Where Line Number and Word position start from 0. The transcription should be inferred from the corresponding Communications transcriptions.

Character snippets -- imagettes_caracteres.zip

Words were split to characters (A to Z) and digits (0 to 9), totalling 95269 images. They are distributed in 3 folders:

characters_rimes_DVD1
characters_rimes_DVD2
characters_rimes_DVD3

Each is further divided into 4 blocks (lot in French):

lot_1
lot_2
lot_3
lot_4

The naming of the image files follows the following:

<Class>_<Correspondance number>_<character position in the image>.png

Where Class is in [A-Z0-9].

Acknowledgments

This dataset was originally collected and prepared in 2007 by the following partners: DGA/CTA/DT/GIP - CEP Arcueil; TSP – ARTEMIS Télécom SudParis; and A2iA SA, as part of the Techno-Vision project. This project was funded by the French ministries for Research and Defense (Ministère de la Recherche and Ministère de la Défense).

After the acquisition of A2iA SA in September 2018, Mitek Systems, Inc. became a legal owner of the dataset, and decided to release it publicly – which was one of the objectives of the project after its conclusion – under a permissive license in 2024, to encourage open science.

Files

images_blocs_de_texte.zip

Files (14.8 GB)

Name	Size
images_blocs_de_texte.zip md5:f94e09cbadad5556d973c8e7dac54874	730.7 MB	Preview Download
Images_Courriers.zip md5:f79d0acd788b97cfb6689d5750042ec2	12.3 GB	Preview Download
imagettes_caracteres.zip md5:1027c1c0e422175fa0c6da1e3af69efb	146.9 MB	Preview Download
imagettes_mots_cursif.zip md5:95f9298c43cd9df893226783fd3377e8	1.6 GB	Preview Download

Additional details

Available: 2024-03-13

	All versions	This version
Views	2,346	2,346
Downloads	2,136	2,136
Data volume	13.5 TB	13.5 TB

Introduction

Contents

Communications -- Images_Courriers.zip

Document Structure Identification

Handwritten text recognition

Writer recognition

Information Extraction

Paragraphs -- images_blocs_de_texte.zip

Cursive words -- imagettes_mots_cursif.zip

Character snippets -- imagettes_caracteres.zip

Acknowledgments

images_blocs_de_texte.zip

Files (14.8 GB)

Dates

RIMES, complete

Authors/Creators

Description

Introduction

Contents

Communications -- Images_Courriers.zip

Document Structure Identification

Handwritten text recognition

Writer recognition

Information Extraction

Paragraphs -- images_blocs_de_texte.zip

Cursive words -- imagettes_mots_cursif.zip

Character snippets -- imagettes_caracteres.zip

Acknowledgments

Files

images_blocs_de_texte.zip

Files (14.8 GB)

Additional details

Dates