Dataset Open Access

A Dataset of French Trade Directories from the 19th Century (FTD)

Nathalie Abadie; Stéphane Bacciochi; Edwin Carlinet; Joseph Chazalon; Pascal Cristofoli; Bertrand Duménieu; Julien Perret

This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.

The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents.


This dataset is divided into two parts:

  1. A labeled dataset, which contains 8765 manually corrected entries from 78 pages (18 different directories), and which is designed for supervised training.
  2. An unlabeled dataset, containing 1058196 raw entries from 6887 pages (13 different directories), and which is designed for self-supervised pre-training.

For the labeled dataset, we provide:

  • Original pages and cropped images
  • Human-corrected positions, transcriptions and entity tagging for each entry
  • OCR prediction from 3 systems (Tesseract v4, PERO OCR v2020 and Kraken)
  • Projected NER reference from clean text to OCR predictions, making it suitable to evaluate the performance of NER systems on real, noisy OCR predictions

For the unlabeled dataset, we provide:

  • Automatically detected positions for each entry (lot of noise)
  • OCR predictions for each entry (PERO OCR engine)

 

How to cite this dataset
Please cite this dataset as:

N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.

@dataset{abadie_dataset_22,
author = {Abadie, Nathalie and
Bacciochi, St{\'e}phane and
Carlinet, Edwin and
Chazalon, Joseph and
Cristofoli, Pascal and
Dum{\'e}nieu, Bertrand and
Perret, Julien},
title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})},
month = mar,
year = 2022,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.6394464},
url = {https://doi.org/10.5281/zenodo.6394464}
}


You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset:

N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer.

@inproceedings{abadie_das_22,
author = {Abadie, Nathalie and
Carlinet, Edwin and
Chazalon, Joseph and
Dum{\'e}nieu, Bertrand},
title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories},
month = may,
year = 2022,
publisher = {Springer},
place = {La Rochelle, France}
}


Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library).
Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.  
Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.  
Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.

The original contents were significantly transformed before being included in this dataset.
All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.

The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library). Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept. Researchers do not have to pay any fee for reusing the original contents in research publications or academic works. The original contents were significantly transformed before being included in this dataset. Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.
Files (152.6 MB)
Name Size
french_trade_directories_19th_century_v1.0.0.zip
md5:38889a89a4a84e3de60c15d3704a8c5c
152.6 MB Download
README.md
md5:130d946f45cb99db0cd3a10d04b8e4b6
29.8 kB Download
112
12
views
downloads
All versions This version
Views 112112
Downloads 1212
Data volume 1.2 GB1.2 GB
Unique views 101101
Unique downloads 1010

Share

Cite as