Published April 25, 2023 | Version v2
Dataset Open

A Dataset of French Trade Directories from the 19th Century for Nested NER task

  • 1. LASTIG, Univ. Gustave Eiffel, IGN-ENSG, Saint-Mandé, France
  • 2. EPITA Research Laboratory (LRE), Le Kremlin-Bicêtre, France
  • 3. CRH-EHESS, Paris, France

Description

This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861.

The purpose of this dataset is to evaluate the performance of Nested Named Entity Recognition approaches on 19th century French documents, regarding both clean and noisy texts (due to the OCR engine).

Source dataset

This dataset has been built from this source dataset :

N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464.

Our experiments // Paper

Details about our experiments on nested NER approaches are given in our paper (the pre-print version is available here).

Tual, S., Abadie, N., Chazalon, J., Duménieu, B., & Carlinet, E. (2023). A Benchmark of Nested NER Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition, San José, California, USA. 2023. Springer. https://hal.science/hal-03994759v2

Our code is available on Git-Hub.

Dataset overview

The following list describes the keys of the .JSON file which contain the complete materials of our experiments.

- id : Entry unique ID in a given page

- box : Bounding box of the entry in the scanned directory page

- book : Source directory of the entry (*see more information bellow*)

- page : Page ID in a given directory

- valid_box : Is the bbox of the entry valid ? (*all bbox are valid here*)

- text_ocr_ref` : OCR extracted and manually corrected text of the entry

- nested_ner_xml_ref :  text_ocr_ref with nested ner entities

- text_ocr_pero : OCR extracted text of the entry with PERO-OCR engine (best engine according to Abadie et al. experiment)

- has_valid_ner_xml_pero : Is entities mapping between nested-ner entities annotated by hand on the ref text and pero ocr text correct ? (in our experiments, we only use entries with True value)

- nested_ner_xml_pero : Annotated noisy entries produced with PERO OCR

- text_ocr_tess : OCR extracted text of the entry with Tesseract engine (*not used in our expriments*)

- nested_ner_xml_tess : Is entities mapping between nested-ner entities annotated by hand on the ref text and tesseract text correct? (not used in our experiments)

- has_valid_ner_xml_tess : Annotated noisy entries produced with Tesseract. (not used in our experiments)

Nested entities are annotated using XML tags. Our hierachy of entities is a *Part Of* a two-levels hierarchy. It means that bottom entities are contained in a top level entity.

 

Source documents // Copyright and licence

This section has been copied from the original dataset description.

The images were extracted from the original source https://gallica.bnf.fr, owned by the *Bibliothèque nationale de France* (French national library).

Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.  

=> Researchers do not have to pay any fee for reusing the original contents in research publications or academic works.

Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.

The original contents were significantly transformed before being included in this dataset.

All derived content is licensed under the permissive *Creative Commons Attribution 4.0 International* license.

Links to original contents are given in the window bellow :

Files

dataset_SODUCO_nested_ner.json

Files (8.2 MB)

Name Size Download all
md5:cf364455627f6be744317e7a5273eccf
8.1 MB Preview Download
md5:23717e46456999331cf44fc49db0504a
16.8 kB Preview Download

Additional details

Funding

SoDUCo – Social Dynamics in Urban Context: open tools, models, and data -- Paris and its suburbs, 1789-1950 ANR-18-CE38-0013
Agence Nationale de la Recherche