Published March 20, 2024 | Version 1.0.7
Dataset Open

GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie

  • 1. ROR icon Laboratoire d'Informatique en Images et Systèmes d'Information
  • 2. ROR icon Institut National des Sciences Appliquées de Lyon
  • 3. ROR icon Interactions, Corpus, Apprentissages, Représentations
  • 4. ROR icon Université Lumière Lyon 2
  • 5. ROR icon Lancaster University
  • 6. ROR icon The Alan Turing Institute

Description

This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

The dataset is available in the following formats:

  • JSONL format provided by Prodigy
  • binary spaCy format (ready to use with the spaCy train pipeline)

The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

Tagset

  • NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. villela rivière, royaume.
  • NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.
  • ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.
  • Relation: spatial relation, e.g. dans, sur, à 10 lieues de.
  • Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.
  • NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.
  • NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.
  • ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.
  • NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique
  • ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.
  • Head: entry name
  • Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

HuggingFace

The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

Files

spancat-dataset_edda_jsonl.zip

Files (3.8 MB)

Name Size Download all
md5:75e7563162dc60b541aed7e1d9fc850e
2.5 MB Preview Download
md5:591d9a178355c621f67068c8936af0a2
1.3 MB Preview Download

Additional details