Published March 24, 2023 | Version v1
Dataset Open

French Entity-Linking dataset between annotated tweets collected during major crises in France and French Wikipedia corpus

Creators

  • 1. BRGM National Geological Survey: Bureau de Recherches Geologiques et Minieres
  • 1. BRGM National Geological Survey: Bureau de Recherches Geologiques et Minieres
  • 2. LASTIG, Univ Gustave Eiffel, IGN-ENSG, F-94160 Saint-Mandé, France
  • 3. LASTIG, Univ Gustave Eiffel, IGN-ENSG, -77420 Champs-sur-Marne, France
  • 4. IMT Mines Albi, Institut Mines-Télécom, Université de Toulouse
  • 5. University of Orléans

Description

Most of the available datasets are not particularly adapted to our target application: geolocate natural disasters from social networks. First, social media posts are largely underrepresented in these datasets, and the only Twitter dataset lacks Entity-Linking annotations. Second, none of the datasets focuses on a crisis or natural disaster event.

To mitigate these issues, we extracted a collection of French tweets written during earthquakes and major floods that have occurred in France in recent years. We set up Label-Studio in order to annotate these tweets. A total of 4617 tweets were annotated, including 1678 tweets posted during earthquakes and 2939 during floods. For each annotated tweet, mentions were annotated using the set of labels described earlier in the paper as well as, when possible, the target Wikipedia title.

Named “RéSoCIO” in reference to the research project in which it was carried out, the dataset resulting from this work contains a total of 12 828 annotated mentions and 1 513 distinct Wikipedia entities. 85% of mentions were associated with a Wikipedia page and 94 % if we ignore the RISKNAT and DAMAGES labels, which are often difficult to map to an existing entity.

Labels #Mentions #Linked #Entities
PERSON 315 263 136
ORG 863 790 281
GEOLOC 4375 4234 701
TRANSPORT 250 203 101
EVENT 35 21 16
FACILITY 129 94 49
RISKNAT 5502 4994 128
DAMAGES 1136 121 56
OTHER 223 200 46
Total 12828 1322 1513

Overview of the mentions annotated in the Twitter dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.

Labels #Mentions #Linked #Entities
PERSON 1100102 1098406 557697
ORG 750925 749504 130394
GEOLOC 2729702 2728296 215924
TRANSPORT 161539 160487 53405
EVENT 798433 798251 86471
FACILITY 258835 258513 109867
RISKNAT 5502 4994 127
DAMAGES 1136 121 56
OTHER 4340621 4339658 682458
Total 10146795 10138230 1836399

Overview of the mentions annotated in the full dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.

Files

Readme.txt

Files (839.1 kB)

Name Size Download all
md5:7384c10daedce56d4f3d68126d16aad1
203 Bytes Preview Download
md5:964347f93f4e953ae80e9e7450f55b4c
838.8 kB Download

Additional details

Funding

ReSoCIO – Social Networks for Natural Disaster: Operational Interpretation ANR-20-CE39-0014
Agence Nationale de la Recherche

References

  • Caillaut, G., Gracianne, C., Abadie, N., Touya, G., & Auclair, S. (2022, May). Automated construction of a French Entity Linking dataset to geolocate social network posts in the context of natural disasters. In ISCRAM. https://hal.science/hal-03631387/
  • Caillaut, G., Gracianne, C., Auclair, S., Abadie, N., & Touya, G. (2022, June). Annotation sémantique pour la géolocalisation d'entités spatiales dans des tweets. In PFIA Résilience et IA. https://hal.science/hal-03682484/