French Entity-Linking dataset between annotated tweets collected during major crises in France and French Wikipedia corpus
Contributors
Project manager:
Researchers:
- 1. BRGM National Geological Survey: Bureau de Recherches Geologiques et Minieres
- 2. LASTIG, Univ Gustave Eiffel, IGN-ENSG, F-94160 Saint-Mandé, France
- 3. LASTIG, Univ Gustave Eiffel, IGN-ENSG, -77420 Champs-sur-Marne, France
- 4. IMT Mines Albi, Institut Mines-Télécom, Université de Toulouse
- 5. University of Orléans
Description
Most of the available datasets are not particularly adapted to our target application: geolocate natural disasters from social networks. First, social media posts are largely underrepresented in these datasets, and the only Twitter dataset lacks Entity-Linking annotations. Second, none of the datasets focuses on a crisis or natural disaster event.
To mitigate these issues, we extracted a collection of French tweets written during earthquakes and major floods that have occurred in France in recent years. We set up Label-Studio in order to annotate these tweets. A total of 4617 tweets were annotated, including 1678 tweets posted during earthquakes and 2939 during floods. For each annotated tweet, mentions were annotated using the set of labels described earlier in the paper as well as, when possible, the target Wikipedia title.
Named “RéSoCIO” in reference to the research project in which it was carried out, the dataset resulting from this work contains a total of 12 828 annotated mentions and 1 513 distinct Wikipedia entities. 85% of mentions were associated with a Wikipedia page and 94 % if we ignore the RISKNAT and DAMAGES labels, which are often difficult to map to an existing entity.
Labels | #Mentions | #Linked | #Entities |
PERSON | 315 | 263 | 136 |
ORG | 863 | 790 | 281 |
GEOLOC | 4375 | 4234 | 701 |
TRANSPORT | 250 | 203 | 101 |
EVENT | 35 | 21 | 16 |
FACILITY | 129 | 94 | 49 |
RISKNAT | 5502 | 4994 | 128 |
DAMAGES | 1136 | 121 | 56 |
OTHER | 223 | 200 | 46 |
Total | 12828 | 1322 | 1513 |
Overview of the mentions annotated in the Twitter dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
Labels | #Mentions | #Linked | #Entities |
PERSON | 1100102 | 1098406 | 557697 |
ORG | 750925 | 749504 | 130394 |
GEOLOC | 2729702 | 2728296 | 215924 |
TRANSPORT | 161539 | 160487 | 53405 |
EVENT | 798433 | 798251 | 86471 |
FACILITY | 258835 | 258513 | 109867 |
RISKNAT | 5502 | 4994 | 127 |
DAMAGES | 1136 | 121 | 56 |
OTHER | 4340621 | 4339658 | 682458 |
Total | 10146795 | 10138230 | 1836399 |
Overview of the mentions annotated in the full dataset. #Mentions shows the total number of mentions per label, #Linked the number of mentions linked to an entity and #Entities the number of distinct entities per label present in the dataset.
Files
Readme.txt
Files
(839.1 kB)
Name | Size | Download all |
---|---|---|
md5:7384c10daedce56d4f3d68126d16aad1
|
203 Bytes | Preview Download |
md5:964347f93f4e953ae80e9e7450f55b4c
|
838.8 kB | Download |
Additional details
Funding
- ReSoCIO – Social Networks for Natural Disaster: Operational Interpretation ANR-20-CE39-0014
- Agence Nationale de la Recherche
References
- Caillaut, G., Gracianne, C., Abadie, N., Touya, G., & Auclair, S. (2022, May). Automated construction of a French Entity Linking dataset to geolocate social network posts in the context of natural disasters. In ISCRAM. https://hal.science/hal-03631387/
- Caillaut, G., Gracianne, C., Auclair, S., Abadie, N., & Touya, G. (2022, June). Annotation sémantique pour la géolocalisation d'entités spatiales dans des tweets. In PFIA Résilience et IA. https://hal.science/hal-03682484/