Published March 10, 2026 | Version v7
Dataset Open

LLM-GeoDis

Authors/Creators

Description

This repository contains the code and datasets used to produce the LLM-GeoDis dataset, a global database of subnationally geocoded disaster events from the EM-DAT International Disaster Database. The workflow uses a large language model (GPT-4o) to extract and standardize textual disaster location descriptions and match them to administrative units using GADM, OpenStreetMap, and Wikidata.

The resulting dataset provides subnational geocoding for global disaster events recorded in EM-DAT (2000–2024). Each record has been automatically processed to extract location entities and link them to administrative units. The dataset contains 14,215 disaster events across 17,948 unique locations, each associated with GADM administrative levels 1–2. It includes point geometries from Wikidata and OSM as well as harmonized GADM geometries to ensure consistent spatial coverage. Due to its size (~30 GB), the full LLM-GeoDis database is distributed across five compressed files.

This Zenodo record contains the following files:

  • LLMGeoDis_part1–5.zip – compressed parts of the main LLM-GeoDis dataset containing the geoparsed and geocoded disaster locations.

  • geoemdat_gaul.gpkg – GeoPackage containing EM-DAT events mapped to FAO GAUL administrative boundaries.

  • pend-gdis-1960-2018-disasterlocations.csv – intermediate dataset of disaster location strings extracted from EM-DAT during preprocessing.

  • reliability_db.csv – annotations used to evaluate geoparsing reliability and agreement between sources.

  • input_emdat.csv – EM-DAT input dataset used for the geoparsing workflow.

  • 241204_emdat_archive.xlsx – archived version of EM-DAT used for validation and reproducibility.

  • gdis_disnos.csv – mapping between EM-DAT event identifiers and events in the GDIS disaster dataset used for benchmarking.

  • synthetic_EMDAT_locations.csv – synthetic location examples used during development and testing.

  • emdat_geocoding-zenodo.zip – archive of the full code repository used to reproduce the geoparsing, geocoding, and validation workflows.

  • Instructions_LLM-GeoDis.pdf - instructions to run the full code and reproduce the analysis in the manuscript. 

External datasets required to fully reproduce the workflow include GADM 4.1 administrative boundaries and the full GDIS dataset, which must be downloaded separately from their respective sources.

Files

input_emdat.csv

Files (14.9 GB)

Name Size Download all
md5:e78b986f8234a48de376b8026017d1dc
6.7 MB Download
md5:6d177d89cc6ceacf790068213974b5b7
29.2 MB Preview Download
md5:41e6a666c7e26323ae6853ef753f960d
135.3 kB Preview Download
md5:79b9e864059c19e6bf76b084a4aba9ee
6.1 GB Download
md5:97384f5817103569d2b1b18de5be4ea4
2.8 MB Preview Download
md5:2e8cb9f813877cd10dcbaa326b9ed01f
275.2 kB Preview Download
md5:e0b90fa7ffff0ca9a4ef102118db8056
1.9 GB Preview Download
md5:e8137895a92159c6d9b19f21a4559709
1.9 GB Preview Download
md5:817e75737a4834b782568e6a079e3dc5
1.8 GB Preview Download
md5:667f69347acd41261bf397cadd8f11a3
1.5 GB Preview Download
md5:be427a1290f6eea5eb88a7bf87b2141d
1.6 GB Preview Download
md5:c6a669466c815a0882deb5b4cc648acb
4.9 MB Preview Download
md5:fbb290425afbb64a3de62d8373fb839d
3.5 MB Preview Download
md5:391085a505a50c948b95ffdb257ab058
104.3 kB Preview Download