HazMiner dataset
Authors/Creators
Description
The HazMiner dataset contains the location, timing and impact of geo-hydrological hazard events (flood, landslide and flash flood) at the global scale. The data is extracted using a paragraph based text mining method called HazMiner. It uses large language models to extract infromation from online news articles. The HazMiner method is specifically designed to improve the documentation of geo-hydrological hazards in the Global South.
The current version contains events from 2017 through 2024, containing 21,411 flood, 7,659 landslide and 3,606 flash flood events, extracted from 6,366,905 news articles in 58 languages.
More information on HazMiner: <future publication>
More information on the code
General information
The dataset contains information on the articles, paragraphs and events.
- Articles (Level 1): the articles used to extract the geo-hydrological events
- Paragraphs (Level 2): the paragraphs of the corresponding articles with their extracted information on the location, timing and impact
- Events (Level 3): geo-hydrological hazard events represents clustered paragraphs that occur around the same time in space
The datasets are linked to eachother by different ids, each article, paragraph and event has its own id assigned. All articles were extracted from the GDELT Global Knownledge Graph (GDELT, 2025).
How to get started
More to follow soon.
Structure
Articles
| Column | Description |
|---|---|
| ArticleID | The id of the article |
| title | The title of the article |
| url | The url of the article |
| domain | The corresponding domain of the articles |
| sourcecountry | The source country of the articles |
| Publication_time | The publication time (YYYY-MM-DD HH:MM:SS) |
| Location | The place names extracted from the article by a NER large language model |
| NER score | The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location') |
| language_iso | The language of the article (ISO 639-1) |
Paragraphs
| Column | Description |
| ParagraphID | The id of the paragraph |
| Hazard type | The hazard type (flood, landslide or flash flood), identified by a zero-shot classification model |
| Hazard type score | The score give by the zero-shot classification models |
| Time | Time of the hazard extracted from a time reference in the text. When there is no time reference, it will equal to the publication time of the article. (YYYY-MM-DD HH:MM:SS) |
| Publication_time | The publication time of the article (YYYY-MM-DD HH:MM:SS) |
| Location | The place names extracted from the article by a NER large language model |
| NER score | The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location') |
| lat | The latitude of the paragraph, a weighted average of all locations metioned in the paragraph (°) |
| lon | The longitude of the paragraph, a weighted average of all locations metioned in the paragraph (°) |
| minLat | The southern boundary of the paragraph location (°) |
| maxLat | The northern boundary of the paragraph location (°) |
| minLon | The western boundary of the paragraph location (°) |
| maxLon | The eastern boundary of the paragraph location (°) |
| number_death | The number of death extracted by a large language model (Q&A) |
| score_death | The score on the answer for the number of death returned by the large language model |
| answer_death | The answer for the number of death returned by the large language model |
| number_homeless | The number of homeless extracted by a large language model (Q&A) |
| score_homeless | The score on the answer for the number of homeless returned by the large language model |
| answer_homeless | The answer for the number of homeless returned by the large language model |
| number_injured | The number of injured extracted by a large language model (Q&A) |
| score_injured | The score on the answer for the number of injured returned by the large language model |
| answer_injured | The answer for the number of injured returned by the large language model |
| number_affected | The number of affected extracted by a large language model (Q&A) |
| score_affected | The score on the answer for the number of affected returned by the large language model |
| answer_affected | The answer for the number of affected returned by the large language model |
| number_missing | The number of missing extracted by a large language model (Q&A) |
| score_missing | The score on the answer for the number of missing returned by the large language model |
| answer_missing | The answer for the number of missing returned by the large language model |
| number_evacuated | The number of evacuated extracted by a large language model (Q&A) |
| score_evacuated | The score on the answer for the number of evacuated returned by the large language model |
| answer_evacuated | The answer for the number of evacuated returned by the large language model |
| ArticleID | The id of the corresponding article |
| title | The title of the corresponding article |
| domain | The domain of the corresponding article |
| sourcecountry | The source country of the corresponding articles |
| language_iso | The language of the article (ISO 639-1) |
| EventID | The id of the corresponding event |
Events
| Column | Description |
| EventID | The id of the event |
| Hazard type | The hazard type of the event |
| hazard_score | The score give by the zero-shot classification models (average of paragraphs) |
| lat | The latitude of the event (medoid of parapgraphs) (°) |
| lon | The longitude of the event (medoid of parapgraphs) (°) |
| min_lat | The southern boundary of the event (most southern paragraph) (°) |
| max_lat | The northern boundary of the event (most northern paragraph) (°) |
| min_lon | The western boundary of the event (most western paragraph) (°) |
| max_lon | The eastern boundary of the event (most eastern paragraph) (°) |
| Start | The start of the event (timing of the first paragraph) (YYYY-MM-DD) |
| End | The end of the event (timing of the last paragraph) (YYYY-MM-DD) |
| Time | The timing of the event (median time of all paragraphs) (YYYY-MM-DD) |
| Duration | The duration of the event (days) |
| Paragraphs | The paragraph ids of all paragraphs of the event |
| Articles | The article ids of all articles of the event |
| n_paragraphs | The number of paragraphs |
| n_articles | The number of articles |
| n_language | The number of languages |
| n_sourcecountry | The number of source countries |
| n_domain | The number of domains |
| mostfreq_death | The most frequently reported number of death |
| n_mostfreq_death | The number of times the most frequently reported number of death is reported |
| median_death | The median number of death (median of all paragraphs) |
| mostfreq_homeless | The most frequently reported number of homeless |
| n_mostfreq_homeless | The number of times the most frequently reported number of homeless is reported |
| median_homeless | The median number of homeless (median of all paragraphs) |
| mostfreq_injured | The most frequently reported number of injured |
| n_mostfreq_injured | The number of times the most frequently reported number of injured is reported |
| median_injured | The median number of injured (median of all paragraphs) |
| mostfreq_affected | The most frequently reported number of affected |
| n_mostfreq_affected | The number of times the most frequently reported number of affected is reported |
| median_affected | The median number of affected (median of all paragraphs) |
| mostfreq_missing | The most frequently reported number of missing |
| n_mostfreq_missing | The number of times the most frequently reported number of missing is reported |
| median_missing | The median number of missing (median of all paragraphs) |
| mostfreq_evacuated | The most frequently reported number of evacuated |
| n_mostfreq_evacuated | The number of times the most frequently reported number of evacuated is reported |
| median_evacuated | The median number of evacuated (median of all paragraphs) |
Disclaimer
The dataset is part of a preprint, once published the data will be available in open access.
The HazMiner database was created through lawful text and data mining (TDM) in accordance with Article 3 of Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market. All data contained in this database are the result of automated extraction and synthesis from lawfully accessible sources, including publicly available news articles indexed in the GDELT database. The dataset contains only factual information (e.g., time, location, type of event, reported impacts) and does not reproduce any protected expression or copyrighted content from the original sources.
Files
Additional details
Funding
- Belgian Federal Science Policy Office
- FED-tWIN Programme Prf-2019-066_GuiDANCE
References
- The GDELT Project: https://www.gdeltproject.org/, last access: 24 July 2025.