Published February 4, 2026 | Version v1
Dataset Restricted

HazMiner dataset

  • 1. Vrije Universiteit Brussel
  • 2. ROR icon Royal Museum for Central Africa

Description

The HazMiner dataset contains the location, timing and impact of geo-hydrological hazard events (flood, landslide and flash flood) at the global scale. The data is extracted using a paragraph based text mining method called HazMiner. It uses large language models to extract infromation from online news articles. The HazMiner method is specifically designed to improve the documentation of geo-hydrological hazards in the Global South. 

The current version contains events from 2017 through 2024, containing 21,411 flood, 7,659 landslide and 3,606 flash flood events, extracted from 6,366,905 news articles in 58 languages. 

More information on HazMiner: <future publication>

More information on the code

General information

The dataset contains information on the articles, paragraphs and events. 

  • Articles (Level 1): the articles used to extract the geo-hydrological events
  • Paragraphs (Level 2): the paragraphs of the corresponding articles with their extracted information on the location, timing and impact 
  • Events (Level 3): geo-hydrological hazard events represents clustered paragraphs that occur around the same time in space

The datasets are linked to eachother by different ids, each article, paragraph and event has its own id assigned. All articles were extracted from the GDELT Global Knownledge Graph (GDELT, 2025). 

How to get started

More to follow soon.

Structure

Articles

Column Description
ArticleID The id of the article
title The title of the article
url The url of the article
domain The corresponding domain of the articles
sourcecountry The source country of the articles
Publication_time The publication time (YYYY-MM-DD HH:MM:SS)
Location The place names extracted from the article by a NER large language model 
NER score The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location')
language_iso The language of the article (ISO 639-1)

Paragraphs

Column Description
ParagraphID The id of the paragraph
Hazard type The hazard type (flood, landslide or flash flood), identified by a zero-shot classification model
Hazard type score The score give by the zero-shot classification models
Time Time of the hazard extracted from a time reference in the text. When there is no time reference, it will equal to the publication time of the article. (YYYY-MM-DD HH:MM:SS)
Publication_time The publication time of the article (YYYY-MM-DD HH:MM:SS)
Location The place names extracted from the article by a NER large language model 
NER score The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location')
lat The latitude of the paragraph, a weighted average of all locations metioned in the paragraph (°)
lon The longitude of the paragraph, a weighted average of all locations metioned in the paragraph (°)
minLat The southern boundary of the paragraph location (°)
maxLat The northern boundary of the paragraph location (°)
minLon The western boundary of the paragraph location (°)
maxLon The eastern boundary of the paragraph location (°)
number_death The number of death extracted by a large language model (Q&A)
score_death The score on the answer for the number of death returned by the large language model
answer_death The answer for the number of death returned by the large language model
number_homeless The number of homeless extracted by a large language model (Q&A)
score_homeless The score on the answer for the number of homeless returned by the large language model
answer_homeless The answer for the number of homeless returned by the large language model
number_injured The number of injured extracted by a large language model (Q&A)
score_injured The score on the answer for the number of injured returned by the large language model
answer_injured The answer for the number of injured returned by the large language model
number_affected The number of affected extracted by a large language model (Q&A)
score_affected The score on the answer for the number of affected returned by the large language model
answer_affected The answer for the number of affected returned by the large language model
number_missing The number of missing extracted by a large language model (Q&A)
score_missing The score on the answer for the number of missing returned by the large language model
answer_missing The answer for the number of missing returned by the large language model
number_evacuated The number of evacuated extracted by a large language model (Q&A)
score_evacuated The score on the answer for the number of evacuated returned by the large language model
answer_evacuated The answer for the number of evacuated returned by the large language model
ArticleID The id of the corresponding article
title The title of the corresponding article
domain The domain of the corresponding article
sourcecountry The source country of the corresponding articles
language_iso The language of the article (ISO 639-1)
EventID The id of the corresponding event

Events

Column Description
EventID The id of the event 
Hazard type  The hazard type of the event
hazard_score The score give by the zero-shot classification models (average of paragraphs)
lat  The latitude of the event (medoid of parapgraphs) (°)
lon The longitude of the event (medoid of parapgraphs) (°)
min_lat The southern boundary of the event (most southern paragraph) (°)
max_lat The northern boundary of the event (most northern paragraph) (°)
min_lon The western boundary of the event (most western paragraph) (°)
max_lon The eastern boundary of the event (most eastern paragraph) (°)
Start The start of the event (timing of the first paragraph) (YYYY-MM-DD)
End The end of the event (timing of the last paragraph) (YYYY-MM-DD)
Time The timing of the event (median time of all paragraphs) (YYYY-MM-DD)
Duration The duration of the event (days)
Paragraphs The paragraph ids of all paragraphs of the event
Articles The article ids of all articles of the event
n_paragraphs The number of paragraphs
n_articles The number of articles
n_language The number of languages
n_sourcecountry The number of source countries
n_domain The number of domains
mostfreq_death The most frequently reported number of death
n_mostfreq_death The number of times the most frequently reported number of death is reported
median_death The median number of death (median of all paragraphs)
mostfreq_homeless The most frequently reported number of homeless
n_mostfreq_homeless The number of times the most frequently reported number of homeless is reported
median_homeless The median number of homeless (median of all paragraphs)
mostfreq_injured The most frequently reported number of injured
n_mostfreq_injured The number of times the most frequently reported number of injured is reported
median_injured The median number of injured (median of all paragraphs)
mostfreq_affected The most frequently reported number of affected
n_mostfreq_affected The number of times the most frequently reported number of affected is reported
median_affected The median number of affected (median of all paragraphs)
mostfreq_missing The most frequently reported number of missing
n_mostfreq_missing The number of times the most frequently reported number of missing is reported
median_missing The median number of missing (median of all paragraphs)
mostfreq_evacuated The most frequently reported number of evacuated
n_mostfreq_evacuated The number of times the most frequently reported number of evacuated is reported
median_evacuated The median number of evacuated (median of all paragraphs)

Disclaimer

The dataset is part of a preprint, once published the data will be available in open access.

The HazMiner database was created through lawful text and data mining (TDM) in accordance with Article 3 of Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market. All data contained in this database are the result of automated extraction and synthesis from lawfully accessible sources, including publicly available news articles indexed in the GDELT database. The dataset contains only factual information (e.g., time, location, type of event, reported impacts) and does not reproduce any protected expression or copyrighted content from the original sources.

 

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Funding

Belgian Federal Science Policy Office
FED-tWIN Programme Prf-2019-066_GuiDANCE

References