HazMiner dataset

Valkenborg, Bram; Dewitte, Olivier; Smets, Benoît

doi:10.5281/zenodo.18483419

Published February 4, 2026 | Version v1

Dataset Restricted

HazMiner dataset

1. Vrije Universiteit Brussel
2. Royal Museum for Central Africa

The HazMiner dataset contains the location, timing and impact of geo-hydrological hazard events (flood, landslide and flash flood) at the global scale. The data is extracted using a paragraph based text mining method called HazMiner. It uses large language models to extract infromation from online news articles. The HazMiner method is specifically designed to improve the documentation of geo-hydrological hazards in the Global South.

The current version contains events from 2017 through 2024, containing 21,411 flood, 7,659 landslide and 3,606 flash flood events, extracted from 6,366,905 news articles in 58 languages.

More information on HazMiner: <future publication>

More information on the code

General information

The dataset contains information on the articles, paragraphs and events.

Articles (Level 1): the articles used to extract the geo-hydrological events
Paragraphs (Level 2): the paragraphs of the corresponding articles with their extracted information on the location, timing and impact
Events (Level 3): geo-hydrological hazard events represents clustered paragraphs that occur around the same time in space

The datasets are linked to eachother by different ids, each article, paragraph and event has its own id assigned. All articles were extracted from the GDELT Global Knownledge Graph (GDELT, 2025).

How to get started

More to follow soon.

Structure

Articles

Column	Description
ArticleID	The id of the article
title	The title of the article
url	The url of the article
domain	The corresponding domain of the articles
sourcecountry	The source country of the articles
Publication_time	The publication time (YYYY-MM-DD HH:MM:SS)
Location	The place names extracted from the article by a NER large language model
NER score	The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location')
language_iso	The language of the article (ISO 639-1)

Paragraphs

Column	Description
ParagraphID	The id of the paragraph
Hazard type	The hazard type (flood, landslide or flash flood), identified by a zero-shot classification model
Hazard type score	The score give by the zero-shot classification models
Time	Time of the hazard extracted from a time reference in the text. When there is no time reference, it will equal to the publication time of the article. (YYYY-MM-DD HH:MM:SS)
Publication_time	The publication time of the article (YYYY-MM-DD HH:MM:SS)
Location	The place names extracted from the article by a NER large language model
NER score	The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location')
lat	The latitude of the paragraph, a weighted average of all locations metioned in the paragraph (°)
lon	The longitude of the paragraph, a weighted average of all locations metioned in the paragraph (°)
minLat	The southern boundary of the paragraph location (°)
maxLat	The northern boundary of the paragraph location (°)
minLon	The western boundary of the paragraph location (°)
maxLon	The eastern boundary of the paragraph location (°)
number_death	The number of death extracted by a large language model (Q&A)
score_death	The score on the answer for the number of death returned by the large language model
answer_death	The answer for the number of death returned by the large language model
number_homeless	The number of homeless extracted by a large language model (Q&A)
score_homeless	The score on the answer for the number of homeless returned by the large language model
answer_homeless	The answer for the number of homeless returned by the large language model
number_injured	The number of injured extracted by a large language model (Q&A)
score_injured	The score on the answer for the number of injured returned by the large language model
answer_injured	The answer for the number of injured returned by the large language model
number_affected	The number of affected extracted by a large language model (Q&A)
score_affected	The score on the answer for the number of affected returned by the large language model
answer_affected	The answer for the number of affected returned by the large language model
number_missing	The number of missing extracted by a large language model (Q&A)
score_missing	The score on the answer for the number of missing returned by the large language model
answer_missing	The answer for the number of missing returned by the large language model
number_evacuated	The number of evacuated extracted by a large language model (Q&A)
score_evacuated	The score on the answer for the number of evacuated returned by the large language model
answer_evacuated	The answer for the number of evacuated returned by the large language model
ArticleID	The id of the corresponding article
title	The title of the corresponding article
domain	The domain of the corresponding article
sourcecountry	The source country of the corresponding articles
language_iso	The language of the article (ISO 639-1)
EventID	The id of the corresponding event

Events

Column	Description
EventID	The id of the event
Hazard type	The hazard type of the event
hazard_score	The score give by the zero-shot classification models (average of paragraphs)
lat	The latitude of the event (medoid of parapgraphs) (°)
lon	The longitude of the event (medoid of parapgraphs) (°)
min_lat	The southern boundary of the event (most southern paragraph) (°)
max_lat	The northern boundary of the event (most northern paragraph) (°)
min_lon	The western boundary of the event (most western paragraph) (°)
max_lon	The eastern boundary of the event (most eastern paragraph) (°)
Start	The start of the event (timing of the first paragraph) (YYYY-MM-DD)
End	The end of the event (timing of the last paragraph) (YYYY-MM-DD)
Time	The timing of the event (median time of all paragraphs) (YYYY-MM-DD)
Duration	The duration of the event (days)
Paragraphs	The paragraph ids of all paragraphs of the event
Articles	The article ids of all articles of the event
n_paragraphs	The number of paragraphs
n_articles	The number of articles
n_language	The number of languages
n_sourcecountry	The number of source countries
n_domain	The number of domains
mostfreq_death	The most frequently reported number of death
n_mostfreq_death	The number of times the most frequently reported number of death is reported
median_death	The median number of death (median of all paragraphs)
mostfreq_homeless	The most frequently reported number of homeless
n_mostfreq_homeless	The number of times the most frequently reported number of homeless is reported
median_homeless	The median number of homeless (median of all paragraphs)
mostfreq_injured	The most frequently reported number of injured
n_mostfreq_injured	The number of times the most frequently reported number of injured is reported
median_injured	The median number of injured (median of all paragraphs)
mostfreq_affected	The most frequently reported number of affected
n_mostfreq_affected	The number of times the most frequently reported number of affected is reported
median_affected	The median number of affected (median of all paragraphs)
mostfreq_missing	The most frequently reported number of missing
n_mostfreq_missing	The number of times the most frequently reported number of missing is reported
median_missing	The median number of missing (median of all paragraphs)
mostfreq_evacuated	The most frequently reported number of evacuated
n_mostfreq_evacuated	The number of times the most frequently reported number of evacuated is reported
median_evacuated	The median number of evacuated (median of all paragraphs)

Disclaimer

The dataset is part of a preprint, once published the data will be available in open access.

The HazMiner database was created through lawful text and data mining (TDM) in accordance with Article 3 of Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market. All data contained in this database are the result of automated extraction and synthesis from lawfully accessible sources, including publicly available news articles indexed in the GDELT database. The dataset contains only factual information (e.g., time, location, type of event, reported impacts) and does not reproduce any protected expression or copyrighted content from the original sources.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Belgian Federal Science Policy Office
FED-tWIN Programme Prf-2019-066_GuiDANCE

The GDELT Project: https://www.gdeltproject.org/, last access: 24 July 2025.

	All versions	This version
Views	45	45
Downloads	1	1
Data volume	176.8 MB	176.8 MB

General information

How to get started

Structure

Articles

Paragraphs

Events

Disclaimer

Files

Restricted

Funding

References

HazMiner dataset

Authors/Creators

Description

General information

How to get started

Structure

Articles

Paragraphs

Events

Disclaimer

Files

Restricted

Additional details

Funding

References