Published December 16, 2025 | Version v1
Dataset Open

CREXWET: CREdata-Weather-Emergency-Twitter

  • 1. ROR icon Barcelona Supercomputing Center

Description

Dataset Summary

The CREXWET dataset is a multilingual corpus of Tweets annotated according to their relevance to a flash flood, or wildfire, for weather event detection. The dataset contains 627k (266k German,  223k English, 91k Spanish, 46k Catalan) unique sentences from various past flood and wildfire incidents, and some from unrelated time periods.

Supported Tasks and Leaderboards

text-classification: This dataset can be used to train models for flood and wildfire detection.

Languages

The languages included in the dataset are:
- English (`en`)
- German (`de`)
- Spanish (`es`)
- Catalan (`ca-ES`)

Dataset Structure


Data Instances
 
Each instance in the dataset contains the following fields:

- id: index id specific to this dataset.
tweet_id: id provided by Twitter/X.
language: language in sentence. Possible values: ENGLISH, GERMAN, SPANISH, CATALAN.
label: event label. Posible values: fire, flood, none.
label_quality: label quality score provided by Cleanlab.

 
Data Fields

{
  "id": "0",
  "tweet_id": "276989327442079745",
  "language": "GERMAN",
  "label": "none",
  "label_quality": "0.9916292981992856"
}

Data Splits

The dataset is split into `train` and `test` sentences, the `train` may be split to create `dev` sentences, but `test` should only be used for evaluation. The annotation of `train` sentences are machine generated, while the `test` are machine generated and then human reviewed. Crisis incidents in `train` are not in `test`. The statistics of the splits are:
 
Dataset Instances
Train 592,704
Test 34,722

 

Dataset Creation


Curation Rationale

This dataset was created to train a weather emergency detection model within the CREXDATA Project (Grant Agreement No. 101092749).

Source Data

Initial Data Collection and Normalization

The data was collected from Twitter, focusing on days prior, during, and after past floods and wildfire events in Europe and USA. For German we focused on Germany and Austria, for English on USA, for Spanish and Catalan on Spain. We also sourced data from public weather crisis datasets including CrisisLexT6, CrisisLexT26, CrisisMMD, CrisisNLP, HumAID.  

Who are the source language producers?

The source language producers are users of Twitter.

Annotations

Annotation process

The `train` dataset was annotated using Phi 3.5 mini a large language model from Microsoft. The LLM was provided the prompt below, and the text to be annotated was provided in place of `<TEXT>` in the prompt:
 
Analyze the given body of text and determine whether it contains information relevant to an ongoing fire disaster, flood disaster, or neither.
The classification should be based on the presence of specific, actionable details or key indicators of relevance to such disasters.
Examine the text for mentions of fire-related or flood-related events, such as locations, impacts, safety advisories, or emergency responses.
Classify as "fire" if the text contains information about wildfires, house fires, smoke hazards, evacuation orders due to fire, or related topics.
Classify as "flood" if the text discusses water inundation, flash floods, storm surges, flood warnings, or evacuation due to flooding.
Classify as "none" if the text does not provide information relevant to either disaster or lacks actionable disaster-related content.
Texts that express sympathy, offer prayers, or request donations should also be classified as "none".
Only assign "flood" or "fire" if the relevance is clear; otherwise, default to "none".
The response should be formatted as follows:
{"classification": "[flood/fire/none]"}
Example
Input: "Evacuation orders have been issued for Riverside due to rapidly spreading wildfire near the canyon."
Output: {"classification": "fire"}
Classify:
Input: "<TEXT>"
 
Once labelled, machined-enabled label cleaning is performed using Cleanlab following their documentation. Error labels and low quality labels below 0.3 were dropped, we further re-labelled instances of `fire, flood` with label quality less than 0.7 as `none`. Our annotation and label cleaning process can be found at the repository.

The `test` data was first machine annotated, cleaned, and then reviewed 2 human annotators. In this case, error labels are also reviewed and not automatically discarded.

Who are the annotators?

The annotators of the `test` have native or high-level profiency in English, Catalan, and Spanish.

Personal and Sensitive Information

N/A

Considerations for Using the Data

Social Impact of Dataset

We hope this data can improve research into weather emergency detection in social media data.

Discussion of Biases

We are aware that, since the data comes from social media, this will contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact.

Other Known Limitations

The texts in the dataset have to be downloaded from Twitter, therefore some instances might be lost in cases where tweets have been deleted and are no longer available.

Additional Information

Dataset Curators

Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.

This work has been developed under the EU-funded CREXDATA Project (Grant Agreement No. 101092749).

Licensing Information

Files

CREXWET.zip

Files (14.3 MB)

Name Size Download all
md5:146ae724e91214ef2a78a568a4d8d1db
14.3 MB Preview Download

Additional details

Funding

European Commission
CREXDATA - Critical Action Planning over Extreme-Scale Data 101092749