Published December 16, 2025
| Version v1
Dataset
Open
CREXWET-SYNTH: CREXdata-Weather-Emergency-Twitter-SYNTHetic
Authors/Creators
Description
Dataset Summary
The CREXWET-SYNTH dataset is a multilingual corpus of synthetically generated texts in the form of tweets with labels according to their relevance to a flash flood, or wildfire, for weather event detection. The dataset contains 37k (9k German, 10k English, 9k Spanish, 8k Catalan) unique sentences.
Supported Tasks and Leaderboards
text-classification: This dataset can be used to train models for flood and wildfire detection.Languages
The languages included in the dataset are:
- English (`en`)
- German (`de`)
- Spanish (`es`)
- Catalan (`ca-ES`)
Dataset Structure
Data Instances
Each instance in the dataset contains the following fields:
-
id: index id specific to this dataset.-
text: generated text in form of tweet.-
language: language in sentence. Possible values: ENGLISH, GERMAN, SPANISH, CATALAN.-
label: event label. Posible values: fire, flood, none.-
label_quality: label quality score provided by Cleanlab.-
model: LLM used to generate text.-
prompt_category: type of prompt used to generate the text.Data Fields
{ "id": "0", "text": "Ufff, quin partit del Barça ahir! 😍 Sembla que la Lliga ja és nostra! #ForçaBarça #ViscaElBarça", "language": "CATALAN", "label": "none", "label_quality": "0.995009975117508", "model": "gemma_3", "prompt_category": "unrelated_to_crisis_discussing_random_topics"}Dataset Creation
Curation Rationale
This dataset was created to augument real data used to train a weather emergency detection model within the CREXDATA project (Grant Agreement No. 101092749).
Source Data
Initial Data Collection and Normalization
The data was generated using the 8-bit quantized versions of Google’s Gemma 3 27B and MistralAI’s Mistral Small 24B. The models were prompted to generate texts in the following categories:
-
from_affected_persons_with_keywords: In this case, the LLM was instructed to generate social media posts written from the perspective of individuals affected by a wildfire or flood. The prompt included guidance on tone and content, along with a list of keywords to incorporate. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt.-
from_government_and_meteorological_agencies_with_warning_alerts: The LLM was instructed to simulate posts issued by government or meteorological agencies, particularly those providing public warnings and alerts. This helped introduce an institutional perspective into the dataset. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt.- `
unrelated_to_crisis_discussing_random_topics`: To simulate unrelated social media posts, the LLM was prompted to produce posts about various topics such as politics, celebrities, cancer, music, sports, food, lifestyle, memes, breaking news, personal updates, and tourism. These topics were chosen to represent general social media noise. These are annotated as 'nones'.-
related_to_crisis_but_lacking_useful_information: Here, the LLM was instructed to generate posts mentioning the incident but without providing actionable content—such as those requesting donations, expressing sympathy, criticizing authorities, or promoting conspiracy theories. These are annotated as 'nones'.The prompts for each category can be found at this repository.
Who are the source language producers?
The source language produced by the LLMs used for generation.
Annotations
Annotation process
The annotations were produced by the LLMs. They were further cleaned using Cleanlab following their documentation. Error labels and low quality labels below 0.3 were dropped, we further re-labelled instances of `fire, flood` with label quality less than 0.7 as `none`. Our annotation and label cleaning process can be found at the repository.
Who are the annotators?
The annotators were the LLMs used for generation.
Personal and Sensitive Information
N/A
Considerations for Using the Data
Social Impact of Dataset
We hope this data can improve research into weather emergency detection in social media data.
Discussion of Biases
We are aware that, since the data comes from LLMs, they can contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact.
Other Known Limitations
The dataset is fully generated by LLMs and should be used solely for augmenting real data.
Additional Information
Dataset Curators
Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
This work has been developed under the EU-funded CREXDATA Project (Grant Agreement No. 101092749).
Since, part of this data was generated using Google's Gemma 3 model, its usage should follow Terms of Use and Prohibited Use Policy.
Licensing Information
Files
CREXWET-SYNTH.zip
Files
(1.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3759c13ec2bf097fcf38340537e947cc
|
1.6 MB | Preview Download |