Published December 16, 2025 | Version v1
Dataset Open

CREXWET-SYNTH: CREXdata-Weather-Emergency-Twitter-SYNTHetic

  • 1. ROR icon Barcelona Supercomputing Center

Description

Dataset Summary


The CREXWET-SYNTH dataset is a multilingual corpus of synthetically generated texts in the form of tweets with labels according to their relevance to a flash flood, or wildfire, for weather event detection. The dataset contains 37k (9k German, 10k English, 9k Spanish, 8k Catalan) unique sentences.

Supported Tasks and Leaderboards

text-classification: This dataset can be used to train models for flood and wildfire detection.

Languages

The languages included in the dataset are:
- English (`en`)
- German (`de`)
- Spanish (`es`)
- Catalan (`ca-ES`)

Dataset Structure

Data Instances
 
Each instance in the dataset contains the following fields:

- id: index id specific to this dataset.
text: generated text in form of tweet.
language: language in sentence. Possible values: ENGLISH, GERMAN, SPANISH, CATALAN.
label: event label. Posible values: fire, flood, none.
label_quality: label quality score provided by Cleanlab.
- model: LLM used to generate text.
prompt_category: type of prompt used to generate the text.


Data Fields

{
  "id": "0",
  "text": "Ufff, quin partit del Barça ahir! 😍 Sembla que la Lliga ja és nostra! #ForçaBarça #ViscaElBarça",
  "language": "CATALAN",
  "label": "none",
  "label_quality": "0.995009975117508",
  "model": "gemma_3",
  "prompt_category": "unrelated_to_crisis_discussing_random_topics"
}


Dataset Creation


Curation Rationale

This dataset was created to augument real data used to train a weather emergency detection model within the CREXDATA project (Grant Agreement No. 101092749).

Source Data

Initial Data Collection and Normalization

The data was generated using the 8-bit quantized versions of Google’s Gemma 3 27B and MistralAI’s Mistral Small 24B. The models were prompted to generate texts in the following categories:

- from_affected_persons_with_keywords: In this case, the LLM was instructed to generate social media posts written from the perspective of individuals affected by a wildfire or flood. The prompt included guidance on tone and content, along with a list of keywords to incorporate. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt.
- from_government_and_meteorological_agencies_with_warning_alerts: The LLM was instructed to simulate posts issued by government or meteorological agencies, particularly those providing public warnings and alerts. This helped introduce an institutional perspective into the dataset. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt.
- `unrelated_to_crisis_discussing_random_topics`: To simulate unrelated social media posts, the LLM was prompted to produce posts about various topics such as politics, celebrities, cancer, music, sports, food, lifestyle, memes, breaking news, personal updates, and tourism. These topics were chosen to represent general social media noise. These are annotated as 'nones'.
related_to_crisis_but_lacking_useful_information: Here, the LLM was instructed to generate posts mentioning the incident but without providing actionable content—such as those requesting donations, expressing sympathy, criticizing authorities, or promoting conspiracy theories. These are annotated as 'nones'.

The prompts for each category can be found at this repository.

Who are the source language producers?

The source language produced by the LLMs used for generation.

Annotations


Annotation process

The annotations were produced by the LLMs. They were further cleaned using Cleanlab following their documentation. Error labels  and low quality labels below 0.3 were dropped, we further re-labelled instances of `fire, flood` with label quality less than 0.7 as `none`. Our annotation and label cleaning process can be found at the repository.

Who are the annotators?

The annotators were the LLMs used for generation.

Personal and Sensitive Information

N/A

Considerations for Using the Data


Social Impact of Dataset

We hope this data can improve research into weather emergency detection in social media data.

Discussion of Biases

We are aware that, since the data comes from LLMs, they can contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact.

Other Known Limitations

The dataset is fully generated by LLMs and should be used solely for augmenting real data.

Additional Information


Dataset Curators

Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.

This work has been developed under the EU-funded CREXDATA Project (Grant Agreement No. 101092749).

Since, part of this data was generated using Google's Gemma 3 model, its usage should follow Terms of Use and Prohibited Use Policy.

Licensing Information


Files

CREXWET-SYNTH.zip

Files (1.6 MB)

Name Size Download all
md5:3759c13ec2bf097fcf38340537e947cc
1.6 MB Preview Download

Additional details

Funding

European Commission
CREXDATA - Critical Action Planning over Extreme-Scale Data 101092749