# Spanish Fake News Dataset

This dataset contains a structured and annotated collection of false
news items in Spanish (Castilian), gathered and processed for academic
research on misinformation.

## Dataset Scope

The dataset represents most of the recorded **false news messages and
their variations** up to **01.02.2021**.

## Content Description

The dataset includes samples of false information in various formats:

-   News articles and headlines
-   Tweets and Facebook/Instagram/Telegram posts
-   YouTube video captions
-   WhatsApp text and voice message transcripts
-   Transcribed video/audio fragments with false claims
-   Fake government documents
-   Captions from photos and memes
-   Text extracted from images using OCR

Only **Spanish (Castilian)** texts were used, excluding regional
variants (e.g., Catalan, Basque, Galician) for consistency.

### Sources

The data was collected from the following verified fact-checking
initiatives:

-   [Maldito Bulo](https://maldita.es/malditobulo/)
-   [Newtral](https://www.newtral.es/zona-verificacion/fakes/)
-   [AFP Factual](https://factual.afp.com/)

Fact-checkers from these organizations provide detailed articles
identifying and explaining falsehoods, often including:

-   General context of the event
-   Quotes or links to false claims
-   Analysis and explanation of why the claims are false
-   Verified information or corrections

## Collection Method

The dataset was built using both **manual extraction** (e.g.,
identifying and quoting false statements) and **automated parsing**:

-   **MyNews** service: an archive of Spanish mass media
-   **Custom scripts**: for parsing and extracting structured data
-   **OCR tools**: for extracting text from images (e.g., memes and
    screenshots)

------------------------------------------------------------------------

## Fields Description

  ----------------------------------------------------------------------------------
  Column Name        Description
  ------------------ ---------------------------------------------------------------
  `Topic`            The thematic category of the news item (e.g., *Politics*,
                     *Health*, *COVID-19*, *Crime*). Normalized and translated to
                     English.

  `Link source`      URL to the original news piece, fact-check report, or source of
                     the claim. Invalid links were removed.

  `Media`            The platform or outlet where the false claim appeared (e.g.,
                     Facebook, YouTube, WhatsApp). Normalized for consistent
                     spelling and language.

  `Date`             Publication or verification date of the news item, in
                     `YYYY-MM-DD` format.

  `Author`           (Optional) Author of the news or platform source, if available.
                     May be empty.

  `Headlines`        Title or summary of the news item or article containing the
                     false information.

  `Fake statement`   Quoted false claim or misinformation as cited in the
                     verification article.
  ----------------------------------------------------------------------------------

------------------------------------------------------------------------

## ⚠️ Notes

-   The dataset was preprocessed to remove duplicates, invalid links,
    and non-textual clutter.
-   Field values were normalized to support multilingual and
    cross-platform analysis.
-   Only Castilian Spanish was retained for consistency and clarity.

## 📚 License & Use

This dataset is intended for **non-commercial academic and research
purposes**. Please cite the original fact-checking organizations and
this dataset if used in publications or analysis.