# Spanish Fake News Dataset This dataset contains a structured and annotated collection of false news items in Spanish (Castilian), gathered and processed for academic research on misinformation. ## Dataset Scope The dataset represents most of the recorded **false news messages and their variations** up to **01.02.2021**. ## Content Description The dataset includes samples of false information in various formats: - News articles and headlines - Tweets and Facebook/Instagram/Telegram posts - YouTube video captions - WhatsApp text and voice message transcripts - Transcribed video/audio fragments with false claims - Fake government documents - Captions from photos and memes - Text extracted from images using OCR Only **Spanish (Castilian)** texts were used, excluding regional variants (e.g., Catalan, Basque, Galician) for consistency. ### Sources The data was collected from the following verified fact-checking initiatives: - [Maldito Bulo](https://maldita.es/malditobulo/) - [Newtral](https://www.newtral.es/zona-verificacion/fakes/) - [AFP Factual](https://factual.afp.com/) Fact-checkers from these organizations provide detailed articles identifying and explaining falsehoods, often including: - General context of the event - Quotes or links to false claims - Analysis and explanation of why the claims are false - Verified information or corrections ## Collection Method The dataset was built using both **manual extraction** (e.g., identifying and quoting false statements) and **automated parsing**: - **MyNews** service: an archive of Spanish mass media - **Custom scripts**: for parsing and extracting structured data - **OCR tools**: for extracting text from images (e.g., memes and screenshots) ------------------------------------------------------------------------ ## Fields Description ---------------------------------------------------------------------------------- Column Name Description ------------------ --------------------------------------------------------------- `Topic` The thematic category of the news item (e.g., *Politics*, *Health*, *COVID-19*, *Crime*). Normalized and translated to English. `Link source` URL to the original news piece, fact-check report, or source of the claim. Invalid links were removed. `Media` The platform or outlet where the false claim appeared (e.g., Facebook, YouTube, WhatsApp). Normalized for consistent spelling and language. `Date` Publication or verification date of the news item, in `YYYY-MM-DD` format. `Author` (Optional) Author of the news or platform source, if available. May be empty. `Headlines` Title or summary of the news item or article containing the false information. `Fake statement` Quoted false claim or misinformation as cited in the verification article. ---------------------------------------------------------------------------------- ------------------------------------------------------------------------ ## ⚠️ Notes - The dataset was preprocessed to remove duplicates, invalid links, and non-textual clutter. - Field values were normalized to support multilingual and cross-platform analysis. - Only Castilian Spanish was retained for consistency and clarity. ## 📚 License & Use This dataset is intended for **non-commercial academic and research purposes**. Please cite the original fact-checking organizations and this dataset if used in publications or analysis.