Published November 29, 2025 | Version v1
Dataset Open

Dominique: An AI-Powered Fact-Checking Chatbot for Democratizing Access to Reliable Information

Description

This repository contains two datasets developed for research in Brazilian Portuguese fake news detection:

  1. Golden Dataset: The primary dataset, comprising 22,044 unique news articles (11,145 fake, 10,899 true) in Brazilian Portuguese. It was created by merging and deduplicating three established corpora, Fake.Br, FakeTrueBR, and FakeRecogna, to form a larger, more robust, and balanced resource. It includes extensive metadata such as source, publication date, author, and linguistic features to support the development of advanced machine learning models.

  2. Gemini Validation Dataset: A synthetic, health-focused dataset of 1,000 news instances (labeled as true or fake) generated using Google's Gemini LLM. This dataset was specifically created for external validation to test the generalization capability of trained models on unseen, out-of-distribution topics, simulating a real-world fact-checking scenario.

Files

gemini_dataset.csv

Files (201.4 MB)

Name Size Download all
md5:7f243e09f5215c0365b21957cc58a646
2.7 MB Preview Download
md5:78fc3508594e6aeb9bf0515c8831d014
198.8 MB Preview Download

Additional details

Software

Repository URL
https://github.com/autoaihub/Dominique
Programming language
Python
Development Status
Active