Published March 25, 2025 | Version 1.0.0
Dataset Open

FactSpan: Multilingual Fact-Checking Dataset

Description

The FactSpan dataset is an extension of the X-Fact dataset, designed to support multilingual fact-checking research. This dataset overcomes limitations in existing datasets by incorporating recent data from the ClaimReview Markup for Data Commons Feed and providing detailed annotations.

Key Features:

  • Data Source: Claims are sourced from both the X-Fact dataset (up to 2020) and the Data Commons Feed (post-2020).
  • Validity: Claims are filtered to include only those from organizations recognized by the International Fact-Checking Network (IFCN) and Duke Reporters’ Lab, ensuring high reliability.
  • Standardized Labels: Verdict labels are standardized into five categories: False, Mostly False, Partly False/Misleading, Mostly True, and True.
  • Annotations (Annotated Dataset Only): The FactSpan_annotated.csv dataset includes rich annotations generated using GPT-3.5:
    • label: The standardized verdict label.
    • claim: The fact-checked claim.
    • claimDate: The date of the claim.
    • claim_year: The year of the claim.
    • language: The language of the claim.
    • Position Statements: Indicates the presence of position statements.
    • Entity/Event Properties: Indicates the presence of entity or event properties.
    • Quote: Indicates the presence of quotes.
    • Numerical Data: Indicates the presence of numerical data.
    • claim type: Categorizes the claim as factual or opinion.
    • topics: Categorizes the claim into one of five predefined topics (Health and Pandemics, Politics and Governance, Society and Culture, Economy and Environment, Conflict and Security).
    • mapped_label: An additional mapped label, for edge cases or further label mappings.
  • Unannotated Dataset: The FactSpan.csv dataset includes:
    • label: The standardized verdict label.
    • claim: The fact-checked claim.
    • claimDate: The date of the claim.
    • language: The language of the claim.

Purpose:

This dataset aims to facilitate research in multilingual fact-checking, providing a comprehensive and up-to-date resource for developing and evaluating fact-checking models.

Repository:

The dataset is maintained in the GitHub repository. The repository also contains scripts for expanding and updating the dataset.

This work was supported by the German Research Foundation (DFG, project no. 504226141).

Files

FactSpan.csv

Files (25.1 MB)

Name Size Download all
md5:069bd2039175db0d1af701348a818d8a
11.0 MB Preview Download
md5:254724da69ecee21209ecefbcaf79b4a
14.1 MB Preview Download

Additional details

Software

Repository URL
https://github.com/lorraine-dev/FactSpan
Programming language
Python
Development Status
Active