Published April 23, 2025 | Version 1.0
Dataset Restricted

AFP-Sum: A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

  • 1. Kempelen Institute of Intelligent Technologies

Description

A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

Abstract: Online disinformation poses a global challenge, placing significant demands on fact-checkers who must verify claims efficiently to prevent the spread of false information. A major issue in this process is the redundant verification of already fact-checked claims, which increases workload and delays responses to newly emerging claims. This research introduces an approach that retrieves previously fact-checked claims, evaluates their relevance to a given input, and provides supplementary information to support fact-checkers. Our method employs large language models (LLMs) to filter irrelevant fact-checks and generate concise summaries and explanations, enabling fact-checkers to faster assess whether a claim has been verified before. In addition, we evaluate our approach through both automatic and human assessments, where humans interact with the developed tool to review its effectiveness. Our results demonstrate that LLMs are able to filter out many irrelevant fact-checks and, therefore, reduce effort and streamline the fact-checking process.

Paper: https://arxiv.org/abs/2504.20668

GitHub Repository: https://github.com/kinit-sk/claim-retrieval

The data are available upon request for research purposes only.

References

If you use this dataset in any publication, project, tool or in any other form, please cite the following paper:

@misc{vykopal2025generativeaidrivenclaimretrievalcapable,
      title={A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages}, 
      author={Ivan Vykopal and Martin Hyben and Robert Moro and Michal Gregor and Jakub Simko},
      year={2025},
      eprint={2504.20668},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.20668}, 
}

Content

  1. afp-sum.csv - AFP-Sum dataset consisting of around 19K fact-checks across 23 languages

    • id - Article ID

    • url - A URL of a fact-checking article

    • text - A text extracted from the fact-checking article

    • summary - A summary extracted from the fact-checking article

    • processed_text - Text of the fact-checking article without the summary

    • language - Language of the fact-checking article

  2. sample2.csv - Sample of 2 fact-checking articles per language from the AFP-Sum dataset

    • id - Article ID

    • url - A URL of a fact-checking article

    • text - A text extracted from the fact-checking article

    • summary - A summary extracted from the fact-checking article

    • processed_text - Text of the fact-checking article without the summary

    • language - Language of the fact-checking article

  3. sample100.csv - Sample of 100 fact-checking articles per language from the AFP-Sum dataset

    • id - Article ID

    • url - A URL of a fact-checking article

    • text - A text extracted from the fact-checking article

    • summary - A summary extracted from the fact-checking article

    • processed_text - Text of the fact-checking article without the summary

    • language - Language of the fact-checking article

  4. fact_checks_metadata.csv - Metadata for the MultiClaim dataset and especially for the fact-checking articles

    • fact_check_id - Id of the fact-checks from the original MultiClaim dataset

    • url - A URL of the fact-checking article

    • rating_category - Rating extracted from the fact-checks metadata

    • language - Language of the fact-checking article

    • published_at - Publication date of the fact-checking article

Acknowledgments

This project is funded by the European Media and Information Fund (grant number 291191). The sole responsibility for any content supported by the European Media and Information Fund lies with the author(s) and it may not necessarily reflect the positions of the EMIF and the Fund Partners, the Calouste Gulbenkian Foundation and the European University Institute.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not re-share the dataset (or any of its parts) with anyone else not included in this request. 
  3. For the AFP-Sum dataset, you will follow the AFP organization's terms of use, which do not allow use of the data for commercial purposes.
  4. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
  5. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details