AFP-Sum: A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

Vykopal, Ivan; Hyben, Martin; Móro, Róbert; Gregor, Michal; Simko, Jakub

doi:10.5281/zenodo.15267292

Published April 23, 2025 | Version 1.0

Dataset Restricted

AFP-Sum: A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

1. Kempelen Institute of Intelligent Technologies

A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

Abstract: Online disinformation poses a global challenge, placing significant demands on fact-checkers who must verify claims efficiently to prevent the spread of false information. A major issue in this process is the redundant verification of already fact-checked claims, which increases workload and delays responses to newly emerging claims. This research introduces an approach that retrieves previously fact-checked claims, evaluates their relevance to a given input, and provides supplementary information to support fact-checkers. Our method employs large language models (LLMs) to filter irrelevant fact-checks and generate concise summaries and explanations, enabling fact-checkers to faster assess whether a claim has been verified before. In addition, we evaluate our approach through both automatic and human assessments, where humans interact with the developed tool to review its effectiveness. Our results demonstrate that LLMs are able to filter out many irrelevant fact-checks and, therefore, reduce effort and streamline the fact-checking process.

Paper: https://arxiv.org/abs/2504.20668

GitHub Repository: https://github.com/kinit-sk/claim-retrieval

The data are available upon request for research purposes only.

References

If you use this dataset in any publication, project, tool or in any other form, please cite the following paper:

@misc{vykopal2025generativeaidrivenclaimretrievalcapable,
title={A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages},
author={Ivan Vykopal and Martin Hyben and Robert Moro and Michal Gregor and Jakub Simko},
year={2025},
eprint={2504.20668},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.20668},
}

Content

afp-sum.csv - AFP-Sum dataset consisting of around 19K fact-checks across 23 languages

id - Article ID
url - A URL of a fact-checking article
text - A text extracted from the fact-checking article
summary - A summary extracted from the fact-checking article
processed_text - Text of the fact-checking article without the summary
language - Language of the fact-checking article

sample2.csv - Sample of 2 fact-checking articles per language from the AFP-Sum dataset

id - Article ID
url - A URL of a fact-checking article
text - A text extracted from the fact-checking article
summary - A summary extracted from the fact-checking article
processed_text - Text of the fact-checking article without the summary
language - Language of the fact-checking article

sample100.csv - Sample of 100 fact-checking articles per language from the AFP-Sum dataset

id - Article ID
url - A URL of a fact-checking article
text - A text extracted from the fact-checking article
summary - A summary extracted from the fact-checking article
processed_text - Text of the fact-checking article without the summary
language - Language of the fact-checking article

fact_checks_metadata.csv - Metadata for the MultiClaim dataset and especially for the fact-checking articles

fact_check_id - Id of the fact-checks from the original MultiClaim dataset
url - A URL of the fact-checking article
rating_category - Rating extracted from the fact-checks metadata
language - Language of the fact-checking article
published_at - Publication date of the fact-checking article

Acknowledgments

This project is funded by the European Media and Information Fund (grant number 291191). The sole responsibility for any content supported by the European Media and Information Fund lies with the author(s) and it may not necessarily reflect the positions of the EMIF and the Fund Partners, the Calouste Gulbenkian Foundation and the European University Institute.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

In order to share the dataset with you, please agree to the following terms:

You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
You will not re-share the dataset (or any of its parts) with anyone else not included in this request.
For the AFP-Sum dataset, you will follow the AFP organization's terms of use, which do not allow use of the data for commercial purposes.
You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Repository URL: https://github.com/kinit-sk/claim-retrieval

	All versions	This version
Views	47	47
Downloads	13	13
Data volume	4.5 GB	4.5 GB

AFP-Sum: A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

Creators

Description

A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

References

Content

Acknowledgments

Files

Restricted

Request access

Additional details

Software