MultiClaim Dataset v2
Creators
Description
The MultiClaim v2 dataset is an extension of the original MultiClaim. It consists of 435k claims fact-checked by professional fact-checkers and 89k social media posts containing these claims which were all published before April 2025. There are 105k pairs of fact-checked claims and social media posts in total; each social media post has at least one claim assigned.
The dataset is available for research purposes only. It is intended to be used for the task of previously fact-checked claim retrieval (sometimes also called claim matching), i.e., to develop information retrieval models that will assign appropriate claims to all the posts.
The original paper: https://aclanthology.org/2023.emnlp-main.1027/
GitHub repository: https://github.com/kinit-sk/multiclaim
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the Zenodo MultiClaim v2 dataset together with the following paper:
@inproceedings{pikuliak-etal-2023-multilingual,
title = "Multilingual Previously Fact-Checked Claim Retrieval",
author = "Pikuliak, Mat{\'u}{\v{s}} and Srba, Ivan and Moro, Robert and Hromadka, Timo and Smole{\v{n}}, Timotej and Meli{\v{s}}ek, Martin and Vykopal, Ivan and Simko, Jakub and Podrou{\v{z}}ek, Juraj and Bielikova, Maria",
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.1027",
doi = "10.18653/v1/2023.emnlp-main.1027",
pages = "16477--16500",
}
Dataset Description
fact_check_post_mapping.csv - Mapping between fact checks and social media posts:
fact_check_id
post_id
relationship - either claimreview_schema, backlink, or similarity:identical (for identical claims)
fact_checks.csv - Data about fact-checks:
fact_check_id
claim - fact-checked claim
title - title of the fact-checking article containing the fact-checked claim
claim_en - English translation of the claim using Google Translate
title_en - English translation of the title using Google Translate
claim_detected_language - detected language of the claim using Google Translate in BCP 47 format
title_detected_language - detected language of the title using Google Translate in BCP 47 format
claim_detected_language_iso - detected language of the claim using Google Translate in ISO 639 format
title_detected_language_iso - detected language of the claim using Google Translate in ISO 639 format
instances - instances of the fact-check – a list of unix timestamps and URLs following the MultiClaim v1 format
claim_v1 - claim in the MultiClaim v1 format
title_v1 - title in the MultiClaim v1 format
ratings - veracity ratings of the claim provided by the fact-checkers (aggregated across all instances)
posts.csv - Data about social media posts:
post_id
post_body - anonymized post’s body (text)
post_body_en - English translations of the anonymized post’s body (text) using Google Translate
post_detected_language - detected language of the post using Google Translate in BCP 47 format
post_detected_language_iso - detected language of the claim using Google Translate in ISO 639 format
instances - instances of the posts – a list of unix timestamps and what were the social media platforms following the MultiClaim v1 format
ocr - a list of the OCR transcripts based on the images attached to the post (if present). It follows the MultiClaim v1 format, i.e. it is a list of tuples, where each tuple consists of the anonymized OCR transcript (using Google Vision), its English translation using Google Translate and a list of detected languages (provided in the response by Google Vision).
verdicts - this is a list of verdicts attached by Meta (e.g., False information) aggregated across all instances
text_v1 - post’s body in MultiClaim v1 format
Ethical considerations
Most of the ethical, legal and societal issues tied to the MultiClaim dataset were already described in the Ethical Considerations section of the original paper accompanying MultiClaim v1. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure.
For MultiClaim v2, we have reassessed the risk of a potential violation of the ToS of the social media platforms in light of the new EU digital regulations. Exploratory research on very large online platforms is now legally permitted by Article 40 (12) of the EU Act on digital services (DSA) if the research concerns systemic risks. As the spread of disinformation is clearly a systemic risk as foreseen by Recital 83 of the DSA, we see this as an argument in favor of the further use of the MultiClaim v2 dataset. The MultiClaim v2 dataset also contains a small number of posts from Telegram, which is not considered a very large online platform. However, we are including only posts identified and linked by fact-checkers from public groups/channels. As with all other posts, we anonymize the content for any personal information and do not publish the link to the posts.
We include the fact-checks from fact-checking organizations using Google Fact Check Explorer and a limited number of custom scrapers. To be indexed in Google Fact Check Explorer, the fact-checkers need to provide metadata using the ClaimReview schema or upload the content themselves to increase the visibility. When handling potentially copyrighted content, we are applying the research exemption under the EU Directive 2019/790. Nevertheless, the full texts of the fact-checks are not published to avoid possible copyright violations. We publish only the claims, titles, and fact-checkers’ rating. We always attribute the claims by providing the original URL of the fact-checking article.
To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes.
Although we have not deliberately involved sensitive topics, due to the nature of disinformation, the dataset may contain some sensitive societal topics regarding the LGBTQIA+ community, war, or humanitarian crises, child abuse, terrorism, or other political and theological topics.
Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.
Acknowledgements
This work was supported by the European Media and Information Fund (grant number 291191). The sole responsibility for any content supported by the European Media and Information Fund lies with the author(s) and it may not necessarily reflect the positions of the EMIF and the Fund Partners, the Calouste Gulbenkian Foundation and the European University Institute. It was also partially supported by the AI-CODE, a project funded by the European Union under the Horizon, GA No. 101135437.
Files
Additional details
Related works
- Is described by
- Publication: 10.18653/v1/2023.emnlp-main.1027 (DOI)