Published May 19, 2025 | Version v2
Dataset Restricted

MultiClaim Dataset v2

Description

The MultiClaim v2 dataset is an extension of  the original MultiClaim. It consists of 435k claims fact-checked by professional fact-checkers and 89k social media posts containing these claims which were all published before April 2025. There are 105k pairs of fact-checked claims and social media posts in total; each social media post has at least one claim assigned. 

The dataset is available for research purposes only. It is intended to be used for the task of previously fact-checked claim retrieval (sometimes also called claim matching), i.e., to develop information retrieval models that will assign appropriate claims to all the posts.

The original paper: https://aclanthology.org/2023.emnlp-main.1027/

GitHub repository: https://github.com/kinit-sk/multiclaim

 

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the Zenodo MultiClaim v2 dataset together with the following paper:

@inproceedings{pikuliak-etal-2023-multilingual,
    title = "Multilingual Previously Fact-Checked Claim Retrieval",
    author = "Pikuliak, Mat{\'u}{\v{s}} and Srba, Ivan and Moro, Robert and Hromadka, Timo and Smole{\v{n}}, Timotej and Meli{\v{s}}ek, Martin and Vykopal, Ivan and Simko, Jakub and Podrou{\v{z}}ek, Juraj and Bielikova, Maria",
    editor = "Bouamor, Houda  and Pino, Juan  and Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.1027",
    doi = "10.18653/v1/2023.emnlp-main.1027",
    pages = "16477--16500",
}

 

Dataset Description

fact_check_post_mapping.csv - Mapping between fact checks and social media posts:

fact_check_id

post_id

relationship - either claimreview_schema, backlink, or similarity:identical (for identical claims)

 

fact_checks.csv - Data about fact-checks:

fact_check_id

claim - fact-checked claim

title - title of the fact-checking article containing the fact-checked claim

claim_en - English translation of the claim using Google Translate

title_en - English translation of the title using Google Translate

claim_detected_language - detected language of the claim using Google Translate in BCP 47 format

title_detected_language - detected language of the title using Google Translate in BCP 47 format

claim_detected_language_iso - detected language of the claim using Google Translate in ISO 639 format

title_detected_language_iso - detected language of the claim using Google Translate in ISO 639 format

instances - instances of the fact-check – a list of unix timestamps and URLs following the MultiClaim v1 format

claim_v1 - claim in the MultiClaim v1 format

title_v1 - title in the MultiClaim v1 format

ratings - veracity ratings of the claim provided by the fact-checkers (aggregated across all instances)

 

posts.csv - Data about social media posts:

post_id

post_body - anonymized post’s body (text)

post_body_en - English translations of the anonymized post’s body (text) using Google Translate

post_detected_language - detected language of the post using Google Translate in BCP 47 format

post_detected_language_iso - detected language of the claim using Google Translate in ISO 639 format

instances - instances of the posts – a list of unix timestamps and what were the social media platforms following the MultiClaim v1 format

ocr - a list of the OCR transcripts based on the images attached to the post (if present). It follows the MultiClaim v1 format, i.e. it is a list of tuples, where each tuple consists of the anonymized OCR transcript (using Google Vision), its English translation using Google Translate and a list of detected languages (provided in the response by Google Vision).

verdicts - this is a list of verdicts attached by Meta (e.g., False information) aggregated across all instances

text_v1 - post’s body in MultiClaim v1 format

 

Ethical considerations

Most of the ethical, legal and societal issues tied to the MultiClaim dataset were already described in the Ethical Considerations section of the original paper accompanying MultiClaim v1. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure.

For MultiClaim v2, we have reassessed the risk of a potential violation of the ToS of the social media platforms in light of the new EU digital regulations. Exploratory research on very large online platforms is now legally permitted by Article 40 (12) of the EU Act on digital services (DSA) if the research concerns systemic risks. As the spread of disinformation is clearly a systemic risk as foreseen by Recital 83 of the DSA, we see this as an argument in favor of the further use of the MultiClaim v2 dataset.  The MultiClaim v2 dataset also contains a small number of posts from Telegram, which is not considered a very large online platform. However, we are including only posts identified and linked by fact-checkers from public groups/channels. As with all other posts, we anonymize the content for any personal information and do not publish the link to the posts.

We include the fact-checks from fact-checking organizations using Google Fact Check Explorer and a limited number of custom scrapers. To be indexed in Google Fact Check Explorer, the fact-checkers need to provide metadata using the ClaimReview schema or upload the content themselves to increase the visibility. When handling potentially copyrighted content, we are applying the research exemption under the EU Directive 2019/790. Nevertheless, the full texts of the fact-checks are not published to avoid possible copyright violations. We publish only the claims, titles, and fact-checkers’ rating. We always attribute the claims by providing the original URL of the fact-checking article.

To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes.

Although we have not deliberately involved sensitive topics, due to the nature of disinformation, the dataset may contain some sensitive societal topics regarding the LGBTQIA+ community, war, or humanitarian crises, child abuse, terrorism, or other political and theological topics.

Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.

Acknowledgements

This work was supported by the European Media and Information Fund (grant number 291191). The sole responsibility for any content supported by the European Media and Information Fund lies with the author(s) and it may not necessarily reflect the positions of the EMIF and the Fund Partners, the Calouste Gulbenkian Foundation and the European University Institute. It was also partially supported by the AI-CODE, a project funded by the European Union under the Horizon, GA No. 101135437.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
  3. You will not re-share the dataset (or any of its parts) with anyone else not included in this request.  
  4. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
  5. You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct. 
  6. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Related works

Is described by
Publication: 10.18653/v1/2023.emnlp-main.1027 (DOI)

Funding

European Commission
AI-CODE - AI-CODE - AI services for COntinuous trust in emerging Digital Environments 101135437