MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval
Creators
- 1. Kempelen Institute of Intelligent Technologies
Description
MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval - is a dataset that can be used to train a test models used for disinformation combatting. The dataset consists of 206k claims fact-checked by professional fact-checkers and 28k social media posts gathered from the wild. Each social media post has at least on claim assigned. The main idea is to develop information retrieval models that will assign appropriate claims to all the posts.
Paper: https://aclanthology.org/2023.emnlp-main.1027/
Preprint: https://arxiv.org/abs/2305.07991
GitHub repository: https://github.com/kinit-sk/multiclaim
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:
@inproceedings{pikuliak-etal-2023-multilingual,
title = "Multilingual Previously Fact-Checked Claim Retrieval",
author = "Pikuliak, Mat{\'u}{\v{s}} and Srba, Ivan and Moro, Robert and Hromadka, Timo and Smole{\v{n}}, Timotej and Meli{\v{s}}ek, Martin and Vykopal, Ivan and Simko, Jakub and Podrou{\v{z}}ek, Juraj and Bielikova, Maria",
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.1027",
doi = "10.18653/v1/2023.emnlp-main.1027",
pages = "16477--16500",
}
Contents
fact_check_post_mapping.csv - Mapping between fact checks and social media posts:
fact_check_id
post_id
fact_checks.csv - Data about fact-checks:
fact_check_id
claim - This is the translated text (see below) of the fact-check claim
instances - Instances of the fact-check – a list of timestamps and URLs.
title - This is the translated text (see below) of the fact-check title
posts.csv - Data about social media posts:
post_id
instances - Instances of the fact-check – a list of timestamps and what were the social media platforms.
ocr - This is a list of translated texts (see below) of the OCR transcripts based on the images attached to the post.
verdicts - This is a list of verdicts attached by Meta (e.g., False information)
text - This is the translated text (see below) of the text written by the user.
What is a translated text?
A tuple of text, its translation to English and detected languages, e.g., in the sample below we have an original Croatian text, its translation to English and finally the predicted language composition (hbs = Serbo-Croatian):
( '"...bolnice su pune ? ti ina, muk...upravo sada, bolnica Rebro..tragi no sme no', '"...hospitals are full? silence, silence... right now, Rebro hospital... tragically funny', [('hbs', 1.0)] )