Published May 12, 2023 | Version 1.0
Dataset Restricted

MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval

Description

MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval - is a dataset that can be used to train a test models used for disinformation combatting. The dataset consists of 206k claims fact-checked by professional fact-checkers and 28k social media posts gathered from the wild. Each social media post has at least on claim assigned. The main idea is to develop information retrieval models that will assign appropriate claims to all the posts.

Paper: https://aclanthology.org/2023.emnlp-main.1027/

Preprint: https://arxiv.org/abs/2305.07991

GitHub repository: https://github.com/kinit-sk/multiclaim

 

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:

@inproceedings{pikuliak-etal-2023-multilingual,
    title = "Multilingual Previously Fact-Checked Claim Retrieval",
    author = "Pikuliak, Mat{\'u}{\v{s}} and Srba, Ivan and Moro, Robert and Hromadka, Timo and Smole{\v{n}}, Timotej and Meli{\v{s}}ek, Martin and Vykopal, Ivan and Simko, Jakub and Podrou{\v{z}}ek, Juraj and Bielikova, Maria",
    editor = "Bouamor, Houda  and Pino, Juan  and Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.1027",
    doi = "10.18653/v1/2023.emnlp-main.1027",
    pages = "16477--16500",
}

 

Contents

fact_check_post_mapping.csv - Mapping between fact checks and social media posts:

fact_check_id

post_id

fact_checks.csv - Data about fact-checks:

fact_check_id

claim - This is the translated text (see below) of the fact-check claim

instances - Instances of the fact-check – a list of timestamps and URLs.

title - This is the translated text (see below) of the fact-check title

posts.csv - Data about social media posts:

post_id

instances - Instances of the fact-check – a list of timestamps and what were the social media platforms.

ocr - This is a list of translated texts (see below) of the OCR transcripts based on the images attached to the post.

verdicts - This is a list of verdicts attached by Meta (e.g., False information)

text - This is the translated text (see below) of the text written by the user.

 

What is a translated text?

A tuple of text, its translation to English and detected languages, e.g., in the sample below we have an original Croatian text, its translation to English and finally the predicted language composition (hbs = Serbo-Croatian):

(  '"...bolnice su pune ? ti  ina, muk...upravo sada, bolnica Rebro..tragi  no sme  no',  '"...hospitals are full? silence, silence... right now, Rebro hospital... tragically funny',  [('hbs', 1.0)] )

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
  3. You will not re-share the dataset with anyone else not included in this request.
  4. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
  5. You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct. 
  6. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

DisAI – Improving scientific excellence and creativity in combating disinformation with artificial intelligence and language technologies 101079164
European Commission