Published September 1, 2025 | Version v2
Dataset Restricted

MultiCheckWorthy (MultiCW) dataset

  • 1. ROR icon Kempelen Institute of Intelligent Technologies
  • 1. ROR icon Kempelen Institute of Intelligent Technologies

Description

The MultiCheckWorthy (MultiCW) dataset is a balanced multilingual benchmarking dataset for a check-worthy claim detection, covering 16 languages, 6 topical domains, and 2 writing styles. The dataset consists of 123,722 samples, evenly distributed between noisy and structured texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. Each claim is accompanied by its English translation, detected topic, writing style, language code, check-worthyness label as well as the list of detected named entities. The dataset was composed of existing datasets and balanced by translating the samples from the existing datasets as well as using the samples collected from Wikipedia.

The dataset is partitioned into training, validation, and test set. In addition, we construct a separate out-of-distribution (OOD) set consisting of 4 other languages (it, mk, nl and my), to evaluate model generalization beyond the in-distribution data. Bellow is the number of samples included in each set:

Set Samples
Train 86,691
Validation 18,491
Test 18,540
Out-of-distribution (OOD) 27,761

 

Notes

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
  3. You will not re-share the dataset (or any of its parts) with anyone else not included in this request.  
  4. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
  5. You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct. 
  6. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Funding

European Commission
vera.ai - vera.ai: VERification Assisted by Artificial Intelligence 101070093
European Commission
EU NextGenerationEU through the Recovery and Resilience Plan for Slovakia 09I01-03-V04- 00006