MultiCheckWorthy (MultiCW) dataset

Martin Hyben

doi:10.5281/zenodo.17482958

Published September 1, 2025 | Version v2

Dataset Restricted

MultiCheckWorthy (MultiCW) dataset

Martin Hyben (Contact person)¹

1. Kempelen Institute of Intelligent Technologies

Contributors

Contact person (5):

1. Kempelen Institute of Intelligent Technologies

The MultiCheckWorthy (MultiCW) dataset is a balanced multilingual benchmarking dataset for a check-worthy claim detection, covering 16 languages, 6 topical domains, and 2 writing styles. The dataset consists of 123,722 samples, evenly distributed between noisy and structured texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. Each claim is accompanied by its English translation, detected topic, writing style, language code, check-worthyness label as well as the list of detected named entities. The dataset was composed of existing datasets and balanced by translating the samples from the existing datasets as well as using the samples collected from Wikipedia.

The dataset is partitioned into training, validation, and test set. In addition, we construct a separate out-of-distribution (OOD) set consisting of 4 other languages (it, mk, nl and my), to evaluate model generalization beyond the in-distribution data. Bellow is the number of samples included in each set:

Set	Samples
Train	86,691
Validation	18,491
Test	18,540
Out-of-distribution (OOD)	27,761

Notes

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
You will not re-share the dataset (or any of its parts) with anyone else not included in this request.
You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct.
You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/17482958">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

In order to share the dataset with you, please agree to the following terms:

You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
You will not re-share the dataset (or any of its parts) with anyone else not included in this request.
You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct.
You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

European Commission
vera.ai - vera.ai: VERification Assisted by Artificial Intelligence 101070093
European Commission
EU NextGenerationEU through the Recovery and Resilience Plan for Slovakia 09I01-03-V04- 00006

	All versions	This version
Views	72	47
Downloads	30	10
Data volume	387.1 MB	112.2 MB

MultiCheckWorthy (MultiCW) dataset

Authors/Creators

Contributors

Contact person (5):

Description

Notes

Files

Restricted

Request access

Additional details

Funding