MULTITuDEv3

Macko, Dominik; Kopal, Jakub; Moro, Robert; Srba, Ivan

doi:10.5281/zenodo.15519413

Published 2025 | Version v3

Dataset Restricted

MULTITuDEv3

MULTITuDEv3 is a dataset for multilingual machine-generated text detection benchmark, originally described in the EMNLP 2023 conference paper. It consisted of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles) (see MULTITuDEv1). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in MULTITuDEv2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. This version covers 21 languages (instead of original 11) with mostly equal coverage in the training set and has been introduced in ACL 2025 conference paper for out-of-domain evaluation of detectors trained on social-media texts.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Fields

The dataset has the following fields:

'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / news medium of the given text

Statistics (the number of samples)

Splits:

train - 156240
test - 50090
train+test - 206330

Binary labels:

0 - 25945
1 - 180385

Multiclass labels:

human - 25945
aya-101 - 25948
Mistral-7B-Instruct-v0.2 - 25937
gpt-3.5-turbo-0125 - 25935
v5-Eagle-7B-HF - 25892
vicuna-13b - 25876
opt-iml-max-30b - 25568
Llama-2-70b-chat-hf - 25229

Languages:

Language	train	test
Arabic	7975	2392
Bulgarian	7954	2386
Catalan	2894	2389
Chinese	7926	2383
Croatian	7951	2384
Czech	7962	2389
Dutch	7958	2386
English	7954	2384
German	7951	2388
Greek	7944	2384
Hungarian	7964	2385
Irish	2333	2381
Polish	7946	2383
Portuguese	7956	2388
Romanian	7949	2386
Russian	7945	2382
Scottish Gaelic	7899	2377
Slovak	7946	2385
Slovenian	7947	2386
Spanish	7947	2387
Ukrainian	7939	2385

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/15519413">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

In order to share the dataset with you, please agree to the following terms:

You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
You will not re-share the dataset with anyone else not included in this request.
You will appropriately cite the paper mentioned in the dataset description in any publication, project, tool using this dataset.
You understand how the dataset was created and that the "human" label may not be 100% correct.
You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. The authors are not responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

European Commission
VIGILANT - Vital IntelliGence to Investigate ILlegAl DisiNformaTion 101073921

	All versions	This version
Views	4,115	677
Downloads	673	36
Data volume	76.0 GB	5.3 GB

MULTITuDEv3

Authors/Creators

Description

Fields

Statistics (the number of samples)

Files

Restricted

Request access

Additional details

Funding