MULTITuDE

Macko, Dominik; Moro, Robert; Uchendu, Adaku; Lucas, Jason Samuel; Yamashita, Michiharu; Pikuliak, Matúš; Srba, Ivan; Le, Thai; Lee, Dongwon; Simko, Jakub; Bielikova, Maria

doi:10.5281/zenodo.10013755

Published October 17, 2023 | Version v1

Dataset Restricted

MULTITuDE

MULTITuDE is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Fields

The dataset has the following fields:

'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / news medium of the given text.

Statistics (the number of samples)

Splits:

train - 44786
test - 29295

Binary labels:

0 - 7992
1 - 66089

Multiclass labels:

gpt-3.5-turbo - 8300
gpt-4 - 8300
text-davinci-003 - 8297
alpaca-lora-30b - 8290
vicuna-13b - 8287
opt-66b - 8229
llama-65b - 8229
opt-iml-max-1.3b - 8157
human - 7992

Languages:

English (en) - 29460 (train + test)
Spanish (es) - 11586 (train + test)
Russian (ru) - 11578 (train + test)
Dutch (nl) - 2695 (test)
Catalan (ca) - 2691 (test)
Czech (cs) - 2689 (test)
German (de) - 2685 (test)
Chinese (zh) - 2683 (test)
Portuguese (pt) - 2673 (test)
Arabic (ar) - 2673 (test)
Ukrainian (uk) - 2668 (test)

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

In order to share the dataset with you, please agree to the following terms:

You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
You will not re-share the dataset with anyone else not included in this request.
You will appropriately cite the paper mentioned in the dataset description in any publication, project, tool using this dataset.
You understand how the dataset was created and that the "human" label may not be 100% correct.
You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. The authors are not responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

European Commission
VIGILANT – Vital IntelliGence to Investigate ILlegAl DisiNformaTion 101073921
European Commission
vera.ai – vera.ai: VERification Assisted by Artificial Intelligence 101070093

	All versions	This version
Views	1,099	943
Downloads	325	193
Data volume	38.9 GB	26.2 GB

MULTITuDE

Creators

Description

Fields

Statistics (the number of samples)

Files

Restricted

Request access

Additional details

Funding