MULTITuDE
Description
MULTITuDE is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Fields
The dataset has the following fields:
- 'text' - a text sample,
- 'label' - 0 for human-written text, 1 for machine-generated text,
- 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
- 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
- 'language' - the ISO 639-1 language code identifying the language of the given text,
- 'length' - word count of the given text,
- 'source' - a string identifying the source dataset / news medium of the given text.
Statistics (the number of samples)
Splits:
- train - 44786
- test - 29295
Binary labels:
- 0 - 7992
- 1 - 66089
Multiclass labels:
- gpt-3.5-turbo - 8300
- gpt-4 - 8300
- text-davinci-003 - 8297
- alpaca-lora-30b - 8290
- vicuna-13b - 8287
- opt-66b - 8229
- llama-65b - 8229
- opt-iml-max-1.3b - 8157
- human - 7992
Languages:
- English (en) - 29460 (train + test)
- Spanish (es) - 11586 (train + test)
- Russian (ru) - 11578 (train + test)
- Dutch (nl) - 2695 (test)
- Catalan (ca) - 2691 (test)
- Czech (cs) - 2689 (test)
- German (de) - 2685 (test)
- Chinese (zh) - 2683 (test)
- Portuguese (pt) - 2673 (test)
- Arabic (ar) - 2673 (test)
- Ukrainian (uk) - 2668 (test)