MULTITuDEv2
Description
MULTITuDEv2 is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in v2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Files
The v2 of the dataset consists of multiple files. 'multitude.csv' contains original v1 of the dataset (i.e., without the field 'generated'). The other files contains also the 'generated' field (as described below) and are compressed by GZIP. The file 'multitude_obfuscated_original.csv.gz' contains copies of the 'text' field in the 'generated' field to be compatible with files with the obfuscated texts (used as such in the experiments).
Fields
The dataset has the following fields:
- 'text' - an original (unobfuscated) text sample,
- 'label' - 0 for human-written text, 1 for machine-generated text,
- 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
- 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
- 'language' - the ISO 639-1 language code identifying the language of the given text,
- 'length' - word count of the given text,
- 'source' - a string identifying the source dataset / news medium of the given text,
- 'generated' - an obfuscated text sample (i.e., transformed from original text by the obfuscator indicated by the corresponding filename)
Note: some obfuscated text in the 'generated' field are the same as in the 'text' field, indicating failure of the obfuscator to modify the text. Human-written obfuscated texts are also included; however, labels of their originals might be no longer relevant for them (i.e., human-written text obfuscated by a machine could be considered as machine-generated as well); thus, consider this in your research.
Statistics (the number of samples)
Splits:
- train - 44786
- test - 29295
Binary labels:
- 0 - 7992
- 1 - 66089
Multiclass labels:
- gpt-3.5-turbo - 8300
- gpt-4 - 8300
- text-davinci-003 - 8297
- alpaca-lora-30b - 8290
- vicuna-13b - 8287
- opt-66b - 8229
- llama-65b - 8229
- opt-iml-max-1.3b - 8157
- human - 7992
Languages:
- English (en) - 29460 (train + test)
- Spanish (es) - 11586 (train + test)
- Russian (ru) - 11578 (train + test)
- Dutch (nl) - 2695 (test)
- Catalan (ca) - 2691 (test)
- Czech (cs) - 2689 (test)
- German (de) - 2685 (test)
- Chinese (zh) - 2683 (test)
- Portuguese (pt) - 2673 (test)
- Arabic (ar) - 2673 (test)
- Ukrainian (uk) - 2668 (test)