MULTITuDEv3
Authors/Creators
Description
MULTITuDEv3 is a dataset for multilingual machine-generated text detection benchmark, originally described in the EMNLP 2023 conference paper. It consisted of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles) (see MULTITuDEv1). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in MULTITuDEv2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. This version covers 21 languages (instead of original 11) with mostly equal coverage in the training set and has been introduced in ACL 2025 conference paper for out-of-domain evaluation of detectors trained on social-media texts.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Fields
The dataset has the following fields:
- 'text' - a text sample,
- 'label' - 0 for human-written text, 1 for machine-generated text,
- 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
- 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
- 'language' - the ISO 639-1 language code identifying the language of the given text,
- 'length' - word count of the given text,
- 'source' - a string identifying the source dataset / news medium of the given text
Statistics (the number of samples)
Splits:
- train - 156240
- test - 50090
- train+test - 206330
Binary labels:
- 0 - 25945
- 1 - 180385
Multiclass labels:
- human - 25945
- aya-101 - 25948
- Mistral-7B-Instruct-v0.2 - 25937
- gpt-3.5-turbo-0125 - 25935
- v5-Eagle-7B-HF - 25892
- vicuna-13b - 25876
- opt-iml-max-30b - 25568
- Llama-2-70b-chat-hf - 25229
Languages:
-
Language train test Arabic 7975 2392 Bulgarian 7954 2386 Catalan 2894 2389 Chinese 7926 2383 Croatian 7951 2384 Czech 7962 2389 Dutch 7958 2386 English 7954 2384 German 7951 2388 Greek 7944 2384 Hungarian 7964 2385 Irish 2333 2381 Polish 7946 2383 Portuguese 7956 2388 Romanian 7949 2386 Russian 7945 2382 Scottish Gaelic 7899 2377 Slovak 7946 2385 Slovenian 7947 2386 Spanish 7947 2387 Ukrainian 7939 2385