Published 2025 | Version v3
Dataset Restricted

MULTITuDEv3

Description

MULTITuDEv3 is a dataset for multilingual machine-generated text detection benchmark, originally described in the EMNLP 2023 conference paper. It consisted of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles) (see MULTITuDEv1). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in MULTITuDEv2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. This version covers 21 languages (instead of original 11) with mostly equal coverage in the training set and has been introduced in ACL 2025 conference paper for out-of-domain evaluation of detectors trained on social-media texts.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Fields

The dataset has the following fields:

  • 'text' - a text sample,
  • 'label' - 0 for human-written text, 1 for machine-generated text,
  • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
  • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
  • 'language' - the ISO 639-1 language code identifying the language of the given text,
  • 'length' - word count of the given text,
  • 'source' - a string identifying the source dataset / news medium of the given text

Statistics (the number of samples)

Splits:

  • train - 156240
  • test - 50090
  • train+test - 206330

Binary labels:

  • 0 - 25945
  • 1 - 180385

Multiclass labels:

  • human -                            25945
  • aya-101 -                          25948
  • Mistral-7B-Instruct-v0.2 - 25937
  • gpt-3.5-turbo-0125 -        25935
  • v5-Eagle-7B-HF -              25892
  • vicuna-13b -                     25876
  • opt-iml-max-30b -           25568
  • Llama-2-70b-chat-hf -     25229

Languages:

  • Language train test
    Arabic 7975 2392
    Bulgarian 7954 2386
    Catalan 2894 2389
    Chinese 7926 2383
    Croatian 7951 2384
    Czech 7962 2389
    Dutch 7958 2386
    English 7954 2384
    German 7951 2388
    Greek 7944 2384
    Hungarian 7964 2385
    Irish 2333 2381
    Polish 7946 2383
    Portuguese 7956 2388
    Romanian 7949 2386
    Russian 7945 2382
    Scottish Gaelic 7899 2377
    Slovak 7946 2385
    Slovenian 7947 2386
    Spanish 7947 2387
    Ukrainian 7939 2385

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/15519413">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:
  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not re-share the dataset with anyone else not included in this request.
  3. You will appropriately cite the paper mentioned in the dataset description in any publication, project, tool using this dataset.
  4. You understand how the dataset was created and that the "human" label may not be 100% correct.
  5. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. The authors are not responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

European Commission
VIGILANT - Vital IntelliGence to Investigate ILlegAl DisiNformaTion 101073921