Published September 27, 2024 | Version v2
Dataset Restricted

MULTITuDEv2

Description

MULTITuDEv2 is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in v2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Files

The v2 of the dataset consists of multiple files. 'multitude.csv' contains original v1 of the dataset (i.e., without the field 'generated'). The other files contains also the 'generated' field (as described below) and are compressed by GZIP. The file 'multitude_obfuscated_original.csv.gz' contains copies of the 'text' field in the 'generated' field to be compatible with files with the obfuscated texts (used as such in the experiments).

Fields

The dataset has the following fields:

  • 'text' - an original (unobfuscated) text sample,
  • 'label' - 0 for human-written text, 1 for machine-generated text,
  • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
  • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
  • 'language' - the ISO 639-1 language code identifying the language of the given text,
  • 'length' - word count of the given text,
  • 'source' - a string identifying the source dataset / news medium of the given text,
  • 'generated' - an obfuscated text sample (i.e., transformed from original text by the obfuscator indicated by the corresponding filename)

Note: some obfuscated text in the 'generated' field are the same as in the 'text' field, indicating failure of the obfuscator to modify the text. Human-written obfuscated texts are also included; however, labels of their originals might be no longer relevant for them (i.e., human-written text obfuscated by a machine could be considered as machine-generated as well); thus, consider this in your research.

Statistics (the number of samples)

Splits:

  • train - 44786
  • test - 29295

Binary labels:

  • 0 - 7992
  • 1 - 66089

Multiclass labels:

  • gpt-3.5-turbo -       8300
  • gpt-4 -                    8300
  • text-davinci-003 -   8297
  • alpaca-lora-30b -   8290
  • vicuna-13b -          8287
  • opt-66b -                8229
  • llama-65b -            8229
  • opt-iml-max-1.3b - 8157
  • human -                 7992

Languages:

  • English (en) - 29460 (train + test)
  • Spanish (es) - 11586 (train + test)
  • Russian (ru) - 11578 (train + test)
  • Dutch (nl) - 2695 (test)
  • Catalan (ca) - 2691 (test)
  • Czech (cs) - 2689 (test)
  • German (de) - 2685 (test)
  • Chinese (zh) - 2683 (test)
  • Portuguese (pt) - 2673 (test)
  • Arabic (ar) - 2673 (test)
  • Ukrainian (uk) - 2668 (test)

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:
  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not re-share the dataset with anyone else not included in this request.
  3. You will appropriately cite the paper mentioned in the dataset description in any publication, project, tool using this dataset.
  4. You understand how the dataset was created and that the "human" label may not be 100% correct.
  5. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. The authors are not responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

European Commission
VIGILANT – Vital IntelliGence to Investigate ILlegAl DisiNformaTion 101073921
European Commission
AI-CODE - AI services for COntinuous trust in emerging Digital Environments 101135437