There is a newer version of the record available.

Published October 17, 2023 | Version v1
Dataset Restricted

MULTITuDE

Description

MULTITuDE is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Fields

The dataset has the following fields:

  • 'text' - a text sample,
  • 'label' - 0 for human-written text, 1 for machine-generated text,
  • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
  • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
  • 'language' - the ISO 639-1 language code identifying the language of the given text,
  • 'length' - word count of the given text,
  • 'source' - a string identifying the source dataset / news medium of the given text.

Statistics (the number of samples)

Splits:

  • train - 44786
  • test - 29295

Binary labels:

  • 0 - 7992
  • 1 - 66089

Multiclass labels:

  • gpt-3.5-turbo -       8300
  • gpt-4 -                    8300
  • text-davinci-003 -   8297
  • alpaca-lora-30b -   8290
  • vicuna-13b -          8287
  • opt-66b -                8229
  • llama-65b -            8229
  • opt-iml-max-1.3b - 8157
  • human -                 7992

Languages:

  • English (en) - 29460 (train + test)
  • Spanish (es) - 11586 (train + test)
  • Russian (ru) - 11578 (train + test)
  • Dutch (nl) - 2695 (test)
  • Catalan (ca) - 2691 (test)
  • Czech (cs) - 2689 (test)
  • German (de) - 2685 (test)
  • Chinese (zh) - 2683 (test)
  • Portuguese (pt) - 2673 (test)
  • Arabic (ar) - 2673 (test)
  • Ukrainian (uk) - 2668 (test)

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:
  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not re-share the dataset with anyone else not included in this request.
  3. You will appropriately cite the paper mentioned in the dataset description in any publication, project, tool using this dataset.
  4. You understand how the dataset was created and that the "human" label may not be 100% correct.
  5. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. The authors are not responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

VIGILANT – Vital IntelliGence to Investigate ILlegAl DisiNformaTion 101073921
European Commission
vera.ai – vera.ai: VERification Assisted by Artificial Intelligence 101070093
European Commission