Published May 1, 2026 | Version v1
Dataset Restricted

CEAID Adversarial Subset

Authors/Creators

Description

CEAID is a dataset (described in a paper) for machine-generated text detection benchmark for 7 Central European languages (Croatian, Czech, German, Hungarian, Polish, Slovak, and Slovenian) in two domains (news and social media). It contains 188,098 texts, of which about 23k are human-written and about 165k are generated by 8 multilingual large language models.

This dataset is an extension of CEAID for evaluation of adversarial robustness of the machine-generated text detection methods. It contains a carefully balanced pseudorandomly selected subset of 100 texts for each domain and language for the machine-generated as well as human-written class. It further contains adversarially modified counterparts for the machine-generated samples by each of the used two attacks (homoglyph attack and paraphrasing). In total, it contains 1,400 human-written and 4,200 machine-generated samples. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Disclaimer

Due to data source (original CEAID is a subset of a combination of news articles from MULTITuDEv3 and social-media texts from MultiSocial), the dataset may contain harmful, disinformation, or offensive content. MultiSocial dataset description states that based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

Data

The dataset has the following fields:

  • 'text' - a text sample,

  • 'label' - 0 for human-written text, 1 for machine-generated text,

  • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, while after "_" character is a string representing original/homoglyph/paraphrased subset

  • 'language' - the ISO 639-1 language code identifying the detected language of the given text,

  • 'length' - word count of the given text,

  • 'source' - a string identifying the source dataset of the given text (whther originated in CEAID, MULTITuDE, or MultiSocial),

  • 'domain' - "news" for news articles, "social_media" for social-media texts.

Basic statistics:

language original human original machine homoglyph machine paraphrased machine
cs 200 200 200 200
de 200 200 200 200
hr 200 200 200 200
hu 200 200 200 200
pl 200 200 200 200
sk 200 200 200 200
sl 200 200 200 200

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/20397173">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for non-commercial research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not re-share the dataset with anyone else not included in this request.
  3. You will appropriately cite the paper mentioned in the dataset description in any publication, project, tool using this dataset.
  4. You understand how the dataset was created and that the "human" label may not be 100% correct.
  5. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. The authors are not responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

Government of Slovakia
RobIndAI 09I01-03-V04-00059

Software