CEAID Adversarial Subset
Authors/Creators
Description
CEAID is a dataset (described in a paper) for machine-generated text detection benchmark for 7 Central European languages (Croatian, Czech, German, Hungarian, Polish, Slovak, and Slovenian) in two domains (news and social media). It contains 188,098 texts, of which about 23k are human-written and about 165k are generated by 8 multilingual large language models.
This dataset is an extension of CEAID for evaluation of adversarial robustness of the machine-generated text detection methods. It contains a carefully balanced pseudorandomly selected subset of 100 texts for each domain and language for the machine-generated as well as human-written class. It further contains adversarially modified counterparts for the machine-generated samples by each of the used two attacks (homoglyph attack and paraphrasing). In total, it contains 1,400 human-written and 4,200 machine-generated samples. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
Disclaimer
Due to data source (original CEAID is a subset of a combination of news articles from MULTITuDEv3 and social-media texts from MultiSocial), the dataset may contain harmful, disinformation, or offensive content. MultiSocial dataset description states that based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
Data
The dataset has the following fields:
-
'text' - a text sample,
-
'label' - 0 for human-written text, 1 for machine-generated text,
-
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, while after "_" character is a string representing original/homoglyph/paraphrased subset
-
'language' - the ISO 639-1 language code identifying the detected language of the given text,
-
'length' - word count of the given text,
-
'source' - a string identifying the source dataset of the given text (whther originated in CEAID, MULTITuDE, or MultiSocial),
-
'domain' - "news" for news articles, "social_media" for social-media texts.
Basic statistics:
| language | original human | original machine | homoglyph machine | paraphrased machine |
|---|---|---|---|---|
| cs | 200 | 200 | 200 | 200 |
| de | 200 | 200 | 200 | 200 |
| hr | 200 | 200 | 200 | 200 |
| hu | 200 | 200 | 200 | 200 |
| pl | 200 | 200 | 200 | 200 |
| sk | 200 | 200 | 200 | 200 |
| sl | 200 | 200 | 200 | 200 |
Files
Additional details
Software
- Repository URL
- https://github.com/kinit-sk/CEAID