Published February 17, 2025 | Version v2
Dataset Open

PAN'25 Generative AI Detection (Task 2): Human-AI Collaborative Text Classification

  • 1. Mohamed bin Zayed University of Artificial Intelligence, UAE
  • 2. Nebius AI, Netherlands
  • 3. New York University Abu Dhabi, UAE
  • 4. Toloka AI, Netherlands
  • 5. Cornell University, USA
  • 6. Technical University of Darmstadt, Germany

Description

Dataset for the Generative AI Detection Task (Subtask 2) @ PAN 2025.

As large language models (LLMs) like GPT-4o, Claude 3.5, and Gemini 1.5-pro become increasingly accessible, machine-generated content is proliferating across diverse domains, including news, social media, education, and academia. These models produce highly fluent and coherent text, making them valuable for automating various writing tasks. However, their widespread use also raises concerns about misinformation, academic integrity, and content authenticity. Identifying the degree of human and machine involvement in text creation is crucial for addressing these challenges.

In this shared task, we focus on Human-AI Collaborative Text Classification, where the goal is to categorize documents that have been co-authored by humans and LLMs. Specifically, we aim to classify texts into six distinct categories based on the nature of human and machine contributions:

  • Fully human-written: The document is entirely authored by a human without any AI assistance.
  • Human-initiated, then machine-continued: A human starts writing, and an AI model completes the text.
  • Human-written, then machine-polished: The text is initially written by a human but later refined or edited by an AI model.
  • Machine-written, then machine-humanized (obfuscated): An AI generates the text, which is later modified to obscure its machine origin.
  • Machine-written, then human-edited: The content is generated by an AI but subsequently edited or refined by a human.
  • Deeply-mixed text: The document contains interwoven sections written by both humans and AI, without a clear separation.

Label Distribution:

Label Category Train Dev
Machine-written, then machine-humanized 91,232 10,137
Human-written, then machine-polished 95,398 12,289
Fully human-written 75,270 12,330
Human-initiated, then machine-continued 10,740 37,170
Deeply-mixed text (human + machine parts) 14,910 225
Machine-written, then human-edited 1,368 510
Total 288,918 72,661

 

Files

pan25-generative-ai-detection-task2-train.zip

Files (282.3 MB)

Name Size Download all
md5:969665003148bdaae956f9e1cc9165ae
282.3 MB Preview Download