Published July 4, 2025 | Version v1
Dataset Open

On Speakers' Identities, Autism Self-Disclosures and LLM-Powered Robots

  • 1. ROR icon University of Luxembourg
  • 2. ROR icon University of Augsburg

Description

Introduction

This repository contains the data created and used for the paper "On Speakers' Identities, Autism Self-Disclosures, and LLM-Powered Robots."

The dataset includes LLM-generated text across three task types: Retrieval-Augmented Generation Question Answering, Story Generation, and Dialogue Generation. The responses were conditioned on varying self-descriptions from autistic individuals.

These self-descriptions are divided into two primary groups:

  • Group A: Descriptions with an explicit self-disclosure of neurodivergence.
  • Group B: Descriptions without any self-disclosure of neurodivergence.

To allow for controlled comparisons, counterpart descriptions were manually created:

  • Group A⁻: Group A descriptions with the self-disclosures removed.
  • Group B⁺: Group B descriptions with added (potential) self-disclosures.

As a result, the dataset includes paired versions for both groups, labeled as *_with_SD and *_without_SD.

 

Further Details

Models

The dataset has been generated using 4 different LLMs:

  • Gemma 2 9B
  • Lllama 3.1 8B
  • Mistral Small (22B)
  • Mistral NeMo (12B)

Tasks

Story Generation(stories.json)

Each LLM was tasked with generating a short fictional story about the person described in the prompt.

Retrieval-Augmented Generation Question Answering (raq-qa.json)

Using a RAG approach, 100 distinct health topic questions were asked, with an open-access book serving as the source material for the LLMs. Each prompt was personalized with a self-description of a persona.

Dialogues (dialogues.json)

LLMs were tasked with generating potential dialogues between a specified user persona (based on the provided self-description) and a service provider (either a baker or a doctor). Two dialogue formats were created: (1) Instruction-based, where the model generated the entire dialogue from scratch, and (2) Script-based, where the model was given only the service provider’s lines and generated the user persona's responses.
The dataset includes complete dialogues (both with and without self-descriptions: text_with_SD and text_without_SD), along with separated turns for the service provider (Speaker 1) and the user persona (Speaker 2).

 

Further details about the dataset's construction and intended use can be found in the original paper.

 

 

 

Files

dialogues.json

Files (397.3 MB)

Name Size Download all
md5:bab889fd160e25730d27940f6e22e034
60.8 MB Preview Download
md5:bbf68a4d0132b1d18090a9ebed7baee9
315.3 MB Preview Download
md5:822fb0f4bf5dd29f6740dfa80bdd115c
21.2 MB Preview Download

Additional details

Software

Programming language
JSON