On Speakers' Identities, Autism Self-Disclosures and LLM-Powered Robots

Höhn, Sviatlana; Philippy, Fred; André, Elisabeth

doi:10.5281/zenodo.15807538

Published July 4, 2025 | Version v1

Dataset Open

On Speakers' Identities, Autism Self-Disclosures and LLM-Powered Robots

1. University of Luxembourg
2. University of Augsburg

Introduction

This repository contains the data created and used for the paper "On Speakers' Identities, Autism Self-Disclosures, and LLM-Powered Robots."

The dataset includes LLM-generated text across three task types: Retrieval-Augmented Generation Question Answering, Story Generation, and Dialogue Generation. The responses were conditioned on varying self-descriptions from autistic individuals.

These self-descriptions are divided into two primary groups:

Group A: Descriptions with an explicit self-disclosure of neurodivergence.
Group B: Descriptions without any self-disclosure of neurodivergence.

To allow for controlled comparisons, counterpart descriptions were manually created:

Group A⁻: Group A descriptions with the self-disclosures removed.
Group B⁺: Group B descriptions with added (potential) self-disclosures.

As a result, the dataset includes paired versions for both groups, labeled as *_with_SD and *_without_SD.

Further Details

Models

The dataset has been generated using 4 different LLMs:

Gemma 2 9B
Lllama 3.1 8B
Mistral Small (22B)
Mistral NeMo (12B)

Tasks

Story Generation(`stories.json`)

Each LLM was tasked with generating a short fictional story about the person described in the prompt.

Retrieval-Augmented Generation Question Answering (`raq-qa.json`)

Using a RAG approach, 100 distinct health topic questions were asked, with an open-access book serving as the source material for the LLMs. Each prompt was personalized with a self-description of a persona.

Dialogues (`dialogues.json`)

LLMs were tasked with generating potential dialogues between a specified user persona (based on the provided self-description) and a service provider (either a baker or a doctor). Two dialogue formats were created: (1) Instruction-based, where the model generated the entire dialogue from scratch, and (2) Script-based, where the model was given only the service provider’s lines and generated the user persona's responses.
The dataset includes complete dialogues (both with and without self-descriptions: text_with_SD and text_without_SD), along with separated turns for the service provider (Speaker 1) and the user persona (Speaker 2).

Further details about the dataset's construction and intended use can be found in the original paper.

Files

dialogues.json

Files (397.3 MB)

Name	Size	Download all
dialogues.json md5:bab889fd160e25730d27940f6e22e034	60.8 MB	Preview Download
rag-qa.json md5:bbf68a4d0132b1d18090a9ebed7baee9	315.3 MB	Preview Download
stories.json md5:822fb0f4bf5dd29f6740dfa80bdd115c	21.2 MB	Preview Download

Additional details

Programming language: JSON

	All versions	This version
Views	37	37
Downloads	79	79
Data volume	12.9 GB	12.9 GB

On Speakers' Identities, Autism Self-Disclosures and LLM-Powered Robots

Authors/Creators

Description

Introduction

Further Details

Models

Tasks

Story Generation(stories.json)

Retrieval-Augmented Generation Question Answering (raq-qa.json)

Dialogues (dialogues.json)

Files

dialogues.json

Files (397.3 MB)

Additional details

Software

Story Generation(`stories.json`)

Retrieval-Augmented Generation Question Answering (`raq-qa.json`)

Dialogues (`dialogues.json`)