On Speakers' Identities, Autism Self-Disclosures and LLM-Powered Robots
Authors/Creators
Description
Introduction
This repository contains the data created and used for the paper "On Speakers' Identities, Autism Self-Disclosures, and LLM-Powered Robots."
The dataset includes LLM-generated text across three task types: Retrieval-Augmented Generation Question Answering, Story Generation, and Dialogue Generation. The responses were conditioned on varying self-descriptions from autistic individuals.
These self-descriptions are divided into two primary groups:
- Group A: Descriptions with an explicit self-disclosure of neurodivergence.
- Group B: Descriptions without any self-disclosure of neurodivergence.
To allow for controlled comparisons, counterpart descriptions were manually created:
- Group A⁻: Group A descriptions with the self-disclosures removed.
- Group B⁺: Group B descriptions with added (potential) self-disclosures.
As a result, the dataset includes paired versions for both groups, labeled as *_with_SD and *_without_SD.
Further Details
Models
The dataset has been generated using 4 different LLMs:
- Gemma 2 9B
- Lllama 3.1 8B
- Mistral Small (22B)
- Mistral NeMo (12B)
Tasks
Story Generation(stories.json)
Each LLM was tasked with generating a short fictional story about the person described in the prompt.
Retrieval-Augmented Generation Question Answering (raq-qa.json)
Using a RAG approach, 100 distinct health topic questions were asked, with an open-access book serving as the source material for the LLMs. Each prompt was personalized with a self-description of a persona.
Dialogues (dialogues.json)
LLMs were tasked with generating potential dialogues between a specified user persona (based on the provided self-description) and a service provider (either a baker or a doctor). Two dialogue formats were created: (1) Instruction-based, where the model generated the entire dialogue from scratch, and (2) Script-based, where the model was given only the service provider’s lines and generated the user persona's responses.
The dataset includes complete dialogues (both with and without self-descriptions: text_with_SD and text_without_SD), along with separated turns for the service provider (Speaker 1) and the user persona (Speaker 2).
Further details about the dataset's construction and intended use can be found in the original paper.
Files
dialogues.json
Additional details
Software
- Programming language
- JSON