Emotionally Incongruent Synthetic Speech Dataset (EMIS)

Corrêa, Pedro; Lima, João; Moreno, Victor; Ueda, Lucas Hideki; Costa, Paula

doi:10.21227/s2r2-hg86

Published March 24, 2026 | Version v1

Dataset Open

Emotionally Incongruent Synthetic Speech Dataset (EMIS)

1. Universidade Estadual de Campinas (UNICAMP)

This dataset contains 1248 speech audio samples synthetically generated by Text-to-speech systems. The audios are emotionally incongruent between transcription and voice tone. To generate each speech sample, we leverage emotion-rich sentences divided into four distinct emotions: angry, happy, neutral, and sad. For each sentence, we employ three different TTS systems to generate speech in the same four different emotions, thus resulting in three emotionally incongruent speech samples per sentence. Unlike standard emotional speech samples that are used to train and test emotion recognition systems, this dataset provided incongruency between the sentiment present in the tone of the voice and that present in the transcription of the sample.

Dataset Description

The dataset contains synthetically speech samples generated by Text-to-speech (TTS) systems.
Each speech sample corresponds to audio clips of approximately 5 seconds, uttered in one of the four sentiments: angry, happy, neutral, and sad.
Since the TTS systems require reference audios to extract the sentiment style, we employ samples from the Emotional Speech Dataset (ESD) spoken by English speakers.
For each generation, we concatenate seven audios from ESD to build the reference audios.

2. File Format and Structure

Speech samples are stored in .WAV format.
Each sample name is formated as {sentene-ID}_{text-emotion}_{audio-emotion}_{ESD-speaker-ID}_{TTS-used}.wav.
Text-emotion is divided into explicit and implicit. Explicit samples have the emotion tag explicitly in the text, whereas implicit samples bear the sentiment in the context of the sentence.
Each sentence-ID corresponds to a distinct sentence; in total, we have 104 sentence samples automatically generated by ChatGPT-4.5.

3. Usage Instruction

Data can be loaded directly using common audio tools from Python such as Librosa
Example:

< import librosa; audio, sample_rate = librosa.load(wav_path, sr=16000) >

4. Applications

Users can conduct tests on emotion recognition systems to verify perfomance on audios that have different emotions present in the voice tone versus the audio transcription.
Users can verify perceptual results from humans on how people perceive emotions when this type of audios are presented.

Files

audios_EMIS.zip

Files (318.5 MB)

Name	Size	Download all
audios_EMIS.zip md5:a78a40ce73eae28dabafc1d16ec703bc	318.5 MB	Preview Download
text_samples_EMIS.csv md5:84a37ca7ae97859c809e50938b3b4b6f	4.6 kB	Preview Download

Additional details

Is part of: Conference paper: 10.48550/arXiv.2510.25054 (DOI)

	All versions	This version
Views	10	10
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Emotionally Incongruent Synthetic Speech Dataset (EMIS)

Authors/Creators

Description

Files

audios_EMIS.zip

Files (318.5 MB)

Additional details

Related works