Emotionally Incongruent Synthetic Speech Dataset (EMIS)
Authors/Creators
Description
This dataset contains 1248 speech audio samples synthetically generated by Text-to-speech systems. The audios are emotionally incongruent between transcription and voice tone. To generate each speech sample, we leverage emotion-rich sentences divided into four distinct emotions: angry, happy, neutral, and sad. For each sentence, we employ three different TTS systems to generate speech in the same four different emotions, thus resulting in three emotionally incongruent speech samples per sentence. Unlike standard emotional speech samples that are used to train and test emotion recognition systems, this dataset provided incongruency between the sentiment present in the tone of the voice and that present in the transcription of the sample.
- Dataset Description
- The dataset contains synthetically speech samples generated by Text-to-speech (TTS) systems.
- Each speech sample corresponds to audio clips of approximately 5 seconds, uttered in one of the four sentiments: angry, happy, neutral, and sad.
- Since the TTS systems require reference audios to extract the sentiment style, we employ samples from the Emotional Speech Dataset (ESD) spoken by English speakers.
- For each generation, we concatenate seven audios from ESD to build the reference audios.
2. File Format and Structure
- Speech samples are stored in .WAV format.
- Each sample name is formated as {sentene-ID}_{text-emotion}_{audio-emotion}_{ESD-speaker-ID}_{TTS-used}.wav.
- Text-emotion is divided into explicit and implicit. Explicit samples have the emotion tag explicitly in the text, whereas implicit samples bear the sentiment in the context of the sentence.
- Each sentence-ID corresponds to a distinct sentence; in total, we have 104 sentence samples automatically generated by ChatGPT-4.5.
3. Usage Instruction
- Data can be loaded directly using common audio tools from Python such as Librosa
- Example:
< import librosa; audio, sample_rate = librosa.load(wav_path, sr=16000) >
4. Applications
- Users can conduct tests on emotion recognition systems to verify perfomance on audios that have different emotions present in the voice tone versus the audio transcription.
- Users can verify perceptual results from humans on how people perceive emotions when this type of audios are presented.
Files
audios_EMIS.zip
Files
(318.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:a78a40ce73eae28dabafc1d16ec703bc
|
318.5 MB | Preview Download |
|
md5:84a37ca7ae97859c809e50938b3b4b6f
|
4.6 kB | Preview Download |
Additional details
Related works
- Is part of
- Conference paper: 10.48550/arXiv.2510.25054 (DOI)