Published October 9, 2024 | Version v1
Dataset Open

EmotionCaps: A Synthetic Emotion-Enriched Audio Captioning Dataset

  • 1. ROR icon New Jersey Institute of Technology

Description

Version 1.0, October 2024

Created by

Mithun Manivannan (1), Vignesh Nethrapalli (1), Mark Cartwright (1)

  1. Sound Interaction and Computer Lab, New Jersey Institute of Technology

Publication

If using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:

Manivannan, M., Nethrapalli, V., Cartwright, M. EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation. arXiv preprint arXiv:2410.12028, 2024.

Description

EmotionCaps is a ChatGPT-assisted, weakly-labeled audio captioning dataset developed to bridge the gap between soundscape emotion recognition (SER) and automated audio captioning (AAC). Created through a three-stage pipeline, the dataset leverages ground-truth annotations from AudioSet SL, which are enhanced by ChatGPT using tailored prompts and emotions assigned via a soundscape emotion recognition model trained on Emo-Soundscapes Dataset. It comprises four subsets of captions for 120,071 audio clips, each reflecting a different prompt variation: WavCaps-like, Scene-Focused, Emotion Addon, and Emotion Rewrite. The average word counts for these subsets are: WavCaps-like (12.61), Scene-Focused (14.04), Emotion Addon (18.35), and Emotion Rewrite (18.65). The increase in word count for the emotion prompts illustrates the difference in sentence length when integrating emotion information into the captions.

Audio Data

The audio data is from AudioSet SL, the strongly-labled subset of 120,071 audio clips from the larger AudioSet dataset.

Synthetic Captions

The synthetic captions were generated using a three-stage pipeline, beginning with training a soundscape emotion recognition model. This model assesses the valence and arousal of each audio clip, mapping the resulting vector to an emotion identifier. Next, we leveraged the ground-truth annotations from AudioSet SL, and extracted the list of sound events. Using these sound events, we employed ChatGPT to create different variations of captions by applying distinct prompts.

We first used the WavCaps prompt for AudioSet SL as a base, the output of which we call WavCaps-like. Building on this, we created three new prompt variations (1) scene-focused which is a modified WavCaps prompt that describes the scene, (2) emotion addon which is an extension of the scene-Focused prompt, where an emotion is appended to the list of sound events to guide the caption generation, and (3) emotion rewrite which consists of two-step prompt where ChatGPT first generates the scene-focused caption, then is instructed to rewrite it with a specific emotion in mind.

Using these four prompt styles — WavCaps, Scene-Focused, Emotion Addon, and Emotion Rewrite — along with the AudioSet SL sound events and predicted emotions, we employed ChatGPT-3.5 Turbo to generate four corresponding caption variations for the dataset.

Each caption variation has been organized into separate CSV files for clarity and accessibility. All files correspond to the same set of audio clips from AudioSet SL, with the key distinction being the caption variation associated with each clip. The different subsets are designed to be used independently, as they each fulfill specific roles in understanding the impact of emotion in audio captions.

  • wavcaps-like.csv: Contains captions generated using the WavCaps prompt, serving as the baseline before emotion is introduced.

  • scene-focused.csv: Provides captions focused on describing the scene or environment of the audio clip, without emotion integration.

  • emotion-addon.csv: Captions where emotion data is appended to the scene-focused base caption.

  • emotion-rewrite.csv: Captions that are completely rewritten based on the scene-focused base caption and the assigned emotion.

This structure allows users to explore how emotional content influences captioning models by comparing the variations both with and without emotional enrichment.

Columns in CSV files

segment_id : The ID of the audio recording in AudioSet SL. These are in the form <YouTube ID>_<start time in ms>_<end time in ms>

caption : The caption generated for each audio clip, corresponding to the specific subset (e.g., WavCaps, Scene-Focused, Emotion Addon, or Emotion Rewrite) as indicated by the file name.

Conditions of use

Dataset created by Mithun Manivannan, Vignesh Nethrapalli, Mark Cartwright

The EmotionCaps dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license:

https://creativecommons.org/licenses/by/4.0/

The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, New Jersey Institute of Technology is not liable for, and expressly excludes all liability for, loss or damage however and whenever caused to anyone by any use of the EmotionCaps dataset or any part of it.

Feedback

Please help us improve EmotionCaps by sending your feedback to:

In case of a problem, please include as many details as possible.

Acknowledgments

This work was partially supported by the New Jersey Institute of Technology Honors Summer Research Institute (HSRI).

Files

emotion-addon.csv

Files (56.1 MB)

Name Size Download all
md5:bd20858aad17ba7b08c67cf36e955bc8
16.2 MB Preview Download
md5:b2b25c5fa185a7fbfeb6b9c62680f8a9
16.5 MB Preview Download
md5:1d93ca6d9b3364bddded0162369b123c
5.7 kB Preview Download
md5:28e3c0b18517c4f62845369845cfc1c3
12.2 MB Preview Download
md5:4712917a3a7eeb07429178564f59b086
11.2 MB Preview Download