Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

Tientcheu Touko Landry Dejoli; Qianhua He; Wei Xie

doi:10.5281/zenodo.5573185

Published May 24, 2021 | Version 2

Journal article Open

Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510640, China

The Audio, Speech, and Vision Processing Lab Emotional Sound database (ASVP-ESD)

Dejoli Tientcheu Touko Landry; Qianhua He; Wei Xie

Citing the ASVP-ESD

The ASVP-ESD emotional sound database is released by Audio, Speech, and Vision Processing Lab(http://www.speech-led.com/main.htm, from the South China University of Technology), so please cite the ASVP-ESD if it is used in your work in any form. Personal works, such as machine learning projects or posts, should provide a URL to this Zenodo page, though a reference.

Contact Information

If you would like further information about the ASVP-ESD, when facing any issues downloading files, please contact us at 201722800077@mail.scut.edu.cn, 1197581424@qq.com

data labeling process

The first version dataset labeling (containing 52 folders) was done by 5 different annotators through a tagging application specially designed for audio tagging the latest added folders were done by 3 others annotators. After listening to each audio the judge should choose the corresponding label according to personal feeling. Then after the tagging part a simple voting algorithm was build for voting and upgrading the corresponding audio to the class having the most number of vote.

Construction and Validation

The ASVP-ESD contains 7812 audio files(with additional 1204 files for babies' voices ). It is an emotional-based database, containing speech and non-speech emotional sound. The audio were recorded and collected from movies, tv show, youtube channel and others emotional sound website. Comparing to other public emotional databases, ASVP-ESD is more realistic and non-scripted with no language restriction.

.Description

The Audio, Speech, and Vision Processing Lab Emotional Sound database(ASVP-ESD) contains 7812 audio files regrouped in 78 folders. The data are organized as follows: odd folder number for female, even for male (total size: 1.26 GB). As it's a realistic dataset some folders contain dialog or several people interacting in the audio. Speech and non-speech Emotional sound include boredom, neutral, happiness (laugh, gaggle), sadness(cry, sniff), angry, fear (scream, deep breath, panic), surprise(amazed), disgust, excite(agitation), pleasure, pain, disappointment expressions total of 12 different emotions. 2 levels of intensity were used for the database (normal and high). Audio is available in 16k, 1 channel, .wav format, the average length of the file is between 0.5 to 20 seconds, for a total of about 10 hours 51 minutes. Note, there are two additional folders (acteur_150,acteur_50) that contains only babies audio(laugh, cry) Audio-only files are regrouped as:

Actor_00 is composed of mixed audio samples from movies and website sounds.
From actor_01 to actor_19 and actor_31 to acteur_38 are different actors from 3 different movie sounds.
From actor_21 to actor_29 and actor_39 to actor_68 are only for sounds randomly collected on online platform.
Actor_100(actor_100 are crowd or many people voices) to actor_102 are from the same website ,same for actor_103 to actor_106

File naming convention

Each of the audio files has a unique filename. The filename consists of numerical identifier (e.g., 02-01-06-01-02-105-02-01-02.wav, for speech / 02-01-06-01-02-105-02-01-02-01-03.wav, for non-speech) These identifiers define the stimulus characteristics:

Filename identifiers

Modality ( 03 = audio-only).
Vocal channel (01 = speech, 02 = non speech).
Emotion ( 01 = boredom, 02 = neutral, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised, 09 = excited, 10 = pleasure, 11 = pain, 12 = disappointment).
Emotional intensity (01 = normal, 02 = high).
Statement (as it’s non scripted this refer to the number of sample select per actor folder ).
Actor ( even numbered acteurs are male, odd numbered actors are female).
Age (01 = above 65, 02 = between 20~64, 03 = under 20,04=new born).
Source of downloading (01 =website , 02 = youtube channel, 03= movies).
Language(01=Chinese , 02=English ,04 = french , others)

Filename example: 03-01-06-01-02-12-02-01-01-16-04.wav:

1.audio-only (03)

2.Speech (01)

3.Fearful (06)

4.Normal intensity (01)

5.Statement (02)

6.12th Actorr (12) folder 12 male as its even

7.Age(02)

8.Source(01)

9.language(01)

10.Screaming “only for non speech” (16)

11.the 4^th sample from the same dialog(04)

All file with 77 at the end means file with a high noise environment.

for non-speech data:

Happyness is a collection of (laugh=13,gaggle=23,others=33)

sadness is a collection of (cry=14, sigh=24,sniffle=34,suffering=44)

fear is a collection of (scream 16, breath=26 ,panic=36)

angry (rage=15,frustration=25 ,other=35)

surprise (surprised=18, amazed=28 ,astonishment=38,others=48)

disgust(disgust=17, rejection=27)

For any suggestion please don't hesitate just send us an email.

Files

ASVP-ESD_UPDATE.zip

Files (1.4 GB)

Name	Size
ASVP-ESD_UPDATE.zip md5:34d5e629682ae715ed84e9d8269259cc	1.4 GB	Preview Download

	All versions	This version
Views	3,141	140
Downloads	528	45
Data volume	1.3 TB	72.4 GB

Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

Authors/Creators

Description

Files

ASVP-ESD_UPDATE.zip

Files (1.4 GB)