Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

doi:10.5281/zenodo.3782416

Published May 2, 2020 | Version v1

Journal article Open

Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510640, China

The Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

Dejoli Tientcheu touko L.andry; Qianhua He; Wei Xie

Citing the ASVP-ESD

The ASVP-ESD emotional sound database is released by Audio, Speech and Vision Processing Lab(http://www.speech-led.com/main.htm, from South China University of Technology), so please cite the ASVP-ESD if it is used in your work in any form. Personal works, such as machine learning projectsog posts, should provide a URL to this Zenodo page, though a reference. .

Contact Information

If you would like further information about the ASVP-ESD, when facing any issues downloading files, please contact us at 201722800077@mail.scut.edu.cn, 1197581424@qq.com

data labeling process

The final data labeling was done by 5 different annotators through a tagging application specially designed for audio tagging. After listening to each audio the judge should choose the corresponding label according to personal feeling. Then after the tagging part a simple voting algorithm was build for voting and upgrading the corresponding audio to the class having the most number of vote.

Construction and Validation

The ASVP-ESD contains 5146 audio files(with additional 1204 files for babies voices ). It is an emotional based database,containing speech and non speech emotional sound. The audio were recorded and collected from movies, tv show, youtube channel and others emotional sound website. Comparing to others public emotional database,ASVP-ESD is more realistic and non scripted with no language restriction.

.Description

The Audio,Speech and Vision Processing Lab Emotional Sound database(ASVP-ESD) contains 5146 audio files regrouped in 55 folder the data are organize as follow: odd folder number for female ,even for male (total size: 1.05 GB). Speech and non speech Emotional sound includes neutral(others), happy(laugh,gaggle), sad(cry,sniff,pain), angry, fearful(scream,breath,panic), surprise(amazed), and disgust expressions. There 2 level of intensity used for the database (normal and high).Audio are available in 16k, 1 channel , .wav format ,the average length of the file are between 0.5 to 13 second .Note, there are two additional folder(acteur_150,acteur_50) contains only babies audio(laugh,cry) Audio-only files are regrouped as:

acteur_00 is composed with mixed audio sample from movies and webcite sounds.
acteur_01 to acteur_19 and acteur_31 to acteur_38 are different actors from 3 different movies sounds.
acteur_21 to acteur_29 and acteur_39,acteur_40 are only for sounds ramdomly collected from internet.
acteur_100(acteur_100 are crowd or many people voices) to acteur_102 are from the same webcite ,same for acteur_103 to acteur_106

File naming convention

Each of the 5146 audio files has a unique filename. The filename consists of a 9-part numerical identifier (e.g., 02-01-06-01-02-105-02-01-02.wav)for speech and 12 for non speech (e.g., 02-01-06-01-02-105-02-01-02-01-03.wav) These identifiers define the stimulus characteristics:

Filename identifiers

Modality ( 03 = audio-only).
Vocal channel (01 = speech, 02 = non speech).
Emotion ( 02 = others, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
Emotional intensity (01 = normal, 02 = high).
Statement (as it’s non scripted this refer to the number of sample select per acteur folder ).
Actor ( even numbered acteurs are male, odd numbered actors are female).
Age (01 = above 65, 02 = between 20~64, 03 = under 20,04=new born).
Source of downloading (01 =website , 02 = youtube channel, 03= movies).
Language(01=Chinese , 02=English ,03= others)

Filename example: 03-01-06-01-02-12-02-01-01-16-04.wav:

1.audio-only (03)

2.Speech (01)

3.Fearful (06)

4.Normal intensity (01)

5.Statement (02)

6.12th Acteur (12) folder 12 male as its even

7.Age(02)

8.Source(01)

9.language(01)

10.Screaming “only for non speech” (16)

11.the 4^th sample from the same dialog(04)

All file with 77 at the end means file with high noise environment.

for non speech data:

Happy is a collection of (laugh=13,gaggle=23,others=33)

sad is a collection of (cry=14, sigh=24,sniffle=34,suffering=44)

fear is a collection of (scream 16, breath=26 ,panic=36)

angry (rage=15,frustration=25 ,other=35)

surprise (surprised=18, amazed=28 ,astonishment=38,others=48)

disgust(disgust=17, rejection=27)

Files

Files (735.3 MB)

Name	Size	Download all
ASVP-ESD.rar md5:2818c06842cd9f1965fe26074461ffa4	735.3 MB	Download

	All versions	This version
Views	1,304	301
Downloads	267	144
Data volume	812.4 GB	595.6 GB

Audio,Speech and Vision Processing Lab Emotional Sound database (ASVP-ESD)

Creators

Description

Files

Files (735.3 MB)