Published October 4, 2023 | Version 1.0
Dataset Open

Extreme Metal Vocals Dataset (EMVD)

  • 1. ENSLL / LS2N
  • 2. INRIT
  • 3. ENSLL
  • 4. LS2N

Description

Extreme Metal Vocals Dataset (EMVD)

Version 1.0, October 2023

 

Created by

Modan Tailleur (1,3), Julien Pinquier (2), Laurent Millot (1), Corsin Vogel (1), Mathieu Lagrange (3)

  1. ENS Louis-Lumière, Saint-Denis, France

  2. IRIT, Université de Toulouse, CNRS, UT3 Toulouse, France

  3. Nantes Université, Ecole Centrale Nantes, CNRS, LS2N, UMR 6004, Nantes, France

 

Publication

If using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:

@misc{tailleur2024emvddatasetdatasetextreme,
      title={EMVD dataset: a dataset of extreme vocal distortion techniques used in heavy metal}, 
      author={Modan Tailleur and Julien Pinquier and Laurent Millot and Corsin Vogel and Mathieu Lagrange},
      year={2024},
      eprint={2406.17732},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2406.17732}, 
}

 

Description

The Extreme Metal Vocals Dataset (EMVD) comprises a collection of recordings of extreme vocal techniques performed within the realm of heavy metal music. The dataset consists of 760 audio excerpts of 1 second to 30 seconds long, totaling about 100 min of audio material, roughly composed of 60 minutes of distorted voices and 40 minutes of clear voice recordings. These vocal recordings are from 27 different singers and are provided without accompanying musical instruments or post-processing effects. The distortion taxonomy within this dataset encompasses four distinct distortion techniques and three vocal effects, all performed in different pitch ranges.

 

How to use

To get an example on how to use this dataset for deep learning applications, please follow the link to the companion website: https://github.com/modantailleur/ExtremeMetalVocalsDataset

 

Label Taxonomy

The label taxonomy is as follows (see our paper for further details):

Techniques:

  • Clear Voice: high, mid, low
  • Black Shriek: high, mid
  • Death Growl: mid, low
  • Hardcore Scream: high, mid, low
  • Grind Inhale

Effects:

  • Pig Squeal
  • Deep Gutturals
  • Tunnel Throat

 

Recording procedure

For the recording sessions, a mobile setup was selected to accommodate as many singers as possible. An SM58 microphone was employed, chosen for its prevalence as a microphone commonly used by metal singers during live performances. A closed-back headphone served for music playback and provided the singers with a monitor of their own voice if they desired to hear it during recording. An audio interface Scarlett 6i6 by Focusrite was responsible for connecting the laptop, microphone, and headset.

In some cases, singers were recorded remotely using their own equipment (a stage microphone and an audio interface) which are documented in the database. These singers were provided with a video tutorial and explanatory documents to facilitate their participation in the project, with the main author remotely guiding them. Each singer was instructed to sustain three vowels—[a] as in "cat," [i] as in "ship," and [u] as in "book"—for a duration of five seconds each. They were required to maintain a consistent pitch not only within each vowel but also across all vowels produced. After this, they were asked to perform for approximately 15 seconds using the same vocal technique, but this time with lyrics of their choosing. The lyrics had to remain the same across all technique categories. Each vocal technique was recorded across several registers (high, mid, and low) depending on their relevance to the specific technique. It's worth noting that the Grind Inhale technique, although producible in multiple registers, was recorded in only one register, as many singers deemed it potentially harmful to their voice. A musical loop was provided in the singers' headphones during each recording.

 

Grading system

Each vocalist in this study underwent a comprehensive assessment of their comfort level with each vocal technique across the various vocal registers, employing a ranking system ranging from 0 to 5. A rank of 0 signifies that they never use this technique and are not sufficiently comfortable to produce it, which ultimately results in missing data in the dataset. A rank of 3 indicates occasional use, and a rank of 5 signifies that they use it in every live performance.This dataset provides supplementary insights into the singers' practices. These include the typical microphone-to-mouth distance employed by each vocalist during recording, as well as their professional status within the field of singing. The majority of the recordings were conducted onsite, within the familiar confines of the vocalist's chosen location, whether it be their home or a professional studio, utilizing equipment provided by the authors. However, some recordings were independently done by the vocalists themselves, leveraging their personal microphones and audio interfaces. In such instances, the authors remotely guided the recording process to ensure consistency and quality. Detailed equipment specifications have been documented.

As authors noticed that the singers auto-evaluation ranking wasn’t very effective, the main author provided grades to individual audio files created by the singers, ranging from 0 to 2. A 2 grade suggests that the technique closely represents the intended vocal technique, 1 indicates that it moderately represents the vocal technique, and 0 signifies that the technique does not adequately represent the vocal technique. Audio files rated as 0 should not be employed for deep learning applications, but they are retained within the dataset in case future re-evaluation of the audio files is desired. Notably, approximately 70\% of the dataset's audio files received grades of 2 or 1 from the authors and are thus suitable for being used in diverse applications.

 

metadata_files.csv

file_name : the name of the audio file

singer_id : the id of each singer (from 1 to 27)

type : whether the distortion employed is a technique, an effect, or a distortion that doesn’t fit any specific category

name : the name of the technique or of the effect employed by the singer (‘-’ if it doesn’t fit in any category)

range : the range employed by the singer (‘High’, ‘Mid’, or ‘Low’)

vowel : the vowel employed by the singer. ‘a’ if vowel [a] as in "cat", 'i' if vowel [i] as in "ship," and 'u' if vowel [u] as in "book"

authors_rank : the rank given by the authors (2, 1 or 0)

duration(s) : duration (in seconds) of the audio file

 

metadata_singers.csv

singer_id : the id of each singer (from 1 to 27)

gender : the gender of the singer (« M » if male, « F » if female)

status : whether the singer is professional or non-professional (« Professional », or « Non-professional »)

recording : whether the recording was made onsite, with the authors equipment, or if it was guided remotely (« Onsite » or « Guided »)

distance_to_microphone(cm) : the distance chosen by the singer to the microphone (in centimeters)

microphone : model of microphone that was used for the recording

audio_interface : audio interface used for the recordings

DAW : Digital Audio Workstation (DAW) used for recording the singer (Ex: ProTools, Reaper etc...)

ClearVoice_High, …, TunnelThroat : singer’s rank (from 0 to 5) from his auto-evalution on each technique performed in each range.
 

split_kfolds.csv

For deep learning applications, a k-fold cross-validation with 4 folds was performed and stored in the «split_kfolds.csv » file, reserving 20% of the training data for validation.

file_name : the name of the audio file

split0, …, split3 : for each split, wether the file belongs to the train subset (‘train’), the evaluation subset (‘eval’), the validation subset (‘valid’) or if it isn’t used for training (‘-’)

 

Feedback

Please help us improve EMVD by sending your feedback to:

In case of a problem, please include as many details as possible.

 

Acknowledgments

We want to thank Oriol Nieto, Geoffroy Peeters, Christophe d'Alessandro and Boris Doval for fruitful discussion. We particularly want to thank Joshua Smith for guidance for the design of the taxonomy. We also want to thank the 27 singers for bringing this dataset to life.

Files

metadata_files.csv

Files (1.0 GB)

Name Size Download all
md5:179f9d3aca33d1f4fb6b3d3c47192e73
1.0 GB Download
md5:322685ad3df2a66eec581bc2bea8c1a0
87.1 kB Preview Download
md5:cacc9a39929f96855779c5361529cc8a
2.6 kB Preview Download
md5:b8f964968b5ff8d40a88f7d7e5f3a6c8
54.7 kB Preview Download