Extreme Metal Vocals Dataset (EMVD)
Authors/Creators
Description
Extreme Metal Vocals Dataset (EMVD)
Version 1.1, December 2025
Publication
If using this data in an academic work, please reference the DOI and version, as well as cite the following paper, which presented the data collection procedure and the first version of the dataset:
@misc{tailleur2024emvddatasetdatasetextreme,
title={EMVD dataset: a dataset of extreme vocal distortion techniques used in heavy metal},
author={Modan Tailleur and Julien Pinquier and Laurent Millot and Corsin Vogel and Mathieu Lagrange},
year={2024},
eprint={2406.17732},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2406.17732},
}
Description
The Extreme Metal Vocals Dataset (EMVD) comprises a collection of recordings of extreme vocal techniques performed within the realm of heavy metal music. The dataset consists of 760 audio excerpts of 1 second to 30 seconds long, totaling about 100 min of audio material, roughly composed of 60 minutes of distorted voices and 40 minutes of clear voice recordings. These vocal recordings are from 27 different singers and are provided without accompanying musical instruments or post-processing effects. The distortion taxonomy within this dataset encompasses four distinct distortion techniques and three vocal effects, all performed in different pitch ranges.
How to use
To get an example on how to use this dataset for deep learning applications, please follow the link to the companion website: https://github.com/modantailleur/ExtremeMetalVocalsDataset
Label Taxonomy
The label taxonomy is as follows (see our paper for further details):
Techniques:
- Clear Voice: high, mid, low
- Black Shriek: high, mid
- Death Growl: mid, low
- Hardcore Scream: high, mid, low
- Grind Inhale
Effects:
- Pig Squeal
- Deep Gutturals
- Tunnel Throat
Recording procedure
For the recording sessions, a mobile setup was selected to accommodate as many singers as possible. An SM58 microphone was employed, chosen for its prevalence as a microphone commonly used by metal singers during live performances. A closed-back headphone served for music playback and provided the singers with a monitor of their own voice if they desired to hear it during recording. An audio interface Scarlett 6i6 by Focusrite was responsible for connecting the laptop, microphone, and headset.
In some cases, singers were recorded remotely using their own equipment (a stage microphone and an audio interface) which are documented in the database. These singers were provided with a video tutorial and explanatory documents to facilitate their participation in the project, with the main author remotely guiding them. Each singer was instructed to sustain three vowels—[a] as in "cat," [i] as in "ship," and [u] as in "book"—for a duration of five seconds each. They were required to maintain a consistent pitch not only within each vowel but also across all vowels produced. After this, they were asked to perform for approximately 15 seconds using the same vocal technique, but this time with lyrics of their choosing. The lyrics had to remain the same across all technique categories. Each vocal technique was recorded across several registers (high, mid, and low) depending on their relevance to the specific technique. It's worth noting that the Grind Inhale technique, although producible in multiple registers, was recorded in only one register, as many singers deemed it potentially harmful to their voice. A musical loop was provided in the singers' headphones during each recording.
Grading system
Each vocalist in this study underwent a comprehensive assessment of their comfort level with each vocal technique across the various vocal registers, employing a ranking system ranging from 0 to 5. A rank of 0 signifies that they never use this technique and are not sufficiently comfortable to produce it, which ultimately results in missing data in the dataset. A rank of 3 indicates occasional use, and a rank of 5 signifies that they use it in every live performance.This dataset provides supplementary insights into the singers' practices. These include the typical microphone-to-mouth distance employed by each vocalist during recording, as well as their professional status within the field of singing. The majority of the recordings were conducted onsite, within the familiar confines of the vocalist's chosen location, whether it be their home or a professional studio, utilizing equipment provided by the authors. However, some recordings were independently done by the vocalists themselves, leveraging their personal microphones and audio interfaces. In such instances, the authors remotely guided the recording process to ensure consistency and quality. Detailed equipment specifications have been documented.
As authors noticed that the singers auto-evaluation ranking wasn’t very effective, the main author provided grades to individual audio files created by the singers, ranging from 0 to 2. A 2 grade suggests that the technique closely represents the intended vocal technique, 1 indicates that it moderately represents the vocal technique, and 0 signifies that the technique does not adequately represent the vocal technique. Audio files rated as 0 should not be employed for deep learning applications, but they are retained within the dataset in case future re-evaluation of the audio files is desired. Notably, approximately 70\% of the dataset's audio files received grades of 2 or 1 from the authors and are thus suitable for being used in diverse applications.
Speech transcription
A subset of the EMVD audio files includes short lyrical phrases performed by the singers during the 15-second “lyrics” segment recorded for each vocal technique and register. As the EMVD dataset was not originally intended for speech-recognition tasks, the lyrics used by the singers were not documented during recording.To facilitate research on robustness of speech recognition systems to extreme vocal distortion, a set of manual speech transcriptions has been produced. These transcriptions were produced independently by two annotators (co-authors of the EMVD dataset), who listened to each individual audio file and provided a textual transcription of the perceived lyrics. Although the singers were instructed to use the same lyrics across all techniques and registers, the annotators observed that the lyrics often varied from one recording to another, most of the time with a few words changing between different versions. For this reason, transcriptions were not produced per singer, but per audio file. For each audio file, both annotators’ transcriptions are systematically provided. In most cases, the transcriptions are very close or identical.
Audio transcription exceptions are as follows :
-
Singer 21 is excluded because they did not perform clear-voice recordings and the strong variability in their lyrics made transcription unreliable.
-
Singer 17 is included, but their recordings were particularly difficult to transcribe. These transcriptions should therefore be interpreted carefully.
Users of the transcription set should be aware that the content is not guaranteed to be perfectly accurate, as extreme metal vocal techniques often result in low intelligibility. Nevertheless, the presence of two independent annotators, combined with access to clear-voice recordings for most transcriptions, result in fairly reliable transcripts that provide a practical and valuable resource for research.
metadata_files.csv
file_name : the name of the audio file
singer_id : the id of each singer (from 1 to 27)
type : whether the distortion employed is a technique, an effect, or a distortion that doesn’t fit any specific category
name : the name of the technique or of the effect employed by the singer (‘-’ if it doesn’t fit in any category)
range : the range employed by the singer (‘High’, ‘Mid’, or ‘Low’)
vowel : the vowel employed by the singer. ‘a’ if vowel [a] as in "cat", 'i' if vowel [i] as in "ship," and 'u' if vowel [u] as in "book"
authors_rank : the rank given by the authors (2, 1 or 0)
duration(s) : duration (in seconds) of the audio file
transcript1: transcription of the lyrics by annotator1
transcript2 : transcription of the lyrics by annotator2
lan : the language in which the lyrics are sung (‘en’ for English, ‘fr’ for French)
metadata_singers.csv
singer_id : the id of each singer (from 1 to 27)
gender : the gender of the singer (« M » if male, « F » if female)
status : whether the singer is professional or non-professional (« Professional », or « Non-professional »)
recording : whether the recording was made onsite, with the authors equipment, or if it was guided remotely (« Onsite » or « Guided »)
distance_to_microphone(cm) : the distance chosen by the singer to the microphone (in centimeters)
microphone : model of microphone that was used for the recording
audio_interface : audio interface used for the recordings
DAW : Digital Audio Workstation (DAW) used for recording the singer (Ex: ProTools, Reaper etc...)
ClearVoice_High, …, TunnelThroat : singer’s rank (from 0 to 5) from his auto-evalution on each technique performed in each range.
split_kfolds.csv
For deep learning applications, a k-fold cross-validation with 4 folds was performed and stored in the «split_kfolds.csv » file, reserving 20% of the training data for validation.
file_name : the name of the audio file
split0, …, split3 : for each split, wether the file belongs to the train subset (‘train’), the evaluation subset (‘eval’), the validation subset (‘valid’) or if it isn’t used for training (‘-’)
Feedback
Please help us improve EMVD by sending your feedback to:
- Modan Tailleur: modan.tailleur@gmail.com
In case of a problem, please include as many details as possible.
Acknowledgments
We want to thank Oriol Nieto, Geoffroy Peeters, Christophe d'Alessandro and Boris Doval for fruitful discussion. We particularly want to thank Joshua Smith for guidance for the design of the taxonomy. We also want to thank the 27 singers for bringing this dataset to life.
Changelog
- 1.1 : Introduction of manual speech transcriptions for the lyrical recordings