AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data
Description
The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.0 - Superseded)
! PLEASE USE THE MOST RECENT VERSION OF THIS DATASET FOUND HERE: https://zenodo.org/records/16937418
The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models.
AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties.
Numerical labels:
- Speaker count
Binary labels:
- Non-English (foreign) language
- Background music
- Noisy or poor-quality speech
- Synthetic (spoofed) speech
Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines.
AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI.
Files include:
- Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json)
- Interface config for Label Studio (.xml)
If you use this dataset, please cite:
W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025.
License:
Creative Commons Attribution 4.0 International (CC BY 4.0)
Files
Additional details
Dates
- Accepted
-
2025-05-19Interspeech 2025
Software
- Development Status
- Active
References
- X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota and S. Watanabe, "Yodas: Youtube-Oriented Dataset for Audio and Speech," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389689. keywords: {Training;Video on demand;Conferences;Pipelines;Self-supervised learning;Manuals;Data collection;multilingual speech processing;speech recognition;large-scale speech dataset},
- H. He et al., "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation," 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 885-890, doi: 10.1109/SLT61566.2024.10832365. keywords: {Training;Technological innovation;Annotations;Conferences;Pipelines;Training data;Transforms;Speech;Data models;Multilingual;Extensive Multilingual and Diverse Dataset;Large-scale Speech Generation},