AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data
Authors/Creators
- 1. ConnexAI
Description
The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.1)
Version 1.1: Some missing entries in the Emilia file map have been updated and a number of YODAS archive URLs have been corrected.
The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models.
AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties.
Numerical labels:
- Speaker count
Binary labels:
- Non-English (foreign) language
- Background music
- Noisy or poor-quality speech
- Synthetic (spoofed) speech
Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines.
AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI.
Files include:
- Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json)
- Interface config for Label Studio (.xml)
If you use this dataset, please cite:
W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025.
License:
Creative Commons Attribution 4.0 International (CC BY 4.0)
Files
emilia_file_map.csv
Files
(4.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9bc8c456b92f29c9412278fbd5cc081d
|
1.2 MB | Preview Download |
|
md5:3be4b058df8a01e111126d35c4b79e1f
|
1.3 MB | Preview Download |
|
md5:348936f0957eb978a863fc86aba1ec39
|
1.5 kB | Preview Download |
|
md5:7d5f855091f9a33726ed31361a63a62e
|
83.8 kB | Preview Download |
|
md5:b03d03adba58c4f498744bef5c405d0e
|
62.5 kB | Preview Download |
|
md5:0b3a98e0ec014ee2d264d42d18b42f29
|
1.8 MB | Preview Download |
Additional details
Dates
- Accepted
-
2025-05-19Interspeech 2025
Software
- Development Status
- Active
References
- X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota and S. Watanabe, "Yodas: Youtube-Oriented Dataset for Audio and Speech," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389689. keywords: {Training;Video on demand;Conferences;Pipelines;Self-supervised learning;Manuals;Data collection;multilingual speech processing;speech recognition;large-scale speech dataset},
- H. He et al., "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation," 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 885-890, doi: 10.1109/SLT61566.2024.10832365. keywords: {Training;Technological innovation;Annotations;Conferences;Pipelines;Training data;Transforms;Speech;Data models;Multilingual;Extensive Multilingual and Diverse Dataset;Large-scale Speech Generation},