Published August 17, 2025 | Version 1.1
Dataset Open

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

Description

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.1)

Version 1.1: Some missing entries in the Emilia file map have been updated and a number of YODAS archive URLs have been corrected.

The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models.

AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties.

Numerical labels:

  • Speaker count

Binary labels:

  • Non-English (foreign) language
  • Background music
  • Noisy or poor-quality speech
  • Synthetic (spoofed) speech

Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines.

AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI.

Files include:

  • Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json)
  • Interface config for Label Studio (.xml)

If you use this dataset, please cite:

W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025.

License:

Creative Commons Attribution 4.0 International (CC BY 4.0)

 

Files

emilia_file_map.csv

Files (4.4 MB)

Name Size Download all
md5:9bc8c456b92f29c9412278fbd5cc081d
1.2 MB Preview Download
md5:3be4b058df8a01e111126d35c4b79e1f
1.3 MB Preview Download
md5:348936f0957eb978a863fc86aba1ec39
1.5 kB Preview Download
md5:7d5f855091f9a33726ed31361a63a62e
83.8 kB Preview Download
md5:b03d03adba58c4f498744bef5c405d0e
62.5 kB Preview Download
md5:0b3a98e0ec014ee2d264d42d18b42f29
1.8 MB Preview Download

Additional details

Dates

Accepted
2025-05-19
Interspeech 2025

Software

Development Status
Active

References

  • X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota and S. Watanabe, "Yodas: Youtube-Oriented Dataset for Audio and Speech," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389689. keywords: {Training;Video on demand;Conferences;Pipelines;Self-supervised learning;Manuals;Data collection;multilingual speech processing;speech recognition;large-scale speech dataset},
  • H. He et al., "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation," 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 885-890, doi: 10.1109/SLT61566.2024.10832365. keywords: {Training;Technological innovation;Annotations;Conferences;Pipelines;Training data;Transforms;Speech;Data models;Multilingual;Extensive Multilingual and Diverse Dataset;Large-scale Speech Generation},