There is a newer version of the record available.

Published May 28, 2025 | Version 1.0
Dataset Restricted

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

Description

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.0 - Superseded)

! PLEASE USE THE MOST RECENT VERSION OF THIS DATASET FOUND HERE: https://zenodo.org/records/16937418

The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models.

AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties.

Numerical labels:

  • Speaker count

Binary labels:

  • Non-English (foreign) language
  • Background music
  • Noisy or poor-quality speech
  • Synthetic (spoofed) speech

Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines.

AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI.

Files include:

  • Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json)
  • Interface config for Label Studio (.xml)

If you use this dataset, please cite:

W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025.

License:

Creative Commons Attribution 4.0 International (CC BY 4.0)

 

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Dates

Accepted
2025-05-19
Interspeech 2025

Software

Development Status
Active

References

  • X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota and S. Watanabe, "Yodas: Youtube-Oriented Dataset for Audio and Speech," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389689. keywords: {Training;Video on demand;Conferences;Pipelines;Self-supervised learning;Manuals;Data collection;multilingual speech processing;speech recognition;large-scale speech dataset},
  • H. He et al., "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation," 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 885-890, doi: 10.1109/SLT61566.2024.10832365. keywords: {Training;Technological innovation;Annotations;Conferences;Pipelines;Training data;Transforms;Speech;Data models;Multilingual;Extensive Multilingual and Diverse Dataset;Large-scale Speech Generation},