AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

Ravenscroft, William; Close, George; Stacey, Jamie; Kit, Bower-Morris; Sityaev, Dmitry; Hong, Kris Y.; ConnexAI

doi:10.5281/zenodo.15534662

Published May 28, 2025 | Version 1.0

Dataset Restricted

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

1. ConnexAI

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.0 - Superseded)

! PLEASE USE THE MOST RECENT VERSION OF THIS DATASET FOUND HERE: https://zenodo.org/records/16937418

The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models.

AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties.

Numerical labels:

Speaker count

Binary labels:

Non-English (foreign) language
Background music
Noisy or poor-quality speech
Synthetic (spoofed) speech

Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines.

AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI.

Files include:

Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json)
Interface config for Label Studio (.xml)

If you use this dataset, please cite:

W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025.

License:

Creative Commons Attribution 4.0 International (CC BY 4.0)

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Accepted: 2025-05-19

Interspeech 2025

Development Status: Active

X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota and S. Watanabe, "Yodas: Youtube-Oriented Dataset for Audio and Speech," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389689. keywords: {Training;Video on demand;Conferences;Pipelines;Self-supervised learning;Manuals;Data collection;multilingual speech processing;speech recognition;large-scale speech dataset},
H. He et al., "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation," 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 885-890, doi: 10.1109/SLT61566.2024.10832365. keywords: {Training;Technological innovation;Annotations;Conferences;Pipelines;Training data;Transforms;Speech;Data models;Multilingual;Extensive Multilingual and Diverse Dataset;Large-scale Speech Generation},

	All versions	This version
Views	134	109
Downloads	178	144
Data volume	307.7 MB	243.8 MB

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.0 - Superseded)

Files

Restricted

Additional details

Dates

Software

References

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

Creators

Description

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.0 - Superseded)

Files

Restricted

Additional details

Dates

Software

References