AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

Ravenscroft, William; Close, George; Stacey, Jamie; Kit, Bower-Morris; Sityaev, Dmitry; Hong, Kris Y.; ConnexAI

doi:10.5281/zenodo.16937418

Published August 17, 2025 | Version 1.1

Dataset Open

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

1. ConnexAI

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.1)

Version 1.1: Some missing entries in the Emilia file map have been updated and a number of YODAS archive URLs have been corrected.

The Annotated In-The-Wild (AITW) dataset accompanies the paper “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification”, accepted at Interspeech 2025 in Rotterdam. This dataset supports research into automated filtering of noisy or undesirable audio segments in large-scale, real-world speech corpora, particularly for training high-quality English TTS and ASR models.

AITW includes over 21,000 manually labeled audio samples (≈64 hours) from two popular in-the-wild speech datasets (Emilia and YODAS). Each audio clip is annotated at the utterance level with binary or numerical labels for five key properties.

Numerical labels:

Speaker count

Binary labels:

Non-English (foreign) language
Background music
Noisy or poor-quality speech
Synthetic (spoofed) speech

Annotations were performed by expert annotators using a custom Label Studio interface, with consistent guidelines applied across all tasks. This dataset enables the benchmarking of multi-task classification models like Whilter and comparison with single-task baselines.

AITW is designed to foster further research in scalable speech data curation and low-resource dataset bootstrapping. We encourage contributions and improvements through the included Label Studio GUI.

Files include:

Labeled audio metadata along with file maps which map back to the data in YODAS and Emilia (.csv or .json)
Interface config for Label Studio (.xml)

If you use this dataset, please cite:

W. Ravenscroft, G. Close, K. Bower-Morris, J. Stacey, D. Sityaev, K. Hong. “Whilter: A Whisper-based Data Filter for ‘in-the-wild’ Speech Corpora Using Utterance-level Multi-Task Classification,” Interspeech 2025.

License:

Creative Commons Attribution 4.0 International (CC BY 4.0)

Files

emilia_file_map.csv

Files (4.4 MB)

Name	Size	Download all
emilia_file_map.csv md5:9bc8c456b92f29c9412278fbd5cc081d	1.2 MB	Preview Download
finetuning_set.csv md5:3be4b058df8a01e111126d35c4b79e1f	1.3 MB	Preview Download
label-studio-template.xml md5:348936f0957eb978a863fc86aba1ec39	1.5 kB	Preview Download
test_set.csv md5:7d5f855091f9a33726ed31361a63a62e	83.8 kB	Preview Download
validation_set.csv md5:b03d03adba58c4f498744bef5c405d0e	62.5 kB	Preview Download
yodas_file_map.csv md5:0b3a98e0ec014ee2d264d42d18b42f29	1.8 MB	Preview Download

Additional details

Accepted: 2025-05-19

Interspeech 2025

Development Status: Active

X. Li, S. Takamichi, T. Saeki, W. Chen, S. Shiota and S. Watanabe, "Yodas: Youtube-Oriented Dataset for Audio and Speech," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-8, doi: 10.1109/ASRU57964.2023.10389689. keywords: {Training;Video on demand;Conferences;Pipelines;Self-supervised learning;Manuals;Data collection;multilingual speech processing;speech recognition;large-scale speech dataset},
H. He et al., "Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation," 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 885-890, doi: 10.1109/SLT61566.2024.10832365. keywords: {Training;Technological innovation;Annotations;Conferences;Pipelines;Training data;Transforms;Speech;Data models;Multilingual;Extensive Multilingual and Diverse Dataset;Large-scale Speech Generation},

	All versions	This version
Views	177	54
Downloads	249	105
Data volume	364.7 MB	120.9 MB

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.1)

Version 1.1: Some missing entries in the Emilia file map have been updated and a number of YODAS archive URLs have been corrected.

emilia_file_map.csv

Files (4.4 MB)

Dates

Software

References

AITW: The Annotated In-the-Wild Dataset for Filtering of In-the-Wild Speech Data

Authors/Creators

Description

The Annotated In-The-Wild (AITW) Dataset for Filtering of In-the-Wild Speech Data (v1.1)

Version 1.1: Some missing entries in the Emilia file map have been updated and a number of YODAS archive URLs have been corrected.

Files

emilia_file_map.csv

Files (4.4 MB)

Additional details

Dates

Software

References