Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published July 12, 2022 | Version 1.0.1
Dataset Open

PodcastFillers

  • 1. University of Rochester
  • 2. Adobe Research

Description

OVERVIEW:
The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.

The PodcastFillers dataset homepage: PodcastFillers.github.io
The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils

 

LICENSE:

The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.

Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.

## License for PodcastFillers Dataset metadata

This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.

1.    GRANT OF LICENSE.
1.1    Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution.
1.2    You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
1.3    For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
2.    OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
3.    DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
4.    LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
5.    TERM AND TERMINATION.  
5.1    The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2.
5.2    Any breach of any material provision of this License will automatically terminate the rights granted herein.
5.3    Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License.
## License for PodcastFillers Dataset audio files

All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.
 

ACKNOWLEDGEMENT:
Please cite the following paper in work that makes use of this dataset:

Filler Word Detection and Classification: A Dataset and Benchmark
Ge Zhu, Juan-Pablo Caceres and Justin Salamon
In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.

Bibtex

@inproceedings{Zhu:FillerWords:INTERSPEECH:22,
  title = {Filler Word Detection and Classification: A Dataset and Benchmark},
  booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)},
  address = {Incheon, Korea}, 
  month = {Sep.},
  url = {https://arxiv.org/abs/2203.15135},
  author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin},
  year = {2022},
}

 

ANNOTATIONS:
The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events.
Full label vocabulary
Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).

Fillers
- Uh: 17,907
- Um: 17,078
- You know: 668
- Other: 315
- Like: 157

Non-fillers
- Words: 12,709
- Repetitions: 9,024
- Breath: 8,288
- Laughter: 6,623
- Music    : 5,060
- Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755
- Noise    : 2,735
- Overlap (overlapping speakers): 1,484

Total: 85,803 
Consolidated label vocabulary
76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.

- Words: 21,733
- Uh: 17,907
- Um: 17,078
- Breath: 8,288
- Laughter: 6,623
- Music    : 5,060

- Total: 76,689

The consolidated vocabulary was used to train FillerNet

For a detailed description of how the dataset was created, please see our paper.
Data Split for Machine Learning:
To facilitate  machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.

We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper

 

AUDIO FILES:

1. Full-length podcast episodes (MP3)
199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format:  [show name]_[episode name].mp3.

2. Pre-processed full-length podcast episodes (WAV)
199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format:  [show name]_[episode name].wav

3. Pre-processed WAV clips
Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.

The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra” folder. 

Filename format: [pfID].wav where:

[pfID] = the PodcastFillers ID of the audio clip (see metadata below)

 

METADATA:

1. Speech-to-text podcasts transcripts
Speech transcript in JSON format for each podcast episode. Generated using the SpeechMatics STT Filename format: [show name]_[episode name].json. 

Each word in the transcript is annotated as a dictionary:
{“confidence”:(float), “duration”:(int),  “offset”:(int),  “text”:(string)}
where “confidence” indicates the STT confidence in the prediction, “duration” (unit:microsecond or 1e-6 second) is the duration of the transcribed word, “offset” (unit:microsecond or 1e-6 second) is the start time of the transcribed word in the full-length recording.

2. PodcastFillers.csv
This is the dataset’s main annotation file, and contains the annotations of all 85,803 manually annotated events. Each annotated event also corresponds to one pre-processed audio clip. For each annotated event / audio clip, we provide:

clip_name: (str)
The filename of the audio clip containing this event: [pfID].wav

pfID: (str)
The PodcastFillers ID of the clip/event, a unique identifier.

fullvoc_label: (str)
The full-vocabulary label for this audio event, one of “Uh”, “Um”, “You know”, “Like”, “Other”, “Laughter”, “Breath”, “Agree”, “Words”, “Repetitions”, “Overlap”, “Music” and “Noise”. See Annotations above for details.

consolidated_label: (str)
Consolidated-vocabulary label used for training FillerNet, including “Uh”, “Um”, “Laughter”, “Breath”, “Words”, “Music” or “None”. “None” is assigned to events outside the consolidated vocabulary. See Annotations above for details.

episode_split_subset: (str) 
The subset (train/validation/test) folder the episode this 1-second clip comes from belongs to. See Data Split for Machine Learning above for details.

clip_split_subset: (str)
The subset (train/validation/test/extra) folder this 1-second clip belongs to. See Data Split for Machine Learning above for details. Note that while every episode belongs to one of train/validation/test, the clips are sorted into these subsets plus one additional folder “extra”, which contains clips that are excluded from the consolidated vocabulary. See Annotations above for details.

podcast_filename: (str)
The filename of the full-length podcast episode where this event occurs, it takes the format of [show name]_[episode name]

event_start_inepisode: (float)
The start time of the event in the episode, unit: second.

event_end_inepisode: (float) 
The end time of the event in the episode, unit: second.

event_start_inclip: (float) 
The start time of the event in the 1-second clip, unit: second.

event_end_inclip: (float) 
The end time of the event in the 1-second clip, unit: second.

clip_start_inepisode: (float) 
The start time of the 1-sec clip in the episode, unit: second.

clip_end_inspieosde: (float) 
The end time of the 1-sec clip in the episode, unit: second.

duration (float): Duration of the event in the episode, unit: second. Note that it can be larger than 1.0 because some events are longer than 1 second.

confidence (float): Confidence among crowd annotators, it indicates confidence in the range 0-1.

agreement (int): Agreement among crowd annotators. It indicates the number of annotators.

3. Per-episode csv files

These files have the exact same format as PodcastFillers.csv, but only contain the audio events for a specific podcast episode. There’s one file per episode. Filename format:  [show name]_[episode name].csv 

4. VAD activation csv files

Voice activity detection predictions using our pretained robust VAD model, the first column indicates time stamps (unit:second) and the second one indicates activations. Filename format:  `{show name}_{episode name}.csv`

5. Ground truth and prediction sed_eval txt files

Ground truth and AVC-FillerNet predicted events using sed_eval supported format: 

{start_time} {end_time} event_label

Filename format:  `{show name}_{episode name}.txt`

* To unzip split files:

zip -FF PodcastFillers.zip --out PodcastFillers-full.zip
unzip PodcastFillers-full.zip

 

Files

PodcastFillers.csv

Files (25.3 GB)

Name Size Download all
md5:68d8614ad562263c621b90dc81f06abd
21.7 MB Preview Download
md5:e0941abae2a9e5bbe0e7c28f6140ccc7
6.4 GB Download
md5:25a149c5dc76fa7c91d0edb6e77aedb6
6.4 GB Download
md5:399c44abb9ec35aa56d6fbf75043164b
6.4 GB Download
md5:20e9f6b93b1b5c2e1c9d8b774ca28c33
5.9 GB Preview Download