Podcast annotation dataset for paper "Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts "
Description
Dataset for paper "Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts". Please refer to the paper for details. Compared to the dataset used in the paper, 20 out of the 417 episodes have been removed due to copyright issues.
The data file contains the following fields:
- "episode_intro_start": the time stamp for episode introduction start (in milliseconds)
- "episode_intro_end": the time stamp for episode introduction end (in milliseconds)
- "program_intro_start": the time stamp for program introduction start (in milliseconds)
- "program_intro_end": the time stamp for program introduction end (in milliseconds)
- "program_name": name of the podcast program
- "episode_name": name of the podcast episode
- "transcription": JSON string containing the transcription, including the timestamps.
- "annotator": anonymized annotator ID.
Files
LICENSE.txt
Files
(115.4 MB)
Name | Size | Download all |
---|---|---|
md5:3a86ee579a68bc4a89fef4251b030734
|
20.2 kB | Preview Download |
md5:344308ea2c2cb7204acbc53218b732ad
|
115.4 MB | Download |
Additional details
Related works
- Is supplement to
- Journal article: https://arxiv.org/abs/2110.07096 (URL)