Podcasts as Data
Authors/Creators
Description
Description:
This dataset provides bag-of-words (BOW) representations of podcast transcripts derived from a large-scale corpus of English-language podcasts. The corpus spans all 19 Apple Podcasts genres and includes over 15,000 episodes from approximately 1,900 shows.
Transcripts were generated using Hugging Face's Distil-Whisper-Large-v3 automatic speech recognition model. Text is lowercased but otherwise unprocessed.
Files:
bow_matrix.npz— Sparse bag-of-words matrix (scipy sparse format, documents × terms)metadata.csv— Podcast title, episode title, and description for each document (row indices correspond to matrix rows)vocabulary.csv— Terms in the vocabulary (column indices correspond to matrix
Usage:
from scipy import sparse import pandas as pd bow_matrix = sparse.load_npz('bow_matrix.npz') metadata = pd.read_csv('metadata.csv') vocabulary = pd.read_csv('vocabulary.csv')['term'].tolist()
Related publication:
For a detailed description of the data collection and transcription process, see:
Verreyen, L. (2025). Podcasts as Data: Building a Dataset for Large-Scale Audio Content Analysis. In T. Arnold, M. Fantoli, & R. Ros (Eds.), Computational Humanities Research 2025 (Vol. 3, pp. 231–248). Anthology of Computers and the Humanities. https://doi.org/10.63744/QgeF94c0fP7D
Files
metadata.csv
Additional details
Related works
- Is described by
- Conference paper: 10.63744/QgeF94c0fP7D (DOI)