Podcasts as Data

Verreyen, Loren

doi:10.5281/zenodo.17469222

Published October 28, 2025 | Version v1

Dataset Open

Podcasts as Data

Verreyen, Loren^{1, 2, 3}

1. University of Antwerp
2. University of Amsterdam
3. Research Foundation - Flanders

Description:

This dataset provides bag-of-words (BOW) representations of podcast transcripts derived from a large-scale corpus of English-language podcasts. The corpus spans all 19 Apple Podcasts genres and includes over 15,000 episodes from approximately 1,900 shows.

Transcripts were generated using Hugging Face's Distil-Whisper-Large-v3 automatic speech recognition model. Text is lowercased but otherwise unprocessed.

Files:

bow_matrix.npz — Sparse bag-of-words matrix (scipy sparse format, documents × terms)
metadata.csv — Podcast title, episode title, and description for each document (row indices correspond to matrix rows)
vocabulary.csv — Terms in the vocabulary (column indices correspond to matrix

Usage:

from scipy import sparse
import pandas as pd

bow_matrix = sparse.load_npz('bow_matrix.npz')
metadata = pd.read_csv('metadata.csv')
vocabulary = pd.read_csv('vocabulary.csv')['term'].tolist()

Related publication:

For a detailed description of the data collection and transcription process, see:

Verreyen, L. (2025). Podcasts as Data: Building a Dataset for Large-Scale Audio Content Analysis. In T. Arnold, M. Fantoli, & R. Ros (Eds.), Computational Humanities Research 2025 (Vol. 3, pp. 231–248). Anthology of Computers and the Humanities. https://doi.org/10.63744/QgeF94c0fP7D

Files

metadata.csv

Files (44.5 MB)

Name	Size	Download all
bow_matrix.npz md5:e5ae7bb9afc3df05040c980473992bdb	40.1 MB	Download
metadata.csv md5:4d13e439d0d8c383775b90d6d72cd7a5	1.3 MB	Preview Download
vocabulary.csv md5:75f21483f332b82d2e7bf93f2b67ee24	3.0 MB	Preview Download

Additional details

Is described by: Conference paper: 10.63744/QgeF94c0fP7D (DOI)

	All versions	This version
Views	175	175
Downloads	128	125
Data volume	940.4 MB	931.7 MB

Podcasts as Data

Authors/Creators

Description

Files

metadata.csv

Files (44.5 MB)

Additional details

Related works