Published October 28, 2025 | Version v1
Dataset Open

Podcasts as Data

Authors/Creators

  • 1. ROR icon University of Antwerp
  • 2. ROR icon University of Amsterdam
  • 3. ROR icon Research Foundation - Flanders

Description

Description:

This dataset provides bag-of-words (BOW) representations of podcast transcripts derived from a large-scale corpus of English-language podcasts. The corpus spans all 19 Apple Podcasts genres and includes over 15,000 episodes from approximately 1,900 shows.

Transcripts were generated using Hugging Face's Distil-Whisper-Large-v3 automatic speech recognition model. Text is lowercased but otherwise unprocessed.

 

Files:

  • bow_matrix.npz — Sparse bag-of-words matrix (scipy sparse format, documents × terms)
  • metadata.csv — Podcast title, episode title, and description for each document (row indices correspond to matrix rows)
  • vocabulary.csv — Terms in the vocabulary (column indices correspond to matrix

 

Usage:

 
from scipy import sparse
import pandas as pd

bow_matrix = sparse.load_npz('bow_matrix.npz')
metadata = pd.read_csv('metadata.csv')
vocabulary = pd.read_csv('vocabulary.csv')['term'].tolist()


Related publication:

For a detailed description of the data collection and transcription process, see:

Verreyen, L. (2025). Podcasts as Data: Building a Dataset for Large-Scale Audio Content Analysis. In T. Arnold, M. Fantoli, & R. Ros (Eds.), Computational Humanities Research 2025 (Vol. 3, pp. 231–248). Anthology of Computers and the Humanities. https://doi.org/10.63744/QgeF94c0fP7D

Files

metadata.csv

Files (44.5 MB)

Name Size Download all
md5:e5ae7bb9afc3df05040c980473992bdb
40.1 MB Download
md5:4d13e439d0d8c383775b90d6d72cd7a5
1.3 MB Preview Download
md5:75f21483f332b82d2e7bf93f2b67ee24
3.0 MB Preview Download

Additional details

Related works

Is described by
Conference paper: 10.63744/QgeF94c0fP7D (DOI)