iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022

Balci, Utkucan; Patel, Jay; Balci, Berkan; Blackburn, Jeremy

doi:10.5281/zenodo.10515991

Published June 4, 2024 | Version 1.0.0

Dataset Open

iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022

1. Binghamton University
2. Middle East Technical University

ABSTRACT

---------------
Rumble has emerged as a prominent platform hosting controversial figures facing restrictions on YouTube. Despite this, the academic community’s engagement with Rumble has been minimal. To help researchers address this gap, we introduce a comprehensive dataset of about 6.7K podcast videos from August 2020 to December 2022, amounting to over 5.6K hours of content. Besides covering metadata of these podcast videos, we provide speech-to-text transcriptions for future analysis. We also provide speaker diarization information, a collection of ~250K unique representative images from podcast videos, and face embeddings of ~400K extracted faces. With the rise of the influence of podcasts and populist figures, this dataset provides a rich resource for identifying challenges in cyber social threats in a relatively underexplored space.

Rumble platform: http://rumble.com/
Link to paper: https://workshop-proceedings.icwsm.org/abstract.php?id=2024_07
License: CC BY-NC-SA 4.0

Dataset Summary

iDRAMA-rumble-2024 is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities:

Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more.
Text: Transcription (i.e., speech-to-text) of podcast videos.
Audio: Speaker diarization information providing speaker detection over time for each video.
Video: Sampled representative video frames from each video, totaling 200K images. We also detect ~400K non-unique faces from these images and release face embeddings.

Repository links

Zenodo: On Zenodo, we provide JSON formatted dataset for all modalities and representative images in compressed files.
Github: The main repository of this dataset, where we provide code snippets to get started with this dataset.
- Link here: https://github.com/idramalab/iDRAMA-rumble-2024
Huggingface: On Huggingface, we provide a dataset that can be accessed through Huggingface APIs in a `parquet` format.
- Link here: https://hf.co/datasets/iDRAMALab/iDRAMA-rumble-2024

Dataset Info

The dataset is organized by modalities -- transcripts, representative images, speaker diarization, and face embeddings.

Config	Data-points
Podcast videos	6,735
Representative images	252,387
Face embeddings	399,333
Transcripts & Speaker diarization	6,735

Zenodo Dataset Files Info

	#Files	File names
Metadata	1	iDRAMA-rumble-2024-metadata.ndjson
Speaker diarization	1	iDRAMA-rumble-2024-speaker-dirization.zip
Face embeddings	1	iDRAMA-rumble-2024-face-embeddings.ndjson
Representation images	5	iDRAMA-rumble-2024-repr-images-set1.tar.gz iDRAMA-rumble-2024-repr-images-set2.tar.gz iDRAMA-rumble-2024-repr-images-set3.tar.gz iDRAMA-rumble-2024-repr-images-set4.tar.gz iDRAMA-rumble-2024-repr-images-set5.tar.gz
Transcription Lite (Minimal information)	3	iDRAMA-rumble-2024-transcription-lite_part_1.ndjson iDRAMA-rumble-2024-transcription-lite_part_2.ndjson iDRAMA-rumble-2024-transcription-lite_part_3.ndjson
Transcription	3	iDRAMA-rumble-2024-transcription_part_1.ndjson iDRAMA-rumble-2024-transcription_part_2.ndjson iDRAMA-rumble-2024-transcription_part_3.ndjson

Authorship

This dataset is published in the "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted in Buffalo, NY, USA.

Academic Organization: iDRAMA Lab
Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn
Affiliation: Binghamton University, Middle East Technical University

Licensing

This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.

Citation

@article{balci2024idrama,
title = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022},
author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy},
year = {2024},
journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}
}

Files

iDRAMA-rumble-2024-speaker-dirization.zip

Files (39.6 GB)

Name	Size
iDRAMA-rumble-2024-face-embeddings.ndjson md5:a655143be18c5ad6b48e336aae28ecf4	2.8 GB	Download
iDRAMA-rumble-2024-metadata.ndjson md5:0d85f705e75966e4166a7b4ac7472e27	10.8 MB	Download
iDRAMA-rumble-2024-repr-images-set1.tar.gz md5:d58a8060ac1229316c306121f7a2e1fa	4.9 GB	Download
iDRAMA-rumble-2024-repr-images-set2.tar.gz md5:2902cecd47292d3e6704e7df792acb4a	5.1 GB	Download
iDRAMA-rumble-2024-repr-images-set3.tar.gz md5:4c34e9aec95877ad3391a538680208c0	4.7 GB	Download
iDRAMA-rumble-2024-repr-images-set4.tar.gz md5:cfb7d6d800a51ef74b9c118b31cd01cf	5.0 GB	Download
iDRAMA-rumble-2024-repr-images-set5.tar.gz md5:f3d364c6984213c64d2cf34b104c493b	3.7 GB	Download
iDRAMA-rumble-2024-speaker-dirization.zip md5:4fd3034ee02a7c69637145fe0325aa3c	176.2 MB	Preview Download
iDRAMA-rumble-2024-transcription-lite_part_1.ndjson md5:a24bdf5e9406e6180b2469c0e1457624	1.8 GB	Download
iDRAMA-rumble-2024-transcription-lite_part_2.ndjson md5:a0e75c7da818b1de1fa2d5355c2c563c	2.3 GB	Download
iDRAMA-rumble-2024-transcription-lite_part_3.ndjson md5:1f57524bdc042f1f6a438b72beabb3a3	1.7 GB	Download
iDRAMA-rumble-2024-transcription_part_1.ndjson md5:c3d700bcc891bc7c70ef0fb1e2ab2e5c	2.4 GB	Download
iDRAMA-rumble-2024-transcription_part_2.ndjson md5:5c4b764bfd6bd2d81319a8bdd775a48c	3.0 GB	Download
iDRAMA-rumble-2024-transcription_part_3.ndjson md5:e2e721c853400ac7a6915c0b3bc31942	2.3 GB	Download

Additional details

DOI: 10.36190/2024.07

Repository URL: https://github.com/idramalab/iDRAMA-rumble-2024

	All versions	This version
Views	390	390
Downloads	1,314	1,314
Data volume	3.6 TB	3.6 TB

ABSTRACT

Dataset Summary

Repository links

Dataset Info

Zenodo Dataset Files Info

Authorship

Licensing

Citation

iDRAMA-rumble-2024-speaker-dirization.zip

Files (39.6 GB)

Identifiers

Software

iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022

Authors/Creators

Description

ABSTRACT

Dataset Summary

Repository links

Dataset Info

Zenodo Dataset Files Info

Authorship

Licensing

Citation

Files

iDRAMA-rumble-2024-speaker-dirization.zip

Files (39.6 GB)

Additional details

Identifiers

Software