Published June 4, 2024 | Version 1.0.0
Dataset Open

iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022

  • 1. ROR icon Binghamton University
  • 2. ROR icon Middle East Technical University

Description

ABSTRACT

---------------
Rumble has emerged as a prominent platform hosting controversial figures facing restrictions on YouTube. Despite this, the academic community’s engagement with Rumble has been minimal. To help researchers address this gap, we introduce a comprehensive dataset of about 6.7K podcast videos from August 2020 to December 2022, amounting to over 5.6K hours of content. Besides covering metadata of these podcast videos, we provide speech-to-text transcriptions for future analysis. We also provide speaker diarization information, a collection of ~250K unique representative images from podcast videos, and face embeddings of ~400K extracted faces. With the rise of the influence of podcasts and populist figures, this dataset provides a rich resource for identifying challenges in cyber social threats in a relatively underexplored space.

Dataset Summary

iDRAMA-rumble-2024 is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities:

  • Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more.
  • Text: Transcription (i.e., speech-to-text) of podcast videos.
  • Audio: Speaker diarization information providing speaker detection over time for each video.
  • Video: Sampled representative video frames from each video, totaling 200K images. We also detect ~400K non-unique faces from these images and release face embeddings.

Repository links

Dataset Info

The dataset is organized by modalities -- transcripts, representative images, speaker diarization, and face embeddings.

Config Data-points
Podcast videos 6,735
Representative images 252,387
Face embeddings 399,333
Transcripts & Speaker diarization 6,735

Zenodo Dataset Files Info

  #Files File names
Metadata 1 iDRAMA-rumble-2024-metadata.ndjson
Speaker diarization 1 iDRAMA-rumble-2024-speaker-dirization.zip
Face embeddings 1 iDRAMA-rumble-2024-face-embeddings.ndjson
Representation images 5

iDRAMA-rumble-2024-repr-images-set1.tar.gz

iDRAMA-rumble-2024-repr-images-set2.tar.gz

iDRAMA-rumble-2024-repr-images-set3.tar.gz

iDRAMA-rumble-2024-repr-images-set4.tar.gz

iDRAMA-rumble-2024-repr-images-set5.tar.gz

Transcription Lite

(Minimal information)

3

iDRAMA-rumble-2024-transcription-lite_part_1.ndjson

iDRAMA-rumble-2024-transcription-lite_part_2.ndjson

iDRAMA-rumble-2024-transcription-lite_part_3.ndjson

Transcription 3

iDRAMA-rumble-2024-transcription_part_1.ndjson

iDRAMA-rumble-2024-transcription_part_2.ndjson

iDRAMA-rumble-2024-transcription_part_3.ndjson

Authorship

This dataset is published in the "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted in Buffalo, NY, USA.

  • Academic Organization: iDRAMA Lab
  • Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn
  • Affiliation: Binghamton University, Middle East Technical University

Licensing

This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.

Citation

@article{balci2024idrama,
  title  = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022},
  author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy},
  year   = {2024},
  journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}
}

Files

iDRAMA-rumble-2024-speaker-dirization.zip

Files (39.6 GB)

Name Size Download all
md5:a655143be18c5ad6b48e336aae28ecf4
2.8 GB Download
md5:0d85f705e75966e4166a7b4ac7472e27
10.8 MB Download
md5:d58a8060ac1229316c306121f7a2e1fa
4.9 GB Download
md5:2902cecd47292d3e6704e7df792acb4a
5.1 GB Download
md5:4c34e9aec95877ad3391a538680208c0
4.7 GB Download
md5:cfb7d6d800a51ef74b9c118b31cd01cf
5.0 GB Download
md5:f3d364c6984213c64d2cf34b104c493b
3.7 GB Download
md5:4fd3034ee02a7c69637145fe0325aa3c
176.2 MB Preview Download
md5:a24bdf5e9406e6180b2469c0e1457624
1.8 GB Download
md5:a0e75c7da818b1de1fa2d5355c2c563c
2.3 GB Download
md5:1f57524bdc042f1f6a438b72beabb3a3
1.7 GB Download
md5:c3d700bcc891bc7c70ef0fb1e2ab2e5c
2.4 GB Download
md5:5c4b764bfd6bd2d81319a8bdd775a48c
3.0 GB Download
md5:e2e721c853400ac7a6915c0b3bc31942
2.3 GB Download

Additional details

Identifiers