iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022
Creators
Description
ABSTRACT
---------------
Rumble has emerged as a prominent platform hosting controversial figures facing restrictions on YouTube. Despite this, the academic community’s engagement with Rumble has been minimal. To help researchers address this gap, we introduce a comprehensive dataset of about 6.7K podcast videos from August 2020 to December 2022, amounting to over 5.6K hours of content. Besides covering metadata of these podcast videos, we provide speech-to-text transcriptions for future analysis. We also provide speaker diarization information, a collection of ~250K unique representative images from podcast videos, and face embeddings of ~400K extracted faces. With the rise of the influence of podcasts and populist figures, this dataset provides a rich resource for identifying challenges in cyber social threats in a relatively underexplored space.
- Rumble platform: http://rumble.com/
- Link to paper: https://workshop-proceedings.icwsm.org/abstract.php?id=2024_07
- License: CC BY-NC-SA 4.0
Dataset Summary
iDRAMA-rumble-2024 is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities:
- Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more.
- Text: Transcription (i.e., speech-to-text) of podcast videos.
- Audio: Speaker diarization information providing speaker detection over time for each video.
- Video: Sampled representative video frames from each video, totaling 200K images. We also detect ~400K non-unique faces from these images and release face embeddings.
Repository links
- Zenodo: On Zenodo, we provide JSON formatted dataset for all modalities and representative images in compressed files.
- Github: The main repository of this dataset, where we provide code snippets to get started with this dataset.
- Huggingface: On Huggingface, we provide a dataset that can be accessed through Huggingface APIs in a `parquet` format.
Dataset Info
The dataset is organized by modalities -- transcripts, representative images, speaker diarization, and face embeddings.
Config | Data-points |
Podcast videos | 6,735 |
Representative images | 252,387 |
Face embeddings | 399,333 |
Transcripts & Speaker diarization | 6,735 |
Zenodo Dataset Files Info
#Files | File names | |
Metadata | 1 | iDRAMA-rumble-2024-metadata.ndjson |
Speaker diarization | 1 | iDRAMA-rumble-2024-speaker-dirization.zip |
Face embeddings | 1 | iDRAMA-rumble-2024-face-embeddings.ndjson |
Representation images | 5 |
iDRAMA-rumble-2024-repr-images-set1.tar.gz iDRAMA-rumble-2024-repr-images-set2.tar.gz iDRAMA-rumble-2024-repr-images-set3.tar.gz |
Transcription Lite (Minimal information) |
3 |
iDRAMA-rumble-2024-transcription-lite_part_1.ndjson |
Transcription | 3 |
iDRAMA-rumble-2024-transcription_part_1.ndjson |
Authorship
This dataset is published in the "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted in Buffalo, NY, USA.
- Academic Organization: iDRAMA Lab
- Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn
- Affiliation: Binghamton University, Middle East Technical University
Licensing
This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.
Citation
@article{balci2024idrama,
title = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022},
author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy},
year = {2024},
journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}
}
Files
iDRAMA-rumble-2024-speaker-dirization.zip
Files
(39.6 GB)
Name | Size | Download all |
---|---|---|
md5:a655143be18c5ad6b48e336aae28ecf4
|
2.8 GB | Download |
md5:0d85f705e75966e4166a7b4ac7472e27
|
10.8 MB | Download |
md5:d58a8060ac1229316c306121f7a2e1fa
|
4.9 GB | Download |
md5:2902cecd47292d3e6704e7df792acb4a
|
5.1 GB | Download |
md5:4c34e9aec95877ad3391a538680208c0
|
4.7 GB | Download |
md5:cfb7d6d800a51ef74b9c118b31cd01cf
|
5.0 GB | Download |
md5:f3d364c6984213c64d2cf34b104c493b
|
3.7 GB | Download |
md5:4fd3034ee02a7c69637145fe0325aa3c
|
176.2 MB | Preview Download |
md5:a24bdf5e9406e6180b2469c0e1457624
|
1.8 GB | Download |
md5:a0e75c7da818b1de1fa2d5355c2c563c
|
2.3 GB | Download |
md5:1f57524bdc042f1f6a438b72beabb3a3
|
1.7 GB | Download |
md5:c3d700bcc891bc7c70ef0fb1e2ab2e5c
|
2.4 GB | Download |
md5:5c4b764bfd6bd2d81319a8bdd775a48c
|
3.0 GB | Download |
md5:e2e721c853400ac7a6915c0b3bc31942
|
2.3 GB | Download |
Additional details
Identifiers
- DOI
- 10.36190/2024.07
Software
- Repository URL
- https://github.com/idramalab/iDRAMA-rumble-2024