Dataset Restricted Access

Dataset for: "It is just a flu: Assessing the Effect of Watch History on YouTube's Pseudoscientific Video Recommendations"

Kostantinos Papadamou; Savvas Zannettou; Jeremy Blackburn; Emiliano De Cristofaro; Gianluca Stringhini; Michael Sirivianos

Dataset for the paper: "It is just a flu: Assessing the Effect of Watch History on YouTube’s Pseudoscientific Video Recommendations"

Abstract: 

The role played by YouTube’s recommendation algorithm in unwittingly promoting misinformation and conspiracy theories is not entirely understood. Yet, this can have dire real-world consequences, especially when pseudoscientific content is promoted to users at critical times, such as the COVID-19 pandemic. In this paper, we set out to characterize and detect pseudoscientific misinformation on YouTube. We collect 6.6K videos related to COVID-19, the Flat Earth theory, as well as the anti-vaccination and anti-mask movements. Using crowdsourcing, we annotate them as pseudoscience, legitimate science, or irrelevant and train a deep learning classifier to detect pseudoscientific videos with an accuracy of 0.79.

We quantify user exposure to this content on various parts of the platform and how this exposure changes based on the user’s watch history. We find that YouTube suggests more pseudoscientific content regarding traditional pseudoscientific topics (e.g., flat earth, anti-vaccination) than for emerging ones (like COVID-19). At the same time, these recommendations are more common on the search results page than on a user’s homepage or in the recommendation section when actively watching videos. Finally, we shed light on how a user’s watch history substantially affects the type of recommended videos.

 

Dataset Files

The dataset consists of three files: the metadata, comments, and captions of the ground-truth dataset videos collected and manually reviewed in this paper.

1. Video Metadata

  • "groundtruth_videos.json": Contains the metadata of our manually reviewed ground-truth dataset videos. The ground-truth dataset includes 1,197 science, 1,325 pseudoscience, and 3,212 irrelevant videos. More specifically, it includes the metadata of videos related to the following pseudoscientific topics:
    • COVID-19: (607 science, 368 pseudoscience, 721 irrelevant videos)
    • Anti-vaccination (363 science, 394 pseudoscience, and 1,060 irrelevant videos)
    • Anti-mask (65 science, 188 pseudoscience, and 724 irrelevant videos)
    • Flat Earth (162 science, 375 pseudoscience, and 707 irrelevant videos)

Note, that 600 of the videos in this dataset include the "annotation.manual_review_label" attribute, which is the label assigned by the first author of this paper to evaluate the performance of the crowdsourced annotation process.

- Video Metadata Description:

  • "search_term": The search terms used to search YouTube and retrieve these videos during our data collection. It can be one of the following search terms: 'covid-19', 'coronavirus', 'anti-vaccination', 'anti-vaxx', 'anti-mask', or 'flat earth'.
  • "annotation.annotations": The list of the three annotations assigned to each video by our crowdsourced annotators.
  • "annotation.label": The annotation label assigned to the video based on the majority agreement of the crowdsourced annotators.
  • "annotation.manual_review_label": The label assigned by the first author of this paper to evaluate the performance of the crowdsourced annotation process.
  • "isSeed": 0 if the video is a seed video of our data collection, 1 if it is a recommended video of a seed video.
  • "relatedVideos": The recommended videos of the given video as returned by the YouTube Data API.

2. Video Comments: 

  • "groundtruth_videos_comments_ids.json": Includes the identifiers of the comments of our ground-truth videos.

3. Video Transcripts:

  • "groundtruth_videos_transcripts.json": Includes the captions of our ground-truth videos.

If you use this dataset in any publication, of any form and kind, please cite using this data.

@article{papadamou2020just,
    title={'It is just a flu': Assessing the Effect of Watch History on YouTube's Pseudoscientific Video Recommendations},
    author={Papadamou, Kostantinos and Zannettou, Savvas and Blackburn, Jeremy and De Cristofaro, Emiliano and Stringhini, Gianluca and Sirivianos, Michael},
    journal={arXiv preprint arXiv:2010.11638},
    year={2020}
}
Acknowledgments: This project has received funding from the European Union's Horizon 2020 Research and Innovation program under the CONCORDIA project (Grant Agreement No. 830927), and from the Innovation and Networks Executive Agency (INEA) under the CYberSafety II project (Grant Agreement No. 1614254). This work reflects only the authors' views; the funding agencies are not responsible for any use that may be made of the information it contains.
Restricted Access

You may request access to the files in this upload, provided that you fulfil the conditions below. The decision whether to grant/deny access is solely under the responsibility of the record owner.


In order to share the dataset with you, please agree to the following terms:

  1. You will not attempt to use this data to de-anonymize, in any way, any users in this or any other dataset.
  2. You will not re-share the dataset with anyone else not included in this request.
  3. You will appropriately cite the "It is just a flu: Assessing the Effect of Watch History on YouTube’s Pseudoscientific Video Recommendations" paper in any publication, of any form and kind, using this data:
@article{papadamou2020just,
    title={'It is just a flu': Assessing the Effect of Watch History on YouTube's Pseudoscientific Video Recommendations},
    author={Papadamou, Kostantinos and Zannettou, Savvas and Blackburn, Jeremy and De Cristofaro, Emiliano and Stringhini, Gianluca and Sirivianos, Michael},
    journal={arXiv preprint arXiv:2010.11638},
    year={2020}
}

 

 

 

 

 


169
25
views
downloads
All versions This version
Views 169121
Downloads 2525
Data volume 850.0 MB850.0 MB
Unique views 12894
Unique downloads 77

Share

Cite as