Dataset Restricted Access

Dataset for: "It is just a flu: Assessing the Effect of Watch History on YouTube's Pseudoscientific Video Recommendations"

Kostantinos Papadamou; Savvas Zannettou; Jeremy Blackburn; Emiliano De Cristofaro; Gianluca Stringhini; Michael Sirivianos


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-02-24</subfield>
  </datafield>
  <controlfield tag="005">20210822130350.0</controlfield>
  <datafield tag="500" ind1=" " ind2=" ">
    <subfield code="a">Acknowledgments: This project has received funding from the European Union's Horizon 2020 Research and Innovation program under the CONCORDIA project (Grant Agreement No. 830927), and from the Innovation and Networks Executive Agency (INEA) under the CYberSafety II project (Grant Agreement No. 1614254). This work reflects only the authors' views; the funding agencies are not responsible for any use that may be made of the information it contains.</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a">10.5281/zenodo.4769863</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isSupplementedBy</subfield>
    <subfield code="a">10.5281/zenodo.4580999</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4558468</subfield>
  </datafield>
  <controlfield tag="001">4769731</controlfield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:4769731</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;&lt;strong&gt;Dataset for the paper: &amp;quot;It is just a flu: Assessing the Effect of Watch History on YouTube&amp;rsquo;s Pseudoscientific Video Recommendations&amp;quot;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;The role played by YouTube&amp;rsquo;s recommendation algorithm in unwittingly promoting misinformation and conspiracy theories is not entirely understood.&amp;nbsp;Yet, this can have dire real-world consequences, especially when pseudoscientific content is promoted to users at critical times, such as the COVID-19 pandemic.&amp;nbsp;In this paper, we set out to characterize and detect pseudoscientific misinformation on YouTube. We collect 6.6K videos related to COVID-19, the Flat Earth theory, as well as the anti-vaccination and anti-mask movements.&amp;nbsp;Using crowdsourcing, we annotate them as pseudoscience, legitimate science, or irrelevant and train a deep learning classifier to detect pseudoscientific videos with an accuracy of 0.79.&lt;/p&gt;

&lt;p&gt;We quantify user exposure to this content on various parts of the platform and how this exposure changes based on the user&amp;rsquo;s watch history. We find that YouTube suggests more pseudoscientific content regarding traditional pseudoscientific topics (e.g., flat earth, anti-vaccination) than for emerging ones (like COVID-19).&amp;nbsp;At the same time, these recommendations are more common on the search results page than on a user&amp;rsquo;s homepage or in the recommendation section when actively watching videos. Finally, we shed light on how a user&amp;rsquo;s watch history substantially affects the type of recommended videos.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The dataset consists of three files: the metadata, comments, and captions of the ground-truth dataset&amp;nbsp;videos collected and manually reviewed in this paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Video Metadata&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;strong&gt;&amp;quot;groundtruth_videos.json&amp;quot;:&lt;/strong&gt;&amp;nbsp;Contains the metadata of our manually reviewed ground-truth dataset videos. The ground-truth dataset includes 1,197 science, 1,325 pseudoscience, and 3,212 irrelevant videos. More specifically, it includes the metadata of videos related to the following pseudoscientific topics:

	&lt;ul&gt;
		&lt;li&gt;COVID-19: (607 science, 368 pseudoscience, 721 irrelevant videos)&lt;/li&gt;
		&lt;li&gt;Anti-vaccination (363 science, 394 pseudoscience, and 1,060 irrelevant videos)&lt;/li&gt;
		&lt;li&gt;Anti-mask (65 science, 188 pseudoscience, and 724 irrelevant videos)&lt;/li&gt;
		&lt;li&gt;Flat Earth (162 science, 375 pseudoscience, and 707 irrelevant videos)&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note,&amp;nbsp;that 600 of the videos in this dataset include the &lt;em&gt;&lt;strong&gt;&amp;quot;annotation.manual_review_label&amp;quot;&lt;/strong&gt;&lt;/em&gt;&amp;nbsp;attribute,&amp;nbsp;which is the label assigned by the first author of this paper to evaluate the performance of the crowdsourced annotation process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Video Metadata Description:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;em&gt;&amp;quot;search_term&amp;quot;&lt;/em&gt;: The search terms used to search YouTube and retrieve these videos during our data collection. It can be one of the following search terms: &amp;#39;covid-19&amp;#39;, &amp;#39;coronavirus&amp;#39;, &amp;#39;anti-vaccination&amp;#39;, &amp;#39;anti-vaxx&amp;#39;, &amp;#39;anti-mask&amp;#39;, or &amp;#39;flat earth&amp;#39;.&lt;/li&gt;
	&lt;li&gt;&lt;em&gt;&amp;quot;annotation.annotations&amp;quot;&lt;/em&gt;: The list of the three annotations assigned to each video by our crowdsourced annotators.&lt;/li&gt;
	&lt;li&gt;&lt;em&gt;&amp;quot;annotation.label&amp;quot;&lt;/em&gt;: The annotation label assigned to the video based on the majority agreement of the crowdsourced annotators.&lt;/li&gt;
	&lt;li&gt;&lt;em&gt;&amp;quot;annotation.manual_review_label&amp;quot;&lt;/em&gt;: The label assigned by the first author of this paper to evaluate the performance of the crowdsourced annotation process.&lt;/li&gt;
	&lt;li&gt;&amp;quot;isSeed&amp;quot;: 0 if the video is a seed video of our data collection, 1 if it is a recommended video of a seed video.&lt;/li&gt;
	&lt;li&gt;&lt;em&gt;&amp;quot;relatedVideos&amp;quot;&lt;/em&gt;: The recommended videos of the given video as returned by the YouTube Data API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Video Comments:&amp;nbsp;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;strong&gt;&amp;quot;groundtruth_videos_comments_ids.json&amp;quot;:&lt;/strong&gt;&amp;nbsp;Includes the identifiers of the comments of our ground-truth videos.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Video Transcripts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;strong&gt;&amp;quot;groundtruth_videos_transcripts.json&amp;quot;:&lt;/strong&gt; Includes the captions of our ground-truth videos.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you use this dataset in any publication, of any form and kind, please cite using this data.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;@article{papadamou2020just,
    title={'It is just a flu': Assessing the Effect of Watch History on YouTube's Pseudoscientific Video Recommendations},
    author={Papadamou, Kostantinos and Zannettou, Savvas and Blackburn, Jeremy and De Cristofaro, Emiliano and Stringhini, Gianluca and Sirivianos, Michael},
    journal={arXiv preprint arXiv:2010.11638},
    year={2020}
}&lt;/code&gt;&lt;/pre&gt;</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Max Planck Institute</subfield>
    <subfield code="a">Savvas Zannettou</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Binghamton University</subfield>
    <subfield code="a">Jeremy Blackburn</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University College London</subfield>
    <subfield code="a">Emiliano De Cristofaro</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Boston University</subfield>
    <subfield code="a">Gianluca Stringhini</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Cyprus University of Technology</subfield>
    <subfield code="a">Michael Sirivianos</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">restricted</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Cyprus University of Technology</subfield>
    <subfield code="a">Kostantinos Papadamou</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">YouTube</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">YouTube Videos</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">YouTube's Recommendation Algorithm</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Science</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Pseudoscience</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Pseudoscientific Misinformation</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Watch History</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">COVID-19</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Anti-vaccination</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Anti-mask</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Flat Earth</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.4769731</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Dataset for: "It is just a flu: Assessing the Effect of Watch History on YouTube's Pseudoscientific Video Recommendations"</subfield>
  </datafield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">830927</subfield>
    <subfield code="a">Cyber security cOmpeteNce fOr Research anD Innovation</subfield>
  </datafield>
</record>
184
25
views
downloads
All versions This version
Views 184134
Downloads 2525
Data volume 850.0 MB850.0 MB
Unique views 141106
Unique downloads 77

Share

Cite as