Auditing YouTube’s Recommendation Algorithm for Misinformation Filter Bubbles

In this article, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to “burst the bubble,” i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve into misinformation filter bubbles by watching misinformation-promoting content. Then they try to burst the bubbles and reach more balanced recommendations by watching misinformation-debunking content. We record search results, home page results, and recommendations for the watched videos. Overall, we recorded 17,405 unique videos, out of which we manually annotated 2,914 for the presence of misinformation. The labeled data was used to train a machine learning model classifying videos into three classes (promoting, debunking, neutral) with the accuracy of 0.82. We use the trained model to classify the remaining videos that would not be feasible to annotate manually. Using both the manually and automatically annotated data, we observe the misinformation bubble dynamics for a range of audited topics. Our key finding is that even though filter bubbles do not appear in some situations, when they do, it is possible to burst them by watching misinformation-debunking content (albeit it manifests differently from topic to topic). We also observe a sudden decrease of misinformation filter bubble effect when misinformation-debunking videos are watched after misinformation-promoting videos, suggesting a strong contextuality of recommendations. Finally, when comparing our results with a previous similar study, we do not observe significant improvements in the overall quantity of recommended misinformation content.


INTRODUCTION
In this paper, we investigate the misinformation filter bubble creation and bursting on YouTube 1 .The role of very large online platforms (especially social networking sites, such as Facebook, Twitter, or YouTube) in dissemination and amplification of misinformation has been widely discussed and recognized in recent years by researchers, journalists, policymakers, and representatives of the platforms alike [11,18,19,22,30,41,48].The platforms are blamed for promoting sensational, attention-grabbing, or polarizing content through the use of personalized recommendation algorithms (resulting from their mode of operation based on monetizing users' attention [53,60]).To tackle this issue, the platforms have (on the European level) committed to implement a range of measures stipulated in the Code of Practice on Disinformation [17].However, the monitoring of the platforms' compliance and the progress made in this regard has proved difficult [16].One of the problems is a lack of effective public oversight in the form of internal audits of the platforms' personalized algorithms that could directly quantify the impact of disinformation as well as the measures taken by the platforms.
This lack has been partially compensated by external black-box auditing studies performed by the researchers, such as [1,4,27,38,41,48] that aimed to quantify the portion of misinformative content being recommended on social media platforms.With respect to YouTube, which is the subject of the audit presented in this paper, previous works investigated how a user can enter a filter bubble.Multiple studies demonstrated that watching a series of misinformative videos strengthens the further presence of such content in recommendations [1,27,38], or that following a path of the "up next" videos can bring the user to a very dubious content [48].However, no studies have covered if, how or with what "effort" can the user "burst" (lessen) the bubble.More specifically, they have not investigated what type of user's watching behavior (e.g., switching to credible news videos or conspiracy debunking videos) would be needed to lessen the amount of misinformative content recommended to the user.Such knowledge would be valuable not just for the sake of having a better understanding of the inner workings of YouTube's personalization, but also to improve the social, educational, or psychological strategies for building up resilience against misinformation.
Our work extends the prior works by researching this important aspect.To do so, we employ a sock puppet auditing methodology [5,43].We simulate user behavior on the YouTube platform, record platform responses (search results, home page results, recommendations) and manually annotate their sample for the presence of misinformative content.Using the manual annotations, we train a machine learning model to predict labels for the remaining recommended videos that would be impractical to annotate manually due to their large volume.Then, we quantify the dynamics of misinformation filter bubble creation and also of bubble bursting, which is the novel aspect of the study.
The main contributions of this work are threefold.As the first contribution, this paper reports on the behavior of YouTube's personalization in a situation when a user with misinformation promoting watch history (i.e., with a developed misinformation filter bubble) starts to watch content debunking the misinformation (in an attempt to burst that misinformation filter bubble).
The key finding is that watching misinformation debunking videos (such as credible news or scientific content) generally improves the situation (in terms of recommended items or search result personalization), albeit with varying effects and forms, mainly depending on the particular misinformation topic.
Complementing manual labels with automatically predicted ones (using our trained machine learning model) allowed us to inspect not only difference at the specific points in time (the state at the beginning of the study vs. the state after obtaining a watch history of misinformation promoting videos vs. the state after watching the misinformation debunking content), but also the dynamics of misinformation filter bubble creation and bursting throughout the whole duration of the study.Thus, as the second contribution, we provide a so-far unexplored deeper insight into misinformation filter bubble dynamics, since a continuous evaluation of the proportions of misinformation promoting and debunking videos has not been covered, to the best of our knowledge, by any of the existing auditing studies yet.The key finding is that there is a sudden increase in the number of debunking videos after the first watched debunking video, suggesting a strong contextuality of the YouTube's personalization algorithms.We observe this consistently for both the home page results and the recommendations for most examined misinformation topics.
Lastly, part of this work is a replication of the prior works, most notably the work of Hussein et al. [27] who also investigated the creation of misinformation filter bubbles using user simulation.We aligned our methodology with Hussein's study: we re-used Hussein's seed data (topics, queries, and also videos, except those which have been removed in the meantime), used similar scenarios and the same data annotation scheme.As a result, we were able to directly compare the outcomes of both studies, Hussein's and ours, on the number of observed misinformative videos present in recommendations or search results.As the third contribution, we report changes in misinformation video occurrences on YouTube, which took place since the study of Hussein et al. [27] (mid-2019).Due to the ongoing YouTube's efforts to improve their recommender systems and policies (e.g., by removing misinformative content or preferring credible sources) [54,55], we expected to see less filter bubble creation behavior than Hussein et al.While we in general observe low overall prevalence of misinformation in several topics, there is still a room for improvement.More specifically, we observe worse situation regarding the topics of vaccination and (partially) 9/11 conspiracies and some improvements (less misinformation) for moon landing or chemtrails conspiracies.In addition, we replicated, to a lesser extent, the works of Hou et al. [26] and Papadamou et al. [38].We reused their classification models, which we adapted and applied on our collected data.Finally, we used the better performing Papadamou's model [38] to predict the labels for the videos that were not labeled manually.
To ensure future replicability of our study and reproducibility of the presented results, the implementation of the experimental infrastructure, collected data, manual annotations as well as the analytical notebooks are all publicly available on GitHub 2 .However, to mitigate the identified ethical risks associated with a possibility of mislabeling a video as promoting misinformation (cf.Section 4.7), we do not publish the labels predicted by our trained machine learning model, only their aggregated numbers necessary to reproduce the results.Nevertheless, since we train and use models published in prior works [26,38], it is also possible to replicate this part of our study.In addition, we do not publish copyrighted data (video title, description, etc.).However, this metadata can be downloaded through YouTube API.
The rest of the paper is structured as follows.Section 2 looks at the definitions of a filter bubble, echo chambers, and misinformation and uses them to define a misinformation filter bubble which is the focus of this work.In Section 3, we analyze different types of auditing studies and what the previous works examined with respect to social media (YouTube in particular) and misinformation.In Section 4, we describe in detail our research questions and study methodology, including manual as well as automatic annotation of collected data.We present the results of a training of a machine learning model for automatic annotation of videos and compare these results with the relevant related works.We also discuss identified ethical issues and how we addressed them in our research.The results of the audit itself are presented in Section 5. Finally, the implications of the results and the conclusions are discussed in Section 6.

BACKGROUND: FILTER BUBBLES AND MISINFORMATION
In online social media, a phenomenon denoted as virtual echo chambers refers to situations in which the same ideas are repeated, mutually confirmed and amplified in relatively closed homogeneous groups.Echo chambers may finally contribute into increased polarization and fragmentation of the society [50,59].Tendency of users to join and remain in echo chambers can be intentional (also denoted as self-selected personalization [7]) and explained by their selective exposure (focusing on information that is in accordance with one's worldview) or confirmation bias (reinforcing one's pre-existing beliefs) [6].To some extent such users' behavior is a natural human defense against information overload [36].However, enclosure of users into echo chambers can be also caused and amplified by adaptive systems (even without users' intention) resulting into the so-called filter bubble effect (also denoted as pre-selected personalization [7]).This effect has serious ethical implications-users are often unaware of the existence of filter bubbles, as well as of the information that was filtered out.
Filter bubbles were firstly recognized by Pariser [39] in 2011 as a state of intellectual isolation caused by algorithms that personalize users' online experiences hence exposing users to information and opinions that conform to and reinforce their own beliefs.Filter bubbles quickly became the target of theoretical as well as empirical research efforts from multiple perspectives, such as: 1) exploring characteristics of filter bubbles and identification of circumstances of their creation [32]; 2) modelling/quantifying the filter bubble effect [2,28]; and 3) discovering strategies how to prevent or "burst" filter bubbles [10].
The ambiguity and difficult operationalization of the original Pariser's definition of filter bubbles led to its different interpretations; inconsistent or even contrasting findings; and finally, also to low generalizability across the studies [35].For these reasons, the operationalized, systematically and empirically verifiable definition of the filter bubble has been recently proposed in [35] as follows: "A technological filter bubble is a decrease in the diversity of a user's recommendations over time, in any dimension of diversity, resulting from the choices made by different recommendation stakeholders."Based on this definition, authors also stress the criteria on studies addressing the filter bubble effect-they must consider the diversity of recommendations and measure a decrease in diversity over time.In this work, we proceed from this definition and also meet the stated criteria.
We are interested specifically in filter bubbles that are determined by the presence of content spreading disinformation or misinformation.Disinformation is a "false, inaccurate, or misleading information designed, presented and promoted to intentionally cause public harm or for profit" [19], while misinformation is a false or inaccurate information that is spread regardless of an intention to deceive.In this work, we use a broader term of misinformation since we are interested in any ACM Transactions on Recommender Systems (ACM TORS), Vol.0, No. 0, Article 0. Publication date: 2022.Auditing YouTube's Recommendation Algorithm for Misinformation Filter Bubbles 0:5 kind of false or inaccurate information regardless of intention behind its creation and spreading (in contrast to the most current reports and EU legislation that opt for the term disinformation and thus emphasize the necessary presence of intention).Due to significant negative consequences of misinformation on our society (especially during the ongoing COVID-19 pandemic), tackling misinformation attracted also a plethora of research works (see [9,56,58] for recent surveys).The majority of such research focuses on various characterization studies [46] or detection methods [40,49].
We denote filter bubbles that are characterized by the increased prevalence of such misinformative content as misinformation filter bubbles.They are states of intellectual isolation in false beliefs or manipulated perceptions of reality.Following the adopted definition, filter bubbles in general are characterized by the decrease in any dimension of diversity.We can broadly distinguish three types of diversity [35]: structural, topical and viewpoint diversity.Misinformation filter bubbles can be considered as a special case of a decrease of viewpoint diversity in which the viewpoints represented are provably false.Analogically to topical filter bubbles, misinformation filter bubbles can be characterized by a high homogeneity of recommendations/search results that share the same positive stance towards misinformation.In other words, the content adaptively presented to a user in a misinformation filter bubble supports one or several false claims/narratives.While topical filter bubbles are not necessary undesirable (they may be intended and even positively perceived by users [8,30]), misinformation filter bubbles are by definition more problematic and cause an indisputable negative effects [12,21,30].
Reflecting the adopted definition of the filter bubble [35], we do not consider misinformation filter bubble as a binary state at a single moment (i.e., following the current recommended items, a user is/is not in the misinformation filter bubble), but as the interval measure reflecting how deep inside the bubble the user is.Such a measure is determined by the proportion of misinformative content and calculated at different points in time.This definition and operationalization of filter bubbles emphasizes our second contribution in so-far unexplored deeper insight into the dynamics of filter bubbles since we calculated the proportion of misinformative content not only in the selected time points but continuously over the whole duration of the study.
To prevent misinformation and misinformation filter bubbles, social media conduct various countermeasures.These are usually reactions to public outcry or are required by legislation, e.g., EU's Code of Practice on Disinformation [17].Currently, the effectiveness of such countermeasures is evaluated mainly by self-evaluation reports.This approach has been, however, already recognized as insufficient due to a lack of evidence and objectivity since social media are reluctant to provide access to their data for independent research [16].In addition, the commercial aims of social media may be contradicting pro-social interests, as also revealed by the recent whistleblowing case Facebook Files3 [25].The verification of countermeasures is further complicated by interference of psychological factors.For example, some researchers argue that users' intentional self-selected personalization is more influential than algorithms' pre-selected personalization when it comes to intellectual isolation [3,15].
An alternative solution towards responsible and governed AI in social media and eliminating its negative social impact is employment of independent audits.Such audits, which are carried out by an external auditor independent from the company developing the audited AI algorithm, are envisaged also in the proposal of an upcoming EU legislation [20].Nevertheless, the auditing studies on the intersection of filter bubbles and misinformation (i.e., studies on misinformation filter bubbles), such as the one presented in this paper, are still relatively rare.

RELATED WORK: AUDITS OF ADAPTIVE SYSTEMS
In this context, an audit is a systematic statistical probing of an online platform, used to uncover socially problematic behavior underlying its algorithms [27,43].Algorithmic audits can be conducted in two different settings.Internal (white-box) audits may utilize a direct access to the recommender system (algorithm and data) and thus require a cooperation between an independent auditor and the platform operating the recommender system.External (black-box) audits are performed without the detailed information and access to the internal workings of the recommender system, and thus are limited to publicly available data obtained via user interface or API.External audits may suffer from methodological problems, especially limited possibilities to evaluate causal hypotheses about the effects of different variants of recommender systems on users [30].Internal audits may naturally lead to more precise results, nevertheless, their execution is not currently possible for independent researchers (we do not expect that the situation will change significantly in future even when the new EU legislation [20] will be adopted [33]).Fernández et al. [21] attempted to overcome this limitation and performed simulations of the effect of some of the most popular recommendation algorithms on the spread of misinformation utilizing Twitter data.Such recommender systems are, however, very probably too distinct from the algorithms utilized by social media platforms.Therefore, the applicability of findings on real-world social media recommender systems is questionable.Auditing filter bubbles caused by the social media recommender systems are thus performed almost solely externally.
From another perspective, algorithmic auditing can address four categories of problematic machine behavior [5]: discrimination, distortion, exploitation, and misjudgement.Auditing filter bubbles caused by social media recommender systems falls into the distortion category (i.e., a behaviour that distorts or obscures an underlying reality).
External audits examining distortion category of problematic behavior come in multiple forms [5] as defined by the taxonomy introduced by Sandvig et al. [43].Commonly used scraping audits, which collect data by generating requests to API or web interface, allow only very limited simulation of user behavior.Crowdsourcing audits and sockpuppeting audits, which can replicate user behavior more precisely and eliminate confound factors better, are more suitable to investigate the effect of (misinformation) filter bubbles; nevertheless, they have been researched to a lesser extent.
Crowdsourcing audit studies are conducted using real user data.Hannak et al. [23] recruited Mechanical Turk users to run search queries and collected their personalized results.Silva et al. [45] developed a browser extension to collect personalized ads with real users on Facebook.Shen et al. [44] introduced an idea of everyday algorithm auditing, in which users detect, understand, and interrogate problematic machine behaviors via their day-to-day interactions with algorithmic systems.However, such auditing methodology suffers from a lack of isolation (users may be influenced by additional factors, such as confirmation bias).Moreover, uncontrolled environment makes comparisons difficult or unfeasible; it is difficult to keep users active; such audits also raise several privacy issues.
Sockpuppeting audits solve these problems by employing non-human bots that impersonate the behavior of users in a predefined controlled way [43].To achieve representative and meaningful results in sockpuppeting audits, researchers need to tackle several methodological challenges [27].First is the selection of appropriate seed data, which are needed to setup user profiles and prescribe user actions executed during the audit (e.g., the initial activity of bots, search queries).Second, the experimental setup must measure the real influence of the investigated phenomena.At the same time, it must minimize confounding factors and noise (such as of name, gender or geolocation [23]).
Another challenge is how to appropriately label the presence of the audited phenomena (expertbased/crowdsourced [27,45] or automatic labeling [38] can be employed).
Audits can be further distinguished by the online platform they are applied on (social networking sites [4,24,27,38,45], search engines [31,34,42], e-commerce sites [29]), by adaptive systems being investigated (recommendations [27,38,48], up-next recommendation [27], search results [27,31,34,38,42], autocomplete [42], advertisement system [4]) and by phenomena being studied (misinformation [27,38], political/ideological bias [24,31,34], political ads [45]).In our study, we focus specifically on misinformation filter bubbles in the context of the online video platform YouTube and its recommender and search system.As argued by Spinelli et al. [48], YouTube is an important case to study as a significant source of socially-generated content and because of its opaque recommendation policies.Some information about the inner workings of YouTube's adaptive systems are provided by research papers published at RecSys conference [14,57] or blogs [54] published directly by the platform, nevertheless, a detailed information is unknown.Therefore, we feel a need to conduct independent auditing studies on undesired phenomena like unintended creation of misinformation filter bubbles.
The existing studies confirmed the effects of filter bubbles in YouTube recommendations and search results.Spinelli et al. [48] found that chains of recommendations lead away from reliable sources and toward extreme and unscientific viewpoints.Similarly, Ribeiro et al. [41] concluded that YouTube's recommendation contributes to further radicalization of users and found paths from large media channels to extreme content through recommendation.Abul-Fottouh et al. [1] confirmed a homophily effect in which anti-vaccine videos were more likely to recommend other anti-vaccine videos than pro-vaccine ones and vice versa.In a recent work, Haroon et al. [24] utilized sock puppets to determine the presence of ideological bias (i.e., the alignment of recommendations with users' ideology), its magnitude (i.e., whether the users are recommended an increasing number of videos aligned with their ideology), and radicalization (i.e., whether the recommendations are progressively more extreme).Furthermore, a bottom-up intervention to minimize ideological bias in recommendations was designed and evaluated.Ballard et al. [4] analyzed how ads shown on conspiracy content differ from ads on mainstream videos and thus evaluated YouTube's algorithms for advertisement selection and demonetization.Obtained results revealed that conspiracy videos have fewer, but at the same time lower-quality and more predatory ads than mainstream videos.Authors also conclude that the difference in advertising quality suggests that YouTube's advertising platform may be assisting predatory advertisers to identify potential victims.
Recently, we can observe first audits focused specifically on misinformation filter bubbles in the case of YouTube recommender systems.Hussein et al. [27] and Papadomou et al. [38] found that YouTube mitigates pseudoscientific content in some handpicked topics such as COVID-19.Hussein et al. [27] found that demographics and geolocation (within the US) affect personalization only after having acquired some watch history.These studies provide evidence of the existence and properties of misinformation filter bubbles on YouTube.
Above-mentioned auditing studies took two different approaches to determine the presence of investigated phenomenon in the recommended content.At first, some studies rely on manual (expert) annotation [27].This approach can potentially lead to more precise labels.Annotation process may be, however, influenced by subjectivity of individual human experts and, therefore, each recommended item should be assigned a final label only in case of agreement of multiple experts.In the end, such approach is very time-consuming and, therefore, the amount of labeled data is rather small.Secondly, some studies use automatic annotations by means of various heuristics (e.g., analysis of user feedback [24]) or a machine learning model (e.g., a classification model [38]).Applying machine learning models allows to annotate a significantly larger amounts of data.Nevertheless, it still requires a labeled subset of data (labeled either by experts or by crowdsourcing) to create a necessary training set, and at the same time, machine learning predictions may introduce some level of noise due to their natural imperfection.
From the properties that remain uninvestigated, we specifically address two.First, the adaptive systems used by YouTube are in continuous development and improvement.Information on how YouTube proceeds in countering misinformation is needed.Second, while the existing studies focused on misinformation filter bubble creation, there is not the same level of insight on the inverse process-filter bubble bursting.The online survey performed in [8] revealed that the majority of respondents (63%) are aware that they are affected by the filter bubble, while fewer participants (31%) also deliberately take action against the filter bubble effect.Investigation of bubble bursting strategies can help not only such users who intentionally want to get more diverse/balanced recommendation mix, but also to get a better understanding how recommender system internally works (what can be valuable also for social media platforms themselves).Such insight can finally help to improve the design of recommender systems (e.g., how exactly to prioritize credible sources).

STUDY DESIGN AND METHODOLOGY
To investigate the dynamics of bursting out of a misinformation filter bubble, we conducted an external agent-based sockpuppeting audit study.The study took place on YouTube, but its methodology and implementation can be generalized to any adaptive service, where recommendations can be user-observed.
In the study, we let a series of agents (bots) pose as YouTube users.The agents performed pre-defined sequences of video watches and query searches.They also recorded items they saw: recommended videos, search results, and videos shown at the home page.The pre-defined actions were designed to first invoke the misinformation filter bubble effect by purposefully watching videos with (or leaning towards) misinformative content.Then, agents tried to mitigate the bubble effect by watching videos with trustworthy (misinformation debunking) content.Between their actions, the agents were idle for some time to prevent possible carry-over effects.The degree of how deep inside a bubble the agent is was observed through the number and rank of misinformative videos offered to them.
The secondary outcome is the partial replication of a previous study done by Hussein et al. [27] (denoted hereafter as the reference study).This replication allowed us to draw direct comparisons between quantities of misinformative content that agents encountered during our study (conducted in March 2021) and during the reference study (conducted in mid-2019).

Research Questions and Hypotheses
RQ1 (comparison to the reference study): Has YouTube's personalization behavior changed with regards to misinformative videos since the reference study?In particular, we seek to validate the following hypothesis: • H1.1:We will observe a weaker misinformation filter bubble effect, when comparing the state after constructing a promoting watch history with the results of the reference study in both search and recommendations.RQ2 (bubble bursting dynamics): How does misinformation filter bubble effect change, when debunking videos are watched?The "means of bubble bursting" would be implicit user feedbackwatching misinformation debunking videos.In particular, we seek to validate the following hypotheses: • H2.0: Watching videos that promote misinformation leads to their increased presence in search results, top-10 recommendations, and home page recommendations (this is a precondition hypothesis to all remaining hypotheses under the RQ2).
• H2.1: Watching the sequence of misinformation debunking videos after the sequence of misinformation promoting videos will decrease the presence of misinformation in search results, top-10 recommendations, and home page recommendations in comparison to the end of the promoting sequence.• H2.2: Watching the sequence of misinformation debunking videos after the sequence of misinformation promoting videos will decrease the presence of misinformation in search results, top-10 recommendations, and home page recommendations in comparison to the start of the experiment.• H2.3:The presence of misinformation changes gradually (approximately linearly) with the number of watched videos that promote or debunk misinformation.That is, the misinformation filter bubble effect increases linearly with the increasing number of misinformation promoting videos and decreases linearly with the increasing number of misinformation debunking videos.In H2.3, we purposefully refer to a gradual linear change as a kind of standard ("baseline") change, that allows the most straightforward interpretation.This decision is also motivated by the fact that we are not aware of any existing studies on dynamics of misinformation filter bubbles that can be used as a source of assumption for more specific change (like exponential).

Metrics and Operationalization of Hypotheses
To evaluate the hypotheses we use the following metrics: Normalized score (NS).Drawn directly from the reference study [27].It quantifies misinformation prevalence in a given list of items (videos), which are labeled as either promoting (value 1), debunking (value -1) or neutral (value 0).It is computed as an average of individual annotations of the items present in a list without considering their order (rank).Its value ranges from the ⟨−1, 1⟩ interval.Lists populated with mostly debunking content would receive values close to -1, with promoting close to 1, and with balanced or mostly neutral, close to 0. In other words, a score closer to -1 means a better score (less misinformation).Search result page misinformation score (SERP-MS).Also drawn directly from the reference study.Unlike the NS metric, it takes into account the rank of the items in a list.It is thus better suited for longer ordered lists.It is computed as [27]: where   is annotation value,  search result rank and  number of search results in the list.Its value also ranges from the ⟨−1, 1⟩ interval with the same interpretation as above.Difference to linear (DIFF-TO-LINEAR) A metric that describes the slope of changes in the normalized score as videos are watched.It compares the actual change in the value of normalized score between a given start and an end video against an expected linear change.
We define the final value of the metric as a sum of these differences at each watched video: where  and  are indices of the start and end videos,    is the normalized score at the -th watched video.If the overall change in the normalized score is positive (i.e., there is an increase in the normalized score on the given interval between the start and end videos) and the metric value is positive, the normalized score changes faster than expected.If the metric value is negative, it changes slower than expected.In case of a negative overall change in the normalized score (i.e., there is a decrease on the given interval between the start and end videos), the interpretation is reversed.That is, if the metric value is positive, the normalized score changes slower than expected and if it is negative, it changes faster.The Normalized score (NS) and Search result page misinformation score (SERP-MS) metrics are used to evaluate the following hypotheses: H1.1 (their decrease is expected); H2.0 (their increase is expected); H2.1 (their decrease is expected); and H2.2 (their decrease is expected).The NS metric is used to score shorter lists, which are, in our case, top-10 recommendations and top-10 home page results.On the other hand, SERP-MS metric is used for longer lists of search results.Finally, the DIFF-TO-LINEAR metric is used to evaluate the hypothesis H2.3 (values near 0 are expected).

Study scenarios
We let agents interact with YouTube following a scenario composed of four phases, as depicted in Algorithm 1 (see also the diagram depicting the agent scenario in the Figure 1).
Phase 0: Agent initialization.At the start of a run, the agent fetches its desired configuration, including the YouTube user account and various controlled variables (the variable values are explained further below).Also, the agent fetches  ∈  , a topic with which it will work (e.g., "9/11").The agent fetches   and   , which are lists of   = 40 and   = 40 most popular videos promoting, respectively debunking, misinformation within topic .Afterwards, it fetches , a set of   = 5 search queries related to the particular  (e.g., "9/11 conspiracy").The agent configures and opens a browser in incognito mode, visits YouTube, logs in using the given user account, and accepts cookies.Finally, the agent creates a neutral baseline by visiting the home page and saving videos, and performing a search phase.In the search phase, the agent randomly iterates through search queries in , executes each query on YouTube, and saves the search results.To prevent any carry-over effect between search queries, the agent waits for   = 20 minutes after each query.Log in using the given account credentials and accept cookies 9: Visit home page and save videos listed there to create a neutral baseline 10: Execute Search() to create a neutral search baseline 11: end procedure 12: procedure Search() 13: for query from randomized  do: Clear user history 30: end procedure 31: Phase 0: Execute Agent initialization 32: Phase 1: Execute Watch(  ) 33: Phase 2: Execute Watch(  ) 34: Phase 3: Execute Tear-down Phase 1: Watch promoting videos (create the filter bubble).For creating a filter bubble effect, the agent randomly iterates through   and "watches" each video for  ℎ = 30 minutes (or less, if the video is shorter).Immediately after watching a video, the agent saves video recommendations on that video's page and visits the YouTube home page, saving video recommendations listed there as well.After every   = 2 videos, the agent performs another search phase.
Phase 2: Watch debunking videos (burst the filter bubble).The agent follows the same steps as in phase 1.The only difference is the use of   instead of   .
Phase 3: Tear-down.In this phase, the agent clears YouTube history (using Google's "my activity" section), making the used user account ready for the next run.
For each selected topic, we run the scenario 10 times (in parallel).This way, we were able to deal with recommendation noise present at the platform.In order to run our experiments multiple times, we used the reset (delete all history) button provided by Google instead of creating a new user profile for each run.Before deciding to use the reset button in our study, we first performed a short verification study to see whether using this button really deletes the whole history and resets the personalization on YouTube.We randomly selected few topics, from which we manually watched few videos (5 for each).Then, we used the reset button and evaluated the difference between videos appearing on the YouTube home page, recommendations, and search.We found no carry-over effects.
4.3.1 Controlled variables and parameter setup pre-study.In the study, we control several sources of potential confounds: • Geolocation from which the agents access the YouTube service and other user characteristics of the agents (name, date of birth, gender).• Time, i.e., all experiments are done in shortest time frame possible to minimize the influence of content newly appearing on YouTube.Risks of changes in YouTube's recommendation algorithms during the audit are also minimized this way.• Technical setup of agents uses always the same configuration (e.g., operating system, browser) as described below.
For geolocation, we use N. Virginia to allow for better comparison with the reference study.The date of birth for all accounts was arbitrarily set to 6.6.1990 to represent a person roughly 30 years old.The gender was set as "rather not say" to prevent any personalization based on gender.The names chosen for the accounts were composed randomly of the most common surnames and unisex given names used in the US.
There were also process parameters that we needed to keep constant.These include: (1)   = 40 and   = 40 representing the number of seed videos used in the promoting and the debunking phase respectively, (2)  ℎ = 30 representing the maximum watching time in minutes for every video, (3)   = 5 representing the number of queries used, (4)   = 20 representing the wait time in minutes between query yields, and (5)   = 2 representing the number of videos to watch between search phases.
Values of the process parameters greatly influence the total running time and results of the study.Yet, determining them was not straightforward given many unknown properties of the environment (first and foremost YouTube's algorithms).For example, prior to the study, it was unclear how often we need to probe for changes in recommendations and search result personalizations to answer our research questions.
Therefore, we ran a pre-study in which we determined the best parameter setup.We used two metrics to determine the type of change (i.e., change in videos, or change in order of videos) between lists of returned videos: 1) the overlap of lists of recommended videos, a simple metric that disregards order of videos; and 2) the Levenshtein distance between ordered results that takes change in order, but not the exact position, into consideration.Measuring these metrics across watched videos and different runs, we determined to run 10 individual agents for each topic, as we observed instability between repeated runs (e.g., the same configuration yielded ∼ 70% of the same recommended videos).For the   and   parameters, we observed that in some cases, a filter bubble effect (a change in diversity of the returned videos) could be detected after 20 watched videos.Yet in others, it was after 30 or more.Due to this inconsistency, we opted to watch 40 videos for a phase to guarantee that the full potential of misinformation filter bubble effect is developed and observed.To determine the optimal value of  ℎ , we first calculated the average running time of our seed videos.Most of the videos (∼ 85%) had a running time of about 30 minutes or shorter, so 30 minutes became the baseline value.In addition, we compared the results obtained by watching only 30 minutes with results from watching the whole video regardless of its length, but found no apparent differences.
To determine the number of queries   and periodicity of searches   , we ran the scenario with all seed queries introduced by the reference study and used them after every seed video.We observed that the difference in search results between successive seed videos was not significant.As the choice of search queries and the frequency of their use greatly prolonged the overall running time of the agents, we opted to run the search phase after every second video.In addition, we opted to use only 5 queries per topic.
The only parameter not set by a pre-study is   , which we set to 20 minutes based on previous studies.These found that the carry-over effect (which we wanted to avoid) is visible for 11 minutes after the search [23,27].

Technical setup.
The implementation of the agents is available in the accompanying GitHub repository.They were implemented using the Selenium library in Python.We used Google Chrome browser version 88 with chromedriver version 88.0.4324.96.We used the incognito mode to remove any possibility of history or cookies being carried over across different sessions.In addition, we used uBlock Origin to prevent noise from watching ads in our experiments.Each agent was implemented using Python 3.8.7 Dockerfile based on Debian 10.7.All the agents were placed at an AWS server running in N. Virginia.

Seed Data
We used 5 topics in our study (same as the reference study): 1) 9/11 conspiracies claiming that authorities either knew about (or orchestrated) the attack, or that the fall of the twin towers was a result of a controlled demolition, 2) moon landing conspiracies claiming the landing was staged by NASA and in reality did not happen, 3) chemtrails conspiracy claiming that the trails behind aircrafts are purposefully composed of dangerous chemicals, 4) flat earth conspiracy claiming that we are being lied to about the spherical nature of Earth, and 5) vaccines conspiracy claiming that vaccines are harmful, causing various range of diseases, such as autism.The narratives associated with the topics are popular (persistently discussed), while at the same time, demonstrably false, as determined by the reference study [27].
For each topic, the experiment required two sets of seed videos.The promoting set, used to construct a misinformation filter bubble (its videos have a promoting stance towards the conspiratorial narrative or present misinformation).And the debunking set, aimed to burst the bubble (and containing videos disproving the conspiratorial narratives).
As a basis for our seed data sets we used data already published in the reference study, which the authors either used as seed data, or collected and annotated.To make sure we use adequate seed data, we re-annotated all of them.
The number of seed videos collected this way was insufficient for some topics (we required twice as many seed videos as the reference study).To collect more, we used an extended version of the seed video identification methodology of the reference study.Following is the list of approaches we used (in a descending order of priority): YouTube search, other search engines (Google search, Bing video search, Yahoo video search), YouTube channel references, recommendations, YouTube home page, and known misinformation websites.To minimize any biases, we used a maximum of 3 videos from the same channel.
As for search queries, we required fewer of them than the reference study.We selected a subset based on their popularity on YouTube.Some examples of the used queries are: "9/11 conspiracy", "Chemtrails", "flat earth proof ", "anti vaccination", "moon landing fake".
The full list of seed videos and used search queries is available in the accompanying GitHub repository.

Data collection and annotation
Agents collect videos from three main components on YouTube: 1) recommendations appearing next to the watched videos, 2) home page videos, and 3) search results.In case of recommendations, we collect first 20 videos that YouTube displays next to a currently watched video (in rare cases, less than 20 videos are recommended).Collecting top- videos is in line with previous works and sources from the fact that users seldom visit recommendations below.For home page videos and search results, we collect all videos appearing with the given resolution (2560x1440px), but no less than 20.In case when less than 20 videos appear, the agent scrolled further down on the page to load more videos.
For each video encountered, the agent collects the following metadata: 1) YouTube video ID, 2) position of the video in the list, and 3) presence of a warning/clarification message that appears with problematic topics such as COVID-19.Other metadata, such as video title, channel, or description are collected using the YouTube API.
To annotate the collected videos for the presence of misinformation, we used an extended version of the methodology proposed in the reference study.Each video was viewed and annotated by the authors of this study using a code ranging from -1 to 10 as follows: • Code -1, i.e., debunking, when the narrative of a video provides arguments against the misinformation related to the particular topic (such as "The Side Effects of Vaccines -How High is the Risk?").• Code 0, i.e., neutral, when the narrative discusses the related misinformation but does not present a stance towards it (such as "Flat Earthers vs Scientists: Can We Trust Science? | Middle Ground").• Code 1, i.e., promoting, when the narrative promotes the related misinformation (such as "MIND BLOWING CONSPIRACY THEORIES").• Codes 2, 3, and 4 have the same meaning as codes -1, 0, and 1, but are used in cases when they discuss misinformation not related to the topic of the run (e.g., video dealing with climate crisis misinformation encountered during a flat earth audit).• Code 5 is applied to videos that do not contain any misinformation views (such as "Gordon's Guide To Bacon").This includes completely unrelated videos (e.g., music or reality show videos), but also videos that are related to the general audit topic, but not misinformation (e.g., original news coverage of 9/11 events).• Code 6 is assigned in rare cases of videos that are not in English and do not provide English subtitles.• Code 7 is assigned in equally rare cases when the narrative of the video cannot be determined with enough confidence.• Code 8 is reserved for videos removed from YouTube (before they are annotated).
• Codes 9 and 10 present an extension of the approach used in the reference study.They are used to denote videos that specifically mention misinformation but rather than debunk them, they mock them (9 for related misinformation, 10 for unrelated misinformation, for example "The Most Deluded Flat Earther in Existence!").Mocking videos are a distinct (and often popular) category, which we wanted to investigate separately (however, for the purposes of analysis, they are treated as debunking videos).
Similar to the reference study, we map all the codes assigned by the annotators to one of the three stance values: -1 (debunking), 1 (promoting), and 0 (neutral), which are used to compute the metrics and evaluate our hypotheses.Codes -1, 2, 9, and 10 are mapped to -1.Codes 1 and 4 are mapped to 1. Codes 0, 3, and 5 are mapped to 0. Videos coded with 6, 7, or 8 are discarded from the evaluation, since they cannot be reliable mapped to any of the stance values.
To determine how many annotators are needed per video, we first re-annotated the seed videos released by the reference study.Each seed video was annotated by at least two authors.To measure an inter-rater reliability, we evaluated consistency between: 1) our annotators, who produced re-annotated labels, achieving Cohen's kappa value of 0.815; and 2) our annotators and authors of the reference study, achieving Cohen's kappa value of 0.688.We attribute the difference between these values to a possible smaller divergence of annotation methodology execution of our and reference study annotators.We identified characteristics of edge cases and counseled how to resolve them with all our annotators.Following the re-annotation and the findings from it, when annotating our collected videos, we assign only one annotator per collected video.Annotators were instructed to report edge cases, i.e., videos hard to label with enough confidence for some reason.Such videos were encountered approximately every 10-20 annotations, and were always reviewed by another annotator (i.e., providing the "second opinion") and optionally discussed until a consensus was reached between the annotators.
For the purpose of this study and to evaluate our hypotheses, we manually annotated the following subset of collected videos: • All recorded search results.
• Videos recommended for the first 2 seed videos at the start of the run and the last 2 seed videos of both phases (resulting in 6 sets of annotated videos per topic).This selection was a compromise between representativeness, correspondence to the reference study, and our capacities.• We did not annotate the home page videos for the purpose of this study.These videos were the most numerous, the most heterogeneous, and with little overlap across bots and seed videos.
For the remaining videos from top-10 recommendations and home page results, which we did not annotate manually, we employed a machine learning model trained on the manual annotations to predict their labels as discussed next.

Trained machine learning models for automated prediction of annotations
We opted for automatic annotation of the remaining videos by the means of machine learning due to their large number (17,405 unique videos overall, cf.Section 5) which would not be feasible to annotate manually.We experimented with two state-of-the-art models for classification of YouTube videos used in similar misinformation detection-related tasks that were presented in the related works-models by Hou et al. [26] and Papadamou et al. [38].

Model by Hou et al. [26] (Hou's model).
The authors presented an SVM model trained to classify prostate cancer videos as misinformative or trustworthy based on a set of viewer engagement features (e.g., number of views, number of thumbs ups, number of comments), linguistic features (e.g., n-grams and syntax based features, readability and lexical richness features derived from the video transcripts), and raw acoustic features.We implemented this model using standard ML toolkits (nltk, sklearn) and trained it using our annotated dataset.We omitted using acoustic features in our training since we did not collect them in our dataset.We also experimented with an XGBoost version of this model, which used the same set of input features.We published our implementation of the model (both SVM and XGBoost versions) in the GitHub repository accompanying this paper.

Model by Papadamou et al. [38] (Papadamou's model). A deep learning model was used to classify YouTube videos related to common conspiracy theory topics as pseudoscientific or scientific.
The proposed classifier takes four feature types as input: snippet (video title and description), video tags (defined by the creator of the video), transcript (subtitles uploaded by the creator of the video or auto-generated by YouTube), and top-200 video comments.It then uses fastText (fine-tuned to the inputs) to generate vector representations (embeddings) for each of the textual inputs.Resulting features are flattened into a single vector and processed by a 4-layer, fully-connected neural network (comprising 256, 128, 64, and 32 units with ReLU activation).Regularization using dropout ( = 0.5) is applied at each fully-connected layer.Finally, the output is passed to a 2-unit layer with softmax activation.There is a threshold for predicting the "pseudoscientific" class that requires the classification probability to be 0.7 or higher for it to be used.The classifier is implemented using Keras and Tensorflow.Due to class imbalance (Papadamou's dataset contained 1,325 pseudoscience and 4,409 other videos), oversampling is applied during training to produce the same number of training samples for both classes.We made use of the source code provided by the authors of the paper 4 .However, we did not use video tags as input features as we lacked them in our dataset.Similarly to the authors' original work, we also experimented with a BERT-based version of the model, which used a pre-trained BERT model5 instead of fastText to compute embeddings of the textual inputs, with the rest of the neural network's architecture remaining the same.4.6.3Classification tasks.Both models were originally applied for binary classification tasks and classified videos as misinformative/trustworthy in case of Hou's model and pseudoscientific/scientific in case of Papadamou's model.Since our data was annotated with multiple labels that were normalized into three classes (promoting, debunking, neutral), we had to make a decision on how to handle the "neutral" class not considered in the original models.We experimented with the following variations of classes in our cross-validation of the models: (1) Binary without neutral: only promoting (class 1) and debunking (class 2), discarding the neutral videos.(2) Binary with neutral: promoting (class 1) and debunking together with neutral (class 2).
(3) 3 classes: promoting (class 1), debunking (class 2), and neutral (class 3).4.6.4Performance of the models.We trained the models using a subset of all manually labeled data (a combination of the seed data and the videos encountered during data collection) for which we could retrieve all necessary information from the YouTube API (such as transcript and other metadata).It consisted of 2,622 labeled videos (405 promoting, 758 debunking, and 1,459 neutral videos).We evaluated the models using 5-fold cross-validation in case of Hou's model and 10-fold cross-validation in case of Papadamou's model to reflect the evaluation in their respective papers.Table 1 shows classification metrics comparing the models' performance reported in the original papers with their performance on our data evaluated on a binary setup without the neutral class and with it.Table 2 compares the models' performance on a 3 classes setup.It also includes evaluation of an XGBoost version of the Hou's model and a BERT-based version of the Papadamou's model.
Hou's model showed performance similar to that reported in the paper when applied to the binary classification task with only the promoting and debunking classes.On the other hand, the performance (on the promoting class) decreased when we incorporated neutral videos into a "debunking + neutral" class (cf.Table 1).The low precision (0.42) on promoting class shows that the model does not have predictive power to distinguish these classes.Applying the model to classification of all three classes showed relatively weak performance as well (cf.Table 2).When Table 1.Comparison of the classification metrics of the evaluated models as reported in their original papers (training: "Paper") or cross-validated on our data (training: "Our data").We experimented with two different class setups to train the models on (classes: "Binary without neutral" and "Binary with neutral").Precision, recall, and F1-score are reported both on the promoting (prom.)class (misinformative in the paper by Hou et al. [26], not reported by Papadamou et al. [38]), as well as their weighted (weigh.)average across classes.[26] to the results of the XGBoost model, we can see that their performance is similar with SVM having slightly better performance on the promoting class.Papadamou's model achieved better performance when applied to binary classification with promoting and debunking videos only and also outperformed the metrics reported in the original paper (cf.Table 1)-we attribute this improvement to the quality of our data which was annotated by experts instead of crowd-sourcing annotators who were employed by Papadamou et al.It also retained a good performance (0.71 F1-score on the promoting class) when neutral videos were added into the "debunking + neutral" class.Therefore, we decided to adapt this model for classification of all three classes: promoting, debunking, and neutral.In this task, the model achieved a slightly lower F1-score (0.66) caused by lower recall (0.65 compared to 0.76) on the promoting class (with 405 samples); cf.Table 2. On the debunking class (with 758 samples), the model achieved precision 0.79 and recall 0.74.On the neutral class (with 1,459 samples), precision was 0.86 and recall 0.91.The results of the BERT-based version of the model were worse than the original Papadamou's model trained on our data and it was even outperformed by both SVM and XGBoost versions of the Hou's model.This is in line with the results reported in [38], where the BERT-based model performed worse than the fastText version and had mixed results when compared to SVM and Random Forest.Table 3 shows a confusion matrix for the best-performing Papadamou's model on all three classes.4.6.5 Implications for the rest of the paper.Seeing that Hou's model was struggling with the neutral class, we opted for Papadamou's model for the use in this paper.We further decided to take advantage of the model trained for the 3-class classification task as that enables deeper analyses and retains a satisfactory performance.

Ethics assessment
To consider the various ethical issues regarding the research of misinformative content, we participated in a series of ethics workshops facilitated by experts on AI and data ethics from among the co-authors of this paper who were not part of the technical team.The workshops were aimed at exploring questions related to data ethics [52] and AI ethics issues [13] within our audit and its impact Table 2. Comparison of the classification metrics of the evaluated models cross-validated on our data with the "3 classes" setup.Precision, recall, and F1-score are reported both on the promoting (prom.)class as well as their weighted (weigh.)average across classes.We compare the original models proposed by Hou et al. [26] and Papadamou et al. [38] to their XGBoost and BERT-based variants respectively.For the data analysis presented further in this paper, we make use of the best-performing model, which is reported in the rightmost column of this on stakeholders in the light of the principles of Responsible Research and Innovation [37].The most affected stakeholder groups were platform users, annotators, content creators, and researchers.We devised different engagement strategies and specific action steps for every stakeholder group.Our main task was to devise countermeasures to the most prominent risks that could emerge for these stakeholder groups.First, we were concerned about the risk of unjustified flagging of the content as misinformation and their creators as conspirators.To minimize this risk, we decided to report hesitations in the manual annotation process.These hesitations were consequently back-checked by other annotators and independently validated until the consensus was reached.After the deployment of the automated machine learning pipeline, this concern was accompanied by the risk that we might over-rely on automated annotation based on previous results and lose caution in the human annotating process.Therefore, we proposed to put in place processes to check the performance of the model and to detect its potential deterioration to ensure that human oversight is in place for video recommendations and home page results.These include the proposal for manual annotation of a new sample of data (in case we decided to use the trained model on a newly collected data, e.g., collected later in time or for different topics, which was, however, not yet the case) and checking properties of the new data, e.g., distributions compared to the training and validation data.
One of our main concerns was also not to harm or delude other users of the online platform.To avoid a disproportional boost of the misinformation content by our activity, we selected videos with at least 1000 views and warned the annotators not to watch videos online more than once, or in case of back-checks, twice.After each round, we reset the user account and deleted the watch history.To minimize the risk of misuse of data about users and their alleged inclination to possibly misinformative content, which can be presented, e.g., in video comments, we decided not to publish automatically annotated data and store this data at servers with appropriate safeguards in place with access only to approved researchers.
Other concerns were connected to the deterioration of the well-being of human annotators.Specifically, their decision-making abilities would be negatively affected after a long annotation process.The ethics experts from among the co-authors of this paper, who did not participate on the annotation process, proposed daily routines for the annotation, including breaks during the process, and advised to monitor any changes in the annotator's beliefs.The annotators also underwent a survey on their tendency to believe in conspiracy theories 6 and none of them showed such tendency at the end of the study.
We also wanted to avoid the situation of being too general in our conclusions considering the dynamics of filter bubble creation and bubble bursting for various groups of users.Together with ethics experts we identified possible limitations of our study that were not only tied to the amount of topics that we have investigated or the set of agent interactions with the platform, but also to the fact that the agent's user account was tied only to specific settings (geolocation, date of birth, gender or name) despite the fact that we have no evidence of their impact on the filter bubble creation dynamics.Yet, we note that there still remain the risks that the conclusions drawn from the results of our study might not be fully representative of other groups of users.To validate our results on a more diverse set of users is a future challenge together with the need for the deployment of tools to identify potential sources of systematic discrimination or other unwanted biases in data or the model.
We were also aware that the risk of possible unjust biases and discrimination presented in the model is tightly connected to the problem of transparency and explainability of the proposed model for automated annotation.This problem still poses an open question for future research.We have decided to publish the source code and datasets excluding automatically annotated data to support any future research in this area and maintain the privacy of users at the same time.In addition, to avoid any possible legal issues, we made public only YouTube IDs of seed and encountered videos without any metadata, such as titles, description or transcripts, which might be protected by copyrights.However, these can be downloaded using YouTube API (if the video is still available on the platform).

A note on comparability with the reference study by Hussein et al.
In order to be able to draw comparisons, we kept the methodology of our study as compatible as possible with the previous study by Hussein et al. [27].We shared the general approach of prompting YouTube with implicit feedback: both studies used similar scenarios of watching a series of misinformation promoting videos and recording search results and recommended videos.We re-used the topics, a subset (for scaling reasons) of search queries, and all available seed videos (complementing the rest by using a similar approach as the reference study).Moreover, both studies used the same coding scheme, metrics, sleep times, and annotated a similar number of videos.
We should also note differences between the studies, which mainly source from different original motivations for our study.For instance, no significant effects of demographics and geolocation of the agents were found in the reference study, so we only controlled these.In Hussein's experiments, all videos were first "watched" and only then all search queries were fired.In our study, we fired all queries after watching every 2nd video (with the motivation to get data from the entire run, not just the start and end moment).The reference study created genuine 150 accounts on YouTube, while we used fewer accounts and took advantage of the browsing history reset option.In some aspects, our study had a larger scale: we executed 10 runs for each topic instead of one (to reduce possible noise) and used twice as many seed videos (to make sure that filter bubble effects develop).There were also technical differences between the setups, as we used our own implementation of agents (e.g., different browser, ad-blocking software).
Given the methodological alignment (and despite the differences), we are confident to directly compare some of the outcomes of both studies, namely quantity of misinformative content appearing at the end of the promoting phases.

RESULTS AND FINDINGS
Following the study design, we executed the study between March 2nd and March 31st, 2021.Together, we executed 50 bot runs (10 for each topic).On average, runs for a single topic took 5 days (bots for a topic ran in parallel).The bots watched 3,951 videos, executed 10,075 queries, and visited home page 3,990 times.For each of these, we recorded the recommendations and results provided by YouTube as shown in Table 4. Overall, we recorded 17,405 unique videos originating from 6,342 channels.
Using the selection strategy and annotation scheme described in Section 4.5, five annotators annotated 2,914 unique videos (covering 255,844 appearances).The distribution of labels is shown in Table 5. Promoting videos constituted 8% of annotated search results, and 7% of annotated top-10 recommendations.Debunking videos made up 27% of annotated search results, and 19% of annotated top-10 recommendations.Using the trained machine learning model, we retrieved labels for additional 13,801 videos from top-10 recommendations and home page results.Their distribution is also shown in Table 5. Promoting videos constituted 3% of predicted top-10 recommendations, and 4% of predicted home page results.Debunking videos made up 56% of predicted top-10 recommendations, and 44% of predicted home page results.Compared to the manually annotated data, we see lower percentages of promoting videos, but higher percentages of debunking videos.
Table 6 shows basic descriptive statistics of the collected data with respect to the length of videos and how many times they appeared during data collection.Both distributions are powerlaw distributions with very long tails.When looking at more frequent videos (appearing 10 or more times), we observe that even though they represent only around 17% of the collected data, they comprise 26% of all promoting videos, i.e., they are 1.7 times more likely to be promoting misinformation than the less frequent videos.We also collected the popularity of the videos (number of views, comments, likes, etc.), but due to many missing values (in about 40% of all encountered videos), we do not report statistics related to the popularity.
Table 5. Distribution of manual and predicted labels.Promoting and debunking videos include those related as well as unrelated to respective topics.Debunking videos also include those manually annotated as mocking videos.In further analyses presented in the paper as well as in the classification model, we do not distinguish between neutral videos and videos labeled as not about misinformation (both are regarded as neutral).The other videos include videos that were manually labeled as unknown, non-English or removed.Table 6.Descriptive statistics of the data with respect to the length of videos (#minutes) and how many times they appeared during data collection (#encounters).The statistics are computed for a subset of 15,837 videos (out of all 17,405 encountered videos), for which we were able to obtain metadata.We collected videos' metadata using YouTube API some time after the data collection itself.This meant that we were not able to get metadata for all encountered videos, e.g., in cases when the videos were removed by the authors or by the platform.We report the results according to research questions and hypotheses defined in Section 4.1.We use manually annotated data to answer all research questions and to test our hypotheses except for those related to home page results and the hypothesis H2.3 concerned with the slope of change in the proportion of misinformation; in those cases we also use the automatically labeled data to complement the manual labels.SERP-MS score metrics are reported for search results and mean normalized scores for recommendations and home page results.Since the metrics are not normally distributed with some samples of unequal sizes, we make use of non-parametric statistical tests.Pairwise tests are performed using two-sided Mann-Whitney U test.In cases where multiple comparisons by topics are performed, Bonferroni correction is applied on the significance level (in that case  = 0.05 is divided by number of topics   = 5, resulting in  = 0.01).

RQ1: Has YouTube's personalization behavior changed since the reference study?
Regarding H1.1, we overall see a small change in the mean SERP-MS score across the same search queries in our and reference data: mean SERP-MS worsened from -0.46 (std 0.42) in reference data to -0.42 mean (std 0.3) in our data.However, the distributions are not statistically significantly different (n.s.d.).There is a similar small change towards the promoting spectrum in up-next (first result in recommendation list) and top-5 recommendations (following 5 recommendations).We compared the up-next and top-5 recommendations together (as top-6 recommendations) using last 10 watched promoting videos in reference watch experiments and last two watched videos in our promoting phase.We see mean normalized score worsened from -0.07 (std 0.27) in reference data to -0.04 (std 0.31) in our data.These distributions are also not significantly different (U=45781.5,n.s.d.).More considerable shifts in the data can be observed when looking at individual topics.Table 7 shows a comparison of SERP-MS scores for top-10 search results between our and reference data.Improvement can be seen within certain queries for the chemtrails conspiracy that show a large decrease in the number of promoting videos.The reference study reported that this topic receives significantly more misinformative search results compared to all other topics.In our experiments, their proportion was lower than in the 9/11 conspiracy.On the other hand, search results for flat earth conspiracy worsened.Queries such as "flat earth british" resulted in more promoting videos, likely due to new content on channels with similar names.Within the anti-vaccination topic, there is an increase in neutral videos (from 12% to 35%) and thus a drop in debunking videos (from 85% to 61%).This may relate to new content regarding COVID-19.
Table 8 shows a comparison of normalized scores for up-next and top-5 recommendations.Only the moon landing and anti-vaccination topics come from statistically significantly different distributions.Similar to search results, recommendations for the 9/11 and anti-vaccination conspiracy topics worsened.There were more promoting videos on the 9/11 topic (27% instead of 18%).In the anti-vaccination topic, we observed a drop in debunking videos (from 29% to 9%) and a subsequent increase in neutral (from 70% to 78%) and promoting videos (from 1% to 8%).The change within the anti-vaccination controversy is even more pronounced when looking at up-next recommendations separately.Within up-next, the proportion of debunking videos drops from 77% to 19%, neutral videos increase from 22% to 70%, and promoting increase from 1 to 11%.On the other hand, in the moon landing topic, we see much more debunking video recommendations-40% instead of 23% in reference data.
These results bring up a need to distinguish between endogenous (e.g., changes in algorithms, policy decisions made by platforms to hide certain content) and exogenous factors (e.g., changes in content, external events, behavior of content creators) as discussed by Metaxa et al. [34].Our observations show that search results and recommendations were in part influenced by exogenous changes in content on YouTube.Within the chemtrails conspiracy, we observed results related to a (then) new song by Lana del Rey that mentions "Chemtrails" in its name.Search results and recommendations in the anti-vaccination topic seem to be influenced by COVID-19.Flat earth conspiracy videos were influenced by an increased amount of activity within a single conspiratorial channel.(3) comparison of metrics between start of promoting phase (S1) and end of debunking phase (E2) answering H2.2, (4) comparison of the slope of metrics in the promoting phase and in the debunking phase answering H2.3.As already noted, automatically generated annotations using the trained ML model were used in addition to the manually labeled data for evaluating the comparisons on home page results and in case of comparison (4), also on top-10 recommendations.

Comparison (1).
There are changes in search results, recommendations and home page results after watching promoting videos (E1) compared to the start of the experiment (S1).If there was a misinformation bubble effect developed, we would expect the metrics to worsen due to watching promoting videos (H2.0).Regarding search results, the distribution of SERP-MS scores between S1 and E1 is indeed significantly different (MW U=34118.5, p-value=0.028).However, the score actually improves-mean SERP-MS score changed from -0.39 (std 0.28) to -0.42 (std 0.3).Table 9 shows the change for individual topics.Only the flat earth conspiracy shows significant differences and improved the SERP-MS score due to a decrease in promoting and an increase of debunking videos.
Top-10 recommendations also change their distribution of normalized scores significantly at E1 compared to S1 (MW U=4085, p-value=0.0397).We observe that the mean normalized score worsens from -0.07 (std 0.24) to 0.01 (std 0.31).Looking at individual topics in Table 10, we can see that the change is significant in topics 9/11 and anti-vaccination that gain more promoting videos.
Table 11.Comparison of changes in average normalized scores for top-10 home page results in promoting and debunking phase of our experiment.Three points are compared: start of promoting phase (S1), end of promoting phase (E1), end of debunking phase (E2).

Score Change Inspection
The overall change in home page results across all topics is statistically significantly different as well (MW U=3412.5, p-value=0.0002).It worsens from -0.17 (std 0.14) to -0.09 (std 0.17).When looking at the individual topics, we see statistically significant changes on home page in certain topics-9/11, and chemtrails both get worse (cf.Table 11).We see an increase in the proportion of promoting videos also in the flat earth and anti-vaccination topics, but these changes are not statistically significant.Interestingly, home page results in the moon landing topic see a higher proportion of debunking videos (although this change is also not significant).

Comparison (2).
When examining the changes in search results, recommendations and home page results between the end of promoting phase (E1) and the end of debunking phase (E2), we expect the metrics would improve due to watching debunking videos, i.e., that we would observe misinformation bubble bursting (lessening of a misinformation filter bubble effect; H2.1).However, SERP-MS scores in search results between E1 and E2 are not from statistically significantly different distributions, which is consistent with the fact that we did not observe misinformation bubble creation in search results in the first place.Table 9 shows minor improvements in SERP-MS scores across the topics, but they are not statistically significant.Top-10 recommendations show more considerable differences and their overall distribution is significantly different when comparing E1 and E2 (MW U=7179.5, p-value=1.8e−9).Mean normalized score improves from 0.01 (std 0.31) to -0.27 (std 0.27).Table 10 shows significantly different distributions for all topics except for moon landing conspiracy.All topics show an improvement in normalized scores.The 9/11 topic shows a decrease in promoting videos, while the other topics show an increase in the number of debunking videos.
Home page results also show an overall significantly different distribution of the normalized scores between E1 and E2 (MW U=6876.5, p-value=0.0).There are statistically significant improvements in all topics except for 9/11.All topics except for moon landing show a decrease in the number of promoting videos and all topics except for 9/11 show a rise in debunking videos.

Comparison (3).
When examining the differences between the start (S1) and end of the experiment (E2), we expect the metrics would improve due to watching debunking videos despite watching promoting videos before that (H2.2).The distribution of SERP-MS scores in search results is statistically significantly different when comparing S1 and E2 (MW U=36515, p-value=0.0002).Overall, we see an improvement in mean SERP-MS score from -0.39 (std 0.28) to -0.46 (std 0.29).Table 9 shows that only topics flat earth and anti-vaccination significantly changed their distributions, but all topics show an improvement according to our expectations.The improvement is due to increases in debunking videos, decreases in promoting videos, or reordered search results in some search queries.
Similarly, top-10 recommendations at E2 come from a significantly different distribution than at S1 (MW U=6940.5, p-value=2.9e−7).Mean normalized score improves from -0.07 (std 0.24) to -0.27 (std 0.27).Table 10 shows a significant difference in distributions for all topics except for 9/11 and moon landing conspiracies.Mean normalized scores improve compared to S1 in all topics except for 9/11, where the score is comparable to S1 (due to the numbers of promoting and neutral videos returning to S1 levels).Other topics show increases in the numbers of debunking videos.
Home page results at E2 also come from a statistically significantly different distribution compared to S1 (MW U=6340.0,p-value=0.0).All topics except for 9/11, which interestingly gets worse, show a statistically significant improvement in the metrics most commonly due to an increase in the number of debunking videos.

Comparison (4)
. Finally, we want to look deeper at the change in the metrics throughout the experiment.Our interest is in evaluating the slope of the misinformation normalized score and we expect it to increase approximately linearly as the 40 promoting videos are watched and decrease approximately linearly as the 40 debunking videos are watched (H2.3).We use the DIFF-TO-LINEAR metric defined in Section 4.2 and evaluate it for top-10 recommendations and home page results within topics that showed statistically significant changes in the normalized scores (in S1 vs. E1 and in E1 vs. E2 comparisons).Table 12 shows the results.In most cases, we can see that the change is faster than linear-in the promoting phase, recommendations in the 9/11 topic show positive values.This indicates that they worsen faster than linearly.The remaining topics show negative values that are close to 0 (except for anti-vaccination which worsens slower than linearly).The change is larger in the debunking phase-all topics show faster improvement (negative values) of top-10 recommendations and home page results.Figure 2 lets us look at these changes in normalized score in more detail.We can observe the change that happens right after the end of promoting phase-there is a sudden decrease (improvement) in the score.This is visible for both top-10 recommendations and home page results in most topics.The main exception is the 9/11 topic that shows more gradual changes compared to other topics both in the promoting and debunking phase.To look even deeper at how the proportions of promoting, debunking, and neutral videos change over the experiment, we can refer to Figure 3.Here we can see a sudden Table 12.Difference to expected linear trend (DIFF-TO-LINEAR metric) across top-10 recommendations ("Recomm."), and home page results ("Home") in the promoting phase (phase 1), and debunking phase (phase 2) for topics with statistically significant changes in the normalized score metrics in the respective phases (cells with "-" denote cases without statistically significant changes).Positive values in the promoting phase indicate that normalized score worsens faster than linearly and negative values in the debunking phase indicate that it improves faster than linearly.The promoting phase shows smaller differences to the expected linear trend compared to the debunking phase.On the other hand, normalized score improves much faster than linear trend in the debunking phase in most cases.increase in the number of debunking videos especially in recommendations at the start of the debunking phase.Proportion of promoting videos increases gradually over the promoting phase and decreases over the debunking phase.

DISCUSSION AND CONCLUSIONS
In the paper, we presented an audit of misinformation present in search results, recommendations, and home page results on the video-sharing platform YouTube.To support reproducibility 7 , we publish the collected data and source codes for the experiment.We aimed at verifying a hypothesis that there is less misinformation present in both search results and recommendations as a result of the ongoing improvements in YouTube recommender systems and policies [54,55] (H1.1).The comparison was done against a study carried out in mid-2019 by Hussein et al. [27].We were interested, whether we could still observe the misinformation bubble effect after watching videos promoting conspiracy theories (H2.0).In addition to the previous studies, we also examined bubble bursting behavior.Namely, we aimed to verify whether misinformation bubbles could be burst if we watched videos debunking conspiracy theories (H2.1).We also hypothesized that watching debunking videos (even after a previous sequence of promoting videos) would still decrease the amount of misinformation compared to the initial state with no watch history at the start of the study (H2.2).Finally, we investigated the dynamics of change in the prevalence of misinformation promoting videos and hypothesized that their number would increase linearly (using linear change as a baseline) as misinformation promoting videos are watched, and decrease linearly as more and more misinformation debunking videos are watched (H2.3).
Regarding hypothesis H1.1, we did not find a significantly different amount of misinformation in search results in comparison to the reference study.A single topic (anti-vaccination) showed a statistically significant difference.However, it did not agree with the hypothesis as the metric worsened due to more neutral and less debunking videos.Recommendations showed significant differences across multiple topics but were not significantly different overall.A single topic (moon For some topics, one can observe a sudden drop in the normalized score after the 40th video, i.e., when agents started watching debunking videos.As some of the video labels are generated by the trained machine learning model, we also show the proportion of manually annotated videos out of all recommendations using the size of dots. landing) improved normalized scores of recommendations in agreement with the hypothesis.Yet, the anti-vaccination topic worsened its scores.We suspect the changes in search results and recommendations were influenced mostly by changes in content, but currently it is not possible to reliably differentiate between endogenous (e.g., change in the recommendation algorithm or platform policy) and exogenous changes (e.g., change in content on the platform).Overall, our results did not show a significant improvement in the prevalence of misinformation in the audited topics, although it is worth noting that the absolute numbers of misinformation promoting videos are relatively low in most of the audited topics.We did not observe the development of misinformation filter bubble effect in search results (H2.0) despite watching promoting videos.On the other hand, recommendations behaved according to our hypothesis, and their overall normalized scores worsened.By making use of predictions from a trained ML model, we evaluated home page results as well.They showed a statistically significant change in their overall distribution; the scores worsened in 9/11 and chemtrails topics.Since we did not observe a filter bubble effect in search results, we did not observe any bubble bursting effect there either.Results did not show a statistically significant difference between the end of promoting phase and the end of the debunking phase.Recommendations as well as home page results showed more considerable differences that were statistically significant and confirmed the hypothesis H2.1.We showed that watching debunking videos decreases the number of misinformation promoting videos and increases the number of misinformation debunking videos in search results, recommendations and home page results when compared to the initial state (start of the experiment), which confirms our hypothesis H2.2.We observed an improvement of SERP-MS scores for search results and normalized scores for recommendations and home page results in most topics.Finally, we inspected the trend of change in metrics over the course of the experiment as we expected them to change linearly as misinformation promoting or debunking videos are watched (H2.3).We saw large deviations from a linear trend in the debunking phase.There was a sudden improvement of the misinformation score after the first watched debunking video caused by an increase in the number of debunking videos.Although changes in the score continue as more videos are watched, there is a strong contextuality of recommendations and home page results with the most recently watched videos.
Based on our results, we can conclude that users, even with a watch history of promoting conspiracy theories, do not get enclosed in a misinformation filter bubble when they search on YouTube.However, we do observe this effect in video recommendations and home page results with varying degrees depending on the topic.At the same time, watching debunking videos helps practically in all cases to decrease the amount of misinformation that the users see.Additionally, although we expected to see less misinformation than the previous studies reported, this was in general not the case.Worsening in the anti-vaccination topic was partially expected due to the COVID-19 pandemic.However, it is interesting that we also observed a worse situation in the 9/11 topic.In fact, this topic served as a sort of a gateway to misinformation videos on other topics.
Besides results, several limitations of our study need to be pointed out.First, we investigated only a limited amount of topics-these did not include, for example, recent QAnon conspiracy, disinformation and propaganda related to Russia-Ukraine war, or COVID-19 related conspiracies (which were present only indirectly through anti-vaccination narratives).However, our topics were explicitly selected to allow comparison with the reference study.Additional audit studies like ours are needed to investigate misinformation prevalence in constantly emerging new topics.It would also be interesting to extend the audit for different geolocations (other than USA) and languages other than English and to select topics relevant for these new contexts.However, the substantial required effort (especially with respect to the data collection and following manual annotation) is currently a serious limiting factor in this regard.Another limitation is that we included only a limited set of agent interactions with the platform (search and video watching).Real users also like or dislike videos, subscribe to channels, leave comments or click on the search results or recommendations.A more human-like bot simulation, with these interactions and possible inclusion of human biases bursting remains a subject of the future work.Furthermore, as already stated, it is currently difficult (if even possible at all) to distinguish between endogenous and exogenous factors that impact the observed results.The direct comparison between independent audit studies is also sometimes difficult due to many factors that might confound the results and that the researchers must be aware of when interpreting the observed differences.
Despite the limitations, our audit showed that YouTube (similar to other platforms), despite their best efforts so far, can still promote misinformation seeking behavior to some extent.Even a year after running the experiment, the majority of videos that we annotated as promoting (either those we used as seed videos, or the additional ones we encountered) still remain on the platform.Only 64 out of the 449 promoting videos (∼ 14%) are no longer available at the platform.However, we can see the focus on dealing with misinformation from the platform, as the number of removed promoting videos is higher than removed debunking (8 out of 760 seed and encountered) and neutral (41 out of 2,042 seed and encountered) videos, so the removal cannot be attributed solely to the evolution of videos on the platform (e.g., how many are created and removed regardless of topic).We can also see the focus on major misinformation topics, as 75% of the removed promoting seed videos come from the anti-vaccination topic and from the overall promoting videos (seed and encountered) that were removed, ∼ 49% come from the anti-vaccination and ∼ 15% come from the flat earth topic.The relatively low number of removed promoting videos can be a result of a difference between what we consider to be a promoting video and what is considered to be a misinformation video by the platform, although the number of potentially harmful videos present on the platform is still high.
The results also motivate the need for independent continuous and automatic audits of YouTube and other social media platforms [47], since we observed that the amount of misinformation in a topic could change over time due to endogenous as well as exogenous factors.The partial use of automated annotation of recommended videos shown in this paper is a step towards this goal.However, it is crucial for any automated approach to be trustworthy, i.e., to be robust (e.g., against potential concept drifts), explainable, and to include humans in the loop (or better yet, in control) at all stages of the audit.This remains our future work.

Fig. 1 . 11 Algorithm 1
Fig. 1.The diagram depicting the agent scenario executed during the audit.

Fig. 2 .
Fig.2.Changes in average normalized score for top-10 recommendations (top) and for home page results (bottom) over the duration of the experiment.The normalized score ranges from -1 for all debunking to +1 for all promoting recommendations.The X-axis shows the number of videos that the agents had watched before the recorded recommendations.Recall that the agents first watched 40 promoting and then 40 debunking videos.For some topics, one can observe a sudden drop in the normalized score after the 40th video, i.e., when agents started watching debunking videos.As some of the video labels are generated by the trained machine learning model, we also show the proportion of manually annotated videos out of all recommendations using the size of dots.

Fig. 3 .
Fig. 3. Proportions of stance labels (promoting, debunking, neutral) of videos in top-10 recommendations (left) and home page results (right) over the duration of the experiment.
if Watched   videos since the last search then:

Table 3 .
[38]e (text in italics)-model from Papadamou et al. classifying videos into 3 classes (promoting, debunking, and neutral).Confusion matrix from cross-validation of the model by Papadamou et al.[38]trained on our data for classification into three classes.There is a significant class imbalance with the neutral class being the most prominent.Oversampling was used in training to address this problem.

Table 4 .
Overview of the data collected by bots during the study execution.

Table 7 .
Comparison of SERP-MS scores for top-10 search results with data from the reference study.The scores range from ⟨−1, 1⟩, where -1 denotes a debunking and 1 a promoting stance towards the conspiracy.Only search results from queries that were executed both by the reference study and us are considered.

Table 8 .
Comparison of normalized scores for up-next and top-5 recommendations with data from the reference study.Normalized scores range from ⟨−1, 1⟩, where -1 denotes a debunking and 1 a promoting stance towards the conspiracy.Last 10 out of 20 watched videos in reference data are considered.Last 2 out of 40 watched videos in our data are considered.

Table 9 .
Comparison of SERP-MS scores for top-10 search results in promoting and debunking phase of our experiment.Three points are compared: start of promoting phase (S1), end of promoting phase (E1), end of debunking phase (E2).

Table 10 .
Comparison of changes in average normalized scores for top-10 recommendations in promoting and debunking phase of our experiment.Three points are compared: start of promoting phase (S1), end of promoting phase (E1), end of debunking phase (E2).