Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"

Pecher, Branislav; Bindas, Adrián; Jakubčík, Ján; Tuna, Matus; Tibensky, Matus; Liska, Simon; Sakalik, Peter; Šutý, Andrej; Mosnar, Matej; Hossner, Filip; Srba, Ivan

doi:10.5281/zenodo.19144520

Published March 21, 2026 | Version v1

Dataset Restricted

Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"

1. Kempelen Institute of Intelligent Technologies

This is a dataset accompanying the paper “Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok”, designed to analyze video interactions and user engagement patterns on TikTok website. It contains records of interactions of social media auditing agents with TikTok website over the timespan of present study.

The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person.

To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation.

Paper: TBA

Preprint: TBA

GitHub repository: https://github.com/kinit-sk/ai-auditology-personalisation-drift-tiktok

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:

TBA

Dataset Description

The dataset consists of 3 CSV files:

ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_plus_neutral.csv — Data for the first user group (neutral+polarising) consists of 30 users from runs which were seeded with both polarizing and neutral topic.
ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_only.csv — Data for the second user group (polarising only) consists of an additional 32 users (4 for topic+stance) that are only seeded with a polarising topic (representing maximum polarity), but interact with a neutral topic during the interaction phase.
ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv — Data for the third user group (mixed polarity) seeded with equal manner with only the US politics topic.

The CSV files contain 28 columns (29 for data contained in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv), capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata.

Column name	Data type	Description	Example
interaction_number	integer	Unique integer per interaction per agent	1,2,3…
video_url	string	URL of video the agent interacted with	https://www.tiktok.com/@author123
video_id	string	TikTok unique video ID	1234
video_author	string	TikTok author name	author123
video_description	string	Video description generated by video author plus hashtags	This video is about…
video_time_duration	integer	Duration of video in seconds	67.9333
video_transcript	string	Speech transcript by inhouse Whisper model	Welcome to my video about…
video_transcript_language	string	Code for language detected in transcript	en, fr ….
video_action_skip	bool	Decision by user interaction predictor, TRUE if video is to be skipped	TRUE, FALSE
video_action_watch	bool	Decision by user interaction predictor, TRUE if video is to be watched	TRUE, FALSE
video_action_like	bool	Decision by user interaction predictor, TRUE if video is to be liked	TRUE, FALSE
video_action_bookmark	bool	Decision by user interaction predictor, TRUE if video is to be bookmarked	TRUE, FALSE
video_time_watch_loop_start	integer	UNIX timestamp of time when agent started watching particular video	1765302470.8245792
video_time_watch_loop_end	integer	UNIX timestamp of time when agent finished watching particular video	1765302470.8245792
video_time_skip	integer	UNIX timestamp of time when agent skipped particular video	1765302470.8245792
video_time_like	integer	UNIX timestamp of time when agent liked particular video	1765302470.8245792
video_time_bookmark	integer	UNIX timestamp of time when agent bookmarked particular video	1765302470.8245792
video_time_predict_interaction	integer	UNIX timestamp of time when user interaction predictor predicted how to interact with particular video	1765302470.8245792
agent_id	string	Unique ID of agent	agent_id
topic	string	Topic of interest of given agent	Vaccines, US Politics, Flatearth, Climate change, Cooking
stance	string	Stance towards the topic of interest of given agent	support, oppose
gender	string	Gender set for given agent in TikTok	male, female
country_code	string	Country of origin set for given agent	US
date_of_birth	string	Date of birth set for given agent in TikTok	1/2/2005
run_id	string	ID of given agent run	1759515058.941394_main
predicted_topic_match	bool	TRUE if predicted_topic == topic of interest	TRUE, FALSE
predicted_stance_match	bool	TRUE if predicted stance == stance of given agent	TRUE, FALSE
predicted_topic	string	Topic predicted by data annotator using these data fields: video_author, video_description, video_transcript	Vaccines, US Politics, Flatearth, Climate change, Cooking
predicted_stance	string	Predicted stance towards the topic of interest of given agent. Only in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv	support, oppose

Ethical considerations

Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure.

The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed.

The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata.

The user interaction model, which we use for annotation purposes to determine the topic and stance of a video towards such a topic, is based on a large language model, and so we may observe potentially biased and incorrect findings due to the mistakes made by it. We address this problem by ad-hoc as well as systematic manual annotation of selected dataset subset. To accomplish this, we need to perform human annotation. It is done solely by the authors of the study, following recommendations from ethics experts in order to minimise possible negative consequences and ensure well-being.

Labels in the dataset that are derived from the prediction of above-mentioned annotation system (namely: predicted_topic, predicted_topic_match, predicted_stance_match) as well as transcript of the speech in the video (video_transcript) are a product of statistical machine learning systems and therefore might be inaccurate and may differ from the video author opinions and stances towards the topics of interest.

Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/19144520">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

In order to share the dataset with you, please agree to the following terms:

You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
You will not re-share the dataset (or any of its parts) with anyone else not included in this request.
You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct.
You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	51	51
Downloads	2	2
Data volume	306.8 MB	306.8 MB

Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"

Authors/Creators

Description

References

Dataset Description

Ethical considerations

Files

Restricted

Request access