Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"
Authors/Creators
- 1. Kempelen Institute of Intelligent Technologies
Description
This is a dataset accompanying the paper “Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok”, designed to analyze video interactions and user engagement patterns on TikTok website. It contains records of interactions of social media auditing agents with TikTok website over the timespan of present study.
The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person.
To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation.
Paper: TBA
Preprint: TBA
GitHub repository: https://github.com/kinit-sk/ai-auditology-personalisation-drift-tiktok
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:
TBA
Dataset Description
The dataset consists of 3 CSV files:
-
ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_plus_neutral.csv — Data for the first user group (neutral+polarising) consists of 30 users from runs which were seeded with both polarizing and neutral topic.
-
ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_only.csv — Data for the second user group (polarising only) consists of an additional 32 users (4 for topic+stance) that are only seeded with a polarising topic (representing maximum polarity), but interact with a neutral topic during the interaction phase.
-
ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv — Data for the third user group (mixed polarity) seeded with equal manner with only the US politics topic.
The CSV files contain 28 columns (29 for data contained in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv), capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata.
|
Column name |
Data type |
Description |
Example |
|
interaction_number |
integer |
Unique integer per interaction per agent |
1,2,3… |
|
video_url |
string |
URL of video the agent interacted with |
https://www.tiktok.com/@author123 |
|
video_id |
string |
TikTok unique video ID |
1234 |
|
video_author |
string |
TikTok author name |
author123 |
|
video_description |
string |
Video description generated by video author plus hashtags |
This video is about… |
|
video_time_duration |
integer |
Duration of video in seconds |
67.9333 |
|
video_transcript |
string |
Speech transcript by inhouse Whisper model |
Welcome to my video about… |
|
video_transcript_language |
string |
Code for language detected in transcript |
en, fr …. |
|
video_action_skip |
bool |
Decision by user interaction predictor, TRUE if video is to be skipped |
TRUE, FALSE |
|
video_action_watch |
bool |
Decision by user interaction predictor, TRUE if video is to be watched |
TRUE, FALSE |
|
video_action_like |
bool |
Decision by user interaction predictor, TRUE if video is to be liked |
TRUE, FALSE |
|
video_action_bookmark |
bool |
Decision by user interaction predictor, TRUE if video is to be bookmarked |
TRUE, FALSE |
|
video_time_watch_loop_start |
integer |
UNIX timestamp of time when agent started watching particular video |
1765302470.8245792 |
|
video_time_watch_loop_end |
integer |
UNIX timestamp of time when agent finished watching particular video |
1765302470.8245792 |
|
video_time_skip |
integer |
UNIX timestamp of time when agent skipped particular video |
1765302470.8245792 |
|
video_time_like |
integer |
UNIX timestamp of time when agent liked particular video |
1765302470.8245792 |
|
video_time_bookmark |
integer |
UNIX timestamp of time when agent bookmarked particular video |
1765302470.8245792 |
|
video_time_predict_interaction |
integer |
UNIX timestamp of time when user interaction predictor predicted how to interact with particular video |
1765302470.8245792 |
|
agent_id |
string |
Unique ID of agent |
agent_id |
|
topic |
string |
Topic of interest of given agent |
Vaccines, US Politics, Flatearth, Climate change, Cooking |
|
stance |
string |
Stance towards the topic of interest of given agent |
support, oppose |
|
gender |
string |
Gender set for given agent in TikTok |
male, female |
|
country_code |
string |
Country of origin set for given agent |
US |
|
date_of_birth |
string |
Date of birth set for given agent in TikTok |
1/2/2005 |
|
run_id |
string |
ID of given agent run |
1759515058.941394_main |
|
predicted_topic_match |
bool |
TRUE if predicted_topic == topic of interest |
TRUE, FALSE |
|
predicted_stance_match |
bool |
TRUE if predicted stance == stance of given agent |
TRUE, FALSE |
|
predicted_topic |
string |
Topic predicted by data annotator using these data fields: video_author, video_description, video_transcript |
Vaccines, US Politics, Flatearth, Climate change, Cooking |
|
predicted_stance |
string |
Predicted stance towards the topic of interest of given agent. Only in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv |
support, oppose |
Ethical considerations
Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure.
The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed.
The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata.
The user interaction model, which we use for annotation purposes to determine the topic and stance of a video towards such a topic, is based on a large language model, and so we may observe potentially biased and incorrect findings due to the mistakes made by it. We address this problem by ad-hoc as well as systematic manual annotation of selected dataset subset. To accomplish this, we need to perform human annotation. It is done solely by the authors of the study, following recommendations from ethics experts in order to minimise possible negative consequences and ensure well-being.
Labels in the dataset that are derived from the prediction of above-mentioned annotation system (namely: predicted_topic, predicted_topic_match, predicted_stance_match) as well as transcript of the speech in the video (video_transcript) are a product of statistical machine learning systems and therefore might be inaccurate and may differ from the video author opinions and stances towards the topics of interest.
Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.