Published March 21, 2026 | Version v1
Dataset Restricted

Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"

Description

This is a dataset accompanying the paper “Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok”, designed to analyze video interactions and user engagement patterns on TikTok website. It contains records of interactions of social media auditing agents with TikTok website over the timespan of present study. 

The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person.

To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation.

 

Paper: TBA

Preprint: TBA

GitHub repository: https://github.com/kinit-sk/ai-auditology-personalisation-drift-tiktok

 

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:

TBA

 

Dataset Description

The dataset consists of 3 CSV files:

  • ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_plus_neutral.csv — Data for the first user group (neutral+polarising) consists of 30 users from runs which were seeded with both polarizing and neutral topic.

  • ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_only.csv — Data for the second user group (polarising only) consists of an additional 32 users (4 for topic+stance) that are only seeded with a polarising topic (representing maximum polarity), but interact with a neutral topic during the interaction phase.

  • ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv — Data for the third user group (mixed polarity) seeded with equal manner with only the US politics topic.

The CSV files contain 28 columns (29 for data contained in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv), capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata. 

Column name

Data type

Description

Example

interaction_number

integer

Unique integer per interaction per agent

1,2,3…

video_url

string

URL of video the agent interacted with

https://www.tiktok.com/@author123

video_id

string

TikTok unique video ID

1234

video_author

string

TikTok author name

author123

video_description

string

Video description generated by video author plus hashtags

This video is about…

video_time_duration

integer

Duration of video in seconds

67.9333

video_transcript

string

Speech transcript by inhouse Whisper model

Welcome to my video about…

video_transcript_language

string

Code for language detected in transcript

en, fr ….

video_action_skip

bool

Decision by user interaction predictor, TRUE if video is to be skipped

TRUE, FALSE

video_action_watch

bool

Decision by user interaction predictor, TRUE if video is to be watched

TRUE, FALSE

video_action_like

bool

Decision by user interaction predictor, TRUE if video is to be liked

TRUE, FALSE

video_action_bookmark

bool

Decision by user interaction predictor, TRUE if video is to be bookmarked

TRUE, FALSE

video_time_watch_loop_start

integer

UNIX timestamp of time when agent started watching particular video

1765302470.8245792

video_time_watch_loop_end

integer

UNIX timestamp of time when agent finished watching particular video

1765302470.8245792

video_time_skip

integer

UNIX timestamp of time when agent skipped particular video

1765302470.8245792

video_time_like

integer

UNIX timestamp of time when agent liked particular video

1765302470.8245792

video_time_bookmark

integer

UNIX timestamp of time when agent bookmarked particular video

1765302470.8245792

video_time_predict_interaction

integer

UNIX timestamp of time when user interaction predictor predicted how to interact with particular video

1765302470.8245792

agent_id

string

Unique ID of agent

agent_id

topic

string

Topic of interest of given agent

Vaccines, US Politics, Flatearth, Climate change, Cooking

stance

string

Stance towards the topic of interest of given agent

support, oppose

gender

string

Gender set for given agent in TikTok

male, female

country_code

string

Country of origin set for given agent

US

date_of_birth

string

Date of birth set for given agent in TikTok

1/2/2005

run_id

string

ID of given agent run

1759515058.941394_main

predicted_topic_match

bool

TRUE if predicted_topic == topic of interest

TRUE, FALSE

predicted_stance_match

bool

TRUE if predicted stance == stance of given agent

TRUE, FALSE

predicted_topic

string

Topic predicted by data annotator using these data fields: video_author, video_description, video_transcript

Vaccines, US Politics, Flatearth, Climate change, Cooking

predicted_stance

string

Predicted stance towards the topic of interest of given agent. Only in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv

support, oppose

 

Ethical considerations

Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure. 

The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed.

The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata. 

The user interaction model, which we use for annotation purposes to determine the topic and stance of a video towards such a topic, is based on a large language model, and so we may observe potentially biased and incorrect findings due to the mistakes made by it. We address this problem by ad-hoc as well as systematic manual annotation of selected dataset subset. To accomplish this, we need to perform human annotation. It is done solely by the authors of the study, following recommendations from ethics experts in order to minimise possible negative consequences and ensure well-being. 

Labels in the dataset that are derived from the prediction of above-mentioned annotation system (namely: predicted_topic, predicted_topic_match, predicted_stance_match) as well as transcript of the speech in the video (video_transcript) are a product of statistical machine learning systems and therefore might be inaccurate and may differ from the video author opinions and stances towards the topics of interest.

Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/19144520">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
  3. You will not re-share the dataset (or any of its parts) with anyone else not included in this request.  
  4. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
  5. You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct. 
  6. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here