Dataset for paper "The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok"
Authors/Creators
Description
This is a dataset accompanying the paper “The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok”, designed to analyze video interactions, ad classifications, and user engagement patterns. It contains records of video interactions, including metadata about the videos, user demographics, and ad classifications, allowing the full replication of results presented in the paper.
The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person.
To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation.
Paper: TBA (currently under review)
Preprint: https://arxiv.org/abs/2603.05653
GitHub repository: https://github.com/kinit-sk/ai-auditology-advertising-and-minor-profiling-tiktok
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:
@misc{solarova2026dsasblindspotalgorithmic,
title={The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok},
author={Sara Solarova and Matej Mosnar and Matus Tibensky and Jan Jakubcik and Adrian Bindas and Simon Liska and Filip Hossner and Matúš Mesarčík and Ivan Srba},
year={2026},
eprint={2603.05653},
archivePrefix={arXiv},
primaryClass={cs.CY},
url={https://arxiv.org/abs/2603.05653},
}
Dataset Description
The logs of video presented to individual simulated users are provided in the ai-auditology-advertising-and-minor-profiling-tiktok_video_data.csv file. It is structured into 31 columns, capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata.
|
Column Name |
Data Type |
Description |
Example Value |
|
session_id |
string |
Session identifier captured during browsing |
1765302414.743265 |
|
video_id |
string |
Platform video identifier |
[anonymized] |
|
timestamp |
datetime |
Timestamp when the record was captured |
2025-12-09T17:47:56.296448 |
|
is_ad |
boolean |
Whether the video was classified as an ad |
false |
|
ad_type |
string (nullable) |
Ad classification type when is_ad is true |
other |
|
ad_topic |
string (nullable) |
Detected topic for ad content |
beauty |
|
visual_indicators |
array[string] |
List of visual indicators used to classify ads |
["hashtag #clearskin"] |
|
reasoning |
string |
Model reasoning for the ad classification |
No disclosure label visible. |
|
interaction_number |
integer |
Sequential interaction count within the session |
1 |
|
search_term |
string |
Search term used to find the content |
clear skin |
|
video_action_skip |
boolean |
Whether the user skipped the video |
False |
|
video_action_watch |
boolean |
Whether the user watched the video |
True |
|
video_action_like |
boolean |
Whether the user liked the video |
True |
|
video_action_bookmark |
boolean |
Whether the user bookmarked the video |
True |
|
video_time_watch_loop_start |
float (nullable) |
Timestamp when watch loop started |
1765302470.8245792 |
|
video_time_watch_loop_end |
float (nullable) |
Timestamp when watch loop ended |
1765302477.842666 |
|
video_time_skip |
float (nullable) |
Timestamp when the video was skipped |
nan |
|
video_time_like |
float (nullable) |
Timestamp when the video was liked |
1765302471.8269806 |
|
video_time_bookmark |
float (nullable) |
Timestamp when the video was bookmarked |
1765302477.3054323 |
|
video_time_predict_interaction |
float (nullable) |
Timestamp for predicted interaction (if any) |
nan |
|
topic |
string |
User interest topic used for personalization |
beauty |
|
gender |
string |
User gender |
female |
|
country_code |
string |
User country code |
DE |
|
date_of_birth |
date |
User date of birth |
2009-11-29 |
|
agent |
string |
Agent identifier added during processing |
Beauty_minor |
|
video_url |
string |
Full URL to the video |
https://www.tiktok.com/[anonymized] |
|
video_author |
string |
Account handle of the video author |
[anonymized] |
|
video_description |
string |
Video description text |
little bonus - your waist? nonexistent #chiaseeds #guthealth |
|
video_time_duration |
float |
Video duration in seconds |
25.866667 |
|
video_transcript |
string (nullable) |
Auto-transcribed video text if available |
nan |
|
video_transcript_language |
string (nullable) |
Language of the transcript |
nan |
Manual annotations of selected videos (used to assess the accuracy of ad type and topic classification model) are provided in ai-auditology-advertising-and-minor-profiling-tiktok_annotator_1.csv and ai-auditology-advertising-and-minor-profiling-tiktok_annotator_2.csv, for the first and second human annotator respectively.
Ethical considerations
Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure.
The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed.
The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata.
To mitigate potential biases and inaccuracies inherent in the Large Vision Model (LVM) used for advertisement classification, we implemented a multi-layered validation process. This included both ad-hoc and systematic manual audits of dataset subsets. Data failing to meet accuracy benchmarks were excluded, and we have reported the estimated error rates accordingly. To prioritize ethical standards and researcher well-being, all manual annotations were conducted solely by the study’s authors, following expert ethical guidelines.
Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.