Published March 5, 2026 | Version 1.0
Dataset Restricted

Dataset for paper "The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok"

  • 1. ROR icon Kempelen Institute of Intelligent Technologies
  • 2. ROR icon Comenius University Bratislava

Description

This is a dataset accompanying the paper “The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok”, designed to analyze video interactions, ad classifications, and user engagement patterns. It contains records of video interactions, including metadata about the videos, user demographics, and ad classifications, allowing the full replication of results presented in the paper. 

The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person.

To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation.

 

Paper: TBA (currently under review)

Preprint: https://arxiv.org/abs/2603.05653

GitHub repository: https://github.com/kinit-sk/ai-auditology-advertising-and-minor-profiling-tiktok

 

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper:

@misc{solarova2026dsasblindspotalgorithmic,
      title={The DSA's Blind Spot: Algorithmic Audit of Advertising and Minor Profiling on TikTok}, 
      author={Sara Solarova and Matej Mosnar and Matus Tibensky and Jan Jakubcik and Adrian Bindas and Simon Liska and Filip Hossner and Matúš Mesarčík and Ivan Srba},
      year={2026},
      eprint={2603.05653},
      archivePrefix={arXiv},
      primaryClass={cs.CY},
      url={https://arxiv.org/abs/2603.05653}, 
}

 

Dataset Description

The logs of video presented to individual simulated users are provided in the ai-auditology-advertising-and-minor-profiling-tiktok_video_data.csv file. It is structured into 31 columns, capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata.

Column Name

Data Type

Description

Example Value

session_id

string

Session identifier captured during browsing

1765302414.743265

video_id

string

Platform video identifier

[anonymized]

timestamp

datetime

Timestamp when the record was captured

2025-12-09T17:47:56.296448

is_ad

boolean

Whether the video was classified as an ad

false

ad_type

string (nullable)

Ad classification type when is_ad is true

other

ad_topic

string (nullable)

Detected topic for ad content

beauty

visual_indicators

array[string]

List of visual indicators used to classify ads

["hashtag #clearskin"]

reasoning

string

Model reasoning for the ad classification

No disclosure label visible.

interaction_number

integer

Sequential interaction count within the session

1

search_term

string

Search term used to find the content

clear skin

video_action_skip

boolean

Whether the user skipped the video

False

video_action_watch

boolean

Whether the user watched the video

True

video_action_like

boolean

Whether the user liked the video

True

video_action_bookmark

boolean

Whether the user bookmarked the video

True

video_time_watch_loop_start

float (nullable)

Timestamp when watch loop started

1765302470.8245792

video_time_watch_loop_end

float (nullable)

Timestamp when watch loop ended

1765302477.842666

video_time_skip

float (nullable)

Timestamp when the video was skipped

nan

video_time_like

float (nullable)

Timestamp when the video was liked

1765302471.8269806

video_time_bookmark

float (nullable)

Timestamp when the video was bookmarked

1765302477.3054323

video_time_predict_interaction

float (nullable)

Timestamp for predicted interaction (if any)

nan

topic

string

User interest topic used for personalization

beauty

gender

string

User gender

female

country_code

string

User country code

DE

date_of_birth

date

User date of birth

2009-11-29

agent

string

Agent identifier added during processing

Beauty_minor

video_url

string

Full URL to the video

https://www.tiktok.com/[anonymized]

video_author

string

Account handle of the video author

[anonymized]

video_description

string

Video description text

little bonus - your waist? nonexistent #chiaseeds #guthealth

video_time_duration

float

Video duration in seconds

25.866667

video_transcript

string (nullable)

Auto-transcribed video text if available

nan

video_transcript_language

string (nullable)

Language of the transcript

nan

 

Manual annotations of selected videos (used to assess the accuracy of ad type and topic classification model) are provided in ai-auditology-advertising-and-minor-profiling-tiktok_annotator_1.csv and ai-auditology-advertising-and-minor-profiling-tiktok_annotator_2.csv, for the first and second human annotator respectively.

 

Ethical considerations

Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure. 

The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed.

The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata. 

To mitigate potential biases and inaccuracies inherent in the Large Vision Model (LVM) used for advertisement classification, we implemented a multi-layered validation process. This included both ad-hoc and systematic manual audits of dataset subsets. Data failing to meet accuracy benchmarks were excluded, and we have reported the estimated error rates accordingly. To prioritize ethical standards and researcher well-being, all manual annotations were conducted solely by the study’s authors, following expert ethical guidelines.

Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

In order to share the dataset with you, please agree to the following terms:

  1. You will use dataset strictly only for research purposes. The request for access to the dataset must be sent from the official and existing e-mail address of the relevant university, faculty or other scientific or research institution (for verification purposes).
  2. You will not attempt to identify, deanonymize or contact the authors of the social media posts included in this dataset.
  3. You will not re-share the dataset (or any of its parts) with anyone else not included in this request.  
  4. You will appropriately cite the papers mentioned in the dataset description in any publication, project, tool using this dataset.
  5. You understand how the dataset was created and that the manual or automatically predicted annotations may not be 100% correct. 
  6. You acknowledge that you are fully responsible for the use of the dataset (data) and for any infringement of rights of third parties (in particular copyright) that may arise from its use beyond the intended purposes. Neither the authors nor Kempelen Institute of Intelligent Technologies (KInIT) are responsible for your actions.

You are currently not logged in. Do you have an account? Log in here