Published October 2025
| Version v1
Dataset
Restricted
Dataset of Data Exports (Hidden in Plain Bytes)
Authors/Creators
Description
Overview
This repository contains 12 data exports obtained under "right of access" requests (i.e., GDPR/CCPA) from 6 major online platforms (Apple, Discord, Facebook, Google, Instagram, and Snapchat), which were collected and analyzed for the following paper:
Julia Nonnenkamp, Naman Gupta, Abhimanyu Dev Gupta, and Rahul Chatterjee. 2025. Hidden in Plain Bytes: Investigating Interpersonal Account Compromise with Data Exports. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS '25). ACM, Taipei, Taiwan, 1–14 (October 13–17, 2025). DOI: 10.1145/3719027.3765147.
Contributors
This data was collected by Julia Nonnenkamp and Abhimanyu Dev Gupta (University of Wisconsin–Madison), and cleaned and analyzed with additional help from Naman Gupta and Rahul Chatterjee (University of Wisconsin–Madison). For questions, please contact Julia Nonnenkamp (nonnenkamp@wisc.edu).
Data sources
We simulated benign and malicious activity researcher-controlled accounts (made under the pseudonym "Sam") on six platforms: Apple, Discord, Facebook, Google, Instagram, and Snapchat. We then requested and downloaded data exports from each platform twice, in January 2025 and 30 days later in February 2025.
The file
sam_january_cleaned.zip contains subdirectories for each of the 6 data exports from January, and sam_february_cleaned.zip contains the same for February.The data provided is partially pre-processed to remove personally identifying information (PII) and bulky media files and directories irrelevant to our analysis. If referencing the accompanying paper, the data provided has undergone the transformations described in Section 3.4, "Pseudonymization" and "Filtering Files." Below, we describe each of these steps in more detail.
(A) Pseudonymization
To protect the privacy of the researchers and any other individuals whose data may be present in the exports, we pseudonymized all personally identifying information (PII) in the data. This includes IP addresses, phone numbers belonging to the researchers (needed to verify accounts), precise location coordinates, and state/city details outside of our lab building. We replaced these with syntactically similar values, e.g. IP addresses as
0.0.0.1,0.0.0.2, etc., and masked state/cities as State1, City2, etc.(B) File filtering
We removed files that did not contain machine-readable text (images, videos) and files from platform features we did not use during simulation (IoT integrations, streaming, payments, educational tools). The retained files are primarily HTML, JSON, and CSV formats, with some TXT files.
See the accompanying paper (Section 3.4) for more detailed reasoning for file filtering. See
removed_files.md for the complete list of excluded file paths.
Files
Additional details
Funding
- U.S. National Science Foundation
- CAREER: Account Security Against Interpersonal Attacks 2339679
- University of Wisconsin–Madison
- Baldwin Wisconsin Idea Endowment