Annotated Dataset for User Review Classification in the Privacy Domain for Social Media Applications
Creators
Description
This labelled dataset contains 16,000 user reviews collected from seven social media applications (Instagram, Facebook, WhatsApp, Snapchat, X (formerly Twitter),
Facebook Lite, and Line) on the Google Play Store using the Google-Play-Scraper library.
A total of 78,000 reviews were initially scraped, with 18,000 from Facebook and 10,000 from each of the remaining platforms. We retained reviews with at least five words and filtered potentially privacy-related content using an iteratively refined keyword list grouped under five privacy themes. This process yielded 12,170 candidate reviews. To ensure class diversity, we randomly sampled 3,830 additional privacy-irrelevant reviews, resulting in 16,000 raw reviews.
Each of the 16,000 reviews was manually annotated into one of three categories—privacy-related feature requests, privacy-related bug reports, or privacy-irrelevant—by the first two authors. The annotation process achieved high inter-rater reliability (Cohen’s Kappa = 0.87), with disagreements resolved through discussion. The final annotated dataset consists of 3,627 privacy-related feature requests, 4,221 privacy-related bug reports, and 8,152 privacy-related privacy-irrelevant reviews.
Pre-processing involved lowercasing, contraction expansion, and removal of HTML tags, numbers, usernames, duplicates, and empty entries. The dataset was then split into training (12,756), validation (1,594), and test (1,595) sets. The training set was further augmented using multiple NLP techniques (e.g., synonym replacement, contextual insertion) via the nlpaug library, producing 126,602 reviews.
Post-processing steps included stop-word removal, tokenization, and lemmatization, resulting in a final augmented training set of 121,374 samples.
The dataset contains three columns: Reviews (preprocessed), Processed (postprocessed), and Label (0: feature request, 1: bug report, 2: irrelevant).
Files
final_data_unbalanced_121374.csv
Additional details
Related works
- Is part of
- Preprint: arXiv:2507.10640 (arXiv)