Published July 10, 2025 | Version v1
Dataset Open

Annotated Dataset for User Review Classification in the Privacy Domain for Social Media Applications

Description

This labelled dataset contains 16,000 user reviews collected from seven social media applications (Instagram, Facebook, WhatsApp, Snapchat, X (formerly Twitter),
Facebook Lite, and Line) on the Google Play Store using the Google-Play-Scraper library.

A total of 78,000 reviews were initially scraped, with 18,000 from Facebook and 10,000 from each of the remaining platforms. We retained reviews with at least five words and filtered potentially privacy-related content using an iteratively refined keyword list grouped under five privacy themes. This process yielded 12,170 candidate reviews. To ensure class diversity, we randomly sampled 3,830 additional privacy-irrelevant reviews, resulting in 16,000 raw reviews.

Each of the 16,000 reviews was manually annotated into one of three categories—privacy-related feature requests, privacy-related bug reports, or privacy-irrelevant—by the first two authors. The annotation process achieved high inter-rater reliability (Cohen’s Kappa = 0.87), with disagreements resolved through discussion. The final annotated dataset consists of 3,627 privacy-related feature requests, 4,221 privacy-related bug reports, and 8,152 privacy-related privacy-irrelevant reviews.

Pre-processing involved lowercasing, contraction expansion, and removal of HTML tags, numbers, usernames, duplicates, and empty entries. The dataset was then split into training (12,756), validation (1,594), and test (1,595) sets. The training set was further augmented using multiple NLP techniques (e.g., synonym replacement, contextual insertion) via the nlpaug library, producing 126,602 reviews.

Post-processing steps included stop-word removal, tokenization, and lemmatization, resulting in a final augmented training set of 121,374 samples.

The dataset contains three columns: Reviews (preprocessed), Processed (postprocessed), and Label (0: feature request, 1: bug report, 2: irrelevant).

 

Files

final_data_unbalanced_121374.csv

Files (45.5 MB)

Name Size Download all
md5:e865b62e4d0a45502d04325c68d44b7b
40.3 MB Preview Download
md5:ec37813501c743361a1d97fa872cbd70
4.2 MB Preview Download
md5:cce8dad71016940aae92f761da115d33
516.1 kB Preview Download
md5:7e7bbbf5ae2987cf53b8d5600b93fa8b
531.0 kB Preview Download

Additional details

Related works

Is part of
Preprint: arXiv:2507.10640 (arXiv)