Published February 17, 2024 | Version v1
Dataset Open

AdFlush: A Real-World Deployable Machine Learning Solution for Effective Advertisement and Web Tracker Prevention

Description

The dataset of AdFlush: A Real-World Deployable Machine Learning Solution for Effective Advertisement and Web Tracker Prevention, accepted to the Web Conference 2024, Singapore.

Abstract:

Ad blocking and web tracking prevention tools are widely used, but traditional filter list-based methods struggle to cope with web content manipulation. Machine learning-based approaches have been proposed to address these limitations, but they have primarily focused on improving detection accuracy at the expense of practical considerations such as deployment overhead. In this paper, we present *AdFlush*, a lightweight machine learning model for ad blocking and web tracking prevention that is practically designed for the Chrome browser. To develop *AdFlush*, we first evaluated the effectiveness of 883 features, including 350 existing and 533 new features, and ultimately identified 27 key features that achieve optimal detection performance. We then evaluated *AdFlush* using a dataset of 10,000 real-world websites, achieving an F1 score of 0.98, which outperforms state-of-the-art models such as AdGraph (F1 score: 0.93), WebGraph (F1 score: 0.90), and WTAgraph (F1 score: 0.84). Importantly, *AdFlush* also exhibits a significantly reduced computational footprint, requiring 56% less CPU and 80% less memory than AdGraph. We also evaluated the robustness of *AdFlush* against adversarial manipulation, such as URL manipulation and JavaScript obfuscation. Our experimental results show that *AdFlush* exhibits superior robustness with F1 scores of 0.89–0.98, outperforming AdGraph and WebGraph, which achieved F1 scores of 0.81–0.87 against adversarial samples. To demonstrate the real-world applicability of *AdFlush*, we have implemented it as a Chrome browser extension and made it publicly available. We also conducted a six-month longitudinal study, which showed that *AdFlush* maintained a high F1 score above 0.97 without retraining, demonstrating its effectiveness. Additionally, *AdFlush* detected 642 URLs across 108 domains that were missed by commercial filter lists, which we reported to filter list providers.

Files

AdFlush_test.csv

Files (5.3 GB)

Name Size Download all
md5:fd806d6d16fe5affbea436dfeba499a5
35.6 MB Preview Download
md5:57db841f85c4ccb5095b97c415c69812
142.6 MB Preview Download
md5:feaf036ef797d08625be53ac5320ea08
1.0 GB Preview Download
md5:b2636b7fc6c5c7ae4abc64873743497d
4.0 GB Preview Download
md5:30b5208f15912245546170bbbdabb00d
12.4 MB Preview Download
md5:29d89710efaad705f64ce80aadbe0f8a
7.9 MB Preview Download
md5:08a1b317a2480ba300a63d61ab23de8a
10.5 MB Preview Download
md5:e786d9f381a1e20706a55468cdd30983
53.8 MB Preview Download
md5:710d6e78c4ebc05d5793cec2edbfc310
56.3 MB Preview Download
md5:f3df2ee9d6f91e7a02c2db3db4e582ea
54.9 MB Preview Download

Additional details

Software

Repository URL
https://github.com/SKKU-SecLab/AdFlush
Programming language
Python, JavaScript
Development Status
Active