Phishing and Benign Websites Dataset
Description
This dataset was compiled by Peya Mowar and Mini Jain. We are releasing this dataset for the research community.
Reference Paper:
P. Mowar and M. Jain, "Fishing out the Phishing Websites," 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), 2021, pp. 1-6, doi: 10.1109/CyberSA52016.2021.9478237.
Abstract:
Phishing is a cybercrime in which deceitful websites lure naive users and trick them into disclosing confidential information, such as social media passwords or financial data. Phishing websites are crafted such that they superficially appear similar to popular legitimate websites. This paper aims to detect such phishing websites by proposing a novel classifier that takes lexical-based, script-based, rule-based, and address-based features extracted from a website into account. A large-scale balanced dataset of 38,800 active phishing and legitimate websites is created, on which tree-based ensemble classifiers are trained, out of which the XGBoost (eXtreme Gradient Boosting) model performs the best with a testing accuracy of 99.6%. The classifier can detect zero-day phishing attacks without requiring any third- party features such as page rank. Several other benefits of using this model over the state-of-the-art techniques are discussed.
Files
phishing_and_benign_websites.csv
Files
(3.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:20c217346de248aab25b5b8def454525
|
3.2 MB | Preview Download |
Additional details
Related works
- Compiles
- Conference paper: 10.1109/CyberSA52016.2021.9478237 (DOI)