Published December 28, 2021 | Version v1
Conference paper Open

Phishing and Benign Websites Dataset

  • 1. Delhi Technological University

Contributors

Contact person (2):

  • 1. Delhi Technological University

Description

This dataset was compiled by Peya Mowar and Mini Jain. We are releasing this dataset for the research community.

Reference Paper:

P. Mowar and M. Jain, "Fishing out the Phishing Websites," 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), 2021, pp. 1-6, doi: 10.1109/CyberSA52016.2021.9478237.

Abstract:

Phishing is a cybercrime in which deceitful websites lure naive users and trick them into disclosing confidential information, such as social media passwords or financial data. Phishing websites are crafted such that they superficially appear similar to popular legitimate websites. This paper aims to detect such phishing websites by proposing a novel classifier that takes lexical-based, script-based, rule-based, and address-based features extracted from a website into account. A large-scale balanced dataset of 38,800 active phishing and legitimate websites is created, on which tree-based ensemble classifiers are trained, out of which the XGBoost (eXtreme Gradient Boosting) model performs the best with a testing accuracy of 99.6%. The classifier can detect zero-day phishing attacks without requiring any third- party features such as page rank. Several other benefits of using this model over the state-of-the-art techniques are discussed.

Files

phishing_and_benign_websites.csv

Files (3.2 MB)

Name Size Download all
md5:20c217346de248aab25b5b8def454525
3.2 MB Preview Download

Additional details

Related works

Compiles
Conference paper: 10.1109/CyberSA52016.2021.9478237 (DOI)