Published December 31, 2021 | Version v1
Conference paper Open

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

  • 1. University of Piraeus

Description

Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over the last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.

Files

A_Comparison_of_Natural_Language_Processing_and_Machine_Learning_Methods_for_Phishing_Email_Detection__Zenodo.pdf

Additional details

Funding

SECONDO – a Security ECONomics service platform for smart security investments and cyber insurance pricing in the beyonD 2020 netwOrking era 823997
European Commission
CyberSec4Europe – Cyber Security Network of Competence Centres for Europe 830929
European Commission