A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

doi:10.1145/3465481.3469205

Published December 31, 2021 | Version v1

Conference paper Open

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

1. University of Piraeus

Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over the last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.

Files

A_Comparison_of_Natural_Language_Processing_and_Machine_Learning_Methods_for_Phishing_Email_Detection__Zenodo.pdf

Files (672.8 kB)

Name	Size	Download all
A_Comparison_of_Natural_Language_Processing_and_Machine_Learning_Methods_for_Phishing_Email_Detection__Zenodo.pdf md5:aa91c8b37ccaa9916dbb84bb70ef82d2	672.8 kB	Preview Download

Additional details

SECONDO – a Security ECONomics service platform for smart security investments and cyber insurance pricing in the beyonD 2020 netwOrking era 823997: European Commission
CyberSec4Europe – Cyber Security Network of Competence Centres for Europe 830929: European Commission

	All versions	This version
Views	35	35
Downloads	404	395
Data volume	283.9 MB	277.9 MB

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

Creators

Description

Files

A_Comparison_of_Natural_Language_Processing_and_Machine_Learning_Methods_for_Phishing_Email_Detection__Zenodo.pdf

Files (672.8 kB)

Additional details

Funding