Published November 2, 2025 | Version v1

Machine Learning-Based Email Spam Detection: Accuracy, Overfitting and Robustness Analysis

  • 1. Institute of Computer & Software Engineering, Khwaja Fareed University of Engineering and Information Technology, Pakistan
  • 2. Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Pakistan
  • 3. Department of Computer Science, University of Wah, Wah Cantt, Pakistan
  • 4. Department of Computer Science, The Islamia University of Bahawalpur, Pakistan

Description

This study evaluates classic and modern machine-learning methods for email spam detection using term frequency and inverse document frequency (TF-IDF) features and a public dataset of ham and spam emails. Nineteen classifiers were trained and compared with accuracy, precision, recall, F1, and variance-based stability. While several models (e.g., Gradient Boosting, Ridge Classifier CV, Bernoulli Naive Bayes) achieved high test accuracy, robustness analysis shows Random Forest and Logistic Regression with cross-validation provide steadier performance and reduced overfitting. Standard-deviation results and train-test gaps expose variance issues in single trees and highlight the practical value of ensembles and regularized linear models. The work underscores that deployment choices should favor consistent, generalizable behavior over peak scores alone.

Files

377-Article Text-664-1-10-20251102.pdf

Files (858.3 kB)

Name Size Download all
md5:85fc709cf46e8f54c30df2889ceeb0a5
858.3 kB Preview Download