Published June 6, 2026 | Version v1

Phishing Email Clustering Using Quasi-Ground-Truth Machine Learning

  • 1. ROR icon NASK National Research Institute

Description

Datasets referenced in a conference paper (submitted)

Zip files are password-protected; the password will be revealed after paper acceptance.

phish_public - phishing messages, a dataframe with columns:

  • ts_anon - timestamp
  • from_anon - sender
  • to_anon - recipient
  • subject_anon -suject
  • content_anon - message text
  • client_ip_anon - IP of the sender
  • matched_domains - matched known malicious domains
  • cluster_gt - spam campaign ID (ground truth)
  • emb - embeddings of URLs in message

phish_public_teaser is the preview of the file 

test_public - test dataset, a dataframe with columns:

  • msg_id - message ID, as in the original dataset: AISec_Clustering
  • paper_cluster_id - ground-truth
  • emb - embeddings of URLs in message
  • siamese_embeddings - embeddings of message text

 

Files

phish_public.zip

Files (132.3 kB)

Name Size Download all
md5:c7655120e0eebe152789c1d0b09ef2ca
88.0 kB Preview Download
md5:5f4e28c2b7329b9534e763fcdec49159
17.0 kB Download
md5:b37b4075e0db82be065f5e98a12ae47a
27.3 kB Preview Download