Published June 6, 2026
| Version v1
Dataset
Open
Phishing Email Clustering Using Quasi-Ground-Truth Machine Learning
Authors/Creators
Description
Datasets referenced in a conference paper (submitted)
Zip files are password-protected; the password will be revealed after paper acceptance.
phish_public - phishing messages, a dataframe with columns:
- ts_anon - timestamp
- from_anon - sender
- to_anon - recipient
- subject_anon -suject
- content_anon - message text
- client_ip_anon - IP of the sender
- matched_domains - matched known malicious domains
- cluster_gt - spam campaign ID (ground truth)
- emb - embeddings of URLs in message
phish_public_teaser is the preview of the file
test_public - test dataset, a dataframe with columns:
- msg_id - message ID, as in the original dataset: AISec_Clustering
- paper_cluster_id - ground-truth
- emb - embeddings of URLs in message
- siamese_embeddings - embeddings of message text