Dataset for paper Spotting the Hook: Leveraging Domain Data for Advanced Phishing Detection
Creators
Description
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 benign domains from Cisco Umbrella and 68,353 phishing domains from PhishTank and OpenPhish services. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered as phishing by VT have been removed. The data was collected between March and November 2023.The final assessment of the data was conducted in December 2023.
The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection.
Data Files
The data is located in two individual files:
- benign.json - data for 432,572 benign domains, and
- phishing.json - data for 68,353 phishing domains.
Data Structure
Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:
- some fields may be missing (they should be interpreted as nulls),
- extra fields may be present (they should be ignored),
- due to a processing error, the common_name field of the certificate objects always contains trailing symbols:
‘>
.
Field name |
Field type |
Nullable |
Description |
domain_name |
String |
No |
The evaluated domain name |
url |
String |
No |
The source URL for the domain name |
evaluated_on |
Date |
No |
Date of last collection attempt |
source |
String |
No |
An identifier of the source |
sourced_on |
Date |
No |
Date of ingestion of the domain name |
dns |
Object |
Yes |
Data from DNS scan |
rdap |
Object |
Yes |
Data from RDAP or WHOIS |
tls |
Object |
Yes |
Data from TLS handshake |
ip_data |
Array of Objects |
Yes |
Array of data objects capturing the IP addresses related to the domain name |
DNS data (dns field) |
|||
A |
Array of Strings |
No |
Array of IPv4 addresses |
AAAA |
Array of Strings |
No |
Array of IPv6 addresses |
TXT |
Array of Strings |
No |
Array of raw TXT values |
CNAME |
Object |
No |
The CNAME target and related IPs |
MX |
Array of Objects |
No |
Array of objects with the MX target hostname, priority and related IPs |
NS |
Array of Objects |
No |
Array of objects with the NS target hostname and related IPs |
SOA |
Object |
No |
All the SOA fields, present if found at the target domain name |
zone_SOA |
Object |
No |
The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly |
dnssec |
Object |
No |
Flags describing the DNSSEC validation result for each record type |
ttls |
Object |
No |
The TTL values for each record type |
remarks |
Object |
No |
The zone domain name and DNSSEC flags |
RDAP data (rdap field) |
|||
copyright_notice |
String |
No |
RDAP/WHOIS data usage copyright notice |
dnssec |
Bool |
No |
DNSSEC presence flag |
entitites |
Object |
No |
An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. |
expiration_date |
Date |
Yes |
The current date of expiration |
handle |
String |
No |
RDAP handle |
last_changed_date |
Date |
Yes |
The date when the domain was last changed |
name |
String |
No |
The target domain name for which the data in this object are stored |
nameservers |
Array of Strings |
No |
Nameserver hostnames provided by RDAP or WHOIS |
registration_date |
Date |
Yes |
First registration date |
status |
Array of Strings |
No |
The state of the registered object [TODO] |
terms_of_service_url |
String |
No |
URL of the RDAP usage ToS |
url |
String |
No |
URL of the RDAP entity |
whois_server |
String |
No |
WHOIS server address |
TLS data (tls field) |
|||
cipher |
String |
No |
TLS cipher suite description according to [TODO] |
protocol |
String |
No |
One of “TLS”, ”TLSv1.2”, ”TLSv1.3” |
certificates |
Array of Objects |
No |
Array of objects representing the certificate chain, the first element is the root certificate |
IP data (elements in the ip_data array) |
|||
ip |
String |
No |
The IP address |
from_record |
String |
No |
The type of the DNS record the address was captured from |
remarks |
Object |
No |
Ping round-trip time, “is alive” flag and rdap/geo/asn evaluation dates |
rdap |
Object |
Yes |
RDAP data, similar to DNS RDAP, see the JSON Schema for details |
geo |
Object |
Yes |
Geolocation data from the GeoLite2 City database (e.g. latitude, longitude, city, country, etc.) |
asn |
Object |
Yes |
Autonomous system data from the GeoLite2 ASN database (ASN, organization, network) |
Acknowledgements
We would like to thank the OpenPhish Team for grating permission to use and publish their dataset. We also thank VirusTotal for providing us access to the API for research purposes. The research has been supported by the Flow-based Encrypted Traffic Analysis project, no. VJ02010024, granted by the Ministry of the Interior of the Czech Republic and the Smart Information Technology for a Resilient Society project, no. FIT-S-23-8209, granted by Brno University of Technology.
Files
benign.json
Files
(18.1 GB)
Name | Size | Download all |
---|---|---|
md5:2efea92fea918e8457b3e446e235dba1
|
16.4 GB | Preview Download |
md5:1c7d6d72cbbadbca46067b22d6215516
|
1.7 GB | Preview Download |