Published June 25, 2024 | Version v1
Dataset Open

Dataset for paper Spotting the Hook: Leveraging Domain Data for Advanced Phishing Detection

Description

The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 benign domains from Cisco Umbrella and 68,353 phishing domains from PhishTank and OpenPhish services. The ground truth for the phishing dataset was double-check with the VirusTotal (VT) service. Domain names not considered as phishing by VT have been removed. The data was collected between March and November 2023.The final assessment of the data was conducted in December 2023.

The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection. 

Data Files

The data is located in two individual files:

  • benign.json - data for 432,572 benign domains, and
  • phishing.json - data for 68,353 phishing domains.

Data Structure

Both files contain a JSON array of records generated using mongoexport. The following table documents the structure of a record. Please note that:

  • some fields may be missing (they should be interpreted as nulls), 
  • extra fields may be present (they should be ignored), 
  • due to a processing error, the common_name field of the certificate objects always contains trailing symbols: ‘> .

Field name 

Field type 

Nullable 

Description 

domain_name 

String 

No 

The evaluated domain name 

url 

String 

No 

The source URL for the domain name 

evaluated_on 

Date 

No 

Date of last collection attempt 

source 

String 

No 

An identifier of the source 

sourced_on 

Date 

No 

Date of ingestion of the domain name 

dns 

Object 

Yes 

Data from DNS scan 

rdap 

Object 

Yes 

Data from RDAP or WHOIS 

tls 

Object 

Yes 

Data from TLS handshake 

ip_data 

Array of Objects 

Yes 

Array of data objects capturing the IP addresses related to the domain name 

DNS data (dns field) 

Array of Strings 

No 

Array of IPv4 addresses 

AAAA 

Array of Strings 

No 

Array of IPv6 addresses 

TXT 

Array of Strings 

No 

Array of raw TXT values 

CNAME 

Object 

No 

The CNAME target and related IPs 

MX 

Array of Objects 

No 

Array of objects with the MX target hostname, priority and related IPs 

NS 

Array of Objects 

No 

Array of objects with the NS target hostname and related IPs 

SOA 

Object 

No 

All the SOA fields, present if found at the target domain name 

zone_SOA 

Object 

No 

The SOA fields of the target’s zone (closest point of delegation), present if found and not a record in the target domain directly 

dnssec 

Object 

No 

Flags describing the DNSSEC validation result for each record type 

ttls 

Object 

No 

The TTL values for each record type 

remarks 

Object 

No 

The zone domain name and DNSSEC flags 

RDAP data (rdap field) 

copyright_notice 

String 

No 

RDAP/WHOIS data usage copyright notice 

dnssec 

Bool 

No 

DNSSEC presence flag 

entitites 

Object 

No 

An object with various arrays representing the found related entity types (e.g. abuse, admin, registrant). The arrays contain objects describing the individual entities. 

expiration_date 

Date 

Yes 

The current date of expiration 

handle 

String 

No 

RDAP handle 

last_changed_date 

Date 

Yes 

The date when the domain was last changed 

name 

String 

No 

The target domain name for which the data in this object are stored 

nameservers 

Array of Strings 

No 

Nameserver hostnames provided by RDAP or WHOIS 

registration_date 

Date 

Yes 

First registration date 

status 

Array of Strings 

No 

The state of the registered object [TODO] 

terms_of_service_url 

String 

No 

URL of the RDAP usage ToS 

url 

String 

No 

URL of the RDAP entity 

whois_server 

String 

No 

WHOIS server address 

TLS data (tls field) 

cipher 

String 

No 

TLS cipher suite description according to [TODO] 

protocol 

String 

No 

One of “TLS”, ”TLSv1.2”, ”TLSv1.3” 

certificates 

Array of Objects 

No 

Array of objects representing the certificate chain, the first element is the root certificate 

IP data (elements in the ip_data array) 

ip  

String 

No 

The IP address 

from_record 

String 

No 

The type of the DNS record the address was captured from 

remarks 

Object 

No 

Ping round-trip time, “is alive” flag and rdap/geo/asn evaluation dates 

rdap 

Object 

Yes 

RDAP data, similar to DNS RDAP, see the JSON Schema for details 

geo 

Object 

Yes 

Geolocation data from the GeoLite2 City database (e.g. latitude, longitude, city, country, etc.) 

asn 

Object 

Yes 

Autonomous system data from the GeoLite2 ASN database (ASN, organization, network) 

Acknowledgements

We would like to thank the OpenPhish Team for grating permission to use and publish their dataset. We also thank VirusTotal for providing us access to the API for research purposes. The research has been supported by the Flow-based Encrypted Traffic Analysis project, no. VJ02010024, granted by the Ministry of the Interior of the Czech Republic and the Smart Information Technology for a Resilient Society project, no. FIT-S-23-8209, granted by Brno University of Technology.

Files

benign.json

Files (18.1 GB)

Name Size Download all
md5:2efea92fea918e8457b3e446e235dba1
16.4 GB Preview Download
md5:1c7d6d72cbbadbca46067b22d6215516
1.7 GB Preview Download