Phishing and Benign Domain Dataset (DNS, IP, WHOIS/RDAP, TLS, GeoIP)
- 1. Brno University of Technology
Description
The dataset contains DNS records, IP-related features, WHOIS/RDAP information, information from TLS certificate fields, and GeoIP information for 432,572 verified benign domains from Cisco Umbrella and 36,993 verified phishing domains from PhishTank and OpenPhish services. The dataset is useful for statistical analysis of domain data or feature extraction for training machine learning-based classifiers, e.g. for phishing detection. The data was collected between March and July 2023.The final assessment of the data was conducted in July 2023 (this is why the names are suffixed with _2307).
The upload contains: a) data files, b) the description of the data structure, and c) the veature vector we used for ML-based phishing domain detection.
Data Files
The data is located in two individual files:
- benign_2307.json - data about 432,572 benign domains, and
- phishing_2307.json - data about 36,993 phishing domains.
Data Structure
Both files are in the JSON Array format. The structure is as follows:
[
{
"_id" : "A unique ID of the data record",
"domain_name" : "Name of the domain (e.g., zenodo.com)",
"dns" : { "//": "Data obtained from DNS records" },
"evaluated_on" : "// ISO Timestamp of data collection ",
"ip_data" : [ "// Data for each related IP adddress ",
{
"//": "IP-related data, including RTT from ICMP echo attempts (from Brno, Czechia)",
"//": "WHOIS/RDAP data for the given IP address",
"//": "GeoIP data for the given IP address",
"//": "NERD system reputation score (if available)",
"//": "ASN info",
"//": "remarks: ISO timestamps of collection of the individual data pieces"
},
],
"label" : "benign_2307 for benign OR misp_2307 for phishing",
"rdap" : { "//": "WHOIS/RDAP information for the domain name" },
"remarks" : {
"dns_evaluated_on" : "ISO Timestamp of DNS data collection",
"rdap_evaluated_on" : "ISO Timestamp of WHOIS/RDAP data collection",
"tls_evaluated_on" : "ISO Timestamp of TLS certificate information collection",
"dns_had_no_ips" : "true if no IPs were found in DNS records"
},
"sourced_on" : "ISO Timestamp of the moment the domain was found",
"tls" : {
"cipher" : "Identifier of the TLS cipher suite",
"count" : "Number of certificates in chain",
"protocol" : "Version of the TLS protocol",
"certificates" : [
"//": "Information from TLS certificate fields: issuer, extensions, etc."
]
},
"category" : "Category of the record (could be ignored)",
"source" : "Name of the file that we used to save the domain list"
}
]
Feature Vector
This section describes the veature vector used in the "Unmasking the Phishermen: Phishing Domain Detection with Machine Learning and Multi-Source Intelligence" paper that was accepted to the IEEE NOMS 2024 conference.
Lexical Features
The following features were extracted from the sole domain name:
- lex_name_len - length of the domain name,
- lex_begins_with_digit - true if the domain name begins with a digit,
- lex_www_flag - true if the domain name begins with "www.",
- lex_phishing_keyword_count - occurence count of 47 phishing-related keywords,
- lex_consecutive_chars - length of the longest consecutive character sequence,
- lex_tld_len - length of the top-level domain (TLD),
- lex_tld_hash - hash of the TLD,
- lex_sld_len - length of the second-level domain (SLD),
- lex_sld_norm_entropy - normalized entropy of the SLD,
- lex_stld_unique_char_count - number of unique characters in the TLD and the SLD,
- lex_sub_count - number of subdomains,
- lex_sub_digit_ratio - ratio of digits in subdomains,
- lex_sub_hex_ratio - ratio of hex symbols in subdomains,
- lex_sub_non_alpanum_ratio - ratio of non-alphanumeric symbols in subdomains,
- lex_sub_vowel_ratio - ratio of vowels in subdomains,
- lex_sub_consonant_ratio - ratio of consonants in subdomains,
- lex_sub_max_consonant_len - length of the longest consonant sequence in subdomains,
- lex_sub_norm_entropy - normalized entropy of a string made from all subdomains,
- lex_phishing_bigram_matches - occurrence count of the top 300 phishing domain bigrams,
- lex_phishing_trigram_matches - occurrence count of the top 2000 phishing domain trigrams,
- lex_phishing_tetragram_matches - occurrence count of the top 5000 phishing domain tetragrams,
- lex_phishing_pentagram_matches - occurrence count of the top 10000 phishing domain pentagrams.
DNS-based Features
The following features were extracted from DNS responses when querying about the domain:
- dns_A_count - number of A records for the domain,
- dns_AAAA_count - number of AAAA records for the domain,
- dns_CNAME_count - number of CNAME records for the domain,
- dns_MX_count - number of MX records for the domain,
- dns_NS_count - number of nameserver (NS) records for the domain,
- dns_TXT_count - number of TXT records for the domain,
- dns_soa_primary_ns_len - number of characters in the primary NS's domain name,
- dns_soa_primary_ns_level - number of subdomain in the primary NS's domain name,
- dns_soa_primary_ns_digit_count - number of digits in the primary NS's domain name,
- dns_soa_primary_ns_entropy - normalized entropy of the primary NS's domain name,
- dns_soa_email_len - number of characters in the admin's email domain name part,
- dns_soa_email_level - number of subdomains in the admin's email domain name part,
- dns_soa_email_digit_count - number of digits in the admin's email domain name part,
- dns_soa_email_entropy - normalized entropy of the admin's email domain name part,
- dns_soa_refresh - SOA refresh parameter,
- dns_soa_retry - SOA retry parameter,
- dns_soa_expire - SOA expire parameter,
- dns_mx_avg_len - average number of characters of the domain names in MX records,
- dns_mx_avg_entropy - average normalized entropy of the domain names in MX records,
- dns_domain_name_in_mx - true if the domain name is contained in the MX record's domains,
- dns_txt_spf_exists - true if an SPF record is in the TXT RRs,
- dns_txt_avg_entropy - average normalized entropy of the TXT records
- dns_ttl_low - number of RRsets with TTL in [0,100],
- dns_ttl_mid - number of RRsets with TTL in [101,500],
- dns_zone_entropy - normalized entropy of the zone's domain name.
IP-based Features
These features were derived from IP addresses and ICMP echo replies:
- ip_mean_average_rtt - average RTT of all ICMP echo attempts,
- ip_entropy - total entropy of all /16 (/64 for v6) IP prefixes,
- ip_count - total number of IP addresses for the domain,
- ip_v4_count - total number of IPv4 addresses for the domain,
- ip_v6_count - total number of IPv6 addresses for the domain,
TLS-based Features
The following features were extracted from TLS certificate chains and TLS handshakes:
- tls_chain_len - length of the TLS certificate chain,
- tls_broken_chain - true if there is a certificate that has never been valid,
- tls_expired_chain - true if there is an expired certificate in the chain,
- tls_total_extension_count - total extensions in all certificates in the chain,
- tls_critical_extensions - total extensions flagged as "critical" in all certificates,
- tls_with_policies_crt_count - number of certificates that include the "policies" extension,
- tls_percentage_crt_with_policies - percentage of certificates that include the "policies" extension,
- tls_x509_anypolicy_crt_count - number of certificates not enforcing any security policy,
- tls_iso_policy_crt_count - total discovered policies from the 1.* OID space,
- tls_joint_isoitu_policy_crt_count - total discovered policies from from the 2.* OID space,
- tls_subject_count - number of subject alternative names (SANs) in the leaf certificate,
- tls_server_auth_crt_count - number of certificates with the "Web Server Authentication",
- tls_client_auth_crt_count - number of certificates with the "Web Client Authentication",
- tls_CA_certs_in_chain_ratio - ratio of CA certificates in the chain,
- tls_unique_SLD_count -number of unique second-level domains (SLD) in domain name SANs,
- tls_common_name_count - number of common names in the chains,
- tls_root_cert_validity_len - length of the validity period of the root certificate,
- tls_leaf_cert_validity_len - length of the validity period of the leaf certificate.
WHOIS/RDAP-based Features
These features are based on the information gathered from WHOIS/RDAP when asking about: a) the domain name and b) domain-related IP addresses:
- rdap_registration_period - difference between domain expiration and registration date,
- rdap_has_dnssec - true if DNSSEC is used for the domain,
- rdap_domain_age - days elapsed from the domain registration,
- rdap_time_from_last_change - days elapsed from the last change of records,
- rdap_domain_active_time - min(today, expiration) - registration date,
- rdap_registrar_name_hash - hash of the domain's registrar,
- rdap_ip_avg_admin_name_len - average length of the admin's name for IP addresses.
Geolocation Features
This set of features is based on information gathered from the GeoIP service when asked about domain-related IP addresses:
- geo_countries_count - number of distinct countries where servers of domain-related IPs are located,
- geo_countries_hash - a unique hash for each combination of countries amongst domain-related IPs,
- geo_continent_hash - a uniuque hash for each combinations of continents where the countries are situated.
Files
benign_2307.json
Files
(17.0 GB)
Name | Size | Download all |
---|---|---|
md5:2efea92fea918e8457b3e446e235dba1
|
16.4 GB | Preview Download |
md5:9a0e038ac3fd1db8ada16801ba31e016
|
621.9 MB | Preview Download |