Helix: DGA Domain Embeddings for Tracking and Exploring Botnets

Botnets have been using domain generation algorithms (DGA) for over a decade to covertly and robustly identify the domain name of their command and control servers (C&C). Recent advancements in DGA detection has motivated botnet owners to rapidly alter the C&C domain and use adversarial techniques to evade detection. As a result, it has become increasingly difficult to track botnets in DNS traffic. In this paper, we present Helix, a method for tracking and exploring botnets. Helix uses a spatio-temporal deep neural network autoencoder to convert domains into numerical vectors (embeddings) which capture the DGA and seed used to create the domain. This is made possible by leveraging both convolutional (spatial) and recurrent (temporal) layers, and by using techniques such as attention mechanisms and highways. Furthermore, by using an autoencoder architecture, the network can be trained in an unsupervised manner (no labeling of data) which makes the system practical for real world deployments. In our evaluation, we found that Helix can track botnet campaigns, distinguish between DGA families and seeds, and can identify domains generated using the latest adversarial machine learning techniques. Helix is currently being used to track botnets in one of the world's largest Internet Service Providers (ISP), and we include some of the ISP's analysis work using our method.


INTRODUCTION
Botnets are groups of compromised Internet machines (bots) that are controlled by a single owner to carry attacks such as distributed denial-of-service [9], data exfiltration [17], and cryptocurrency mining [8]. To maintain control over the botnet, the owner configures its bots to connect and receive commands from a command and control server (C&C) in the Internet.
Hardcoding the C&C's IP address makes it easy for firewalls and defenders to blacklist botnet traffic and locate the C&C. Therefore, to make connectivity to the C&C robust, owners register an Internet domain name (e.g., s23kdk.com) and have the bots use the domain name system (DNS) to lookup the current IP address which can be changed if the C&C is taken down. Additionally, modern botnets prevent Internet Service Providers (ISPs) from blacklisting their domain names by using a domain generation algorithm (DGA) [19].
A DGA operates similarly to a random number generator, except it is designed to generate valid domain names. The botnet author configures all of its bots to use the same seed to generate the same random domain, and connect to that domain as their C&C. Periodically, the bots generate the next domain in unison to evade detection of the ISP. Fig. 1 illustrates this process.
Recently machine learning has been applied to the task of DGA domain detection in DNS traffic. To use machine learning, the DGA generated domain names must be represented as a feature vector. In the past, researchers have captured the legitimacy of a domain name by extracting domain registration patterns and DNS resolve statuses [3]. However, ISPs can observe trillions of domains each day. Therefore, it is inefficient to extract these features since it requires constant queries over big data clusters.
Instead, it is preferred to capture domain names by extracting features from the domain name itself. An accepted approach has been to extract textual features such as character entropy, frequency, and counts of specific characters [3,14,21,25,30]. However, several recent studies have demonstrated camouflage attacks against these textual features, such as mixing benign domain names into the  generated ones [2,12,18]. Therefore, there is a need for a stronger representation of domain names, one which can capture both the underlying DGA and the presence of camouflage attacks.
In this paper, we propose Helix: a method for representing generated domain names by converting them into numerical vectors (embeddings) which capture the essence of their underlying DGAs. To accomplish this, we use a deep neural network autoencoder which uses convolutional layers during the encoding and decoding process to capture local patterns (the textual features), and bi-directional LSTM layers at the center to capture the complex and subtle sequential patterns left by the DGA. The architecture also uses various state-of-the-art techniques for extracting the subtle nuances in sequential data, such as attention mechanisms, denoising, and highway layers. As a result, our embeddings not only can distinguish between different DGA families, but can also differentiate between two botnets which are using the same DGA but with different seeds (i.e., code reuse in separate campaigns).
To the best of our knowledge, this is only work which can capture generated domain names with this level of fidelity, and as such, can be instrumental in tracking and detecting botnets via DNS traffic. Furthermore, we have found that Helix is robust to the camouflage evasion techniques [2,12,18] compared a state of the art DGA classifier. Lastly, Helix is practical because autoencoders are trained in an unsupervised manner, meaning that there is no need for manual labeling. As a result, an ISP can easily train the model on trillions of samples gathered from their own DNS traffic.
In our evaluation, we found that Helix (1) outperforms textual features in terms of cluster quality and fidelity, and (2) can form a more robust DGA classifier than the state-of-the-art approach, in the presence of camouflage attacks. Lastly, using Helix, we were able to detect five previously unreported DGA families in one-day of DNS data (8 billion records). We note that Helix is currently being used to monitor botnets in one of the world's largest ISPs, and we have published our code and DGA datasets online for reproducibility. 1

THE MOTIVATION FOR HELIX
Previously, the ISP's cyber emergency response team (CERT) was trying to detect botnets by detecting large peaks of traffic on network ports. However this approach was (1) too expensive due to the amount of traffic, (2) ineffective since botnets could only be detected when it was too late (under attack), (3) inaccurate since only large attacks performed in unison (e.g., DDoS) could be detected and not others (e.g., APTs, cryptomining, etc), and incomplete since, the CERT's analysts had no way to determine the malware's source code, author, or variant.
As a result, the ISP turned to us to develop a tool for detecting, monitoring, and tracking botnets in their upper-DNS traffic (easy to obtain) in a manner which is practical and accurate. Helix addressed all of their challenges because (1) it is trained in an unsupervised manner, (2) performs significantly better than textual features, and (3) can track and detect novel botnets based on their DGAs, before any disruptive attacks are launched. In cooperation with the ISP, we integrated Helix into their CERT pipeline over a 6 month period and tuned its performance in cooperation with the CERT's analysts. The CERT currently uses the tool as part of a big data security analytic toolbox which examines the last 24 hours of DNS data, stored in a distributed database, to generate actionable threat intelligence.
Unfortunately, for privacy reasons, we are not privy to the number of botnets detected using the tool since the integration in January 2019. Therefore, in this paper, we only report the results based on the dataset provided to us by the ISP.

RELATED WORK
Previous works aimed at DGA classification and detection proposed three common ways for representing DGA-generated domain names as feature vectors: textual features, a mix if textual and DNS features, and the character string itself. Below we briefly describe each of these feature representations. Table 1 summarizes how they have been used for various DGA related tasks. Textual Features These features capture aspects that can help differentiate between a DGA-generated domain and a natural language string. For example, the domain's (1) length [16], (2) number of consonants, number of vowels, digit ratios [22,24], (3) ratio of meaningful words [7,32], and (4) domain character entropy (randomness) [13,32]. Since DGAs are simlar to random string generators, character entropy provides good indication whether or not a domain is from a DGA. Let be a given sample domain name, and let { } denote the set of characters in . The entropy feature is calculated as ( ) = − ·log where is the probability of the -th character in the set { } to appear in . DNS Traffic-Based Features These are features which capture a domain's meta data provided by the DNS server. For example, in [7], the authors examined the related IP addresses, country of origin, and the DNS time-to-live statistics. Although these features provide a useful context for the machine learning algorithm, acquiring these features may infringe on the users' privacy and incur a significant preprocessing overhead (e.g., IP resolutions and Whois lookups) especially given the massive number of DNS queries made each day. Character String Instead of summarizing a string as a feature vector, some propose using the character string itself, such as a recurrent neural networks (RNN) which can handle sequential data. There, each of the characters is represented as with one-hot-encoding. For example, = (1,0,0,...), = (0,1,0,...), and so on.

CREATING THE EMBEDDINGS (HELIX)
In our work, we propose using a neural network to find a more meaningful representation of DGA domain names. To obtain the embedding of a DGA domain name, we use a neural network; Specifically, a custom spatio-temporal autoencoder.
An autoencoder is an architecture which consists of two neural networks: an encoder ( ) = and a decoder ( ) = ′ , where is the input and is the resulted embedding. The network is trained as a single model ( ( )) = which tries to reconstruct the same input at the output. However, since has much fewer dimensions than , the network essentially learns how to compress an input into a concise vector (i.e., perform non-linear dimensionality reduction) based on the distribution of the training set.
The spatial aspect of the characters is captured by the autoencoder using multiple convolutional filters of different sizes. These filters essentially search for common patterns and sub sequences. These layers capture the benign domains very well and also capture some of the 'leakage' from the the DGA (characters, groupings, etc).
The temporal/sequential aspect of the characters in captured by the autoencoder bi-directional long short term-memory (LSTM) units at the output of and input of . The feature maps (outputs from the convolutional filters) are passed through the LSTM instead of the character sequence. This is because (1) domain names can be very long and LSTMs do not perform well on long sequences [20], (2) bi-directional LSTMs have been shown to perform exceptionally well in natural language processing (NLP) and thus captures the benign domains very well, and (3) the LSTM can learn the cyclic patterns which some DGAs use to generate the characters. Therefore, the motivation for this architecture is to capture both local and global patterns in domain name strings which can reflect the originating DGA (see Section 6.2 for an ablation study).
The complete architecture of Helix can be found in Fig. 2. We will now detail each of the layers accordingly: Input layer The input layer receives the first 63 characters of a second-level domain name (e.g., 'google' from google.com). A character can be one of the 37 symbols found in standard domain names (RFC1035 [15]), including English letters, digits, and hyphens. Each character is replaced with an index corresponding to its symbol, and zero padding is added to the end. Character embedding layer Next, the 63×1 vector of indexes are used to retrieve the corresponding character embeddings (compressed one-hot encodings with a dimensionality of 20). The embeddings are concatenated to produce a 63×20 matrix. 1D convolution layers Four character-aware neural (convolutional) network layers, are used to extract informative patterns from the domain name (i.e., the 63×20 matrix). We use character-aware layers since they have been shown to be very successful at identifying character-level patterns [10]. In contrast to conventional CNNs, the four layers are not executed sequentially on each other's outputs. Rather, the four layers receive the same 63 × 20 matrix as their input, yet process the matrix using different kernel sizes: (kernel size=1, num. filters=30), (2,16), (3,8), (4,8). The feature maps outputted from each layer is pooled and concatenated together to form a collective feature map. Highway layer A special gating layer that enables unimpeded information flow across layers of a neural network [23]. Here the highway enables information from the feature maps to be passed directly to the encoding without passing the bi-LSTM. bidirectional LSTM This layer consists of a dual layer LSTM which maps an input sequence to an output sequence, in a task known as sequence-to-sequence (seq2seq). The first LSTM captures the linguistic dependencies between the characters from left-to-right, and the latter captures the dependencies from right-to-left. Attention We employ an attention mechanism to allow the encoder to capture more meaningful information regarding the linguistic patterns of domain names. We use the mechanism of [26] since it is designed for encoder decoder networks used for language processing. Character softmax encoder This is a time-distributed softmax layer which predicts the character for each "timestep" (character index) based on a maximum likelihood estimation.
We train Helix end-to-end in an unsupervised manner using categorical cross entropy loss defined as L ( , ) = − log( ). Therefore, the training objective is to minimize L ( ) = L ( ′ , ) where are the trainable network parameters of Helix.
The entire autoencoder network is trained as a denoising autoencoder (DAE). This means that for 20% of the training samples, 35% of the characters were randomly removed during training. This approach helps the network generalize better to unseen domains. This also helps the network become more robust to the malicious manipulation of characters which is common in adversarial attacks. After training the neural network is discarded and is used in deployment to generate embeddings.

USING THE EMBEDDINGS
There are many different machine learning tasks which can be applied to the embeddings. For example, they can be used as feature vectors for classification models to identify regular domains from generated domains, classify domains according to their DGA family. Doing so may increase detection performance and provide additional robustness to camouflage attacks. Another use is to perform cluster analysis to detect emerging botnets, new seeds, and obtain other threat intelligence.
Typically, an ISP has an internal Cyber Emergency Response Team (CERT) which is responsible for identifying attacks launched against and from within their network. In this section, we discuss how a CERT can deploy Helix for the purpose of botnet campaign tracking. Later, in Section 6, we will evaluate Helix on this task.

The Pipeline
In order to efficiently track campaigns within their network, the CERT can obtain hourly or daily lists the domain names resolved by their DNS servers. These domain names can then be converted into embeddings by Helix and then added to a clustering model for analysis. There are many different clustering models which can be used. In this paper, we use K-means because many big data clusters offer libraries which can run K-means efficient on trillions of data points (e.g., MLlib in Spark over a Hadoop cluster). Other ISPs may chose a more real-time approach by implementing a stream clustering algorithm on a distributed data stream processing platform such as CluStream over Apache Storm.
We also note that while the training of Helix requires extensive computation (e.g., GPU), the execution of the trained model can be performed on a commodity compute cluster with a standard CPU (Helix has only 226,973 parameters which is relatively small).
In Fig. 3 we present the complete pipeline, which is initialized by the CERT as follows: Note that we train Helix on both benign and malicious generated domain names. One could train Helix on only malicious domains by using a DGA classifier (e.g., Invincea [21]) to curate a malicious dataset from the DNS. Although this would improve the clustering quality, we include the benign domains during training because we expect camouflage and other adversarial attacks attacks aiming to fool DGA classifiers by evading detection.
Periodically (once every few hours or days), the CERT can update their models as follows: to be in addition to all new domains obtained from the DNS and DGA lists.
(3) Cluster +1 to obtain +1 , tuned to fit (as done in the Initialization phase). (4) Transfer all annotations from to +1 : For each cluster in +1 , if over 75% of the annotated domains have the same DGA family , then annotate the cluster as belonging to family .
During deployment, the CERT can use the models to perform botnet analysis (exploration and campaign tracking). One way to implement botnet analysis with Helix is as follows:

(2A) Deployment: Botnet Analysis
Exploration Because is annotated, similar DGA domains in will be clustered with the DGA families in because of their embeddings will be similar. Therefore, the CERT can identity entire clusters (botnets) in using the same DGAs within their network. The CERT can also obtain a situational awareness by tracking the size, origins, and C&C  locations, of the hosts for each cluster, from to +1 and so on. The analysts can also observe the evolution and distribution of DGA usage in their network. For example, a variant (new malware) may be detected when a new cluster is found adjacent/joined to a cluster of a well-known DGA family. Finally, the analysts can also add their own annotations to clusters, such as manually identify outliers and tag novel DGA clusters. Campaign Tracking When a new botnet is deployed, the bots are given a novel seed. Using Helix, the domains generated from each seed form a micro cluster which is distinct from all other botnets. By monitoring these micro clusters, an ISP can detect and monitor new botnets. This was previously not possible with textual features, since all botnets using the same DGA fall into the same cluster.
To accomplish this, the CERT can: (1) Identify DGA clusters using their annotations.
(2) For the -th DGA cluster, train a new clustering model , on the embeddings of that cluster. (3) Finally, each of the sub-clusters in , captures a new seed/campaign. However, for improved accuracy, the CERT can selecting clusters in , where the majority of host IP addresses are unique with respect to the other micro-clusters in , . This is because it is unlikely that the same hosts belong to many other campaigns, and under the same DGA.
The CERT can also use the models to perform DGA detection: the tasks of (1) identifying DGA domains from benign domains, (2) classifying DGA domains by their family, and (3) detecting novel domains generated by a new DGA. In this case, any machine learning classifier model can be fitted to by assuming that domains from are the positive class and those from are the negative class. However, since we assume that is distributed over a big data cluster, we propose the following classifier based on the KNN algorithm. We note that KNN also enables us to perform task #3 (novelty detection) which a classifier cannot perform: 2 (2B) Deployment: DGA Detection In all other cases, is given the label of the majority class ('benign' or the generic 'DGA'), where all of the DGA family labels in are grouped together.

EVALUATION
To justify the architecture of Helix, we first provide results of an ablation study which shows that each component of the network is needed. Next, we compare the quality of our embeddings to the current best method for representing domain names (textual features) through cluster analysis. We then evaluate our suggested implementations for botnet exploration and campaign tracking (Section 5:2A). Finally, we analyze Helix's robustness using several state-of-the-art camouflage attacks on the suggested KNN DGA Detector (Section 5:2B). For reproducibility, we have made the source code and DGA datasets available online (see the footnote in Section 1).

Datasets
We evaluate Helix using four different datasets: DGArchive A dataset of 50,000 domains from 50 DGA families taken from the DGA Archive [19]. The DGA families were selected at random to capture a wide variety of behaviors. AmritaDGA A dataset from 2018 consisting of 100,000 registered benign domain names, and 297,777 DGA generated domain names that were evenly produced from 20 authentic DGAs used in the wild [27]. AdverDGA Our own custom dataset containing samples from three recently published DGAs which use adversarial camouflage techniques to evade detection: DeepDGA [2], Char-Bot [18], and MaskDGA [12] (available online). ISP A dataset of domain names collected from the upper-level DNS servers of our partner, one of the world's largest ISPs. The unlabeled dataset contains approximately eight billion DNS records (22 million distinct domains, 732 GB) collected from a single day, of both benign and malicious nature. Fig. 4 shows the character distribution of ISP compared the the generated domains in DGArchive and AmritaDGA. Note that ISP naturally has DGA samples within it. Using the LSTM Classifier from [2] we found that ∼4.6% of the distinct domains were from benign and malicious DGAs.
Following the process described in Section 5:1A, we collected the DGA samples from AmritaDGA as and used the benign samples of AmritaDGA and the entire ISP dataset as . We split the data for training and testing according to Table 2. Note that since Helix is trained on the ISP's data in an unsupervised manner, we expect to have some DGA domains within it. To measure our system's performance, we used the 50 DGA family labels in DGArchive for the ablation, representation, and exploration studies (Sections 6.2, 6.3, and 6.4) and the AdverDGA dataset was used to evaluate Helix's robustness against camouflage attacks (Section 6.6).

Ablation Study
To validate the architecture of Helix, we performed an ablation study: we evaluated several versions of autoencoders using subsets of Helix's layers. For each version, we (1) generated embeddings for the DGArchive dataset, (2) trained a K-means model with k set to the number of DGA families (50), and then (3) evaluated the clustering performance through classification and clustering metrics. For the classification metrics we used the F1-score, precision, and recall measures. For the clustering metrics we used the Adjusted Rand Index (ARI) and the Normalized Mutual Information measures, where the ARI measures the accuracy of a clustering and NMI measures the entropy reduction for each class. Table 3 presents the results from this study. We found that the combination of layers used in Helix provides the best results. The reason is because the architecture enables the biLSTM layers to model domains as sequences of patterns (or building blocks) captured by the CharCNN layers. Moreover the attention helps the network focus on the relevant components. As a result the network can learn the distribution of sequences more efficient than CNN or LSTM layers alone.

Helix as a DGA Domain Representation
To evaluate Helix as a feature representation for machine learning, we compare it to the current best representation: the textual features described in Section 3. We extracted two datasets from DGArchive: one using Helix and the other using the textual features . Similar to our ablation study, for each dataset we clustered them, and then computed their performance using the classification and clustering metrics. Table 4 shows that Helix captures DGAs significantly better than the current best 'textual features'. This is understandable because the network is able to summarize relevant information (character patterns) based on its observations, whereas the textual features only summarize the given string, adding a significant amount of noise to the feature vector. Since both Helix and the linguistic features were applied to the same dataset, it is clear that the Helix model is the contributing factor and not the size of the dataset.
In Fig. 5, we present scatter plots of the Helix embeddings and the textual feature vectors using the ISP dataset. Here only samples which are classified a DGA domain are presented for simplicity. The figure shows that not only are the textual features noisy, but Helix is able to naturally distinguish between the DGA families.

Exploring Botnets with Helix
Using the embeddings of the ISP dataset, we explored the clusters formed by the DGAs. The bottom two subpots of Fig. 5 shows that  the embeddings were able to capture a wide variety of DGA families active within the ISP's network.
For this exploration, we observed that Helix generates meaningful embeddings. After PCA reduction, we observe that the largest distinguishing aspect relates to the output length of the DGA (x-axis, pc0: left=short, right=long). The second aspect seems to relate to the algorithmic patterns of the source DGA (the y-axis, pc1). The remaining aspects (other dimensions which are not visible due to the projection) capture entropy and linguistic content. For example, each of the visible clusters can be divided into three distinct macro-clusters: dictionary-based DGAs (e.g., suppobox, matsnu, and gozi), random letter DGAs (e.g., necurs, cryptolocker, ramdo, and gameover-p2p), and random alpha-numeric DGAs (e.g., urlzone, qadars, and emotet). To detect novel DGAs in the clustering process, we (1) used a DGA classifier to remove benign domains, and (2) used DGArchive as a blacklist to remove known DGA samples. As a result, we had the DGA domains which are currently unknown (the bottom subplot) and were able to identify five new DGA families currently active in the ISP's network. The fifth group uses Internationalized Domain Names (IDN) that supports foreign languages, and can be tied to phishing attack websites.

Campaign Tracking with Helix
New botnets often reuse existing DGAs, but each campaign uses a different initial seed. Helix can distinguish between domains generated using different seeds. To demonstrate this, we used existing DGAs to generated a few thousand domain names using 3-4 different seeds. Fig. 6 plots the cluster distributions (pie charts) for each of the seeds. Helix captures dictionary-based DGA very well (top:Suppobox) and random generators too (bottom:Necrus). Therefore, by clustering the macro-clusters into micro-clusters, we are able to identify new seeds because the distribution shifts.

Robustness to Camouflage Attacks
To evaluate how robust Helix is to camouflage attacks, we attacked Helix and a recent DGA detector (Invincea [21]) with the state-ofthe-art camouflage attacks: CharBot, DeepDGA, and MaskDGA in our AdverDGA dataset. CharBot corrupts popular domain names, DeepDGA uses a GAN to generate names that fit the benign distribution, and MaskDGA performs a black box adversarial machine learning attack against arbitrary DGA classifiers.
We then used Helix (Section 5:2B) and Invincea to detect DGA domains in AdverDGA. We found that Helix performed better than Invincea on average (F1-score of 0.72 vs 0.67). We also found that Helix performed significantly better than Invincea on the gradient-based attacks (MaskDGA) achieving an F1-score of 0.73 compared to 0.50. The reason why is because the unique perturbations of MaskDGA are captured in the domain name, enabling Helix to distinguish it from the others (visualized in Fig. 7). In the figure, we examine data from AmritaDGA because MaskDGA was trained on it. This is in contrast to CharBot which leaves very little evidence of the attack, resulting in Helix achieving an F1-score of 0.65 vs the 0.66 of Invincea.

CONCLUSION AND FUTURE WORK
In this paper we presented Helix: a method for representing DGA domains as feature vectors (embeddings) for botnet exploration, tracking, and many other machine learning tasks. We also presented several ways Helix can be used by an ISP's CERT to explore their daily DNS traffic and automatically detect new botnets in their network over large distributed data clusters (via the KNN algorithm).
We found that not only does Helix our perform the textual features used today, but is also robust to state-of-the-art adversarial attacks which aim to camouflage DGA domains. Finally, we also found that Helix produces high-fidelity embeddings, capable of detecting when new DGA seeds (new campaigns) are used.
As future work, we plan to use Helix to improve current state-ofthe-art DGA classifiers, and refine the seed exploration by training Helix on domains from each DGA separately. Helix is currently deployed in the CERT one of the largest ISPs, providing them with emerging threat intelligence, and has significantly improved their situational awareness. Therefore, we believe that this work can help other ISPs and ultimately make the web a safer place.