Evaluating ML-based DDoS Detection with Grid Search Hyperparameter Optimization

—Distributed Denial of Service (DDoS) attacks disrupt global network services by mainly overwhelming the host victim with requests originating from multiple trafﬁc sources. DDoS attacks are currently on the rise due to the ease of execution and rental of distributed architectures, which could potentially result in substantial revenue losses. Therefore, the detection and prevention of DDoS attacks are currently topics of high interest. In this study, we utilize trafﬁc ﬂow information to determine if a speciﬁc ﬂow is associated with a DDoS attack. We evaluate traditional Machine Learning (ML) methods in developing our DDoS detector and utilize an exhaustive hyperparameter search to optimize the detection capability of each ML model. Our evaluation shows that most algorithms provide satisfactory results, with Random Forests achieving as high as 99% of detection accuracy, which is comparable to existing deep learning approaches.


I. INTRODUCTION
Denial of Service (DoS) is a well-known cyberattack that targets a victim host (e.g., network servers, resources, or nearby infrastructures) mainly through excessive flooding of network requests to overload the victim that becomes unable to execute its usual services [1]. A DoS attack becomes more difficult to detect when the source is distributed over the network, which is known as Distributed DoS (DDoS). Over the years, DDoS attacks have grown due to the increase of distributed architectures such as the Internet of Things (IoT), distributed service paradigms (e.g., [2]), and the ease of renting resources. For instance, the MIRAI Botnet [3] was recently used to deploy a large-scale DDoS attack that has infected around 600,000 IoT devices worldwide.
Solutions for mitigating DDoS attacks have undergone from rule-based approaches (e.g., filtering [4]) to traditional Machine Learning (ML) schemes (e.g., [5]). ML popularity has increased in anomaly detection due to the increasing availability of data and computing resources. Lately, Deep Learning (DL) approaches (e.g., [6], [7]) have been utilized to increase further the detection accuracy. However, DL methods The final publication will be available at IEEE Xplore. are usually cumbersome in real-world deployments [8] and typically require much more input data and computational power than traditional ML methods, which becomes a concern in lightweight execution environments such as IoT. Thus, we aim to create a lightweight ML-based DDoS detector and evaluate traditional ML methods that utilize traffic flow information as input data streams.
In developing our lightweight DDoS detector, we used grid search for finding the best hyperparameters and enhance the detector's accuracy, which is a novelty compared to existing DL and ML-based DDoS detection schemes. A similar approach has been used in [9] to improve accuracy but for BGP anomalies. While current trends focus on DL methods (e.g., [7]), this study looks into improving the capabilities of lightweight ML methods to their full potential using hyperparameter optimization. We evaluate binary classification techniques such as Logistic Regression, Decision Trees, Random Forests, K-Nearest Neighbors, Support Vector Machines, and Feed-forward Neural Networks (Multi-Layer Perceptron) to represent our ML Detector.
We trained and evaluated the ML methods using the Canadian Institute of Cybersecurity (CIC) datasets 1 . These datasets are used widely for both research (e.g., [7], [10]) and industrial purposes. We utilized datasets from 2012, 2017, 2018, and 2019 that include DoS and DDoS attacks. This data includes, among others, the number of incoming and outgoing packets in a flow, packet inter-arrival times, header flag counts, together with their simple statistical measures such as the minimum, maximum, average, and standard deviation.
Our evaluation results show that DDoS attacks can be detected using traditional ML algorithms with optimized hyperparameters, reaching an accuracy of over 98% by using Random Forests and Decision Trees across all the datasets. These results are similar to current DL approaches. Thus, tuning hyperparameter in traditional ML allows for increased performance similar to DL approaches with fewer resources needed. This paper is organized as follows. The following section, Section II, discusses related work on the current ML techniques for DDoS detection. Then, Section III presents the methodology and experimental setup. Section IV provides a detailed data analysis of the data features used in this study. Section V presents the paper evaluation results. Finally, our concluding remarks are discussed in Section VI.

II. RELATED WORK
Early comparison of traditional ML methods for DDoS detection is conducted by Bhamare et al. [5]. Using the UNSW dataset, Logistic Regression achieved best in terms of accuracy (i.e., 89%) with a 97% True Positive rate. Additionally, when these trained models re applied to another dataset (ISOT dataset), both J48 Decision Tree and Logistic Regression have achieved the best with 95% accuracy of overall but have significantly reduced the True Positive rate.
He et al. [11] also studied DoS attacks but focused on the scenario in which the cloud environment is used for launching DoS attacks. They used hypervisors/virtual machine information to monitor DDoS attacks. They evaluated the most common binary classification techniques and reported that Support Vector Machines with a linear kernel performed best with an accuracy of 99.73%.
Given the increasing number of cyber attack types, Salman et al. [12] propose a two-step approach to identify the types of attack. First, they used ML to detect anomalies and proceed to a rule-based identification process to determine the attack type. Using Random Forest and Linear Regression, they achieved 99% of detecting anomalies while they achieved 93.6% accuracy in classifying the attack type.
With the increasing popularity of Neural Network (NN), Yuan et al. [10] adopted different deep learning methods such as Convolutional NN (CNN) and Recurrent NN (RNN) for DDoS attack detection called DeepDefense. They evaluated their classifier using the ISCXIDS (2012) [13], which showed a large reduction of error in comparison to traditional approaches. For instance, it reduced the error rate from 7.5% to 2.1% in comparison to Random Forests.
Furthermore, Yao et al. [14] developed a DL feature extractor, DeepGFL, to classify attack and traffic flow through graph representation. It aims to detect various network attack types. For DoS Hulk, it reached 94.05% of F1-measure in the evaluation using CIC-IDS (2017) [15].
Min et al. [17] also developed a Text-CNN and Random Forest-based Intrusion Detection System (TR-IDS) for IoT DDoS detection. Word embedding and Text-CNN are used for  [6], which utilizes a 2-hidden layer feed-forward NN approach coupled with feature selection to reduce input data. The study used the most recent CIC datasets and achieved high detection performance.
The literature clearly shows that ML approaches do not implement hyperparameter optimization and recent works adopt DL models instead. This increased the detection performance. However, DL methods are usually difficult to deploy in realworld scenarios [8] and typically require much more input data and computing capability than the traditional ML methods. This becomes a bottleneck for lightweight execution environments (e.g., IoT) and in centralized security architectures (e.g., [2]) that requires lightweight data collecting agents.
In this work, different from the DL trends, we focus on lightweight and traditional ML models that are coupled with a hyperparameter search to obtain the optimum parameters that yield higher detection performance. This work also utilizes recent CIC datasets and reports a comparison of current DL approaches in detecting DDoS attacks.

III. SYSTEM MODEL
In this section, the datasets and ML models in this study are discussed, followed by the training and evaluation procedures.

A. Datasets Description
The datasets used in this study are taken from the CIC, University of New Brunswick (UNB), which are publicly available and have been widely used in numerous studies (e.g., [6], [7]). We used the latest datasets that contain DoS, namely, ISCXIDS (2012) [13], CICIDS (2017) [15], CSE-CIC-IDS (2018) [15], and CICDDoS (2019) [18], which are reported in Table I. The same set of datasets has also been used in [6]. The attacks extracted are composed of a Botnet attack, DoS attacks such as Hulk, GoldenEye, sloworis and slowHTTPtest, network stress testing tools such as Low Orbit Ion Cannon (LOIC), and DDoS attacks from utilizing different protocols and applications (i.e., DNS, LDAP, MSSQL, NTP, NetBIOS, SNMP, SSDP, UDP, TCP (Syn), TFTP, UDP-lag, and WebDDoS). Table I shows that ISCXIDS (2012) includes an Internet Relay Chat (IRC) Botnet DDoS attack [13]. This dataset was generated using the IBM QRadar appliance. Moreover, the CICIDS (2017), CSE-CIC-IDS (2018), and CICDDoS (2019) are the latest datasets provided by UNB that include multiple cyberattacks. All these datasets have real data traces in PCAP format and also have flow features generated by the CI-Cflowmeter tool [19] to extract flow statistics. Each traffic flow sample contains 76 features composed of traffic flow statistics. Among others, the features include the number of packet flows in both forward and backward direction, Inter-Arrival Times (IAT) of these packets, packet length information, header flag counts, header information, and bulk/segment information of packets together with their simple statistical parameters such as minimum, maximum, average and standard deviation. The complete description is provided in the documentation of the CICflowmeter tool [19].

B. Training Configuration
Each dataset is randomly split into 75% training and 25% test sets. Then, we preprocessed the training data by removing noise and scaling the features. For the feature scaling, we used z-score normalization [20], which converts individual feature distribution to a zero-mean and unit variance. We used the scikit-learn's standard scaler function 2 . for the z-score feature scaling.
We compared multiple binary classification techniques, namely, Logistic Regression (LR), Naive Bayes (NB) classifier, Decision Trees (DT), Random Forests (RF), K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Multi-layer Perceptron (MLP). The output binary classes include anomaly and benign, which stands for the DDoS attack and regular traffic flows, respectively.
Each of the ML algorithms has its own set of hyperparameters that needs to be fixed before training except for the NB classifier, which does not have a tuning parameter and will be used as a benchmark. These hyperparameters are reported in Table II. For LR, we explored Ridge (L1) and Lasso (L2) regression [21] for regularization, which prevents overfitting and reduces model complexity. The C parameter [22], which is the inverse of regularization strength, is also explored for LR. Similarly, we also explored the C parameter for the SVM.
For KNN, we explore different values of K neighbors, together with the different metrics for computing the distances among samples. For DT, we use different criteria for computing the impurity such as Gini impurity [23] and entropy [23], with tree depth up to 20 levels. RF also takes the same hyperparameters with the addition of the number of tree estimators ranging from 10 to 100. We explore different sizes of hidden layers for MLP together with their activation functions such as logistic [24], rectified linear unit (relu) [24], and hyperbolic tangent (tanh) [24]. We also explore different weight optimizers, which include the Limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm [25], Stochastic Gradient Descent (SGD) [25], and Adam optimizer [26]. The hyperparameter search is conducted using scikitlearn's GridsearchCV function 3 . The model training and validation are implemented using 3-fold cross-validation.
P recision (P r) = T P T P + F P Recall (Rc) = T P T P + F N IV. DATA ANALYSIS DDoS attacks are known to overwhelm the server victims with both incoming traffic (e.g., volumetric attacks) and outgoing traffic (e.g., reflection attacks) that are more than they can handle. As a first step, we investigated the total packet flows for both forward (attacker to the victim) and backward (victim to the attacker) directions. These features are included among the multiple features used in this study. Figure 1 shows the total forward versus backward packets in a flow differentiated according to regular and DDoS flows. Even only showing these two features, we can see a pattern for the different DDoS attacks. Normal traffic mostly has an equal number of forward and backward traffic, even for Figure  1c, where we skewed the figure to show a large difference between traffic (maximum of 300,000 packets) from outgoing traffic (maximum of 25,000 packets). Figure 1a shows the DDoS Botnet from 2012, which has a slightly increased number of backward packets versus the forward packets. In this scenario, seven users managed to infiltrate the servers and force them to download and run the HTTP GET command, which resulted in full and partial inaccessibility [13].
For DoS attacks (e.g., Hulk, GoldenEye, slowloris, and slowHTTPtest), the features do not show significant change as depicted in Figure 1b The difference between the number of forward and backward packets is more evident in the 2018 LOIC (UDP and HTTP) attacks. Figure 1c shows a clear separability between DDoS and regular flows. Note that for a single regular flow, the largest number of forward packets only achieved 20,000 packets. A single attack flow can reach up to 200,000 forward packets for an HTTP attack and up to 300,000 forward packets for a UDP attack, which is respectively 10 and 15 times the maximum number of forward packets in a regular flow.
Similarly, DDoS attacks of the 2019 dataset also show similar characteristics to the 2018 LOIC attacks. The total number of forward packets has a vast difference compared to the number of backward packets, which is shown in Figure  1d. For instance, a single DDoS traffic flow using the DNS protocol reached a maximum of 100,000 packets. Thus, we expect high detection rates for these datasets given their clear separability, even when using traditional ML methods.

V. EXPERIMENTAL RESULTS
We report the training evaluation of the ML models during the hyperparameter search in this section. Then, we show the classification results of all the ML methods using both the accuracy measures and the ROC curves of the final testing evaluation. Finally, we show the comparison of our results with the existing DL approaches.
Our experiments were conducted using a server node with 32 cores and 64 GB of RAM. We used Python's scikit-learn 4 module for the development of ML models. The final detector was also configured and tested in a real-world deployment.

A. Training Evaluation
The training evaluation is shown in Figure 2, depicting the average training and validation accuracies for each hyperparameter. Figure 2 shows only the result for the ISCXIDS 2012 dataset, which has patterns similar to the other datasets.
For the DT, the figure shows that choosing either Gini or entropy to measure node impurity does not incur a significant change in detection accuracy. On the other hand, the depth of the tree plays a huge role in improving the accuracy. As the depth increases, the training and validation accuracy increases. However, the validation accuracy stops earlier, which means that tree depth also tends to overfit. The optimum value chosen by GridSearchCV is at max depth = 11, where the validation accuracy is the highest (instead of the training accuracy). The same pattern can be observed for the other datasets.
For the RF, similar patterns can be seen for the depth and criterion from DT, as shown in Figure 2d. The number of estimators does not have a strong influence on increasing accuracy. The patterns for these three hyperparameters are also similar to the other datasets.
For the KNN, the number of neighbors, K, has more influence on accuracy than the distance metric. As K increases, the accuracy decreases, which is the same on all datasets. Thus, K=10 is optimum for all the KNN models. For the distance metric, Manhattan achieved the best accuracy while Chebyshev achieved the least. This result is also the same for the 2017 and 2019 datasets, while Chebyshev achieved best for the 2018 dataset.
For the LR, both L1 and L2 regularization parameters have little difference in accuracy across all datasets. On the other hand, the C parameter has a significant influence on the SVM, the C parameter is uniform across accuracy. For the SVM, the C parameter is uniform across all the datasets, where accuracy decreases as the C parameter increases.
For MLP, the weight optimizer has more influence on accuracy than the hidden layer numbers and their activation functions. For the optimizer, SGD always achieved the least among all the datasets. Adam optimizer achieves best for 2012 and 2017 datasets while LBFGS achieves best for the remaining datasets. Although the differences are minuscule, relu has achieved the highest for the 2017 dataset while tanh achieved best for the remaining datasets.
Finally, the average fitting and validation time during the training phase is shown in Figure 3 Table III shows the results of ML models for the four main test datasets, reporting the overall accuracy (Ac), F1-measure (F1), precision (Pr), and recall (Rc). For the 2012 dataset, the best result was achieved by RF, reaching 98.3% Ac using entropy with a tree depth of 14. Similarly, DT also provides high Ac, which reached 98.2% using Gini and a tree depth of 11. Linear SVM and LR both achieved 96.3% Ac using a C parameter of 0.01 and 0.1, respectively. All of the algorithms yield high accuracy except for NB.  For the 2017 dataset, DT, RF, and KNN reached over 99% Ac. DT achieved the highest with a maximum tree depth of 14. RF also achieved high Ac through a maximum tree depth of 20 levels and 100 tree estimators. Furthermore, MLP reached 98.9% Ac with relu as the activation function in the hidden nodes and Adam weight optimizer. LR and SVM also achieved high Ac with a C parameter equal to 10 and 0.001, respectively. NB still performed the least in this dataset.

LR DT RF SVM MLP KNN
All of the algorithms yield high detection Ac in the 2018 dataset. NB reached 91.8% Ac while the rest achieved greater than 97.5%. DT achieved best with 99.9% Ac followed closely by KNN. RF also achieved 99.8% Ac using 20 tree estimators. Both DT and RF used entropy with a maximum tree depth of 17 and 20 levels, respectively. MLP also reached 99.8% Ac with 100 hidden layers. Linear models, SVM and LR achieved 97.5% and 98.6% accuracy using 0.001 and 0.1 of the C parameter, respectively.
The models yield the highest detection accuracy using the 2019 DDoS dataset. NB achieved 98.3% Ac while the rest obtained over 99%. RF achieved best, which reached 99.97% Ac followed closely by DT. RF and DT models use 20 and 19 levels of tree depth, respectively. MLP also achieved 99.93% using 80 hidden layers with tanh activation function and LBFGS weight optimizer. Linear models, LR and SVM reached the same accuracy of 99.79% using a C parameter value of 100 and 0.1, respectively.
The results have shown accordance with the data analysis in Section IV where the 2018 and 2019 datasets show clear separability for the number of forward and backward packets, allowing them to have better Ac results compared to the others.

C. ROC curves
In addition to accuracy metrics, the AUC of each ML model is also reported in Figure 4, showing only the distinct values. Figure 4a shows the AUC of the ML models for the 2012 Botnet. It shows high AUC values for most of the ML models except for NB. It confirms the results of the accuracy metrics and also shows that RF is the most robust classifier. MLP, KNN, DT, and LR also achieved 0.99 of AUC.
For the 2017 dataset, Figure 4b shows that RF has achieved the largest AUC, which also achieved the highest detection accuracy for this dataset. DT, RF, MLP, and KNN also achieved similar results. NB still has the least performance in detection confidence. Figure 4c also shows that RF achieved the highest AUC for the 2018 dataset, which also achieved the highest Ac of over 99% for this dataset. DT, KNN, and MLP have also achieved similar results with RF. Finally, Figure 4d shows the robustness of all ML models for the 2019 DDoS dataset, which achieved greater than 0.98 of AUC.
The accuracy metrics and AUC evaluation show that RF and DT achieved outstanding results for DDoS attack detection. On average, DT outperformed RF only by 0.002% Ac while RF outperformed DT only by 0.002 on average AUC. Since RF is an ensemble method composed of multiple DTs to overcome DT's tendency of overfitting, we choose RF to represent our DDoS detection scheme.

D. Literature Comparison
The evaluation concludes with the comparison of current DL approaches that utilize the same dataset. As shown in Figure  5, our results using RF, called MLDDoS, are comparable to current approaches. For the 2012 dataset, DLDDoS [6] use Deep Neural Networks (DNN), LUCID [7] and TR-IDS [17]  used CNNs, while we show the best result of DeepDefense [10] using LSTM. The figure shows that our approach has similar detection performance to existing DL methods.
For the 2017 dataset, we have outperformed DLDDoS [6], LUCID [7], DeepGFL [14], and the DL approach from [16] using CNN+LSTM. For the DeepGFL, we reported the result for detecting DDoS Hulk in the figure. LUCID, DLDDoS, and MLDDoS have similar detection performance in the 2018 dataset. Finally, only DLDDoS and MLDDoS have utilized the most recent 2019 dataset at the time of writing, which both achieved high detection performance.

VI. CONCLUSION
In this paper, we show that DDoS attacks can be detected with high accuracy using only traditional ML algorithms. The search for the optimum hyperparameters also supported the development of different ML models to yield high detection performance. Our evaluation show that RF and DT achieved the best performance and worst for NB since it does not have hyperparameters for tuning. We adopted RF as our DDoS detector since it is an ensemble technique composed of DTs that would combat a single DT's overfitting tendencies. We also compared our results to existing approaches in the literature that utilize DL methods with the same datasets. This paper also provides a detailed data analysis and ML model, which are missing in most ML studies. We found patterns from attacks by analyzing the raw data and understand the ML model parameters that are important for tuning to increase detection performance.