Deep learning with focal loss approach for attacks classification

ABSTRACT


RELATED WORK
Network Intrusion Detection System has been studied widely over the past several years. This section briefly discusses some published approaches to deep learning methods, in particular to imbalanced datasets. In 2019, Lin et al. [26] used deep learning for dynamic network anomaly detection. The synthetic minority oversampling technique (SMOTE) algorithm was experimentally applied to handle the imbalanced class problem in the CSE-CICIDS2018 dataset. As a classifier, a deep neural network model was used with long short-term memory (LSTM) based, combined with an attention mechanism (AM), to enhance performance. The SMOTE algorithm applied to promote the proportion of minority class optimizes the deep learning model. The model achieved the best results, with an accuracy of 96.2%, and the recall rate reached 98% for 6 categories class.
More recently, Zhang et al. [27] introduced a hybrid SMOTE that combines SMOTE and Gaussian mixture model (GMM) based clustering to improve the minority class's detection rate. The synthetic minority over-sampling technique (SMOTE) and gaussian mixture (SGM) processing was integrated with a convolutional neural network (CNN) for binary and multi-class classification. They claimed that the SGM model increases detection and reduces the time cost. The proposed method was evaluated with 5 classes imbalanced technique and 2 classification algorithms. They were verified using the University of New South Wales-NB 2015 (UNSW-NB15) and the Canadian institute for cybersecurity intrusion detection system 2017 (CICIDS2017) datasets. The evaluation of the CICIDS2017 dataset shows that the method achieves an excellent detection rate of 99.85% in the 15-class classification. However, the detection rates for web attack brute force are still less than 50%, lower than random oversampling (ROS) and SMOTE. As for the UNSW-NB15 dataset, the detection rates for binary and 10 classifications reach 99.74% and 96.54%.
Abdulhammed et al. [23] used various techniques, such as over-sampling, under-sampling, spread subsample, and class balancer, to solve imbalanced data problems for binary classes. Several classifiers, such as random forest (RF), DNN, voting, variational auto-encoder, are used in the evaluation. The experiments on the CIDDS-001 dataset showed that DNN with the down-sampling method and class balancer is the most effective. By the experimental results, the class distribution has a light impact on the classification process. Furthermore, Abdulhammed et al. [36] proposed the uniform distribution based balancing (UDBB) for imbalanced classes. To reduce features, the auto-encoder (AE) and principle component analysis (PCA) were used in evaluation with various classifier methods. The simulation results on the original distribution of the CICIDS2017 dataset showed that PCA produces better accuracy than AE, at 99.6%. However, by implementing UDBB, the detection accuracy was reduced to 98.9%, although it better detected some attacks. In another experiment, Hua [29] used under-sampling and feature selection in pre-processing. The proposed traffic classification using LightGBM, based on the CSE-CIC-IDS2018 dataset. The model used only 10 TELKOMNIKA Telecommun Comput El Control  Deep learning with focal loss approach for attacks classification (Yesi Novaria Kunang) 1409 features that were selected using Random Forest. They compared their models with various machine learning algorithms, and CNN deep learning. The best results for the overall accuracy obtained reached 98.37%. However, the influence of the model on the minority class was not discussed.
Yang et al. [37] applied an improved conditional variational autoencoder with a deep neural network in NIDS. An improved conditional variational AutoEncoder (ICVAE) training explores the relationship between data features and attack classes. This aims at balancing training data sets and improve detection performance in minority attack. They used cross-entropy as the function of reconstruction loss of the decoder. The results of this challenge showed that the best individual detection system obtains up to 89.08% and 85.97% of the multi-class classification in the UNSWB15 and NSL-KDD datasets, respectively. They claimed that ICVAE-DNN increases detection rates of minority and unknown attacks. Also, an unsupervised auto-encoder was used by Li et al. [38] to overcome imbalance problems in NIDS. They used the random forest to select significant features in the CSE-CIC-IDS2018 dataset, and performed anomaly detection for each attack. However, the results of AE-IDS for attacks, in web attacks (SQL injection, brute force web, and brute force-XSS), are still low and optimized. Similarly, the unsupervised auto-encoder model was used by Zhao et al. [39] They introduced the semi-supervised discriminant auto-encoder (SSDA) to overcome new attacks. Inspired by existing research, this study uses DAE to extract attack data. Furthermore, the focal loss is used to increase the detection rate of minority attacks. The CSE-CIC-IDS2018 dataset is used to test the model in multi-class classification and compare the impact of the three-loss functions on unbalanced processes.

RESEARCH METHOD
This study improves intrusion detection systems' ability to detect minority attacks class using deep learning models with deep auto-encoder (DAE) pre-tuning processes and fine-tuning using DNN. The classification process used 3 scenarios, including categorical cross-entropy (CE) loss, focal-loss (FL), and weighted categorical cross-entropy (WCE), as illustrated in Figure 1. The model was evaluated using CSE-CIC-IDS2018, which represents a recent attack dataset [40].

Dataset
The CSE-CIC-IDS2018 dataset consists of 80 features, including labels. The features of the dataset are generated and extracted with CICFlowMeter [40], [41]. The designed scenario consists of 6 attacks, including denial of services (DoS), distributed denial of services (DDoS), botnet, brute-force, web attacks, and infiltration, as presented in Table 1. They are grouped into 14 attack sub-classes. The total data amount is 16,232,943 records, dominated by 83.07% benign traffic. This study used 10% data for training and 2.5% for testing. About 51% of the data was used for benign composition. The structure of malicious data is used based on the amount of data. Table 1 clarifies the composition of the training and testing data used. The infiltration attack includes a stealthy attack that utilizes an internal network for illegal access. The characteristics of infiltration traffic and benign are very close, which implies a difficulty in detecting the network IDS [42]. As a result, the infiltration attack was eliminated in the experiment because this study discussed the focus, emphasizing the consideration of imbalanced class factors in accuracy detection.

Pre-processing
From 80 features of the CSE-CIC-IDS2018 dataset [40], the timestamp feature was eliminated, and only 79 were used. The timestamp is encoded information that explains the occurrence of an attack. It is quite essential for prediction in time series. However, it is not essential for classification where the model must recognize the attack based on its characteristics. Feature flow duration has more impact on identifying attacks, such as DDoS and DoS, due to its rapid nature. In the detection and classification model of attack, the time of occurrence is not necessary. This is because, in its implementation, the attack happens at any time. Therefore, the feature is eliminated as in previous studies [26], [27], [38].
The first stage of dataset pre-processing is feature encoding, which transforms the data from categorical into numerical. The feature encoding process, using one-hot encoding, changed the protocol and label features into numeric data. The process mapped the protocol feature to 3 instances, including transmission control protocol (TCP), user datagram protocol (UDP), and Hop-by-Hop IPv6 (HOPOPT). Also, the label features became 6 feature attack categories by eliminating the infiltration class. Finally, 80 features for data and 6 feature labels were obtained. The next stage was the feature scaling process to turn all data values into the specified range. This process is necessary for features with a high value and not dominating others. Feature scaling uses the same approach of Min-max scaling with range [0, 1] as in a previous study [43]. After pre-processing, the data is ready for the training and testing process.

Deep learning architecture
The proposed model of the intrusion detection system is designed using the pre-tuning and fine-tuning process. The deep learning architecture used automatic feature extraction with deep auto-encoder (DAE) in the pre-tuning stage and the deep neural network (DNN) architecture in the fine-tuning stage. DAE performs the process of feature extraction with the encoding and decoding phase. Auto-encoder generates output ̂, which is reconstructed from input . In single-layer Auto-encoder, when the input vector ∈ ℝ, the vector encoding function ℎ, in forward propagation for hidden layer l ( = 1), is notated as shown in (1). The decoding function is notated as shown in (2); are weight matrices, are bias vectors, is an input vector form dataset, and f(.) and g(.) are activation functions used on a hidden layer. The experiment used several variants of ReLU activation functions, such as SeLU, PReLU, ELU, and Leaky ReLU in the hidden layer and sigmoid in the last layer of auto-encoder.
Deep auto-encoder (DAE) is the development of a single AE with a higher number of layers. The function compositions in the encoder and decoder are and , respectively the proposed architecture uses 7 hidden layers, with the output reconstruction ̂= 1( 2( 3( 3( 2( 1( )))))). For the reconstruction process input into the output, ̂ uses the MSE loss function with (3). The backpropagation process produces a loss value close to zero Output ̂ (prediction target label) will close to y (vector target label) by using the activation function softmax (. ) in the last layer. For all the training datasets ( , ), the function of loss can be solved by: where ℒ is the loss function, and is the number of datasets. One focus of this study is to compare the ℒ loss function using cross-entropy, weighted cross-entropy, and focal loss.
The DNN model carried out the training process for classifying multi-class attacks. A hyperparameter tuning process is performed to get the best deep learning model by looking at the rate detection result of the attack classification. This tuning process tried various model variants based on the number of hidden layers, the number of nodes, learning rate value, batch-size, activation function, and kernel initialization to get the best model.

Loss function
In the case of a multi-class with the number of classes ( > 2), the equation of the loss function for the categorical cross-entropy (CE) is: is the number of classes, is the ground truth class, and ̂∈ [0,1] is the model's predicted probability for the class. Where = 1 belongs to the actual label of ; otherwise, it equals 0.
For imbalanced class cases, the CE loss function is modified by adding a weighting factor [44] to obtain the CE as shown in (7).
where is the weight factor for class . The deficiency of CE loss is that many samples contribute to a significant accumulation of the loss value above the rare class [33], [34]. Therefore, when the extreme balance issue in the case of multi-class attack classification is resolved, the scenario takes advantage of the focal loss function proposed by Lin et al. [33]. The focal loss function does not provide the same weighted value on all training data. In contrast, focal loss reduces the weight of well-classified data. Its impact on focal loss emphasizes training on data that is difficult to classify with as shown in (8): with  as a modularity factor to reduce the weight of well-classified classes. When = 0, the loss equals to cross-entropy. Therefore, >= 0 is set to evaluate the effect of samples classified with a loss factor. The parameter is the weight to balance focal loss, and it increases the accuracy value for the imbalance class.

Experimental setup and performance metrics
The experiment was run on the cloud machine in the Google Colaboratory platform. The model was developed using the Python programming language with computation utilizing a TensorFlow-GPU library of Keras [45], a deep learning framework. The hyperparameter tuning process used Talos Library [46]. This observation used accuracy, sensitivity, and specificity to measure the performance of the proposed model. The evaluated performance used the accuracy function to assess the model's ability to classify attacks correctly. In the case of imbalanced datasets, the predicted result was dominated by large numbers of classes. Therefore, it is necessary to examine the model's specificity and sensitivity for imbalanced data set case [47]. The sensitivity results showed how precisely the model detected an attack. The specificity showed the probability that the model does not make mistakes in recognizing an attack.

RESULTS AND ANALYSIS 4.1. Hyper-parameter tuning
The hyper-parameter tuning process is crucial in obtaining a network architecture (number of neurons and layers). Moreover, the process was used to obtain the most appropriate hyper-parameter values in the deep learning model. In the initial phase, several hyper-parameter processes were performed on several hidden layers and nodes, batch size, learning rate, activation function, and kernel initial for deep learning model (DAE-DNN). The experiments used categorical cross-entropy as the lost function. The best architecture The experiment used the initial lecun_uniform kernel, as well as the Leaky ReLU activation function. The Leaky ReLU was previously selected through the tuning process of various activation functions. The batch size value for the best model used is 256. Also, this was obtained through the tuning process with batch size variations 32, 64, and 256.
After getting the best model with the CE loss function as the basis of comparison, tuning for the focal loss parameter was performed. Two parameters were tuned in a multi-class focal loss, in which the settings of γ reduced the effect of the modulation factor. The α parameter is the weight factor for the class. The focal loss parameters tuned are range γ ∈ [0, 5] and α ∈ [0, 1], as recommended in [33].

Result of various of focal loss
The effectiveness of the focal loss function in attack classification was measured by taking the best tuning result in focal loss parameters. The training process was performed with the number of epoch=30. Table 2 summarizes the overall hyper-parameter tuning results for the focal loss function, with various values of and . Also, the weight assigned to the rare class has a stable range. However, it interacts with γ, making it necessary to select the two parameters together, as shown in Tables 2 (a) and 2 (b). In general, increased slightly as fluctuated. In this case, = 0.5 works best when = 1. The best results are the accuracy value of 98.223%, the sensitivity of 98.223%, and specificity of 99.814% for the entire attack classes.
The proposed model reached the highest accuracy metric at =1. It is reasonable because minimizes the loss contribution of the dominant class sample that is easily classified. When parameter increases, the probability of correct classification (1 − ̂) decreases. This probability increases the weight of minority class samples that are difficult to be classified. As a result, the model focuses on the difficulty class of classified samples that lowers classification accuracy.

Performance and comparison
The results of configuring NIDS with focal loss (NIDS-FL) were evaluated by comparing them with cross-entropy loss (NIDS-CE), and weighted cross-entropy loss (NIDS-WCE) accordingly. Equal values of network architecture (number of hidden layers, number of nodes), and hyper-parameter value were used. The NIDS-CE and NIDS-WCE configuration do not use the γ and α. The weighted cross-entropy used a balanced mode. It means that this function replicates the smaller class until the number of samples in the minority and larger classes is equal.
In the first stage, training was conducted using epoch=30. After the training process, an evaluation was performed using testing data. Figures 2 show all the metric comparisons with various variants of the loss function. It shows that for epoch=30, almost the models' overall performance using the focal loss function was better than CE and WCE. Respectively, the accuracy value is 98.23%, precision to 98.34%, recall (sensitivity) to 98.23%, and specificity to 98.25%, as shown in Figure 2 (a). This research used a multi-class classification for BoT, Brute Force, DDOS, DoS, and web attacks. The results showed that NIDS's performance using focal loss was higher than cross-entropy and weighted cross-entropy, as presented in Figure 2 (a). The detailed result for the entire class may be observed in Table 3. The proposed model excellently detected BoT and DDoS attacks, with an overall performance above 99.9%. The overall performance for DoS attacks was higher than 90%, while for Brute Force, recall performance attained 94%, with a precision of only 84%. The file transfer protocol (FTP)-brute attack has a characteristic that resembles the slow-hypertext transfer protocol (HTTP)  To investigate the performance of the loss function against the imbalanced dataset, this study examined the model's efficiency in classifying types of attacks, especially in minority classes. The web attacks are a minority class that only amounts to 0.05% of the total data trained in Table 1. According to Table 3 and Figure 2 (b), the NIDS-FL outperforms the other methods to classify web attacks. There is a significant increase in the value of precision, recall, and f1-score compared to models that use CE and WCE losses. The recall (sensitivity) reaches 74.14%, implying an approximate increase of 7% from CE as a primary loss function. The model that uses WCE in minority classes with epoch=30 does not have excellent sensitivity, although it has a good precision value.

TELKOMNIKA Telecommun Comput El Control
The loss value of the 3 models in Figures 2 (c) and, the FL function, is smaller than the CE and WCE. Also, the number of misclassification attacks were compared. The error count value of NIDS-FL in Figure 2 (d) is lower than other models, which are only 6965. The superiority of the FL shows the effect of modularity and weight factors on focal loss. By selecting the most appropriate modularity and weight for the imbalanced class, the loss and misclassification values can be minimized, especially for the minority class. In the next phase, the model was evaluated by increasing the number of epochs to 200. As shown by the loss achieved in Figure 3 (a), the proposed deep learning model using FL with the previously selected hyper-parameters converge faster than CE and WCE. Figure 3 (b) shows the network using FL stabilizes after around 30 epochs, which is in contrast to 100 epochs and 80 epochs with CE loss and WCE loss, respectively. However, with the increasing number of epochs, the models using the cross-entropy function are more stable. Also, they tend to keep increasing compared to models that use focal loss and weighted cross-entropy functions. The improvement is reasonable because the hyper-parameter tuning process was performed on a model with a cross-entropy loss function. It has produced a model with the most appropriate hyper-parameters for deep learning. The resulting focal loss curve tends to fluctuate due to factor γ. Therefore, with an oscillating curve, the model has a slightly higher validation value when the curve reaches its peak. Table 4 shows the best comparison of results for the 3 models after 200 epochs. The overall performance results are almost the same based on accuracy, precision, recall, F1-score, and specificity. For instance, the recall value indicates a tiny difference of <0.01%. In the web attack as a minority class, the overall performances after 200 epochs for models that use focal loss function for the precision, recall, accuracy, and F1-score values, respectively, amounted to 97.76%, 75.29%, and 85.07%.  Models with deep auto-encoder pre-training process have advantages in overcoming the imbalance problem. An increase in the layers of deep auto-encoder network raises the number of complicated features learned from the original data. Transfer-layer with 4 deep encoding layers in DAE (7 hidden layers) improves the DNN fine-tuning process's classification results, despite extreme class imbalance problems. Out of the 3 models, although in the web-attack class where the training data only 0.05% of the total data, the sensitivity value reaches 74.14%, especially in those that use the focal loss function. However, in the web-attack class, the training data only 0.05% of the total data. In the model that uses cross-entropy, the results are acceptable, with a sensitivity reaching almost 68%. On the weighted cross-entropy model, the sensitivity is only 41.38%. The weights and bias in the DNN network are initialized according to transferred value from the encoding layer TELKOMNIKA Telecommun Comput El Control  Deep learning with focal loss approach for attacks classification (Yesi Novaria Kunang) 1415 on DAE. The weighted cross-entropy, which uses the weight factor proportional to the class frequency, affects the training that is unsuitable for the positive samples at the beginning of the learning process.
The final loss layer of the deep neural network is the softmax loss function. In DAE optimization, the weights are initialized using lecun_uniform. As a result, the output from each layer is uniformly distributed. The value of the output layer in the deep neural network for softmax functions is uniformly distributed. Appropriately, the minor positive samples are less critical in the initial training stage. Therefore, in deep learning using focal loss, the last layer's bias term is initialized to some non-zero value [33]. More focus is directed at the positive examples in the early training stage, and the whole training process is likely to be effective. In deep learning, cross-entropy loss, and weighted cross-entropy loss was based on weight initialization. Different loss functions produced varied predictions on various models. In 30 epochs, the model used FL and achieved slightly better accuracy than CE loss. Another advantage of using the FL function is that the value of the cost loss is near zero. The loss of the cost function represents how well the model learns concerning the training examples.
In general, without modifying the distribution of the CSE-CIC-IDS2018 dataset, the model results were satisfactory. These results were better than the previous study using deep learning and resampling techniques in the same dataset, as shown in Table 5. The overall accuracy obtained at 98.27% was better than the deep learning model developed by Lin et al. [26]. An accuracy of 96.2% was achieved by Lin et al. [26], using SMOTE algorithm for imbalance class, and DNN model based on long short term memory (LSTM) based and, combined with attention mechanism (AM). Also, their study compared the various machine learning techniques with datasets that have been conducted over-sampling. All of the multiple methods performances are not significant compared to the proposed model in this study. For instance, for web attack classes as minority samples, they claimed the models developed reached 98% for a better recall value than the model in this study, which is only 75.29%. However, the precision value and F1-score for the web attack class is only 30%. In contrast, the model precision and F1-Score for web attack classes with focal loss in this study is 97.76% and 85.07%, respectively, after 200 epoch training.
This study is superior to the precision value and F1-score compared to the performances of previous research, as detailed in Table 5. Hua [29] used various machine and deep learning, as well as under-sampling and feature selection techniques for pre-processing. The recall value of the model with LightGBM was around 0.1%, slightly higher than the model proposed in this study. However, the precision value of this model is 0.24% superior to the model they proposed. Unfortunately, their research did not explain the effect of the model on web attacks as minority classes. Zhao et al., [39] with the semi-supervised discriminant auto-encoder (SSDA), and Ferrag et al., [48] with Deep Auto-encoder, utilized unsupervised learning without modifying the data distribution on the dataset. However, with the hyper-parameter process performed in this study, the deep learning model proposed resulted in better detection. This model works better with less data proportion and deep learning using focal loss. As a result, it solves much of the imbalance-class problem.
This study has successfully demonstrated the significance of deep learning. This has been achieved using deep auto-encoder as a feature reduction technique with focal loss functions. It has provided better results in terms of several performance metrics for IDS, especially in imbalance classes. However, there have been certain limitations and constraints in this study. In the evaluation process, the infiltration attack class in the initial test was eliminated without modifying the dataset. This is because it often caused misclassification with benign class. However, future studies should develop a deep learning model with a two-stage classification that detects infiltration attacks. Also, it is believed that the focal loss function optimizes the imbalanced class problem using other deep learning algorithms. Subsequently, future studies should try to evaluate the influence of the focal loss function with different deep learning algorithms. This study used the CSE-CIC-IDS2018 dataset in the training and testing processes. Future research should cover an anomaly-based online intrusion detection system.

CONCLUSION
This study presented a new deep learning model to address the problem of classifying multi-class attacks. The network architecture was partitioned into automatic feature extraction with deep autoencoder using 7 hidden layers, and a classifier with a fully connected deep neural network. The focal loss function was adjusted to the proposed model in an imbalanced dataset. The proposed deep learning model used the focal loss to obtain a faster convergence than cross-entropy loss and weighted loss. Concerning web attack classes as minority samples, the evaluation results of the CSE-CIC-IDS2018 show that the deep learning method with focal loss is a high-quality classifier with 98.38% precision, 98.27% sensitivity, and 99.82% specificity. Several future studies should be built on this research in several aspects. First, using the focal-loss function on imbalanced datasets should be evaluated by comparing them with various datasets. In this research, the infiltration attack class was eliminated, which behaves in the same way as benign traffic. However, future studies should improve a deep learning model that uses two stages to filter the infiltration attacks.