Online traffic classification for malicious flows using efficient machine learning techniques

The rapid network technology growth causing various network problems, attacks are becoming more sophisticated than defenses. In this paper, we proposed traffic classification by using machine learning technique, and statistical flow features such as five tuples for the training dataset. A rule-based system, Snort is used to identify the severe harmfulness data packets and reduce the training set dimensionality to a manageable size. Comparison of performance between training dataset that consists of all priorities malicious flows with only has priority 1 malicious flows are done. Different machine learning (ML) algorithms performance in terms of accuracy and efficiency are analyzed. Results show that Naïve Bayes achieved accuracy up to 99.82% for all priorities while 99.92% for extracted priority 1 of malicious flows training dataset in 0.06 seconds and be chosen to classify traffic in real-time process. It is demonstrated that by taking just five tuples information as features and using Snort alert information to extract only important flows and reduce size of dataset is actually comprehensive enough to supply a classifier with high efficiency and accuracy which can sustain the safety of network.


INTRODUCTION
According to Webroot Threat Report that was written in 2019, 93.6% of malware spotted on one single computer. This is the highest annual rate that have ever seen, even though the number has risen beyond 90% since 2014. More than two-thirds of IT security professionals consider that a successful cyber-attack is coming up in 2020 [1], [2]. Numerous type of traffic classification techniques have been used such as port based, payload based, statistical approach and behavioral based with a common aim of classifying data packets or flows effectively. However, network attack tactics have gradually become more complex and can hardly be detected [3]. For example, the growing and new trends of application developers to avoid the detection leave this network traffic classification field open for further research. In a nutshell, the target of this paper is to propose a solution to real time network traffic classification that could overcome current research gap for better human safety in the cyber world.
Network traffic classification [4], [5] is flows identification of the network traffic and positioning each of the flows to various classes according to their feature information like port number [6], [7], TELKOMNIKA Telecommun Comput El Control  Online traffic classification for malicious flows using efficient machine learning techniques (YingYenn Chan) 1397 payload [8]- [11], [12], traffic behaviours [13]- [16] and flow information [4], [8], [9], [12], [17]. In this paper, an efficient flow information based classification of suspicious network traffic flow is introduced. Here, 'efficient' means an extra rule-based system, Snort will be used to reduce size of training data to improve detection accuracy and efficiency. The evaluation started by offline traffic classification. Computational performance in terms of accuracy and efficiency is compared among machine learning (ML) algorithms using the reduced size of training data set that consists of only the most severe malicious data. This comparison is done to choose the best classifier that will be used for the online traffic classification later. The paper is organized as follows. Section 2 describes the literature review. Section 3 presents methodology of this paper. Section 4 provides the results and performance of the proposed idea. Conclusion is in section 5.

LITERATURE REVIEW 2.1. Overview
Several common methods have been used for traffic classification. Port-based approach has been commonly used and is considered the quickest and low-resource consuming method in the case of classifying network traffic packets. There are some applications that have fixed (or traditionally used) port numbers, such as WWW and email. Thus, it is easy to detect the traffic that belongs to these applications. However, there are also applications do not have a fixed port number, such as peer-to-peer (P2P), games, and multimedia. Instead, they will use the port numbers of other widely used applications such as (hypertext transfer protocol), HTTP/file transfer protoco (FTP) connections [18]. Therefore, this approach sometimes could yield poor outcomes because attackers tend to play tricks easily by modifying the port number periodically in the system and pretend like a usual normal application [6].
Payload-based approach examines the packets' contents to identify the types of traffic. This method identifies all signatures that are found in the payload of the application layer. After the algorithm successfully collected a set of unique payload signatures, it results in good performance for most types of traffic, such as HTTP, FTP, and simple mail transfer protocol (SMTP) [6]. However, the effectiveness of these methods also has greatly reduced due to the fact that the customers use encrypted flows, while governments decided to band to have third parties to involve in the system to examine payloads for safety purposes. In addition, the inspection process of packets payload syntax could give a heavy operational load and delay [9].
Heuristic approach works by checking the suspicious behaviours of targeted files, monitoring files in the system such as (system documents and service), observing process in the system and application programming interfaces. There are two types of detection such as static detection and dynamic detection, in which they have one important difference is to decide on the options of running or not the detected documents and do the checking of suspicious operation. However, this method cannot promise to find all the Trojans, plus installing the system on every computer of the whole network is an issue as well. If any of the computers in the network that are not covered by using this method, chances are there that malware could disperse to the other computers until the entire network is infected from that specific unprotected computer [15].
Statistical approach using ML algorithm depends on the classifying of statistical information such as frequency and length of bytes, size of packets and inter-arrival time of packets transmitted. This technique is fast and capable of detecting and analysing the class categories of the unknown applications. Therefore, by having a complete statistical information of the targeted packets during inspections, classifying the hidden protocol will be easy. This approach has become popular as encrypting traffic of applications becoming a new trend that causes challenges for any of the proposed classification tools that previously claimed to be able to achieve high accuracy in applications classification. However, the accuracy might drop if the training data is insufficient as high amount of data is needed as learning process [4], [9]. In Weka [19], various ML algorithms are tested and compared. Naïve Bayes [20] is a great tool for knowledge representation as well as reasoning. It could calculate the probability of a new variables subset when a subset of random variables, also known as evidence variables are provided. Naïve Bayes is a non-complex probabilistic classifier which follows Bayes theorem as shown in (1); where P(c|x) is the posterior probability, P(x|c) is the probability of predictor given class, P(c) is the class prior probability and P(x) is the predictor prior probability. Benefit of using Naïve bayes is that it is possible to predict the important classification parameters with a trivial amount of training data.

Related works
Researchers have generated enormous ways to tackle the issue using methods presented in the Section 2.1. For example, Shim et al. [11] has proposed a method where it is able to generate payload signature automatically. This study can cut down on the amount of time spent generating signatures that required to be done manually. However, application traffic input, has to be manually collected. Moreover, it is difficult to do signature extraction although the traffic was gathered using one single function and also mixing the traffic up with other features could happen easily.
Galal et al. [14] has proposed a behavior-based features model that helps to define suspicious activities exhibited by malware. The major difficulty of this method is the runtime overhead. According to Bekerman et al. [15], 972 behavioral features were extracted across different protocols and network layers, but might classify data packets wrongly, causing false positive. Once attacks change behaviors, the classifiers cannot work well and need to be retrained. Finamore et al. [12] has integrated both flow and payload statistical feature based clustering for classifying unknown traffic. However, the amount of clusters has to be enormously high to attain good accuracy in classification performance, which then results in an issue of having to use a large dataset for a small applications [17]. Furthermore, the goodness of their features may limited by encrypted application layer protocols. Zhang et al. [17] also has proposed an approach that has the ability of identifying anonymous flows created by unknown applications and using the associated information in the actual network traffic to enhance classification result. However, this method often lead to high false detection.
Therefore, after looking at these research gaps, the proposed solution in this paper is a malicious traffic classification using ML method and statistical features i.e five tuples with further assisted by Snort alert for an efficient classification. Unlike the existing similar works mentioned above [12], [17] of using all traffic priorities, information from the Snort alert in terms of malicious traffic priorities is used to augment the ML which it is used to extract out only important features and avoid redundant data, while at the same time speed up the training process and improve the selected Naïve Bayes classifier accuracy.

RESEARCH METHOD
The project was executed using Dell (Inspiron-14) laptop in which its operating system was Windows 10 Enterprise 64-bit, version 1809 (build 17761.1158). Processor is Intel ® Core ™ i5-5200 U CPU @ 2.20GHz. The installed RAM of this laptop is 4.00GB. This device will be used as the main laptop throughout the project implementation. While during online classification, another laptop will be added to transmit real time data packets to the main laptop. The second laptop (laptop B) is a Dell laptop that uses Windows 10 Pro, 64-bit Operating System. Processor of the laptop is Intel® Core™ i5-4310 U CPU @ 2.00GHz. In addition, it has the same RAM of 4.00GB. Figure 1 shows the overall framework of this paper. There are two phases consisting of offline and online classification. Various stages need to be done to get an optimum accuracy and performance in phase 1 as a stable foundation for phase 2 online classification. As illustrated in Figure 1, the aims are flows reconstruction, features extraction, Snort filtering flows with high severity, classifier model generation as well as comparing results between different classifier algorithms. Key evaluators are accuracy and efficiency of the model. Accuracy is the rate of precision and exactness, while, efficiency is the processing time to build the model. According to the best accuracy and efficiency, the fit classifier model can be obtained. Phase 2 will focus on online classification which implementing the similar steps in phase 1 in order to get the same performance as offline classification to prove its effectiveness.

Five-tuples as input features
A 5-tuple can be defined as a set consisting of five distinct numbers that comprise a transmission control protocol/internet protocol (TCP/IP) connection. It consists of a source and destination IP address, port number, and the protocol used. Each protocol type will has its own ID number, for example 'TCP' is port 6, and while 'UDP' is port 17. In our case, protocol ID will replace protocol type in the dataset. The arrangement of features are as follows: destination IPs, destination ports, source IPs, source ports, transport layer protocol and types of flows. Below is the example of the traffic flows. 192

Offline classification process
Offline classification in phase 1 begins when a large amount of offline malware data is downloaded from reliable websites [21]- [23]. As the files are from different sources, files merging in Wireshark is TELKOMNIKA Telecommun Comput El Control  Online traffic classification for malicious flows using efficient machine learning techniques (YingYenn Chan) 1399 necessary. After the files are compiled properly in one, the data packets are transferred to Caploader [24] to be reconstructed into flows [18], [25] so that 5 tuples of the data packets can be extracted out in csv format using Caploader. On the other hand, external assistance from Snort will be required because Snort is capable of inspecting the network packets injected for possible malicious traffic through the predefined rule and log into the alert file when the packet signature matches with one of the rules. After using intrusion detection system (IDS) mode and scanning all the incoming packets, classify them to classes according to priority ranging from 1-4 as well as filtering out the non-malicious data packets, a file in pcap format and a log text file will be generated. The log text file is the outcome of Snort analysis, all malicious packets will be listed out one by one together with the priority levels and other packets information while the pcap file consisting of only data packets that are malicious.  Figure 2, with Snort alert file as shown in Figure 3, five tuples of priority 1 will be filtered out, along with the list of clean and normal data features, arranged accordingly in csv file, and labels are added to form the dataset that will be used to create generative model afterward. This is a step of reducing the training set size and forming a good and compact training dataset. Flows with priority 1 shows the most harmful malicious packet according to the Snort rules. Extracted data flows with priority 1 means all of the data are pretty malicious. Statistic of the training set is shown in the Table 1. To prove the effectiveness of using only priority 1 dataset, a complete all priorities dataset is prepared as well for comparison as shown in Table 1. The csv file of the datasets are converted to the Arff format for Weka [19] software to undergo further analysis on different algorithms such as Bayes Net [20], [26], Naïve Bayes [27], [28], Random Tree [29], J48 [30], [31] and ZeroR [32], [33] to get the best efficiency and accuracy. In this paper, five algorithms that suitable for text classification [34] are used and analyzed. After the best algorithm is chosen, a classifier model will be saved and generated hence marked the end of phase 1.

Online classification process
Then, followed by online classification in phase 2, command lines with different roles are included in the C++ program. The processes in the program are exactly the same as the offline classification process, but in real time as shown in Algorithm 1. The traffic will be transmitted from laptop B to the main laptop through PlayCap, the Wireshark installed in the main laptop captured, analyzed the flows and classified them accordingly into malicious and normal flows following their features similarity to the generative model.   The flowchart in Figure 4 illustrated the whole process of the online traffic classification whereas statistics of flows captured are shown in Table 2. Considering online classification needs an incoming real time data packets to be transmitted, laptop B will be prepared and served as a router to send data packets to the main laptop. PlayCap will be installed in laptop B and inserted with the prepared pcap dataset. It will send the packets one by one from the prepared file to the main laptop as a 'real time' incoming data. While in the main laptop, it will capture all the data packets received by using Wireshark. For experiment purpose, it will first start off with packets capturing in 30 seconds by using Dumpcap from Wireshark and saved in a PcapNg file. Then, Tshark extract out the five tuples information from TCP flows and write in a csv file. The C++ program will take charge of changing the header of data in the saved csv file to the same header as the training dataset so that it can be read in Weka. After it is settled, Weka will classify the flows into two classes which are malware and normal according to their statistical features based on the generative model. Output of online predictions will be compared with offline procedures outcome to verify its functionality to be implemented in real life.

Evaluation parameters
In order to analyse and evaluate the results, confusion matrix shown in Table 3 is used to speculate each classifier.
, , and symbolize the number of correctly identified malware flows, number of wrongly identified normal flows, number of correctly identified normal flows and number of wrongly identified malware flows, respectively. Taking these symbols of confusion matrix, parameters that assess the performance of classification can be well-defined. Parameters to evaluate a classifier performance are shown in the Table 4, (2) to (7), respectively.

RESULTS AND ANALYSIS
In order to investigate the functionality of the proposed method in this paper, a series of tests were conducted thoroughly in two parts, offline and online analysis. The purpose of offline part is to find out the effectiveness of five tuples of priority 1 dataset in traffic classification using a variety of algorithms in Weka. While online analysis intends to further evaluate the effectiveness of finalized model on incoming real time data.

Test scenario-offline classification
In offline analysis, firstly, five algorithms were experimented and analysed as shown in Table 5 for two types of datasets (all priorities data set and training data set consisting of priority 1 flows from Table 1). The tested algorithms are Naïve Bayes, random tree, J48, Bayes Net and ZeroR. The first four algorithms are chosen because they work better with text classification whereas the ZeroR is used as the minimum benchmark of model performance. This evaluation is to ensure that by only using the priority 1 dataset is sufficient and strong enough to generate a decent and promising generative model.

Algorithms effect on datasets using 10 fold cross validation
The prediction is done by using 10 fold cross validation test option. The value of folds is elected so that each subset of data is sufficient to be statistically illustrative of the wider full dataset. The choice of fold number is usually 5 or 10, when the value gets higher, the size variance between the training set and subsets becomes smaller. In this case, usually 10 fold can achieve the average accuracy for a classifier [35], [36]. Hence 10 fold cross validation is selected without further experimentation. The accuracy and efficiency of each of the algorithms tested on all priorities data set and solely priority 1 data set are listed in Table 5. The number of instances between the two training datasets are differ by 12346 instances. The full dataset contains 92251 while the extracted dataset has only 79905. According to Table 5, the performance of priority 1 dataset is generally better or similar to all priorities dataset. Smaller dataset took lesser time than full priorities hence more efficient. Except for J48 and ZeroR that have slight lower accuracy, all other algorithms tested on priority 1 dataset are estimated to be able to predict better when new dataset injected. In an overall trend, experiment shows that by using extracted priority 1 dataset, it has a better performance in terms of efficiency and accuracy. Among all algorithms, Naive Bayes will be the most convincing classifier due its accuracy of 99.92% and efficiency of 0.06s when priority 1 dataset and cross validation of 10 folds were applied. Table 6 presents Naïve Bayes classification performance for the priority 1 dataset in detail. There are a number of important parameters to be emphasized on other than the accuracy and efficiency. Confusion matrix illustrated the raw number of correctly and incorrectly classified instances. The addition of aa, ab, ba and bb will be equal to the total number of instances of 79905 while a and b are the class label (normal and malware). High precision means the algorithm is able to bring significantly applicable results than the irrelevant ones while high recall indicating that more relevant results are returned. F-measure specifies the model accuracy by combining recall and precision of the model. Other parameters such as receiver operating characteristics (ROC) Area and Kappa statistic also give a verification on the accuracy level of the model. The most optimum classifier would have the value of ROC and Kappa approaching or equals to 1. Again as proven in Table 6, it is a promising classifier as its ROC area as well as the Kappa statistic is almost equal to 1.

Test scenario-online classification
The statistic of dataset supplied as new test set is shown in Table 2. The online analysis started by supplying data from laptop B using Playcap, while main laptop captured and played as a simulation of traffic classification in real life. Naïve Bayes model is used to predict the newly captured data, with headers edited csv file because the newly supplied dataset must possess the same attributes as the dataset in the saved model. Otherwise it will not be recognized. By using the TCP flows from Table 2 that captured from laptop B, it attempts to match attributes between two datasets before prediction. As shown in Figure 5, the first five attributes were perfectly matched except for the last one, which is the prediction result we are looking for. It is detected as mismatched because the incoming data flow does not have a label as the saved model dataset. Hence, after prediction happened, it generates results that contain both the actual and predicted class for the type label of each instance in the test set as shown in Figure 5. Its error prediction should be exactly the same when Weka GUI is used. In order to validate the online classification result, a Weka Explorer testing was done on the same newly captured PcapNg from Playcap in laptop B by using the saved model. As shown in the Figure 6, the predictions on the test set is exactly the same as the results produced from online classification as shown in Figure 5. The prediction in the output is its probability of correct prediction. Hence, the closer to '1', the better the accuracy of each prediction. While the predicted class also correctly assigned, as to ease the verification, only malware packets are sent from laptop B. 'Malware' is labeled to each instance which is proven that the predictions are all correct and possible to make reality. Based on the outcome from the experiments conducted, it verifies that ML method using statistical features and assisted by Snort alert is sufficient and precise enough to provide an accurate answer for each instance. The performance of priority 1 dataset which contain only important features and avoid redundant data is generally better or similar to all priorities dataset. Training with smaller dataset takes lesser time than with full priorities and can produce more efficient traffic classification performance. Figure 5. Output prediction of online classification Figure 6. Output prediction of offline classification

CONCLUSION
As conclusion, traffic classification using ML approach with five tuples features and assisted by Snort alert information could provide an efficient classification based on the classification accuracy and training processing time that have been achieved. This classifier is produced within the desired time frame and the outcome is following closely to the expectation. This proposed method is capable to reduce the unclassified traffic network and be a promising way for securing the Internet users. Combining with the real time online classification of unwanted data, our devices and information safety can be sustained. At the end of this project, some recommendations are needed to make this project a better one for the next researcher. To have a more comprehensive result, the training data set can collect more data flows that supported by a variety types of protocol, so that the classifier can be more accurate when it is tested with new dataset. Moreover, experiments on a real network with different types of malicious traffic should be implemented as it would greatly improve this research.