A Big Data Analytical Framework for Intrusion Detection Based On Novel Elephant Herding Optimized Finite Dirichlet Mixture Models

ABSTRACT


INTRODUCTION
The field of Data Science (DS) often uses advanced analytical methods and scientific concepts to draw useful commercial insights from data. By analyzing data to identify patterns and make predictions about what is expected to happen, advanced analytics puts a greater emphasis on forecasting future occurrences. Advanced analytics go beyond basic analytics by offering a deeper understanding of data and helping with the examination of detailed data, whereas basic analytics just give a general description of data, which is what we are interested in [1]. Modern life is increasingly influenced by networks, making cybersecurity a crucial area of study. Anti-virus software firewalls, and Intrusion Detection Systems (IDSs) are the key cyber security tools. These methods defend networks against both internal and external intrusions. An IDS is one of these detection systems that are crucial in ensuring cyber security by keeping track of the hardware and software configurations inside a network [2]. In 1980, the first IDS was suggested. Many mature IDS products have now emerged. However, a lot of IDS continue to emit alerts for low-threat situations often because of their high false alarm rate. This makes security analysts' workloads heavier and raises the possibility that very damaging assaults would go unnoticed. IDSs with greater rates of detection and fewer false alarms have been developed as a consequence of the extensive study. Existing IDSs also have the drawback of being unable to recognize unidentified assaults. Fast-changing network settings result in a steady emergence of new attack types. Therefore, it is essential to create IDSs that can recognize unidentified assaults [3][4].
As a result, cybersecurity is quickly overtaking other pressing problems in contemporary society. Monitoring and analysis of network traffic data are essential for detecting likely attack trends. Worldwide businesses and IT companies have been putting money into data science to create more sophisticated IDSs that can prevent damaging assaults and provide higher cybersecurity [5]. To analyze, display, and derive insights that might help forecast and halt cyber attacks, big data analytics in security requires the capacity to collect enormous volumes of digital data. It improves our cyber defense posture together with security technology. This idea encompasses a variety of techniques from the domains of computers, statistics, and data technological equipment, including the well-known Machine Learning (ML) technique [6]. However, because of the enormous amount of heterogeneous big data produced by several sources, standard data analytics and shallow ML approaches are worthless and ineffective in dealing with such security risks directly. Notably, classical ML approaches struggle with processing complexity and latency and may be unable to comprehend the complex and time-varying non-linear relationships seen in huge datasets [7]. The main contributions of the paper are, • To evaluate this technique's dependability for detecting intrusion, we compare it with three other ways and utilize two benchmark datasets: UNSW-NB15 and NSL-KDD. Then, we use z-score normalization for pre-processing the data.
• We also develop a novel EHO-FDMM based on intrusion detection to efficiently detect harmful events in this framework.
The following are the other portions of the study: Pertinent Studies are provided in part 2, the technique is introduced in part 3, the results and discussion are presented in part 4, and the article is concluded in the last part.

RELATED WORKS
The question of whether the CRoss-Industry Standard Process for Data Mining (CRISP-DM) is still appropriate for use in data science projects was examined in the publication [8]. They contend that the process model approach still mainly holds if the project is goal-directed and process-driven. However, as DS initiatives get more experimental, the potential directions they might go down become more diversified, necessitating a more adaptable paradigm. The field of supply chain management (SCM) is paying an increasing amount of attention to big data analytics (BDA). The purpose of the investigation [9] was to suggest a categorization of these predicted BDA implications for supply chain demand projections, identify the holes, and give recommendations for future research. Because of various constraints, typical IDS techniques need to be updated and enhanced before they can be used in the Internet of Things (IoT). These constraints include resource-constrained devices, the restricted memory and battery capacity of nodes, and a specialized protocol stack. A lightweight attack detection technique that uses a supervised ML-based Support Vector Machine (SVM) was created in the research [10] to identify an opponent who was trying to inject extraneous data into the IoT network. The method of discovering hostile activity in a network by examining the behavior of network traffic was created in the research [11], referred to as the approach known as network intrusion detection. To identify abnormalities, IDS often makes use of data mining methods. Because spotting abnormalities in high-dimensional network traffic features is a laborious operation, IDS relies heavily on dimensionality reduction as a key component. The study [12], proposed Passban, an intelligent intrusion detection system that can safeguard the Internet of Things devices that are directly linked to it. The suggested system is unique in that it can be installed directly on extremely inexpensive IoT gateways. As a result, it takes full use of the edge computing paradigm to identify cyber risks. IDS are among the most reliable options, particularly those that were developed with the assistance Of Artificial Intelligence (AI). An artificially fully automated IDS for fog security against cyberattacks was proposed in the study [13]. Recurrent neural networks (RNNs) with many layers are used in the proposed model to provide security for fog computing that is located very close to end users and IoT devices. Using collaborative learning and feature selection, the innovative IDS architecture was developed in the research [14]. The CFS-BA heuristic method, which chooses the best subset based on the relationship between features, is presented as the initial stage for dimensionality reduction. In the research [15], the Variational Long Short-Term Memory (VLSTM) approach to learning for skilled discrepancy detection using a reconstituted depiction of features was introduced to address the discrepancy between dimensionality reduction and feature retention in unbalanced Industrial Big Data (IBD).

METHOD
This section explains the recommended technique for utilizing the EHO-FDMM to develop efficient IDS, as well as the mathematical aspects of data modeling and estimation utilizing the DMM. Figure 1 depicts the structure of EHO-FDMM.

Dataset
For analyzing the effectiveness of the suggested methodologies, several standalone databases have been obtained using a variety of normal and malicious records, including the KDD CUP 99, NSL-KDD, and UNSWNB15. The KDD CUP 99 dataset has been enhanced using the NSL-KDD dataset. To prevent each classifier from favoring the records with the highest frequency, redundant records were removed from the training and testing sets in the KDD CUP 99 dataset. The NSL-KDD dataset comprises 41 characteristics and a class label for each record, much like this dataset.
The UNSW-NB15 dataset combines real, recent recordings of attacks and normal behavior. Its network packets have a size of 1,450,133 records and have been stored in four CSV files totaling around 100 Gigabytes. Each investigation has 58 characteristics, along with a class name that highlights its high dimensionality diversity. With an average velocity of 5 to 10 MB/s between sources and destinations, it allows for larger data rate transfers via Ethernets, perfectly simulating actual network situations [16].

Pre-processing using z-score normalization
Z-score normalization is a common pre-processing technique used in intrusion detection systems to standardize the scale of input data. It involves subtracting the mean value of the data and dividing it by the standard deviation. The resulting data has a mean of zero and a standard deviation of one, which makes it easier to compare and analyze.
The equation for z-score normalization is as follows: Where z is the standardized value, x is the original value, is the mean of the data, and is the standard deviation of the data.
To use z-score normalization in an intrusion detection system, the first step is to calculate the mean and standard deviation of the training data for each feature. Then, for each new data point, the z-score is calculated using the above equation. If the z-score falls outside a certain threshold, the system raises an alarm indicating a potential intrusion.

Intrusion detection using Finite Dirichlet Mixture Model (FDMM)
FDMM is a probabilistic model that assumes that the data comes from a finite number of unknown Gaussian distributions. The number of distributions is not known a priori, but it can be estimated from the data. The model assigns each data point to one of the Gaussian distributions with a probability proportional to its likelihood under that distribution. In intrusion detection, FDMM can be used to detect anomalies in network traffic. To teach the model the features, it may be trained using an inventory of typical network activity of legitimate traffic. Once the model is trained, it can be used to classify new network traffic as either normal or anomalous. The finite mixture model is an effective and adaptable probabilistic modeling network data that may be viewed as a convex amalgamation of two or more Probability Density Functions (PDFs), whose combined features can roughly mimic any random distributions. Figure 2 depicts a finite mixture of K-component Dirichlet distributions, and it is denoted by 1 for = 1, … … and, T for the dimension of W. It is important to note the use of a Dirichlet probability as a parent distributed rather than as a before the roles that require to directly represent the data.
If we suppose that the mixed distribution in Equation (2)  ), the possible function of the FDMM is The latent variable model in Equation (2) is the finite mixture model. As a result, we create a Kdimensional binary random vector ( = { , … … }) for each vector ( ), with ∈ {0,1}, ∑ = 1 =1 belongs to element j, otherwise 0. The distribution under the condition of Y given combination coefficients ( )is defined as follows for the latent variables ( = { 1,….. }), which are essentially concealed variables that are not mentioned directly in the model.
The consequent allocation of a dataset K given the class labels, or the probability functional with latent factors, which may therefore be represented as The process of acquiring knowledge of the combination variables, which involves both predicting the settings and choosing the number of elements (L), is a significant issue given some data K and a collection of characteristics C.
The FDMM approach has several advantages over other intrusion detection methods. First, it can detect both known and unknown types of attacks. Second, it can adapt to changing network traffic patterns over time. Third, it has a low false positive rate, which means that it is less likely to classify normal traffic as anomalous. As result, FDMM is a powerful and flexible model that can be used for intrusion detection. Its ability to detect both known and unknown types of attacks and its low false positive rate make it a promising approach for securing computer networks.

Elephant Herding Optimization (EHO)
EHO is a recent optimization algorithm that mimics the behavior of elephants in a herd. In a herd, elephants cooperate and communicate with each other to achieve a common goal, such as finding food or water. EHO algorithm is based on this concept of cooperation and communication among individuals to solve optimization problems.
In the fundamental EHO algorithm, the separation operation is executed after the update functioning, which establishes the search orientation and local search detail level of the method. The group updating operations and the division operation are the two stages of this procedure. Establish the elephant community at random, and then split it into n clans, with j elephants living in each group. The location of each elephant in each iteration is specified by Equation (7).
Equation (8) is used to determine the role of the female matriarch (, , ). Equation (9) is used to define the elephant group's nucleus.
EHO is an optimization method that uses the behavior of elephants in a herd to guide the search for the optimal solution. The algorithm is based on the concept of cooperation and communication among individuals, and it is effective in solving a wide range of optimization problems.

Elephant Herding Optimized-Finite Dirichlet Mixture Model (EHO-FDMM)
The concept of using an Elephant Herding Optimized Finite Dirichlet Mixture Model (EHO-FDMM) in an intrusion detection system involves using an ML algorithm to identify and classify patterns of behavior in network traffic that may indicate an attempted intrusion or attack. The EHOFDMM is a variant of the Dirichlet Mixture Model (DMM), which is a probabilistic model used in machine learning for clustering and classification tasks. The EHO-FDMM uses a swarm intelligence algorithm inspired by the behavior of elephant herds in nature to optimize the DMM's performance.
The EHO-FDMM in intrusion detection can be formulated mathematically using the following equation: Where ( | ) represents the probability of observing network traffic data w given the model parameters θ, L is the number of mixture components, represents the weight of the ℎ mixture component, and ( | ) represents the probability of observing data w given the parameters of the ℎ mixture component.
The EHOFDMM optimizes the values of the parameters θ and the weights using a swarm intelligence algorithm inspired by the behavior of elephant herds in nature. This algorithm iteratively adjusts the values of θ and to maximize the likelihood of observing the network traffic data.
The EHO-FDMM can be used in conjunction with other intrusion detection techniques, such as signature-based detection and anomaly-based detection, to provide a more comprehensive and accurate approach to intrusion detection. By leveraging the power of ML and swarm intelligence, the EHO-FDMM has the potential to improve the effectiveness of network security measures and detect new and emerging threats in real-time

RESULTS AND DISCUSSION
The datasets utilized to evaluate the proposed approach are covered in this part, followed by the evaluation measures that were used to compare the effectiveness of the suggested strategy in comparison to other methods. The analysis of the variance of the traits selected from the NSL-KDD and UNSW-NB15 datasets is given in this section.

Analysis of Performance
The proposed EHO-FDMM-based IDS technique was evaluated in several experiments on the two datasets using external evaluation metrics, such as accuracy, Detection Rate (DR), and False Positive Rate (FPR), which depend on the four terms true positive (TP), true negative (TN), false negative (FN), and false positive (FP). The number of real records classed as assaults are denoted by the letters TP, normal records are denoted by the letters TN, attack records are denoted by the letters FN, and normal recordings are denoted by the letters FP. Following is a definition of these measures.
Accuracy is measured as the proportion of all normal and attack records that are properly categorized, or more specifically in Equation (12). Figure 3 and Table 1 compare the EHO-FDMM accuracy with other current techniques.  The proportion of successfully identified attack recordings is referred to as the Detection Rate (DR), which may be found in Equation (13). The DR of the suggested approach is contrasted with that of other existing methods in Figure 4 and Table 2.  The proportion of records that were mistakenly identified as an attack is denoted by the False Positive Rate (FPR), which can be found in Equation 14. In Figure 5 and Table 3, a comparison is made between the suggested method's FPR and other methods. Figure 5. Comparison of FPR The findings of the efficiency assessment for the EHO-FDMM IDS approach depending on the NSL-KDD dataset, the outcomes of three alternative methods were compared, including the Triangle Area Nearest Neighbours (TANN) [17], Euclidean Distance Map (EDM) [18], and Multivariate Correlation Analysis (MCA) [19], total DRs and FPRs are shown in Table 4. Since they are more current and offer comparable statistical measures to our EHO-FDMM, these approaches are utilized for comparison with ours. The TANN, EDM, and MCA had respective accuracy of 81%, 76%, and 83%, and DRs of 91.2%, 94.3%, and 96.1%, with FARs of 9.5%, 7.3%, and 4.1%. The EHO-FDMM, in comparison, had superior outcomes with 98% accuracy, 97.3% DR, and 2.3% FPR.

Datasets Applied to Analysis
The characteristics from the two datasets that were chosen for the performance assessment of the DMM-based IDS approach are provided in Table 5 along with the total DR, accuracy, and FPR scores. In the NSL-KDD dataset, as a whole DR and accuracy climbed from 82.2% to 86.8% and 92.1% to 96.7%, accordingly, as the v value continuously grew from 1.5 to 3, whereas the FPR globally decreased from 2.2% to 1.4%.
Similarly, in the UNSW-NB15 dataset, when the v value climbed from 1.5 to 3, the total accuracy and DR climbed from 84.1% to 93.9% and 89.1% to 94.3%, accordingly, while the overall FPR decreased from 9.2% to 5.8%.
The FDMM precisely matches the bounds as it provides a list of chances used to calculate every incident's PDF of every characteristic, which is the main factor that made the EHO-FDMM-based IDS approach perform better than the other techniques. Nevertheless, even though the EHO-FDMM-based IDS approach had the lowest FPR and highest DR on the NSL-KDD dataset, but performed much worse on the UNSW-NB15 due to subtle differences between regular and atypical instances. This demonstrated the intricate, quite normal-looking assault patterns of the present.

CONCLUSION
This research covered a proposed scalable framework with three primary modules: data source, pre-processing, and a suggested procedure. To easily manage large-scale settings, the goal of the first component was to detect and gather network information from a database that is distributed while the second module's objective was to handle smaller-scale settings to analyze and filter network data to increase the performance of the suggested technique. The third approach, the EHO-FDMM-based intrusion detection, was developed based on an intrusion detection approach that uses a lower-upper interval of confidence as an indicator to identify abnormal data. The performance assessment of the EHO-FDMM-based intrusion detection system showed that it had been more precise than many other important techniques. In the future, we'll investigate more statistical methods to use them in conjunction to provide a visual tool for analysis, monitoring, and making choices on individual intrusions. We will further expand on this research to integrate the proposed framework's architecture with SCADA and cloud computing platforms.