A Hypothesis Testing tool for the comparison of different Cyber-Security Mitigation Strategies in IoT

Internet of Things (IoT) is a field with tremendous growth that already shows great impact in numerous domains. Simultaneous with this development is the need for better Cyber-security: IoT systems are attacked by various adversaries targeting IoT services, platforms and networks, which can have disruptive consequences. These attacks can be countered using multiple strategies with different effects to the system. The following paper, proposes a novel approach based on Machine Learning and Statistical Hypothesis Testing, which allows the security operator to investigate how using different strategies affects various KPI related to the security of the IoT network and if the KPI resulting from modifications to a mitigation strategy are statistically different when compared to those occurring from a starting mitigation action set.


I. INTRODUCTION
From Industry 4.0 to smart cities, the growth of the Internet of Things (IoT) ecosystem has been empowering tremendous advancements and extensive impact in multiple domains while more industrial and commercial opportunities can be foreseen in the future. In tandem with this expansion, has been the rise of attacks against IoT systems, attacks which can potentially become as pervasive as the spread of IoT applications. Successful attacks against the IoT can cause disruption of services provided, breach of sensitive or private data, huge monetary losses, damage to property and even physical harm [1].
Designing secure IoT systems is the first step towards guarding them from malicious acts. However, new exploits and vulnerabilities are constantly discovered, so a robust attack detection system combined with methods and tools to mitigate new threats is an extremely important part to preserve the soundness of any IoT system. This is a challenging task since there are multiple attacks targeting different layers of the IoT network, and each of these attacks can be counter-measured by multiple mitigation actions [2].
The plethora of possible attack responses, creates the need of tools that can help the network operator investigate how different mitigation actions can affect the network so that she can make proper decisions for its safeguarding. The outcomes of different mitigation strategies can be quantified by using carefully selected Key Performance Indicators (KPI). Such a tool is described and evaluated in the remainder of this paper.

A. Key Contributions
This paper presents a novel approach to distinguish differences between sets of mitigation actions used to counter an attack or threat against an IoT network. It allows the Security Operator of the IoT network to modify a set of existing mitigation actions and see the impact of the changes. More specifically, the KPI values resulting from different mitigation sets are used to ascertain clusters of such sets using a machine learning algorithm. Then, the difference between these clusters, is evaluated by means of a p-value provided by a method based on Statistical Hypothesis Testing. To our knowledge, this is a novel approach not suggested elsewhere in available the literature.
The rest of the paper is structured as follows: Section II briefly presents relevant literature and Section III presents the methodology used to formulate the proposed approach while section IV contains the results of experiments that were performed to validate the proposed method. Finally, section V contains the conclusions of this paper along with future research directions and aims.

II. RELATED WORK
This section briefly presents the use of Hypothesis Testing based algorithms in the context of IoT networks through examples found in recently published literature, along with details concerning four KPIs used in this work to quantify the effects of one or more mitigation actions in the system.

A. Statistical Hypothesis Testing in the context of IoT networks
Hypothesis testing in the context of statistics, is the use of a sample of data to evaluate the plausibility of a hypothesis concerning the distributions of the sample data. Numerous applications of algorithms based on Hypothesis testing have been successfully applied to IoT domain specific problems.
Such methods have been extensively employed for attack detection, for example: In [3] such an algorithm is used to detect a Link flooding attack on an IoT Network, in [4] Hypothesis Testing is used against spectrum sensing data falsification in cognitive IoT Networks while in [5] Li et. al use it to empower an distributed attack detection System.
Examples of other uses are numerous: Hypothesis Testing based algorithms were used to perform polling for the values of multiple KPIs in an fog based IoT sensor network [6] , to facilitate protocols for the authentication of IoT devices [7] or manage the privacy of Smart Energy Meters [8]. Moreover, decentralized methods have been proposed, to preserve IoT sensor energy consumption [9], to safeguard privacy [10] or even preserve robustness in the case of existing noise and occurence small data sets [11]. To our knowledge, no other hypothesis-based application for evaluating different mitigation strategies in IoT has been presented in the available literature.

B. Security Related Key Performance Indicators for the selection of Mitigation Actions
The following section presents four Key Performance Indicators (KPIs), selected to be used in this report as the metrics to be describe the mitigation actions that has been deployed to secure an IoT network. These were selected, based on the results of a extensive literature review, which is available in [12].
The first KPI is Common Vulnerability Scoring System (CVSS), which is an open Industry standard for assessing the severity of cyber-security vulnerabilities. This is accomplished by the use of a score between [0,10] with 10 representing a vulnerability with the highest severity. The score is calculated using predefined values and equations, available in [13]. Using CVSS, security experts can easily share discovered vulnerabilities via public databases such as the National Vulnerability Database [14]. Moreover, any attack vector can be easily translated to a vulnerability in the device it affects. In section IV the following formula is used for the calculation of the CVSS score for a number of mitigation actions to be applied to the system: CV SS = 10 − mean(CV SS all detected vulnerabilities ).
The second KPI is named Return on Response Investment (RORI), which is used to calculate an index associated to a set of the mitigation actions . This KPI can be used to evaluate optimal plans by ranking them based on their efficiency in stopping potential attacks, and simultaneously preserve the best possible service for users. The authors of [15] provide the formula used in this paper to calculate RORI: • The financial cost expected to occur annualy, in the absence of applying a mitigation strategy is refered as Annual Loss Expectancy (ALE). • Risk Mitigation (RM) estimates the coverage of one more actions in the mitigation of an attack. • The cost expected to occur due to the application of mitigations is called Annual Response Cost (ARC) while • The fixed cost associated to the system infrastructure, regardless of the application of a mitigation strategy is called Annual infrastructure value (AIC). Then, The third KPI is named Vulnerability Coverage (VC). VC of a mitigation action cm i , is defined as the number of vulnerabilities it covers when applied divided by the number of total vulnerabilities so VC ∈ [0, 1] [16]. Disjoint VC includes the vulnerabilities covered by a single countermeasure, whereas joint VSC refers to the vulnerabilities covered by multiple countermeasures.
Finally , the Deployment Cost KPI evaluates the deployment costs of the mitigation actions by considering deployment time, consumed resources and the importance of the device that is affected by the countermeasure as assessed by the network security operator [17]. To calculate it, three quantities are required. First, Deployment Time (DT) which is measured in milliseconds. This is the time required for a mitigation action to be deployed. It can be assessed using historical data and be dynamically updated. Then, the Device Importance (DI) is needed, which is arbitrarily assessed by the network security operator considering the specifics of each use case. A value is assigned to each device, where Device Importance ∈ [0, 1]. The last quantity needed is Resource Consumption (RC), can be imputed either by measurements or an arbitrarily chosen ranking scheme can be used based on the network operators' expertise i.e RC = {Very low: 1, Low:2, Medium : 3 , High :4 , Very High : 5}. Finally the KPI value is calculated by the following formula: Deployment Cost = DT * DI * RC.

III. PROPOSED MODEL
In the following section, a method to distinguish between different sets of mitigation actions is presented: More specifically, the KPI values resulting from different mitigation sets are used to cluster the them. Then, the difference between these clusters is evaluated by means of a p−value provided by a Monte Carlo (MC) based method named Statistical Significance of Clustering (SigClust) using Soft Thresholds. Figure 1 presents a high level overview of the proposed tool.
A brief presentation of the SigClust method with Soft Thresholds, as shown in [18] follows: be a data-set of n observations each containing the values of d different KPIs.The method starts from the null hypothesis that the data of X come form a single multivariate Gaussian distribution N (μ, Σ). A test level α, e.g. α = 0.95 is pre-specified to finally test the Hypothesis.
Let C 1 and C 2 be two disjoint sets resulting from the application of a clustering of the data points contained in X i.e. C 1 ∪C 2 = {1, 2, · · · , n}. Then, an indicator of the strength of the clusters can be attained by the Cluster Index (CI), that is used as the test statistic of the method: wherex k is the mean of cluster k ∈ [1,2] whilex is the overall mean. Estimate values (λ 1 , · · · ,λ d ) for the eigenvalues of Σ must be computed. Let (λ 1 , · · · ,λ d ) be the eigenvalues of the sample covariance matrix.The covariance matrix Σ can be written as Then the precision matrix C can be defined, where To estimate Σ, the negative log-likelihood is minimized to using C and the sample covariancẽ to yield, subject to 3 and C, W 0 0. Let M ≥ 0, be a tuning parameter. An additional constrained is set to control the signal versus the noise of the data: where τ is obtained by solving Using the computed eigenvalues, the theoretic optimal CI value is obtained by The rest of the process is comprised by the following four steps: 1) Initially, data from the null distribution is simulated: (x 1 , ..., x d ) are independent with x j ∼ N (0, max(λ j ,σ 2 N )). 2) Then, this data is clustered using the k-means algorithm with k = 2; the corresponding CI value are calculated. 3) By repeating this process a large number of times, an empirical distribution of the values of CI is obtained.
Using the CI values obtained by the simulation, calculate a p−value for the CI value of X. 4) Finally, a conclusion can be derived based on test level α. By using the p-value from the final step the following hypothesis is answered: • H o The clusters come from the same distribution (p ≤ 0.05) or • H 1 The clusters come from a different distribution (p ≥ 0.05). To apply this method, to distinguish the difference between different mitigation action sets the following procedure followed: Let X = [x 1 , x 2 , · · · , x n ], x ∈ R d , be a data set of n historical observations each containing the values of d different KPIs, with x n being the latest observation. Let x n+1 be the KPI values occurring from modifying the mitigation actions that produce x n . Some clustering algorithm is applied to all data, resulting to a partition of clusters C = {C 1 , C 2 , · · · C j }, j ≤ n for all data points.
Let C A , C B be the clusters, in which the data-points x n and x n+1 belong to. If the size of cluster size S is smaller than a predefined minimum size C, then the following correction is applied: The Mean and Standard Deviation of each KPI is calculated. These are used as input to create and C − S synthetic data points for the cluster using a Gaussian Isotrope Distribution i.e. a Gaussian Distribution where the covariance matrix is represented by the simplified matrix Σ = σ 2 I.
Proceeding the process, in the case when C A = C B the case is trivial and no testing is required since the points belong to the same cluster. Else, let x cluster ⊂ X be the subset of all points that belong either in C A or C B . Then the Statistical Significance of Clustering methodology described above is applied as is, answering the question: "Are C A or C B different in terms of their underlying distribution and is this difference statistically significant?".

IV. PERFORMANCE EVALUATION
The following section contains the evaluation of the performance of the Hypothesis Testing Tool under two different scenarios. The first scenario is used to showcase the approach in a data-set with clear cluster memberships. Additionally, it is used to investigate the sensitivity of the approach to the change of different parameters of the clustering algorithm and the underlying data set size.
The second scenario, investigates the operations and the effectiveness of the proposed method to a data set with noise and no clear cluster membership. Table III shows the steps of the algorithm used for Hypothesis Testing.
The Hypothesis Testing tool was written in Python. The experiments using the Jackstraw method where performed with the implementation described in [19] using the R Statistical Language.

A. Experimental setup
For the clustering of the KPI values, the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [20] algorithm was used. This machine learning algorithm has been shown to operate well in the presence of data with noise and outliers, successfully recognizing clusters with arbitrary shapes, different sizes and dissimilar densities.
Two sets of four KPI values are examined: One is completely synthetic with lack of noise while the second is obtained by simulating a scenario where 150 devices are under attack, i.e. 30 of each type of the devices presented in table I.
The HDBSCAN algorithm is heavily infected by choice of a parameter that indicates the minimum number of members a cluster must have to be considered valid. To determine this number an arbitrarily chosen integer m which is the max number of minimum cluster membership was chosen. Then for cluster minimum membership number M = {2, 3, · · · , m}, HDBSCAN was used to cluster the data and calculate the value of the Variance Ratio Criterion score V [21]: a higher value indicates clusters are dense and well separated.
The proposed method, is compared to a similar SoA method originating from the field of Bio-Informatics, termed the Jackstraw method [19] as implemented in [22]. This method calculates the cluster membership significance instead of difference between clusters. It produces a p-value to answer the following Hypothesis: • H o The point examined belongs to the same distribution with the other members of thesame cluster (p ≤ 0.05) or • H 1 The point comes from a cluster with a different distribution (p >0.05). To compare two clusters using the Jackstraw method, for the experimental results presented, the following assumption was used: two clusters have statistically significant difference only if the Jackstraw algorithm correctly ascertains cluster membership for the cluster members of both clusters.

B. Experimental Results
In all experiments, it is assumed that a data point i.e. the current KPI values of the system belong to a cluster and then a data point representing the modified version of the original is tested for differences, occurring by a set of modified mitigation actions.

C. Experiment 1: Synthetic data-set from three isotropic distributions
In this experiment, the proposed Hypothesis Testing tool is applied to synthetic data with clear cluster membership to investigate the sensitivity of the SigClust and HDBScan algorithms in changes to their underlying parameters: Three clusters are used as the most trivial case. Additionally, experiments with different numbers of iterations for the Monte Carlo process were carried out and the run-time for each is reported, to ascertain an appropriate number for the second experiment.
The synthetic data set was created by sampling 3 discrete isotropic Gaussian distributions with pre-specified centres and deviations, shown in Table II. Data-sets with different sizes were created, to determine the data set size needed for the method to produce correct results. Experiments with the following data set size were performed N =[200, 500, 1000, 2500, 3000, 4000, 5000, 10000] while in each case the clusters were equally sized. For each data-set, the Monte Carlo process for For all cases, the expected result i.e. statistical significance when the clusters came from the same clusters, was found when the data-set size was larger than N = 1000, for any and all number of iterations. However, for N ≤ 1000, the algorithm cannot correctly distinguish between members of the same cluster.
For the experiments three different cases are examined: In the first case, both the original and the modified data point belong to the same cluster (Cluster A). In the second and third cases, the two data points belong to different clusters: A and B, A and C respectively. In all cases, the clusters have an equal point of members. Table III contains the results for N = 500. In the case where we examine two points from the same cluster, all results are False i.e. the test fails to distinguish that the points belong to the same cluster. However, the opposite holds when comparing points from different clusters: for all different numbers of the iterations, the algorithm correctly indicates that statistical significance is found i.e. the points come from different clusters.

D. Experiment 2: Simulated data-set
In this experiment, the tool is applied to to a data set with noise and no clear cluster membership. This data set was produced by applying the mitigation selection algorithm presented in [12], on the mitigation rules found in table I. The data examined in this experiment, was found to optimally contain ten clusters, with a minimum of 150 members (variance V = 9240.615), while a lot of the points were determined to be noise i.e. not assigned to any of the clusters. Figure 2 shows the distribution of the data and the clusters found, while Table  V contains number of the points belonging to each cluster. The data-set has 6300 unique data points, in some cases not exceeding the threshold of N >1000, determined to be needed in experiment 1.
For this experiment, all clusters were cross-examined for significance (47 unique pairs), using both the Sigclust and the Jackstraw methods. The algorithm was initially tested without the cluster size correction: It correctly assessed only 80.43 %:  The majority of the assessments is correct if different clusters are compared but erroneous assessments occur when members of the same cluster were compared. However, based on the fact there are clusters with membership count lower than 1000, the cluster size correction was applied: For each such cluster, enough synthetic data using Gaussian isotropic distributions are created to reach membership threshold C = 1000. This data was only used for the case of a comparison of a cluster with itself and not for clustering or for inter-cluster comparison . Using the correction, the number of correct assessments rose to 95.65 %. The proposed method with the correction outperforms both the method without the correction and the SoA method which achieves 89.13 % accuracy. Finally, the F 1 scores for each method are: 0.8889 for the corrected method, 0.3077 for the method without a correction and 0.7619 for the Jackstraw methods.

V. CONCLUSIONS
This paper presented a novel mechanism to ascertain the difference between a starting and a modified set of mitigation actions was developed. This mechanism allows the system operator to explore the effects of different mitigation plans in the system in terms of four cyber-security KPIs using statistics and machine learning tools.
To reiterate, experiments show that the proposed tool achieves a score of 95.78% accuracy, in discerning whether a statistically significant difference exists between different mitigation plans using a data set with numerous outliers. The algorithm outperforms another similar SoA method by 6.61% in terms of Accuracy.
The Hypothesis testing tool is currently being integrated with the system developed in the SerIoT project [1] and more specifically with an approach using Software Defined Networking (SDN) controllers, an automated Mitigation Engine and a Visual Analytics Dashboard that allows a user-friendly overview and operation in the case where manual intervention to the system is required.
Finally, we plan to further augment the proposed method in order to improve its performance and usability by introducing two enhancements: The first enhancement, will be a mechanism that will automatically fine-tune the model used to cluster the mitigation action set. The second enhancement will involve experimentation using ensemble methods to combine the results of the Sigclust and the Jackstraw models to obtain a method with improved performance.