Dataset Reduction Framework For Intelligent Fault Detection In IoT-based Cyber-Physical Systems Using Machine Learning Techniques

Intelligent Fault Detection (IFD), the use of machine learning-based methods and algorithms for the fault detection in modern systems becomes nowadays important due to the large number of data being generated by devices embedded in such systems. A typical example of such systems is Internet of Things (IoT)-based Cyber-Physical Systems (CPS) where IoT devices are used for better monitoring and control of such systems but at the same time due to their nature are susceptible to component faults. IFD depends on the number of data generated in such systems and their representation using system characteristics (features). Instance-based dataset reduction schemes used in Machine Learning (ML) aim to reduce the volume of data required during training while maintaining or preserving testing accuracy. Such reductions lead to less storage and processing time required for the trained models, which enables the use of lightweight IFD approaches in embedded devices found in the core of IoT-based CPS systems. In this work, we propose a machine learning-based framework for instance-based dataset reduction applied for IFD models. Our proposed framework is experimentally evaluated over two datasets. Results show that reduction is possible for up to 15.51% with an average accuracy improvement of 17% on the set of evaluated classification algorithms.


I. INTRODUCTION
Cyber-Physical Systems, are combination of computational, networking and physical processes. In fact, the extensive integration of CPS in critical infrastructures elevated their role in ensuring economic development [1]. Hence, their resilience has become of utmost importance in all aspects of modern life. The proliferation of IoT technologies and their integration within CPS enables better monitoring, control and management of these systems. Constant communication of IoT devices enables the generation of a large amount of data in constant manner and thus machine learning-based approaches seems to be applicable. At the same time due the nature of integrated IoT devices, these systems are increasingly susceptible to component faults [2].
IFD refers to the use of machine learning methods for models used to detect faults. Such models are subject to data collection and instances representation in terms of system characteristics (features). Traditional artificial feature selection procedures [3] involve the use of machine learning methods to collect the features that represent better the instances of that dataset, while most recent approaches are focusing on the data collection procedure itself [4]. To the best of our knowledge there is no work that takes into consideration instance-based dataset reduction, a technique in machine learning that reduce the volume of data needed during training and lead to less storage and processing time required for the trained models, for IFD models.
Fault Detection in CPS has emerged as a challenging task due to the heterogeneity and large scale of such systems and the complexity of defining the faulty behaviour as this is a dynamic problem. Some approaches [5] utilise sensor and alarm data, characterising the new era of IoT-based CPS. Such IoT solutions for fault detection combine both machine learning approaches and human expertise while more recent approaches focus on the use of Deep Learning for fault detection. These procedures are based on two steps, the big data collection, referring to the generation of a large number of data using cloud-based solutions, and the deep learning based diagnosis and detection, that learns features from data and recognise healthy state of system [6].
Dataset reduction is a technique used in machine learning that aims to reduce the volume of data in a dataset. One way of doing this is by using instance-based dataset reduction. In this approach, dataset is reduced by removing instances that are used as part of the training and thus entail in reducing training time and computational resources required for the trained models while preserving or improving testing accuracy. An example of this approach is applied with instance-based classification algorithms, which perform their learning process at instance level. Those processes try to approximate the unknown function of the trained model by assigning the class labels to the actual instances and not violating model accuracy by removing redundant instances [7]. Another way of dataset reduction derives from the removal of unnecessary instances on algorithms that built decision boundaries (hyper planes), where instances removed are such that are away from the constructed boundaries, such as in Support Vector Machine (SVM) algorithm [8].
In this work we propose a generic ML-based dataset set reduction framework inspired from fault list reduction methodologies used in digital systems. More specifically, our framework accompanies techniques for features evaluation and ranking, such as Information Gain, with the notion of fault dominant. Instance-based dataset reduction proposed aim to reduce the volume of data needed during training while maintaining or increasing the testing accuracy. The main motivation of our framework is to derived a machine learning model for fault detection that will require less processing time and computational resources and thus being able to be applied in embedded devices found in the core of IoT-based CPS. Our proposed framework is evaluated through an IoT-based CPS simulated environment dataset [9] and a power system testbed model [10].

II. PROBLEM DEFINITION
We consider a dataset T R consisted of instances from the set of classes {N, F 1, F 2, F 3, . . . , F n} used for IFD models. Classes used are normal/healthy class N and a set of n abnormal faulty classes, each denoted by F i. Moreover, let a set of m machine learning classification algorithm Alg = {A 1 , A 2 , A 3 , A 5 , . . . , A m } that are used for IFD models based on T R. Each model of an algorithm A i has an accuracy, denoted by ac T R Ai . Based on Table I, accuracy is computed by counting the correct number of predictions performed in the test phase of each model, that is trained in advance with a subset of instances of the initial dataset. a, d are the correct predictions performed by classifier and b, c the incorrect ones. Accuracy is a percentage metric (%) and is computed by ac = a+d a+b+c+d , ac T R µ denotes the average accuracy achieved by the set of classification algorithms Alg = {A 1 , A 2 , A 3 , A 4 , . . . , A m } and models constructed based on T R.

Problem
II.1 (Dataset Reduction). Given dataset T R consisted of instances from the set of classes {N, F 1, F 2, F 3, . . . , F n} a set of classification algorithms Alg = {A 1 , A 2 , A 3 , A 4 , . . . , A m } with ac T R µ , we aim to find a reduced dataset T R consisted of instances from the set of classes {N, F 1, F 2, F 3, . . . , F n } and n ≤ n such that for each algorithm A i the ac T R Ai ≥ ac T R Ai . This will entail into ac T R µ ≥ ac T R µ . Thus we aim, to find a reduction that maintain or increase the testing accuracy as per algorithm and average, by removing instances from the training dataset.

III. PROPOSED FRAMEWORK
Our proposed framework is consisted of three steps as illustrated in Fig. 1. This framework deals with datasets considering healthy (normal) and faulty (abnormal) instances. The proposed reduction is performed by removing instances from the faulty classes and thus reducing the size of the dataset. This is done with respect of the Problem II.1, for achieving higher per algorithm and on average accuracy. We now describe each step by also analysing its complexity based on the number of the faulty classes in the dataset n.  Possible dominant relation cases are pair of faulty classes that have at least half of their most significant features common and thus their coexistence in the dataset T R might not contribute to the improvement of the accuracy for the proposed models. Thus, the necessary condition for extracting these cases is to have two faulty classes (F i, F j) have at least half of their most significant features common. Specifically, let CF i,j denote the set of features common among the most significant between two faulty classes (F i, F j). |CF i,j | denotes the number of common features among the most significant as derived from previous step. This value must be greater than ceil(|M SF F i |/2). Then if |CF i,j | ≥ ceil(|M SF F i |/2) a possible dominant case exist between faulty classes (F i, F j). This step complexity is O(n 2 ), as we need to examine all pairs of faulty classes.

3)
Step 3: Experimental Evaluation Of Cases Using ML Algorithms: Based on the possible dominant relation pair of classes derived from Step 2, in this step we define the dominant operator that allow us to determine whether a reduction of faulty instances is possible. Let T W j being the dataset consisted of all the instances in T R except from the instances of F j class. Let F T RAIN being the set of classes without F j. Given a classification algorithm A i ∈ Alg and ac T R Ai the accuracy derived by having all the faulty instances together as part of the training and ac T Wj Ai the accuracy of the same algorithm A i with instances of F j used only as part of the testing set then these class instances can be deducted from T R if ac T Wj Ai ≥ ac T R Ai . Specifically, the dominant operator ≫ Alg over the set of all classification algorithms is defined as: Definition III.1 (Dominant Operator). Dominant operator ≫ Alg using a set of machine learning classification algorithms Alg = {A 1 , A 2 , . . . , A m } is such that given datasets T R, T W j and accuracies ac T Wj In such case, a model for detecting faults using the set of algorithms in Alg can be trained using only T W j instances but being able to detect faults in the system, with higher per algorithm and on average accuracy. For each pair of classes in a case the procedure described above is applied for each class, separately. This step complexity is O(n 2 ) as the number of cases in terms of pairs can be up to n * (n − 1).

IV. IOT EXPERIMENTAL SETUP AND RESULTS
We first evaluate our proposed framework in an IoT simulated environment, as it is described in [9]. In that work, authors consider an IoT enabled Energy Aware Smart Home (EASH). The communication environment of this system is simulated in OPNET simulator and communication is performed using Zigbee protocol. The topology described is a star topology, where peripheral monitoring elements report energy consumption measurements to a central coordinator every minute. The dataset consisted of normal, faulty and attack scenarios but for our evaluation we consider only the faulty scenarios instances. Faulty classes are F1: Low Energy Failure, F2: Routing Failure and F3: Packet Dropped Failure. We keep the same classification evaluation algorithms set as the one used for the experiments in that work, and experimental tool of Waikato Environment for Knowledge Analysis (WEKA) [11]. Thus Alg ={NaiveBayes (NB), J48, Multilayered Perceptron (MLP) and Multinomial Logistic Regression (MLR)}. In order to ensure that Alg set is representative enough we choose four different algorithms, which belongs to three separate categories: (Tree, Function, Probabilistic)based classification algorithms. The baseline average accuracy ac T R µ where T R considered of instances from the set of classes {N, F 1, F 2, F 3} equals to 95.8%. Dataset contains 120 instances from which 48 are normal instances and each faulty class has 24 instances. Evaluation was performing using the percentage-split approach, thus we use 75% of the data as training data and the rest 25% as testing data. Below we explain how the steps of the proposed framework are applied using this dataset.
• Step 1: Features Ranking For Faulty Classes: Features ranking for the faulty classes of the dataset is performed by using the Information Gain ranking scheme as we already define in the framework explanation section. For each class we keep the rank the ten most significant features and those are shown in Table III No dominant relation can be derived in this case, as the accuracy is not improved in either scenario for all classification algorithms. Based on the experimental evaluation performed above dataset T R can be reduced to T W 4 by removing F4 instances (Case #1), as this was the only scenario that shows an accuracy improvement in all experimental algorithms. The ac T R µ with T R consisted of instances from the set of classes {N, F 1, F 2, F 3, F 5, F 6} equals to 90.46%. Thus, reduction leads to an accuracy improvement of 12.4% on average with a dataset reduction of 15.51%. Having a bigger dataset we also observed that non functional-based optimisation algorithms (J48 and NB), are further improved by reduction schemes.
VI. CONCLUSION This paper presents preliminary results on the dataset reduction framework proposed for the application domain of IFD in IoT-based CPS system datasets. Results show that proposed framework achieved to improve the accuracy of the models, by removing instances that are redundant from the actual datasets. Specifically, an improvement of 12.4% on average with a dataset reduction up to 15.51% in the large dataset we examined for the power system testbed was achieved. Moreover, for the simulation IoT-based dataset reduction performed was of 20.0% and an average accuracy improvement of 4.2%. Such reduction enable the faster training models and the less storage required leaving space for lightweight model solutions.
In future work, we aim to examine additional approaches for dataset reduction, considering the removal of instances without focusing on their class. Moreover, we aim to expand our framework and experiments in order to deal with other sources of abnormalities affecting such systems, as for example attacks and examine. Moreover, we aim to examine our framework ability over intrusion detection datasets.

VII. ACKNOWLEDGEMENT
This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 739551 (KIOS CoE) and from of the Republic of Cyprus through the Directorate General for European Programmes, Coordination and Development.