Efficient Cyber Attacks Detection in Industrial Control Systems Using Lightweight Neural Networks

Industrial control systems (ICSs) are widely used and vital to industry and society. Their failure can have severe impact on both economics and human life. Hence, these systems have become an attractive target for attacks, both physical and cyber. A number of attacks detection methods were proposed, however, they are characterized by an insufficient detection rate, a substantial false positives rate, or are system specific. In this paper, we study an attack detection method based on simple and lightweight neural networks, namely, 1D convolutions and autoencoders. We apply these networks to both time and frequency domains of the collected data and discuss pros and cons of each approach. We evaluate the suggested method on three popular public datasets and achieve detection metrics matching or exceeding previously published detection results, while featuring small footprint, short training and detection times, and generality.


INTRODUCTION
Industrial control systems (ICSs), also known as supervisory control and data acquisition (SCADA), combine distributed computing with physical process monitoring and control.They are comprised of elements providing feedback from the physical world (sensors), elements influencing it (actuators), as well as of computers and controllers networks processing the feedback data and issuing commands to the actuators.Many of ICSs are safety critical and an attack interfering with their functionality can cause substantial financial and environmental harm, and endanger people's lives.
The importance of ICSs makes them an attractive target for attacks, particularly cyber attacks.Several high impact incidents of this kind have been reported in recent years, including the attack on a power grid in Ukraine [44], the infamous Stuxnet malware that targeted nuclear centrifuges in Iran [24], and attacks on a Saudi oil company [9].In the past, ICSs ran on proprietary hardware and software in physically secure locations, but more recently they have adopted common information technology (IT) stack and remote connectivity.This trend exposes ICSs to cyber threats that leverage the common technology stack attack tools.At the same time, the ICS defender's toolbox is limited due to the need to support legacy protocols built without modern security features, as well as the inadequate processing capabilities of the endpoints.This problem can be addressed by utilizing traditional IT network-based intrusion detection systems (IDS) to identify malicious activity, which does not rely on endpoint computational resources.However, a very low number of known attacks on ICSs renders this approach ineffective.Alternatively, model-based methods have been proposed to detect the anomalous behavior of the monitored ICS [22,31,37,43].Unfortunately, creating an accurate model of complex physical processes is a very challenging task.It requires an in depth understanding of the system and its implementation, and is time consuming and difficult to scale.Thus, recent studies have utilized machine learning to model the system.Some of them used supervised learning [3,18] and achieved high precision results, however the supervised learning approach is limited to the modeled attacks only.To overcome this obstacle, a number of other researches used unsupervised deep neural networks for detecting anomalies and attacks in ICS data [13,21,27].
Kravchik and Shabtai [23] suggested using unsupervised neural networks based on 1D CNNs and demonstrated detection 31 of 36 cyber attacks in the popular SWaT dataset [12], improving upon previously published results.This paper is aimed to answer the following questions not addressed in [23] .
• Can the generality and effectiveness of 1D CNN be validated using more datasets, preferably from different system types?• Can the detection of an anomaly be interpreted in such a way that it provides actionable insights to the system operator, namely how to pinpoint the attacked sensors or actuators?• Are there other lightweight alternative neural network architectures?• Will detection in the frequency domain provide any benefits compared to detection in the time domain?
The contributions of this paper are as follows.
• An effective and generic method for detecting anomalies and cyber attacks in ICS data using 1D CNN and undercomplete autoencoders (UAE).The method was validated on three public datasets and achieved better performance than previously published results.• A method for robust feature selection based on Kolmogorov Smirnov test.• A method for attack detection explanation -attack localization using neural networks-based models.• The efficient application of the abovementioned method to the frequency domain which provides high detection scores.
The rest of this paper is organized as follows.In section 2 we present the necessary background on ICSs, the considered threat model, relevant neural network architectures and time-frequency transformation.In section 3 we survey related work, focusing on physical system state-based detection researches.Section 4 introduces the datasets we used for validation.We describe our detection and scoring method in section 5.The experiments and their results are described in section 6, and section 7 concludes the paper.

BACKGROUND 2.1 Industrial Control Systems
A typical ICS combines network-connected computers with physical processes, which are both controlled by these computers and provide the computers with the feedback.The key components of an ICS include sensors and actuators that are connected to a local computing element, commonly called programmable logic controller (PLC) or remote terminal unit (RTU).Sensors and actuators are usually connected to the PLC with a direct cable connection, and commands are sent to the PLC via a local networking protocol, such as CAN, Profibus, DNP3, IEC 61850, IEC 62351, Modbus, or S7.PLCs of the remote nodes are connected to the central control unit via protocols such as TCP over a wireless, cellular or wired network.The connection is done with the help of data acquisition system (DAS) that bridges between the local and remote networking protocols.The central control node contains a master terminal unit (MTU) that applies the control logic to RTUs and provides management capabilities to a human machine interface (HMI) computer.A historian server that collects and stores data from the RTUs is another important ICS component.An additional common element of ICSs is an engineering workstation running SCADA software that provides a means to both control the PLCs and to change their internal logic.

Attacks on ICSs and Threat Model
The central role of ICSs in critical infrastructure, medical devices, transport, and other areas of human society makes them an attractive target for attacks.Motives for such attacks are diverse and include criminals seeking control of an important asset or blackmailing a victim, industrial espionage and sabotage, political reconnaissance, cyber war, and privacy evasion.
An ICS can be attacked using several attack vectors including software, hardware, communication protocols, the physical environment, and human elements.In this research, we don't assume any specific attack vector.Out threat model considers a powerful adversary that is is able to influence the physical state of the protected ICS.This can be achieved in numerous ways.Manipulating the network traffic between the PLC and actuators allows overriding the commands sent to them.Manipulating the network traffic from sensors to the PLC allows spoofing the actual physical state of the controlled process and tricks the PLC into issuing incorrect commands.The adversary can also achieve his/her goals by altering the PLC's internal state using the SCADA-PLC communication interface or change the set points of the controlled process via the same interface.A schematic ICS architecture, its control and remote segments, and different attack locations are illustrated in Figure 1.
In this research, we apply a physics-based attack detection approach.The central idea of this approach is that the behavior of the protected system complies with immutable laws of physics and therefore can be modeled.Monitoring the physical system state and its deviation from the model allows for detecting anomalous behavior which includes spoofed sensor readings and injected control commands.For example, opening a valve should result in an increase in the water level.If the level does not increase or the speed of the increase is higher than usual, the sensor reading might be falsified or the sensor might be faulty.We use deep neural networks (DNN) to model the system's behavior based on observation of its physical state .The generality of DNNs allows us to model not only the physical dynamics of the monitored system but also the internal PLC logic influencing the physical state transitions.Our method will be described in more detail in section 5.

Convolutional Neural Networks
Convolutional neural networks (CNN) are feedforward neural networks popular in image processing.In the basic neural network model, the layers are fully connected, which means that a unit (a neuron) is connected to all units in the following layer.This requires the neuron to hold a very large number of weights on these connections.This structure does not scale well, and such a large number of parameters will usually lead to overfitting.In addition, fully connected networks ignore the input topology: input variables can be presented in any order, and the outcome will be the same.However, many kinds of data, such as images have distinct structure and nearby pixels are highly correlated.CNNs address these deficiencies by applying convolutions, or filters, to small regions of the input instead of performing matrix multiplication over the entire input at once.The filter uses the same weights across all of the locations and thus is detecting features regardless their position in the image.A convolutional layer consists of several feature maps each detecting a different input feature.1D CNNs can be successfully used for time series processing, because time series have distinct 1D (time) locality that can be successfully extracted by convolutions [25].

Autoencoders
An autoencoder (AE) is a neural network trained to reproduce its input [15], thereby learning useful properties of the data.This is achieved by applying constraints on the network which prevent copying the input to the output and cause the network to learn the compact representation of the data.Due to this ability, autoencoders are widely used for dimensionality reduction and feature learning [15].Autoencoders have two major parts: encoder that transforms the input into some internal representation, and decoder that reconstructs the input from this representation.The simplest kind of autoencoder is an undercomplete autoencoder (UAE), which passes the data through a bottleneck of a hidden layer with smaller dimensions than the input and output.This bottleneck enforces the network to learn a subspace capturing principal features of the data.Another way of forcing an autoencoder to learn important input structural features is letting it reconstruct original input from the input corrupted by noise.A common way to corrupt the input is to add some Gaussian noise to it.Such autoencoders are called denoising (DAE).
Variational autoencoders (VAE) [8] have become very popular in unsupervised learning of complex distributions and in generating images of different kinds.While regular autoencoders learn compact representation of the input data, there is no constraint on this compact representation.For example, given an autoencoder network trained with many images of dogs, we still don't know how to build an internal representation that could generate a dog when passed to the decoder part of the network.VAE solves this problem by applying constraints on the distribution of the compact representation (called a latent variable or code).To impose this constraint, the loss function is a sum of the data reconstruction error (generalization error) and a deviation of the latent variable distribution from some chosen prior distribution, commonly unit Gaussian.Once the network is trained, it is possible to generate new images of dogs by drawing samples from the unit Gaussian distribution and passing them to the decoder.VAE has three parts: encoder, decoder and prior distribution and is illustrated in Figure 2. The encoder creates a distribution of the latent variables for the given input, the decoder returns a distribution of inputs corresponding to the given latent variables.The network is trained to maximize the likelihood of the data under the codes it assigns to it, while keeping the codes distribution close to the chosen prior one.
where f is some function depending on time x, f is its Fourier transform, and k is the frequency.When dealing with periodic data samples, rather then a continuous function, the discrete Fourier transform is used: where f n denotes the n-th sample of f .Fourier transform of a time series provides its spectrum over the entire period of time measured.In order to detect changes in the signal spectrum over time, the short time Fourier transform (STFT) is used.STFT applies the transform to short overlapping segments of the time series.
Frequency domain analysis provides several advantages.First, it provides a more compact and concise representation of most of the dominant signal components.Second, it allows for the detection of attacks involving frequency change of regular operation modes, e.g., fast starting and stopping the engine.Lastly, according to the uncertainty principle [7], functions localized in the time domain (e.g., a short spike) are spread across many frequencies, and functions that are concentrated in the frequency domain are spread across the time domain.This means that slow attacks that usually evade time domain detection methods will stand out in frequency analysis, but short attacks will be difficult to detect using it.

RELATED WORK
The area of anomaly and intrusion detection in ICSs has been widely studied.Extensive surveys [17,20,32] and surveys of surveys [10] are devoted to classification of researches in this field.In our review of related work, we will focus on ICS anomalies and cyber attack detection using the physical state of the system as measured by the sensors.As noted in [11], the first step in physics-based detection is a system state prediction.By observing the deviation between the predicted and reported system state, a decision is made on whether an attack or anomaly is detected and how to score it.Hence, one of the main ways to classify the research is by the prediction method used.Auto regressive (AR) models are used to predict the system state in [16] and [28].While popular in time series analysis, these models have limitations in multivariate systems, when the state of one observed variable is correlated with another.In our research, we use deep neural networks (DNNs) that don't have these limitations.
Another popular way to model the system is rooted in the control theory and uses the subsystem model identification based on the equation (3) describing a linear dynamical system: where x k is the system state at time k, u k denotes the controller commands to actuators, y k -sensor measurements, ϵ k is perturbation noise, e k -sensor noise, and A, B, C, D -matrices modeling the dynamics of the system.This approach has been used in previous researches, such as [30,34,35].The limitations of linear dynamical system modeling include the requirement for controller command measurement, a requirement which is not met in most datasets.In addition, many attack scenarios involve altering PLC logic and do not violate system dynamics.For both of these reasons, we chose to use DNNs that are more flexible on both counts.Specification-based system modeling can also be very effective, as shown by [33] and [31].In [33], the authors used behavior rules to specify the safe system state for medical CPSs and monitor deviation from these rules.Distributed invariants-based mechanisms for smart grids are presented in [38] and [39].In [39], the detection is based on observing the physical state of the shared system, detecting the power conservation invariants violation and identifying the rogue component by the invariants verification in its topological neighborhood.While effective in rogue CPS controller identification, solution in [39] is very specific to smart grids, where the physical invariants are well known and simple.Rahman et al. [38] used multiple computationally powerful agents that communicated with each other.One main drawback of these approaches is their specificity -the solution should be tailored to the system and its operating conditions, while our approach is generic and requires no manual configuration.
In a recent competition on water distribution system cyber attacks detection (the BATADAL -BATtle of the Attacks Detection Algorithms [42]), seven teams demonstrated their solutions on a simulated dataset.The best results were shown by the authors of [19], who were able to model the system precisely using MATLAB.The main limitation of this solution is its reliance on the need and ability to create a precise system model, both a non-generic and a difficult task.Another work that achieved a high score in the competition is [1] in which the authors proposed a three-layer method, where the first layer detects statistical anomalies, the second layer is a neural network aimed at finding contextual inconsistencies with normal operation, and the third layer uses principal component analysis (PCA) on all sensor data to classify the samples as normal or abnormal.Our work differs from [1] in the following ways.First, we study the efficiency of a single generic mechanism, as opposed to the multilevel system used by [1].Second, our solution evaluates types of neural networks not covered by [1].Finally, we study frequency domain anomaly detection.
Another relevant study from the BATADAL competition is [5].In addition to other detection mechanism, the authors of [5] used variational autoencoders (VAE) to calculate the reconstruction probability the data.In our research, we found that VAEs are not very accurate in reconstructing time series data.Therefore, we suggest using simpler autoencoder models and demonstrate their effectiveness at this task.
Neural networks are used in more physics-based cyber attacks detection researches ( [13,21,27]).Unlike our work, these researches use more complex recurrent and graphical models and do not study the frequency domain.
Autoencoders have been used for anomaly and intrusion detection before [29,40].The differences between this work and [29] are that in our work (1) AEs are applied to raw physical signals without statistical feature extraction, and (2) AEs are applied to the frequency domain.We extend the research in [40] by applying AEs to cyber attacks detection in time series combining control, status and raw physical data, as well as applying them to the frequency domain.We also enhance the architecture of the network and present a feature selection method which improves network performance.

DATASETS 4.1 SWaT
The Secure Water Treatment (SWaT) testbed was built at the Singapore University of Technology and Design.Although a detailed description of the testbed and the dataset can be found in [12], we provide a brief description below.The testbed is a fully operational scaled down water treatment plant.As shown in Figure 3, the water goes through a six-stage process.Each stage is equipped with a number of sensors and actuators.Sensors include flow meters, water level meters, and conductivity and acidity analyzers.Water pumps, chemical dosing pumps, and valves that control inflow are the actuators.The sensors and actuators of each stage are connected to the corresponding PLC, and the PLCs are connected to the SCADA system.
The dataset contains seven days of recording under normal conditions and four days during which 36 attacks were conducted.The entire dataset contains 946,722 records, labeled as either attack or normal, with 51 attributes corresponding to the sensor and actuator data.The threat model used in the experiment is a system that has been already infected by attackers who spoof the system state to the PLCs causing erroneous commands to be issued to the actuators, or override the PLC commands with malicious ones.A table containing a description and the timing of the attacks is provided in [12].Each attack aims to achieve some physical effect on the system.For example, attack 30 aims to underflow the tank in the first stage.For that purpose, the value of the water level sensor LIT101 is fixed at 700mm, while the pump P101 controlling water outflow is kept open for 20 minutes.Figure 4 shows the attack, its effects, and the time it takes the system to stabilize.The attacks were usually not stealthy, i.e., when a command was issued to the actuator, and the actuator changed the system state, the change was not hidden by the attackers.

BATADAL
The BATADAL dataset represents a water distribution network comprised of seven storage tanks with eleven pumps and five valves, controlled by nine PLCs (see Figure 5).The network was generated with epanetCPA [41], a MATLAB toolbox that allows for the injection of cyber attacks and simulates the response of the network to these attacks.
There are 43 variables representing tank water levels, the flow and status of all of the pumps, as well as inlet and pressure for the pumping stations and valves.The training data simulates hourly measurements collected for 365 days, resulting in 8,761 records.The test dataset contains 2,089 records (from 87 days of recording).There are seven attacks present in the test data.The attacks involved malicious actuator activation, PLC set point changes, and sensor measurements manipulation.In addition, the attacks were concealed from the SCADA system by replacing the PLC-to-SCADA communication data with the data recorded at the same hour during normal operation.Figure 6 illustrates attack 12 (the attack numbers start at eight).The goal of this attack is to cause tank T2 to overflow.The L_T2 sensor's readings are altered to report lower levels and cause PLC3 to keep the valve V2 open.At the same time, the traffic from PLC3 to SCADA is modified to replay previously recorded values of L_T2, as well as V2 flow and pressures.Figure 6 shows that the status of the valve was not replayed, although the authors of [42] reported it was.Also, one can see that immediately after the attack the system returns to its regular cycle.This looks unrealistic, as the tank must be in an overflow state and it should take time to consume the excess water it contains.We estimate that these represent limitations of the simulation.

ANOMALY DETECTION AND SCORING METHOD
The anomaly detection method used in this research is based on the one used in [23] and extends it.We trained a neural network until its training error reached the desired value (usually less than 0.1).The network is used to predict the future values of the data features based on previous values.Thus, the model performs the function where y i is a feature vector at time i, ŷi is the estimation of the feature vector, l and m represent the input and output sequence length respectively, h is the prediction horizon, and n is the time moment following the data used for prediction.We generalized the method to allow the prediction of arbitrary length sequences in the future with a chosen horizon, e.g., predicting 256 time steps starting with the fifth time step from the last input time.The residuals vectors are calculated as: The residuals are used to trigger the anomaly alert in one of two ways.In the first, the residuals are normalized by dividing them by the maximal per feature residuals for the training data, and the maximum of the normalized residuals is compared to a threshold τ : In order to prevent false alarms on short-term deviations, we require that the residual exceed the threshold for at least a specified duration of time window w.Thus, an anomaly alert A i at time i is determined by: Figure 5: Hierarchy of the water distribution system used in the BATADAL dataset.[6] Figure 6: Attack 12 on the L_T2 sensor.
The hyperparameters τ and w are determined by setting a maximal accepted false alarms rate in the training data and finding the solution to: where ω τ and ω w are weights of the threshold and the window, A(τ , w) is the set of attack alerts detected with the specific threshold and window values, and F P max is the maximal allowed number of false positives in the training data.In other words, we are looking for the hyperparameter values that don't produce more than the permitted number of false alerts, while minimizing the product of their weights.The weight of a hyperparameter is proportional to its index in the argument space.For example, if the possible window value space is 5, 10, 15, 20, the corresponding weights will be 0.25, 0.5, 0.75, 1.Using weights allows us to normalize the contribution of both hyperparameters regardless of their absolute values.
The second way to detect the attacks differs from the first one by normalizing the residuals using their mean and standard deviation (by feature) and is described in [23].In this research, we found that in the case of the SWaT dataset, using the residuals' mean and standard deviation based on the test data produced better results than using the statistics based on the training data.Updating threshold statistics with test data, a common practice in online anomaly detection, compensates for data drift.This finding hinted at the presence of data drift in the SWaT dataset, and it was indeed detected, as we describe in section 6.
In order to produce results comparable with previous researches, we used utilized the same performance metrics as other works using the corresponding dataset.For SWaT, the metrics are precision, recall and F 1 and they are calculated based on log record labels contained in the dataset.
In the BATADAL competition, the score was calculated as a weighted sum: where S T DD is the time-to-detection score, S C LF is the classification score, and γ determines the relative importance of the two scores and is set to 0.5.The details of the calculation of both scores are based on the log record labels and are described in [42].

Data Analysis and Preprocessing
Our detection mechanism builds upon the ability to model and predict the system's behavior.To fulfill this requirement, the following assumptions must hold: the training data must be representative of the test data.More specifically, the training data should contain all of the (latent) states and transitions between them that appear in the test data.In other time-series forecasting techniques, e.g., AR models and recurrent neural networks, there is a stronger requirement that the data needs to be stationary (i.e., maintains its probability distribution over time) or can be transformed into stationary [15].We found that a number of SWaT features do not have the same distribution in the training and test data (see Figure 7).In order to obtain a quantitative measure of the similarity between the probability distributions of the training and test data, we used the Kolmogorov-Smirnov test (K-S test) [4].We chose the K-S test, because it is non-parametric and isn't based on any assumptions on the probability distributions tested.It also is more sensitive than comparing the mean and standard deviation, or t-test, both of which do not work well with multimodal and non-normal distributions.The K-S test statistic for two distributions is the maximal difference between their empirical cumulative distribution functions (ECDF): where F 1 and F 2 are ECDFs of the compared distributions.They can be found as: where The original K-S test is limited to fully specified distributions [36], however we found the slight modification described below useful as a concise metrics for filtering out features unsuitable for modeling.Using the maximum as a statistic makes the K-S test extremely sensitive to small CDF differences when the distribution's mean is slightly offset on the x axis.To increase the test's robustness, we used the area between the CDFs instead, calculated as Figure 7 illustrates three SWaT features, their values over time, histograms, and K-S and K-S* statistics.We calculated the K-S * statistic for all SWaT and BATADAL features.The features were normalized to (0,1) scale.As Figure 8 shows, many SWaT features differ greatly between the training and test sets.Such features would create a lot of false positive alarms and must be excluded from the modeling.In addition to data normalization and feature statistic profiling, we subsampled the SWaT data at a five second rate.Subsampling provides a regularization mechanism which prevents overfitting and allows us to operate with a smaller amount of data.As for the BATADAL dataset, all but one (P_J280) of its features have very low K-S* metrics (10 or less).This striking difference between the real world and simulated data stresses the need to validate any findings in realistic setups.

Evaluating 1D CNN Performance with BATADAL Dataset
To answer our first research question we validated the effectiveness of 1D CNNs with the BATADAL dataset.We modeled all features, except for P_J280 due to its high K-S* value.We used an eight-layers 1D CNN with a sequence length of 18 data points and achieved the scores presented in Table 1.  1 shows, 1D CNN detected all of the attacks and achieved high scores, however it did not achieve the performance of the best BATADAL competitors.Some of the attacks were not detected due to the attack concealment techniques used in BATADAL.We tried multiple hyperparameter configurations, both using grid search and genetic algorithms [14], but were unable to achieve a better score with 1D CNN architecture.Therefore, we conclude that while 1D CNN networks are indeed effective in detecting cyber attacks in cyber physical systems, there is room for improvement in terms of precision, recall, and timeliness of detection.

Attack Detection Explainability
Once an attack has been detected, the ability to localize the attack is very important.Using a neural network to model each feature in the monitored system allows us to assess which sensors and actuators were involved in the attack.The attack indicator for a feature i at a time t is the corresponding residual r i t bypassing the threshold τ .Analyzing the 1D CNN attack's location detection we observed the advantages of the combined feature modeling over modeling the features separately.When each feature is modeled separately, the model often makes a prediction based on the recent past and thus is mainly useful for detecting abrupt non-characteristic changes of the feature.To counter this effect, we increased the prediction horizon, so that recent past values become less useful.This resulted in the discovery of more attacks as well as in more false positives.On the other hand, when we modeled a number of features related to a single PLC or a number of related PLCs together, 1D CNN models capture dependencies between them.This results is more complete and accurate attack detection, both in terms of time and location.We also observed that spoofing a single feature might trigger behavior changes of multiple features, resulting in all of them being considered anomalous, as shown in Figure 9.
Table 2 summarizes attack location detection for the attacks on the first stage of the SWaT testbed.
As Table 2 shows, 1D CNN can almost always identify the feature that was attacked directly, and can also locate the related features influenced by the attack.The attack location detection in the BATADAL dataset is summarized in Table 3.In the BATADAL dataset, the attacks were concealed by replaying previously recorded valid data but the network was able to detect them by detecting anomalies in the dependent features that were not replayed.  Highlighted in bold font are the directly attacked features. 2More features were detected anomalous.Only the most strongly indicated ones are listed.
To answer our second research question, our modeling method locates the attacked features, provided they are not replayed.However, we found that the attack may trigger a reaction in many features, and we haven't yet found a means of distinguishing the original cause from its consequences.

Undercomplete Autoencoders
The observation that feature correlations are very effective in detecting cyber attacks led us to experiment with autoencoders, a network architecture that excels in finding latent feature correlation.After experimenting with multiple AE architectures, such as LSTM-based AEs, variational AEs and denoising AEs, we discovered that the best detection performance is achieved with the simplest undercomplete AEs.As VAEs were used in a related work ( [5]), we explored multiple VAE configurations using both grid search and genetic algorithms, and discovered that the generative nature of VAEs causes less precise predictions and lower recall.This VAE behavior is consistent with the results obtained in image generation as reported in [15].As Table 1 shows, VAEs obtained a lower score than simple non-generative AEs.
The best results were achieved using the AE network variant adapted for multivariate sequence reconstruction.The network design is as follows: • an optional corruption layer applying Gaussian noise to the input sequence, • a fully connected layer with an ReLU activation function inflating the input; the purpose of this layer is to enlarge the hypothesis space,

Method
Precision Recall F1 DNN [21] 0.983 0.678 0.803 SVM [21] 0.925 0.699 0.796 TABOR [27] 0.862 0.788 0.823 1D CNN [23] 0.968 0.791 0.871 AE 0.890 0.803 0.844 AE Frequency 0.924 0.827 0.873 • an encoding layer that flattens the input and produces its compact representation using a fraction of the input size; in our experiments the best results were achieved using the compact representation being a half size of the input, • a decoding layer reconstructing the original sequence from its compact representation.
This architecture can deal with sequences of arbitrary length, but the best results were achieved with a length of three.As shown in Table 1, this simple AE network produced better detection scores than 1D CNN and approached the best results of the competition winners; its attack location capabilities were also better than that of a 1D CNN.As Table 3 shows, in most of the cases, AEs were able to pinpoint the attacked features despite the concealment of the attack.As a point of comparison, we also used PCA with the same number of components used with AEs.As expected, AEs perform better than PCA, as they are able to capture non-linear dependencies between features [15].We can make two observations from our comparison of the PCA, VAE and AE performance (see Table 1).First, a high PCA score is probably the result of the simulated nature of the data.Second, more complex detectors such as VAE can actually degrade detection performance.To verify our first observation, we used PCA for detection on the SWaT dataset.Only four attacks were discovered, and there were 13 false positives, so the conclusion is that a high detection score with simulated datasets is not a sufficient indicator of the detector's performance in a general case.
To verify the effectiveness of our AE architecture in a more realistic setup, we applied it to the SWaT dataset.As seen in Table 4), the AE network obtained a high F1 score, comparable to the best results we achieved using a 1D CNN.This result is particularly impressive considering that PCA performed very poorly on the SWaT dataset.In addition, AE networks are very small (as we are working with short sequences) and fast to train.This answers our third research question -AEs are a lightweight and effective alternative neural network architecture that can be used for anomaly and cyber attack detection in CPSs.

Frequency Domain Detection
Our final research question seeks to explore the usefulness of frequency domain attack detection.Although the time and frequency domain represent the same information, the compactness of periodic signal representation in the frequency domain could help in detecting anomalies.The following method was used to create signal representation in the frequency domain.
(1) Determine the dominant frequency of each signal (the frequency with the most energy) using the discrete fast Fourier transform (DFFT).(2) Determine the window for the short time Fourier transform (STFT) based on the dominant frequency period.It was found that the optimal window is between one and two periods of the dominant frequency.to the frequency domain representation.As each feature is represented separately, the ability to locate the attack is maintained.
We used an AE network for both SWaT and BATADAL frequency domain detection.The network consisted of one to three fully connected layers followed by an encoder and decoder.For BATADAL, we were able to match the score of the best previously published result [19], see Table 1.However, the detector of [19] is built for the specific BATADAL configuration, while our architecture is generic.
For SWaT, we first conducted a statistical analysis of the frequency domain representation and removed those features that differed significantly between the training and test data.On the remaining data we were able to obtain an F1 score of 0.873, slightly better than the previously published results.While these results are very encouraging and suggest further study and validation, we discovered one limitation of frequency domain detection.In order to be able to transform the data into the frequency representation we use windows of at least one period of the dominant frequency.The consequence of this is a lack in the ability to distinguish between short attacks that quickly follow one another in the same window.Even though in reality this might be a mild concern, in the SWaT dataset many short attacks occur in succession.Our method usually detects them as one long attack, which reduces the precision metrics.Thus, in general, the answer to our last research question is positive.While frequency domain analysis can contribute to attack detection, it has its limitations.

Additional Validation with the WADI Dataset
While the proposed method and neural networks achieved excellent results with BATADAL, the simulated nature of the data requires additional validation with a different, nonsimulated dataset.Finding another high quality nonsimulated cyber physical dataset containing attacks is not an easy task.The best candidate we could find is WADI [2], collected from a scaled down water distribution testbed and built by the authors of SWaT.The testbed consists of a number of large water tanks that supply water to consumer tanks.The dataset contains 16 attacks whose goal is to stop the water supply to the consumer tanks.The attacks were conducted by opening valves and spoofing sensor readings, and were partially concealed.The dataset is significantly larger than the SWaT and BATADAL datasets, and contains 1,209,610 data points in the training set and 126 features.The WADI dataset was made public recently, and very few attack detection results utilizing this dataset have been published.In [31], the authors proposed an agent-based framework for CPS modeling and used it to detect attacks on the WADI dataset.
Unfortunately, the authors of [31] did not publish the quantitative metrics of the detection results, only reporting that 12 of 16 attacks were detected.In [26], the authors use LSTM-based generative adversarial networks (referred to as MAD-GAN in the paper) and show that they outperform other methods, such as PCA, K-nearest neighbors (KNN), and feature bagging (FB) on the SWaT and WADI datasets.
We performed data preprocessing as described above, subsampling the data at 1/10 rate and removing the features with extremely different statistics.We were able to achieve substantially better results than those reported in [26] as shown in Table 5.The unstable feature removal improved our ability to detect attacks in general, as our PCA result is significantly higher than the result reported in [26].For the 1D CNN, we modeled each WADI PLC separately and then merged the detection.A 1D CNN model with eight layers and sequences of 16 data points successfully detected 14 attacks and outperformed both PCA and MAD-GAN.However, the best results in our experiments were achieved by autoencoders, using the AE architecture described above with sequences of length 18 (see Table 5).
The WADI dataset did not appear to be suitable for frequency domain analysis.Only 44 of 127 features had a clear dominant frequency, and the frequency was very low (with a period of 1440 minutes or 24 hours).Such very long periods result in poor resolution in detecting short attacks (attacks in the WADI dataset are about ten minutes long).Other features did not have any clear periodicity.Inapplicability of frequency domain analysis to the WADI dataset and similar data is therefore a limitation of this approach.To summarize, we successfully validated our detection methods on the WADI dataset.There were two other important observations.First, frequency domain analysis is limited to data with strong periodicity and is less effective when the periods are very long.Second, the absolute performance results of attack detection for the WADI dataset are not high and require further study.

CONCLUSIONS
In this paper, we studied the effectiveness of 1D CNN and AE-based anomaly and cyber attack detection mechanisms, answering our previously mentioned research questions as follows.
• Based on our experiments, we conclude that both 1D CNNs and AEs achieve or exceed the state-of-the-art performance on the three public datasets utilized, while maintaining generality, simplicity, and a small footprint.It is not clear whether one of these architectures is always preferable over another, and we plan to extend our research with more datasets to investigate this further.In the meantime, we recommend an ensemble consisting of both models when possible.If a single model must be chosen, AEs will likely work out-ofthe-box in most cases, while 1D CNNs will require a round of hyperparameter tuning to eliminate false positives.• The attack detection method we use allows us to pinpoint the specific attack location.However, as spoofing attacks can trigger a number of changes in the related features, all of the features will be considered attacked by our method; in such cases, differentiation between the causes and reactions can be made using an a priori knowledge of the PLC input and output.Algorithmic means of distinguishing between PLC input and output solely by data are a topic for future research.• We found frequency domain analysis very helpful in anomaly and attack detection.Its applicability is subject to a number of practical limitations; if they are met, frequency domain analysis can provide strong results.
Although we have not addressed our method's robustness to adversarial attacks, we have shown that our networks were able to detect attacks concealed by replay.Another topic for future research is investigating the network's resilience to white-box attacks that target the network itself.Finally, we found it difficult to obtain high quality datasets for this and related research.Simulated data may have intrinsic properties that allow for easier anomaly detection using certain techniques (e.g., PCA or frequency transformation).This makes results obtained on such data less representative.On the other hand, many real-world datasets don't consider system stabilization after the attacks, contain inconsistent features, or introduce new system states during testing.All of these prevent accurate detection algorithm performance assessment.We contacted the SWaT testbed's maintainers, suggesting improvements that can be made, and they expressed interest in cooperating on this in the future.

ACKNOWLEDGEMENTS
The authors thank iTrust Centre for Research in Cyber Security, Singapore University of Technology and Design for creating and providing the SWaT and the WADI datasets, Dr. Elchanan Zwecher for his valuable insights, and Rafael Defense Systems for supporting this work.

Figure 1 :
Figure 1: A schematic ICS architecture and possible attack locations.

2. 5
Time -Frequency Domain Transformation Raw data measured by ICS sensors produces a time series.While in most ICS anomaly and attack detection research this data is processed directly, it is very common in signal processing to analyze data in the frequency domain.Fourier transform (1) allows us to build signal's frequency domain representation and preserves the signal's energy according to Parseval's theorem

Figure 4 :
Figure4: Attack 30 on the LIT101 sensor.LIT101 measures the water level in the first tank.P101 pumps the water out of the tank to the second stage processing.Note that after the attack is over, it takes a long time until the system returns to its normal production cycle.

Figure 7 :
Figure 7: Features statistics comparison.LIT101 has a very similar distribution in both the training and test data.AIT401 has a similar but slightly offset distribution.K-S has a high value, but K-S* correctly classifies the distributions as close.AIT201 has very different distributions.

Figure 8 :
Figure 8: K-S * statistic for the SWAT dataset.A number of features differ significantly between the training and test sets.

Figure 9 :
Figure 9: SWaT Attack 1 location interpretation.The areas highlighted in red indicate the detected abnormal feature behavior.The attack opened the MV101 valve, letting more water into an already full tank.The model detects other related features as being abnormal too, e.g., water flow into a full tank (indicated by FIT101), and abnormally high water level (indicated by LIT101).Attacks 2 and 3 which were carried out soon after the first attack can be seen as well.

( 3 )
Transform the signals into their frequency representation.(a) Split the entire signal into overlapping windows.(b) Perform STFT for each window.(c) Binarize the entire spectrum of STFT into a number of bins.Calculate the total energy of the signal in each bin.(d) Pick a small number of bins with the most energy.The energy values will represent the feature in the frequency domain for the corresponding time window.We found that two or three bins were sufficient for representing the features.(4) Apply the chosen neural network model (1D CNN or AE)

Table 1 :
Comparative performance of neural networks on the BATADAL dataset

Table 2 :
Attack location detection for the attacks on the first stage of the SWaT testbed1 1Highlighted in bold font are the directly attacked features.

Table 3 :
Attack location detection for BATADAL1 2

Table 4 :
SWaT Attack detection performance comparison.

Table 5 :
Comparative performance of attack detection for the WADI dataset