Keywords

1 Introduction

Industrial Control System (ICS) refers to systems that monitors and controls important national infrastructure and industrial processes such as power, gas, water and sewage, nuclear power, transportation, and manufacturing. Early ICS is isolated systems to ensure availability and differed from traditional Information Technology (IT) systems. Programmable Logic Controller (PLC), the main element of the control system, was not connected to the network, so there were few threats other than physical disturbances or natural disasters. Therefore, ICS manufacturers were able to operate the systems without considering security at all when designing systems operating in a closed network. However, as the ICS, which used to be a safe closed network, is converted to an open network with the development of information and communication technology, security vulnerabilities are exposed [1, 2]. In order to prevent and respond to such security incidents, systematic security technology for ICS is required [3].

Attacks on ICS has been continuously occurring since the past, and the frequency of these attacks has been increasing recently. A representative security incident targeting ICS is Stuxnet. It was discovered in June 2010 by infecting Microsoft Windows and then monitoring and destroying industrial facilities, targeting PLCs. This sparked interest in ICS security [4]. In December 2015, a cyberattack on Ukrainian utilities resulted in a massive blackout, affecting more than 80,000 people [5]. In December 2017, it was detected while preparing for an attack using the zero-day vulnerability of the firmware through the EWS (Engineering WorkStation) of a chemical plant in Saudi Arabia. This case was not detected for more than 3 years after it penetrated the IT network in 2014. Such ICS security accidents may cause not only economic loss but also supply chain problems due to discontinuation of product launches, as well as human casualties due to explosions. In this paper, we introduce an anomaly detection method using a Recurrent Neural Network (RNN) and a time series analysis technique in an ICS environment. The contributions of this paper are as follows:

  • An anomaly detection method for multivariate industrial process time series data;

  • Attempts of various approaches to reduce false positives in anomaly detection model;

  • A higher TaPR score for anomaly detection on the HAI dataset when using our method (Result of HAICon2020);

The rest of this paper is organized as follows. Section 2 describes the ICS operating structure and HAICon2020, and summarizes related research on anomaly detection. Section 3 describes the proposed model for anomaly detection in ICS and approaches to improve performance. Section 4 describes the dataset used, HAICon2020 results, and anomaly detection results. Finally, Sect. 5 describes the conclusions of this study and future work.

2 Related Work

2.1 Industrial Control System Operation Structure

ICS components are controller, HMI, actuator, sensor, and can be managed by various industrial protocols (see Fig. 1). The ICS consists of multiple PLCs. PLCs are used in both Supervisory Control and Data Acquisition (SCADA) and Distributed Control Systems (DCS) as the control component of full hierarchical systems to provide local management of processes through feedback control. Each PLC component consists of Process Variable (PV), Set Point (SP), and Control Variable (CV). In the process, the controller generates CV by interpreting PV, which is the value collected through sensors such as pressure, temperature, and speed, and flows to actuators such as valves and motors. HMI for integrated ICS control monitors multiple PLCs and has a structure that can change SPs suitable for process procedures [6].

Fig. 1.
figure 1

Industrial control system operation structure.

2.2 ICS Security Threat Detection AI Contest (HAICon2020)

Cybersecurity threats to the control systems of national infrastructure and industrial facilities are on the increase. Cyber-attacks on critical national facilities can cause enormous and irreparable damage to society. Consequently, countries around the world are focusing on developing security technologies. In this context, a dataset that accurately reflects the characteristics of the on-site control systems and sets out various types of control systems cyberattacks is an essential element for AI-based security technology research. The National Security Research Institute built a control system testbed using industrial control devices, sensors, and actuators built by General Electric, Emerson, and Siemens. This was then used to develop a HIL-based Augmented ICS (HAI) dataset. The first version of the HAI dataset, HAI 1.0, was made available on GitHub and Kaggle in February 2020. This dataset included ICS operational data from normal and anomalous situations for 38 attacks [7, 8]. The improved HAI 2.0 dataset was released at HAICon2020. HAICon2020 is the first competition in Korea for machine learning and deep learning models that can detect attacks and abnormal situations by learning only normal data using HAI 2.0 datasets created for ICS security research. Performance evaluation is measured by TaPR evaluation, specialized in time series anomaly detection [9]. The public score is scored at about 30% of the total test data and is posted on the leaderboard during the competition. The private score is scored with the rest of the test data and released immediately after the competition ends. The model is ranked by the final private score. A total of 928 teams, including the authors, participated in this competition.

2.3 Research on Anomaly Detection

Anomaly detection, a type of intrusion detection system (IDS), is an important data analysis task that detects anomalies or abnormalities in a given dataset. Anomaly detection has been extensively studied in statistics, and AI technologies are increasingly being used to automate anomaly detection. Anomalies are defined as deviations from normal patterns. However, defining the anomaly is still difficult. Anomalies are rare, and we do not have prior knowledge of all types of anomalies. Classical approaches to anomaly detection include OC-SVM, SVDD, and KDE. What these methods have in common is that they are all unsupervised and constitute a single approach to anomaly detection. A single approach is not sufficient to apply on time series data [10]. Currently, deep learning approaches are mainly used for anomaly detection, and open Secure Water Treatment (SWaT) datasets have been used for ICS research, and the use of recently published HAI data sets is increasing.

A study using the HAI dataset [11] proposes a supervised machine learning model using SMOTE (Synthetic Minority Oversampling Technique) to solve the problem of data imbalance in anomaly detection. They chose KNN, DT, and RF for machine learning to compare their performance. Experimental results show that RF performs better than other classifier algorithms. Another study proposes Autoencoder and SVDD approaches based on deep learning with the same dataset [12]. They calculate the difference between predicted and actual values and apply Cumulative Distribution Function (CDF)-based statistical methods to predict the top 2% of data as anomalies. As a result, the SAE model showed a higher detection rate than the SVDD model.

There are also studies using Autoencoder in combination with other models [13]. They propose an LSTM Autoencoder model using the SWaT dataset. They compare the typical LSTM model predictions with the LSTM model reconstruction results. It shows improved performance with 88.5% recall and 87.0% f1-score, unlike typical Autoencoder neural networks that predict or reconstruct data individually. The anomaly detection method does not change what is considered an anomaly by calculating the error between the predicted value and the actual value. However, performance varies greatly depending on the threshold. There are also studies that propose anomaly detection methods based on statistical deviation calculations for anomaly detection [14]. Most statistical methods use the mean and standard deviation to calculate the z-score. Based on the z-score, you can set which top percentage of the entire test dataset is considered an anomaly.

It is important to accurately identify various attacks in sequential data, such as ICS and networks. There are also studies on intrusion detection using suitable RNN in a sequential data environment [15]. They use sequential NSL-KDD datasets to compare RNN with other machine learning methods. Experimental results show that RNNs are well suited for modeling sequential data with high accuracy and outperform conventional machine learning classification methods. Another study [16] proposes a new algorithm based on SR (Spectral Residual) and CNN for time series anomaly detection, as well as RNN. The CNN algorithm automatically extracts a fixed size of features from time series data. Then, SR subtracts the predictable value from the actual value and obtains a new value. This value has a small value if there is no difference in the residuals, and a large value otherwise. Therefore, if it is greater than the preset threshold, it is considered abnormal [17]. There are also studies comparing the performance of different combinations of DNN architectures, including different variants of CNNs and RNNs. This study was performed on the SwaT dataset and successfully detected 31 attacks with 3 false positives out of a total of 36 different cyberattacks. Therefore, they prove that CNNs and RNNs are effective in time series prediction tasks.

3 Proposed Model

3.1 Overview

We propose a stacked Bi-LSTM model of a family of Recurrent Neural Networks (RNNs) to detect anomalies in ICS. Training datasets consist of unlabeled normal data and require unsupervised learning. Normalize data before model learning so that certain features are not dependent on others. The RNN model designs three-dimensional data (Samples, Time Step, Features) that predict the following data as input. Anomaly detection is based on an anomaly score calculated as the difference between the actual and predicted values. We also applied three methods to improve performance. Finally, we evaluate our model using Time-Series Awareness Precision and Recall (TaPR), which is suitable for time series evaluation. The following shows a suggested model for improving performance in anomaly detection (see Fig. 2).

Fig. 2.
figure 2

Proposed model for anomaly detection

3.2 Data Preprocessing

Data Normalization.

There are two reasons why normalization is necessary before input features are supplied to neural networks. First, if the features of the dataset are large compared to other features, large features take precedence. As a result, the prediction of neural networks is inaccurate. Second, Front Propagation of neural networks contains inner-product of the weight with input properties. Therefore, very high values (for image and non-image data) take a lot of time and memory to calculate the output. The same is true of Back Propagation. As a result, the model converges slowly if the input is not normalized. To solve this, we use the Min-Max scaling technique to normalize each feature to enter a scale of 0 to 1. For fields that do not change values, set all values to zero. Subsequently, exponential smoothing is used to minimize noise from sensors and actuators. Finally, verify that the normalized data has values less than zero or greater than one.

Time-Series Data Input and Output Definition.

An RNN is a sequence model that processes inputs and outputs based on sequences. In general, time series data have features sequentially, so you need to generate samples of the same size through a sliding window, which is called sampling. Setting the sliding window size too large or too small when learning time series data on RNNs makes it difficult to derive appropriate predictions. Assuming that an RNN model predicts the next t + 1 based on the current point t, the size of the range predicted by the attack will be similar to the sliding window size. On the other hand, if the sliding window size is set small, assume that you cannot detect various attacks because each sequence is not unique.

3.3 Proposed Anomaly Detection Model Design

We compared various algorithms for processing time series data. Algorithms such as AutoEncoder, KNN, and DNN are point anomaly detection methods that are outside the scope of normal data. This was excluded because it could not reflect temporal information in time series. We also considered using CNN to extract local features and then combine them with other algorithms. Existing work combined CNN with LSTM and AutoEncoder to provide superior performance, but not with HAI datasets. CNN is suitable for images and uses many pixels as features. The HAI dataset consists of 79 features that are less than the image features, which is interpreted as having a large loss of information to extract meaningful features. RNN families are suitable for time series because they can predict the present through previously displayed sequences. We compared the performance of RNN, GRU, and LSTM on the HAI 2.0 dataset. As a result, we chose LSTM because GRU and LSTM are similar, but LSTM outperforms GRU. The results of the RNN were poor due to the vanishing gradient problem, in which the weight gradient decreases during the back propagation process.

We propose a stacked bidirectional LSTM model. Although unidirectional LSTM is available, but sometimes we expect that using bidirectional LSTM is more powerful. We chose bidirectional LSTM over unidirectional LSTM. Stacked LSTM is a method of increasing the complexity of the model by stacking LSTM in multiple layers, allowing them to train complex and large amounts of data. This approach can enable more sophisticated learning than increasing the number of cells in LSTM. However, stacking many layers does not continue to improve performance. We apply Skip connections to three bidirectional LSTM models. This minimizes the loss of information that occurs as the input goes through multiple layers, and uses the first value of the input sequence and the output of the model as an output to prevent the weight from becoming too large or too small. In order to avoid overload during the training process, we applied the callback function, which is an early stop and model selection feature, instead of dropout that randomly deletes nodes. An anomaly score is the error between the actual value and the predicted value. If the Anomaly score is greater than or equal to the threshold, it is considered an attack. Anomaly score is calculated by MAE, a typical measurement method.

3.4 Attempts to Improve Performance

Anomaly Score Moving Average.

We expect the anomaly score to increase when an attack starts and decrease when an attack ends, but there are cases where it does not, so we smooth through the moving average. In statistics, Moving Average is a calculation that analyzes data elements by generating sets of means for different subsets of the entire data set. It is suitable for detecting trends or variations in data and is also used for stock price prediction analysis. However, in anomaly detection, the mean of the anomaly score is close to zero due to the data imbalance problem, which has more normal data than the anomaly data. So we apply it to the anomaly score described in Sect. 3.4 by slightly modifying the moving average without considering the average of the overall data. Calculates the anomaly score at the prediction point and the average of the most recent N sequence means. The final anomaly score is considered an attack if it is above the threshold. Otherwise, it is considered normal. We can visualize the anomaly score to set an appropriate threshold (see Fig. 3). Due to the nature of time series data, anomalies that occurred after the time of prediction are not considered. However, if the moving average is considered as the average of the anomaly score before and after the prediction time, the performance is higher than that using only the previously occurring anomaly score.

Fig. 3.
figure 3

Visualization of anomaly score on the validation dataset.

Predicted Short Attacks and Normal Range Changes.

Even if you apply the moving average to anomaly score, false positive can occur close to certain thresholds. At this time, the anomaly score repeatedly crosses the boundary of the threshold, and the data is considered as an attack and a normal one. This increases false positive and degrades the performance of the anomaly detection model. We consider normal and attack labels predicted below a certain level to be prior to changing to minimize false positive. Therefore, a slight difference in the anomaly score prevents the attack from being considered normal, and prevents the attack from being considered normal. This method can reduce false positive and false negative by considering the time series properties.

Attack Detection Policy at High and Low Window Size.

We compared the results by changing the sliding window size. If you set the window size to high, the type of attack with a higher anomaly score is considered to be an attack more broadly than the actual range of attacks. Conversely, setting a lower window size reduces attacks that are considered more broadly than the actual range of attacks, but certain types of attacks are not detected due to low anomaly score. The figure below shows anomaly scores at window sizes of 60 or 10 for five attack types in the labeled validation data set (see Fig. 4). Overall, attack types were well detected, but Type 4 and 5 were considered to be more attacks than the actual attack range due to their high window size. The reason is that the high window size affected the next input sequence. In order to reduce false positive, attack is detected by combining results from high and low window sizes. In this study, the window size was set to 60 for high and 10 for low under the same conditions. The high-window-size model detects the starting point of the attack, and the low-window-size model detects the ending point of the attack. This is only for attack types with an anomaly score of 0.1 or more.

3.5 Evaluation Method (TaPR)

In general, precision, recall, and f1-score are commonly used to classify normal and abnormal conditions. However, various abnormal sections need to be accurately detected in terms of the operation of the ICS. Also, if the process stops for false positive, availability guarantees are difficult, so there should be a difference in the evaluation index of precision and recall. Therefore, it is recommended to evaluate performance with TaPR when using HA datasets. Key elements of the TaPR assessment as follows.

  • Does the prediction result detect anomaly without false positive?

  • How diverse a range can be found?

In the scoring method according to the main evaluation factors of TaPR, a high score is given even if only a part of the attack range is found, and even a small amount of false positive is a factor of large deductions. In this paper, the performance of the detection model was evaluated through TaPR reflecting the characteristics of the ICS.

4 Experimental Result

For experiments, the programming language used Python 3.7 and the neural network framework for deep learning and reasoning used the Keras 2.4.3 and TensorFlow 2.3 libraries. Jupyterbook 6.2.0 was used for application and Tesla T4 GPU from NVIDIA was used for learning.

Fig. 4.
figure 4

Anomaly score graph at high and low window size

4.1 Dataset

In this paper, we used the HAI 2.0 dataset released by HAICon2020. The training dataset is normal data collected every second for a total of 3 collection periods, and there is no label. The validation dataset has labels for predicting attacks and anomalies, and contain five attack situations. The test dataset consists of 358,804 data generated during 4 collection periods, and the goal is to design an appropriate model to detect unknown attacks and anomalies contained in the test data set. The composition of the HAI 2.0 dataset is shown in the table below (see Table 1). All dataset consists of 79 features. There is a time field indicating time, and the remaining features are values of sensors and actuators that have been de-identified.

Table 1. HAI 2.0 dataset configuration.

4.2 Anomaly Detection Result

Based on the training dataset, we normalized the validation and test dataset using a min-max scaling technique. The proposed model sets the sliding window size to 60 and the stride parameter to 1. This is a model that predicts the 60th second with a sequence from 1 s to 59 s as input. The model configuration is a 3-stacked bidirectional LSTM, the size of the hidden cell is 100, no dropout is used, and a callback function is used. The callback function calculates the loss value based on the validation dataset at every epoch while the model is training, and saves the model if the loss value is lower than the previous epoch. Otherwise, the next epoch proceeds. Also, if the loss value does not decrease for 4 epochs, we end training to avoid overfitting. The loss function uses MSE and the optimizer uses Adam. Batch size is 512, epoch is 32. We used the skip-connection technique to make predictions by summing the first values in the input window and the output of the Bi-LSTM model. The initial anomaly score is calculated from the difference between the actual and predicted values. The final anomaly score is an average of the following two: Score at the time of prediction and average score over the previous 10 s. The threshold for anomaly score is set to 0.019 through the validation dataset. Exceeding the threshold is considered an attack. Because of the threshold, the short-predicted normal and attack range labels are considered old labels. We then train another low sliding window model by setting it to a low sliding window size of 10 and similarly detect attacks. At this point, the threshold is 0.008. The final attack is considered as an attack endpoint with a low sliding window model result for the attack range, where the proposed model result yields a score of 0.1 or higher. We placed second out of 931 teams at HAICon2020, where we submitted our test data, with a public score of 0.98031 and a personal score of 0.93614. To improve the anomaly detection performance, we reflected the features of the time series data, and we confirm that the TaPR score increases. The verification data set results (see Table 2 and 3) and test data results (see Table 4) are as follows.

Table 2. Classification evaluation metrics results from validation dataset.
Table 3. TaPR results on validation dataset.
Table 4. TaPR results on test dataset.

5 Conclusion

As the ICS is a closed network environment, security is not considered at all. However, the transition of ICS, which used to be considered safe, to open networks reveals a variety of security vulnerabilities. In order to prevent and respond to such security incidents, systematic security technology for ICS is required. In this paper, we participated in HAICon2020 and proposed a Bi-LSTM-based anomaly detection model using the HAI 2.0 dataset published here. We made various attempts to improve performance considering the features of time series data, and confirmed that the TaPR score increased. The proposed model ranked second among the 931 teams participating in HAICon2020. Therefore, the proposed model can be said to be a proven model for detecting anomalies in time series ICS. However, it is interpreted that it is necessary to extract new features that can detect anomalies or add components. Because there are parts where it is difficult to detect anomalies in the industry due to lack of learning data and features. In addition, we plan to conduct additional research to characterize the device that caused the abnormal process because we have detected whether there is an abnormality in the entire industrial process. It is expected that faster response and recovery will be possible by notifying the security administrator of the result.