Keywords

1 Introduction

Cyber Physical Systems(CPSs) are built by integrating computational algorithms and physical components for various mission-critical tasks. Examples of such systems include public infrastructures such as smart power grids, water treatment and distribution networks, transportation, robotics and autonomous vehicles. These systems are typically large and geographically dispersed, hence they are being network connected for remote monitoring and control. However, such network connectivities open up the likelihood of cyber attacks. Such possibilities make it necessary to develop techniques to defend CPSs against attacks: cyber or physical. A “cyber attack” refers to an attack that is transmitted through a communications network to affect system behavior with an intention to cause some economic harm. A “physical attack” is on a physical component such as a motor or a pump to disrupt state of the system.

Research efforts in securing CPSs from such attacks have been ongoing. However, there is limited availability of operational data sets in this research community to advance the field of securing CPSs. While there are datasets for Intrusion Detection Systems (IDS), these datasets focus primarily on network traffic. Such datasets include, for example, the DARPA Intrusion Detection Evaluation Dataset [3] and the NSL-KDD99 [2] datasets. These data are a collection of RAW TCP dump collected over a period of time which includes various intrusions simulated in a military network environment. Such datasets are thus not suitable for CPS IDS. The only other publicly available datasets for CPS kown to the authors are provided by the Critical Infrastructure Protector Center at the Mississippi State University (MSU) [4]. Their datasets [4] comprise of data obtained from their Power, Gas and Water testbeds. The power dataset is based on a simulated smart grid whereas their water and gas datasets were obtained from a very small scale laboratory testbed. However, as acknowledged by the authors themselves, these datasets have been found to contain some unintended patterns that can be used to easily identify attacks versus non-attacks using machine learning algorithms. Although the gas dataset was updated in 2015 [4] to provide more randomness, it was obtained from a small scale testbed which may not reflect the true complexity of CPSs. Hence, there is no publicly available realistic dataset of a sufficient complexity from a modern CPS that contains both network traffic data and physical properties of the CPS.

The goal of this paper is to provide a realistic dataset that can be utilised to design and evaluate CPS defence mechanisms. In this paper, we present a dataset obtained from Secure Water Treatment testbed (SWaT).

The main objective of creating this dataset and making it available to the research community is to enable researchers to (1) design and evaluate novel defence mechanisms for CPSs, (2) test mathematical models, and (3) evaluate the performance of formal models of CPS. The key contributions of the paper are as follows:

  1. 1.

    A large scale labelled–normal & attack–dataset collected from a realistic testbed of sufficient complexity.

  2. 2.

    Network traffic and physical properties data.

The remainder of this paper is organised as follows. Section 2 describes the SWaT testbed in which the data collection process was implemented. Section 3 presents the attacks used in this data collection procedure. Section 4 describes the entire data collection process including the types of data collected. The paper concludes in Sect. 5.

Fig. 1.
figure 1

Actual photograph of SWaT testbed

2 Secure Water Treatment (SWaT)

As illustrated in Fig. 1, SwaT is a fully operational scaled down water treatment plant with a small footprint, producing 5 gallons/minute of doubly filtered water. This testbed replicates large modern plants for water treatment such as those found in cities. Its main purpose is to enable experimentally validated research in the design of secure and safe CPS. SWaT has six main processes corresponding to the physical and control components of the water treatment facility. It has the following six-stage filtration process, as shown in Fig. 2.

2.1 Water Treatment Process

The process (P1) begins by taking in raw water and storing it in a tank. It is then passed through the pre-treatment process (P2). In this process, the quality of the water is assessed. Chemical dosing is performed if the water quality is not within acceptable limits. The water than reaches P3 where undesirable materials are removed using fine filtration membranes. After the residuals are filtered through the Ultra Filtration system, any remaining chorine is destroyed in the Dechlorination process (P4) using Ultraviolet lamps. Subsequently, the water from P4 is pumped into the Reverse Osmosis (RO) system (P5) to reduce inorganic impurities. In the last process, P6, water from the RO is stored and ready for distribution in a water distribution system. In the case of SWaT, the treated water can be transferred back to the raw tank for re-processing. However, for the purpose of data collection, the water from P6 is disposed to mimic water distribution.

2.2 Communications

SWaT consists of a layered communication network, Programmable Logic Controllers (PLCs), Human Machine Interfaces (HMIs), a Supervisory Control and Data Acquisition (SCADA) workstation, and a Historian. Data from the sensors is available to the SCADA system and recorded by the Historian for subsequent analysis.

As illustrated in Fig. 3, there are two networks in SWaT. Level 1 is a star network that allows the SCADA system to communicate with the six PLCs dedicated to each of the process. Level 0 is a ring network that transmits sensor and actuator data to the relevant PLC. The sensors, actuators and PLCs all communicate either via wired or wireless links (where manual switches allow the switch between wireless and wired modes).

Fig. 2.
figure 2

SWaT testbed processes overview

Fig. 3.
figure 3

SWaT testbed processes overview

In the data collection process, only network data through wired communications was collected.

3 Attack Scenarios

A systematic approach was used to attack the system. We used the attack model [1] that considers the intent space of an attacker for a given CPS in the attack model. This attack model can be used to generate attack procedures and functions that target a specific CPS. In our case, the attack model to target the SWaT testbed was derived. We launched the attacks through the data communication link in Level 1 of the network (Fig. 3). In essence, we hijack the data packet and manipulate the sensor data before sending the packet to the PLCs. We assumed that an attacker succeeds in launching an attack. We assume that an attacker is successful in launching an attack, hence the number of possible attack scenarios is infinite.

The attack model [1] for CPS is abstracted as a sextuple (M; G; D; P; \(S_{0}\) ; \(S_{e}\) ), where M is potentially an infinite set of procedures to launch attacks, G is a subset of a finite set of attacker intents, D is the domain model for the attacks derived from the CPS, P is a finite set of attack points, and \(S_{0}\) and \(S_{e}\) are infinite sets of states of CPS, that denote, respectively, the possible start and end states of interest to the attacker. An attack point in CPS could be a physical element or an entry point through the communications network connecting sensors or actuators to the controllers (PLCs) and the SCADA system.

From the above discussion, it is clear that the space of potential attacks is large. The massive size of the attack space arises by changing the method M, potential attack points, P, as well as the start and end state of the CPS. SWaT consists of six stages where each stage contains different number of sensors and actuators. Based on attack points in each stage, the attacks are divided into four types.

  1. 1.

    Single Stage Single Point (SSSP): A Single Stage Single Point attack focuses on exactly one point in a CPS.

  2. 2.

    Single Stage Multi Point (SSMP): A Single Stage Multiple Point attack focuses on two or more attack points in a CPS but on only one stage. In this case set, P consists of more than one element in a CPS selected from any one stage.

  3. 3.

    Multi Stage Single Point (MSSP): A Multi Stage Single Point attack is similar to an SSMP attack except that now the SSMP attack is performed on multiple stages.

  4. 4.

    Multi Stage Multi Point (MSMP): A Multi Stage Multi Point attack is an SSMP attack performed two or more stages of the CPS.

For a detailed description of the attacks generated, we refer the reader to the dataset websiteFootnote 1. The data collection process consisted of the following steps.

Step 1: Define each attack based on the number of attack points and places.

Step 2: Design each attack based on the attack point (i.e. the actuator or sensor to be affected affect), start state, type of attack, the value of the selected sensor data to be sent to the PLC, the intended impact.

A total of 36 attacks were launched during the data collection process. The breakdown of these attacks are listed in Table 1. The duration of the attack is varied based on the attack type. A few attacks, each lasting ten minutes, are performed consecutively with a gap of 10 min between successive attacks. Some of the attacks are performed by letting the system stabilize before a subsequent attack. The duration of system stabilization varies across attacks. Some of the attacks have a stronger effect on the dynamics of system and causing more time for the system to stabilize. Simpler attacks, such as those that effect flow rates, require less time to stabilize. Also, some attacks do not take effect immediately.

Table 1. Number of attacks per category

4 Data Collection Process

The data collection process lasted for a total of 11 days. SWaT was functioning non-stop 24 hours/day, during the entire 11-day period. SWaT was run without any attacks during the first seven of the 11-days. Attacks were launched during the remaining four days. Various attack scenarios, discussed in Sect. 3, were implemented on the testbed. These attacks were of various intents and lasted between a few minutes to an hour. Depending on the attack scenario, the system was either allowed to reach its normal operating state before another attack was launched or the attacks were launched consecutively.

The following assumptions are made during the data collection process.

  1. 1.

    The system will stabilise and reach its operation state within the first seven days of normal operation.

  2. 2.

    Data is recorded once every second assuming that no significant attack on the SWaT testbed can be launched in less than one second.

  3. 3.

    The PLC firmware does not change.

All tanks in SWaT were emptied prior to starting data collection; i.e. the data collection process starts from an empty state of SWaT. This initialization was deemed necessary to ensure that all the tanks are filled with unfiltered water and not pre-treated.

Table 2. Sensor and actuator description of the SWaT testbed.

4.1 Physical Properties

All the data was logged continuously once every second into a Historian server. Data recorded in the Historian was obtained from the sensors and actuators of the testbed. Sensors are devices that convert a physical parameter into an electronic output, i.e. an electronic value whereas actuators are devices that convert a signal into a physical output, i.e. turning the pump off or on.

Fig. 4.
figure 4

First 10 h of data collection

The dataset describes the physical properties of the testbed in operational mode. In total, 946,722 samples comprising of 51 attributes were collected over 11 days. Data capturing the physical properties can be used for profiling cyber-attacks. Table 2 describes the different sensors and actuators in SWaT that served as source of the data.

As the data collection process started from an empty state, it tool about 5 h for SWaT to stablise. Figure 4(a) indicates a steady flow of water into the tank in P1 (the level of tank is reported by sensor LIT101). Figure 4(b) shows that it took approximately 5 h for the tank to fill up and reach its operational state. For the tanks in stages P3 and P4 (level of tank reported by sensor LIT301 and LIT401 respectively), it took approximately 6 h for the tanks to be filled up. This is because the water from P1 is sent to P2 for chemical dosing before it reaches P3, hence an additional hour is needed to fill up the tank. The water from P3 is subsequently sent to P4 for reverse osmosis.

Figures 5(a) and (b) illustrate consequences of cyber attacks. Figure 5(a) illustrates a disturbance in the usual cycle of the reading from sensor LIT101 during 6:30 pm and 6:42 pm This was an SSSP attack with the intention of overflowing the tank by shutting pump P101 off and manipulating the value of LIT101 to be at 700 mm for 12 min. The effects are immediately observed over the next hour before the data stabilised nearly two hours later. Similarly Fig. 5(b) shows the consequence of an SSSP attack with the intention to underflow the tank and damage pump P101. In this attack sensor LIT-301 was attacked between 12.08pm and 12.15pm to increase the sensor level to 1100 mm. This deceives the PLC to think that there is an over supply of water and turns the pump on to supply water to P4. In reality, the water level falls below the low mark while the pump is still active. Given sufficient time, this attack can cause the tank in P3 to underflow t, thus stagnating the output of the plant and damaging the pumps.

Fig. 5.
figure 5

Attack data plots

4.2 Network Traffic

Network traffic was collected using commercially available equipment from Check Point\(^{\textregistered }\) Software Technologies LtdFootnote 2. This equipment was installed in the SWaT testbed. The use case of the equipment was specifically to collect all the network traffic for analysis. However, for the purpose of data collection, we retrieved network traffic data which is valuable for intrusion detection as in Table 3. Similarly, the data collection for network traffic began the moment the testbed was switched to operational mode. The attacks were performed at level 1 of the SWaT network as discussed in Sect. 2. The network data captures the communication between the SCADA system and the PLCs. Hence, the attacks were launched by hijacking the packets as they communicate between the SCADA system and the PLCs. During the process, the network packets are altered to reflect the spoofed values from the sensors.

Table 3. Network traffic data

4.3 Labelling Data

As the attacks performed in this paper were through a controlled process, labelling of the data turned out be straight forward. During the operation mode of the testbed, any actions to the testbed were required to be logged. Hence, all attacks performed for the purpose of data collection were logged with the information in Table 4.

Table 4. Attack logs

Labelling of Physical Properties. Each data item corresponding to a sensor or an actuator data was collected individually into a CSV file. Each CSV file contains server name, sensor name, value at that point in time, time stamp, questionable, annotated and substituted. As the attributes are from the server, questionable, annotated and substituted are redundant and hence removed. All the remaining data was then combined into a single CSV file. Figure 6 illustrates a snap shot of the physical properties data. Using the attack logs, data was subsequently labelled manually based on the start and end-times of the attacks.

Fig. 6.
figure 6

Example of physical properties data

Labelling of Network Traffic. The network data was separated into multiple CSV files with a line limit of 500,000 packets for easier processing. However, as the data was captured at per second interval, there are instances of overlap where multiple rows reflect a different activity but carry the same time stamp. Similarly, based on the attack logs, the data was labelled based on the end and start time of the attacks. Figure 7 illustrates a snap shot of the presented network data saved as a CSV file.

Fig. 7.
figure 7

Example of network data

5 Conclusion

The lack of reliable and publicly available CPS datasets is a fundamental concern for researchers investigating the design of secure CPSs. There are currently no such large scale public datasets available as there are no open CPS facilities. Real industrial CPS facilities would not be able to provide accurate datasets as faults or attacks can only be assumed at best.

The data collected from the SWaT testbed reflects a real-world environment that helps to ensure the quality of the dataset in terms of both normal and attack data. The attacks carried out by the authors illustrate how such attacks can take place in modern CPSs and provide us the ability to provide accurately label data for subsequent use. The information and data that is provided with this paper includes both network and physical properties stored in CSV file formats.

Our goal is to make the collection of CPSs datasets an on-going process to benefit researchers. The data collected will be continuously updated to include datasets from new testbeds as well as new attacks derived from our research team.