1 Introduction

Smart grids are electrical grids that use information and communication technology (ICT) to provide efficient, reliable distribution and transmission. The importance of security and trust cannot be overstated. Among the several developing security vulnerabilities, the fake data injection (FDI) attack is one of the most serious, with the potential to increase energy distribution costs drastically (Ahmed and Pathan 2020). The reliable operation of any power system depends mainly on appropriate protection schemes developed for line faults and emergencies. The reliable protection scheme enables faster fault detection to restore the power supply as soon as possible after a failure. In recent years, with the dazzling assimilation of the physical energy transmission system in the smart grid with the cybernetic information and communication tools, the possibility of cyber-attacks poses a severe challenge to the development and implementation of the reliable protection mechanism. Fail protection components play an essential role in the overall operation and control of the power system. Increased pressure on rapid fault detection and a reduction in fault levels are emerging because the penetration of renewable energy has caused a shift from a classic protection scheme using local measures to a “wide-area measurement-based protection scheme” (Phadke et al. 2008). The protection scheme’s effective performance based on the wide-area measurement is highly dependent on the information from the sensor transmitted to the control center over the network. The power system is overly dependent on the public communication network for reliable monitoring and operation, making it vulnerable to network attacks (Sridhar et al. 2011). False Data Injection Attack (FDIA) is considered the most effective network attack, in which the hacker can block the entire power grid with minimal effort. During the FDIA, the attacker destroys the integrity of a set of measurements used in the protection algorithm by altering the meter/sensor measurements (Liang et al. 2016; Liu and Li 2017). The protection algorithm is part of the backup protection strategy, and the control center operates it. Transmission of erroneous data to the control center can cause unnecessary control actions, leading to unexpected events or even power outages. Therefore, the current scenario requires a protection scheme that is immune to data falsification or/and includes components for the preventive detention of the injection of false data. Conventional flawed data detection methods that are part of the state estimator should detect any malicious manipulation of sensor information. However, Liu et al. (2011) have proven that hackers with sufficient knowledge of system dynamics can bypass the lousy data detection technology and inject random errors into state variables using FDIA to inject information from malicious sensors. Thus, manipulation of the sensor information during an attack can provide a misleading picture of the dynamics and operation of the system, causing the relay to malfunction during the fault or causing the relay to trip and then isolate itself. The malfunction of the protection relay and the delay in detecting this type of attack can cause enormous economic losses, damage to assets, and the collapse of the subsystems and control mechanisms related to power systems.

Several reasons have contributed to a significant increase in installation error-based data injection seizures. Some of the factors are continuous real-time online monitoring using sensors (CT, PT, PMU) and communication networks, using signal information of different locations or bus current or voltage. The recent work of FDIA in the power grid mainly focuses on FDIA modeling, attack detection, and defense measures (Liu et al. 2016). The possible impact of FDIA on the power system has been addressed in (Liang et al. 2016; Liu et al. 2011; Deng et al. 2016). Noteworthy solutions for FDIA detection reports in power grids are based on transmission line susceptance measurement (Deng and Liang 2018), reactance disturbance (Liu et al. 2018), joint transformation (Singh et al. 2017), extreme machine learning (Yang et al. 2017), optimized dispersion (Liu et al. 2014) and cumulative sum method (Li et al. 2014). Yang et al. (2013) proposed a countermeasure against FDIA, provided that the sensor measures the injected energy flow in the bus and connects to several other buses, which requires safety. The inaccessibility of these sensors will make it difficult for attackers to install FDIA. In Bi and Zhang (2014), Deng et al. (2015), a defense mechanism is proposed to protect a set of state variables.

Das et al. (2019) have proposed a logical analysis of numerical data (LAD) scheme of a simple, economically viable, and FDIA resilient for the attack on power systems, under the assumption that the adversary has complete knowledge of the system dynamics. The rule-based fault detection scheme identifies the limited set of sensors that would be secured using the cryptographic protocol, tamper-resistant hardware, and encryption-based data analysis by mapping the secure sensor information. This paper implements the LAD process with adaptations to simulate and optimize the results. The LAD process uses top-down and bottom-up approaches to produce pure patterns. The adaptations introduced in this produce combinatorial and impure patterns to maximize the performance of the LAD model. The adapted pattern generation generated 38 percent more true positive outcomes than the original model. The model uses a greedy algorithm to optimize the number of attributes based on conclusions drawn by paper (Almuallim and Dietterich 1994) which compares several feature minimization techniques and evaluates each one on the worst-case scenarios, time complexity, and average accuracy. The implementation analyzes power system FDIA scenarios - Mississippi State University and Oak Ridge to identify attack scenarios. Performances of various classifiers F1 Score of classification on the same dataset are discussed in the paper (Liu and Li 2017) and visualized below. In this paper, the proposed architecture yields an F1 Score of 0.86, outperforming the traditional classifiers.

Fig. 1
figure 1

F1 Scores of different Learners

Lastly, we explain the motivation behind employing Logical Analysis of Data (LAD) in smart grids. In data analytics, machine learning is widely used. According to academics and industry, practically all machine learning systems function statistically or optimize blindly. The results’ causal logic remains a black box, restricting machine learning’s usefulness. Our LAD research model employs machine learning to increase application accuracy and rules to ensure the interpretability of results, starting with a structured approach to causal reasoning. Simultaneously, we propose a new rule system based on a mathematically adapted model. It can handle potentially enormous datasets with limited computations.

2 LAD model structure and terminologies

The logical analysis of numerical data (LAD) is a combinatory and optimization-based data analysis method. In Boros et al. (1997) the authors develop the theoretical foundation of the binarization process. They also study the combinatorial optimization problems related to minimizing the number of binary variables. The paper establishes nineteen theorems and several lemmas along with proofs. It shows that any numerical dataset can be checked in polynomial time whether there is a binarization admitting an extension in the given class. A linear integer programming problem is formulated to provide an algorithmic framework for this minimization problem. Set covering problems and some heuristic algorithms are implemented and elaborated further for improvements. LAD detects structural information about datasets which can provide powerful means to solve various problems. Mainly it contributes to classification, automatic knowledge acquisition of expert systems, model-based decision support system development, database inconsistency detection, and feature selection.

The LAD methodology was first proposed for the case of binary data. LAD has applications in numerous disciplines, such as economics and business, seismology, oil exploration, and a few typical examples of binary classification problems. The papers (Boros et al. 2000) and (Hammer and Bonates 2006) are exemplary LAD use cases. It describes the implementation and wide applicability of LAD to Australian Credit Card, Boston Housing, Breast Cancer (Wisconsin), Congressional Voting datasets, and even pilot experiments such as Oil Exploration, Psychometric Testing and Labor Productivity in China, and an in-depth application of LAD as in the prognosis and diagnosis field of Medical Data Analysis with case studies of Ovarian Cancer Diagnosis using a Large Proteomic Dataset, Genome Data-based Breast Cancer Prognosis, respectively. These applications depict the robustness of LAD in any scenario. The dissertation paper (Bonates 2007) shows efficient ways of constructing LAD classification models having high accuracy and requiring minimal control parameters. It also extended the LAD methodology to deal with the critical class of regression problems that frequently appear in data analysis tasks. In this paper, the implementation in Python behaves as an ML classification framework and can adapt to changes with ease. The further sections describe the architecture of code, its adaptations, results, and the conclusions drawn.

2.1 Mathematical background

The essential mathematical foundation components of LAD (Alexe et al. 2007) are the following:

  • To remove superfluous variables from the original dataset, we select a (usually minimal) subset S that can discriminate positive from negative observations. We work with the projections \(+S\) and \(-S\) on this group of variables in the following steps. While most data analysis methods include a “feature extraction” step, the LAD methodology uses it differently. Here, it emphasizes the interaction of variables and the importance of retaining those that can influence the positive or negative nature of observations individually and those whose “collective” or “combinatorial” effect is significant.

  • We cover \(+S\) with a family of (potentially overlapping) homogeneous subsets of the reduced real space, where each subset intersects \(+S\) but is disjoint to \(-S\). LAD only considers \(\mathbb {R}^{|S |}\) intervals with faces parallel to the axes; these intervals are referred to as “positive patterns”. For finding “negative patterns”, a similar construction is used with \(-\mathrm {S}\).

  • A subset of positive (respectively, negative) patterns is discovered whose union encompasses all the observations in \(+\mathrm {S}\) (respectively, \(-\mathrm {S}\)). A “model” is a collection of these two subsets of intervals.

  • A classification approach defines each observation’s positive or negative character covered by the union of the two subsets of intervals of the model, leaving those observations uncovered by this union as “unclassified”.

  • The resulting classification system’s correctness is verified using one of the standard validation methods.

The basic structure of LAD starts with a set of observations S. S consists of observations of two classes, positive and negative, respectively. Hence S is now categorized into \(+S\) and \(-S\) for the above two classes. Each observation carries an n number of attributes labeled \(a_1, a_2, \ldots , a_n\). Each attribute is then analyzed to generate cut points labeled as \(t_1, t_2, \ldots , t_n\). These cut-points generate binarized attributes \(ba_1, ba_2, \ldots \). The set of all binarized attributes is labeled as V. Then the support set generation takes place as a set of the minimal number of binarized attributes labeled Q. Thereafter, Q is used to produce patterns \(p_1, p_2, \ldots \). Finally, a classification model is built using all the generated patterns.

Fig. 2
figure 2

Phases of LAD

2.2 Binarization

The binarization procedure is as follows. The simplest non-binary attribute is the nominal (or descriptive) attribute. The typical nominal property is color, and its value can be red, green, yellow, etc. The binarization of the attribute is done directly by associating each value \(v_s\) of the attribute x against a Boolean variable. In the particular case of nominal attributes, which are binary, i.e., they take only two values, no additional binary variables are introduced. The values are renamed as 0 and 1  Boros et al. (2000).

$$\begin{aligned} b\left( x, v_{s}\right) =\left\{ \begin{array}{cc}1 &{} \text{ if } x=v_{s} \\ 0 &{} \text { otherwise }\end{array}\right. \end{aligned}$$

The binarization of ordered attributes is common in many areas of human activity. For example, blood pressure, body temperature, pulse rate, and other medical parameters are called “normal” or “abnormal,” depending on whether they are within or outside a specific range. In many other examples, the parameter (for example, blood sugar level) is called “normal” or “abnormal” depending on whether it is above or below a certain threshold. In all of these examples, the binarization is done implicitly by comparing the value of a numeric attribute with some standard cut-off point (critical value). Following that, the same principle is applied for binarizing real numerical values. The variables are split into two types: level and interval. Level variables, as the name suggests, create levels, that is the binarization occurs based on the value being above or below every cut-off point(t), labeling it as 1 and 0, respectively.

$$\begin{aligned} b(x, t)=\left\{ \begin{array}{ll}1 &{} \text { if } x \ge t \\ 0 &{} \text { if } x<t\end{array} \right. \end{aligned}$$

Similarly, binarization for interval variables takes place if the value lies between two cut points.

$$\begin{aligned} b\left( x, t^{\prime }, t^{\prime \prime }\right) =\left\{ \begin{array}{ll}1 &{} \text{ if } t^{\prime } \le x<t^{\prime \prime } \\ 0 &{} \text { otherwise }\end{array} \right. \end{aligned}$$

While binarizing numerical attributes, the unique values are sorted in an array to calculate the cut points. In order to make this binarization procedure more robust concerning measurement errors (in the case of numerical attributes), we will use the cut-points as the range between midpoints of consecutive unique points \( t_{s}=\frac{1}{2}\left( v_{s-1}+v_{s}\right) \) .

While dealing with huge datasets, we set up a threshold of critical values generated from a particular attribute to minimize the generation of ambiguous cut points. For example, for an attribute containing 3000 cut points, critical points generated could be limited to 200. If it fails to generate less than 200 points, we try to improvise the attribute by rounding off each point’s last digit and calculating until the number of cut points is less than the threshold.

2.3 Support set generation

After obtaining a binary dataset, the elimination of redundant attributes is prioritized. All LAD archives of observation points are partitioned into a set of true (\(+S\)) and false (\(-S\)) classes, i.e., we assume that no observation point is present in both simultaneously. This property is known as contradiction-free, a basic requirement to be maintained by any correct binarization technique, as clearly preserved by our process. A set of binary attributes is called a support set Q if the archive obtained by eliminating all the other attributes will remain contradiction-free. A support set is called irredundant if no proper subset of it is a support set  Boros et al. (2000). The Support Set Generation method used here is Mutual Information Greedy (MIG) Algorithm  Almuallim and Dietterich (1994). The paper concludes that the MIG algorithm maintained good average-case performance improving all the learning processes it was implemented on while exhibiting rather bad worst-case performance. The MIG algorithm has entropy, or score calculation function as follows:

$$\begin{aligned} \text {MIG score}= & {} -\sum _{i=0}^{2^{|Q |-1}} \frac{p_{i}+n_{i}}{\mid \text{ Sample } \mid }\left[ \frac{p_{i}}{p_{i}+n_{i}} \log _{2} \frac{p_{i}}{p_{i}+n_{i}}\right. \\&\quad \left. +\frac{n_{i}}{p_{i}+n_{i}} \log _{2} \frac{n_{i}}{p_{i}+n_{i}}\right] . \end{aligned}$$

In the Mutual-Information-Greedy Algorithm, the feature that leads to the minimum entropy when added to the current partial solution is selected as the best feature. The best feature is used to partition each group of training samples until each group is either solely positive or negative. An example of the execution without splitting the training sample into all \(2^Q\) groups is elaborated below.

figure a

2.4 Pattern recognition

Patterns are combinations of Boolean attributes of specific orientation that help us classify between positive and negative classes. For example, a combination of binary attributes b1 = 1 and b2 = 0 are only found in positive classes and not in negative classes, which becomes a well-defined positive pattern. The symmetrical definition for negative patterns also holds. The simplest pattern generation method is based on the use of the combinatorial enumeration technique. Given that there are various possible quality metrics for any given pattern, it is important that the pattern generation process must not lose any best patterns. Here best patterns symbolize any pattern that classifies many data points of a specific class at once. Any pattern generation technique should follow two basic principles. The simplicity principle is that short patterns are preferred over longer ones. The second principle is about comprehensive patterns, i.e., all observations of a particular class are to be classified by one of the patterns. We followed a bottom-up approach up to third-degree positive patterns for our model as higher degree pattern generation was not computationally feasible. The pattern generation process is explained through the Algorithm 2 (Das et al. 2020).

figure b

While generating patterns with limited computational resources, we were able to build classifiers that were highly accurate but not comprehensive enough. We introduce two adaptations in our model inspired from Das et al. (2020). First, we introduce imperfect patterns in our model. Imperfect patterns are those patterns that have incorrect classified observations but are below a certain threshold. We set that threshold as 10 percent for our model, which helped us overcome the set covering problem extensively. The second adaptation introduced in the patterns was combined patterns. We generated a hybrid fifth-degree pattern as a combination of a third-degree pattern and not a second-degree pattern. For example, (b1 = 1 and b2 = 1 and b3 = 1) and not (b4 = 1 and b5 = 1). This helps cover more observations while maintaining the accuracy and simultaneously avoiding the time complexity of generating 5-degree patterns.

3 LAD case study with a sample dataset

Table 1 Sample data

The steps followed to carry out the Logical Analysis of Data on the given sample dataset are as follows:

  1. (a)

    Consider the dataset given in Table 1. We observe that the attributes a1, a2 and a3 are nominal, numerical and binary in nature, respectively.

  2. (b)

    The first attribute denoting shapes, which is nominal in nature, can be converted into 3 binarized attributes. Each unique shape becomes a binary variable.

  3. (c)

    The second attribute, is numerical in nature and hence cut points are to be calculated. The cut points formed are [3.5,9,17,25.5,31]. The binary variables formed are shown in Table 2.

  4. (d)

    The last attribute has Boolean values which is itself binary in nature.

  5. (e)

    Putting it all together

Table 2 Binarized form of sample data

The support set generation technique later minimizes the binarized dataset as it may contain redundant attributes. Patterns are recognized from the dataset obtained after getting the support set, and the classifier is modeled. For example, in the above example, we can directly observe one such pattern is b3= 1 and b4 =1 that is unique only to the +S set.

4 Cyber physical system survey

Cyber-physical systems (CPS) refer to a new generation of systems with integrated computational and physical capabilities that can interact with humans through many new modalities. The ability to interact with and expand the capabilities of the physical world through computation, communication, and control is a key enabler for future technology developments. A complete summary of anomaly detection strategies is provided by Chandola et al. (2009). They did not include deep learning-based approaches for CPS in an early effort to review anomaly detection methods. People’s lives have been revolutionized by commodity IoT solutions. Smart home applications, for example, allow users to interact with house appliances automatically. Methods for analyzing programs to safeguard privacy and find vulnerabilities in these applications have been proposed in Celik et al. (2019). Meanwhile, Giraldo et al. (2018) looked into anomaly detection approaches based on CPS physical features (for example, the evolution of the physical system under control). The findings of studies on SCADA system network security are described, with a focus on risk assessment approaches in Cherdantseva et al. (2016). A review of anomaly detection methodologies in CPS was published by Mitchell and Chen (2014), Nazir et al. (2017), and Zacchia Lun et al. (2018). However, the approaches used do not incorporate deep learning methods and are more traditional, such as state estimation and intrusion detection. A study of deep learning-based anomaly detection systems was conducted in Chalapathy and Chawla (2019) apart from traditional CPS systems.

5 Power system attack case study using LAD model

The dataset used for the model is Power System Attack Datasets provided by Mississippi State University and Oak Ridge National Laboratory. The Natural and Attack States of the power systems are considered positive and negative observations. This dataset is used because the power system disturbances are complex in nature and can be attributed to a wide range of sources, including man-made and natural events. Currently, power system operators are heavily dependent on making decisions about the appropriate course of action for the cause and response of the interference experienced. In the case of cyber attacks on the power system, human judgment is less certain since there is an overt attempt to disguise the attack and deceive the operators as to the true state of the system. To enable the human decision-maker, we explore the viability of the LAD Model as a means for discriminating types of power system disturbances and focus specifically on detecting cyber-attacks where deception is a core tenet of the event. The five types of scenarios covered in the datasets are Short-circuit fault, Line maintenance, Remote tripping command injection (Attack), Relay setting change (Attack), and Data Injection (Attack). The scenarios are explained below (Borges et al. 2014).

Short circuit fault: It is a short that can occur at any point in the power line and the percentage range indicates the location.

Line maintenance: Remote relay trip instruction is given to open one or more breakers.

Remote tripping command injection (attack): It is an attack when the attacker sends a false command to relay to open the breaker.

Relay setting change (Attack): The attacker changes the relay configuration to prevent it from tripping when an actual fault occurs.

Data Injection (Attack): A genuine fault is imitated to induce a blackout by changing parameters like current, voltage, etc.

The data was drawn from 15 datasets containing 128 features and thousands of samples. Across the classification systems, an average of 3,711 attack instances and 1,221 normal instances were included in each file for the analysis.

5.1 Data pre-processing

All the columns with more than sixty percent missing values were eliminated, and then the rows with missing values were filtered out. The final clean dataset consisted of 31514 samples. 80-20 (Train-Test) random split was performed on each dataset using the sklearn library.

5.2 Classification metrics

For each classified datasets, a confusion matrix is calculated consisting true and false, negatives and positives respectively. Four classification metrics, Accuracy, Precision, Recall, and F1 Score are calculated as follows.

$$\begin{aligned} \text {Accuracy}= & {} \text {(TP + TN ) / (TP + TN + FP + FN) }\\ \text {Precision}= & {} \text {TP/ (TP + FP) }\\ \text {Recall}= & {} \text {TP/ (TP + FN) }\\ \text {F1}= & {} (2* \text {Precision} * \text {Recall}) / (\text {Precision} + \text {Recall}) \end{aligned}$$

Here TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively (Hossin and Sulaiman 2015).

6 Results

Test and train results of detection of cyberattacks on smart grids using Power Attack Systems dataset and LAD model have been tabulated in Figure 3. The dataset is distributed in 15 files. Each dataset is split into train and test datasets by 80–20. Then LAD model is applied to record the performance of the train and test dataset in terms of accuracy, precision, recall, and F1 score. The average accuracy, precision, recall, and F1 score for the test datasets are 79%, 83.9%, 88.4%, and 86%, respectively. In the graphically represented Test dataset results of LAD (Figure 4), the blue, orange, grey, and yellow trend lines represent the accuracy, precision, recall, and F1 score, respectively. It portrays the consistent performance of the LAD model with all subparts of the dataset.

Fig. 3
figure 3

LAD train and test results data

The performance of the LAD and the most widely used machine learning and deep learning models on the dataset has been compared in Table 3 based on the F1 score. It also compares LAD results with previous studies against the same dataset as given in Hink et al. (2014). It is observed that LAD outperforms the state-of-the-art classification techniques as it has obtained an 86% score which is higher than the rest of the techniques. Apart from overall performance, the in-depth analysis and impact of introducing adaptions are visualized in Table 4 by comparing the performance between LAD and the adapted LAD model on a random subpart of the dataset. The adapted LAD model contains impure and hybrid patterns discussed in the Pattern Recognition section of the paper.

It can be deduced from the confusion matrix given in Table 4 that the standard LAD model provides very high precision, whereas the adaptations allow more accuracy and recall and overall create a positive impact for classification.

Fig. 4
figure 4

LAD test results

7 Gap analysis and contribution

The LAD model is compared with other machine learning models based on the F1 score. We can observe from Table 3 that our model has achieved a score of 86%, which is the highest among all the classifiers mentioned. Thus, we can say that the LAD model outperforms most state-of-the-art classification techniques. We have also introduced adaptations in our LAD model, and these adaptations improvise the results even further. It can be observed from Table 4 that the false negatives have reduced for both train and test data. Using the adapted LAD model, we achieve a recall of 97.53% which is very high compared to the standard LAD model. However, more importantly, the LAD model introduces explainability within the classifier while generating results. It gives the knowledge of features involved in the attack, and thus we can focus more on those features which are vulnerable to attack. These all can potentially take place in real-time and with minimal computation. The dataset required to produce the desired results contains only 31514 observations. Thus, LAD does not require a huge dataset for classification. These observations distinguish LAD from the most commonly used machine learning algorithms.

Table 3 Performance comparison of LAD model with other machine learning classifiers based on F1 Score
Table 4 Changes in confusion matrix due to adaptations

8 Conclusions

As a classification technique, LAD appears to be competitive with the well-established methods in this area. It is easily interpretable and has wide applications, given that it is not bound to any specific specialties related to datasets. Its high classification accuracy, comparable to and frequently exceeding other methods, and ability to handle some missing data provide robustness to the model’s applicability. It is also worth noting that imperfect patterns can also improvise the model given a threshold. Also, combining a few different kinds of patterns has helped reduce computation and time complexities. The results of the Power Attack Systems Case Study show many opportunities for LAD in developing new Smart Grids. The paper concludes by opening the following discussions:

  • Exploring more applications of LAD in different sectors and expanding its concepts to ternary or even multi-class systems.

  • While even degree three computations are compatible with most datasets, with the right resources, LAD could even be used for Big Data problems with the help of higher degrees along with combinations of higher degrees additionally.