Keywords

1 Introduction

Nowadays, advanced industrial monitoring systems such as Supervisory Control and Data Acquisition (SCADA) and Centralized Monitoring System (CMS) are equipped with an enormous number of sensors thanks to the rapid growth in the sensing technologies. Those sensors can be used for condition monitoring purposes and by using pre-defined thresholds they can illustrate different normal and abnormal states of the system. For the abnormal states of the system that can be caused due to a component malfunctioning or fault occurrence, a monitoring and alarm system will generate alarms [1]. The generated alarms enable the operators to understand and solve the system’s issues and plan for its maintenance. The cost and performance of operation and maintenance of industrial systems are highly associated with the performance of alarm systems. Thus, an alarm system with a low false detection and miss detection rate can have a huge economical benefit for a targeted industry, especially for the one that has accessibility limits like offshore wind farms [2]. Alarm management is challenging for any company, given the large number of possible alarms that might be generated at the same time [3]. For instance, operators faced difficult situations due to redundant and confusing information provided to them during the catastrophic accident at the nuclear power plant at Three Mile Island in 1979, the worst nuclear accident in US history. Much of the information collected was irrelevant and illusory during the accident [4].

To increase the safety and efficacy many actions, tools, evaluation metrics and policies have been created. For example, three specific indices have been used to analyse an alarm system’s performance and safety in case of abrupt faults: Averaged Alarm Delay (AAD), Missed Alarm Rate or Probability (MAR/MAP), False Alarm Rate or Probability (FAR/FAP) by [5]. To design and optimize alarm systems for mixture processes and intermittent faults, [6] presented a time-variant finite mixture model to statistically model the behaviour of a process variable which is affected by an intermittent fault. The calculation methods for FAR and AAD are provided and a new time-variant missed alarm rate (MAR(t)) is introduced which reflects the missed alarm rate during the emerging stage of an intermittent fault. Process Variables are the parameters or quantities we wish to control at the correct limit.

[7] developed generalized delay timers, a novel way for improving traditional delay timer systems. In addition, a Markov model was used to calculate FAR, MAR, and Expected Detection Delay (EDD) in order to compare the performance of this technique to that of a standard delay timer. Regarding the performance evaluation of monitoring systems with adaptive and variable alarm thresholds, [8] has introduced a new method using combination semi-Markov process and temporal logic gates. [9] has proposed a new technique for getting optimum filter for alarms that incorporates plant and control system information while allowing the independence requirement to be relaxed.

By investigating through the existing literature for performance assessment and improvement of alarm systems, it is clear that there are few publications regrading adaptive design of alarm systems facing the concept drift or distributional shift in the upcoming data. To address the above-mentioned issues in the monitoring and alarm systems, in this paper, we proposed an adaptive approach which has the following contributions: A) Real time evaluation of the designed alarm system and B) Adjust the designing parameters of alarm system based on statistical difference measures.

2 Problem Formulation

2.1 Detecting Alarm States

The most common way of detecting an alarm state is to compare the value of a process variable to a constant high (low) alarm trip point, i.e.

$$\begin{aligned} x_a(t)= {\left\{ \begin{array}{ll} 1 &{} {x(t) > x_{htp}\quad \text {or} \quad x(t)< x_{ltp}} \\ 0 &{} {x_{ltp} \le x(t) \le x_{htp}} \end{array}\right. } \end{aligned}$$
(1)

where x(t) denotes the process variable, \(x_a(t)\) denotes the alarm variable, and \(x_{htp}\) and \(x_{ltp}\) respectively denote the high and low trip points. A drum level is an example, which is related with the high alarm trip point 100 and the low alarm trip point \(-100\) on a large scale thermal power plant. Figure 1(a) presents 1-hour samples of x(t) with sampling period 1 s. A discrete-valued alarm variable \(x_a(t)\) may be used to mathematically describe alarm states in alarm systems. In Fig. 1(b), the samples of two alarm variables connected with x(t) are shown. That is, the high (low) alarm variable is assigned the value 1 if it is more (less than) \(100(-100)\), and 0 otherwise [3]. The alarm occurrence and alarm clearance are defined as the change of alarm variables from 0 to 1 and from 1 to 0, respectively. Take note of the two rapid changes in the low alarm variable between 23:47:38 and 23:47:42, which are apparent in the magnified plot in Fig. 1.

Fig. 1.
figure 1

(a) Samples of a process variable (solid) were collected in association with alarm trip points (dot-dash). (b), alarm variables \(x_a(t)\) with high (solid) and low (dot-dash) alarm trip points [3]

2.2 Abrupt Faults

Assume that a process variable is in its normal state with distribution p(x). An abrupt fault causes a change in the statistical properties of the variable, for instance, its mean. In this case, the PDF of the variable instantly changes to q(x) in the faulty state. Figure 2 illustrates the distributions, p(x) and q(x) which indicate the variable distributions in normal and abnormal operation states, respectively. For this process variable, the well-known alarm performance indices are defined as [10]:

  • False Alarm Rate (FAR): In this work, the likelihood of an alarm occurring during the normal state is indicated by \(p_1\). When the process variable is in its normal operating condition, the \(\textrm{FAR}\) index shows the potential of an alarm triggering. This indicates that the operator has received an incorrect alarm that does not need response. The \(\textrm{FAR}\) value is computed as follows in accordance with the process variable distribution (notice that due to the computational similarity of the indices for high and low trip points, the index is calculated exclusively for the high trip point): The probability of an alarm triggering is indicated by the above index; the process variable exceeds the high trip point. In other words, the alarm is raised when the process is operating normally.

  • Missed Alarm Rate (MAR): The likelihood of missing an alarm during a faulty or abnormal condition is given by \(q_2\). The MAR index represents the likelihood of an alarm not being triggered when a process variable is operating in an abnormal manner [10]. This index is more critical than the FAR index since a false alarm may disturb the operator and waste his time, while a missed alarm may result in an incident while the operator is ignorant of its existence.

Fig. 2.
figure 2

Normal (p(x)) and abnormal (q(x)) PDFs

For the process variable in Fig. 2, we define the four parameters \(p_1\), \(p_2\), \(q_1\), and \(q_2\) as

$$\begin{aligned} \begin{aligned} p_1&=\int _{x_{htp}}^{\infty }p(x)\textrm{d}x, \quad p_2=\int _{-\infty }^{x_{htp}}p(x)\textrm{d}x \\ q_1&=\int _{x_{htp}}^{\infty }q(x) \textrm{d}x, \quad q_2=\int _{-\infty }^{x_{htp}}q(x) \textrm{d}x \end{aligned} \end{aligned}$$
(2)

Then, it is straightforward to show that the performance indices are calculated as [10]:

$$\begin{aligned} \textrm{FAR}=p_1, \textrm{MAR}=q_2 \end{aligned}$$
(3)

Here, \(x_{htp}\) is the alarm threshold, \(t_a\) is the time of the first alarm after the fault occurrence, \(t_f\) is the time of the fault occurrence. In Fig. 2, the adjusted threshold is indicated by a red vertical dash-line. In this figure, the overlapped region might lead to false detection since the designed alarm system is unable to determine which state of operation the process variable is associated with. In other words, the generated alarm might be a false alarm, or the non-alarm sample of alarm variable could be a missed alarm. The challenge of determining the best threshold is a classification problem, and the overlap region in Fig. 2 may lead to false detection if the designed alarm systems wrongly indicate the state which data belongs to. By considering this figure, one can see the region where two PDFs combine, as well as the potential for errors. We may determine the likelihood of an error [11]

$$\begin{aligned} P(error)=\int _{-\infty }^{\infty }P(error|x)P(x)dx \end{aligned}$$
(4)

where, P(error|x) can be calculated as the minimum of both probabilities of the normal and abnormal data sets as (5).

$$\begin{aligned} P(error|x)=\min _{x\in (-\infty ,\infty )} [P(NS|x), P(ANS|x)] \end{aligned}$$
(5)

NS and ANS denote the normal and abnormal states of operation, respectively, as shown in Fig. 2. The likelihood of error may be expressed in two parts by dividing the space into two regions denoted by the variables \(R_1={x\in R|x < X_{utp}}\) and \(R_2= {x\in R|x > X_{utp}}\).

$$\begin{aligned} \begin{aligned} P(error) & = P(x\in R_1, NS)+P(x\in R_2, ANS)\\ {} & = \int _{R_1}P(x|NS)P(NS)dx+\\ {} &\int _{R_2}P(x|ANS)P(ANS)dx \end{aligned} \end{aligned}$$
(6)

in other words, \(P(x\in R_1, NS), P(x\in R_2, ANS)\) are FAR, and MAR indices, respectively. To ease the minimization problem, consider the following inequality rule [12].

$$\begin{aligned} \min [a,b]\le a^\gamma b^{1-\gamma }\quad \text {where}\quad a,b\ge 0\quad \text {and}\quad 0\le a \le 1 \end{aligned}$$
(7)

The equation (5) may be written as (8). When considering the worst-case situation or upper bound error, the ‘\(\le \)’ in (7) might be interpreted as ‘\(=\)’.

$$\begin{aligned} \begin{aligned} P\left( error|x\right) \ =\ min\left[ P\left( NS|x\right) ,\ P\left( ANS|x\right) \right] = \\ min\left[ \frac{P\left( x|NS\right) P\left( NS\right) }{P\left( x\right) },\frac{P\left( x|ANS\right) P\left( ANS\right) }{P\left( x\right) }\right] \end{aligned} \end{aligned}$$
(8)

By using the inequality rule and equation (8), we have

$$\begin{aligned} \begin{aligned} P\left( error|x\right) \ \le \left( \frac{P\left( x|NS\right) P\left( NS\right) }{P\left( x\right) }\right) ^\gamma \times \\\left( \frac{P\left( x|ANS\right) P\left( ANS\right) }{P\left( x\right) }\right) ^{1-\gamma } \end{aligned} \end{aligned}$$
(9)

The equation (10) can be driven through equations (4) to (9).

$$\begin{aligned} \begin{aligned} P\left( error\right) \ \le \left( P\left( NS\right) \right) ^\gamma \left( P\left( ANS\right) \right) ^{1-\gamma }\ \\ \int _{-\infty }^{+\infty }{\left( P\left( x|NS\right) \right) ^\gamma \left( P\left( x|ANS\right) \right) ^{1-\gamma }dx} \end{aligned} \end{aligned}$$
(10)

It is critical to evaluate the worst-case scenario in safety assurance, which might lead to (11), also known as the Chernoff upper bound of error [12].

$$\begin{aligned} \begin{aligned} P\left( error\right) \ = P\left( NS\right) ^\gamma P\left( ANS\right) ^{1-\gamma }\ \\ \int _{-\infty }^{+\infty }{P\left( x|NS\right) ^\gamma P\left( x|ANS\right) ^{1-\gamma }dx} \end{aligned} \end{aligned}$$
(11)

The integral component of (11) could be solved using (12) [12] if the probability distributions of the classes follow normal or exponential distribution families.

$$\begin{aligned} \int _{-\infty }^{+\infty }{P\left( x|NS\right) ^\gamma P\left( x|ANS\right) ^{1-\gamma }dx}=e^{-\theta \left( \gamma \right) } \end{aligned}$$
(12)

The \(\theta \left( \gamma \right) \) can be calculated using (13) where \(\mu \) and \(\varSigma \) are mean vector and variance matrix of each class respectively.

$$\begin{aligned} \begin{aligned} \theta \left( \gamma \right) =\frac{\gamma \left( 1-\gamma \right) }{2}\left[ \mu _2-\mu _1\right] ^T\left[ \gamma \varSigma _1+\left( 1-\gamma \right) \varSigma _2\right] ^{-1}\times \\\left[ \mu _2-\mu _1\right] +0.5\ log\frac{\left| \gamma \varSigma _1+\left( 1-\gamma \right) \varSigma _2\right| }{\left| \varSigma _1\right| ^\gamma \left| \varSigma _2\right| ^{\left( 1-\gamma \right) }} \end{aligned} \end{aligned}$$
(13)

The equation (13) basically becomes the Bhattacharyya distance when \(\alpha =0.5\). When \(\varSigma _1=\varSigma _2\), it can be demonstrated that this value is the optimal [12, 13]. The Bhattacharyya distance will be utilized to show the technique in this research for simplicity. It should be noted that the estimated error bound could be larger than the true number in certain circumstances. This is permissible, however, since an overestimation of the classifier error would not pose a safety risk (although it may impact performance). The probability of obtaining a proper classification may be computed using (14) since the \(P\left( error\right) \) and \(P\left( correct\right) \) are complimentary.

$$\begin{aligned} P\left( correct\right) \ =1\ -\ \sqrt{P\left( NS\right) P\left( ANS\right) }{\ e}^{-\theta \left( \gamma \right) } \end{aligned}$$
(14)

In most cases, the Chernoff upper limit of error is used to determine the separability of two classes of data, however in this case, equation (14) is used to determine the similarity of two classes. In other words, if you compare a class’s \(P\left( error\right) \) to itself in an optimized environment, the answer should be one, whereas \(P\left( correct\right) \) should be zero. The obvious reason is to see whether the data distribution during training matches the data distribution seen in the field (or not).

The integral component of \(P\left( error\right) \) may be transformed to the cumulative distribution function as (15) if \(P\left( NS\right) = P\left( ANS\right) \).

$$\begin{aligned} \begin{aligned} P\left( error\right) = \left( \int _{-\infty }^{T}{P_{NS}\left( x\right) }dx+\int _{T}^{+\infty }{P_{ANS}\left( x\right) }dx\right) \\ =\ 1\ -\left( F_{ANS}\left( T\right) -F_{NS}\left( T\right) \right) \end{aligned} \end{aligned}$$
(15)

In addition, Equation (15) illustrates the link between likelihood of error (and also accuracy) and statistical difference between two Cumulative Distribution Functions (CDF) of two states. ECDF-based statistical measures such as the Kolmogorov-Smirnov distance (KSD) (Eq. 16) and similar distance measures can be used to predict the error at runtime [14, 15].

$$\begin{aligned} P(error) \approx 1 - KSD = 1 - \sup _{x} { \left( F_{ANS}\left( x\right) -F_{NS}\left( x\right) \right) } \end{aligned}$$
(16)

It should be noted that not all ECDF-based distances are constrained between zero and one, and may need the adjustment of a coefficient as a measure of precision estimate in certain circumstances. The relationship between ECDF-based distance and accuracy will be examined in Sect. 4.

Fig. 3.
figure 3

Flowchart of the proposed approach

3 Safe Designed Alarm System

First and foremost, it is important to stress that the emphasis of this research is on the design of alarm systems for abrupt faults. The flowchart in Fig. 3 demonstrates how we see the idea being used in practice. The designing phase and the application phase are the two main sections of this flowchart. I) The designing phase is an offline approach that uses historical data from a given process variable to design an alarm system based on alarm system performance indices such as missed alarm rate, false alarm rate, and alarm average delay. In order to construct the performance assessment indices, this phase contains the change detection method for the process variable. In the second phase, all indices of the ideal design would be saved for future comparison. II) The second or the application phase is an adaptive approach in which real time data is provided to the system; in this stage, it is not known anything about the statistical characteristics of the real-time data. For example, consider an alarm system designed to monitor the pressure of the main steam driving the power turbine for a thermal power plant. The design policy supposed to make the alarm system able to trigger an alarm as soon as an abrupt is occurred by the least amount of time. In the application phase, it is important to keep in mind that the incoming data isn’t classified as faulty or non-faulty. As a result, it is impossible to predict if the designed alarm system will operate as well as it did during the designing phase. The PDF and statistical parameters of each class could be estimated as input samples are gathered. Because the system needs a sufficient number of samples to correctly detect the statistical difference, a buffer of samples may also be required before proceeding. Using the modified Chernoff error bound presented in [11], the statistical difference of each state of operation in the designing phase and application phase is compared. If the statistical difference is very low, the designed alarm system results and accuracy could be trusted. In the power turbine’s example, the alarm system would continue its operation in this case by holding the designed policies considered in the designing stage. Conversely, if the statistical difference is greater, the findings and accuracy of the designed alarm system are no longer regarded acceptable (because to the huge disparity between the trusted and observed data). In this case, the system should use an alternative design policy or notify a human operator. In the above example, the alarm system could ask the operator to justify the designing parameters of the alarm system.

4 Statistical Difference Values

In this section, the statistical distances values used in the application phase of the flowchart in comparison stage of the statistical values indicated by the yellow box in Fig. 3 are proposed. There would be a buffer in the application phase to gather enough samples. An expert should determine the buffer size at design time so that the gathered data contains the statistical properties of the operation state. It is worth noting that the future data is not considered to belong to a specific operation state. After collecting sufficient samples, the designed alarm system from the previous step will be used to clarify the operation state based on the generated alarms.

Fig. 4.
figure 4

Different statistical distance measures.

The statistical properties of buffered data are gathered and compared to the initial data set using ECDF-based statistical distance measures such as Kolmogorov-Smirnov (KS), Kuiper (K), Anderson-Darling (AD), Cramer-Von Mises (CVM), and Wasserstein (W) [14]. Additionally, throughout the design phase, an expected confidence level for each statistical distance measures should be determined. The confidence level will be determined using the comparison described before and will be compared to the predicted confidence threshold once again. Three distinct possibilities were examined: 1) when the confidence is slightly lower than the threshold, the system should collect additional data; 2) when the confidence is significantly higher than the predefined threshold, it is assumed that the upcoming data have not been seen by the designed alarm system previously and a human-in-the-loop procedure should be considered; and 3) when the confidence is higher than the predefined threshold, the designed alarm system’s results will be accepted and a report of the system’s findings will be stored. To illustrate, consider a case in which a process variable is impacted by natural noise, which alters the process variable’s statistical behavior. As a consequence, the number of alarms generated varies, and the safe design algorithm warns the operator. The operator will determine whether or not the process variable (say, a chemical process) is running properly. If the process is running properly, the alarm system must be redesigned to include the newly buffered data. Otherwise, the alarm system successfully identified the anomalous condition. This algorithm notifies the operator simply by comparing the statistical difference values and also the estimated FAR and MAR of the buffered data and compare it with those of initial data. In Fig. 4, different statistical measures and their differences are shown; since, this work is generally inspired by the [16], we use the same explanation used to explain the statistical difference measures applications in comparing the ECDFs of the trusted (initial) data and the real time data. As can be seen, the KS distance between two ECDFs quantifies their maximum value. The KS distance is incapable of determining which ECDF has a greater value, however the Kuiper distance can quantify two maximums up and down. When two sets have the same mean but distinct variances, the Kuiper distance provides a more accurate metric than the KS distance. As shown in Fig. 4(c), the WD may compute the area between two ECDFs in some way. As a result, the WD will be more sensitive to changes in the distributions’ shape. The CVM distance is comparable to the WD distance, except it is quicker. When the step size of the CVM algorithm is reduced, the results approach those of the WD. [16] provides further insight on ECDF distance measures. Based on the specific attributions of the above statistical distances, we can change the policy of design, and have a more detailed view on the monitored data. This helps us adjust the designing parameter of the alarm system in a real time way and enhance the performance of the designed alarm system.

5 Simulated Example

In this section we brought an example to show how the flowchart is working. Figure 5 depicts the statistical deference measures in relation to accuracy measures for the basic classification approach, which is a simple linear classifier. In the flowchart explored for this study, at the evaluation stage, statistical difference measures are used to compare real-time data to initial data. Figure 5 (a-b) show the accuracy changes respect to WD and KS measurements. As can be seen, there is a predictable manner of how the values are changing. The Fig. 5 (c-d) show the WD and KS changes with respect to the different values of variance. In the following example, we used the Monte-Carlo simulation in order to know whether there is a predictable manner of changing or not. And also, instead of accuracy values we used the well-known performance indices of alarm system.

Fig. 5.
figure 5

Increasing the variance of test data (from 1 to 5) for both normal and abnormal classes: a-b) Accuracy in relation to WD and KSD c-d) Variance values in relation to WSD and KSD [16]

At this point, we are aware there is an orderly fashion relation between WD measures and the indices. Based on this, we can provide strategies for designing alarm systems that are based on real-time WD measurements of the data and predicted values of MAR and FAR. We just take into account the ROC curve threshold optimization for the alarm system in this report.

Example 1:

In this example we calculated the optimum threshold for process variable \(x\sim N(2,1)\) as its normal operation distribution, and \(x\sim N(4,1)\) for abnormal operation distribution. The optimum threshold is considered as the tagged one to the knee point of the ROC curve. Which in here the optimum threshold is 3.25 (the optimization method is the same as it is in [5]). We do the same calculation for different distributional shifts in terms of variance values. Based on the flowchart, only some of the WD values are accepted, and this happens through a comparison of WD values to a predefined threshold (in here, \(D_{th}=0.5\)). In other words, some WD values correspond to data shifts which do not make any obvious changes in the statistical behavior of the data. This threshold is adjusted based on the importance level of the process variable, in terms of safety and security. We also applied different data shift equivalent to 1:0.05:5 on the variance value of both normal and abnormal data set individually and predict the FAR and MAR indices based on the Monte-Carlo simulation.

Fig. 6.
figure 6

Corresponding FAR, MAR and Variance indices to WD values - Variance Shift for normal data

Fig. 7.
figure 7

Corresponding FAR, MAR and Variance indices to WD values - Variance Shift for abnormal data

Figures 6, and 7 show the results of Monte Carlo simulation for variance shifts on normal and abnormal data sets, respectively. For the Monte Carlo simulation, data are generated \(5\times 10^5\) times for both normal and abnormal conditions. On the basis of false negative and false positive arrays of the confusion matrix constructed for each observation to evaluate accuracy, the frequency of false alarms and missed alarms (1 as positive and 0 as negative) is calculated. For normal operating state, the average number of alarm occurrences is determined by averaging the false positive occurrence numbers, and for abnormal operation state, the average number of missed alarms is determined using the same method as for normal operation state. It can be assumed, based on the Monte-Carlo results, that the shift in variance causes predictable changes in MAR and FAR. Consider the situation where the WD statistical difference is 0.70, the predicted MAR index is 0.28, and the pre-adjusted threshold is 3. Since the change only applied to the abnormal state, the FAR index is the same as when the alarm system was initially designed. By applying the ROC curve, the alarm system’s threshold is redesigned. Figure 8 illustrates the ROC curve used to determine the optimal threshold in light of the new data shift. The new MAR is 0.21 and the new FAR is 0.2 in accordance with the optimal threshold of 0.28.

Fig. 8.
figure 8

Used ROC curve to set the threshold

6 Conclusion

For the first time, we attempted to suggest a adaptive designing method of an alarm system in the presented work. We evaluated the degree of dissimilarity between the real-time process variable and the data for the process variable used to design the alarm system using statistical deference measures. We illustrated our work using a flowchart, which clarifies the sequence of the method’s various stages and the operators’ duties depending on the method’s output. At last, we validate the method through the Monte-Carlo simulation, which the results are consistent with the expectations. In future study, we will expand the approach such that it may be applied for many types of faults, including intermittent and incipient faults, and also for multi-variate alarm systems.