Abstract
Alarm systems are utilized in the process industries to notify operators of abnormal process conditions or equipment faults. The alarm system must be appropriately constructed to maximize the possibility of safe and efficient operation. With the advancement of industrial and information technologies, real-time monitoring has shown to be an efficient method of ensuring operational safety and efficiency. Although the performance of alarm systems is an essential part of distributed control systems and has improved over the last decade, adaptive design of uni-variate alarm systems received remarkably little attention in research. We propose an adaptive designing method to evaluate and enhance the performance of an alarm system at runtime in this research. The approach is shown a flowchart, which enables adaptive design of alarm system based on the statistical characteristics of the process variable by considering both the performance deterioration of the alarm system itself, and the distributional shift of the process variable. As a result, the real-time adjustment of alarm system design parameters would be achievable, and at the same time the alarm system will operate more efficient. The proposed method is validated using a simulated example.
This work was supported by the National Natural Science Foundation of China (No. 61873142) and the National Science and Technology Innovation 2030 Major Project (No. 2018AAA0101604) of the Ministry of Science and Technology of China.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Alarm System
- Distributional Shift
- Adaptive Design
- Statistical Difference Measures
- Performance Deterioration
1 Introduction
Nowadays, advanced industrial monitoring systems such as Supervisory Control and Data Acquisition (SCADA) and Centralized Monitoring System (CMS) are equipped with an enormous number of sensors thanks to the rapid growth in the sensing technologies. Those sensors can be used for condition monitoring purposes and by using pre-defined thresholds they can illustrate different normal and abnormal states of the system. For the abnormal states of the system that can be caused due to a component malfunctioning or fault occurrence, a monitoring and alarm system will generate alarms [1]. The generated alarms enable the operators to understand and solve the system’s issues and plan for its maintenance. The cost and performance of operation and maintenance of industrial systems are highly associated with the performance of alarm systems. Thus, an alarm system with a low false detection and miss detection rate can have a huge economical benefit for a targeted industry, especially for the one that has accessibility limits like offshore wind farms [2]. Alarm management is challenging for any company, given the large number of possible alarms that might be generated at the same time [3]. For instance, operators faced difficult situations due to redundant and confusing information provided to them during the catastrophic accident at the nuclear power plant at Three Mile Island in 1979, the worst nuclear accident in US history. Much of the information collected was irrelevant and illusory during the accident [4].
To increase the safety and efficacy many actions, tools, evaluation metrics and policies have been created. For example, three specific indices have been used to analyse an alarm system’s performance and safety in case of abrupt faults: Averaged Alarm Delay (AAD), Missed Alarm Rate or Probability (MAR/MAP), False Alarm Rate or Probability (FAR/FAP) by [5]. To design and optimize alarm systems for mixture processes and intermittent faults, [6] presented a time-variant finite mixture model to statistically model the behaviour of a process variable which is affected by an intermittent fault. The calculation methods for FAR and AAD are provided and a new time-variant missed alarm rate (MAR(t)) is introduced which reflects the missed alarm rate during the emerging stage of an intermittent fault. Process Variables are the parameters or quantities we wish to control at the correct limit.
[7] developed generalized delay timers, a novel way for improving traditional delay timer systems. In addition, a Markov model was used to calculate FAR, MAR, and Expected Detection Delay (EDD) in order to compare the performance of this technique to that of a standard delay timer. Regarding the performance evaluation of monitoring systems with adaptive and variable alarm thresholds, [8] has introduced a new method using combination semi-Markov process and temporal logic gates. [9] has proposed a new technique for getting optimum filter for alarms that incorporates plant and control system information while allowing the independence requirement to be relaxed.
By investigating through the existing literature for performance assessment and improvement of alarm systems, it is clear that there are few publications regrading adaptive design of alarm systems facing the concept drift or distributional shift in the upcoming data. To address the above-mentioned issues in the monitoring and alarm systems, in this paper, we proposed an adaptive approach which has the following contributions: A) Real time evaluation of the designed alarm system and B) Adjust the designing parameters of alarm system based on statistical difference measures.
2 Problem Formulation
2.1 Detecting Alarm States
The most common way of detecting an alarm state is to compare the value of a process variable to a constant high (low) alarm trip point, i.e.
where x(t) denotes the process variable, \(x_a(t)\) denotes the alarm variable, and \(x_{htp}\) and \(x_{ltp}\) respectively denote the high and low trip points. A drum level is an example, which is related with the high alarm trip point 100 and the low alarm trip point \(-100\) on a large scale thermal power plant. Figure 1(a) presents 1-hour samples of x(t) with sampling period 1 s. A discrete-valued alarm variable \(x_a(t)\) may be used to mathematically describe alarm states in alarm systems. In Fig. 1(b), the samples of two alarm variables connected with x(t) are shown. That is, the high (low) alarm variable is assigned the value 1 if it is more (less than) \(100(-100)\), and 0 otherwise [3]. The alarm occurrence and alarm clearance are defined as the change of alarm variables from 0 to 1 and from 1 to 0, respectively. Take note of the two rapid changes in the low alarm variable between 23:47:38 and 23:47:42, which are apparent in the magnified plot in Fig. 1.
2.2 Abrupt Faults
Assume that a process variable is in its normal state with distribution p(x). An abrupt fault causes a change in the statistical properties of the variable, for instance, its mean. In this case, the PDF of the variable instantly changes to q(x) in the faulty state. Figure 2 illustrates the distributions, p(x) and q(x) which indicate the variable distributions in normal and abnormal operation states, respectively. For this process variable, the well-known alarm performance indices are defined as [10]:
-
False Alarm Rate (FAR): In this work, the likelihood of an alarm occurring during the normal state is indicated by \(p_1\). When the process variable is in its normal operating condition, the \(\textrm{FAR}\) index shows the potential of an alarm triggering. This indicates that the operator has received an incorrect alarm that does not need response. The \(\textrm{FAR}\) value is computed as follows in accordance with the process variable distribution (notice that due to the computational similarity of the indices for high and low trip points, the index is calculated exclusively for the high trip point): The probability of an alarm triggering is indicated by the above index; the process variable exceeds the high trip point. In other words, the alarm is raised when the process is operating normally.
-
Missed Alarm Rate (MAR): The likelihood of missing an alarm during a faulty or abnormal condition is given by \(q_2\). The MAR index represents the likelihood of an alarm not being triggered when a process variable is operating in an abnormal manner [10]. This index is more critical than the FAR index since a false alarm may disturb the operator and waste his time, while a missed alarm may result in an incident while the operator is ignorant of its existence.
For the process variable in Fig. 2, we define the four parameters \(p_1\), \(p_2\), \(q_1\), and \(q_2\) as
Then, it is straightforward to show that the performance indices are calculated as [10]:
Here, \(x_{htp}\) is the alarm threshold, \(t_a\) is the time of the first alarm after the fault occurrence, \(t_f\) is the time of the fault occurrence. In Fig. 2, the adjusted threshold is indicated by a red vertical dash-line. In this figure, the overlapped region might lead to false detection since the designed alarm system is unable to determine which state of operation the process variable is associated with. In other words, the generated alarm might be a false alarm, or the non-alarm sample of alarm variable could be a missed alarm. The challenge of determining the best threshold is a classification problem, and the overlap region in Fig. 2 may lead to false detection if the designed alarm systems wrongly indicate the state which data belongs to. By considering this figure, one can see the region where two PDFs combine, as well as the potential for errors. We may determine the likelihood of an error [11]
where, P(error|x) can be calculated as the minimum of both probabilities of the normal and abnormal data sets as (5).
NS and ANS denote the normal and abnormal states of operation, respectively, as shown in Fig. 2. The likelihood of error may be expressed in two parts by dividing the space into two regions denoted by the variables \(R_1={x\in R|x < X_{utp}}\) and \(R_2= {x\in R|x > X_{utp}}\).
in other words, \(P(x\in R_1, NS), P(x\in R_2, ANS)\) are FAR, and MAR indices, respectively. To ease the minimization problem, consider the following inequality rule [12].
The equation (5) may be written as (8). When considering the worst-case situation or upper bound error, the ‘\(\le \)’ in (7) might be interpreted as ‘\(=\)’.
By using the inequality rule and equation (8), we have
The equation (10) can be driven through equations (4) to (9).
It is critical to evaluate the worst-case scenario in safety assurance, which might lead to (11), also known as the Chernoff upper bound of error [12].
The integral component of (11) could be solved using (12) [12] if the probability distributions of the classes follow normal or exponential distribution families.
The \(\theta \left( \gamma \right) \) can be calculated using (13) where \(\mu \) and \(\varSigma \) are mean vector and variance matrix of each class respectively.
The equation (13) basically becomes the Bhattacharyya distance when \(\alpha =0.5\). When \(\varSigma _1=\varSigma _2\), it can be demonstrated that this value is the optimal [12, 13]. The Bhattacharyya distance will be utilized to show the technique in this research for simplicity. It should be noted that the estimated error bound could be larger than the true number in certain circumstances. This is permissible, however, since an overestimation of the classifier error would not pose a safety risk (although it may impact performance). The probability of obtaining a proper classification may be computed using (14) since the \(P\left( error\right) \) and \(P\left( correct\right) \) are complimentary.
In most cases, the Chernoff upper limit of error is used to determine the separability of two classes of data, however in this case, equation (14) is used to determine the similarity of two classes. In other words, if you compare a class’s \(P\left( error\right) \) to itself in an optimized environment, the answer should be one, whereas \(P\left( correct\right) \) should be zero. The obvious reason is to see whether the data distribution during training matches the data distribution seen in the field (or not).
The integral component of \(P\left( error\right) \) may be transformed to the cumulative distribution function as (15) if \(P\left( NS\right) = P\left( ANS\right) \).
In addition, Equation (15) illustrates the link between likelihood of error (and also accuracy) and statistical difference between two Cumulative Distribution Functions (CDF) of two states. ECDF-based statistical measures such as the Kolmogorov-Smirnov distance (KSD) (Eq. 16) and similar distance measures can be used to predict the error at runtime [14, 15].
It should be noted that not all ECDF-based distances are constrained between zero and one, and may need the adjustment of a coefficient as a measure of precision estimate in certain circumstances. The relationship between ECDF-based distance and accuracy will be examined in Sect. 4.
3 Safe Designed Alarm System
First and foremost, it is important to stress that the emphasis of this research is on the design of alarm systems for abrupt faults. The flowchart in Fig. 3 demonstrates how we see the idea being used in practice. The designing phase and the application phase are the two main sections of this flowchart. I) The designing phase is an offline approach that uses historical data from a given process variable to design an alarm system based on alarm system performance indices such as missed alarm rate, false alarm rate, and alarm average delay. In order to construct the performance assessment indices, this phase contains the change detection method for the process variable. In the second phase, all indices of the ideal design would be saved for future comparison. II) The second or the application phase is an adaptive approach in which real time data is provided to the system; in this stage, it is not known anything about the statistical characteristics of the real-time data. For example, consider an alarm system designed to monitor the pressure of the main steam driving the power turbine for a thermal power plant. The design policy supposed to make the alarm system able to trigger an alarm as soon as an abrupt is occurred by the least amount of time. In the application phase, it is important to keep in mind that the incoming data isn’t classified as faulty or non-faulty. As a result, it is impossible to predict if the designed alarm system will operate as well as it did during the designing phase. The PDF and statistical parameters of each class could be estimated as input samples are gathered. Because the system needs a sufficient number of samples to correctly detect the statistical difference, a buffer of samples may also be required before proceeding. Using the modified Chernoff error bound presented in [11], the statistical difference of each state of operation in the designing phase and application phase is compared. If the statistical difference is very low, the designed alarm system results and accuracy could be trusted. In the power turbine’s example, the alarm system would continue its operation in this case by holding the designed policies considered in the designing stage. Conversely, if the statistical difference is greater, the findings and accuracy of the designed alarm system are no longer regarded acceptable (because to the huge disparity between the trusted and observed data). In this case, the system should use an alternative design policy or notify a human operator. In the above example, the alarm system could ask the operator to justify the designing parameters of the alarm system.
4 Statistical Difference Values
In this section, the statistical distances values used in the application phase of the flowchart in comparison stage of the statistical values indicated by the yellow box in Fig. 3 are proposed. There would be a buffer in the application phase to gather enough samples. An expert should determine the buffer size at design time so that the gathered data contains the statistical properties of the operation state. It is worth noting that the future data is not considered to belong to a specific operation state. After collecting sufficient samples, the designed alarm system from the previous step will be used to clarify the operation state based on the generated alarms.
The statistical properties of buffered data are gathered and compared to the initial data set using ECDF-based statistical distance measures such as Kolmogorov-Smirnov (KS), Kuiper (K), Anderson-Darling (AD), Cramer-Von Mises (CVM), and Wasserstein (W) [14]. Additionally, throughout the design phase, an expected confidence level for each statistical distance measures should be determined. The confidence level will be determined using the comparison described before and will be compared to the predicted confidence threshold once again. Three distinct possibilities were examined: 1) when the confidence is slightly lower than the threshold, the system should collect additional data; 2) when the confidence is significantly higher than the predefined threshold, it is assumed that the upcoming data have not been seen by the designed alarm system previously and a human-in-the-loop procedure should be considered; and 3) when the confidence is higher than the predefined threshold, the designed alarm system’s results will be accepted and a report of the system’s findings will be stored. To illustrate, consider a case in which a process variable is impacted by natural noise, which alters the process variable’s statistical behavior. As a consequence, the number of alarms generated varies, and the safe design algorithm warns the operator. The operator will determine whether or not the process variable (say, a chemical process) is running properly. If the process is running properly, the alarm system must be redesigned to include the newly buffered data. Otherwise, the alarm system successfully identified the anomalous condition. This algorithm notifies the operator simply by comparing the statistical difference values and also the estimated FAR and MAR of the buffered data and compare it with those of initial data. In Fig. 4, different statistical measures and their differences are shown; since, this work is generally inspired by the [16], we use the same explanation used to explain the statistical difference measures applications in comparing the ECDFs of the trusted (initial) data and the real time data. As can be seen, the KS distance between two ECDFs quantifies their maximum value. The KS distance is incapable of determining which ECDF has a greater value, however the Kuiper distance can quantify two maximums up and down. When two sets have the same mean but distinct variances, the Kuiper distance provides a more accurate metric than the KS distance. As shown in Fig. 4(c), the WD may compute the area between two ECDFs in some way. As a result, the WD will be more sensitive to changes in the distributions’ shape. The CVM distance is comparable to the WD distance, except it is quicker. When the step size of the CVM algorithm is reduced, the results approach those of the WD. [16] provides further insight on ECDF distance measures. Based on the specific attributions of the above statistical distances, we can change the policy of design, and have a more detailed view on the monitored data. This helps us adjust the designing parameter of the alarm system in a real time way and enhance the performance of the designed alarm system.
5 Simulated Example
In this section we brought an example to show how the flowchart is working. Figure 5 depicts the statistical deference measures in relation to accuracy measures for the basic classification approach, which is a simple linear classifier. In the flowchart explored for this study, at the evaluation stage, statistical difference measures are used to compare real-time data to initial data. Figure 5 (a-b) show the accuracy changes respect to WD and KS measurements. As can be seen, there is a predictable manner of how the values are changing. The Fig. 5 (c-d) show the WD and KS changes with respect to the different values of variance. In the following example, we used the Monte-Carlo simulation in order to know whether there is a predictable manner of changing or not. And also, instead of accuracy values we used the well-known performance indices of alarm system.
At this point, we are aware there is an orderly fashion relation between WD measures and the indices. Based on this, we can provide strategies for designing alarm systems that are based on real-time WD measurements of the data and predicted values of MAR and FAR. We just take into account the ROC curve threshold optimization for the alarm system in this report.
Example 1:
In this example we calculated the optimum threshold for process variable \(x\sim N(2,1)\) as its normal operation distribution, and \(x\sim N(4,1)\) for abnormal operation distribution. The optimum threshold is considered as the tagged one to the knee point of the ROC curve. Which in here the optimum threshold is 3.25 (the optimization method is the same as it is in [5]). We do the same calculation for different distributional shifts in terms of variance values. Based on the flowchart, only some of the WD values are accepted, and this happens through a comparison of WD values to a predefined threshold (in here, \(D_{th}=0.5\)). In other words, some WD values correspond to data shifts which do not make any obvious changes in the statistical behavior of the data. This threshold is adjusted based on the importance level of the process variable, in terms of safety and security. We also applied different data shift equivalent to 1:0.05:5 on the variance value of both normal and abnormal data set individually and predict the FAR and MAR indices based on the Monte-Carlo simulation.
Figures 6, and 7 show the results of Monte Carlo simulation for variance shifts on normal and abnormal data sets, respectively. For the Monte Carlo simulation, data are generated \(5\times 10^5\) times for both normal and abnormal conditions. On the basis of false negative and false positive arrays of the confusion matrix constructed for each observation to evaluate accuracy, the frequency of false alarms and missed alarms (1 as positive and 0 as negative) is calculated. For normal operating state, the average number of alarm occurrences is determined by averaging the false positive occurrence numbers, and for abnormal operation state, the average number of missed alarms is determined using the same method as for normal operation state. It can be assumed, based on the Monte-Carlo results, that the shift in variance causes predictable changes in MAR and FAR. Consider the situation where the WD statistical difference is 0.70, the predicted MAR index is 0.28, and the pre-adjusted threshold is 3. Since the change only applied to the abnormal state, the FAR index is the same as when the alarm system was initially designed. By applying the ROC curve, the alarm system’s threshold is redesigned. Figure 8 illustrates the ROC curve used to determine the optimal threshold in light of the new data shift. The new MAR is 0.21 and the new FAR is 0.2 in accordance with the optimal threshold of 0.28.
6 Conclusion
For the first time, we attempted to suggest a adaptive designing method of an alarm system in the presented work. We evaluated the degree of dissimilarity between the real-time process variable and the data for the process variable used to design the alarm system using statistical deference measures. We illustrated our work using a flowchart, which clarifies the sequence of the method’s various stages and the operators’ duties depending on the method’s output. At last, we validate the method through the Monte-Carlo simulation, which the results are consistent with the expectations. In future study, we will expand the approach such that it may be applied for many types of faults, including intermittent and incipient faults, and also for multi-variate alarm systems.
References
Izadi, I., Shah, S.L., Shook, D.S., Chen, T.: An introduction to alarm analysis and design. IFAC Proc. Vol. 42(8), 645–650 (2009). Elsevier
May, A., McMillan, D.: Condition based maintenance for offshore wind turbines: the effects of false alarms from condition monitoring systems. In: ESREL (2013)
Wang, J., Yang, F., Chen, T., Shah, S.L.: An overview of industrial alarm systems: main causes for alarm overloading, research status, and open problems. IEEE Trans. Automat. Sci. Eng. 13(2), 1045–1061 (2016). IEEE
Zang, H., Yang, F., Huang, D.: Design and analysis of improved alarm delay-timers. IFAC-PapersOnLine 48(8), 669–674 (2015). Elsevier
Xu, J., Wang, J., Izadi, I., Chen, T.: Performance assessment and design for univariate alarm systems based on FAR, MAR, and AAD. IEEE Trans. Automat. Sci. Eng. 9(2), 296–307 (2011). IEEE
Asaadi, M., Izadi, I., Hassanzadeh, A., Yang, F.: Assessment of alarm systems for mixture processes and intermittent faults. J. Process Control 114, 120–130 (2022)
Adnan, N.A., Cheng, Y., Izadi, I., Chen, T.: Study of generalized delay-timers in alarm configuration. J. Process Control 23(3), 382–395 (2013). Elsevier
Aslansefat, K., Gogani, M.B., Kabir, S., Shoorehdeli, M.A., Yari, M.: Performance evaluation and design for variable threshold alarm systems through semi-Markov process. ISA Trans. 97, 282–295 (2020). Elsevier
Roohi, M.H., Chen, T., Guan, Z., Yamamoto, T.: A new approach to design alarm filters using the plant and controller knowledge. Indust. Eng. Chem. Res. 60(9), 3648–3657 (2021). ACS Publications
Xu, J., Wang, J., Izadi, I., Chen, T.: Performance assessment and design for univariate alarm systems based on FAR, MAR, and AAD. IEEE Trans. Automat. Sci. Eng. 9(2), 296–307 (2012). IEEE
Aslansefat, K., Sorokos, I., Whiting, D., Tavakoli Kolagari, R., Papadopoulos, Y.: SafeML: safety monitoring of machine learning classifiers through statistical difference measures. In: Zeller, M., Höfig, K. (eds.) IMBSA 2020. LNCS, vol. 12297, pp. 197–211. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58920-2_13
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Elsevier (2013)
Nielsen, F.: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), The Chord Gap Divergence and a Generalization of the Bhattacharyya Distance, pp. 2276–2280 (2018)
Deza, M.M., Deza, E.: Distances in probability theory. In: Encyclopedia of Distances, pp. 257–272. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44342-2_14
Mathias, R.: Empirical behaviour of tests for the beta distribution and their application in environmental research. Stochast. Environ. Res. Risk Assessm. 25, 79–89 (2011)
Aslansefat, K., Kabir, S., Abdullatif, A., Vasudevan, V., Papadopoulos, Y.: Toward improving confidence in autonomous vehicle software: a study on traffic sign recognition systems. Computer 54(8), 66–76 (2021). IEEE
Naghoosi, E., Izadi, I., Chen, T.: A study on the relation between alarm deadbands and optimal alarm limits. In: Proceedings of the 2011 American Control Conference, pp. 3627–3632 (2011). IEEE
Izadi, I., Shah, S.L., Shook, D.S., Kondaveeti, S.R., Chen, T.: A framework for optimal design of alarm systems. IFAC Proc. Vol. 42(8), 651–656 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Asaadi, M., Aslansefat, K., Izadi, I., Yang, F. (2024). Adaptive Design of Uni-Variate Alarm Systems Based on Statistical Distance Measures. In: Xin, B., Kubota, N., Chen, K., Dong, F. (eds) Advanced Computational Intelligence and Intelligent Informatics. IWACIII 2023. Communications in Computer and Information Science, vol 1931. Springer, Singapore. https://doi.org/10.1007/978-981-99-7590-7_9
Download citation
DOI: https://doi.org/10.1007/978-981-99-7590-7_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7589-1
Online ISBN: 978-981-99-7590-7
eBook Packages: Computer ScienceComputer Science (R0)