1 Introduction

The industrial standard ISA-18.2 (2009) [1] determines that “an alarm system is the collection of hardware and software that detects an alarm state, this communicates the indication of that state to operators, and it records changes in the alarm state”. Most of business have modern computerized monitoring system to control safety and efficiency the alarms.

Alarms can be also defined as:

  • A false alarm is an alarm that is reported when there is no fault.

  • A nuisance alarm occurs when it is true but redundant, i.e. the operator receives more than one alert about that alarm.

  • A missed alarm is the opposite of a false alarm, it occurs when there is a fault in the system and no alarm has been activated.

  • A chattering alarm performs many transitions between normal and abnormal state, it continuously crosses the alarm limit thresholds.

Chattering alarms and alarm flood could be confused, however they are not the same. Alarm flood is when the operator receives many alarms in a short period of time. The alarm flood can be caused by the correlation between variables and, therefore, several alerts are triggered at the same time. The chattering alarm is a single alarm that continuously crosses the alarm threshold, and it generates many alerts. The chattering alarms are defined as those alarms that repeat more than 3 times in a minute according ISA-18.2 (2009) standard [1].

Detection delay is also a key concept. It occurs when the alarms are not activated instantly when the failure occurs. It can occur by the same delay caused by the system itself (deadband, delay-timer, etc.).

In all industrial systems, there are several sensors and actuators for detecting and controlling possible faults. These components can create false alarms, therefore, the control system will be inefficient and the performance will be reduced [2, 3].

Fault detection is an important research area, from the academic and industrial point of view. Numerous methods of detection and control have been designed and developed for fault detection. Some systems can prioritise depending on the gravity of the alarm. When the alarm is triggered, the operator must acknowledge it, to understand it and to know the cause of the alarm, in order to assess its significance and to act to return the operation to its normal state. According to Engineering Equipment and Materials Users Association (EEMUA, 2007) [4], for an operator to respond adequately to an alarm, he must dedicate 10 min to it, i.e. he should not receive more than 6 alarms per hour for a correct operation of the system.

Currently operators receive a large number of alarms, sometimes more than they need or can handle (alarm flooding). It can distract the operator until critical alarms are ignored. Therefore, some operators are reluctant to the monitoring control system. If the system causes many false positives alarms, then it must be redesigned to reduce the number of false and annoying alarms. According to the references [2, 3], most of the alarms received by industrial plant operators are false. There are several methods to improve the alarm system, such as multivariable data analysis or the use of filters, the most important of which are discussed in this chapter.

A few decades ago, only a few selected variables could be controlled. They had to be important for the proper performance of the system and control the quality of the process, because of the alarms were difficult to implement. Each alarm had to be connected by a wire from the sensor to the control room, and it had a high cost. In addition, the control room had limited space and it had to contain numerous control devices. For these reasons, the alarms had to be well designed, to be considered reliable and to guide the operators as it was a good indicator of the correct functioning of the production system.

Nowadays, due to the development of hardware and software, a large number of alarms can be implemented at low cost. Many process variables can be measured and stored in databases. The alarm system communicates with the operator by the Human Machine Interface (HMI) or an Annunciator Panel. Many variables are continuously available in the operator panel for monitoring control. This leads to many alarms, some of them false, chattering or nuisance alarms. Hollifield et al. claim that chattering alarms are the most common type of alarm, where they found 70% of all alarms [5].

Table 1 summarized the number of alarms produced in 39 industrial plants. The alarms have been classified according to their industrial sector and time of occurrence.

Table 1 Performance metrics of industrial alarm system, study carried out in 39 industrial plants [6]

According to Walker et al. [7], the U.S. business loses $13 billion a year due to improper use of alarms. Nevertheless, the costs generated by false alarms are difficult to quantify in the world, but this are estimated to be billions of dollars every year. The unnecessary stoppages cause a significant loss of production. For this reason, the alarm systems should prevent the damage to equipment, downtime and reduced production. Furthermore, the process systems should be controlled to improve the efficiency, availability quality and reliability of the production process [8,9,10].

There is a great deal of research on alarm management, but the behaviour of the operator in the event of an anomaly is rarely studied. Hu et al. analysed the actions of the operators in response to univariate alarms [11]. The alarm system must clearly and precisely indicate to operators which processes require further attention. They conclude that the operators must regulate the control devices to solve the anomalies of the process, therefore, they can suppress an alarm to temporarily ignore it, in the case of annoying alarms, or they can change the state of the device if an alarm has occurred.

The alarm systems should be designed to help operators to regulate processes and to manage anomalies. Several guides have been written such as ISA and EEMUA for the design, implementation and maintenance of alarm systems [1, 4]. Alarms must be used to ensure the safety of alarm systems and processes. The actions of the operators will depend on the severity of the faults or anomalies that must be announced by alarms. The alerts can be visual or audible.

Monitoring systems are essential to ensure the reliability of the operation of industrial systems [12, 13]. Future hybrid methods provide more robust models in modern and complex installations. Research aims to reduce large production losses and high repair costs due to an inadequate alarm system [14,15,16].

Many problems are involved in alarm systems, Izadi et al. shown the most common causes [17]: improperly designed alarms, mis calibrated equipment, oscillations in general, changes in status during switching off or on are not taken into account, noise and/or outliers are not considered.

The objective of this chapter is to illustrate the methods and techniques used in several sectors to implement an optimal alarm system. The aim is to obtain: higher quality, higher performance, lower production costs, reduce breakdowns and make the processes safer.

2 Confusion Matrix

An optimal alarm system provides the necessary tools for operators to detect faults and take corrective action to return the process to normal condition. In practice, the alarm control system may be faulty or poorly calibrated, therefore, it will not give correct results. A missed alarm is set when the value of the variable suffers a deviation, e.g. surpass threshold, but the system does not detect it. The opposite case is a false alarm, when the system generates an alarm, although it has not actually occurred, also called as false positive.

Signals can give these two types of errors due to threshold selection. If the threshold setting is very strict to avoid the probability of a missed alarm, this will make the system more sensitive to random noise and the transient deviations and it will lead to more false alarms. On the other hand, if we increase the threshold, the number of false alarms will decrease at the cost of producing more missed alarms. The selection of the threshold is therefore decisive for system reliability. In the majority of cases, missed alarms are considered more important than false alarms, because its consequences may be greater. A basic tool to visualize false alarms in contrast to missed alarms is the confusion matrix.

A confusion matrix, or also called a contingency table, is an evaluation tool for categorical statistical data [18]. The table determines whether the value supplied by the alarm system matches the actual value. The rows of the matrix are the response alarm system and the columns are the actual values. There are 4 possible cases: if the classifier is positive and the system indicates alarm is true positive (TP), if an alarm has not occurred but the system classifies it as such, it is false positive (FP), otherwise an alarm has occurred, and the system does not identify it therefore will be false negative (FN), or missed alarm. Finally, it can happen that no alarm occurs and the negative system, therefore it is true negative (TN). The main diagonal values show when the system has acted correctly. However, the values of the other diagonal show when an error has occurred (Table 2).

Table 2 Confusion matrix

The following rates are obtained from the confusion matrix.

$$ FP\,rate = \frac{FP}{N} = \frac{negatives\,incorrectly\,classified}{Total\,negatives} $$
$$ Precision = \frac{TP}{TP + FP} $$
$$ Accuracy = \frac{TP + TN}{all\,cases} $$
$$ Sensitivy = \frac{TP}{TP + FN} $$
$$ Specificity = \frac{TN}{FP + TN} $$

3 Process and Alarm Data

The methods for alarm detection are used to improve the efficiency of the global process. However, a large amount of data must be provided to use these techniques.

There are two types of data that are fundamental to the management the alarm system:

  • Process data: these are measurements of process variables at regular intervals, these are stored in a database and they provide information for the identification of the optimal alarm system.

  • Alarm data: these are messages generated by the distributed control system (DCS), and they are stored in an alarm log.

This data is important to analyse, it can help to know the causes of the current alarm system overload. It is important to compare industrial data in a real environment with methods or techniques that are developed academically. For instance, Wang et al. explored the main factors behind this problem and they concluded that [19]: the chattering alarms frequently occur due to noise/disturbance, the alarm variables are incorrectly configured, the alarm design is isolated from related variables and the abnormality of the data is transmitted due to physical connections.

System performance and the alarm management lifecycle should be evaluated such as the runtime concept. Kondaveeti et al. [20] offers a tutorial to the alarms chatter, these are difficult to identify due to the poor design or incorrect configuration of the alarm method. A Chatter index is proposed to reduce the effort to identify and quantify chattering alarms. In reference [21], a quantitative measure is proposed to estimate the degree of chattering. The method for evaluating the chatter index is based on alarm parameters and statistical properties of the process variable. Process data is divided into approximate distribution characteristics, and each distribution is estimated separately. The distribution of process data is obtained by adding all run length distributions together. A mathematical function developed by analytical methods is intended to reduce chattering alarms.

Hu et al. proposed a framework for the combination of causality inference using process data and alarm data, and thus it helps to the operator to reduce the alarm flood [22]. Alarm data can be used to identify root alarm labels, and it reduces alarms that require attention. Root cause and effect analysis can be used to detect root cause alarms. The random relationships can be detected by extracting the process variables associated with the root alarms. Finally, the root cause can be confirmed thanks to the causal map of the process variables and some knowledge of the process. The number of alarms is reduced with the method, since only root alarm tags are alerted. The operators can know the root cause quickly, because the causal relationship is detected. Process and alarm industrial data were applied, and the results presented good performances.

4 Alarm Flood

According to International Electrotechnical Commission (IEC, 2014) [23], an alarm flooding occurs when alarms appear on the control panels at a faster rate than the operator can manage them. It leads to determine the root cause of the alarm and the optimal control of the system.

A flood alarm is usually triggered by a primary event and its consequential events [24]. The root cause alarms should be distinguished from consequent alarms to reduce the number of alarms. The alarm data allows to make a list of the primary alarms and the process variables related to them. In addition, the causal relationships between the alarms are obtained with the alarm data. Subsequently, the process data will help to support or discern the root cause analysis.

The historical alarm data allows to use a new analysis method to eliminate alarm flooding [25]. These data are grouped according to a base of alarm occurrences. The alarm floods have similar patterns. If these patterns are analysed and classified, then this method can lead to the root cause of an anomaly. Therefore, the operator will have fewer false alarms and he will be able to react better to flooding alarms. Hu et al. applied a fast sequence alignment method to speed up the calculation and improve the computational efficiency of the algorithms [26]. The method is intended to be more sensitive to higher priority alarms, and it tends to ignore alarms that occur simultaneously to avoid flooding alarms. Through the set-based comparison is reduced unnecessary calculations by irrelevant alarm tags. The results obtained in industrial cases show that the method is faster than the existing algorithms and, therefore, the operators have more time to perform the correct operation and correct this failure.

An alarm that performs repeated transitions between the normal and the abnormal state is called a chattering alarm. This is mainly due to signal noise and because of the variable operates near the alarm limit. The chattering alarms cause many false alarms. It is proposed to redesign the control system, and that these alarms be eliminated by grouping. Consecutive alarms in a cluster are displayed spaced in a narrow time window, then become a single alarm. And only one alarm message will be sent to the operator for a single cluster when the alarm appears. This is a simple method to reduce alarm flooding.

Rodrigo et al. [27] are based on the previous line of work. They claim that by combining the alarm logging, analysing process data and connectivity, alarms can be grouped together, and their root alarm identified. Figure 1 shows the workflow to reduce the alarms flood.

Fig. 1
figure 1

Workflow for alarm systems

The first step is remove chattering alarms, according to reference [25], the minimum permissible interval should be 10 min. If the elapsed time is shorter the second alarm is eliminated.

In the next step, the alarm log is divided in intervals of 10 min. An alarm threshold is set. It must be more than 10 alarms per time interval and per operator. Consecutive intervals are merged with more alarm occurrences than the defined threshold.

Using sequence pattern matching, the alarm flood sequences are grouped together. In this case, the method described in reference [28] is based on a modified Smith-Waterman (MSW) algorithm. Although, other algorithms can be applied, such as agglomerative hierarchical clustering (AHC).

The fourth step consists of grouping the flood alarm sequences, and a set of templates is created to cancel out the anomalies of all the clusters in the process.

Perhaps the last step is the most complicated, it should be noted that the causal alarm cannot be the first alarm, because when an alarm is triggered, it depends on the alarm setting limits. The time elapsed between the anomaly occurring and the alarm being triggered is probabilistic. Later, some algorithms are applied to determine the root cause of the alarm. There are many papers where different algorithms are applied [29, 30], the best algorithm will depend of the case study.

In summary, to reduce flooding alarm it is used: an alarm log, historical process data and connectivity analysis, to group the different alarms and determine the causal alarm.

There is no single solution to improve the alarm system, therefore, there are different workflows with various processes. For instance, there are signal-baseyvd methods, in this case the process variables are monitored and compared with thresholds (called alarm limits). They are currently the most widely used techniques in the industry and these are implemented in many modern distributed control systems (DCS).

There are also many classifications for alarm systems, some of the techniques applied are threshold design, data processing, multivariate process monitoring, model-based process monitoring, state-based priority setting [31,32,33].

Other classification of alarm systems depends on their design, that can be univariate and multivariate. Within the univariate design are: The alarm threshold; dead band; delay-timer, and; filtering (see Fig. 2). They are individually designed for each variable. In the multivariate design, alarms are combined linearly from various process variables.

Fig. 2
figure 2

Univariate alarming methods diagram

Alarm flooding is difficult to suppress with delay timers or dead bands due to consequence alarms. Lai and Chen present an algorithm (extension of) for optimal alignment of multiple flood alarm sequences to obtain a common pattern of them [28, 34]. This new technique needs the following points: Similarity scoring functions; dynamic programming equation; tracking and alignment generation. They propose to develop new algorithms for combining online alarm messages with a database of patterns to alert operators in case of alarm flooding.

Data-driven method [35], concretely historical alarm data, is also employed to detect frequent patterns of alarm flooding. The results showed that the method is effective in finding patterns and reducing pattern redundancies. The holistic view of alarms is also employed for an intuitive understanding of alarm patterns.

The alarm flood sequence alignment (AFSA) methods provide fault inference from the assessment of the similarity of alarm sequences. Guo et al. proposed a new AFSA method, the match-based accelerated alignment (MAA), which analyses the alarm coincidences [36]. It is important because its alignment results reveal to a large extent the real similarity of the alarm floods.

The alarm flood is a problem for the alarm system. There are several methods and techniques to avoid it, where the main ones are discussed in the following sections.

5 Long Standing Alarm

The long-standing alarms have several different definitions, for example ISA-18.2 defines them as “an alarm that remains in the alarm state for an extended period of time (e.g. 24 h)” [37]. According to EEMUA, 2013, an active alarm is considered a long duration alarm for a complete operating shift [38]. In general, the long-lasting alarm, as its name suggests, has a long alarm duration, but the authors do not agree on the thresholds for this time. In this chapter, three main causes of the generation of these alarms are indicated:

  • Due to the modern computerized monitoring system, alarms are easily created by entering trigger point values, often implemented without special care and generate many misconfigured alarms.

  • It is often not taken into account the start-up states, the average rate, etc., that have different demands and, therefore, different operating states, and are qualified as alarms when in fact they are not, e.g. when the equipment is switched off.

  • The process variables experience variations in different states, but the alarm trigger points are constant. It would be interesting to compare the alarm variables with the measurements of the process variables and thus generate new alarm thresholds.

6 Graphical Methods

Alarm data display tools method are employed to detect the annoying alarms [39, 40], e.g. the High Density Alarm Plot (HDAP) and Alarm Similarity Color Map (ASCM). These graphical tools have proven their usefulness in identifying the chattering alarms.

HDAP presents the highest alarms for a given time. It is recommended to choose a sample size of 10 min, to follow the recommendations of the acceptable announcement rate according to EEMUA. This tool allows to emphasize through colour, for example red will show unacceptable chatter behaviour [41].

ASCM enables to be highlighted correctly, related and redundant alarms. This tool shows the alarms reorganized in terms of their similarity and time of occurrence. It depends on the time of analysis, number of higher alarms, type of union in the construction of the bunches and the method of arrangement of the leaves. This tool displays the data in a color-coded matrix and this allows the identification by groups of related alarms, which provide information on the interactions of the process.

Graphical representations provide valuable feedback to improve the alarm system and thus reduce false alarms. For example, Yang et al. used the pseudo data map according to [42]: (1) it is robust to false, missed and chattering alarms; (2) informs whether there is a positive or negative correlation and the similarity; (3) The pseudodata can be used in other statistical analyses to contrast the results obtained. The method consists of the following phases:

  1. (a)

    The Gaussian kernel method is applied, and the binary alarm data generates continuous pseudo time series.

  2. (b)

    A correlation colour map of pseudodata, or transformed data, is used for showing the set of correlated variables.

  3. (c)

    Statistical methods are applied to find redundant alarm labels, or to group correlated alarms.

There are several difficulties to apply this method, such as parameter adjustment, the graph is sensitive, i.e. it requires some degree of freedom to optimize the display of the graph. However, it has been shown that this method is better than the alarm similarity colour map as long as the parameters are set properly.

7 Univariate Alarming Methods

The methods most commonly used are univariate alarming methods for alarm systems [43]. These methods are used because the information they show about a single signal is simple and clear, and operators can make decisions easily. However, for more complex alarms are needed other techniques such as multi-setpoint settings, mobile window, neural network method, etc. [44].

The most important univariate alarming methods are shown in Fig. 2

7.1 Alarm Filtering

The use of filters is widespread in real life because of they can be used for different proposes, for example: eliminating erroneous or undesirable data, reducing noise, extracting data characteristics, modifying the statistical distribution of data, grouping data according to their frequency. The most popular filters are the moving average, the exponentially moving average (EWMA) and the cumulative sum. Izadi et al. presented filters used to improve the receiver operating characteristic curve (ROC) [45].

Filtering techniques for alarm systems presents some disadvantages, e.g. measured by false alarm rate (FAR), missed alarm rate (MAR) and expected detection delay (EDD). Tan et al. [46] have worked with rank order filters to avoid the disadvantages. They have achieved two approaches when the PDF (probability density function) of raw data is known: performance curves of this filters can be calculated directly and can be estimated the EDD, that is impossible for general filters. The experimental results have shown that the order of the filters offers a degree of freedom for the system design, and other if it is considered the size of the window. These results are limited to univariate alarms. Therefore, it is recommended to work with multivariate systems.

The accuracy is given by the false alarm rate, and the efficiency is related to the detection delay and the complexity of the methodology used [47]. Cheng et al. used a method to create an optimal filter design with the aim of improving performance [48]. The optimum performance curve leads in this case that the moving average filter is better than the linear filters. The authors propose as future work to study the performance of the generalized medium filter to obtain a robust optimal filter design method. Izadi et al. consider filtering, alarm delay or deadband to be simple techniques that can reduce annoying alarms and FAR [45].

7.2 Alarm Delay-Timer

Filters use a continuous function transformation, while alarm delay timers are the transformation of discrete functions. The timers are used for their simplicity and efficiency. They can reduce the FAR and MAR, but their disadvantage is that they suffer from a delayed response.

The main elements for univariate alarm design are: the set point; dynamic order, and; alarm algorithm. Su et al. proposed an alarm method with multiple setpoint delay timers [43]. This achieves a balance between accuracy and sensitivity of the alarm system by providing direct transitions from each delay timer sub-state to the alarm state. FAR, MAR and the averaged alarm delay (AAD) are reduced by this methodology. Xu et al. study the efficiency of a univariate system using FAR, MAR and AAD, with emphasis on the calculation of these rates [49]. The proposed method was applied to an industrial case, concluding that it can be used for power and petrochemical plants. Zang et al. employed an improved delay timer method, where the univariate alarm was configured with multiple commands and set points [50]. These timers had an alarm announcement set point and an alarm end set point over conventional alarm timers. Enhanced alarm timers have more design parameters, but present better performance according to the Markov chain. Markov chains are generally employed for random phenomena, being simple mathematical models. It applies to systems that are particularly dependent, as the state of the n + 1 observation system depends only on the state of the system, i.e. changes in the system depend on the current state and not on the way it has been reached. Adnan et al. showed that the delay timers provide flexibility in the design of alarms [51]. The use of the delay timers is a common practice in the industry as it is a simple technique to reduce FAR, MAR and EDD.

Noise is one of the causes of chattering alarm. If a signal is well defined by its period and amplitude, but it contains noise and the noise is large enough to cross over the trigger point many times, then a chattering alarm occurs. Wang and Chen have proposed an online method to detect and reduce chattering alarms due to oscillation [52]. The presence of oscillation can be determined through a revised chattering index and a method based on discrete cosine transform. Therefore, it is used an alarm setting or delay timer is used to reduce alarms. Wang and Chen [53] proposed a rule for detecting talking alarms caused by random noise, and other for repetitive alarms based on the duration and interval of alarms and by regular patterns. It uses the online method and the sample delay timer m to eliminate flicker and repeat alarms. The effectiveness of the method was tested using 3 industrial examples and according to FAR, MAR and AAD (Fig. 3).

Fig. 3
figure 3

Flowchart for the design and use the delay-timers

8 Multivariate Alarming Methods

Some methods set the alarm limits by studying the correlations between the process data and the alarm data [54]. The multivariate statistical process control (MSPC) is a methodology that is applied for monitoring in many manufacturing processes [55]. It basically consists of three steps:

  1. (1)

    The process is under normal operating conditions, historical data are collected and stored in the database, and a statistical model is developed.

  2. (2)

    The control limits are fixed for the statistical model.

  3. (3)

    If the online data exceeds the control limits, it will be qualified as a process failure.

Historical process data is subjected to multivariable statistical techniques to determine the control limits of the statistics of the study variables, if the actual values exceed the control limit, then the point will be qualified as “out of control”. This involves detection of faults, being the next step is to identify the root cause of the process fault [56].

False alarms can appear by different causes, where the failure of the alarm system and random effects are two of the main causes. System deficiency may be due to the difference between the statistical model and the real process. The random effects also may cause false alarms. There are some online-fluctuation being monitored in the process. They can cause actual variables to deviate from nominal values, and even though the process is working correctly, these false alarms can occur. Many authors have researched using a statistical approach to avoid randomly induced false alarms [57,58,59,60], e.g. Bernoulli, Binomial distributions, conventional method based on principal component analysis (PCA) [61], etc. However, the real variables of the process tend to be self-related, therefore, the approaches of modelling of time series are needed.

One of the main methods of multivariate analysis is the correlation method. In many processes, one variable can be affected by other variable or several variables, i.e. different alarm thresholds generate different alarm data and then different correlations. To optimize these multivariate alarm thresholds, numerous statistical methodologies or algorithms have been applied to demonstrate interactions between variables and determine correlated key variables for the optimization of alarm thresholds, grouped as:

  • Grouping Variables

  • Correlation Methods

  • Advance Methods

  • Intrusion Detection System (IDS).

9 Conclusions

An optimal alarm system should inform and guide, and each alarm should have a defined response and adequate time to allow the operator to respond adequately to that alarm. Alarms must be relevant, unique, prioritized and understandable. The alarm system must identify the alarm, sort it, set priorities and finally alert the operator if necessary, visually or audibly.

Due to the study of false alarms, it is concluded that three of the most important reasons for their existence are: (1) the process undergoes state changes such as switching on and off, this is set that abnormality and it propagates owing to physical connections; (2) the alarms are poorly configured and have redundant measurements, and; (3) exist causal relationships between the variables studied and alarm design is isolated from related variables.

There are many classifications on alarm systems, since depending on how they treat the information, the type of study variable, the algorithms applied, etc.

There are many types of alarm systems are used in the industry, however, false, annoying or chattering alarms have not yet been completely eliminated. Although many resources are devoted to this problem, an optimal solution has not yet been achieved. It will be possible to improve these methods by means of dynamic systems where the historical data provide feedback capable of handling the process correctly, due to the development of new technologies and the increase in data processing capacity.