Keywords

1 Introduction and Background

The emerging Industry 4.0 technologies nowadays provide reliable and accessible smart systems that enable the spreading of predictive maintenance (PdM) practices [1]. The importance of PdM is testified by its ability in improving the useful life of the machine and in decreasing the maintenance costs [2]. The approach at the basis of the development of a reliable PdM system regards the collection and analysis of large amount of data, belonging to relevant time frames [3] as well as the definition of an indicator of system’s health [4]. Assessing an appropriate Health Indicator (HI) allows the understanding of the deviation from the regular operating performance of the system in terms of its Remaining Useful Life (RUL). HI definition supports in increasing the knowledge of the system by focusing the analysis on the most relevant information sources. In this sense, having a precise HI enables the possibility of predicting the RUL of a system confidently [5] and is thus the main focus of many researches (e.g., [6, 7]). Recent literary contributions focused on the development of a HI using several different techniques. For instance, some focused on the development of multi-objective models to derive the health of the system for fault diagnostics and prognostics [8], while other implemented artificial neural networks and K-means clustering [9] and genetic algorithms [10]. Even in terms of application areas, there is a certain heterogeneity: in some works, the focus is posed on semiconductor equipment [11], other focus on the vibration analysis of wind turbines [12]. HIs can indeed be applied for the definition of the remaining useful life of mechanical machinery, as testified by several works (e.g., [13, 14]). Given these assumptions, this work proposes an approach to model the health indicator for a sub-plant of an oil refinery and identify the component causing the performance loss. Predictive maintenance interventions on specific components are performed to avoid system stoppage, prioritizing them through implementing the Association Rule Mining (ARM). Indeed, the Association Rules (ARs) among component failures before the occurrence of a plant stoppage are used as a guide to determine, with a certain probability level, which are the components that caused the HI worsening. In this way, the ARs help identify relationships within a dataset when they are not immediately identifiable [15]. In recent literature, ARM is applied to different fields, ranging from the behavior description [16] to the sub-assemblies production scheduling [17]. In a predictive maintenance perspective, ARM has already been applied to detect the relationships among component failures [18]. Despite the valuable implementations proposed, among the others, by the named authors, there is a lack of research in terms of joint application of HIs definition and ARM.

2 Methodology

In the following, the proposed procedure to define the Health Indicator is described. The procedure is general so that it can be applied to various equipment as long as sensor readings are available. The input dataset contains the readings of one or more system sensors. For simplicity of explanation a single sensor is considered. Seven fundamental steps are performed in order to model the HI:

  1. 1.

    Standardization of the signals (Ss) from system sensors and partitioning of the initial dataset into training and testing sets.

  2. 2.

    Modelling of the Health Indicator (HI): the mean time between two stoppages represents the life duration of the system. The objective at the basis of the HI is creating a degradation profile considering that at the beginning of the HI the reliability of the system is equal to 1 (hence, maximum) while, at the moment of the stoppage, it is minimum (hence, equal to 0). The behavior of the HI is described by Eqs. (1)–(3), being \(DU{R}_{i, m}\) the time between two stoppages of category m considering the i-th machine, TIi,m′ indicates the remaining hours before the stoppage, while TIi,m" is the normalized value.

$$TI^{\prime}_{m} = \left[ {\begin{array}{*{20}c} {DUR_{m} { } - 1} & {DUR_{m} - 2} & {\begin{array}{*{20}c} \ldots & 0 \\ \end{array} } \\ \end{array} } \right]$$
(1)
$$TI_{m} \left( t \right)^{\prime \prime } = \frac{{TI_{m} \left( t \right)^{\prime } }}{{DUR_{m} }}$$
(2)
$$HI_{m} \left( t \right) = TI_{m} \left( t \right)^{\prime \prime } + \left( {1 - TI_{m} \left( {t = 1} \right)^{\prime \prime } } \right)$$
(3)

Equation 3 is such that the first value of HIi,m is equal to 1. In particular, HIi,m represents an initial HI for the considered machine and stoppage.

Once HIs have been calculated for each machine and stoppage, through a linear interpolation, HIs and Ss can be correlated in order to find the transformation coefficient (Eq. 4) able to translate the information from the measures space to the HI space.

$$HI_{total} = b \cdot S_{s,total}$$
(4)

HItotal represents the array composed of all the determined HIs (\(H{I}_{total}=[H{I}_{1,m} H{I}_{2,m}\dots H{I}_{n,m}]\)) and, at the same way, Ss,total is the arrey composed of all the standardised signals (\({S}_{s,total}=[{S}_{s,1} {S}_{s,2}\dots {S}_{s,n}]\)). The parameter b can be in the form of an array if more than one sensor readings are available. In the present paper, it is a scalar since just one sensor readings are available.

Once the transformation “b” has been identified, the transformed HI* is calculatedfor each machine and stoppage according to Eq. 5:

$$HI_{i,m}^{*} = b \cdot S_{s,i}$$
(5)
  1. 3.

    Once all the transformed HI*i,m have been calculated, a non-linear interpolation has been performed to correlate the HI*i,m with time (in form of Eq. 6).

$${HI}_{i,m}^{*}={f}_{i,m}\left({t}_{s,i}\right)$$
(6)

In particular, function f() can be chosen in the same form for all the stoppage categories or differently to define different profile for different categories. At this point, function f() is stored to be used in the K-NN algorithm. Thus, the system training is completed.

  1. 4.

    During the testing phase, the testing signal is transformed as well, using the weights b determined at step 3. Even the duration of the standardized testing signals (\({S}_{s(test)}\)) is assessed (\({{t}_{s(test)}}^{*}\)) and the functions \(f({{t}_{s(test)}}^{*})\) are evaluated for all the \({S}_{s(test)}\).

  2. 5.

    \({\varvec{H}}{{\varvec{I}}}_{{\varvec{j}},{\varvec{m}},{\varvec{t}}{\varvec{e}}{\varvec{s}}{\varvec{t}}}^{\boldsymbol{*}}=\boldsymbol{ }f({{t}_{s(j,test)}}^{*})\) and \({\varvec{H}}{{\varvec{I}}}_{{\varvec{i}},{\varvec{m}}}^{\boldsymbol{*}}\boldsymbol{ }\boldsymbol{ }=\boldsymbol{ }f({{t}_{s}}^{*})\) are compared through the KNN algorithm in order to identify the closest similarity profile. K-nearest neighbours (KNN) algorithm is a machine learning technique applicable for solving regression and classification problems. The main idea at its basis is that the closer an instance is to a data point, the more similar they will be considered by the KNN [19]. Reference functions –\({HI}_{i,m}^{*}={f}_{i,m}({t}_{s,i})\) - are considered as models to be compared to newly defined ones - \({{\varvec{H}}{\varvec{I}}}_{{\varvec{t}}{\varvec{e}}{\varvec{s}}{\varvec{t}}}=\boldsymbol{ }f({{t}_{s(test)}}^{*})\). The distances \({d}_{ij}\) (e.g., Euclidean distance) among \({\varvec{H}}{{\varvec{I}}}_{{\varvec{j}},{\varvec{m}},{\varvec{t}}{\varvec{e}}{\varvec{s}}{\varvec{t}}}^{\boldsymbol{*}}\) and \({\varvec{H}}{{\varvec{I}}}_{{\varvec{i}},{\varvec{m}}}^{\boldsymbol{*}}\) are used to calculate the similarity weights between the testing and the training. The similarity weight \({sw}_{ij}\) is determined as reported in Eq. 7. Then, the weights are ranked in descending order and, finally, determine the number of similar units (Eq. 8). Specifically, k refers to the number of function to be selected for the comparison, while N is the number of training units.

$${sw}_{ij} = exp\left(-{{d}_{ij}}^{2}\right)$$
(7)
$$SU = min \left(k, N\right)$$
(8)
  1. 6.

    Starting from the KNN results, the Weibull distribution is fitted considering the k-similar profiles and the median is determined.

  2. 7.

    Subtracting the \({{t}_{s(test)}}^{*}\) from the median determined at step 6, the RUL is assessed.

Eventually, the proposed approach requires the extraction of the ARs describing the relationships between component failures and plant stoppages. In this way, when a deviation in the operating performance is detected, the estimated RUL is used to inspect the components likely to be the ones causing the stoppage, so that their normal functioning can be reset and, possibly, the actual stoppage avoided.

Mining the Association Rules from a dataset implies the extraction of non-trivial attribute-values associations which are not immediately detectable due to the dimensions of such dataset [20]. Consider a set of Boolean data \(D=\{{d}_{1}, {d}_{2}, \dots {d}_{m}\}\) named items and a set of transactions T = {\({t}_{1}\), \({t}_{2}\), … \({t}_{k}\)}; a transaction \({t}_{i}\) is a subset of items. An Association Rule a → b is an implication among two itemsets (a and b) taken from D, whose intersection is null (a ∩ b = Ø). In order to evaluate the goodness of a rule, different metrics can be used, such as the support (Supp) and the confidence (Conf):

  • \(Supp(a, b) = \frac{\#(a, b)}{\#(T)}\): it measures the number of transaction containing both a and b over the totality of transactions.

  • \(Conf\left( {a \to b} \right) = \frac{{supp\left( {a, b} \right)}}{supp\left( a \right)}\): it measures the conditional probability of the occurrence of b, given the fact that a occurred.

For the purposes of this work, an ARs \(a \to b\) is the implication relating the component requiring a work order (\(a\)) and the stoppage \((b)\). So, the \(Supp(a, b)\) expresses the joint probability of having a failure on a component and a stoppage, while the \(Conf\left( {a \to b} \right)\) represents the probability of having a stoppage given the fact that the component \(a\) failed. In this work the FP-growth [21] is applied to perform the ARM.

When the health indicator highlights the risk of a stoppage, the components are inspected to control their functioning, sorting them by decreasing confidence value. If a failure or a malfunctioning is detected on the first component, it is replaced, else the following one is inspected; depending on the maintenance policy adopted by the company, the inspection can involve all the components included in the ARs, can stop when the first failure is detected or can involve the ARs until a certain threshold.

3 Application

The refinery considered for the case study is located in Italy. It has a processing capacity of 85,000 barrel/day. The sub-plant taken into consideration in this application is the Topping unit. Data refer to a three-year time interval. Specifically, the mass-flow across the plant is collected hourly for each day. Three categories of stoppages or flow deceleration are identified by the company: Non-Significant (NS), if the reduction of the daily mass flow is between 20% and 50%; Slowdown (SLD), if the reduction of the daily mass flow is between 50% and 90%; Shutdown (SHD), if the reduction of the daily mass flow is between 90% and 100%. The dataset containing these data is structured as reported in Table 1: the first column indicates the date of the acquisition; the following twenty-four columns report the hourly mean value of the mass flow registered across the plant, while the last column indicates the kind of stoppage occurred during the day (if any). The mass-flow measures are also used to train and test the proposed approach. In all, 1095 rows and 24 columns are standardized and used to this end. The algorithm is carried out on an approach evaluation Intel® Core™ i7-6700HQ CPU @ 2.60 GHz, using Matlab 2019©. Once the dataset is standardized, steps 1–5 of the proposed approach are carried out. Figure 1 displays the HI profiles obtained through the algorithm in pink for the three stoppage category, while in black the current trend. Evidently, the latter cannot be considered as an anticipator of the NS stoppage, but is far more similar to the SHD one, given its trend. Hence, the through steps 6 and 7, of the proposed algorithm, the RUL can be determined and the relationships among the component failures and SHD stoppages can be enquired.

Table 1. Excerpt of the mass flow dataset indicating the sub-plant, the date of the acquisition, the hourly measurements and the stoppage category.
Fig. 1.
figure 1

The HIs comparison.

A second dataset, i.e., the work order list, is taken into account in order to identify the components requiring a maintenance intervention on a specific day (Table 2). This dataset is integrated with the one reported in Table 1 in order to be able to mine the association rules.

Table 2. Work Order (WO) dataset detailed by sub-plant, date and component.

The relationships among component failures and stoppage category are derived through the Association Rule Mining. Specifically, the interest is identifying the work orders historically preceding the stoppages in order to use them as guidelines to understand which failure might cause the stoppage and intervene. The ARM, in this application, is executed using the well-known data analytics platform RapidMiner, that is widely applicable due to its graphical interface. In all, 120 rules have been identified, setting the minimum support threshold to 0 – in order not to lose any potential relation between components and stoppage category (and vice versa). The minimum confidence, instead, is set to 0.1. The set of ARs in the form componenti → stoppagej is used to identify the components possibly affecting the abnormal functioning of the plant. In the proposed example, it appears that the deviation of the mass flow can be related to a SHD stoppage: 14 rules have been identified, even though only an excerpt is reported (Table 3).

Table 3. ARs relating the WO and the SHD stoppage

This implies that the first component to be checked is the Furnace since, from the actual data of the past events, it requires a work order before the occurrence of a SHD (in other words, the rule Furnace → SHUT_DOWN has a confidence of 0.60). The following components to be checked, i.e., Condensation detector and Chiller, are the ones having the second value of confidence. During the inspection, the technician may detect a failure or a malfunctioning. If a failure or a malfunctioning is detected in one or more components, they should be fixed in the attempt to avoid the occurrence of the stoppage. Remarkably, the order of the inspection is relevant in terms of time: indeed, according to the profile of the HI, the RUL is determined and the inspection, as well as the preventive replacing of the components should be carried out within the time limit imposed by the RUL.

4 Discussion and Conclusions

The proposed approach well suits the dynamic environment characterizing the maintenance field. Indeed, it supports the definition of the RUL of a system and, accordingly, defines the roadmap to inspect the possible cause of the performance losses. In this way, it is possible to fix the malfunctioning promptly, so that the stoppage of the system can be avoided or, at least, the flow can be restored shortly. One of the main advantages of the proposed approach, is the fact that part of the analysis is carried out offline (e.g., training and testing of the proposed datasets, association rule mining) while its application can be run online, during the functioning of the system. The datasets on which the analysis is based can be updated without any impact on the approach implementation. In the proposed example, a single sensor is considered. However, the approach is easily extendable to case studies receiving data from more sensors since the proposed algorithm is general. The accuracy of the proposed approach strictly depends on the quality of the collected data. Before starting the implementation of the algorithm, some preliminary activities on data are required: indeed, it is necessary to clean data from inconsistencies, replace the missing values, eliminate disturbances in the sensor system and eventually, filter to limit the boundaries of the study. In this way it is ensured that the starting database is solid, so that the results are reliable too. As shown in Table 4, the prediction error varies with the percentage of the validation data considered: selecting the 70% of the dataset allows the minimum prediction error, if compared to the 50% and 90% cases. These outcomes, however, are not generalizable since they are strictly related to the specific case study, the sampling time and the initial prediction instant.

Table 4. Prediction error ranges and percentiles varying the validation data percentage for Shut-down stoppages

It should be considered that the data used in this study are usually collected in contexts such as process industries and refineries. In fact, they are used to establish the normal operation of plants and some basic maintenance activities. They are not the result of complex reprocessing or acquisition performed specifically for this purpose. Few reworkings have been performed (e.g., standardization and filtering). The algorithm is therefore based on the data available to the organization. The algorithm provides promising results in terms of prediction and in reasonable times, being it also able to use data stored for other purposes and thus requiring a minimum effort in terms of data collection. From the company’s perspective, knowing in advance the type of intervention to be made to avoid stoppages or to intervene promptly represents a considerable benefit. Indeed, the costs related to such stops are saved. In this work, the goodness of the algorithm is verified at theoretical level through simulations. Further development of this work regard a real evaluation of the actual advantages.