Keywords

1 Introduction

Condition-Based Maintenance (CBM), which comprises diagnosis and prognosis tasks, is important for expensive machine assets such as wind turbines [1]. Wind turbines, for example, are prone to gearbox, brake and rotor blade failures that result in long downtimes [1]. In CBM, condition monitoring data such as temperature, vibration, and pressure measurements are acquired from the machine and subsequently used to infer the health of the machine (i.e. diagnosis) and to determine the remaining useful life of the machine (i.e. prognosis) [1]. Vibration-based methods are most commonly used to monitor the condition of rotating machine components such as bearings and gears [2, 3]

Usually, the damaged assets result in subtle changes in the condition monitoring data that are difficult to detect [2]. This makes processing the data into more meaningful representations essential. Several signal processing and learning-basedFootnote 1 condition monitoring methods have been developed (e.g. Ref [4, 5]) to address the challenges posed by the weak signal components-of-interest. Here, extraneous signal components, time-varying operating conditions, and non-Gaussian noise mask the components-of-interest and impede diagnosis [3, 6].

Ultimately, the diagnosis and prognosis steps are used to inform maintenance decisions and therefore the diagnosis and prognosis methods should be carefully designed. In the diagnosis step, the damaged component and damage severity are inferred from the condition monitoring data and used for prognosis algorithms to predict the remaining useful life. Many data-driven prognosis approaches are proposed in the literature [7]. However, the methods require representative historical fault data from the different damage modes, which are difficult to obtain in practice [8]. Hence, inferring the relevant fault information is critical.

Therefore, in this work, we highlight that identifiability is essential for developing fault diagnosis and prognosis methods. Firstly, we present the identifiability problem for rotating machine condition monitoring in Sect. 2, whereafter we highlight the different tasks in the fault diagnosis and prognosis process in Sect. 3. In Sect. 4, we propose requirements for the development of fault diagnosis and prognosis methods, whereafter we conclude the work in Sect. 5.

2 Identifiability for Condition Monitoring

2.1 The Condition Monitoring Problem

Figure 1(a) shows the condition monitoring problem for a gearbox with four bearings (B1, B2, B3, B4), a gear, a pinion and two events E1 and E2. In Fig. 1(b), the sources of the events are shown: The pinion is damaged, resulting in event E1, and the outer race of the bearing is damaged, resulting in event E2. In this example, two sensors are placed on the casing of the gearbox S1 and S2. Ultimately, we would like to infer the gearbox’s condition from one (or both) sensor(s) measurements. In this example, actual measurements from different sensors on a gearbox are shown to emphasise the influence of the sensor positions.

Fig. 1.
figure 1

The condition monitoring problem: (a) The gearbox, two events inside the gearbox E1 and E2, four bearings B1, B2, B3, B4 and two sensors, S1 and S2, are shown. (b) The two damage components causing the events are shown. (c) Some of the transmission paths between the events and the sensors are shown. (d) Example signals are shown for the two sensors. These were acquired from two sensors on the test-rig described in Ref. [3]. (e) The legend for the events and the transmission paths.

The excitations due to the damage should travel between the source (i.e. the events) and the sensors [9], with possible transmission paths highlighted in Fig. 1(c). The time domain signals can be decomposed as follows:

$$\begin{aligned} x_{1}(t)&= h_{11}(t) \otimes e_{1}(t) + h_{12}(t) \otimes e_{2}(t) + n_{1}(t),\end{aligned}$$
(1)
$$\begin{aligned} x_{2}(t)&= h_{21}(t) \otimes e_{1}(t) + h_{22}(t) \otimes e_{2}(t) + n_{2}(t), \end{aligned}$$
(2)

where \(x_{i}(t)\) and \(n_{i}(t)\) are the vibration signal and the extraneous componentsFootnote 2 in the ith sensor signal, respectively. The jth event, attributed to the excitation at the source of the damage (e.g. the rolling element’s interaction with a spall), is denoted \(e_{j}\). The impulse response function that captures the transmission path between event j and sensor i is denoted by \(h_{ij}\). The measured vibration signal is discrete and denoted \(\boldsymbol{x} = [x(0), x(\varDelta t), x(2 \cdot \varDelta t), \ldots ]\), where \(\varDelta t\) is the sampling period. By using the convolution theorem, it is possible to decompose the signals as follows:

$$\begin{aligned} \left[ \begin{array}{c} X_{1}(f) \\ X_{2}(f) \end{array} \right] = \left[ \begin{array}{cc c} H_{11}(f) &{} \text { } &{} H_{12}(f) \\ H_{21}(f) &{} \text { } &{} H_{22}(f) \\ \end{array} \right] \left[ \begin{array}{c} E_{1}(f) \\ E_{2}(f) \end{array} \right] + \left[ \begin{array}{c} N_{1}(f) \\ N_{2}(f) \end{array} \right] , \end{aligned}$$
(3)

where \(H_{ij}\), \(X_{i}\), \(E_{j}\), \(N_{i}\) are the Fourier transforms of \(h_{ij}\), \(x_{i}\), \(e_{j}\) and \(n_{i}\) respectively. Changes in the signal \(X_{i}\) can potentially be influenced by changes in the frequency response function \(H_{i1}\) and \(H_{i2}\); the events \(E_{1}\) and \(E_{2}\); and/or the noise components \(N_{i}\). Therefore, it is important to carefully interpret the changes in the raw signal and its statistics. Ultimately, we desire to extract or enhance the event information \(E_{j}\) to perform diagnosis and prognosis.

2.2 Identifiability Problem

Identifiability and observability are closely related; observability indicates whether we can infer the latent state of the system from the system’s response, while identifiability indicates whether we can infer the system’s parameters from its response [10]. In condition monitoring, we typically would like to infer the parameters of the damage (e.g. size of the damage) and therefore we present this discussion in an identifiability context. We will use Eq. (3) to present the identifiability problem for condition monitoring, without accounting for estimation errors (e.g. the influence of finite length signals). We can write Eq. (3) in matrix form

$$\begin{aligned} \boldsymbol{X}(f) = \boldsymbol{H}(f) \cdot \boldsymbol{E}(f) + \boldsymbol{N}(f), \end{aligned}$$
(4)

and leverage elementary linear algebra theory to gain insight into the identifiability of the events. The following four interesting cases are considered here for highlighting the identifiability problem:

$$\begin{aligned} \boldsymbol{H}_{1} = \left[ \begin{array}{c c} a &{} 0 \\ 0 &{} b \\ \end{array} \right] , \quad \boldsymbol{H}_{2} = \left[ \begin{array}{c c} a &{} c \\ c &{} b \\ \end{array} \right] , \quad \boldsymbol{H}_{3} = \left[ \begin{array}{c c c} a &{} &{} a \\ a &{} &{} a \\ \end{array} \right] , \quad \boldsymbol{H}_{4} = \left[ \begin{array}{c c} a &{} 0 \\ 0 &{} 0 \\ \end{array} \right] . \end{aligned}$$
(5)

where \(a \in \mathbb {C}\), \(b \in \mathbb {C}\), \(c \in \mathbb {C}\). If the frequency response function matrix \(\boldsymbol{H} = \boldsymbol{H}_{1}\), the events will feature independently in the sensors. Therefore, different sensors can be used to characterise specific events. If \(\boldsymbol{H} = \boldsymbol{H}_{2}\) and assuming \(\text {det}(\boldsymbol{H}) \ne 0\), the matrix is invertible and therefore we obtain two signals with independent event information using

$$\begin{aligned} \boldsymbol{H}^{-1}(f) \cdot \boldsymbol{X}(f) = \boldsymbol{E}(f) + \boldsymbol{H}^{-1}(f) \cdot \boldsymbol{N}(f), \end{aligned}$$
(6)

where \(\boldsymbol{H}^{-1}(f) \cdot \boldsymbol{X}(f)\) is the spectrum of the processed signal. Note, that unless we know \(\boldsymbol{N}(f)\), we cannot identify the events \(E_{j}\), but only scaled and translated versions of the events, i.e. \(\alpha \cdot E_{j} + \beta \). This is often sufficient for most condition monitoring tasks. If \(\boldsymbol{H} = \boldsymbol{H}_{3}\), the rows are linearly dependent, which means that the sensors contain duplicate information (e.g. the sensors are symmetric for the structure and the excitations) and \(\boldsymbol{H}_{3}\) is not invertible. This means that we cannot recover the original events from the response. Lastly, if \(\boldsymbol{H} = \boldsymbol{H}_{4}\), it is only possible to identify the information concerning event 1; event 2 is not identifiable irrespective of the signal processing algorithm used. These four academic examples highlight the importance of sensor positioning to obtain well-conditioned matrices.

Many condition monitoring algorithms (e.g. [2, 3, 11]) only utilise the information from a single sensor to infer the machine’s condition. In some classes of condition monitoring algorithms (e.g. [12, 13]), we aim to design filters \(P_{ij}(f)\) that can be applied to the data of a single sensor \(X_{j}(f)\) so that the jth event is identifiable from the processed signals. The spectrum of these processed signals can be decomposed as follows:

$$\begin{aligned} \left[ \begin{array}{c} P_{i1}(f) \\ P_{i2}(f) \end{array} \right] X_{i}(f) = \left[ \begin{array}{c c} \alpha _{1} &{} 0 \\ 0 &{} \alpha _{2} \\ \end{array} \right] \cdot \left[ \begin{array}{c} E_{1}(f) \\ E_{2}(f) \end{array} \right] + \left[ \begin{array}{c} \beta _{1} \\ \beta _{2} \end{array} \right] , \end{aligned}$$
(7)

where \(P_{ij}(f) \cdot X_{i}(f)\) is the spectrum of the processed signal aiming to extract event \(E_j\), \(\alpha _{i}\) is a scaling term and \(\beta _{i}\) is an offset term. The scaled events \(E_{j}\) are identifiable if we can design a filter

$$\begin{aligned} P_{ij}(f) = k \cdot H_{ij}(f), \end{aligned}$$
(8)

where \(k \ne 0\) and \(H_{i1}(f) \cdot H_{i2}(f) = 0\), i.e. the frequency response functions of the two events are independent. This is possible if the two events manifest in two separate narrow frequency bands. Targeted informative frequency band identification methods (e.g. [11, 12]) and targeted blind deconvolution algorithms [13] aim to find the optimal filters \(P_{ij}(f)\) to extract the events-of-interest.

In conclusion, the measurement signal (and its processed signal) is influenced by the [9]:

  • transmission path, which means that it is influenced by the sensor location and the damaged component’s location. Planetary gearboxes have time-varying transmission paths between the planet gears and the sensors [14].

  • excitation characteristics (e.g. if there is a bearing crack or a bearing spall in the inner race or outer race).

  • the extraneous components (e.g. dominant healthy gear mesh components, impulsive environmental noise).

All these effects can impede identifiability of the events and therefore impede the fault diagnosis and prognosis tasks discussed in the next section.

3 Fault Diagnosis and Prognosis

3.1 Overview of Diagnosis and Prognosis Steps

The diagnosis and prognosis problems are shown in Fig. 2 using the available information from a selected sensor. In the condition inference problem, we infer the actual condition (i.e. the damaged component, the damage mode and the size of the damage) from the processed condition monitoring data. The estimated condition is subsequently used with an appropriate degradation model (e.g. Paris’ law) to estimate the system’s remaining useful life. The reliability of the remaining useful life estimation process depends on our ability to diagnose the machine.

Fig. 2.
figure 2

The conventional diagnosis and prognosis problem. (a) Condition monitoring and data processing; (b) Diagnosis; (c) Prognosis.

In summary, the following steps need to be followed to perform diagnosis (Steps 1–4) and prognosis (Step 5):

  1. 1.

    Damage detection (Presented in Sect. 3.1)

  2. 2.

    Damage component identification (Presented in Sect. 3.2)

  3. 3.

    Damage mode identification (Presented in Sect. 3.3)

  4. 4.

    Fault severity quantification (Presented in Sect. 3.4)

  5. 5.

    Remaining useful life estimation (Presented in Sect. 3.5)

3.2 Damage Detection

In the damage detection phase, changes in the condition monitoring data due to the deteriorating machine are detected and flagged by comparing condition indicators against thresholds as shown in Fig. 3(a). For example, the threshold can be determined using the statistical methodology proposed by Antoni and Borghesani [4]. In learning-based methods, the detection threshold is usually determined from the reference density obtained from healthy historical data [5]. However, time-varying operating conditions could influence the condition indicator and could therefore influence the time of detection (i.e. our ability to identify the damage) and increase the false positive rate (i.e. detecting changes due to operating condition as opposed to changes in machine condition). This is highlighted in Fig. 3(b). This problem can be alleviated by using the appropriate pre-processing of the signal (e.g. [15]) or post-processing of the condition indicators (e.g. [16]). For example, in Ref. [15], the amplitude modulation caused by the changes in the operating conditions is estimated and attenuated, while retaining the damage information. In contrast, Zimroz et al. [16] first obtain the relationship between a condition indicator and the wind turbine’s power using a regression model . Thereafter, changes in this model are used for condition monitoring. It is expected that the sensitivity of the condition indicator to the operating conditions (e.g. power) is dependent on the condition of the machine and can therefore be used for damage detection.

Fig. 3.
figure 3

(a) The generic fault detection problem is shown for a single threshold and fault detection point. (b) The influence of operating conditions on fault detection. Abbreviations: OC - Operating Condition.

3.3 Damage Component Identification

The damage component identification (also referred to as damage localisation) problem is shown in Fig. 4 for the example gearbox in Fig. 1. As shown in Fig. 4(a), the measured signal \(\boldsymbol{x}\) can be generated by multiple potential events (e.g. a single damage mode or multiple simultaneous damage modes) and contains extraneous components \(\boldsymbol{n} = [n(0),n(\varDelta t), \ldots , ]\). The damaged component is identifiable if we can process the signals \(\boldsymbol{x}\) to extract the scaled events \(\alpha \cdot e_{i} + \beta \). This is shown in Eq. (7) with the spectral representation. Methods to identify the damaged component include:

Fig. 4.
figure 4

The damage component identification problem: (a) The signal generation process for the gearbox in Fig. 1; (b) The damage subcomponent identification problem with bearing 3 identified using a Processing step; (c) Some of the potential bearing damage cases; (d) Some of the potential gear damage cases.

  • The synchronous average can be used to extract synchronous deterministic components. If there are extraneous synchronous components, these components would contaminate the synchronous average and impede the damaged component’s identifiability.

  • The synchronous average of the squared envelope does not facilitate identifiability as it is sensitive to non-synchronous changes in the signal [3]. The synchronous median of the squared envelope and the synchronous geometric average of the squared envelope are much better suited as they are robust to non-synchronous impulsive components [3].

  • Cyclostationary analysis tools such as the squared envelope spectrum, the spectral correlation and the improved envelope spectrum can be used to determine the characteristics of the component-of-interest [2]. Cyclostationary analysis methods for time-varying speed conditions are proposed in Refs. [17] and cyclostationary analysis methods for impulsive noise conditions are proposed in Ref. [6, 18].

  • In discrepancy analysis, the anomalies’ localised behaviour is used to determine the source of the anomalous behaviour [19].

  • In contrast to blind condition indicators, targeted condition indicators make it possible to focus on specific cyclic orders and can be used to target specific mechanical components [11]. The RMS for example is sensitive to any energy changes in the data and cannot be used to identify the damaged component and would therefore not be a reliable estimate of the fault severity.

To be able to perform damage identification, the rotational speed of a reference shaft and the system’s kinematics (e.g. the gear mesh frequencies, the ball-pass outer race component of the bearing) need to be known [19]. The damaged component identifiability is also impeded if two components have the same statistical behaviour (e.g. both are first-order cyclostationary) and have very similar characteristic frequencies, e.g. if the same bearings are located on the same shaft. However, using multiple sensors might separate the contributions of the different sources if \(\boldsymbol{H}\) in Eq. (4) is invertible.

3.4 Damage Mode Identification

Fig. 5.
figure 5

The damage mode identifiability problem is illustrated when gear damage is present and only three damage modes are considered. The measured signal is processed to either identify (a) a tooth chip, (b) a tooth crack, or (c) pitting, with the tooth crack identified in this example.

In the damage mode identification phase, the mode of degradation (e.g. crack propagation in the gear tooth, pitting formation and spalling generation) is identified. The damage identification problem is shown in Fig. 5 for the case where there is gear damage in the gearbox. The degradation mode influences the remaining useful life of the system and therefore it is important to identify this for prognosis. Feng et al. [20] could distinguish between changes in the signals due to abrasive wear and pitting and therefore the damage mode is identifiable with their proposed procedure.

3.5 Fault Severity Quantification

In the fault severity quantification phase, the damage’s size is estimated (e.g. characteristic crack length, material loss due to wear). This is important as the remaining useful life depends on the size of the damage [21] and this information needs to be known when using the appropriate degradation model. The ICS2 indicator, which measures the degree of second-order cyclostationarity, has a good correlation with the severity of the abrasive wear and pitting [20]. Three cases are shown in Fig. 6. These problems are often addressed in classification-based fault diagnosis, where the raw or processed data are mapped to a damage mode and/or a fault severity. The problem with this is that the degradation is continuous, not discrete, and often we do not have sufficient historical fault data to train the models [8]. Physics-based models can potentially aid with this task, but model calibration becomes an important consideration.

Fig. 6.
figure 6

The typical behaviour of condition indicators are shown. (a) The fault severity is identifiable for this specific case. (b) A condition indicator where the fault severity is not identifiable. (c) The fault severity is non-identifiable when a condition indicator is used that is sensitive to varying operating conditions. This behaviour is shown for three Operating Conditions (OC).

Fig. 7.
figure 7

The Remaining Useful Life (RUL) estimation process: (a) The conventional RUL estimation process is shown for a known threshold, a known degradation path and a known Time-Of-Failure (TOF). (b) The practical prognosis problem is illustrated as a population of failure thresholds, with the Probability Density Function (PDF) denoted by \(\text {PDF}_{t}\), and a population of degradation paths. This makes us uncertain of the actual TOF, which is described by \({\text {PDF}}_{\text {tof}}\).

3.6 Remaining Useful Life (RUL) Estimation

The remaining useful life estimation process is shown in Fig. 7(a) for the ideal case and in Fig. 7(b) for the practical scenario. The larger the variance of the potential degradation path and the appropriate threshold (i.e. the more uncertain we are), the more difficult it becomes to identify the remaining useful life. Hence, it is important to perform the diagnosis phase properly to ensure that the remaining useful life estimates are reliable. Data-driven prognosis methods estimate the remaining useful life from the raw or processed signals directly using the available historical data [7, 22]. However, a representative historical fault dataset is difficult to obtain; there are many potential failure modes and sufficient representative samples of each failure mode are required. Therefore, purely data-driven prognostics methods are difficult to implement in practice, which highlights the need for hybrid prognosis strategies [23].

4 Proposed Requirements for Fault Diagnosis and Prognosis Methods

In this work, the following requirements are proposed for the different phases of fault diagnosis and prognosis:

  • Sensor placement: The sensors should be placed to ensure that the matrix \(\boldsymbol{H}\) in Eq. (4) is as well-conditioned as possible. This means that if more sensors are added to the monitored machine, it should be ensured that the new sensor is as informative as possible (e.g. that the damage detection accuracy is increased [24]).

  • Damage detection: Condition indicators should only change when the machine degrades and should be robust to changes in operating conditions. Robustness against varying operating conditions is possible by pre-processing the signals (e.g. [15]) or processing the condition indicators (e.g. [16]).

  • Damage component identification: Condition indicators should only be sensitive to the presence of a subset of possible damage events (e.g. different bearing damage modes) to ensure that the condition indicators can be properly interpreted. A method to perform this task is presented in Ref. [3].

  • Damage mode identification: Condition indicators should provide a unique representation for the different damage modes to ensure the damage mode is identifiable. The appropriate degradation model can subsequently be selected or inferred.

  • Remaining useful life estimation: The RUL estimate should only be dependent on the component-of-interest and should not be influenced by changes in the condition of other machine components or changes in the machine’s operating conditions. This can be performed by using a robust condition indicator.

The damage mode identification problem is especially challenging because the damage often manifests as weak components in the signal and there might be subtle differences between the different damage modes. However, Eq. (4) highlights that by using multiple well-placed sensors and by having some prior knowledge about the transmission paths (e.g. from a physics-based model) it could be possible to have a better view of the events, which could aid in identifying the damage mode and the severity of the damage.

5 Conclusions

In this work, we presented the identifiability problem for rotating machine fault diagnosis and prognosis and highlighted its significance for the different fault diagnosis and prognosis tasks. The importance of sensor location and the importance of using the appropriate signal processing methods are emphasised. We also proposed requirements for the development and implementation of fault diagnosis and prognosis methods. We believe these requirements can help to develop reliable condition-based maintenance methods for machinery such as wind turbines and will be investigated in more detail in future work.