Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In Chaps. 3 and 8 we analyzed times to occurrence of an event of interest. Such an event could be failure of a component or system. If the failure is not repaired, but the component or system is replaced following failure, then the earlier analysis methods are applicable. However, in this chapter, we consider the case in which the failed component or system is repaired and placed back into service. Analysis in this situation is a bit more complicated. The details will depend principally upon the nature of the system or component after repair. We will consider two casesFootnote 1:

  • Repair leaves the component or system the same as new,

  • Repair leaves the component or system the same as old.

Both of these cases are modeling assumptions that an analyst must make, and they lead to different aleatory models for the failure time. We will provide some qualitative guidance for when each assumption might be appropriate, and we will provide some qualitative and quantitative model checks that can be used to check the reasonableness of the assumption.

In all of the following discussion, we assume that we can ignore the time it takes to actually repair a component or system that has failed. This allows us to treat the failure process as a simple point process. This assumption is typically reasonable either because repair time is short with respect to operational time, or because we are only concerned with operational time, so time out for repair is accounted for through component or system maintenance unavailability estimates.

9.1 Repair Same as New: Renewal Process

In this case, repair leaves the failed component or system in the same state as a new component or system. The times between failures are thus independent and will be assumed to be identically distributed. If the times between failures are exponentially distributed, the methods of Chap. 3 can be applied. If the times between failures are not exponentially distributed (e.g., Weibull or lognormal), then the methods of Chap. 8 can be applied. Note in both cases that it is the times between failures that are analyzed, not the cumulative failure times.

The assumption of repair same as new is plausible when the entire component or system is replaced or completely overhauled following failure. Examples would be replacement of a failed circuit board or rewinding a motor.

9.1.1 Graphical Check for Time Dependence of Failure Rate in Renewal Process

If one assumes a renewal process, then a qualitative check on whether the failure rate is constant can be done using the times between failures. If the failure rate is constant in a renewal process, then the times between failures are exponentially distributed. If one plots the ranked times between failures on the x-axis, and 1/n t on the y-axis, where n t is the number of components still operating at time t, the result should be approximately a straight line if the failure rate is constant. If the slope is increasing (decreasing) with time, this suggests a renewal process whose failure rate is likewise increasing (decreasing) with time. Such a plot is referred to as a cumulative hazard plot.

Consider the following 25 cumulative times of failure (in days) for a servo motor, taken from [1].

There are 25 times so the cumulative hazard plot increases by 1/25 at 27.1964, by 1/24 at 74.28, etc. The plot is shown in Fig. 9.1, and appears to indicate an increasing failure rate with time. Quantitative analysis of these times can be carried out using the methods for exponential durations in Chap. 3 . A quantitative check will be described below, after we have discussed the other extreme: repair same as old.

Fig. 9.1
figure 1

Cumulative hazard plot for data in Table 9.1, suggesting increasing failure rate with operating time

9.2 Repair Same as Old: Nonhomogeneous Poisson Process

In the case where repair only leaves the component in the condition it was in immediately preceding failure, then the times between failures may not be independent. For example, if the component is wearing out over time (aging), then later times between failures will tend to be shorter than earlier times, and conversely if the component is experiencing reliability growth. In these cases, the times between failures will also fail to meet the assumption of being identically distributed. Recall that to apply the methods of Chap. 3 , these two assumptions must be met.

Repair to a state that is the same as old is a reasonable default assumption for most components in a risk assessment, because a typical component is composed of subcomponents. When failure occurs, only a portion of the component (one or more subcomponents) is typically repaired, so the majority of the subcomponents are left in the condition they were in at the time of failure.

9.2.1 Graphical Check for Trend in Rate of Occurrence of Failure When Repair is same as Old

To avoid the confusion that can arise from using the same terminology in different contexts, we will follow [2] and use rate of occurrence of failure (ROCOF) instead of failure rate for the case where we have repair same as old. If the ROCOF is constant with time, then the times between failures will not tend to get shorter (aging) or longer (reliability growth) over time. If one plots the cumulative number of failures on the y-axis versus cumulative failure time on the x-axis, the resulting plot will be approximately a straight line if the ROCOF is constant. If aging is occurring (increasing ROCOF), the slope will increase with time, as the times between failures get shorter. If reliability growth is occurring (decreasing ROCOF), the slope will decrease with time, as the times between failures get longer.

Consider the following 25 cumulative times in standby at which a cooling unit failed, taken from [1]. Because the cooling unit consists of a large number of subcomponents, and only one or two of these were replaced at each failure, assume repair leaves the cooling unit in the state it was in immediately prior to failure. A plot of the cumulative number of failures versus cumulative failure time can be used to check if there appears to be a time trend in the ROCOF for the cooling unit. The plot is shown in Fig. 9.2. The slope appears to be increasing with time, suggesting that the ROCOF for the cooling unit is increasing as a function of calendar time.

Fig. 9.2
figure 2

Cumulative failure plot for data in Table 9.2, suggesting increasing ROCOF over time

An interesting exercise is to construct a cumulative failure plot for the data in Table 9.1, where the failure rate appeared to be an increasing function of operating time under the assumption that repair was same as new, a renewal process. The plot, shown in Fig. 9.3, does not suggest an increasing ROCOF under the assumption of repair same as old. This illustrates a subtle point in analyzing repairable systems. If repair is same as new after each failure, then times between failures will not exhibit a trend over calendar time. Aging or reliability growth only occurs over the time between one failure and the next, because the system returns to new, and the clock is reset, after each failure. Metaphorically, the system is reincarnated after each failure. On the other hand, when repair is same as old after each failure, then aging or reliability growth occurs over calendar time, and one can then expect to see a trend in the slope of cumulative failures versus time. Therefore, absence of a trend in the cumulative failure plot may suggest no aging or reliability growth under the assumption of repair same as old, but there still may be aging or reliability growth between each failure under the assumption of repair same as new. The cumulative hazard plot shown earlier can be used to check for that possibility. Metaphorically, under repair same-as-old, the system is resuscitated after each failure.

Table 9.1 Cumulative failure times of a servo motor, from [1]
Table 9.2 Twenty five cumulative times in standby at which a cooling unit failed, taken from [1]
Fig. 9.3
figure 3

Cumulative failure plot for data in Table 9.1, showing lack of trend in slope over calendar time

Figures 9.4 and 9.5 show cumulative failures versus cumulative time for 1,000 simulated failure times from two different renewal processes, one in which failure rate is decreasing with increasing operating time, the other where failure rate is increasing with operating time. Note in both cases that the cumulative failure plot produces a straight line, reinforcing the conclusion that this plot is useful for checking for aging or reliability growth under the same-as-old assumption for repair, but it cannot detect a time-dependent failure rate under the same-as-new repair assumption. The corresponding cumulative hazard plots in Figs. 9.6 and 9.7 are useful for this purpose when repair is same as new.

Fig. 9.4
figure 4

Cumulative failure plot for 1,000 simulated failure times from renewal process with decreasing failure rate

Fig. 9.5
figure 5

Cumulative failure plot for 1,000 simulated failure times from renewal process with increasing failure rate

Fig. 9.6
figure 6

Cumulative hazard plot for 1,000 simulated failure times from renewal process with decreasing failure rate

Fig. 9.7
figure 7

Cumulative hazard plot for 1,000 simulated failure times from renewal process with increasing failure rate

9.2.2 Bayesian Inference Under Same-as-Old Repair Assumption

As stated above, if there is an increasing or decreasing trend in the ROCOF over time, then the times between failures will not be independently and identically distributed, and thus the methods of Chaps. 3 and 8, which rely on this condition, cannot be applied. In particular, one should not simply fit a distribution (Weibull, gamma, etc.) to the cumulative failure times or the times between failures. Instead, the likelihood function must be constructed using the fact that each failure time, after the first, is dependent upon the preceding failure time. One must also specify a functional form for the ROCOF. We will assume a power-law form for our analysis here, as this is a commonly used aleatory model, sometimes referred to in the reliability literature as the Crow-AMSAA model. The equation for the power-law form of the ROCOF is

$$ \lambda (t) = \frac{\beta }{\alpha }\left( {\frac{t}{\alpha }} \right)^{\beta - 1} \quad \alpha ,\beta > 0 $$

In this model, there are two unknown parameters, which we denote as α and β. β determines how the ROCOF changes over time, and α sets the units with which time is measured. If β < 1, reliability growth is occurring, if β > 1, aging is taking place, and if β = 1, there is no trend over time.

If the process is failure-truncated, and t i is the cumulative time until the ith failure, then the likelihood function is given by

$$ f(t_{1} ,\,t_{2} ,\, \ldots ,\,t_{n} |\alpha ,\,\beta ) = \frac{{\beta^{n} }}{{\alpha^{n\beta } }}\prod\limits_{i = 1}^{n} {t_{i}^{\beta - 1} \exp \left[ { - \left( {\frac{{t_{n} }}{\alpha }} \right)^{\beta } } \right]} $$
(9.1)

This equation can be derived from the fact that the time to first failure has a Weibull (β, α) distribution, and each succeeding cumulative failure time has a Weibull distribution, truncated on the left at the preceding failure time.

The underlying DAG model is shown in Fig. 9.8 and the OpenBUGS script is listed in Table 9.3. Model checking in this case uses the Bayesian analog of the Cramer-von Mises statistic described in Chap. 4, and is based on the fact that, for the power-law process, if one defines z i  = (t i /α)β, then z inc, i  = z i −z i−1 has an exponential distribution with rate 1, see [3].

Fig. 9.8
figure 8

DAG for modeling failure with repair under same-as-old repair assumption

Table 9.3 OpenBUGS script for analyzing data under same-as-old repair assumption (power-law process)

We run three chains. Convergence to the joint posterior distribution appears to occur within the first 1,000 samples, so we discard the first 1,000 samples for burn in. We ran another 10,000 samples to estimate parameter values, obtaining a posterior mean for β of 1.94 with a 90% credible interval of (1.345, 2.65). The posterior probability that β > 1 is near unity, suggesting a ROCOF that is increasing with time, corresponding to aging, and agreeing with the cumulative failure plot in Fig. 9.2. The Bayesian p-value is 0.57, close to the ideal value of 0.5

The script in Table 9.3 analyzes a process that is failure-truncated, that is, observation stops after the last failure. It is also possible to have a time-truncated process, in which observation continues after the last failure, up to the stopping time τ. To handle this case, only one line in the OpenBUGS script needs to be changed. The line defining the log-likelihood changes to the following:

for(j in 1:M) {

phi[j] <- -log(beta) + beta*log(alpha) − (beta-1)*log(t[j]) + pow(tau/alpha, beta)/M

}

The stopping time (tau) is loaded as part of the observed data.

9.3 Incorporating Results into PRA

How quantitative results are incorporated into a PRA model depends again on the process assumed to describe the failures. If the repair is assumed to be same-as-new, leading to a renewal process, then the aleatory model for failure time in the PRA is the renewal distribution (e.g., Weibull). Most PRA software packages can only model an exponential distribution for failure time, so this would appear to be a problem. A way around this problem for operating equipment with a Weibull renewal distribution is as follows. Assume the mission time for the operating equipment is t m , and assume we have determined that failures of the operating equipment can be described by a renewal process with a Weibull distribution for times between failures, with shape β and scale λ. That is, the density function for times between failures is given by

$$ f(t) = \beta \lambda t^{\beta - 1} \exp \left( { - \lambda t^{\beta } } \right) $$

This is the aleatory model for failure, which as noted above most PRA software does not include as an option. The event of interest in the PRA is the probability that the equipment fails before the end of the mission time. To have the PRA software calculate this probability, one can input an exponential distribution with rate λ as the stochastic model in the PRA, but replace the mission time by (t m)β. This is adequate for a point estimate, but may not allow epistemic uncertainty in β and λ to be propagated through the PRA model. If the renewal distribution is gamma or lognormal, there is no such simple work-around. If there is no clear preference for one model over the other (i.e., lognormal or gamma clearly better than Weibull), it is best to use a Weibull renewal distribution, as it will be easiest to incorporate into the PRA.

If the failures are assumed to be described by a power-law NHPP (repair same as old), the process for incorporating the results into the PRA is more complicated. If the component at hand is new, and its failures are assumed to be described by an NHPP with parameters estimated from past data, then the time to first failure of the component is the variable of interest, and this time will be Weibull-distributed with shape parameter β and scale parameter λ, where λ is given in terms of the power-law parameters by α−β. See above for how to “trick” the PRA software into using a Weibull stochastic model for time to first failure.

If the parameters have been estimated from a component’s failure history, and this history continues into the future, then one is interested in the probability that the next failure will occur before the end of the mission. The distribution of the next cumulative failure time, t i , is not a simple Weibull distribution: the cumulative distribution function is instead a left-truncated Weibull distribution. The probability that the next failure will occur before (cumulative) time t, given that the last failure occurred at (cumulative) time T, is given by

$$ F(t) = 1 - \exp \{ - \lambda [(T + t)^{\beta } - T^{\beta } ]\} $$

where λ = α−β. Adding the lines below to the OpenBUGS script in Table 9.3 encodes this approach.

t.miss <- 24

t.window <- t[M] + t.miss

prob.fail <− 1 − exp(−pow(alpha, −beta)*(pow(t.window, beta) − pow(t[M], beta)))

Running the script as above, we find a mean probability of failure over a 24 h mission time of 0.53, with a 90% interval of (0.36, 0.69). One item to note is that the parameters of the power-law process, α and β, are highly correlated; the rank correlation between them is about 0.95 for the analysis above. This correlation must be taken into account in the calculation of the failure probability; failure to take it into account leads to an over-estimate of the uncertainty. With no correlation between α and β, the 90% interval for the failure probability over a 24 h mission time is estimated to be (0.16, 0.99), much wider than the interval obtained using OpenBUGS, which takes the correlation into account automatically.

9.4 Exercises

  1. 1.

    The following times in min are assumed to be a random sample from an exponential distribution: 1.7, 1.8, 1.9, 5.8, 10.0, 11.3, 14.3, 16.6, 19.4 and 54.8. Plot the cumulative hazard function for these times. Do the times appear to be exponentially distributed?

  2. 2.

    Derive Eq. 9.1.