Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1.1 Aim and Scope of the Book

As the title suggests, the book is devoted to stochastic models for reliability. This very wide topic is naturally ‘censored’ by the current research interests of the authors in the field which are: shock models, burn-in and stochastic modeling in heterogeneous populations. At first sight, it seems that these three areas of research are rather ‘independent’. However, it turns out that they can be naturally combined in the unified framework and some of the results of this kind have already been reported in our recent publications. As most of the real-life populations are heterogeneous, taking this property into account in reliability analysis of various problems is only increasing the adequacy of the corresponding modeling. Furthermore, all objects are operating in a changing environment. One of the ways to model an impact of this environment is via the external shocks occurring in accordance with some point process (e.g., the Poisson process or the renewal process). By a ‘shock’ we understand an ‘instantaneous’, potentially harmful event. Depending on its magnitude, a shock can destroy an operating system (failure), leave it unchanged (as good as old), or, e.g., increase its wear (deterioration) on some increment. Numerous shock models were developed and reported in the reliability-related literature during the past 50 years. However, only a few papers (mostly of the authors) deal with shocks in heterogeneous populations and with shocks as a method of burn-in.

Burn-in is a method of ‘elimination’ of initial failures in field usage. To burn-in a component or a system means to subject it to a period of simulated use prior to the actual operation. Due to the high failure rate at the early stages of a component’s life, burn-in has been widely accepted as an effective method of screening out early failures before systems are actually used in field operation. Under the assumption of decreasing or bathtub-shaped failure rate functions, various problems of determining optimal burn-in have been intensively studied in the literature. In the conventional burn-in, the main parameter of the burn-in procedure is its duration. However, in order to shorten the length of this procedure, burn-in is often performed in an accelerated environment. This indicates that high environmental stress can be more effective in eliminating weak items from a population. In this case, obviously, the larger values of stress should correspond to the shorter duration of burn-in. By letting the stress to increase, we can end up (as some limit) with very short (negligible) durations, in other words, with shocks.

One of the essential features of conventional burn-in is that it is performed for the items with decreasing (at least, initially) failure rate. Indeed, by burning-in items for some time we eventually decrease the failure rate for future usage. One of the main causes that ‘force’ the failure rate to decrease is heterogeneity of populations of items: the weakest subpopulations are dying out first. When a population consists of ordered (in some suitable stochastic sense) subpopulations, the population failure rate is usually initially decreasing. It can have the bathtub or a more complex shape as well. It turns out that under certain assumptions, burn-in for populations of heterogeneous items can be justified even in the case when the population failure rate is increasing. This counter intuitive finding among others shows the importance of taking into account heterogeneity of the manufactured items.

We consider the positive (non-negative) random variables, which are called lifetimes. The time to failure of an engineering component or a system is a lifetime, as is the time to death of an organism. The number of casualties after an accident and the wear accumulated by a degrading system are also positive random variables. Although we deal here mostly with engineering applications, the reliability-based approach to lifetime modeling for organisms is one of the important topics for several meaningful examples and applications in the book. Obviously, the human organism is not a machine, but nothing prevents us from using stochastic reasoning developed in reliability theory for life span modeling of organisms.

An important tool and characteristic for reliability analysis in our book is the failure rate function that describes the lifetime. It is well known that the failure rate function can be interpreted as the probability (risk) of failure in an infinitesimal unit interval of time. Owing to this interpretation and some other properties, its importance in reliability, survival analysis, risk analysis, and other disciplines is hard to overestimate. For example, the increasing failure rate of an object is an indication of its deterioration or aging of some kind, which is an important property in various applications. Many engineering (especially mechanical) items are characterized by the processes of “wear and tear” and, therefore, their lifetimes are described by an increasing failure rate. The failure (mortality) rate of humans at adult ages is also increasing. The empirical Gompertz law of human mortality defines the exponentially increasing mortality rate. On the other hand, the constant failure rate is usually an indication of a non-aging property, whereas a decreasing failure rate can describe, e.g., a period of “infant mortality” when early failures, bugs, etc., are eliminated or corrected. This, as was mentioned, is also very important for justification of burn-in, which is usually performed with items characterized by the decreasing or bathtub failure rate. Therefore, the shape of the failure rate plays an important role in reliability analysis. When the lifetime distribution function \( F(t) \) is absolutely continuous, the failure rate \( \lambda (t) \) can be defined as \( F^{\prime}(t)/(1 - F(t)) \). In this case, there exists a simple, well-known exponential representation for \( F(t) \) (Sect. 2.1). It defines an important characterization of the distribution function via the failure rate \( \lambda (t) \). Moreover, the failure rate contains information about the chances of failure of an operating object in the next sufficiently small interval of time. Therefore, the shape of \( \lambda (t) \) is often much more informative in the described sense than, for example, the shapes of the distribution function or of the probability density function. On the other hand, the mean remaining lifetime contains information about the remaining life span and in combination with the failure rate creates a useful tool for reliability analysis.

In this text, we consider several generalizations of the ‘classical’ notion of the failure rate \( \lambda (t) \). One of them is the random failure rate. Engineering and biological objects usually operate in a random environment. This random environment can be described by a stochastic process \( \{ Z_{t} ,t \ge 0\} \) (e.g., a point process of shocks) or by a random variable \( Z \) as a special case. Therefore, the failure rate, which corresponds to a lifetime \( T \), can also be considered as a stochastic processes \( \lambda (t,Z_{t} ) \) or \( \lambda (t,Z) \). These functions should be understood conditionally on realizations \( \lambda (t|z(u),\;0 \le u \le t) \) and \( \lambda (t|Z = z) \), respectively. Similar considerations are valid for the corresponding distribution functions \( F(t,Z_{t} ) \) and \( F(t,Z) \).

Another important generalization of the conventional failure rate \( \lambda (t) \) deals with repairable systems and considers the failure rate of a repairable component as an intensity process (stochastic intensity) \( \{ \lambda_{t} ,t \ge 0\} \). The ‘randomness’ of the failure rate in this case is due to random times of repair. Assume for simplicity that the repair action is perfect and instantaneous. This means that after each repair a component is ‘as good as new’. Let the governing failure rate for this component be \( \lambda (t) \). Then the intensity process at time \( t \) for this simplest case of perfect repair is defined as

$$ \lambda_{t} = \lambda (t - T_{ - } ), $$

where \( T_{ - } \) denotes the random time of the last repair (renewal) before \( t \). Therefore, the probability of a failure in \( [t,t + dt) \) is \( \lambda (t - T_{ - } )dt \), which should also be understood conditionally on realizations of \( T_{ - } \). This and a more general notion of stochastic intensity for general orderly point processes will be intensively exploited throughout the book.

Our presentation combines classical and recent results of other authors with our research findings of recent years. We discuss the subject mostly using necessary tools and approaches and do not intend to present a self-sufficient textbook on reliability theory. The choice of topics is driven by the research interests of the authors. The excellent encyclopedic books by Lai and Xie [6] and Marshall and Olkin [7] give a broad picture of modern mathematical reliability theory and also present the up-to-date reference sources. Along with the classical text by Barlow and Proschan [2], an excellent textbook by Rausand and Hoylandt [8] and a mathematically oriented reliability monograph by Aven and Jensen [1], these books can be considered the first-choice complementary or further general reading. On the other hand, a useful introduction to burn-in can be found in Jensen and Petersen [5], whereas numerous relevant facts and results on stochastics for heterogeneous populations are covered in Finkelstein [4].

The book is mostly targeted at researchers and ‘quantitative engineers’. The first two chapters, however, can be used by undergraduate students as a supplement to a basic course in reliability. This means that the reader should be familiar with the basics of reliability theory. The other parts can form a basis for graduate courses on shocks modeling, burn-in, and on mixture failure rate modeling for students in probability, statistics, and engineering.

Note that all necessary acronyms and nomenclatures are defined below in the appropriate parts of the text, when the corresponding symbol or abbreviation is used for the first time. For convenience, where appropriate, these explanations are often repeated later in the text as well. This means that each section is self-sufficient in terms of notation.

1.2 Brief Overview

Chapter 2 is devoted to reliability basics and can be viewed as a brief introduction to some reliability notions and results that are extensively used in the rest of the book. We pay considerable attention to the crucial reliability notions of the failure (hazard) rate and the remaining (residual) life functions. The shapes of the failure rate and of the mean remaining life function are especially important for the presentation of chapters devoted to burn-in and heterogeneous populations. On the other hand, sections devoted to basic properties of stochastic point processes are helpful for the presentation of Chaps. 3 and 4 that deal with the theory and applications of shock models. Note that, in this chapter, we mostly consider only those facts, definitions, and properties that are necessary for further presentation and do not aim at a general introduction to reliability theory.

Chapter 3 deals mostly with basic shock models and their simplest applications. Along with discussing some general approaches and results, we present the necessary material for describing our recent results on shocks modeling in Chap. 4. As in the other chapters of this book, we do not intend to perform a comprehensive literature review of this topic, but rather concentrate on notions and results that are vital for further presentation. We understand the term “shock” in a very broad sense as some instantaneous, potentially harmful event (e.g., electrical impulses of large magnitude, demands for energy in biological objects, insurance claims in finance, etc.). It is important to analyze the consequences of shocks for a system (object) that can be basically two-fold. First, under certain assumptions, we can consider shocks that can either ‘kill’ a system, or be successfully survived without any impact on its future performance (as good as old). The corresponding models are usually called the extreme shock models, whereas the setting when each shock results in an additive damage (wear) to a system is often described in terms of the cumulative shock models. In the latter case, the failure occurs when the cumulative effect of shocks reaches some deterministic or random level and, therefore, this setting is useful for modeling degradation (wear). We first briefly discuss several simplest stochastic models of wear that are helpful in describing basic cumulative shock models. In the rest of the chapter, we mostly consider the basic facts about the extreme and cumulative shock models and also describe several meaningful modifications and applications of the extreme shock modeling.

In Chap. 4, we extend and generalize approaches and results of the previous chapter to various reliability-related settings of a more complex nature. We relax some assumptions of the traditional models except the one that defines the underlying shock process as the nonhomogeneous Poisson process (NHPP). Only in the last section of this chapter, we suggest an alternative to the Poisson process to be called the geometric point process. It is remarkable that although the members of the class of geometric processes do not possess the property of independent increments, some shock models for this class can be effectively described without specifying the corresponding dependence structure. The chapter is rather technical in nature, however, the formulation of results is reasonably simple and is illustrated by meaningful examples. In extreme shock models, only an impact of the current, possibly fatal shock is usually taken into account, whereas in cumulative shock models, an impact of the preceding shocks is accumulated as well. In this chapter, we also combine extreme shock models with specific cumulative shock models and derive probabilities of interest, e.g., the probability that the process will not be terminated during a ‘mission time’. We also consider some meaningful interpretations and examples.

Chapter 5 deals with heterogeneity in stochastic modeling. Homogeneity of objects is the unique property that is very rare in nature and in industry. Therefore, one can hardly find homogeneous populations in real life, however, most of reliability modeling deals with a homogeneous case. Due to instability of production processes, environmental and other factors, most populations of manufactured items in real life are heterogeneous. Similar considerations are obviously true for biological items (organisms). Neglecting heterogeneity can lead to serious errors in reliability analysis of items and, as a consequence, to crucial economic losses. Stochastic analysis of heterogeneous populations presents a significant challenge to developing mathematical descriptions of the corresponding reliability indices. Mixtures of distributions usually present an effective mathematical tool for modeling heterogeneity, especially when we are interested in the failure rate, which is the conditional characteristic. In heterogeneous populations, the analysis of the shape of the mixture (population) failure rate starts to be even more meaningful. It is well known, e.g., that mixtures of decreasing failure rate (DFR) distributions are always DFR. On the other hand, mixtures of increasing failure rate (IFR) distributions can decrease, at least, in some intervals of time. Note that the IFR distributions are often used to model lifetimes governed by the aging processes. Therefore, the operation of mixing can dramatically change the pattern of population aging, e.g., from positive aging (IFR) to negative aging (DFR). These properties are very important when considering burn-in for heterogeneous populations of manufactured items. In this chapter, we first present a brief survey of results relevant for our further discussion in this and the subsequent chapters. In the rest of the chapter, some new applications of the mixture failure rate modeling are discussed and basic facts to be used in the subsequent chapters are presented.

In Chap. 6, we introduce the concept of burn-in and review the ‘initial research’ in this area. Burn-in is a method of elimination of initial failures (infant mortality) in items before they are shipped to customers or put into field operation. It is important to obtain an optimal duration of burn-in, because, if this procedure is too short, then the items with shorter lifetimes will still remain in the population. On the other hand, if the procedure is too long, then it decreases the life spans of items with ‘normal’ lifetimes and also results in additional costs. By investigating the relationship between the population failure rate and the corresponding performance quality measures, we illustrate how the burn-in procedure can be justified for items with initially decreasing failure rates. First, we review some important ‘classical’ papers that consider minimization of various cost functions for the given criteria of optimization. Burn-in is generally considered to be expensive and, therefore, the length of burn-in is usually limited. Furthermore, for today’s highly reliable products, many latent failures or weak components require a long time to detect or identify. Thus, as stated in Block and Savits [3], for decreasing the length of this procedure, burn-in is often performed in an accelerated environment. Therefore, in the last part of this chapter, we introduce several stochastic models for accelerated burn-in.

Chapter 7 mostly deals with burn-in for repairable items. When a non-repairable item fails during burn-in, and this case was considered in the previous chapter, it is just scraped and discarded. However, when dealing with expensive products or complex devices, the complete product will not be typically discarded because of failure during burn-in, but rather a repair will be performed. Following an influential survey by Block and Savits [3], there has been intensive research on burn-in for repairable systems. The main directions of recent studies include the following: (i) various reliability models which jointly deal with burn-in and maintenance; (ii) burn-in procedures for general failure models; (iii) stochastic models for accelerated burn-in. In this chapter, recent developments on burn-in methodology will be reviewed mainly focusing on the burn-in procedures for minimally repairable systems. The general repair models for burn-in can constitute an interesting and challenging topic for further studies.

Chapter 8 is devoted to burn-in for heterogeneous populations of items. In Chaps. 6 and 7, burn-in procedures for homogeneous populations have been discussed. Burn-in can be usually justified when the failure rate of a population is decreasing or bathtub-shaped. It is well known that heterogeneity of populations is often the reason for the initial decrease in the failure rate. In this chapter, the optimal burn-in procedure is investigated without assuming that the population failure rate is bathtub-shaped. We consider first the mixed population composed of two ordered subpopulations—the subpopulation of the strong items (items with ‘normal’ lifetimes) and that of the weak items (items with shorter lifetimes). Then the continuous mixture model is also discussed in detail. Our goal is to describe optimization of various characteristics of the performance quality of items after burn-in. It is well known that when the failure rate of a component is increasing there is no need to perform the burn-in procedure and only when it is decreasing or non-monotonic there is a possibility for burn-in. We show that this reasoning is usually valid only for homogeneous populations. However, when we deal with heterogeneous populations the situation can be dramatically different and burn-in can be justified even for increasing failure rates. Furthermore, for heterogeneous populations, there exist the risks of selecting items with poor reliability characteristics (i.e., with large failure rates), which is undesirable in practice. Therefore, to account for this situation, we also develop the special burn-in procedure that minimizes these specific risks.

In Chap. 9 we apply the stochastic theory of shocks described in the previous parts of this book to burn-in modeling. In conventional burn-in, the main parameter of the burn-in procedure is its duration. However, in order to shorten the length of this procedure, burn-in is often performed in an accelerated environment. This indicates that a large environmental stress can be effective in eliminating weak items from a population. In this case, obviously, the larger values of stress should correspond to the shorter duration of burn-in. By letting the stress to increase, we can end up (as some limit) with very short (negligible) durations, in other words, with shocks. Then the stress level can be considered as a controllable parameter for the corresponding optimization, which in a loose sense is an analog of the burn-in duration in accelerated burn-in. This general reasoning suggests that ‘electrical’, ‘thermal’, and ‘mechanical’ shocks can be used for burn-in in heterogeneous populations of items. Therefore, in this chapter, we consider shocks (i.e., ‘instantaneous’ stresses of large level) as a method of burn-in and develop the corresponding optimization model. As in the previous chapters, we also assume that our population is the mixture of stochastically ordered subpopulations. As before, we consider both discrete and continuous mixture models. Under this and some other natural assumptions, we discuss the problem of determining the optimal severity level of a stress. We also develop a burn-in model for items that operate in the environment with shocks. For this we assume that there are two competing risk causes of failure—the ‘usual’ one (in accordance with aging processes in a system) and environmental shocks. A new type of burn-in via the controlled (laboratory) test shocks is considered and the problem of obtaining the optimal level (severity) of these shocks is investigated as well.

Chapter 10 describes Environmental Stress Screening (ESS) as another (although related to burn-in) method of eliminating weak items. There are different ways of improving reliability characteristics of manufactured items. The most common methodology adopted in the industry, as described in the previous chapters, is burn-in, which is a method of ‘elimination’ of initial failures (infant mortality). Usually, to burn-in a component or a system means to subject it to a fixed time period of simulated use prior to actual operation. Thus, the ‘sufficient condition’ for employing the traditional burn-in is the initially decreasing failure rate. It should be noted, however, that not all populations of engineering items that contain ‘weaker’ items to be eliminated exhibit this shape of the failure rate. For example, the ‘weakness’ of some manufactured items can result from latent defects that can create additional failure modes. The failure rate in this case is not necessarily decreasing and, therefore, traditional burn-in should not be applied. However, by applying the short-time excessive stress, the weaker items in the population with increasing failure rate can be eliminated by the ESS and, therefore, the reliability characteristics of the population of items that have successfully passed the ESS test can be improved. This is a crucial distinction of the ESS from burn-in. Another important distinction of the considered model from burn-in is that the ESS can also create new defects in items that were previously defect-free. In this chapter, we develop stochastic models for the ESS, analyze its effect on the population characteristics of the screened items, and describe related optimization problems. We assume that, due to substandard materials of faulty manufacturing process, some of the manufactured items are susceptible to additional cause of failure (failure mode), i.e., shocks (such as electrical or mechanical shocks). We define the ESS as a procedure of applying a shock of the controlled magnitude, i.e, a short-time excessive stress.