1 Introduction

Maintenance is defined in an IEEE/PES Task Force report [1] as an activity “wherein an unfailed device has, from time to time, its deterioration arrested, reduced or eliminated.” It is an important part of asset management. As deterioration increases, the asset value (condition) of a device is reducing; the connection between asset value, time, maintenance and reliability is shown in Fig. 1. The curves in the figure are called life curves. Since they are derived from probabilistic information, the times shown represent means.

Fig. 1
figure 1

Life curves

The maintenance policy is aimed at achieving failure-free operation of the system and prolonging the remaining life of equipment. The remaining lifetime of a device depends to a large extent on two factors—frequency of making inspections (technical surveys) and the quality of repairs (for given part of a device either the most crucial and necessary repairs can be made or a complete overhaul can be provided). Defining both, times when the inspections should be performed and which components should be repaired, are difficult tasks. Usually when an inspection takes place, the equipment is temporarily unavailable (that results in additional costs). As a result, utilization costs can be overestimated due to the fact that inspections are made too frequently.

To address this problem, we will start this chapter with a discussion how the life curves could be used to find an optimal maintenance policy. The presentation will show that once the life curves are generated for various maintenance policies, a most advantageous one from the life extension point of view can be selected performing a sensitivity analysis. However, for finding a true optimized policy, we need a mathematical model. We will discuss briefly deterministic and probabilistic modeling of maintenance activities and we will focus on a Markov model that could be optimized with respect not only to the remaining life of the equipment but also to its availability and the maintenance costs.

Maintenance activities on power system equipment are not taken in isolation from the system performance requirements. For the Maintenance or Asset Sustainment function at an electric utility, the following aspect of the decision-making process is of a particular interest:

Faced with multiple options for re-investment on a particular set of equipment like breakers, disconnects, transformers, etc. (e.g., do nothing, continue with current maintenance practice, refurbish, replace, monitor, and so on), what is the best course of action to maximize reliability at minimum cost?

The effects of changes in maintenance policy are difficult to foresee since there is usually no historical data reflecting the performance of the component subject to the revised policy. Here, mathematical models offer an invaluable help and, as mentioned above, one such model utilizing Markov chain will be investigated in this chapter.

Changes in a component maintenance policy are usually undertaken in order to improve reliability of supply to the customers most affected by the performance of this component and, occasionally, to generate savings in the system operation and maintenance for the utility. This aspect of the problem is seldom modeled mathematically, mostly because of a lack of easy-to-use and reliable tools modeling complex operation of a substation or a small area with several substations. This problem will be addressed in more detail in this chapter as well.

Occasionally, changes in a component maintenance policy may have a profound effect on the reliability of the larger area or even on the entire system. To measure this effect, one needs to model reliability of the entire power system under consideration.

Recognizing the interdependence of a component maintenance policy with the area and system reliability, a new paradigm in reliability analysis by combining the notion of component, small area and system reliability concepts into a single application was proposed by [2] and will be summarized here. This is a conceptual leap in the traditional thinking where each aspect of system operation is analyzed separately. The concepts reflecting this new way of thinking were implemented in a computer platform that allows an analysis of a component maintenance policy in the context of a customer, area and system needs. An example of a study with this platform will be discussed. A thought of linking component maintenance with a small area distribution reliability analysis has been explored in [3, 4]; however, there are no analytical tools involved in this analysis.

Looking at a component maintenance policy from a larger perspective brings one additional important aspect into play. Namely, with a limited maintenance budget a question arises which component in the system should we maintain first? Traditionally, maintenance policies followed a time-based pattern suggested by the equipment manufacturers. Recently, the Reliability Centered Maintenance (RCM) has been applied in many electric utilities.

The cornerstone of the RCM methodology is a classification of component importance in the system operation. This aspect is also implemented in the approach described in this chapter. A numerical example illustrating these concepts on the 24-bus IEEE reliability test system is presented in the final part of the chapter.

2 Selecting the Best Maintenance Alternative

From a system point of view, the governing thought is that starting with a prescribed budget we have to identify the assets that should be put on the priority list of the components whose maintenance policy will affect the key performance indicators the most. The initial step in the analysis is, therefore, bulk electric system reliability study. The study involves analysis of the effects of failures of major components such as lines, transformers and generators. The methods for bulk power system reliability modeling are well established and since the literature is vast, the reader is referred to one of the many papers listed in [59]. Since all commercial programs to assess the BES reliability are geared to analysis of very large systems, the buses have zero failure rates at this stage. As a result of this step, the most vulnerable load buses (delivery points) are identified. From this analysis, a single bus or a set of neighboring buses is selected for further studies.

The selected buses form a small area that is now analyzed in more detail. A network diagram of the selected area shows all the important components that affect customer reliability in this region. Such an area can have several hundred components that are maintained and are subject to failure. A reliability analysis of the reduced system is now performed taking into account constituent components’ reliability characteristics. This includes modeling of protection system operation, common mode outages, maintenance-dependent outages and others. The component reliability indices may be either based on the utility’s historical experience or can be taken from the available external databases. As a result of this analysis, the reliability indices computed at this stage are assigned to the bus(es) representing the station(s) in the BES study or to the components connected to the buses of interest. These indices are now entered in the bulk electric system reliability program and the first step is repeated. There are several possible ways the station indices could be transferred to the BES reliability evaluation program. The new reliability indices reflect the performance of individual components forming the substations in the selected area. This will be our base case scenario.

It is important to mention that this part of the proposed approach is philosophically different from the methods published in the literature dealing with the evaluation of the BES reliability taking into account the station-originated outages [5, 1013]. Paper [10] proposed some models to take station-originated failures into account in the bulk electric system reliability evaluation. Also, some computational techniques have been proposed in the remaining references listed above that evaluate station-related failures. However, these papers have concentrated on the concepts and effects of station-originated outages and not on methods of identifying them. Paper [14] simulated various failure modes of station components and computed the reliability indices of connected lines and generators. The approach described in this chapter is a variation of this concept with a more comprehensive modeling of a substation operation.

Even though we have “homed” on the area of interest, there are still too many components for which the maintenance policies should be analyzed. The next step is then a prioritization of all components in the selected region. There are two types of prioritization lists, one ranking the components on the basis of their structural importance, the other on the basis of the reliability importance. These two lists may result in a quite different ranking of components as discussed in the numerical example presented here. Components at the top of either list are selected for further studies.

A model for the component deterioration process taking into account the presently applied maintenance policy is now built. One of the outcomes of this process is the evaluation of the component failure rate. If the computed failure rate is much different than the one used in the base case study, the above two steps could be repeated with the new failure rate of the component and a new base case scenario established.

We are now ready to contemplate changes in the maintenance policy of the selected component(s). The analysis results in new failure rate for this component. The area and system studies are now repeated with the new information and the effect of the new maintenance policy analyzed. The diagram in Fig. 2 summarizes the procedure described above. In this figure, the programs used by the author are named, namely REAL for the BES reliability evaluation, WinAREP for small area reliability analysis, Asset Management Planner (AMP) and RiBAM for component maintenance investigations. The programs are described in more detail later in the chapter.

Fig. 2
figure 2

Flowchart for maintenance strategy optimization

The proposed approach required a construction of a computer platform that allows seamless transfer of data and results between various computer programs forming the constituent parts of the platform. The computational engines employed in the platform are only briefly described here. References to published articles describing their features are included.

The presentation in this chapter will start with the analysis of a component maintenance policies and their optimization and will conclude with a review of the effect of the component policy on the system reliability.

3 Maintenance Optimization with Life Curves [2]

Conditions for three maintenance policies are illustrated in Fig. 1, including Policy 0 where no maintenance is performed at all, and Policies 1 and 2 where maintenance is performed according to different rules.

Let failure be defined as the asset condition where asset value becomes zero, and lifetime, as the mean time it takes to reach this condition; furthermore, let reliability be linked with the mean time to failure. Now, life extensions T 0 to T 1 when Policy 1 is applied instead of Policy 0, and T 1 to T 2 when Policy 1 is replaced by Policy 2 can be clearly seen in the figure. So are the changes in the asset condition (value) at any time T. Note that in a given study failure, lifetime and reliability can be defined differently; e.g., failure could be tied to any asset condition which is deemed unacceptable.

As far as reliability is concerned, Policy 2 is clearly superior to Policy 1. It is also obvious that maintenance affects component and system reliability. But maintenance has its own costs, and when comparing policies, this has to be taken into account. The increasing costs of carrying out maintenance more frequently must be balanced against the gains resulting from improved reliability. When costs are also considered, Policy 2 in Fig. 1 may be very costly and, therefore, may not be superior to Policy 1.

It is possible to study life curves for finding optimal maintenance policy. However, since no mathematical model exists to represent relationships shown in Fig. 1, the way to proceed is to do a case analysis as shown in the following example.

3.1 Example for High-Voltage Air-Blast Breakers Using the Life Curve Concept

3.1.1 General

This study involves the analysis of several breakers with a total operating history of about 100 breaker-years. According to the current policy, three types of maintenance are routinely performed on each breaker. About every 8 months, minor maintenance (timing adjustments, lubrication) is carried out at a cost of about $700. Its average duration is 0.25 day. Approximately every 10 years, medium maintenance is performed involving replacement of some parts, taking on the average 2 days, at a cost of about $6,000. Major maintenance involving breaker overhaul takes place every 15 years, with an average duration of 22 days and a cost of about $75,000.

In the study, four alternative maintenance policies are compared. The first option is to continue with the present maintenance policy. The second is to do nothing, i.e., to run the equipment in the future without any maintenance. The third is to perform major overhaul, followed by a slightly modified version of the original policy. The last option is to replace the equipment with a new one ($90,000) and continue with the modified maintenance policy. The modified policy differs from the original one in that the minor maintenance after overhaul or replacement is scheduled every 15 months instead of every 8 months, a reduced maintenance policy.

3.1.2 Life Curves

Figure 3 shows life curves for the three “basic” policies. Considering the four options above, the life curves for two of them, the one involving replacement and the other to stop maintenance, are shown in Fig. 3. These curves were derived with the AMP and RiBAM programs, described later, using the following assumptions: (a) the “present” moment when the choice is made among the options is 20 years into the life of the breakers, (b) if the option chosen requires action (replacement, overhaul), there is a delay of 3 years before the action is implemented, and (c) upon failure, repair is performed which brings the device to an assumed 90% of its original condition.

Fig. 3
figure 3

Life curves with a replacement, b no maintenance, after a 3-year delay

3.1.3 Cost Studies

In a financial evaluation, a time horizon must be selected which usually starts at the present, when the study is made, and includes a predetermined number of years for which the costs of the various operating and maintenance options are calculated and compared. For the present study, a time horizon of 10 years was selected. This is also shown in Fig. 3 by a horizontal line between the 20th and 30th year marks.

Cost computations involve the calculation of the expected number of failures, and of the various types of maintenance activities, during the specified time horizon. The cost of each maintenance activity is expressed by its present value. The costs are then expressed as functions of the delay.

Figure 4 illustrates the present costs for all options with a 3-year delay for each. The diagram shows that in the given case the best option (of those considered) is to continue with the original maintenance policy. The expected cost of this is $100,000 for the 10-year time horizon. The costs are highest for the “Stop All Maintenance option” because the probability of failure is much higher than for the other options. The maintenance cost is high for the “Continue as Before” policy because minor maintenance is performed quite often.

Fig. 4
figure 4

Cost diagram for the various options

To summarize, while for highest reliability the “Install New Breaker” policy is the choice, for lowest costs the “Continue as Before” option must be selected.

3.1.4 Sensitivity Studies

It is quite possible to perform optimization of each option with regard to, say, maintenance frequency. Such studies were not carried out. Instead, sensitivity studies were performed to find out how “robust” the findings are if some of the input values are subject to uncertainty. While details are not discussed here, the results show that the cost of the option “Continue as Before” appears to be little affected if several of the input data are varied around their assumed values.

3.2 Review of Maintenance Approaches

3.2.1 Regular Versus “As Needed” Maintenance

Maintenance has been performed for a long time on a great variety of devices, machines and structures. Traditionally, maintenance policies have been chosen either on the basis of long-time experience or by following the recommendations of manuals issued by manufacturers. In both cases, maintenance has been carried out at regular, fixed intervals. This practice is also called scheduled maintenance and, to this day, this is the maintenance policy most frequently used by electric utilities.

It was found, however, that scheduled maintenance may be quite costly in the long run, and may not extend component lifetime sufficiently. For the last 15 years or so, variations of a new approach have been tried and implemented by many industrial undertakings and several electric power utilities. The essence of this approach is that maintenance should be undertaken not regularly but only when needed. Such an approach is called predictive maintenance. To find out when maintenance is needed, condition monitoring—periodic or continuous—and appropriate criteria for triggering action are required.

3.2.2 Improvement Versus Replacement

Maintenance activities may result in the restoration of a device to conditions better than those it was found in, or in its replacement with a new one. However, for a long time, it had been assumed that even restoration would result in “as new” conditions, which clearly is not what happens in practice. Most often, only limited improvement would take place; however, this is very difficult to take into account.

A large number of replacement policies are described in the literature; in fact, most of the literature concerns itself with replacement only, neglecting the possibility that maintenance may result in smaller improvements at smaller costs. Maintenance policies involving limited condition improvement are mostly based on experience, and such empirical approaches cannot predict and compare changes in reliability as a result of applying various maintenance policies.

3.2.3 Empirical Approaches Versus Mathematical Models

Empirical approaches are based on experience and manufacturers’ recommendations. This does not mean that they are necessarily very simple. The method called Reliability Centered Maintenance (RCM), introduced about 18 years ago [15], is empirical, yet quite sophisticated. It is based on condition monitoring (and, therefore, may not follow rigid maintenance schedules), failure cause analysis and an investigation of operating needs and priorities. From this information, it selects the critical components in a system (those which are dominant contributors to system failure or to the resulting financial loss) and initiates more stringent maintenance programs for these components. It assists in deciding where the next dollar budgeted for maintenance should go.

An important advantage of the RCM approach is that it also considers external, non-deterioration-originated failures (e.g., those caused by weather, animals and humans). A good example is the case of overhead lines in distribution systems. According to fault and interruption statistics in the UK, the percentages of failure causes of such lines are the following (since only the dominant failure causes are shown, the percentages are rounded and do not add up to 100):

  • Weather 55%;

  • Damage from animals 5%;

  • Human damage 3%;

  • Trees 11%;

  • Aging 14%.

The conclusion appears to be that the maintenance budget for overhead lines should be divided almost equally between internal and external programs. The external budget would be spent mostly on tree trimming and some design changes, such as the erection of barriers and fences.

Maintenance policies based on mathematical models are much more flexible than heuristic policies. Mathematical models can incorporate a wide variety of assumptions and constraints, but in the process they can become quite complex. A great advantage of the mathematical approach is that the outcomes can be optimized. Optimization with regard to changes in some basic model parameter can be carried out for maximal reliability or minimal costs.

Mathematical models can be deterministic or probabilistic. Since maintenance models are used for predicting the effects of maintenance in the future, probabilistic methods are more appropriate than deterministic ones, even if the price for their use is increased complexity and a consequent loss in transparency. For these reasons, the use of such methods is spreading only slowly.

The simpler mathematical models are still based on fixed maintenance intervals (scheduled maintenance), and optimization will be carried out, in most cases, through sensitivity analysis, by varying, say, the frequency of maintenance. More complex models incorporate the idea of condition monitoring where decisions about the timing and amount of maintenance are dependent on the actual condition of the device (predictive maintenance). Such policies can be optimized with respect to any of the model parameters, such as the frequency of inspections.

3.3 Linking Component Reliability and Maintenance: A Probabilistic Approach

3.3.1 Basic Models

A simple failure-repair process for a deteriorating device is shown in Fig. 5: The various states in the diagram are explained. The deterioration process is represented by a sequence of stages of increasing wear, finally leading to equipment failure. Deterioration is, of course, a continuous process in time, and only for easier modeling it is considered in discrete steps.

Fig. 5
figure 5

State diagram including (D1, D2, …); F failure state

The number of deterioration stages may vary, and so do their definitions. In most applications, the stages are defined through physical signs such as markers on wear or corrosion. This, of course, makes periodic inspections necessary to determine the stage of deterioration the device has reached. The mean times of the stages are usually uneven, and are selected from performance data or by judgment based on experience.

The process in Fig. 5 can be readily represented by a probabilistic mathematical model. If the rates of transitions shown between the states can be assumed time-independent, the mathematical models describing such processes are known as Markov models. Well-known techniques exist for the solution of these models [16]. It can be proven that in a Markov model the times of transitions between states are exponentially distributed. This property and the constant-rate property follow from each other.

3.3.2 The Effect of Maintenance

One way of incorporating maintenance into the model in Fig. 5 is shown in Fig. 6. It is immediately clear that in this arrangement there is no assumption made that maintenance would produce “new” conditions; in fact, the effect of maintenance can now be limited: it is assumed that it will improve the device’s condition to that which existed in the previous stage of deterioration. This contrasts with many strategies described in the literature where maintenance is considered equivalent to replacement.

Fig. 6
figure 6

State diagram including three stages of deterioration stages and maintenance (F failure state)

If a failure has external causes (e.g., inclement weather), there is a single step from the working to the failed state. Now, the constant failure-rate assumption leads to the result that maintenance cannot produce any improvement because the chances of failure in any future time interval are the same with or without maintenance (a property of the exponential distribution). That maintenance will not do any good in such cases agrees with experience as expressed by the oft-quoted piece of wisdom: “If it ain’t broke, don’t fix it!” The situation is quite different for deterioration processes where the times from new conditions to failure are not exponentially distributed even if the times between subsequent stages of deterioration are (this can be rigorously proven). In such a process, maintenance will bring about improvement, and one can conclude that if failures are the consequence of aging, maintenance has an important role to play.

In Fig. 6, the dotted-line transitions to and from state M1 indicate that maintenance while in state D1 should really not be performed because it would lead back to state D1 and, therefore, it would be meaningless. State M1 could be omitted if the maintainer knew that the deterioration process was still in its first stage and, therefore, no maintenance was necessary. Otherwise, maintenance must be carried out regularly from the beginning, and state M1 must be part of the diagram.

It should be observed that this and similar models solve the problem of linking maintenance and reliability. Upon changing any of the maintenance parameters, the effect on reliability (say, the mean time to failure) can be readily computed.

3.3.3 A Practical Model

A more sophisticated model [17] based on the scheme in Fig. 6 and tested in practical applications is shown in Fig. 7. A program, called Asset Management Planner (AMP), using this model, was developed by Kinectrics Inc. in Toronto, Canada. It computes the probabilities, frequencies and mean durations of the states of a component exposed to deterioration but undergoing regular inspections and receiving preventive maintenance.

Fig. 7
figure 7

The AMP model

Without maintenance, the path from the onset (entering D1) would run through the stages of deterioration to the failure state F. With maintenance, this straight path to failure is regularly deflected by inspection and maintenance.

According to the diagram, in all stages of deterioration, regular inspections take place (I1, I2, I3), possibly several times, and at the end of each inspection a decision is made to continue with minor (M) or major (MM) maintenance, or forgo maintenance and return the device to the state of deterioration it was in before the inspection. Another point of decision is after minor maintenance when, if the results are considered unsatisfactory, major maintenance can be initiated.

The result of all maintenance activities is expected to be a single-step improvement in the deterioration chain, following the principle shown in Fig. 7. However, allowances are made for instances when no improvement is achieved or even when some damage is done during maintenance, the latter resulting in the next stage of deterioration.

The choice probabilities (at the points of decision-making) and the probabilities associated with the various possible outcomes are based on user input and are estimated from historical records.

Another technique, developed for computing the so-called first passage times (FPTs) between states, will provide the average times of first reaching any state from any other state. Although not shown, the technique is implemented in the AMP model. If the end state is F, the FPTs are the mean remaining lifetimes from any of the initiating states.

This information is necessary for constructing life curves. It can be observed that the AMP model can handle both scheduled (regular) and predictive (as needed) maintenance policies.

Figure 6 shows an arrangement for scheduled maintenance: the rate of starting maintenances is always the same (this rate is the reciprocal of the mean time to maintenance; the actual times constitute a random variable).

The scheme in Fig. 7 incorporates an arrangement for predictive maintenance. Condition monitoring is done through regular inspections, and if it is found that no maintenance is needed the device is returned to the “main line” without undergoing maintenance.

3.3.3.1 Mathematical Description of the Model in Fig. 7

Transition Rates. Assuming that transition rates between states are known (computed from historical data), transition rate matrix Q can be built with components λ ij denoting a transition rate from state i to j and:

$$ \lambda_{ii} = - \sum\limits_{j,j \ne i} {\lambda_{ij} } . $$
(1)

The transition rates from states Dx to Ix are computed as the reciprocal of the time to inspections, while the transition rates from states Dx to Dy are reciprocals of the times when the device reaches another stage of deterioration without any maintenance.

The repair states are characterized by two parameters: duration and the probabilities of departure to other states. Duration of a state can be determined from historical records for both the Ix and Mx states. In the first case, it is the average duration of inspections, in the second case, the time of performing the repairs. The departure rate from state i to j is then defined as

$$ \lambda_{ij} = {\frac{{p_{ij} }}{{d_{i} }}} $$
(2)

where p ij is the probability of transition from state i to j.

We also have

$$ \sum\limits_{j} {p_{ij} = 1} \quad {\text{for}}\;i = 1,2, \ldots ,n $$
(3)

where n is the number of the repair and inspection states.

This definition of the transition rate matrix describes a semi-Markov process [18].

Cost of a State. In addition, for every state, one can define a cost of residing in this state. It is especially important for the Mx states because it symbolizes the costs of repairs.

Both values (costs and duration) can be written as two vectors: cost C and duration D.

4 Optimal Maintenance Policies for Power Equipment

There are many maintenance optimization models utilizing simplified deterministic mathematics [19]. One such model is presented as an illustration in the next section. A more sophisticated probabilistic optimization model is discussed afterwards.

4.1 A Simple Deterministic Model [19]

Consider a device that breaks down from time to time. To reduce the number of breakdowns, inspections are made n times a year when minor modifications may be carried out. The optimal number of inspections is to be determined which minimizes the total yearly outage time, consisting of the repair times after failures and the inspection durations.

Let the failure rate be λ(n) occurrences per year, where λ is independent of time but is a function of the inspection frequency. Therefore, the total downtime T(n) is also a function of n. Further, let it be assumed that

$$ \lambda (n) = {\frac{n}{k + 1}} $$
(4)

where the numerical value of k indicates the failure frequency when no inspections are made. If t r is the average duration of one repair and t i the average duration of one inspection, then

$$ T(n) = \lambda (n)t_{\text{r}} + nt_{\text{i}} . $$
(5)

Substituting (4), taking the derivative of T(n) with respect to n, and equating it with zero,

$$ {\frac{{{\text{d}}T(n)}}{{{\text{d}}n}}} = {\frac{{ - kt_{\text{r}} }}{{(n + 1)^{2} }}} + t_{\text{i}} = 0. $$
(6)

From the second statement, the optimal value of n becomes

$$ n_{\text{opt}} = \left( {{\frac{{kt_{\text{r}} }}{{t_{\text{i}} }}}} \right)^{0.5} - 1 $$
(7)

with k = 5 per year, t r = 6 h and t i = 0.6 h, one obtains that n opt = 6.07 per year, or the optimal inspection frequency is about one in every 2 months. The total outage time is T(6) = 7.9 h/year, whereas without inspections it would be T(0) = 30 h/year.

As can be seen, optimization is easily included in mathematical models. On the other hand, modeling the relation between maintenance (inspection) and reliability (failure rate) is still a problem. In the example above, this relation is given by (4). It should be observed that this relation is assumed, and not a result of calculations. What is missing is a mathematical model where this relation is part of the model itself, and the effect of maintenance on reliability is part of the solution.

4.2 Maintenance Optimization with a Probabilistic Model

The following section describes a mathematical model for the selection of an optimal maintenance policy [20]. The original model described above presented a method of calculation of the remaining life of equipment without suggestions on how the maintenance policy modeled could be optimized. Here, we will define several possible optimization procedures to find out the best maintenance policy. The optimization process will be illustrated with an optimization algorithm for Markov models utilizing a simulating annealing approach in a practical numerical example involving high-voltage circuit breakers.

4.2.1 The Objective Function

In the optimization procedure discussed here, the quantities of interest are: (1) the Remaining Life of Equipment represented in the model as the FPT from the current deterioration state to the failure state [21], (2) the Life Cycle Costs represented as the cost of maintenance and failure, and (3) equipment Unavailability. Our goal is thus to define an optimization model that would minimize a function of these three parameters, i.e.:

$$ F(r) = \min \,f({\text{total}}\_{\text{cost}}, - {\text{FPT}},{\text{unavailability}}). $$
(8)

Vector r symbolizes parameters of the model that can be varied and is described later in this chapter. To transform the multi-objective optimization problem described by (8) into a more practical single optimization formulation, f is defined as a special function that transforms three parameters to be expressed in the same units of measurement and is described below.

The nature of the problem that we are aiming to solve leads us to the decision that one could use an algorithm based on simulated annealing [20] to find an optimal solution.

A brief review of the way that the three input parameters are evaluated is given as follows.

4.2.1.1 FPT

In Markov theory, the FPT, T ij represents the time when the model (starting from state i) will reach state j for the first time. In the case of the considered model, the most interesting is the time when the device will reach state F. T ij will be equivalent to the remaining lifetime of the equipment. FPT will be measured in years.

4.2.1.2 Unavailability

During both inspection and repair, the device is temporarily out of service. The proposed model enables computation of the equipment unavailability. This value is usually measured in days per year.

In the model shown in Fig. 7, several values can be treated as parameters that can be modified in order to find the optimal solution. These parameters are:

  • frequency of making inspections (time to inspections). These parameters correspond to transitions from states Dx to Ix;

  • funds spent on maintenance (cost of states Mx);

  • durations of the repair states.

The last two items define the depth and the speed of repairs. Each of these quantities can be varied independently or simultaneously. The possible optimization scenarios will be described next.

4.2.2 Parameters of the Maintenance Optimization Problem

The term “optimal maintenance policy” implies a selection of maintenance parameters for which the function in (8) will reach its minimum. The parameters that can be varied are described as follows.

4.2.2.1 Time to Inspection Optimization (TTI)

Knowing that during inspections a device is temporarily unavailable and being aware of the fact that every inspection is connected with additional costs, the main aim of the TTI optimization is to find the best points in time to perform the inspections.

Parameters that can be optimized are transition rates between deterioration states (Dx) and inspection states (Ix). This type of optimization changes the values of the elements of the transition rate matrix Q.

4.2.2.2 Cost Optimization

The second group of parameters that can be optimized are the costs of the states that represent the repairs (cost of the Mx states). After making a decision to spend additional funds on repair, one can expect the following effects:

  • time spent in the repair stage will be shorter;

  • better (deeper) repair—the equipment will be reconditioned with more care and it will end up in a “better” deterioration state. It does not necessarily mean though, that the repair time will be shorter.

4.2.2.3 Maintenance Time Optimization

In this optimization, we assume that the time of repair (elements of the duration vector D) is a function of the funds spent on repair and that the probability of transition from state Mx to Dx is constant (the probability matrix P does not change). Parameters that are optimized are elements of the cost vector C representing repair states. On the basis of this cost, duration of every state is computed (elements of the duration vector D) and then, the elements of the transition rate matrix Q are recalculated using (2).

4.2.2.4 Maintenance Depth Optimization

This type of optimization assumes that the probability of transition from a repair state to any other state is a function of the funds spent on repair, i.e., p i,j = f(c i ). The rationale for this thinking is that the more funds are spent on maintenance, the more likely it is that the equipment will end up in a higher (better) deterioration state than before the repairs. Parameters that are optimized are the elements of the cost vector C, but in this case, duration of the repair is constant (duration vector D does not change). After modification of the elements of the probability matrix P, the elements of the probability matrix P and the elements of the transition rate matrix Q are recalculated using (2).

4.2.3 Constraints

The constraints in this problem relate to the permissible changes in the components of the cost, probability and duration vectors. Thus, there are lower and upper limits on the amount of money available for maintenance and minimum and maximum times between inspections. Section 4.3 presents boundary conditions used in the numerical example. The optimization problems defined above will be solved using a simulated annealing algorithm [20].

4.2.4 Definition of the Optimization Function

Since the quantities to be optimized are expressed in different units and are of a different order of magnitude, it would be very difficult to formulate the objective function that would be just an algebraic sum of these variables. To address this problem, [20] proposed to use a notion of utility from a multi-attribute utility theory (MAUT) [22].

MAUT is one of the methods that form a multi-criteria decision analysis (MCDA) for quantifying the value of something (e.g., a project) based on its characteristics, impacts, and other relevant “attributes”. It is useful for project prioritization because it provides a relatively simple and defensible way to capture all sources of project value, including non-financial (or “intangible”) components of value.

More precisely, MAUT is an approach for deriving a “utility function” that, according to decision theory, quantifies a decision-maker’s preferences over the available alternatives to a decision. The utility function, u, is such that the best alternative is the one that optimizes u.

In order to evaluate optimal parameters of the maintenance policy, there is a need to compare three values expressed in different measures: FTP—expressed in years, cost—expressed in thousands of dollars per year and unavailability—expressed in days per year, respectively. This is achieved by introducing a suitable utility function, which is described next.

4.2.4.1 Utility Functions

Calculation of utility requires a definition of a utility function. The form of this function determines the ability of taking a higher risk of failure in order to find a better solution. A utility function can be constructed to reflect the risk preference of the analyst: From this viewpoint, the analyst can be classified as:

  • risk-seeker,

  • risk-averse,

  • risk-neutral.

Their characteristics are presented in Fig. 8.

Fig. 8
figure 8

Characteristics of different utility functions

One of the commonly used utility functions is a power expression shown as follows:

$$ u(x) = {\frac{{(x - a)^{R} }}{{(b - a)^{R} }}}. $$
(9)

Parameter R is responsible for the definition of a risk-acceptance attitude with a risk-seeker characterized by the value greater than one and a risk-averse person with the parameter smaller than one. A risk-neutral analyst will be assigned the value of R = 1. The constants a and b in (9) represent the minimum and the maximum value of the variable x, respectively.

A very useful characteristic of the utility function in (9) is the fact that calculated utility values are between 0 and 1. In our optimization problem, each of the three optimized parameters is represented by (9) with the same parameter R.

4.3 Numerical Example

The ideas described above will be demonstrated with a model of maintenance policy for high-voltage, air-blast circuit breakers with real historical values of model parameters.

4.3.1 Model Parameters

Figure 9 shows the model with the transition rates and probability of transitions between the states indicated on the arrows. These parameters are the same as used in the numerical example discussed in [17].

Fig. 9
figure 9

Model of maintenance policy with transition rates between the states (d days, y years) and probability of transitions between them

After applying (1)–(3), the transition rate matrix Q is shown in Table 1.

Table 1 Transition rate matrix Q for the model in Fig. 9

The input values of the duration and costs of the states, obtained from historical data supplied by a large utility in Canada, are summarized in Table 2.

Table 2 Duration and cost of each state

4.3.2 Base Case Results

The steady-state probabilities and other model parameters computed from the standard Markov equations and transition rate matrix are shown in Tables 3 and 4.

Table 3 Steady-state probability of each state
Table 4 Solution of the base case maintenance policy model

In reality, the states with the probability equal to zero in Table 3 have non-zero values of this parameter but the values are smaller than 0.0001 and not shown here.

4.3.3 Model Optimization

The assumption made in the numerical analysis presented here states that an increase in the amount of money spent on maintenance can either result in shorter durations of the repairs or a greater depth of the repairs leading to the higher probability of the equipment landing in a better state. Therefore, it is assumed that the durations of the repairs and the probability of a transition from a repair state to another state are functions of the cost of the state, i.e.,

$$ (d_{i} ,p_{i,j} ) = g(c_{i} ) $$
(10)

Function g has a form g(x) = αx d, g(x) = β − γ ln(x) for the duration and probability variables, respectively, with the values of α, β, γ and d different for each value of i and j. Our goal is to find such values of the model parameters for which (8) is minimized. We will assume that the analyst is risk-averse and assign the value of R = 0.2.

The objective function f in (8) is an algebraic sum of utility functions (9) for the variables FPT, unavailability and total_cost and is given by

$$ f(c) = w_{1} u({\text{total}}\_{\text{cost}}) + w_{2} u({\text{unav}}) - w_{3} u({\text{FPT}}) $$
(11)

with weights w 1 = w 3 = 1 and w 2 = 0.5 assigned arbitrarily by the author.

4.3.4 Constraints

The following limiting values were adopted for the model parameters. The lower bound of the costs is that of the present value used by the utility and the upper is the cost of the next level of repair:

$$ 1/\lambda_{i,j} \in \langle 1{\text{d}},365{\text{d}}\rangle $$
(12)
$$ \begin{array}{*{20}c} {c_{{{\text{M}}1}} ,c_{{{\text{M}}2}} ,c_{{{\text{M}}3}} \in \langle \$ 100,\$ 10 ,000\rangle } \hfill \\ {c_{{{\text{MM}}1}} ,c_{{{\text{MM}}2}} ,c_{{{\text{MM}}3}} \in \langle \$ 10 ,000,\$ 100 ,000\rangle } \hfill.\\ \end{array} $$
(13)

This defines completely the optimization problem. The results are discussed later.

4.3.5 Simulation Results

After a series of simulations, the SA algorithm gives the following results: with the optimized cost of breaker states shown in Table 5 and other computed parameters shown in Table 6.

Table 5 Optimal cost of each state
Table 6 Parameters of maintenance policy model

All three parameters were improved by applying the optimal maintenance policy. The unavailability and the total utilization cost of the breaker were reduced by about 30% each and the expected remaining life was increased by about 60%. This is the effect of increasing the expenditures on all maintenance activities. At the same time, the probability of moving from the maintenance state to a higher state has increased compared to the base case, hence the overall improvement in all components of the objective function. For example, the probability of moving from states MM2 and MM3 to D1 increased by about 10% accompanied by a substantial increase in the transition probability from the minor maintenance states M2 and M3 to states D1 and D2, respectively.

Finally, a sensitivity study was performed to determine the effect of the value of the parameter R. Figure 10 presents the dependence of the objective function −u(x) at the optimum value of the vector x = x* on R. The values of the independent variable have been normalized as follows:

Fig. 10
figure 10

Results of the sensitivity analysis

$$ x^{\prime} = {\frac{x}{{x_{\max } - x_{\min } }}}. $$
(14)

Two cost ranges were considered. In addition to the one given by (13), a reduced range was defined as follows: for minor maintenance between $100 and $5,000 and for major maintenance between $30,000 and $70,000.

The first observation is that the parameter R plays a role only for a narrow interval case. In this case, the best values of the objective function are obtained for the risk-averse decision-maker, with virtually no distinction between the risk-neutral and risk taking persons.

5 System Effect of a Component Maintenance

5.1 Bulk Power System Reliability Evaluation

The first step of the analysis performed by the computer platform is the evaluation of the reliability of a bulk electric system. A general approach adopted in many computer programs for this type of analysis is presented in Fig. 11. In this figure, a particular implementation in the software called REAL [23, 24] is shown.

Fig. 11
figure 11

Simplified flowchart of the REAL model

A brief characteristic of the blocks in Fig. 11 is given as follows.

  • Sequential or pseudo-chronological Monte Carlo simulation is used to select system states [25].

  • DC power flow model is used to analyze the system states.

  • Linear programming (LP) is used to solve, by redispatching and load shedding, system problems (i.e., overloads).

  • Failure/repair rates are considered for both generation and transmission equipment.

Since reliability computations often involve an analysis of large systems, two measures are introduced to deal with the efficiency of the computations. First, we employ a pseudo-chronological Monte Carlo simulation during the reliability evaluation process [25]. Second measure involves a division of the entire network into three parts [23].

The first part of the network (i.e., Equipment Outage Area) involves a full representation of random behavior of transmission and generation elements. The second, larger, network (i.e., Optimization Area) involves representation of all its elements for load flow and remedial action analysis. The elements in the second network that do not belong to the first network are not allowed to fail, but generators may be redispatched and load can be cut, if necessary. Finally, the third network (i.e., External Area) includes both previous networks and equivalent representation of the remaining components of the original load flow file. The idea of performing outage simulation on a part of the entire network was first introduced in [26]. The concept introduced there would be equivalent to using only the outage and optimization areas in our approach. The representation of the networks is illustrated schematically in Fig. 12.

Fig. 12
figure 12

Network representation for outage scheduling

The output of this part of the analysis is a set of standard reliability indices including the loss of load costs. The indices are computed for the system and for each bus. From the bus indices, the area for further analysis is selected.

5.2 Small Area Reliability Study

5.2.1 Reliability Evaluation Principles

Area supply reliability is commonly measured in terms of BES delivery point interruptions. The key indices used are the interruption frequency, duration and probability. Delivery point interruptions in a BES system can occur as a result of several reasons. Most of these interruptions are attributable to facilities’ outages or security problems in the transmission system. The proposed approach uses the continuity of supply only as a failure criterion as the equipment overload issues are tackled in bulk electric system reliability studies.

The power system is designed to operate with protection schemes to minimize the effects of the component outage events resulting from the different phenomena described below. Often these effects are localized to an operating area such that widespread outages in the power system will not occur. Therefore, the reliability indices of a delivery point can be studied by modeling component outages within an area containing the delivery point and a few buses away.

A delivery point interruption, which is referred to as a system failure in this part of the chapter, is seldom a result of a single outage event. Overlapping of outage events is likely the cause. Since the outage events are contained to the initiating faulted components and are not wide spreading, good results can still be achieved by limiting the study of overlapping outage events to three or fewer components.

The approach is based on the Area Reliability Evaluation Program (AREP) developed by Hydro One Networks [27, 28] to calculate reliability indices of a selected group of customers supplied by a series of power sources. The method of minimal cuts is used to assess a continuity of supply from sources to sinks and to evaluate the reliability indices.

5.2.2 System Modeling and Component Data

The various phenomena modeled in the area reliability evaluation include:

  • independent outages caused by faults,

  • independent outages caused by false trips,

  • common mode outages caused by faults,

  • common mode outages caused by false trips,

  • breaker failure (active and passive),

  • protection-dependent failure,

  • maintenance-dependent outages,

  • repair-dependent outages,

  • maintenance events,

  • normally open breakers,

  • operation of various protection zones.

Calculation of the frequency and duration of various outages involving the above phenomena are discussed in [27]. The effect of adverse weather is also included.

5.2.3 Classification of Outages

A delivery point is assumed to be interrupted if and only if all the electrical paths between the delivery point and all source points are interrupted. Interruptions are grouped into four types of system failures with different interruption durations as follows:

  • Permanent: if the interrupted delivery point(s) can only be restored by repairing the corresponding component(s) on permanent outage.

  • Switching: if the interrupted delivery point(s) can only be restored by isolating the corresponding component(s) on permanent outage.

  • Temporary: if the interrupted delivery point(s) can only be restored by restoring the outage component(s) to service via manual reclosing disconnects and circuit breakers.

  • Transient: if the interrupted delivery point(s) can only be restored by auto reclosure.

5.2.4 Calculation of Reliability Indices

As mentioned above, the reliability indices for small area studies are calculated using a minimal cut set approach. This method, although it is an approximation, yields very accurate results and is much more practical with larger systems than the Markov process. The frequency and duration equations can be found in [16, 29, 30].

5.2.5 Area Network Representation

In order to perform area supply reliability analysis, the pattern of the power flow from the sources to the delivery points has to be established. Therefore, a direction of power flow has to be assigned to every connection specified in the network. The connection elements used to connect two adjacent components can be either one or bi-directional.

System components are normally protected by the nearest breakers. The protection zone for each component is established using this rule. These are defined as standard protection zones. As an alternative, a nonstandard protection zone for component(s) or breaker(s) can be established to override the standard protection zone defined above.

The approach assumes that all components in the electrical system being analyzed are self-switched; i.e., each component can be isolated from the electrical system without also isolating of another component. Nevertheless, the one may specify a component to be switched out with other components, i.e., a nonstandard switching zone.

In many substations, certain breakers can be normally open. When a permanent system failure is identified by the software, attempts will be made to restore the system by closing the normally open breakers one by one in a specified order. If none of the closings are successful in restoring the system, then the classification of the failure remains as permanent. If, however, any one closing restores the system, then the failure is classified as switching, and frequency and duration calculations are performed accordingly.

A standard failure criterion would state that the supply of M out of N delivery points must be interrupted in order to have a system failure.

5.2.6 Study of Independent and Dependent Outages

Independent and dependent component outages are studied in the area reliability evaluation using failure modes and effect analysis. Although the general approach used is the same, different techniques are used to perform analysis for these two types of outages.

5.2.6.1 Independent Study

The minimal cut technique is used to simulate independent outage events and determine if they result in delivery point interruptions. The basic idea of this technique is that once an independent outage event or overlapping of two independent outage events causes an interruption, it will not be combined further with other outage events in the failure modes and effect analysis.

5.2.6.2 Dependent Study

While studying dependent outages, the three-component rule (maximum three components can be out of service at the same time) as well as transitional analysis are used to perform failure effect analysis. Unlike independent outages, the minimal cut set approach is not used to perform failure effect analysis when studying dependent outages. This is due to the fact that when dealing with dependent outages, system states that result in failures do not have to be minimal cut states. For this reason, transitional analysis is used to carry out failure effect analysis using a state transition diagram [27].

5.2.7 Ranking of Components

A crucial functionality introduced in ASSP relates to the selection of those components for the in-depth maintenance policy analysis that are the most important from the system operation point of view. A component’s contribution to the system failure is termed its importance. It is a function of failure characteristics and system structure. An importance analysis is akin to a sensitivity analysis and is thus useful for system design, operation, and optimization. For example, we can estimate possible variations in system failure probability caused by uncertainties in component reliability parameters. Inspection, maintenance, and failure detection can be carried out in their order of importance for components, and systems can be upgraded by improving components with relatively large importance.

We will consider two ranking methods of system components. Both are based on the calculation of the derivative of the system failure probability with respect to the component probability of failure. This derivative, which will be used as one of the importance indicators, gives a measure of the sensitivity of the system failure probability with respect to the given component reliability. A given component can be important because this derivative takes a high value. This normally is the case when the system fails when this component fails, or this component appears in one or more minimal cuts that have a small number of other components.

The partial derivative considered here can have a high value even when the component has very small probability of failure. In order to take into account a contribution of the component failure probability to the system probability of failure, we will introduce the second measure called criticality importance.

The basic mathematical principles used in the development of the ranking tables are given in [16]. The implementation in the computer platform is an extension of this approach and is briefly discussed below.

5.2.7.1 Structural Importance

This is the simplest of the importance criteria and is merely the partial derivative (the classical sensitivity) of the probability of system failure p F with respect to a component failure probability p j . Thus, for the jth component, we have

$$ {\text{IST}}_{j} = {\frac{{\partial p_{\text{F}} }}{{\partial p_{j} }}}. $$
(15)

Since the probability of system failure is a linear function of the component failure probability, the expression for the system probability of failure can be written as

$$ p_{\text{F}} = p_{j} K_{j} - (1 - p_{j} )L_{j} + H_{j} . $$
(16)

Constant K j is equal to the sum of all components in the expression for the probability of failure that contain factor p j with p j excluded, L j contains the terms with factor q j  = 1 − p j , with q j excluded and H j contains the terms that have neither p j nor q j .

From (15), we have

$$ {\text{IST}}_{j} = K_{j} - L_{j} . $$
(17)

Thus, the structural importance can be easily evaluated if a mathematical expression for system failure can be written in form (16). Since, in a usual area reliability problem, there can be hundreds of minimal cuts, construction a symbolic expression for system failure probability can be a formidable task. As part of this development, a very efficient algorithm to accomplish this task has been programmed in the computer platform described here.

The structural importance of components can be used to evaluate the effect of an improvement in component reliability on the delivery point(s) reliability, as follows: By the chain rule of differentiation, we have

$$ {\frac{{\partial p_{\text{F}} }}{\partial t}} = \sum\limits_{j = 1}^{m} {(K_{j} - L_{j} ){\frac{{{\text{d}}p_{j} }}{{{\text{d}}t}}}} = \sum\limits_{j = 1}^{m} {{\text{IST}}_{j} {\frac{{{\text{d}}p_{j} }}{{{\text{d}}t}}}} $$
(18)

where t is a common parameter—say, the time elapsed since the system development began. Thus, the rate at which system failure probability decreases is a weighted combination of the rates at which component probabilities of failure decrease, where the weights are the structural importance numbers.

From (18), we may also obtain

$$ \Updelta p_{\text{F}} = \sum\limits_{j = 1}^{m} {(K_{j} - L_{j} )} \Updelta p_{j} = \sum\limits_{j = 1}^{m} {{\text{IST}}_{j} } \Updelta p_{j} $$
(19)

where Δp F is the perturbation in system failure probability corresponding to perturbations Δp j in component failure probabilities.

5.2.7.2 Criticality Importance

The criticality importance considers the fact that it is more difficult to improve the more reliable components than to improve the less reliable ones. Dividing both sides of (19) by p F, we obtain

$$ {\frac{{\Updelta p_{\text{F}} }}{{p_{\text{F}} }}} = \sum\limits_{j = 1}^{m} {{\frac{{p_{j} }}{{p_{\text{F}} }}}{\text{IST}}_{j} } {\frac{{\Updelta p_{j} }}{{p_{j} }}}. $$
(20)

The criticality importance of the jth component is defined as

$$ {\text{ICR}}_{j} = {\frac{{p_{j} {\text{IST}}_{j} }}{{p_{\text{F}} }}}. $$
(21)

Thus, the criticality importance is a fractional sensitivity.

5.3 Analysis of Component Maintenance Policy

Having selected the components for further analysis, the analysis of the effect of changes of the component maintenance policy on its remaining life and failure rate are called from the computer platform. The most important features of the proposed approach are described below.

5.3.1 Calculation of the Remaining Life of the Equipment

The remaining life of the equipment is computed using the Markov model as shown in Fig. 8. The calculations use the notion of the FPT discussed later.

5.3.1.1 FPT

Let T ij be the FPT from state i to state j in a finite-state, continuous-time Markov chain with continuous parameter (CTMC) {Z(t), t ≥ 0} with state space Ω = {1, 2, …, n}. The continuous parameter is often time. The transition rate matrix is defined as A = [λ ij ], where λ ij (i ≠ j) represents the transition rate from state i to state j and the diagonal elements λ ii  = −∑ji λ ij . We let η = max|λ ij |. Let C represent the set of absorbing statesFootnote 1 and B (=Ω − C) the set of the transient states in the CTMC. From the matrix A, a new matrix A B of size |B| × |B|, where |B| is the cardinality of the set B, can be constructed by restricting A to only the states in B.

Since Z(t) is distributionwise equivalent to a Poisson process, we have [18]:

$$ T_{ij} = \sum\limits_{k = 0}^{{N_{ij} }} {V_{k} (t)} \quad {\text{where}}\;V_{k} (t) = \eta \,{\text{e}}^{ - \eta t} \quad {\text{for}}\,{\text{all}}\,k. $$
(22)

Thus, \( E(T_{ij} ) = E(N_{ij} ){\frac{1}{\eta }} \), where N ij is the FPT of a discrete parameter Markov chain. But (see page 167 of [16]):

$$ E(N_{ij} ) = \sum\limits_{k = 1}^{n} {n_{ik} } = \left( {I - {\frac{{A_{\text{B}} }}{\eta }} - I} \right)^{ - 1} = - \eta \sum\limits_{k = 1}^{n} {A_{\text{B}}^{ - 1} (i,k)} $$
(23)

where only the transient states are in A B. Hence,

$$ E(T_{ij} ) = \sum\limits_{k = 1}^{n} { - A_{\text{B}}^{ - 1} (i,k)} . $$
(24)

Those FPTs are used to generate the life curve of the equipment.

5.3.1.2 Life Curves of the Equipment

The concepts of a life curve and discounted costs are useful to show the effect of equipment aging with time and were discussed at the beginning of this chapter. Figure 1 shows an example of two life curves for the same type of equipment under two different maintenance strategies. If a “better” maintenance strategy is selected, or if the operating conditions are more favorable, the equipment will last longer and at a particular point in time, its condition (or asset value) will be higher.

The generation of a life curve requires several steps. They are described in the following.

This happens in several steps, as explained below with the help of Fig. 13:

Fig. 13
figure 13

Development of life curves a without maintenance, b with maintenance

  • First, the borderlines between the deterioration stages D1, D2 and D3, expressed in terms of percentages of equipment condition, are marked on the vertical axis and entered into the program.

  • Next, AMP/FPT calculations are carried out by the program, to determine the FPTs between states D1 and D2, D1 and D3, and D1 and F. These are entered on the time axis of Fig. 13. Using the AMP model, the effects of maintenance are already incorporated.

  • If there was no maintenance, the FPTs D1D2*, D1D3* and D1F* would be obtained and the corresponding life curve would run as shown. With maintenance, the life curve is no longer a smooth line but a rugged one indicating the deterioration between maintenances and the improvements caused by them. A crude realization of the process is shown in Fig. 13. It is a deterministic approximation that does not consider all possibilities inherent in the AMP model; nevertheless, it helps to visualize how an equivalent smooth life curve is constructed.

  • The equivalent smooth life curve is drawn by observing the following simple rules. At time 0 it must be at 100%, at D1F it must be 0. At the remaining two ordinates, by arbitrary decision, it should be near the lower quarter of the respective domains. (In Fig. 13, the midpoints are used, an earlier convention.)

5.4 Numerical Example

To illustrate how the complete study is performed, let us assume that a maintenance budget for the high-voltage breakers in the system under consideration is specified and our task is to examine several asset sustainment options for these pieces of equipment. For illustration purposes, we have selected the IEEE Modified Reliability Test System (MRTS). MRTS is a modification of the IEEE RTS [31], with the objective of stressing the transmission network. Bearing in mind this objective, the original generating capacities and peak loads are multiplied by two. The system has 24 buses, 38 circuits and 14 plants (32 generating units). The total installed capacity is 6,810 MW, with a peak load of 5,650 MW. Even though the computer platform can handle very large power systems, as described in [23], a relatively small system was selected for illustrative purposes because (1) it is familiar to many power system engineers, and (2) it allows a better understanding of the procedures adopted in the platform.

The following sections describe the procedure indicated in Fig. 2.

5.4.1 Bulk Electric System Reliability Study

The first step in the study is to set up the BES network information. Figure 14 displays the 24-bus IEEE RTS [31]. All the electrical and reliability parameters are as specified in this reference with the exception of load and generation quantities, which are doubled from the original values with a chronological load model with the load curve represented by 8,760 hourly values. Outages of generators, lines and transformers are considered.

Fig. 14
figure 14

The diagram of the 24-bus RTS system

In this study, a pseudo-chronological simulation was selected with 50,000 samples. A flat load curve was applied and, since the network is small, the entire system was selected as an outage area. The loss of load cost was selected as the governing reliability index. The unit interruption cost curves are taken from Ontario Hydro [32]. The participation of each consumer class per bus is the same as that used in [25]. The system total is: 19.2% residential (1,092 MW), 24.2% commercial (1,379 MW) and 56.6% industrial (3,229 MW). The computed LOLC values are shown in Fig. 15.

Fig. 15
figure 15

LOLC values for selected buses

We can observe that buses 14 and 16 have the highest LOLC values. The area around these buses is shown in Fig. 16.

Fig. 16
figure 16

Area selected for detailed study

The three

stations represented by buses 14, 15 and 16 will be represented in detail in the small area reliability study.

5.4.2 Small Area Reliability Study

5.4.2.1 Assigning Reliability Indices to the Selected Buses

A diagram showing details of the substations represented by buses 14, 15 and 16 is given in Fig. 17. The bus and breaker failure rates and repair times are not a part of the RTS database. The adopted parameters are summarized in Table 7.

Fig. 17
figure 17

Three stations representing buses 14, 15 and 16 in Fig. 16. The relative location is the same as in Fig. 14

Table 7 Component reliability parameters

The duration of a permanent outage of a breaker is set at 6 weeks. In addition to a fault, a failure to open or close is also modeled for breakers and the probability of a stuck breaker is equal to 0.006. Failure rate for transient outages is assumed to be equal to 0. The duration of the temporary outages is assumed to be equal 30 min for all components. Each of the three stations has one node denoted as a sink to represent the load connected at the station. Stations representing buses 15 and 16 in Fig. 17 have 2 (nodes 14 and 22) and 3 (nodes 27, 47 and 48) source nodes, respectively, representing possible power inflows to the station either from the generator or the external lines connected to this station but not represented in Fig. 16. In this example, each of the substations in Fig. 17 is analyzed individually. When the substation representing bus 16 is analyzed in this way, it has 4 sources (one additional source represents a possible inflow from bus 15) and 2 sinks (one its own load and the other representing the supply to bus 14).

Table 8 summarizes the reliability indices for the station with the 6-diameter arrangement in Fig. 9 representing bus 15 in Fig. 16.

Table 8 Reliability indices for the station representing Bus_15

The analysis is now repeated for the remaining two stations and the results from the last row in tables similar to Table 2 are used now in the BES reliability study. A new base case is now established. The results are similar to the ones shown in Fig. 15. Figure 18 shows the differences in the LOLC between the base case and the case where all buses are 100% reliable. Only the buses with load in this system are shown.

Fig. 18
figure 18

The difference in the LOLC between the base case and the studies in Fig. 15

We can observe that buses 1, 10 and 14 show a substantial increase in the LOLC values and bus 16 shows a comparative decrease in the customer interruption cost.

The next step in the analysis is the creation of the component ranking tables.

5.4.2.2 Criticality Ranking of Components

There are 24 high-voltage breakers in the 3 stations shown in Fig. 17. A visual inspection of the station configurations immediately points out that the breakers in the ring bus are very important for the station with this configuration. The criticality of the other breakers is not so apparent. The results of the criticality analysis are summarized in Fig. 19.

Fig. 19
figure 19

Criticality ranking of system components

The elements in this table are ordered according to their criticality importance. Several interesting observations can be drawn analyzing this table.

The two methods rank the system components quite differently. For example, all the breakers in the ring bus substation and three other breakers are ranked in the list of the first 10 most critical components.

However, from a structural point of view, only buses (and line 44) enter this list.

Out of 24 breakers in this system only 11 are important either from a structural or criticality point of view. This is because, in this analysis, only single and two-element minimal cut sets were considered and the remaining breakers did not appear in any of these cuts.

Out of all the breakers in this system, breaker 41 is the most important from the criticality point of view. Both methods would rank the breakers in the same way.

Since a group of breakers at the ring bus (bus 14 in Fig. 14) is at the top of the list, the maintenance activities at this station are reviewed in more detail in the next section.

5.4.3 Analysis of the Breaker Maintenance Policy

We will analyze the performance of the high-voltage breakers. We will start by reviewing the present maintenance policy.

5.4.3.1 Present Maintenance Policy

The present maintenance policy is described earlier in the chapter in the numerical example discussing the application of the life curves. We will recall that, reflecting a general utility practice, we assumed that three types of maintenance are routinely performed on each breaker.

About every 8 months to a year, a minor maintenance is performed involving timing adjustments and lubrication at a cost of about $700. Medium maintenance involving replacement of some parts is performed approximately every 10 years and costs about $6,000. A major maintenance involving breaker overhaul takes place every 15 years and costs about $75,000.

In the breaker example, three types of maintenance were modeled. The equipment could be in one of four possible states: “as new—D1”, “slightly deteriorated—D2”, in “major deterioration state—D3” or “failure—F”. The output of this analysis yields information shown in Fig. 20.

Fig. 20
figure 20

Reliability indices for the air-blast breakers

As the result of the study, we obtain the expected time to failure from various deterioration states (these times range from 40.5 to 27.1 years from “as new” to “badly deteriorated” breaker, respectively) and the percentage of the lifetime that the breaker is expected to be in each deterioration state.

5.4.3.2 Alternative Maintenance Policy

The next step in the analysis is the selection of a maintenance procedure by either staying with the present policy (as described above) or performing major refurbishment or replacing the breaker with a new, possibly of a different type. These studies will be performed by analyzing the life curves of the equipment and the associated costs. A description of the computer program used to perform this study can be found in [33]. The time horizon for the calculation of the discounted costs is set at 10 years with the inflation and discount rates of 3 and 5%, respectively. The system and penalty costs associated with equipment failure are equal to $10,000 each in this example.

In order to calculate the effect of the revised maintenance policy, we need to specify the present asset condition (assumed at 80% in this example) or asset value. This information determines where the equipment is located on the life curve.

In order to analyze maintenance alternatives, possible actions need to be defined. We will consider three possible maintenance actions for the breakers in the station represented by bus no. 14 with the ring construction. In addition to continuing present maintenance policy, we will consider a major refurbishment of the air-blast breakers at a cost of $75k per breaker or a replacement with a newer design (e.g., SF6 construction). The cost of a new breaker is set at $150k. The refurbished breakers will have the same maintenance policy as the current system whereas the new design will have the minor maintenance performed once a year rather than every eight months and medium repairs every 5 years. With the financial and engineering data specified, the calculations of reliability and cost information can now proceed.

5.4.4 Life Curves

Figure 21 shows the life curves of the selected breaker under various maintenance policies considered in this example. The two new maintenance alternatives will be introduced after a 3-year delay. In the replacement action, the equipment is assumed to return to as new condition.

Fig. 21
figure 21

Life curves for various maintenance policies

5.4.5 Cost Curves

Cost computations involve calculation of the number of different types of repairs during the specified time horizon. The number of repairs in the period before and after the action is taken is computed separately. The cost of each repair is then expressed by its present value. Failure and associated costs are computed as expected values. The probabilities of failure before and after the action are computed by the program.

The cost curves are presented as functions of the delay. Since each action can have different delay, the delay time can be specified either in years or as a percentage of the specified delay time. The last option allows display of the curves in one screen without the necessity of providing a separate delay scale for each action.

Figure 22 shows the cost diagram for all actions with a 3-year delay for each (100% of assumed value).

Fig. 22
figure 22

Cost diagram for various actions with a 3-year delay. The components of each bar are as follows (from the bottom): failure cost, maintenance cost, refurbishment action cost

If a 3-year delay was contemplated in the application of any action, the best policy would be to either stay with the present policy or to perform a major refurbishment with the resulting cost of about $95,000. The major portion of this cost is due to the refurbishment action itself. We can observe that for the 10-year time horizon, the expected cost of failure (computed during the analysis but not shown here) is fairly small in all the cases since the probability of failure is small for these actions. The maintenance cost is high for the present maintenance policy because minor repairs are performed quite often and, during the 10-year horizon, one medium and one major repair will be performed.

5.4.6 Comparative Studies

The new breaker failure rates result in improved station reliability characteristics. In particular, the new bus failure rate is equal to 0.114 (1/year) compared with 0.131 (1/year) in the base case. The repetition of a BES reliability studies with this new information yields the revised LOLC values. The difference between new results and the ones in the base case is shown in Fig. 23.

Fig. 23
figure 23

The decrease in $M of the LOLC values with SF6 breakers installed at the substation represented by bus 14 compared to the base case

We can observe that the reduction of the loss of load cost over a 1 year period reaches about $1 M. This is a substantial saving, but before the final decision is made, all the relevant costs should be considered. Table 9 shows the comparison of the results obtained with all the computer programs of the computer platform for the two alternative maintenance policies discussed above.

Table 9 Comparison of two maintenance alternatives

We can observe that from the reliability point of view, both alternatives are very similar. The expected energy not supplied, the frequency and duration of interruptions are very close. However, the economic considerations would favor the alternative involving breaker replacement.

The cost comparison of the two alternatives shows that even though the installation of the new breakers at this station will result in an additional cost of about $300,000 over the 10-year planning horizon, the savings in the cost of load interruptions are much greater, reaching about $1,000,000 per year. In addition, the average life of the replacement breaker is about 50% greater than the old one.

6 Conclusions

In this review, a survey is offered of the various maintenance methods available to operators. The methods range from the simplest, “follow the manual”-types to detailed probabilistic approaches. To get most out of maintenance, one would have to select a mathematical model where optimization is possible—optimization for highest reliability or lowest operating costs. There can be little doubt that such probabilistic models would be the best tools for identifying policies that provide the highest cost savings.

Another choice of which operators are becoming more and more aware is to apply a maintenance policy based on no rigid schedule but on the “as needed” principle. This can be implemented with or without mathematical models; example for the latter is the RCM approach. RCM, steadily gaining in popularity, is based on an analysis of failure causes and past performance, and helps to decide where to put the next dollar budgeted for maintenance. The method is good for comparing policies, but not for true optimization.

This chapter discusses models that can be applied for finding an optimal maintenance policy for power equipment. The emphasis is placed on an optimization formulation of a maintenance policy model based on semi-Markov processes. The model allows the analysis of the influence of the maintenance policies, defined by the durations of repairs, the effects and the costs of various repair actions on the remaining life and the lifetime utilization costs of the equipment. Different types of possible Markov model optimizations were discussed and a simulated annealing algorithm was introduced. This algorithm was found to be very well suited to the solution of the maintenance optimization problem.

The operation of the proposed model was demonstrated on a numerical example for high-voltage circuit breakers with a significant improvement of the three important parameters defining a maintenance policy: the remaining life of equipment, the total utilization cost and the unavailability.

In today’s competitive environment, cost optimization is becoming even more important. This is particularly true for transmission and distribution equipment where the maintenance choices described in this chapter fully apply. As for generating units, the situation is somewhat different.

In the past, the practice was to centrally plan and coordinate the maintenance of generators within a given jurisdiction. Maintenance was done during low-load seasons and the timing was influenced by such considerations as system risk and production cost. In the deregulated scenario, maintenance may not be centrally planned or even coordinated. Generator owners may tend to keep the units running when the market clearing price of electric energy is high, and perform maintenance only when the market price is low. Even then, they may wish to sell energy to another jurisdiction where the periods of high load (and high market price) are different from those near the unit’s location. Therefore, the decision when to maintain a generator will be heavily influenced by profit incentives and the optimal cost of maintenance and repair would be assessed in this context. But even then, some of the approaches and programs discussed in this review would retain their relevance.

The consequences of alternative maintenance actions can be analyzed from three different points of view. Engineers might be mostly interested in the effect of the asset sustainment policy on the asset condition and the probability of failure of the equipment. The condition of the asset can be visualized in a form of a life curve. Development of such curves constitutes a significant part of the analysis. Markov models can be used for this purpose. Solving the equipment Markov model brings information about the probability of failure during the specified time horizon. This probability is, in turn, used to determine the expected cost of the equipment maintenance and failure during this time. The financial information is presented in terms of present values taking into account the anticipated inflation rate and corporate discount rate. The system effects are analyzed with the help of two different approaches: an area and bulk electric system reliability programs, the last one also takes into account the customer interruption costs.

The proposed approach was illustrated in a study to analyze the effect of changes in breaker maintenance policy on the performance of the equipment itself and the reliability of a small area around the substation where the breakers are installed as well the reliability of the entire bulk electric system.

The main feature of this approach is the requirement for a seamless transition between various computational modes. The information transferred can be as basic as the equipment failure rates under various maintenance scenarios, as happens between BES and small area analysis programs or between small area and component analysis modules, or as complex as the complete maintenance strategy and life curves transfer between the computational modules. This innovative design will allow a comprehensive analysis of asset sustainment alternatives in a way that was not possible until now.

The ability to combine engineering and financial information coupled with the ease of use of the computer platform has proven to be an important asset for the re-investment decision-making process.