Keywords

1 Introduction

As reliability has become a major concern in most engineering and computing systems, component reliability improvement and the provision of redundant components are often used together to maintain system reliability [1]. Preventive maintenance, which recovers a component from failure or degraded state, could improve the reliability of single component and is often used jointly with component redundancy for multi-performance-state systems [2]. Similar as traditional redundancy, the configuration of preventive maintenance also introduces extra cost to the whole system, so a great majority of researches have concentrated on the optimization of joint mechanism [3, 4]. Existing researches mainly assume that the repair process is taken immediately after component failure or degradation. However, in large distributed computing systems, e.g. cloud based systems, it is difficult to identify failures in various distributed components. Routine monitoring and inspection mechanisms are often conducted to detect component status and to trigger component repair process [5]. Therefore, the inspection policy plays a key role in the whole system maintenance, but few studies have concentrated on the optimization of joint redundancy and inspection-based maintenance mechanism.

A lot of researches have been taken on the subject of joint traditional fault tolerance mechanism analysis and optimization [11, 12]. For multiple kinds of redundancy strategies, Coit [6], Tavakkoli-Moghaddam et al. [7] and Chambari et al. all [8] studied redundancy optimization problem while simulated annealing algorithm [9] and genetic algorithm [10] were used to get solutions. For maintenance mechanism, Soro et al. evaluated the reliability and performance of multi-state degraded systems with redundancy and imperfect preventive maintenance [2] and then Nourelfath et al. proposed a SP/TG heuristic approach to optimize the joint mechanism in series-parallel systems [3]. Liu et al. considered the deterioration effect after maintenance in this model and conduct optimization using genetic algorithm [4]. The mentioned researches didn’t consider monitoring and inspection mechanism for repair process and were not applicable for cloud computing systems. Our previous research conducted a preliminary study on joint optimized redundancy and inspection rate [13]. Due to the computation complexity of state transition analysis, the model was only built for subsystems with two redundant components when inspection used. For systems with k-out-of-n redundancy architecture, Martorell al. analyzed the optimal test interval of a two-component parallel system based on availability analysis [21]. Vaurio developed the optimization on test and maintenance interval time for k-out-of-n system with four components and series systems [22, 23]. Cepin and Mavko also conducted test strategies and maintenance optimization by minimization of risk through simulated annealing [24, 25]. Torres-Echeverria et al. presented an optimization model for the maintenance and test policy design for a system using MooN voting redundancies [26]. In these researches, analytical availability and periodic test cycle analysis are conducted on probabilistic methods, but the working mechanism inside the test or maintenance strategy has not been presented.

The main objective of this paper is to define the optimal system structure and the inspection policy for each component, so that the series-parallel system performance is maximized, subject to the availability and cost budget constraints. The state transition process is first analyzed for subsystems with redundant components and a routine monitoring and inspection mechanism. Because of the recovering effect of maintenance process, the steady state reliability and performance is analyzed from the state transition process. The cost is a combination of redundancy cost and maintenance cost. While original redundancy allocation problem (RAP) has been proven to be NP-hard problem [14], the introduction of inspection-based maintenance strategy even increases the complexity of the optimization problem. Therefore, a genetic algorithm is used to search the optimal solution of redundancy and inspection routine for each subsystem and illustrative examples are presented to explain the analysis and calculation process. The following sections are organized as follows: the problem description and optimization model is listed in the following section and the solution technique is then proposed; illustrate examples are presented with experimental results analysis; conclusions are summarized at last with future work.

2 The Optimization Model

Notations

m

Number of sequential subsystems

A i

Steady state availability of subsystem i

n i

Number of redundant components in subsystem i

p ri

Work processing rate of each component in subsystem i

λ i

Failure rate of each component in subsystem i

P ri

Steady state performance rate of subsystem i

μ i

Repair rate of each component in subsystem i

q d

Degradation rate in maintenance state

n ri

Maximal number of components being repaired in parallel in subsystem i

n, λ m , n r

Corresponding vectors of n i , λ mi , n ri in all subsystems

q i

Probability of successful repair for each component in subsystem i

λ mi

Monitoring rate of maintenance mechanism in subsystem i

π i,α

Steady state probability of state α in subsystem i

Q i,αβ

Transition rate between two states α and β in subsystem i

μ si

Component detection rate for subsystem i

C sys

Overall system redundancy cost

Q i

The generator matrix of Q i,αβ

M sys

Monitoring and inspection cost per time unit

π i

Vector of π i,α

R sys

The repair facility cost

P sys

Average performance rate of the system

A sys

Steady-state system availability

2.1 Problem Description

Assume that there are m sequential subsystems in one system and for subsystem i(1 ≤ i ≤ m), n i components are running in parallel or kept as backup. The structure of such series-parallel system is shown in Fig. 1.

Fig. 1.
figure 1

Structure of general series-parallel system.

Following the fault tolerance mechanism design in researches [5], n i components are running initially for each subsystem. A routine monitoring and inspection mechanism is used to detect the status of components. When a component fails or degrades, it is replaced by other standby backups instantly. The switched component is repaired or restarted in background and placed as backup components. A brief illustration of the process in each subsystem is shown in Fig. 2.

Fig. 2.
figure 2

Illustration of joint redundancy and inspection-based maintenance mechanism.

To illustration the working process of each subsystem, the state transition analysis is conducted on the subsystem with the redundant components. Each subsystem is presented by a multi-state model, shown as Fig. 3, where it is assumed that:

Fig. 3.
figure 3

State transition diagram joint redundancy and inspection-based maintenance.

  • The working and failure pattern of each component inside one subsystem is the same. In real-world applications, there is more than one type of components inside each subsystem and the failure pattern of these components is not the same. In this case, the patterns of the component with the worst reliability and performance could be used as the common pattern for calculation. Specifically, when a cloud-computing or grid-computing system is chosen for analysis, the same CPU and memory resources would be allocated to each virtual component in one system, it is reasonable to assume that the time-to-failure of each component follows the similar distribution pattern. The time to failure of components for subsystem i is exponentially distributed with rate λ i . The repair process is exponentially distributed with rate μ i .

  • The repair process for each component is carried out in background as long as the system is working. At most n ri failed components could be repaired in parallel with each repair process accomplished successfully with probability q i .

  • The routine inspection mechanism is run every certain time units. An ideal approach would be to take the time to the next inspection to be uniformly distributed. Since the difference between normal distribution and exponential distribution assumption for routine inspection mechanism is proven to be small [15], the periodical inspection/detection process is approximated to be exponentially distributed with λ mi to reduce model complexity.

  • The time used for component status detection is exponentially distributed with rate μ si . It is reasonable to assume that the time is much less than the time of repair process and is not taken into consideration when repair is conducted to avoid state explosion.

In this diagram, state k ij (1 ≤ im, 1 ≤ j ≤ n i ) represents the normal working state of ith subsystem with j active components. State k ij M stands for the corresponding repair and maintenance state from k ij . The performance of these states would be affected by the maintenance process running in the background while only the state of 0 and 0 M are failure states. State transition between states k ij is triggered by the random component failure. State transition from k ij to k ij M is the process waiting for the next inspection. State transition from k ij M back to k ij is the inspection/detection and the repair process. For any state k ij M, n i  − k ij components already fail in the system and repairmen process should be taken on these components. While at most n ri failed components could be repaired each time, the probability of only w VM is migrated successfully is denoted by \( C_{{n_{\hbox{min} } }}^{w} (q_{i} )^{w} (1 - q_{i} )^{{n_{\hbox{min} } - w}} \), where n min is the minimal value of n i  − k ij and n ri .

The overall state of the whole system is determined by the redundancy, inspection and maintenance policy on each subsystem. So, our main objective is to get the appropriate value of this configuration policy through optimization.

2.2 System Performance Evaluation Model

The reliability of a series-parallel system is determined by the reliability of each subsystem. Before constructing the optimization model for the whole system, the reliability and performance evaluation on top of the state transition diagram of subsystem i is first conducted. Since this kind of system is supposed to be running for a long time, steady state method is used to analyze this diagram as an irreducible CTMC [15].

Let S denotes the set of all states in Fig. 3 {0, … n i , 0 M, …, n i M}. Let π i,α denote the steady state probability of state α (α ∈ S) in subsystem i and π i denotes the vector [π ni , … π 0, π ni M, …, π 0 M]. Let Q i,αβ denotes the transition rate between two states (α ∈ S, β ∈ S) in subsystem i and Q i denotes the generator matrix of Q i,αβ . The value of Q i,αβ is

$$ Q_{i,\alpha \beta } = \left\{ {\begin{array}{*{20}l} {\alpha \lambda_{i} ,1 \le \alpha \le n_{i} ,\beta = \alpha - 1} \hfill \\ {\lambda_{mi} ,0 \le \alpha \le n_{i} ,\beta = \alpha M} \hfill \\ {\mu_{si} ,\alpha = n_{i} M,\beta = n_{i} } \hfill \\ {C_{w}^{\beta - \gamma } q_{i}^{\beta - \gamma } (1 - q_{i} )^{w - (\beta - \gamma )} w\mu_{i} ,w = \hbox{min} \{ n_{ri} ,n_{i} - \gamma \} ,\alpha = \gamma M,\gamma \le \beta \le n_{i} ,0 \le \gamma < n_{i} } \hfill \\ { - w\mu_{i} ,w = \hbox{min} \{ n_{ri} ,n_{r} - \gamma \} ,\alpha = \beta = \gamma M,0 \le \gamma < n_{i} } \hfill \\ { - \mu_{si} ,\alpha = \beta = n_{i} M} \hfill \\ { - \alpha \lambda_{i} - \lambda_{mi} ,0 \le \alpha \le n_{i} ,\alpha = \beta } \hfill \\ \end{array} } \right. $$
(1)

Considering the Kolmogorov’s backward equation [15], the limits value of π i,α is calculated as

$$ \mathop {\lim }\limits_{t \to \infty } \frac{{d\pi_{i,\alpha } (t)}}{dt} = (\sum\limits_{\beta \in S\,\&\,\beta \ne \alpha } {\pi_{i,\beta } Q_{i,\beta \alpha } } ) - \pi_{i,\alpha } Q_{i,\alpha \alpha } = 0 \Rightarrow \left\{ {\begin{array}{*{20}l} {\varvec{\pi}_{i} \varvec{Q}_{i} = {\mathbf{0}}} \hfill \\ {\sum\limits_{\alpha \in S} {\pi_{i,\alpha } } = 1} \hfill \\ \end{array} } \right. $$
(2)

which could also be represented in matrix form.

Given the certain system structure, the failure and recover rate of each component, π i,α is a function of the number of redundant components n i , the inspection rate λ mi and the maximum number of parallel repairing components n ri : π i,α  = π i,α (n i , λ mi , n ri ). Following the steady state analysis pattern, steady state availability A i is used to describe subsystem reliability and the expected performance rate P ri is used to describe the subsystem performance. Steady state availability A i is the probability of the subsystem in the working state. Since only states 0 and 0 M are failure states, given the value of n i , λ mi and n ri , the value of A i could be calculated as:

$$ A_{i} = \sum\limits_{\alpha \in S\,\&\,\alpha \ne 0,0M} {\pi_{i,\alpha } } = A_{i} \left( {n_{i} ,\lambda_{mi} ,n_{ri} } \right) $$
(3)

The expected performance rate P ri is the average work processing rate performed by all the working components in the subsystem. Let S i,U and S i,M denotes the set of normally working states {π ni , … π 1} and corresponding maintenance states {π ni M, …, π 1 M} for subsystem i. In the normally working state α ∈ S i,U . The work processing rate of the subsystem is the sum of processing rate of each working component. In the maintenance state β ∈ S i,M , the work processing rate would be degraded due to the inspection and maintenance process running in the background. Assume that the work processing rate for each component is p ri and the degradation rate is q d . The value of P ri is:

$$ P_{ri} = \sum\limits_{\alpha \in S} {\pi_{i,\alpha } p_{ri,\alpha } } = \sum\limits_{{\alpha \in S_{i,U} }} {\pi_{i,\alpha } \alpha p_{ri} } + \sum\limits_{{\beta \in S_{i,D} }} {q_{d} \pi_{i,\beta } kp_{ri} } = P_{ri} \left( {n_{i} ,\lambda_{mi} ,n_{ri} } \right),\beta = kM. $$
(4)

2.3 Optimization Model Formulation

The main objective of the optimization model is to define the optimal system structure and the maintenance policy for each component, so that the multi-state system performance is maximized, subject to steady state availability and cost budget constraints.

The objective function is evaluated as the average performance rate of the whole system P sys , which is calculated from the performance rate of each subsystem. The working processing time of a series system is the sum of the processing time of each subsystem and the corresponding rate is evaluated as

$$ P_{sys} = {1 \mathord{\left/ {\vphantom {1 {\sum\limits_{i = 1}^{m} {(1/P_{ri} \left( {n_{i} ,\lambda_{mi} ,n_{ri} } \right)) = f(\varvec{n},\varvec{\lambda}_{\varvec{m}} ,\varvec{n}_{\varvec{r}} )} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{m} {(1/P_{ri} \left( {n_{i} ,\lambda_{mi} ,n_{ri} } \right)) = f(\varvec{n},\varvec{\lambda}_{\varvec{m}} ,\varvec{n}_{\varvec{r}} )} }} $$
(5)

where n, λ m , n r are the corresponding vectors for each parameter in all the subsystem.

The constraint functions include the constraint of reliability, redundancy cost and maintenance cost. The system steady state availability A sys is the probability of the whole system in working state, which is the product of each subsystem’s availability:

$$ A_{sys} = \prod\limits_{i = 1}^{m} {A_{i} \left( {n_{i} ,\lambda_{mi} ,n_{ri} } \right)} = g(\varvec{n},\varvec{\lambda}_{\varvec{m}} ,\varvec{n}_{\varvec{r}} ). $$
(6)

The overall system redundancy cost C sys is the cost to configure redundant components in each subsystem, which depends on the number of redundant components. The monitoring and inspection cost is the cost to monitor and detect the status of each subsystem periodically and it is related to the monitoring rate in each subsystem. While the cost is also related to the system running time, M sys stands for the monitoring and inspection cost per time unit. The repair facility cost R sys is the cost used to repair failed components, which is determined by the maximal number of repair facilities prepared for each subsystem. Let c i denotes the redundancy cost of components in subsystem i, m di stands for the cost for one inspection and detection process and the r i is the cost of each prepared repair facility. The cost constraint functions are represented as

$$ C_{sys} = \sum\limits_{i = 1}^{m} {c_{i} n_{i} } ,M_{sys} = \sum\limits_{i = 1}^{m} {m_{di} \lambda_{mi} } ,R_{sys} = \sum\limits_{i = 1}^{m} {r_{i} n_{ri} } $$
(7)

When redundancy and inspection-based mechanism are both used in one system, it is difficult to choose the configuration settings of each kind of strategy in each subsystem. Based on these objective and constraint functions, the parameter determination problem is cast into an optimization problem. The optimization model is defined as choosing appropriate system configuration parameters n, λ m , n r to maximize system work processing rate under the availability and cost constraints:

$$ \begin{aligned} {\text{Max }}P_{sys} &= f(\varvec{n},\varvec{\lambda}_{\varvec{m}} ,\varvec{n}_{\varvec{r}} ) \\ s.t.\;A_{sys} &= g(\varvec{n},\varvec{\lambda}_{\varvec{m}} ,\varvec{n}_{\varvec{r}} ) \ge A_{0} ,C_{sys} = \sum\limits_{i = 1}^{m} {c_{i} n_{i} } \le C_{0} ,M_{sys} = \sum\limits_{i = 1}^{m} {m_{di} \lambda_{mi} } \le M_{0} ,R_{sys} = \sum\limits_{i = 1}^{m} {r_{i} n_{ri} } \le R_{0} \\ & \text{1} \le n_{ri} \le n_{i} \le n_{0} ,\text{0} < \lambda_{mi} \le \lambda_{m0} \\ \end{aligned} $$
(8)

A 0, C 0, M 0 and R 0 represents the corresponding constraints value and n 0, λ m0 are the value range of configuration parameters.

3 Solution Technique

Although Lagrange multiplier method could be used to solve the optimization problem by calculating partial derivative values, it asks for the closed-form representation of the optimization model. These is no closed-form formula of reliability and performance without certain value of n i and n ri , thus the proposed model is not a traditional non-linear optimization problem and could not be solved using exact method. So, in this section, a genetic algorithm is used to search the near-optimal solution.

3.1 Genetic Algorithm Framework

Among several kinds of evolutionary algorithms proposed for combinatorial optimization problem, genetic algorithm (GA) conducts a random, yet directed search process wherein solutions evolves according to biological reproduction rules [16]. It is superior to gradient descent technique or random sampling algorithm and has been used in many researches for RAP [17, 18]. The general framework of this algorithm is:

Following similar framework, the significant parts in genetic algorithm are the solution encoding mechanism, genetic operators and the fitness calculation method. The design of these parts is listed in the following subsections.

3.2 Solution Encoding Mechanism

Each individual in the population set represents a specific solution to the optimization problem. Since there are 3 × m elements to search in the optimization model, a triple-element encoding mechanism is used to represent each solution to the problem [19, 20]. Each solution in our algorithm is represented by a 3 × m matrix. Figure 4 illustrates an example of the encoded solution using this mechanism with m = 6.

Fig. 4.
figure 4

Example of triple-element encoding mechanism.

For the matrix shown in Fig. 4, the first row represents the redundancy of components in each subsystem n, the second row stands for the number of repair facilities for each subsystem n r while the third row is the inspection and monitoring rate of each subsystem λ m . Each column of this matrix is the redundancy and inspection-based maintenance configuration of each subsystem. The initialization of a set of these individuals is accomplished by randomly generating values of each element in the matrix. The initial value of n i or n ri is a random integer which satisfies the condition of 1 ≤ n ri  ≤ n i  ≤ n 0. The initial value of λ mi is real number randomly generate between 0 and λ m0.

3.3 Genetic Operators

In the crossover operation, a 3 × m crossover mask is generated as Fig. 5. There are only 0 and 1 in this mask matrix and this matrix is randomly composed by the two values. For two randomly selected individuals p x and p y (1 ≤ x, y ≤ N), once an element in the mask matrix equals to 1, crossover operation is conducted on the value of elements in the same position of the two solution matrixes. For integer values as n i or n ri simple swap operation is used in crossover operation. For real values of λ mi , the simulated binary crossover (SBX) operator is used to switch two elements:

$$ \begin{array}{*{20}c} {p_{x} .\tilde{\lambda }_{mi} ' = \frac{{((1 - \beta )p_{x} .\tilde{\lambda }_{mi} + (1 + \beta )p_{y} .\tilde{\lambda }_{mi} )}}{2}} \\ {p_{y} .\tilde{\lambda }_{mi} ' = \frac{{((1 + \beta )p_{x} .\tilde{\lambda }_{mi} + (1 - \beta )p_{y} .\tilde{\lambda }_{mi} )}}{2}} \\ \end{array} ,\beta = \left\{ {\begin{array}{*{20}l} {(2\alpha )^{{\frac{1}{{\eta { + }1}}}} ,0 \le \alpha < 0.5} \hfill \\ {\frac{1}{{((2(1 - \alpha ))^{{\frac{1}{{\eta { + }1}}}} )}},0.5 \le \alpha < 1} \hfill \\ \end{array} } \right. $$
(9)

where \( \eta \) is the crossover index and \( p_{x} .\tilde{\lambda }_{mi} \) is the normalized value of \( p_{x} .\lambda_{mi} \).

Fig. 5.
figure 5

Crossover operation for triple-element encoded individuals.

In the mutation operation, similarly, a 3 × m mutation mask is generated with randomly chosen 0 and 1, as Fig. 6. At most two elements could be set to 1 at each row to avoid dramatic change in the mutated individual. For any randomly chosen individual p x , if any element in mutation mask is 1, the corresponding element at the same position of p x is changed randomly. The changing process of integer element such as n i or n ri is similar as the population initialization process. For real numbers λ mi , polynomial mutation operator is used to make changes as:

$$ p_{x} .\tilde{\lambda }_{mi} ' = \left\{ {\begin{array}{*{20}l} {\begin{array}{*{20}l} {p_{x} .\tilde{\lambda }_{mi} + (2r)^{{\frac{1}{{\eta_{m} + 1}}}} - 1,{ 0} \le r < 0.5} \hfill \\ {p_{x} .\tilde{\lambda }_{mi} + 1 - [2(1 - r)]^{{\frac{1}{{\eta_{m} + 1}}}} ,{ 0} . 5\le r < 1} \hfill \\ \end{array} } \hfill \\ \end{array} } \right. $$
(10)

where \( \eta_{m} \) denotes the mutation index and \( p_{x} .\tilde{\lambda }_{mi} \) is the normalized value.

Fig. 6.
figure 6

Mutation operation for triple-element encoded individuals.

3.4 Fitness Value Calculation

The optimization problem in (8) is a single-objective multi-constraint optimization problem. The objective value represents the optimality of one individual while the constraint value represents whether this individual meets the requirement. One individual p x outperforms another p y if and only if the objective value of p x is larger than the value of p y and the constraint value of p x meets requirement. Thus, the fitness value I of each individual is calculated as the objective value meeting constraints value:

$$ I = \left\{ {\begin{array}{*{20}l} {R_{sys} ,A_{sys} \ge A_{0} ,C_{sys} \le C_{0} ,M_{sys} \le M_{0} ,R_{sys} \le R_{0} } \hfill \\ {0,else} \hfill \\ \end{array} } \right. $$
(11)

In selection process, a weight value w x is assigned for each individual so that individuals with larger fitness value will be assigned a higher possibility in selection:

$$ w_{x} = \left\{ {\begin{array}{*{20}l} {w_{x - 1} + {{I_{x} } \mathord{\left/ {\vphantom {{I_{x} } {\sum\limits_{y = 1}^{N} {I_{y} } }}} \right. \kern-0pt} {\sum\limits_{y = 1}^{N} {I_{y} } }}} \hfill \\ {0,x = 0.} \hfill \\ \end{array} } \right. $$
(12)

4 Illustrative Examples

In this section, numerical examples are listed to illustrate the optimization of joint redundancy and inspection-based maintenance mechanism and the process of searching near-optimal solutions. System parameters are mainly collected from a cloud-based system used for distributed and parallel processing [27]. Data from 8 series-parallel subsystems in this system are used for analysis. For each subsystem, the failure rate, repair rate, job processing rate of one component are extracted from the operation profiles and presented in Table 1. According to the maintenance schedule, the value of μ si for each component is 30 per hour and the inspection and monitoring cost of each component is set as 1 unit. The repair successful probability q i is 0.9 and the performance degradation rate is 0.8. The redundancy cost c i and repair facility cost r i are estimated and shown in Table 1.

Table 1. Component parameters for example

4.1 Reliability and Performance Analysis

The reliability and performance analysis is first conducted on the first subsystem. The failure rate λ i and repair rate μ i is 0.00499 and 0.551 per hour. The inspection and detection rate μ s i is 30 per hour. Using only one subsystem, the working processing rate p ri in this experiment is set as 1 unit. First of all, the inspection rate is set as 0.5 per hour and the number of working repair facility is set as 3. The corresponding reliability and performance of this subsystem with redundancy change from 1 to 10 are shown in Fig. 7(a). Then the number of redundant component is set as 5 and inspection rate changes from 0 to 1 per hour. The result of subsystem with different number of repair facilities is included in Fig. 7(b). At last, the number of redundant component is set as 5 and the inspection rate is 0.5 per hour, the result of subsystem with the number of repair facilities changing from 1 to 10 is shown as Fig. 7(c).

Fig. 7.
figure 7

Subsystem reliability and performance change with different parameters.

It is shown from Fig. 7 that the increase of redundant components and inspection rate has significant impact on the subsystem reliability and performance change. However, with the continuous change of these parameters, the increase trend of both reliability and performance rate declines and converges to 0, except for the performance increase with redundancy increase. The change in the number of repair facility has little impact on both reliability and performance increase. When the number of working repair facilities is larger than then number of redundant components, there is no change in both system metrics. The experiment data shows that three kinds of configuration parameters have total different impact on the overall system.

4.2 Mechanism Optimization

Figure 8 illustrates the overall system reliability and performance change with the configuration parameters change of the first subsystem. The initial redundancy and repair facility for each subsystem is set as 1 and the inspection rate is 0.1 per hour. It is shown that the impact of parameter change on the reliability and performance improvement generally declines dramatically as the parameters increases.

Fig. 8.
figure 8

System reliability and performance change of different groups of parameters.

So in this section, the maximum number of redundant components and repairing facilities in any subsystem is set as 10 and 5 respectively. The maximal inspection rate is set as 1. The maximum generation in the algorithm is set as 200 and the population size is 100. System parameters in Table 1 is used in this experiment for redundancy optimization. Given availability constraint 0.99, redundancy cost 70, repair facilities cost 70 and inspection cost 5, the optimal configuration of redundancy and maintenance policy using the joint mechanism is calculated and the fitness value of each generation is shown in Fig. 9. The fitness value is the negative value of performance rate: −P i . The result in Fig. 9 indicates that the algorithm already converges within 50 generations.

Fig. 9.
figure 9

The mean and best fitness value of each generation.

The calculated optimal job performance rate is 0.7448 in this test case and the corresponding configuration parameter for each subsystem is shown in Table 2.

Table 2. Optimal component configuration result

In the following experiments, four kinds of cost constraints are changed to show the different result on joint mechanism optimization. Since GA is a stochastic search algorithm, five trials are performed for each experiment and the best solution is considered as the final solution. First of all, the availability constraint is kept as 0.99 with repair facilities cost as 70 and inspection cost as 5. The result of maximized performance change with near-optimal solution is drawn in Fig. 10(a) while the redundancy cost changes from 60 to 100. Then, using the same availability constraint with redundancy cost as 70 and changing the value of repair facilities cost, the corresponding optimization result for performance rate is shown in Fig. 10(b). Keeping the cost constraints for both redundancy and repair as 70 and changing the value of inspection cost from 1 to 10, Fig. 10(c) shows the optimization result under availability constraint 0.90. Figure 10(d) presents the result for availability constraint change from 0.90 to 0.99.

Fig. 10.
figure 10

The change of optimization result with different constraints.

It is shown in Fig. 10 that given different cost constraint, different system configuration would be get with different performance optimization result. Loosening any kind of cost constraint could all get better result on performance optimization. Specifically, the change of redundancy cost brings significant change on the optimization result while only little change is obtained for the other two kinds of cost constraints change. On the contrary, once the availability constraint could be met under the same cost, different availability constraints generally lead to similar system configuration and performance.

5 Conclusion and Future Work

Aiming at the optimal joint redundancy and inspection-based maintenance configuration problem in series-parallel system, this paper pays attention to the routine monitoring and inspection mechanisms, which helps to detect components status and trigger repair process. Considering the shared repair facilities for different component in each subsystem, a state transition models is first built to analyze the inspection-based maintenance mechanism and Markov chain theory is used to get steady state availability and performance metrics for the system. The optimization problem is built to search the appropriate system structure and maintenance policy maximizing system performance while meeting availability and cost constraints. Due to the complexity of uncertain optimization model, genetic algorithm is used to search optimal solution, built upon triple-element encoding mechanism. Specific genetic operators are designed for the triple-element encoding method and illustrative examples are conducted to explain the calculation and optimization process. Experiment results show that the optimization model and corresponding solution technique could be used to search optimal system configuration under given constraints and different cost constraints would lead to different optimization result while meeting availability constraints. This paper takes a preliminary study on the optimization of joint redundancy and inspection-based maintenance model. The analysis model is simplified on top of several assumptions, such as the exponential distribution time-to-failure. In the future, this model could be modified to be suited for more general failure distributions. Meanwhile, some local search operator could be combined into the genetic algorithm to improve the algorithm efficiency.