Keywords

In the current international marketplace, continuous quality improvement is pivotal for maintaining a competitive advantage. Although quality improvement activities are most efficient and cost-effective when implemented as part of the design and development stages (off-line), on-line activities such as statistical process control (SPC) are vital for maintaining quality during manufacturing.

Statistical process control (SPC) is an effective tool for achieving process stability and improving process capability through variation reduction. Primarily, SPC is used to classify sources of process variation as either common cause or assignable cause. Common cause variations are inherent to a process and can be described implicitly or explicitly by stochastic models. Assignable cause variations are unexpected and difficult to predict beforehand. The basic idea of SPC is to quickly detect and correct assignable cause variation before quality deteriorates and defective units are produced. The primary SPC tool was developed in the 1920s by Walter Shewhart of Bell Telephone Laboratories and has been tremendously successful in manufacturing applications [1,2,3].

Robust design (RD) is a systematic methodology that uses statistical experimental design to improve the design of products and processes. By making product and process performance insensitive (robust) to hard-to-control disturbances (noise), robust design simultaneously improves product quality, the manufacturing process, and reliability. The RD method was originally developed by the Japanese quality consultant, Genichi Taguchi [4]. Taguchi’s 1980 introduction of robust parameter design to several major American industries resulted in significant quality improvements in product and process design [5]. Since then, a great deal of research on RD has improved related statistical techniques and clarified underlying principles.

In addition, many RD case studies have demonstrated phenomenal cost savings. In the electronics industry, Kackar and Shoemaker [6] reported a 60% process variance reduction; Phadke [5] reported a fourfold reduction in process variance and a twofold reduction in processing time – both from running simple RD experiments. In other industries, the American Supplier Institute (1983–1990) reported a large number of successful case studies in robust design.

Although most data is multivariate in nature, research in both areas has largely focused on normally distributed univariate characteristics (responses). Montgomery and Woodall [2] present a comprehensive panel discussion on SPC (see also Woodall and Montgomery [7]) and multivariate methods are reviewed by Lowry and Montgomery [8] and Mason [9]. Seminal research papers on RD include Kackar [10], Leon et al. [11], Box [12], Nair [13], and Tsui [14]. RD problems with multiple characteristics are investigated by Logothetis and Haigh [15], Pignatiello [16], Elsayed and Chen [17], and Tsui [18]. This research has yielded techniques allowing engineers to effectively implement SPC and RD in a host of applications.

This chapter briefly revisits the major developments in both SPC and RD that have occurred over the last 30 years and suggests future research directions while highlighting multivariate approaches. Section 11.1 covers SPC of univariate and multivariate random variables for both Shewhart (including \( \overline{x} \) and s charts) and non-Shewhart approaches (CUSUM and EWMA), while assessing the effects of autocorrelation and automatic process control. Section 11.2 considers univariate RD, emphasizing performance measures and modeling for loss functions, dual responses, and desirability functions; deals with multivariate and dynamic RD; and recaps RD case studies from the statistics literature in manufacturing, process control, and tolerance design. Finally, Sect. 11.3 provides an overview of PHM and SHMM framework.

1 Statistical Process Control for Single Characteristics

The basic idea in statistical process control is a binary view of the state of a process; in other words, it is either running satisfactorily or not. Shewhart [19] asserted that the process state is related the type of variation manifesting itself in the process. There are two types of variation, called common cause and assignable or special cause variation. Common cause variation refers to the assumption that “future behavior can be predicted within probability limits determined by the common cause system” [20]. Special cause variation refers to “something special, not part of the system of common causes” [21]. A process that is subject only to common cause variation is “statistically” in control, since the variation is inherent to the process and therefore eliminated only with great difficulty. The objective of statistical process control is to identify and remove special cause variation as quickly as possible.

SPC charts essentially mimic a sequential hypothesis test to distinguish assignable cause variation from common cause variation. For example, a basic mathematical model behind SPC methods for detecting change in the mean is

$$ {X}_t={\eta}_t+{Y}_t, $$

where Xt is the measurement of the process variable at time t, and ηt is the process mean at that time. Here Yt represents variation from the common cause system. In some applications, Yt can be treated as an independently and identically distributed (iid) process. With few exceptions, the mean of the process is constant except for abrupt changes, so

$$ {\eta}_t=\eta +{\mu}_t, $$

where η is the mean target and μt is zero for t < t0 and has nonzero values for t ≥ t0. For analytical simplicity step changes are often assumed; in other words μt remains at a new constant level μ for t ≥ t0.

1.1 SPC for I.I.d. Processes

The statistical goal of SPC control charts is to detect the change point t0 as quickly as possible and trigger corrective action to bring the process back to the quality target. Among many others, the Shewhart chart, the EWMA chart, and the CUSUM chart are three important and widely used control charts.

1.1.1 Shewhart Chart

The Shewhart control chart monitors the process observations directly,

$$ {W}_t={X}_t-\eta . $$

Assuming that the standard deviation of Wt is σW, the stopping rule of the Shewhart chart is defined as |Wt| > W, where L is prespecified to maintain particular probability properties.

1.1.2 EWMA Chart

Roberts [22] introduces a control charting algorithm based on the exponentially weighted moving average of the observations,

$$ {W}_t=\sum \limits_{i=0}^{\infty }{w}_i\left({X}_{t-i}-\eta \right), $$

where wi = λ(1 − λ)i, (0 < λ ≤ 1). It can be rewritten as

$$ {W}_t=\left(1-\lambda \right)\!{W}_{t-1}+\lambda\!\left({X}_t-\eta \right), $$
(11.1)

where W0 = 0 or the process mean. The stopping rule of the EWMA chart is |Wt| > W where \( {\sigma}_W=\sqrt{\lambda/\!\left(2-\lambda \right)}{\sigma}_X \). The Shewhart chart is a special case of the EWMA chart with λ = 1. When the underlying process is i.i.d, the EWMA chart with small λ values is sensitive to the detection of small and medium shifts in mean [23].

1.1.3 CUSUM Chart

Page [24] introduces the CUSUM chart as a sequential probability test. It can be simply obtained by letting λ approach zero in (11.1). The CUSUM algorithm assigns equal weights to past observations, and its tabular form consists of two quantities,

$$ {\displaystyle \begin{array}{l}{W}_t^{+}=\max \left[0,{W}_{t-1}^{+}+\left({X}_t-\eta \right)-k{\sigma}_X\right],\\ {}{W}_t^{-}=\min \left[0,{W}_{t-1}^{-}+\left({X}_t-\eta \right)-k{\sigma}_X\right],\end{array}} $$

where \( {W}_0^{+}={W}_0^{-}=0 \). It can be shown that the CUSUM chart with k = μ/2 is optimal for detecting a mean change in μ when the observations are i.i.d.

Because of the randomness of the observations, these control charts may trigger false alarms – out-of-control signals issued when the process is still in control. The expected number of units measured between two successive false alarms is called the in-controlaverage run length (ARL)0. When a special cause presents itself, the expected period before a signal is triggered is called the out-of-control average run length (ARL1). The ideal control chart has a long ARL0 and a short ARL1. The Shewhart chart typically uses the constant L = 3, so that the in-control ARL is 370 when the underlying process is i.i.d. with normal distribution.

These SPC charts are very effective for monitoring the process mean when the process data is i.i.d. It has been shown that the Shewhart chart is sensitive for detecting large shifts while the EWMA and CUSUM charts are sensitive to small shifts [23]. However, a fundamental assumption behind these SPC charts is that the common cause variation is free of serial correlation. Due to the prevalence of advanced sensing and measurement technology in manufacturing processes, the assumption of independence is often invalid. For example, measuring critical in-process dimensions is now possible on every unit in the production of discrete parts. In continuous process production systems, the presence of inertial elements such as tanks, reactors, and recycle streams often result in significant serial correlation in the measured variables. Serial correlation creates many challenges and opportunities for SPC methodologies.

1.2 SPC for Autocorrelated Processes

Traditional SPC charts have been shown to function poorly while monitoring and controlling serially correlated processes [25, 26]. To accommodate autocorrelation, the following time series methods have been proposed.

1.2.1 Modifications of Traditional Methods

One common SPC strategy is to plot the autocorrelated data on traditional charts whose limits have been modified to account for the correlation. Johnson and Bagshaw [27] and Bagshaw and Johnson [28] consider the effects of autocorrelation on CUSUM charts using the weak convergence of cumulative sums to a Wiener process. Another alternative is the exponentially weighted moving average chart for stationary processes (EWMAST) studied by Zhang [29]. Jiang et al. [30] extend this to a general class of control charts based on autoregressive moving average (ARMA) charts. The monitoring statistic of an ARMA chart is defined to be the result of a generalized ARMA(1, 1) process applied to the underlying process {Xt},

$$ {\displaystyle \begin{aligned}{W}_t&={\theta}_0{X}_t-\theta {X}_{t-1}+\phi {W}_{t-1}\\ &={\theta}_0\left({X}_t-\beta {X}_{t-1}\right)+\phi {W}_{t-1},\end{aligned}} $$
(11.2)

where β = θ/θ0 and θ0 is chosen so that the sum of the coefficients is unity when Wt is expressed in terms of the Xt’s, so θ0 = 1 + θ − ϕ. The authors show that these charts exhibit good performance when the chart parameters are chosen appropriately.

1.2.2 Forecast-Based Monitoring Methods

Forecast-based charts started with the special-cause charts (SCC) proposed by Alwan and Roberts [31]. The general idea is to first apply a one-step-ahead predictor to the observation {Xt} and then monitor the corresponding prediction error,

$$ {W}_t={e}_t, $$
(11.3)

where \( {e}_t={X}_t-{\hat{X}}_t \) is the forecast error of predictor \( {\hat{X}}_t \). The SCC method is the first example that uses minimum mean squared error (MMSE) predictors and monitors the MMSE residuals. When the model is accurate, the MMSE prediction errors are approximately uncorrelated. This removal of correlation means that control limits for the SCC can be easily calculated from traditional Shewhart charts, EWMA charts, and CUSUM charts. Another advantage of the SCC method is that its performance can be analytically approximated.

The SCC method has attracted considerable attention and has been extended by many authors. Among them, Harris and Ross [25] and Superville and Adams [32] investigate process monitoring based on the MMSE prediction errors for simple autoregressive [AR(1)] models; Wardell et al. [33, 34] discuss the performance of SCC for ARMA(1, 1) models; and Vander Wiel [35] studies the performance of SCC for integrated moving average [IMA(0, 1, 1)] models. SCC methods perform poorly when detecting small shifts since a constant mean shift always results in a dynamic shift pattern in the error term.

In general this approach can be applied to any predictor. Montgomery and Mastrangelo [36] recommend the use of EWMA predictors in the SCC method (hereafter called the M–M chart). Jiang et al. [37] propose the use of proportional-integral-derivative (PID) predictors

$$ {\displaystyle \begin{aligned}{\hat{X}}_t&={\hat{X}}_{t-1}+\left({k}_{\mathrm {P}}+{k}_{\mathrm {I}}+{k}_{\mathrm {D}}\right)\!{e}_{t-1}\\ &\quad -\left({k}_{\mathrm {P}}+2{k}_{\mathrm {D}}\right){e}_{t-2}+{k}_{\mathrm {D}}{e}_{t-3,}\end{aligned}} $$
(11.4)

where kP, kI, and kD are parameters of the PID controller defined in Sect. 11.1.3. The family of PID-based charts includes the SCC, EWMA, and M–M charts as special cases. Jiang et al. [37] show that the predictors of the EWMA chart and M–M chart may sometimes be inefficient and the SCC over-sensitive to model deviation. They also show that the performance of the PID-based chart is affected by the choice of chart parameters. For any given underlying process, one can therefore tune the parameters of the PID-based chart to optimize its performance.

1.2.3 GLRT-Based Multivariate Methods

Since forecast-based residual methods monitor a single statistic et, they often suffer from the problem of a narrow “window of opportunity” when the underlying process is positively correlated [35]. If the shift occurrence time is known, the problem can be alleviated by including more historical observations/residuals in the test. This idea was first proposed by Vander Wiel [35] using a generalized likelihood ratio test (GLRT) procedure. Assuming residual signatures {δi} when a shift occurs, the GLRT procedure based on residuals is

$$ {W}_t=\underset{0\le k\le p-1}{\max}\left|\sum \limits_{i=0}^k{\delta}_i{e}_{t-k+i}\right|/\sqrt{\sum \limits_{i=0}^k{\delta}_i^2}, $$
(11.5)

where p is the prespecified size of the test window. Apley and Shi [38] show that this procedure is very efficient in detecting mean shifts when p is sufficiently large. Similar to the SCC methods, this is model-based and the accuracy of signature strongly depends on the window length p. If p is too small and a shift is not detected within the test window, the signature in (11.5) might no longer be valid and the test statistic no longer efficient.

Note that a step mean shift at time t − k + 1 results in a signature

$$ {d}_k={\left(0,\cdots, 0,\overset{k}{\overbrace{1,\cdots, 1}}\right)}^{\prime}\quad \left(1\le k\le p\right) $$

and

$$ {d}_k={\left(1,1,\cdots, 1\right)}^{\prime}\quad \left(k>p\right) $$

on Ut = (Xt − p + 1, Xt − p + 2, ⋯, Xt)′. To test these signatures, the GLRT procedure based on observation vector Wt is defined as

$$ {W}_t=\underset{0\le k\le p-1}{\max}\left|{d}_k^{\prime }{\Sigma}_U^{-1}{U}_t\right|/\sqrt{d_k^{\prime }{\varSigma}_U^{-1}{d}_k}, $$
(11.6)

where ΣU is the covariance matrix of Ut. Jiang [39] points out that this GLRT procedure is essentially model-free and always matches the true signature of Ut regardless of the timing of the change point. If a non-step shift in the mean occurs, multivariate charts such as Hotelling’s T2 charts can be developed accordingly [40].

1.2.4 Monitoring Batch Means

One of the difficulties with monitoring autocorrelated data is accounting for the underlying autocorrelation. In simulation studies, it is well known that batch means reduce autocorrelation within data. Motivated by this idea, Runger and Willemain [41, 42] use a weighted batch mean (WBM) and a unified batch mean (UBM) to monitor autocorrelated data. The WBM method weighs the mean of observations, defines batch size so that autocorrelation among batches is reduced to zero and requires knowledge of the underlying process model [43]. The UBM method determines batch size so that autocorrelation remains below a certain level and is “model-free”. Runger and Willemain show that the UBM method is simple and often more cost-effective in practice.

Batch-means methods not only develop statistics based on batch-means, but also provide variance estimation of these statistics for some commonly used SPC charts. Alexopoulos et al. [44] discuss promising methods for dealing with correlated observations including nonoverlapping batch means (NBM),overlapping batch means (OBM) and standardized time series (STS).

1.3 SPC Versus APC

Automatic process control (APC) complements SPC as a variation reduction tool for manufacturing industries. While SPC techniques are used to reduce unexpected process variation by detecting and removing the cause of variation, APC techniques are used to reduce systematic variation by employing feedforward and feedback control schemes. The relationships between SPC and APC are important to both control engineers and quality engineers.

1.3.1 Feedback Control Versus Prediction

The feedback control scheme is a popular APC strategy that uses the deviation of output from target (set-point) to signal a disturbance of the process. This deviation or error is then used to compensate for the disturbance. Consider a pure-gain dynamic feedback-controlled process, as shown in Fig. 11.1. The process output can be expressed as

$$ {e}_t={X}_t-{Z}_{t-1}. $$
(11.7)
Fig. 11.1
figure 1

Automatic process control

Suppose \( {\hat{X}}_t \) is an estimator (a predictor) of Xt that can be obtained at time t − 1. A realizable form of control can be obtained by setting

$$ {Z}_{t-1}=-{\hat{X}}_t $$
(11.8)

so that the output error at time t + 1 becomes

$$ {e}_t={X}_t-{\hat{X}}_t, $$
(11.9)

which is equal to the “prediction error”. Control and prediction can therefore have a one-to-one corresponding relationship via (11.8) and (11.9).

As shown in Box and Jenkins [45], when the process can be described by an ARIMA model, the MMSE control and the MMSE predictor have exactly the same form. Serving as an alternative to the MMSE predictor, the EWMA predictor corresponds to the integral (I) control [46] and is one of the most frequently used prediction methods due to its simplicity and efficiency. In general, the EWMA predictor is robust against nonstationarity due to the fact that the I control can continuously adjust the process whenever there is an offset.

An extension of the I control is the widely used PID control scheme,

$$ {Z}_t=-{k}_{\mathrm{P}}{e}_t-{k}_{\mathrm{I}}\frac{1}{1-B}{e}_t-{k}_{\mathrm{D}}\left(1-B\right)\!{e}_t, $$
(11.10)

where kP, kI, and kD are constants that, respectively, determine the amount of proportional, integral, and derivative control action. The corresponding PID predictor (11.4) can be obtained from (11.8) and (11.10). When λ3 = 0, in other words when kD = 0 (and thus λ1 = kP + kI and λ2 = − kP), we have a PI predictor corresponding to the proportional-integral control scheme commonly used in industry.

1.3.2 Process Prediction Versus Forecast-Based Monitoring Methods

As discussed in Sect. 11.1.2, one class of SPC methods for autocorrelated processes starts from the idea of “whitening” the process and then monitoring the “whitened” process with time series prediction models. The SCC method monitors MMSE prediction errors and the M–M chart monitors the EWMA prediction error. Although the EWMA predictor is optimal for an IMA(0, 1, 1) process, the prediction error is no longer i.i.d. for predicting other processes. Most importantly, the EWMA prediction error that originated from the I control can compensate for mean shifts in steady state which makes the M–M chart very problematic for detecting small shifts in mean.

Since PID control is very efficient and robust, PID-based charts motivated by PID predictors outperform SCC and M–M charts. APC-based knowledge of the process can moreover clarify the performance of PID-based charts. In summary, the P term ensures that process output is close to the set point and thus sensitive in SPC monitoring, whereas the I term always yields control action regardless of error size which leads to a zero level of steady-state error. This implies that the I term is dominant in SPC monitoring. The purpose of derivative action in PID control is to improve closed-loop stability by making the D term in SPC monitoring less sensitive. Although there is no connection between the EWMA predictor and the EWMA chart, it is important to note that the I control leads to the EWMA predictor and the EWMA prediction-based chart is the M–M chart. As shown in Jiang et al. [37], the EWMA chart is the same as the P-based chart.

1.4 SPC for Automatically Controlled Processes

Although APC and SPC techniques share the objective of reducing process variation, their advocates have quarreled for decades. It has recently been recognized that the two techniques can be integrated to produce more efficient tools for process variation reduction [47,48,49,50,51,52]. This APC/SPC integration employs an APC rule to regulate the system and superimposes SPC charts on the APC-controlled system to detect process departures from the system model. Using Deming’s terminology, the APC scheme is responsible for reducing common cause variation, while the SPC charts are responsible for reducing assignable cause variation. From the statistical point of view, the former part resembles a parameter estimation problem for forecasting and adjusting the process and the latter part emulates a hypothesis test of process location. Figure 11.2 pictures a conceptual integration of SPC charts into the framework of a feedback control scheme. To avoid confusion, Box and Luceno [46] refer to APC activities as process adjustment and to SPC activities as process monitoring. Since this chapter emphasizes SPC methods for quality improvement, we discuss only the monitoring component of APC/SPC integration.

Fig. 11.2
figure 2

APC/SPC integration

As discussed in Sect. 11.1.3, control charts developed for monitoring autocorrelated observations shed light on the monitoring of integrated APC/SPC systems. Fundamentally, the output of an automatically controlled process is recommended for SPC monitoring. This is equivalent to forecast-based control charts of the corresponding predictor. For example, if the process is controlled by an MMSE controller, monitoring the output is exactly the same as the SCC method. Similar to forecast-based methods, assignable causes have an effect that is always contaminated by the APC control action which results in a limited “window of opportunity” for detection [35]. As an alternative, some authors suggest that monitoring the APC control action may improve the probability of detection [20]. Jiang and Tsui [53] compare the performance of monitoring the output vs. the control action of an APC process and show that for some autocorrelated processes monitoring the control action may be more efficient than monitoring the output of the APC system.

In general, the performance achieved by SPC monitoring an APC process depends on the data stream (the output or the control action) being measured, the APC control scheme employed, and the underlying autocorrelation of the process. If information from process output and control action can be combined, a universal monitor with higher SPC efficiency [51] can be developed. Kourti et al. [54] propose a method of monitoring process outputs conditional on the inputs or other changing process parameters. Li et al. [55] propose multivariate control charts such as Hotelling’s T2 chart and the Bonferroni approach to monitor output and control action simultaneously. Defining the vector of outputs and control actions as Vt = (et,⋯, etp+1, Xt,⋯, Xtp+1)′, a dynamic T2 chart with window size p monitors statistic

$$ {W}_t={V}_t^{\prime }{\Sigma}_V^{-1}{V}_t, $$

where ΣV is the covariance matrix of Vt [56]. Wt follows a χ2 distribution during each period given known process parameters. However, strong serial correlation exists so that the χ2 quantiles cannot be used for control limits. By recognizing the mean shift patterns of Vt, Jiang [57] develops a GLRT procedure based on Vt. This GLRT procedure is basically univariate and more efficient than the T2 chart.

1.5 Design of SPC Methods: Efficiency Versus Robustness

Among many others, the minimization of mean squared error/prediction error is one of the important criteria for prediction/control scheme design. Although the special cause chart is motivated by MMSE prediction/control, many previously mentioned SPC charts such as the PID chart have fundamentally different criteria from those of the corresponding APC controllers. When selecting SPC charts, the desired goal is maximization of the probability of shift detection.

For autocorrelated processes, Jiang [37] propose an ad hoc design procedure using PID charts. They demonstrate how two capability indices defined by signal-to-noise ratios (SNR) play a critical role in the evaluation of SPC charts. They denote σW as the standard deviation of charting statistic Wt and μT (/μS) as the shift levels of Wt at the first step (/long enough) after the shift takes place. The transient state ratio is defined as CT = μT/σW, which measures the capability of the control chart to detect a shift in its first few steps. The steady state ratio is defined as CS = μS/σW, which measures the ability of the control chart to detect a shift in its steady state. These two signal-to-noise ratios determine the efficiency of the SPC chart and can be manipulated by selecting control chart parameters.

For a particular mean shift level, if the transient state ratio/capability can be tuned to a high value (say 4–5) by choosing appropriate chart parameters, the corresponding chart will detect the shift very quickly. Otherwise the shift will likely be missed during the transient state and will need to be detected in later runs. Since a high steady state ratio/capability heralds efficient shift detection at steady state, a high steady state ratio/capability is also desired. However, the steady state ratio/capability should not be tuned so high that it results in an extremely small transient ratio/capability, indicative of low probability of detection during the transient state. To endow the chart with efficient detection at both states, a tradeoff is needed when choosing the charting parameters. An approximate CS value of 3 is generally appropriate for balancing the values of CT and CS.

One of the considerations when choosing an SPC method is its robustness to autocorrelated and automatically controlled processes. Robustness of a control chart refers to how insensitive its statistical properties are to model mis-specification. Reliable estimates of process variation are of vital importance for the proper functioning of all SPC methods [58]. For process Xt with positive first-lag autocorrelation, the standard deviation derived from moving range is often underestimated because

$$ E\!\left({\hat{\sigma}}_{\mathrm{MR}}\right)=E\left(\overline{\mathrm{MR}}/{d}_2\right)={\sigma}_X\sqrt{1-{\rho}_1}, $$

where ρ1 is the first-lag correlation coefficient of Xt [59].

A more serious problem with higher sensitivity control charts such as the PID chart is that they may be less robust than lower sensitivity control charts such as the SCC. Berube et al. [60] and Luceno [61] conclude that PID controllers are generally more robust than MMSE controllers against model specification error. However Jiang [37] shows that PID charts tend to have a shorter “in-control” ARL when the process model is mis-specified since model errors can be viewed as a kind of “shift” from the “true” process model. This seems to be a discouraging result for higher sensitivity control charts. In practice, a trade-off is necessary between sensitivity and robustness when selecting control charts for autocorrelated processes. Apley and Lee [62] recommend using a conservative control limit for EWMA charts when monitoring MMSE residuals. By using the worst-case estimation of residual variance, the EWMA chart can be robustly designed for the in-control state with a slight efficiency loss in the out-of-control state. This design strategy can be easily generalized to other SPC methods for autocorrelated or automatically controlled processes.

1.6 SPC for Multivariate Characteristics

Through modern sensing technology that allows frequent measurement of key quality characteristics during manufacturing, many in-process measurements are strongly correlated to each other. This is especially true for measurements related to safety, fault detection and diagnosis, quality control, and process control. In an automatically controlled process, for example, process outputs are often strongly related to process control actions. Joint monitoring of these correlated characteristics ensures appropriate control of the overall process. Multivariate SPC techniques have recently been applied to novel fields such as environmental monitoring and detection of computer intrusion.

The purpose of multivariate on-line techniques is to investigate whether measured characteristics are simultaneously in statistical control. A specific multivariate quality control problem is to consider whether an observed vector of measurements x = (x1,…, xk) exhibits a shift from a set of “standard” parameters \( {\boldsymbol{\mu}}^0={\left({\mu}_1^0,\dots, {\mu}_k^0\right)}^{\prime } \). The individual measurements will frequently be correlated, meaning that their covariance matrix Σ will not be diagonal.

Versions of the univariate Shewhart, EWMA and CUSUM charts have been developed for the case of multivariate normality.

1.6.1 Multivariate T2 Chart

To monitor a multivariate vector, Hotelling [63] suggested an aggregated statistic equivalent to the Shewhart control chart in the univariate case,

$$ {T}^2={\left(\boldsymbol{x}-{\boldsymbol{\mu}}^0\right)}^{\prime }{\hat{\Sigma}}_{\boldsymbol{x}}^{-1}\;\left(\boldsymbol{x}-{\boldsymbol{\mu}}^0\right), $$
(11.11)

where \( {\hat{\Sigma}}_x \) is an estimate of the population covariance matrix Σ. If the population covariance matrix is known, Hotelling’s T2statistic follows a χ2 distribution with k degrees of freedom when the process is in-control. A signal is triggered when χ2 > χ2k,α. One of the important features of the T2 charts is that its out-of-control performance depends solely on the noncentrality parameter \( \delta =\sqrt{{\left(\boldsymbol{\mu} -{\boldsymbol{\mu}}^0\right)}^{\prime }{\hat{\Sigma}}_{\boldsymbol{x}}^{-1}\left(\boldsymbol{\mu} -{\boldsymbol{\mu}}^0\right)}, \) where μ is the actual mean vector. This means that its detectional performance is invariant along the contours of the multivariate normal distribution.

1.6.2 Multivariate EWMA Chart

Hotelling’s T2 chart essentially utilizes only current process information. To incorporate recent historical information, Lowry [64] develop a similar multivariate EWMA chart

$$ {W}_t^2={\boldsymbol{w}}_t^{\prime }{\Sigma}_{\boldsymbol{w}}^{-1}{\boldsymbol{w}}_t, $$

where wt = Λ(xt − μ0) + (I − Λ)wt − 1 and Λ = diag(λ1, λ2,⋯, λk). For simplicity, λi = λ (1 ≤ i ≤ k) is generally adopted and Σw = λ/(2 − λx.

1.6.3 Multivariate CUSUM Chart

There are many CUSUM procedures for multivariate data. Crosier [65] proposes two multivariate CUSUM procedures, cumulative sum of T (COT) and MCUSUM. The MCUSUM chart is based on the statistics

$$ {\boldsymbol{s}}_t=\left\{\begin{array}{ll}\mathbf{0}& \mathrm{if}\, {C}_t\le {k}_1\\ {}\left({\boldsymbol{s}}_{t-1}+{\boldsymbol{x}}_t\right)\left(1-{k}_1/{C}_t\right)& \mathrm{if}\, {C}_t\le {k}_1,\end{array}\right. $$
(11.12)

where s0 = 0, \( {C}_t=\sqrt{{\left({\boldsymbol{s}}_{t-1}+{\boldsymbol{x}}_t\right)}^{\prime }{\Sigma}_{\boldsymbol{x}}^{-1}\!\left({\boldsymbol{s}}_{t-1}+{\boldsymbol{x}}_t\right)} \), and k1 > 0. The MCUSUM chart signals when \( {W}_t={\boldsymbol{s}}_t^{\prime }{\Sigma}_{\boldsymbol{x}}^{-1}{\boldsymbol{s}}_t>{h}_1 \). Pignatiello and Runger [66] propose another multivariate CUSUM chart (MC1) based on the vector of cumulative sums,

$$ {W}_t=\max\!\left(0,\sqrt{D_t^{\prime }{\Sigma}_{\boldsymbol{x}}^{-1}{D}_t}-{k}_2{l}_t\right) $$
(11.13)

where k2 > 0, \( {D}_t={\sum}_{i=t-{l}_t+1}^t{\boldsymbol{x}}_i \), and

$$ {l}_t=\left\{\begin{array}{ll}{l}_{t-1}+1& \mathrm{if}\;{W}_{t-1}>0\\ {}1& \mathrm{otherwise}.\end{array}\right. $$

Once an out-of-control signal is triggered from a multivariate control chart, it is important to track the cause of the signal so that the process can be improved. Fault diagnosis can be implemented by T2 decompositions following the signal and large components are suspected to be faulty. Orthogonal decompositions such as principal component analysis are popular tools. Hayter and Tsui [67] propose other alternatives which integrate process monitoring and fault diagnosis. Jiang and Tsui [68] provide a thorough review of these methods.

1.6.4 Variable-Selection-Based Multivariate Chart

In high dimensional applications, it is very rare to see all interested variables or quality characteristics change or shift at the same time. Rather, a typical yet common phenomenon observed in practice is that a subset of variables, which is dominated by a common latent physical mechanism or component, deviate from their normal condition due to abnormal changes of the common mechanism or component [69,70,71]. By penalizing likelihood functions to locate potential out-of-control variables, Wang and Jiang [72] and Zou and Qiu [73] independently propose to monitor a variable-dimension T2 statistic, which has better efficiency than traditional full-size T2 statistic. Zou et al. [74] and Jiang et al. [75] further utilized the LASSO algorithm for fault diagnosis.

1.6.5 Multivariate Chart Using Real-Time Contrast

Instead of monitoring departures from a nominal mean vector in Phase II, multivariate RTC control charts monitor distances between real time data and Phase I reference data using classification methods. Mis-classification probabilities serve as a reasonable candidate for monitoring differences between the two populations [76]. Classification methods such as linear discrimination analysis (LDA), support vector machines (SVM), etc. can be deployed and kernel-based methods can also be adapted to account for nonlinear boundary between Phases I and II data [77, 78]. Since these classification methods look for a projection direction such that certain “distance” metric are optimized, projection pursuit can be generalized by measuring empirical divergence between the two probability distributions for real-time monitoring [79].

1.7 SPC for Profile Monitoring

In many applications, the quality of a process or product is best characterized and summarized by a functional relationship between a response variable and one or more explanatory variables. Profile monitoring is used to understand and to check the stability of this relationship over time. At each sampling stage one observes a collection of data points that can be represented by a curve (or profile). In some calibration applications, the profile can be represented adequately by a simple linear regression model, while in other applications more complicated models are needed.

Profile monitoring is very useful in an increasing number of practical applications. Much of the work in the past few years has focused on the use of more effective charting methods, the study of more general shapes of profiles, and the study of the effects of violations of assumptions. There are many promising research topics yet to be pursued given the broad range of profile shapes and possible models. Woodall et al. (2004) [80] highlighted the following important issues when monitoring profiles:

  1. 1.

    The usefulness of carefully distinguishing between Phase I and Phase II applications.

  2. 2.

    The decision regarding whether or not to include some between profile variation in common cause variation.

  3. 3.

    The use of methods capable of detecting any type of shift in the shape of the profile.

  4. 4.

    The use of the simplest adequate profile model.

Paynabar et al. [81] developed a new modeling, monitoring, and diagnosis framework for phase-I analysis of multichannel profiles. Woodall et al. [82] conducted a comprehensive review on the use of control charts to monitor process and product quality profiles.

2 Design of Experiment and Robust Parameter Design

2.1 Robust Design for Single Responses

Taguchi [4] introduced parameter design, a method for designing processes that are robust (insensitive) to uncontrollable variation, to a number of American corporations. The objective of this methodology is to find the settings of design variables that minimize the expected value of squared-error loss defined as

$$ L\!\left(Y,t\right)={\left(Y-t\right)}^2, $$
(11.14)

where Y represents the actual process response and t the targeted value. A loss occurs if the response Y deviates from its target t. This loss function originally became popular in estimation problems considering unbiased estimators of unknown parameters. The expected value of (Y − t)2 can be easily expressed as

$$ {\displaystyle \begin{aligned}E(L)&={A}_0E{\left(Y-t\right)}^2\\ &={A}_{0}\!\left[\mathrm {Var}(Y)+{\left(E(Y)-t\right)}^2\right],\end{aligned}} $$
(11.15)

where Var(Y) and E(Y) are the mean and variance of the process response and A0 is a proportional constant representing the economic costs of the squared error loss. If E(Y) is on target then the squared-error loss function reduces to the process variance. Its similarity to the criterion of least squares in estimation problems makes the squared-error loss function easy for statisticians and engineers to grasp. Furthermore the calculations for most decision analyses based on squared-error loss are straightforward and easily seen as a trade-off between variance and the square of the off-target factor.

Robust design (RD) assumes that the appropriate performance measure can be modeled as a transfer function of the fixed control variables and the random noise variables of the process as follows:

$$ Y=f\left(\boldsymbol{x},\boldsymbol{N},\boldsymbol{\theta} \right)+\epsilon, $$
(11.16)

where x = (x1,…, xp)T is the vector of control factors, N = (N1,…, Nq)T is the vector of noise factors, θ is the vector of unknown response model parameters, and f is the transfer function for Y. The control factors are assumed to be fixed and represent the fixed design variables. The noise factors N are assumed to be random and represent the uncontrolled sources of variability in production. The pure error ϵ represents the remaining variability that is not captured by the noise factors and is assumed to be normally distributed with zero mean and finite variance.

Taguchi divides the design variables into two subsets, x = (xa, xd), where xa and xd are called respectively the adjustment and nonadjustment design factors. An adjustment factor influences process location while remaining effectively independent of process variation. A nonadjustment factor influences process variation.

2.1.1 Experimental Designs for Parameter Design

2.1.1.1 Taguchi’s Product Arrays and Combined Arrays

Taguchi’s experimental design takes an orthogonal array for the controllable design parameters (an inner array of control factors) and crosses it with another orthogonal array for the factors beyond reasonable control (an outer array of noise factors). At each test combination of control factor levels, the entire noise array is run and a performance measure is calculated. Hereafter we refer to this design as the product array. These designs have been criticized by Box [12] and others for being unnecessarily large.

Welch [83] combined columns representing the control and noise variables within the same orthogonal array. These combined arrays typically have a shorter number of test runs and do not replicate the design. The lack of replication prevents unbiased estimation of random error but we will later discuss research addressing this limitation.

2.1.1.1.1 Which to Use: Product Array or Combined Array

There is a wide variety of expert opinion regarding choice of experimental design in Nair [13]. The following references complement Nair’s comprehensive discussion. Ghosh and Derderian [84] derive robustness measures for both product and combined arrays, allowing the experimenter to objectively decide which array provides a more robust option. Miller et al. [85] consider the use of a product array on gear pinion data. Lucas [86] concludes that the use of classical, statistically designed experiments can achieve the same or better results than Taguchi’s product arrays. Rosenbaum [87] reinforces the efficiency claims of the combined array by giving a number of combined array designs which are smaller for a given orthogonal array strength or stronger for a given size. Finally, Wu and Hamada [88] provide an intuitive approach to choosing between product and combined array based on an effect-ordering principle.

They list the most important class of effects as those containing control–noise interactions, control main effects and noise main effects. The second highest class contains the control–control interactions and the control–control–noise interactions while the third and least important class contains the noise–noise interactions. That array producing the highest number of clear effect estimates in the most important class is considered the best design.

Noting that the combined array is often touted as being more cost-effective due to an implied smaller number of runs, Wu and Hamada place the cost comparison on a more objective basis by factoring in both cost per control setting and cost per noise replicate. They conclude that the experimenter must prioritize the effects to be estimated and the realistic costs involved before deciding which type of array is optimal.

2.1.1.2 Choosing the Right Orthogonal Array for RD

Whether the experimenter chooses a combined or product array, selecting the best orthogonal array is an important consideration. The traditional practice in classical design of experiments is to pick a Resolution IV or higher design so that individual factors are aliased with three factor interactions, of which there are relatively few known physical examples.

However, the estimation of main effects is not necessarily the best way to judge the value of a test design for RD. The control–noise interactions are generally regarded as having equal importance as the control effects for fine tuning the final control factor settings for minimal product variation. Hence evaluation of an experimental design for RD purposes must take into account the design’s ability to estimate the control–noise interactions deemed most likely to affect product performance.

Kackar and Tsui [89] feature a graphical technique for showing the confounding pattern of effects within a two-level fractional factorial. Kackar et al. [90] define orthogonal arrays and describe how Taguchi’s fixed element arrays are related to well known fractional factorial designs. Other pieces related to this decision are Hou and Wu [91], Berube and Nair [60], and Bingham and Sitter [92].

2.1.1.3 D-Optimal Designs

In this section several authors show how D-optimal designs can be exploited in RD experiments. A D-optimal design minimizes the area of the confidence ellipsoids for parameters being estimated from an assumed model. Their key strength is their invariance to linear transformation of model terms and their characteristic weakness is a dependence on the accuracy of the assumed model. By using a proper prior distribution to attack the singular design problem and make the design less model-dependent, Dumouchel and Jones [93] provide a Bayesian D-optimal design needing little modification of existing D-optimal search algorithms.

Atkinson and Cook [94] extend the existing theory of D-optimal design to linear models with nonconstant variance. With a Bayesian approach they create a compromise design that approximates preposterior loss. Vining and Schaub [95] use D-optimality to evaluate separate linear models for process mean and variance. Their comparison of the designs indicates that replicated fractional factorials of assumed constant variance best estimate variance while semi-Bayesian designs better estimate process response.

Chang [96] proposes an algorithm for generating near D-optimal designs for multiple response surface models. This algorithm differs from existing approaches in that it does not require prior knowledge or data-based estimates of the covariance matrix to generate its designs. Mays [97] extends the quadratic model methodology of RSM to the case of heterogeneous variance by using the optimality criteria D (maximal determinant) and I (minimal integrated prediction variance) to allocate test runs to locations within a central composite design.

2.1.1.4 Other Designs

The remaining references discuss types of designs used in RD which are not easily classified under the more common categories previously discussed.

Pledger [98] divides noise variables into observable and unobservable and argues that one’s ability to observe selected noise variables in production should translate into better choices of optimal control settings. Rosenbaum [99] uses blocking to separate the control and noise variables in combined arrays, which were shown in Rosenbaum [87] to be stronger for a given size than the corresponding product array designs. Li and Nachtsheim [100] present experimental designs which don’t depend on the experimenter’s prior determination of which interactions are most likely significant.

2.1.2 Performance Measures in RD

In Sect. 11.2.1, we compared some of the experimental designs used in parameter design. Of equal importance is choosing which performance measure will best achieve the desired optimization goal.

2.1.2.1 Taguchi’s Signal-to-Noise Ratios

Taguchi introduced a family of performance measures called signal-to-noise ratios whose specific form depends on the desired response outcome. The case where the response has a fixed nonzero target is called the nominal-the-best case (NTB). Likewise, the cases where the response has a smaller-the-better target or a larger-the-better target are, respectively, called the STB and LTB cases.

To accomplish the objective of minimal expected squared-error loss for the NTB case, Taguchi proposed the following two-step optimization procedure: (i) Calculate and model the SNRs and find the nonadjustment factor settings which maximize the SNR. (ii) Shift mean response to the target by changing the adjustment factor(s).

For the STB and LTB cases, Taguchi recommends directly searching for the values of the design vector x which maximize the respective SNR. Alternatives for these cases are provided by Tsui and Li [101] and Berube and Wu [102].

2.1.2.2 Performance Measure Independent of Adjustment (PerMIAs)

Taguchi did not demonstrate how minimizing the SNR would achieve the stated goal of minimal average squared-error loss. Leon et al. [11] defined a function called the performance measure independent of adjustment (PerMIA) which justified the use of a two-step optimization procedure. They also showed that Taguchi’s SNR for the NTB case is a PerMIA when both an adjustment factor exists and the process response transfer function is of a specific multiplicative form. When Taguchi’s SNR complies with the properties of a PerMIA, his two-step procedure minimizes the squared-error loss.

Leon et al. [11] also emphasized two major advantages of the two-step procedure:

  • It reduces the dimension of the original optimization problem.

  • It does not require reoptimization for future changes of the target value.

Box [12] agrees with Leon et al. [11] that the SNR is only appropriately used in concert with models where process sigma is proportional to process mean. Maghsoodloo [103] derives and tabulates exact mathematical relationships between Taguchi’s STB and LTB measures and his quality loss function.

Leon and Wu [104] extend the PerMIA of Leon et al. [11] to a maximal PerMIA which can solve constrained minimization problems in a two-step procedure similar to that of Taguchi. For nonquadratic loss functions, they introduce general dispersion, location, and off-target measures, while developing a two-step process. They apply these new techniques in a number of examples featuring additive and multiplicative models with nonquadratic loss functions. Tsui and Li [101] establish a multistep procedure for the STB and LTB problem based on the response model approach under certain conditions.

2.1.2.3 Process Response and Variance as Performance Measures

The dual response approach is a way of finding the optimal design settings for a univariate response without the need to use a loss function. Its name comes from its treatment of mean and variance as responses of interest which are individually modeled. It optimizes a primary response while holding the secondary response at some acceptable value.

Nair and Pregibon [105] suggest using outlier-robust measures of location and dispersion such as median (location) and interquartile range (dispersion). Vining and Myers [106] applied the dual response approach to Taguchi’s three SNRs while restricting the search area to a spherical region of limited radius. Copeland and Nelson [107] solve the dual response optimization problem with the technique of direct function minimization. They use the Nelder-Mead simplex procedure and apply it to the LTB, STB, and NTB cases. Other noteworthy papers on the dual response method include Del Castillo and Montgomery [108] and Lin and Tu [109].

2.1.2.4 Desirability as a Performance Measure

The direct conceptual opposite of a loss function, a utility function maps a specific set of design variable settings to an expected utility value (value or worth of a process response). Once the utility function is established, nonlinear direct search methods are used to find the vector of design variable settings that maximizes utility.

Harrington [110] introduced a univariate utility function called the desirability function, which gives a quality value between zero (unacceptable quality) and one (further improvement would be of no value) of a quality characteristic of a product or process. He defined the two-sided desirability function as follows:

$$ {d}_i={\mathrm{e}}^{-{\left|{Y}_i^{\prime}\right|}^c}, $$
(11.17)

where e is the natural logarithm constant, c is a positive number subjectively chosen for curve scaling, and Yi is a linear transformation of the univariate response Yi whose properties link the desirability values to product specifications. It is of special interest to note that for c = 2, a mid-specification target and response values within the specification limits, this desirability function is simply the natural logarithm constant raised to the squared-error loss function.

2.1.2.5 Other Performance Measures

Ng and Tsui [111] derive a measure called q-yield which accounts for variation from target among passed units as well as nonconforming units. It does this by penalizing yield commensurate with the amount of variation measured within the passed units. Joseph and Wu [102] develop modeling and analysis strategies for a general loss function where the quality characteristic follows a location-scale model. Their three-step procedure includes an adjustment step which moves the mean to the side of the target with lower cost. Additional performance measures are introduced in Joseph and Wu [112] and Joseph and Wu [113].

2.1.3 Modeling the Performance Measure

The third important decision the experimenter must grapple with is how to model the chosen performance measure. Linear models are by far the most common way to approximate loss functions, SNR’s, and product responses. This section covers response surface models, the generalized linear model, and Bayesian modeling.

2.1.3.1 Response Surface Models

Response surface models (RSM) are typically second-order linear models with interactions between the first-order model terms. While many phenomena cannot be accurately represented by a quadratic model, the second-order approximation of the response in specific regions of optimal performance may be very insightful to the product designer.

Myers et al. [114] make the case for implementing Taguchi’s philosophy within a well established, sequential body of empirical experimentation, RSM. The combined array is compared to the product array and the modeling of SNR compared to separate models for mean and variance. In addition, RSM lends itself to the use of mixed models for random noise variables and fixed control variables. Myers et al. [115] incorporate noise variables and show how mean and variance response surfaces can be combined to create prediction limits on future response.

2.1.3.1.1 Analysis of Unreplicated Experiments

The most commonly cited advantage of modeling process responses rather than SNR is the use of more efficient combined arrays. However, the gain in efficiency usually assumes there is no replication for estimating random error. Here we review references for analyzing the data from unreplicated fractional factorial designs.

Box and Meyer [116] present an analysis technique which complements normal probability plots for identifying significant effects from an unreplicated design. Their Bayesian approach assesses the size of contrasts by computing a posterior probability that each contrast is active. They start with a prior probability of activity and assume normality of the significant effects and deliver a nonzero posterior probability for each effect.

Lenth [117] introduces a computationally simple and intuitively pleasing technique for measuring the size of contrasts in unreplicated fractional factorials. The Lenth method uses standard T statistics and contrast plots to indicate the size and significance of the contrast. Because of its elegant simplicity, the method of Lenth is commonly cited in RD case studies.

Pan [118] shows how failure to identify even small and moderate location effects can subsequently impair the correct identification of dispersion effects when analyzing data from unreplicated fractional factorials. Wu and Hamada [88] propose a simple simulation method for estimating the critical values employed by Lenth in his method for testing significance of effects in unreplicated fractional factorial designs.

McGrath and Lin [119] show that a model that does not include all active location effects raises the probability of falsely identifying significant dispersion factors. They show analytically that without replication it is impossible to deconfound a dispersion effect from two location effects.

2.1.3.2 Generalized Linear Model

The linear modeling discussed in this chapter assumes normality and constant variance. When the data does not demonstrate these properties, the most common approach is to model a compliant, transformed response. In many cases, this is hard or impossible. The general linear model (GLM) was developed by Nelder and Wedderburn [120] as a way of modeling data whose probability distribution is any member of the single parameter exponential family.

The GLM is fitted by obtaining the maximum likelihood estimates for the coefficients to the terms in the linear predictor, which may contain continuous, categorical, interaction, and polynomial terms. Nelder and Lee [121] argue that the GLM can extend the class of useful models for RD experiments to data-sets, wherein a simple transformation cannot necessarily satisfy the important criteria of normality, separation, and parsimony. Several examples illustrate how the link functions are chosen.

Engel and Huele [122] integrate the GLM within the RSM approach to RD. Nonconstant variance is assumed and models for process mean and variance are obtained from a heteroscedastic linear model of the conditional process response. The authors claim that nonlinear models and tolerances can also be studied with this approach. Hamada and Nelder [123] apply the techniques described in Nelder and Lee [121] to three quality improvement examples to emphasize the utility of the GLM in RD problems over its wider class of distributions.

2.1.3.3 Bayesian Modeling

Bayesian methods of analysis are steadily finding wider employment in the statistical world as useful alternatives to frequentist methods. In this section we mention several references on Bayesian modeling of the data.

Using a Bayesian GLM, Chipman and Hamada [124] overcome the GLM’s potentially infinite likelihood estimates from categorical data taken from fractional factorial designs. Chipman [125] uses the model selection methodology of Box and Meyer [126] in conjunction with priors for variable selection with related predictors. For optimal choice of control factor settings, he finds posterior distributions to assess the effect of model and parameter uncertainty.

2.2 Robust Design for Multiple Responses

Earlier we discussed loss and utility functions and showed how the relation between off-target and variance components underlies the loss function optimization strategies for single responses. Multi-response optimization typically combines the loss or utility functions of individual responses into a multivariate function to evaluate the sets of responses created by a particular set of design variable settings. This section is divided into two subsections which, respectively, deal with the additive and multiplicative combination of loss and utility functions, respectively.

2.2.1 Additive Combination of Univariate Loss, Utility and SNR

The majority of multiple response approaches additively combine the univariate loss or SNR performance measures discussed. In this section, we review how these performance measures are additively combined and their relative advantages and disadvantages as multivariate objective functions.

2.2.1.1 Multivariate Quadratic Loss

For univariate responses, expected squared-error loss is a convenient way to evaluate the loss caused by deviation from target because of its decomposition into squared off-target and variance terms. A natural extension of this loss function to multiple correlated responses is the multivariate quadratic loss (MQL) function of the deviation vector (Y − τ) where Y = (Y1,…, Yr)T and τ = (t1,…, tr)T, i.e.,

$$ \mathrm{MQL}\!\left(\boldsymbol{Y},\tau \right)={\left(\boldsymbol{Y}-\tau \right)}^{\mathrm{T}}\boldsymbol{A}\left(\boldsymbol{Y}-\tau \right), $$
(11.18)

where A is a positive definite constant matrix. The values of the constants in A are related to the costs of nonoptimal design, such as the costs related to repairing and/or scrapping noncompliant product. In general, the diagonal elements of A represent the weights of the r characteristics and the off-diagonal elements represent the costs related to pairs of responses being simultaneously off-target.

It can be shown that, if Y follows a multivariate normal distribution with mean vector E(Y) and covariance matrix ΣY, the average (expected) loss can be written as:

$$ {\displaystyle \begin{aligned}E\!\left(\mathrm {MQL}\right)&=E{\left(\boldsymbol{Y}-\tau \right)}^{\mathrm {T}}\boldsymbol{A}\!\left(\boldsymbol{Y}-\tau \right)\\ &=\mathrm {Tr}\left(\boldsymbol{A}{\varSigma}_{\boldsymbol{Y}}\right)\\ &\quad +\left[E\!\left(\boldsymbol{Y}\right)-\tau \Big){}^{\mathrm {T}}\boldsymbol{A}\!\left[E\right]\!\left(\boldsymbol{Y}\right)-\tau \right].\end{aligned}} $$
(11.19)

The simplest approach to solving the RD problem is to apply algorithms to directly minimize the average loss function in (11.19). Since the mean vector and covariance matrix are usually unknown, they can be estimated by the sample mean vector and sample covariance matrix or a fitted model based on a sample of observations of the multivariate responses. The off-target vector product [E (Y) − τ]TA[E (Y) − τ] and Tr(A ΣY) are multivariate analogs to the squared off-target component and variance of the univariate squared-error loss function. This decomposition shows how moving all response means to target simplifies the expected multivariate loss to the Tr(A ΣY) term. The trace-covariance term shows how the values of A and the covariance matrix ΣY directly affect the expected multivariate loss.

2.2.1.2 Optimization of Multivariate Loss Functions

For the expected multivariate quadratic loss of (11.19), Pignatiello [16] introduced a two-step procedure for finding the design variable settings that minimize this composite cost of poor quality. Tsui [18] extended Pignatiello’s two-step procedure to situations where responses may be NTB, STB or LTB.

To this point we have examined squared-error loss functions whose expected value is decomposed into off-target and variance components. Ribeiro and Elsayed [127] introduced a multivariate loss function which additionally considers fluctuation in the supposedly fixed design variable settings. Ribeiro et al. [128] add a term for manufacturing cost to the gradient loss function of Ribeiro and Elsayed.

2.2.1.3 Additive Formation of Multivariate Utility Functions

Kumar et al. [129] suggest creating a multiresponse utility function as the additive combination of utility functions from the individual responses where the goal is to find the set of design variable settings that maximizes overall utility. Additional papers related to this technique include Artiles-Leon [130] and Ames et al. [131].

2.2.1.4 Quality Loss Functions for Nonnegative Variables

Joseph [132] argues that, in general, processes should not be optimized with respect to a single STB or LTB characteristic, rather than a combination of them. He introduces a new class of loss functions for nonnegative variables which accommodates the cases of unknown target and asymmetric loss and which can be additively combined for the multiresponse case.

2.2.2 Multivariate Utility Functions from Multiplicative Combination

In this section, a multivariate desirability function is constructed from the geometric average of the individual desirability functions of each response.

The geometric average (GA) of r components (d1,…, dr) is the rth root of their products:

$$ \mathrm{GA}\left({d}_1,\dots, {d}_r\right)={\left(\prod \limits_{i=1}^r{d}_i\right)}^{\frac{1}{r}}. $$
(11.20)

The GA is then a multiplicative combination of the individuals. When combining individual utility functions whose values are scaled between zero and one, the GA yields a value less than or equal to the lowest individual utility value. When rating the composite quality of a product, this prevents any single response from reaching an unacceptable value, since a very low value on any crucial characteristic (such as safety features or cost) will render the entire product worthless to the end user.

2.2.2.1 Modifications of the Desirability Function

In order to allow placement of the ideal target value anywhere within the specifications, Derringer and Suich [133] introduced a modified version of the desirability functions of Harrington [110] which encompassed both one-sided and two-sided response specifications. Additional extensions of the multivariate desirability function were made by Kim and Lin [134].

2.2.3 Alternative Performance Measures for Multiple Responses

Duffy et al. [135] propose using a reasonably precise estimate of multivariate yield, obtained via Beta distribution discrete point estimation, as an efficient alternative to Monte Carlo simulation. This approach is limited to independently distributed design variables. Fogliatto and Albin [136] propose using predictor variance as a multiresponse optimization criterion. They measure predictive variance as the coefficient of variance (CV) of prediction since it represents a normalized measure of prediction variance. Plante [137] considers the use of maximal process capability as the criterion for choosing control variable settings in multiple response RD situations. He uses the concepts of process capability and desirability to develop process capability measures for multiple response systems.

2.3 Dynamic Robust Design

2.3.1 Taguchi’s Dynamic Robust Design

Up to this point, we’ve discussed only static RD, where the targeted response is a given, fixed level and is only affected by control and noise variables. In dynamic robust design (DRD) a third type of variable exists, the signal variable M whose magnitude directly affects the mean value of the response. The experimental design recommended by Taguchi for DRD is the product array consisting of an inner control array crossed with an outer array consisting of the sensitivity factors and a compound noise factor.

A common choice of dynamic loss function is the quadratic loss function popularized by Taguchi,

$$ L\left[Y,t(M)\right]={A}_0{\left[Y-t(M)\right]}^2, $$
(11.21)

where A0 is a constant. This loss function provides a good approximation to many realistic loss functions. It follows that the average loss becomes

$$ {\displaystyle \begin{aligned}R\left(\boldsymbol{x}\right)&={A}_0{E}_M{E}_{N,\epsilon}{\left[Y-t(M)\right]}^2\\ &={A}_0{E}_M\left\{{\mathrm {Var}}_{N,\epsilon}(Y)+{\left[{E}_{N,\epsilon}(Y)-t(M)\right]}^2\right\}.\end{aligned}} $$
(11.22)

Taguchi identifies dispersion and sensitivity effects by modeling SNR respectively as a function of control factors and sensitivity factors. His two-step procedure for DRD finds control factor settings to minimize SNR and sets other, non-SNR related control variables to adjust the process to the targeted sensitivity level.

2.3.2 References on Dynamic Robust Design

Ghosh and Derderian [138] introduce the concept of robustness of the experimental plan itself to the noise factors present when conducting DRD. For combined arrays they consider blocked and split-plot designs and for product arrays they consider univariate and multivariate models. In product arrays they do this by choosing settings which minimize the noise factor effects on process variability and for the combined array they attempt to minimize the interaction effects between control and noise factors.

Wasserman [139] clarifies the use of the SNR for the dynamic case by explaining it in terms of linear modeling of process response. He expresses the dynamic response as a linear model consisting of a signal factor, the true sensitivity (β) at specific control variable settings, and an error term. Miller and Wu [140] prefer the term signal-response system to dynamic robust design for its intuitive appeal and its identification of two distinct types of signal-response systems. They call them measurement systems and multiple target systems, where this distinction determines the performance measure used to find the optimal control variable settings.

Lunani et al. [141] present two new graphical procedures for identifying suitable measures of location and dispersion in RD situations with dynamic experimental designs. McCaskey and Tsui [142] show that Taguchi’s two-step procedure for dynamic systems is only appropriate for multiplicative models and develop a procedure for dynamic systems under an additive model. For a dynamic system this equates to minimizing the sum of process variance and bias squared over the range of signal values.

Tsui [143] compares the effect estimates obtained using the response model approach and Taguchi’s approach for dynamic robust design problems. Recent publications on DRD include Joseph and Wu [144], Joseph and Wu [145], and Joseph [146].

2.4 Applications of Robust Design

2.4.1 Manufacturing Case Studies

Mesenbrink [147] applied the techniques of RD to optimize three performance measurements of a high volume wave soldering process. They achieved significant quality improvement using a mixed-level fractional factorial design to collect ordered categorical data regarding the soldering quality of component leads in printed circuit boards. Lin and Wen [148] apply RD to improve the uniformity of a zinc coating process.

Chhajed and Lowe [149] apply the techniques of RD to the problem of structured tool management. For the cases of tool selection and tool design, they use Taguchi’s quadratic loss function to find the most cost effective way to accomplish the processing of a fixed number of punched holes in sheet metal products.

2.4.2 Reliability Applications

Reliability is the study of how to make products and processes function for longer periods of time with minimal interruption. It is a natural area for RD application and the Japanese auto industry has made huge strides in this area compared to its American counterpart. In this section several authors comment on the application of RD to reliability.

Hamada [150] demonstrates the relevance of RD to reliability improvement. He recommends the response model approach for the additional information it provides on control–noise interactions and suggests alternative performance criteria for maximizing reliability. Kuhn et al. [151] extend the methods of Myers et al. [114] for linear models and normally distributed data to achieve a robust process when time to an event is the response.

2.4.3 Tolerance Design

This chapter has focused on RD, which is synonymous with Taguchi’s methods of parameter design. Taguchi has also made significant contributions in the area of tolerance design. This section reviews articles which examine developments in the techniques of tolerance design.

D’errico and Zaino [152] propose a modification of Taguchi’s approach to tolerance design based on a product Gaussian quadrature which provides better estimates of high-order moments and outperforms the basic Taguchi method in most cases. Bisgaard [153] proposes using factorial experimentation as a more scientific alternative to trial and error to design tolerance limits when mating components of assembled products.

Zhang and Wang [154] formulate the robust tolerance problem as a mixed nonlinear optimization model and solve it using a simulated annealing algorithm. The optimal solution allocates assembly and machining tolerances so as to maximize the product’s insensitivity to environmental factors. Li and Wu [55] combined parameter design with tolerance design.

Maghsoodloo and Li [155] consider linear and quadratic loss functions for determining an optimal process mean which minimizes the expected value of the quality loss function for asymmetric tolerances of quality characteristics. Moskowitz et al. [156] develop parametric and nonparametric methods for finding economically optimal tolerance allocations for a multivariable set of performance measures based on a common set of design parameters.

3 Reliability and Prognostics and Health Management

3.1 Prognostics and Health Management

Prognostics and health management (PHM) is a framework that offers comprehensive solutions for monitoring and managing health status of individual machine and engineering systems. In recent years, PHM has emerged to be a popular approach for improving reliability, maintainability, safety, and affordability. Concepts and components in PHM have been applied in many domain areas such as mechanical engineering, electrical engineering, statistical science, etc.

Due to high impact and extreme costs associated with system failures, it is important to develop methods that can predict and prevent such catastrophes before they occur. Many application methods have been developed in domains such as electronics-rich systems, aerospace industries, or even the public health environment [157, 158], which can be grouped under the framework of prognostics and health management (PHM). Prognostics is the process of predicting the future reliability of a product by assessing the its degradation from its expected normal operating conditions; health management is the process of real time monitoring the extent of deviation or degradation from normal operating condition [159, 160]. Traditional reliability prediction methods (e.g., US Department of Defense Mil-Hdbk-217 and Telcordia SR-332 (formerly [161])), make strong assumptions that constant hazard rate of each component can be modified to account for various operating and environmental conditions. In PHM approach, we monitor the system’s health status in real time and dynamically update the reliability and hazard function based on in situ measurements and update the current models based on historical data. Due to the success of the PHM approach, new PHM techniques and methods are needed and to apply and implement PHM to other and underdeveloped domains.

Due to the increasing complexity of modern systems, one most prominent problem is called the No Fault Found (NFF) problem (similarly, “trouble not identified,” “intermittent malfunctions,” etc.,) [162,163,164], particularly in electronics-rich systems. It refers to the situation that no failure or fault can be detected or duplicated during laboratory tests even when the failure has been reported in the field [165]. NFF issues not only make diagnosis extremely difficult but also result in skyrocketingnificant maintenance costs. As reported by Williams et al. [165], NFF failures account for more than 85% or 90% of all filed failures and overall maintenance costs in avionics respectively, which cost the US Department of Defense roughly 2 ∼ 10 billion US dollars annually [166]. Similarly, NFF contributes to significant operational costs in many other domain areas. In addition, NFF contributes to potential safety hazards in other industries For example, both Toyota and National Highway Traffic Safety Administration (NHTSA) spent quite a time and efforts to investigate the root causes of sudden acceleration failures in certain car models, which is linked to 89 deaths in 71 crashes since 2000 [167]. Unfortunately, no conclusive finding has been reached despite efforts to replicate the failures in a variety of laboratory conditions. From these examples, it was found that intermittent faults are often related to environmental conditions and operation histories of the particular individual system, and thus difficult to duplicated repeated under newly unknown random disturbances. Traditional laboratory testing and assessment data only provide information on the characteristics of the population and are insufficient to lead accurate prediction for each individual performance. To reduce maintenance cost and avoid safety hazards caused by NFF, the approach of PHM shifts from traditional population data modeling from individual data modeling.

In response to these challenges, the fast development of information and sensing technology has enabled the collection of many in situ measurements during operations that provides the capability of real time data management and processing for each individual. These advancements provide a great opportunity to develop sophisticated models with increasing accuracy of prognostics for individual items. For instance, many different types of data during the whole life cycle of the products can be easily retrieved, especially in crit- ical applications. These data could include production process information, quality records, operation logs, and sensor measurements. Moreover, unlike manually entered data, which are slow, costly, and error-prone, many current records are accurate and timely due to advancements in automated technology. The use of Radio Frequency Identification (RFID) technology, for example, is commonly used in supply chain distribution networks, healthcare, and even military applications, because it provides reliable and timely tracking of products/components. Advanced sensor technologies also enable abundant measurements at both macro and micro scales, such as those used to measure vibration, frequency response, magnetic fields, and the current/voltage, to name a few.

In general, typical workflows in a PHM system can be conceptually divided into three major tasks: fault diagnostics, prognostics, and condition-based maintenance. The first task is on diagnose and root causes identification for system failures. The root causes provide useful information for prognostic and feedback for system design improvement. The prognostic task takes the processed data, system models, or failure mode analysis as inputs and employs the developed prognosis algorithms to update the degradation models for failure time prediction. The last task makes use of the prognosis results with consideration of the cost and benefits to determine the optimal maintenance actions to achieve minimal operating costs and risks. All of these three tasks are necessarily executed dynamically and in real time.

3.2 Systems Health Monitoring and Management

Systems health monitoring and management (SHMM) refers to the framework of continuous surveillance, analysis, and interpretation of relevant data for system maintenance, management, and strategic planning, where “system” is generally defined as “an organized set of detailed methods, procedures, and routines created to carry out a specific activity or solve a specific problem,” ranging from mechanical systems to public health [168,169,170]. SHMM differs from PHM by its distinct emphasis and its definitions of monitoring, prognostics, and management, and can be considered an extended version of PHM. More specifically, system health monitoring includes detection, forecasting, diagnostics, and prognostics, while system health management includes decision, financial, and risk management. A fundamental problem in SHMM is on how to make use of correlated active and passive data in various tasks of prediction and forecasting, monitoring and surveillance, fault detection and diagnostics, engineering management, supply chain management, and many more. In application to complex human and engineering systems, challenging research problems may arise in various domains driven by big data analytics, such as syndromic surveillance [171, 172], electronics-rich system management [173], simulation and optimization of emergency departments in medical systems [174], and mass transit planning [175].

SHMM covers many broad topics, ranging from experimental design and data collection, data mining and analytics, optimization, decision-making, etc. As more and more systems become data-rich, the theoretical foundation of SHMM is a natural complement to the “data to knowledge to action” paradigm, and therefore to benefit from future developments in data science. Data science shows all the signs of growing into a discipline in its own right, with a strong theoretical foundation at its heart, such foundations being paramount in the development of any new scientific field. Specifically, theoretical research on the foundation of SHMM will build upon theoretical foundational research in data science, which is intrinsically inter-disciplinary. In particular, establishing the theoretical basis of SHMM is likely to involve interdisciplinary collaborations between computer scientists, mathematicians, and statisticians, as these three disciplines are at the heart of the theoretical foundation of SHMM’s closest relative, data science.

For SHMM to make impact to real-life applications, close collaboration is required among SHMM researchers from different disciplines and domain experts. We believe that much of the theoretical foundation of SHMM lies at the intersection of computer science, statistics, and mathematics. Each of those disciplines, however, has been built around particular ideas and in response to particular problems that may have existed for a long time. Thus, the research development of SHMM requires rethinking not only how those three foundational areas interact with each other, but also how each interacts with specific implementations and applications. In particular, the design requirements of business, internet, and social media applications lead to questions that tend to be very different from those in scientific and medical applications in the past. Both the similarities and differences between these areas are striking. Designing the theoretical foundations of SHMM requires paying attention to the problems of researchers implementing SHMM in specific fields as well as to the environments and platforms where computations are to be done. A general framework of SHMM is summarized in Fig. 11.3.

Fig. 11.3
figure 3

General framework of SHMM

In SHMM, one frequently encounters mixed-type and multi-modality data. For example, a typical dataset may be aggregated from many data sources, including imaging, numerical, graph, text data, etc. Although each specific data type has been researched intensively in isolation, developing a unified framework will be a more desirable approach to study mixed data systematically. This field has both theoretical and applied implications, and would benefit from a collaboration between statistics, theoretical computer science, mathematics, and practitioners of SHMM. Further research promises to lead to breakthroughs and important progress in science and engineering. A comprehensive review of SHMM can be found in the work by Tsui et al. [176].