1 Introduction

Semiconductor manufacturing consists of hundreds of individual steps that a wafer must pass through in order to become a final product. Recently, these individual operations have become more complex and the process dimensions have become smaller. This has increased the importance of a precise process monitoring and quality control. In typical semiconductor manufacturing, process monitoring and quality control involves an actual metrology process and statistical process control (SPC) techniques [24, 51, 54].

Although metrology-based SPC is the most widely used quality control scheme, it has some limitations. First, there is a trade-off between the effectiveness (high yield) and the efficiency (cycle time) of the manufacturing process. Metrology is not a value-added operational process, and it is only used for process monitoring. If process engineers implement more frequent metrology processes between operational processes, the total number of processes increases. In addition, the remaining wafers must be held until the investigation of the sampled wafers is completed. As a consequence, the total production cycle time increases at the expense of a higher yield rate [13, 55]. Second, wafer-to-wafer quality control is practically impossible as long as sampling techniques are utilized in the metrology process [14, 41]. Sampling-based metrology assumes that the metrology measurements of the other wafers are consistent with those of one or two wafers sampled from the same lot. In a real process, however, variations in quality arise due to many unexpected deviations in the process. Therefore, there is always a higher risk of both missed wafers (faulty wafers that are not picked out by the process) and false alarms (in which normal wafers are picked out for being faulty), as compared to a scenario in which wafer-to-wafer quality control is possible.

In order to overcome such limitations of metrology-based SPC, virtual metrology has been highlighted as a new scheme of advanced process control (APC) that makes possible wafer-to-wafer quality control in semiconductor manufacturing [17, 18, 56]. The purpose of virtual metrology (VM) is to support process monitoring and quality control by predicting the metrological values of every wafer without implementing an actual metrology process, based on process sensor data that is collected during the operation. The development of an accurate virtual metrology model offers many benefits. First, process engineers can take more appropriate actions to improve the final yield, such as adjusting operation recipes, based on information that is richer than produced by an actual metrology process [11]. Second, once a prediction model is built, the number of wafers measured by the actual metrology equipment can be significantly decreased, because only a few wafers are required in order to maintain and update the model. Thus, the total production cycle time and resources required by the actual metrology process is decreased, resulting in higher production efficiency [13]. Third, real-time process drift detection [41] as well as wafer-to-wafer (run-to-run; R2R) process control [35] becomes possible, since virtual metrology provides continuous process monitoring on a wafer level.

Due to many benefits of virtual metrology, it has been widely studied since the latest 1990s, and the research in this area has developed in two main directions. The first direction is to improve the prediction accuracy of virtual metrology models by developing new prediction algorithms or by selecting (extracting) relevant input variables (predictors) [4, 15, 41, 42, 46]. The second direction of research involves building a real-time process control system by integrating virtual metrology and an R2R control scheme [6, 12, 35, 47, 48, 64]. Although previous studies have achieved noticeable progresses in both directions, most of them rely on a common but not realistic assumption; VM prediction results are quite accurate and reliable. However, despite its many benefits, in practice VM is subject to two types of intrinsic risks. The first risk is model risk, which is related to inaccurate prediction results. When R2R control is running, process recipe manipulation or equivalent proper actions are selectively activated based upon each wafer’s VM prediction results. If the VM prediction result is accurate, the follow-up actions are appropriate. However, if the VM result is inaccurate, the follow-up actions will cause several additional problems because the operation is based upon wrong information. The second type of risk is data risk, which is related to the difference between the data used to build a VM model (training data) and the data used for predicting metrological values (test data). When VM prediction is highly accurate, this means that the functional relation between the input variables (process sensor parameters) and target variables (metrological values) of the training data was well-captured by the model. However, no model can make a very accurate prediction when highly heterogeneous data, which was not seen when the model was built, is provided, even though a certain degree of generalization is possible. As a consequence, this heterogeneous data may increase the uncertainty of the R2R control. As mentioned earlier, a great deal of research focused upon lowering the model risk. Only a few works, however, have been devoted to lowering the data risk [16].

In this paper, in order to reduce the data risk, we propose a means of evaluating the reliability level of VM prediction results based upon novelty detection techniques. To do so, VM prediction models and novelty detection models are built based upon the same training data. When a new wafer arrives, its process sensor data is provided simultaneously to both the VM and novelty detection models. If the sensor data are similar enough to those of the training wafers, a high-reliability score is assigned to the wafer’s VM prediction result; if not, a low-reliability score is assigned. Process engineers can then increase the flexibility of process control and enhance overall productivity by selectively utilizing a wafer’s VM prediction results, based on its reliability level. The main contributions of this paper can be summarized as follows:

  • Reliance level for each VM prediction result is evaluated.

  • Novelty detection algorithms and their combinations are employed to evaluate the homogeneity of the input sensor values.

  • Practical applicability of the proposed framework is verified.

The rest of this paper is structured as follows. In Sect. 2, we briefly review the research articles related to VM and novelty detection. In Sect. 3, we present the structure of our reliability evaluation system and its individual components. In Sect. 4, we explain the experimental settings such as data description, variable selection methods, VM prediction models, algorithm parameters, and performance measures. In Sect. 5, we analyze the effect of the reliability evaluation models in terms of two prediction accuracy measures. In Sect. 6, along with some concluding remarks, we discuss areas of future work.

2 Related work

2.1 Virtual metrology

In semiconductor manufacturing, a general process of metrology-based SPC is as follows. First, 25 wafers in a single unit, called a lot, are processed in an individual piece of operational equipment that is guided by a predefined operational manual called a recipe. Second, in order to check whether the wafers in the lot were processed properly, only one or two wafers are provided as samples to the metrology equipment. This equipment then measures the pre-determined parameters that are considered critical to the yield rate of the final product, such as translation, rotation, and magnification. If all these measurements of the sampled wafer meet the process control criteria, then all the wafers in the same lot are transferred to the next process; if not, they either undergo an additional calibration process or are discarded.

The conceptual difference between actual metrology and virtual metrology is illustrated in Fig. 1. In actual metrology, only a few wafers are sampled when an operating process is completed, and they are provided to the metrology equipment in order to measure quality-related indicators. If the measurements of these indicators are within the control limit, all the wafers in the same lot pass the examination and are transferred to the next operational process. If the measurements are not within the control limit, then, either an additional operation is conducted or the wafers are discarded, depending upon the degree of error. In virtual metrology, on the other hand, a prediction model is built based upon equipment sensor data that is collected during the operation (inputs, predictors, independent variables) as well as actual metrological values (outputs, targets, dependent variables). Because sampled wafers provide both input and output data, the model is trained with these wafers. Once the model is built, the sensor data from the process equipment for every wafer are provided to the model, and its metrological values are predicted in real time without an actual metrology process. If the model can determine the functional relationship between the process sensor data and the metrological values, it becomes possible to obtain metrological values for every wafer in the lot without an actual metrology process.

Fig. 1
figure 1

The conceptual difference between actual metrology (top) and virtual metrology (bottom)

There are two mainstreams in VM-related research: (1) to develop new prediction algorithms or by selecting (extracting) relevant input variables (predictors) to improve prediction accuracy of virtual metrology systems; and (2) to integrate virtual metrology and R2R control scheme to build a real-time process control system.

With regard to the first VM research direction, Cheng and Cheng [15] employed a 4-layer feed-forward neural network to build a VM prediction model. A total of 2,356 input variables are utilized to predict three metrological values (thickness mean, range, and uniformity) of an advanced 300 mm FAB environment in Taiwan. Despite the complicated network structure, their VM model achieved 1.7 % of maximum error rate and 0.39 % of maximum average projection error (MAPE). Besnard and Toprac [4] built a regression tree based on various types of data such as raw FDC data, preceding metrology measurements, and context information. Before training the regression tree, irrelevant input variables, such as not normally distributed or highly correlated each other, are removed. Then, their VM model achieved an 85 % correlation between actual and predicted metrological values. Lin et al. [41] extracted relevant variables using principal component analysis (PCA), then built a prediction model based upon radial basis function (RBF) networks. Their virtual metrology model achieved a <1 % mean absolute percentage error (MAPE) in the CVD process environment. Pang et al. [46] showed that a very low MAPE could be achieved by taking into account the effects of different tools in different steps, based upon a combination of clustering techniques and multivariate analysis of variance (MANCOVA). Lynn et al. [42] improved the prediction accuracy of VM models by employing a weighted partial least squares regression to reflect the relative importance of process sensor parameters.

With regard to the second VM research direction. Qin et al. [48] presented a fab-wide R2R control framework by combining fault detection control (FDC) and VM, and highlighted critical issues for the success of the framework, such as updating prediction models and embedding the FDC inside the VM models. Khan et al. [35] tried to improve the VM prediction accuracy as well as R2R control flexibility by designing an R2R framework in which VM models are embedded inside an operational process, and adjacent VM models are connected and exchange information on the processes. In order to integratethe VM models into an R2R control system, statistical or machine learning algorithms are employed. The (multivariate) linear regression is the simplest R2R controller and it was adopted in early studies such as photolithograph overlay control [6] and lithography process [12]. As a non-linear R2R controller, neural networks are most commonly used. It was adopted in various semiconductor processes such as reactive ion etching [40], chemical vapor decomposition (CVD) [61], chemical–mechanical planarization [10, 64], and photolithographic steppers [47].

2.2 Novelty detection

Novel instances or outliers are defined as “observations that deviate so much from other observations as to arouse suspicions that they were generated by a different mechanism” [29]. The purpose of novelty detection is to identify those novel observations that occur rarely among abundant normal instances [33]. For a novelty detection task, two different learning frameworks are available: binary classification and one-class classification. The formal learns both normal and novel classes during the training, whereas the latter generalizes only normal class during the training. The class boundary difference generated by binary classification and one-class classification is illustrated in Fig. 2. Because a small number of crosses are located in the right side, binary classification algorithms divide the data space as shown in Fig. 2a. Assuming that the points A and B are newly given, they are classified as circles. On the other hand, since only circles are used to describe the normal class in one-class classification, the decision boundary becomes a rectangle that envelops the given observations Fig. 2b. In this example, the points A and B are determined as novel. One-class classification is more effective than binary classification under certain circumstances, such as when the class imbalance is severe, or when it is practically impossible to gather data for a certain class. Tax and Duin [59] pointed out that the sample size and class overlap are two main features of one-class datasets so when developing a new classifier, it should be designed to cover these features as wide as possible. Due to its practical importance, a number of one-class classification algorithms have been introduced and they can be grouped into four major categories: (1) distribution-based, (2) clustering-based, (3) distance-based, and (4) support vector-based methods. Distribution-based methods have an assumption that normal observations are drawn from a specific distribution so the main task of algorithms is to estimate its parameters. Gaussian density estimation [3], mixture of Gaussian density estimator [44], and Parzen window density estimator [21] belong to this category. Clustering-based methods relieve the assumption of the shape of distribution in distribution-based methods. In clustering-based methods, normal class is defined as a union of some number of distinctive arbitrary shape of clusters. They can be grouped into three sub-categories: (1) partitional clustering, (2) hierarchical clustering, and (3) density-based clustering. K-Means clustering [9] and K-medoids clustering [65] are representative partitional clustering algorithms, whereas BIRCH [66], CURE [25], ROCK [26], Chameleon [34], and Z-windows [7] are representative hierarchical clustering algorithms. As density-based clustering methods, DBSCAN [22], OPTICS [2], and LOF [8] are commonly used. Distance-based methods employ nearest neighbor learning for novelty detection. The novelty score of a new observation is proportional to the aggregated distance to its nearest neighbors. Based on the distance measure and the aggregation method, various algorithms can be possible [1, 27, 33, 37, 49]. Support vector-based methods generate an arbitrary shape of closed class boundary that can describe the normal class well in the input space by mapping the data into a higher dimensional feature space to achieve a better generalization ability. The one-class support vector machine (1-SVM) [52] and support vector data description (SVDD) [57] are two well-known support vector-based algorithms. The former finds the farthest hyperplane from the origin, above which as many normal observations are placed as possible, whereas the latter finds the most compact hypersphere that envelops as many normal observations as possible. It has been proved that 1-SVM and SVDD produce the same class boundary when a Gaussian kernel function is used [57]. Due to their high generalization ability, support vector-based novelty detection algorithms have been successfully applied to various practical domains such as image classification [20, 38] and chemical process monitoring [31].

Fig. 2
figure 2

The classification boundary of binary classification (a) and one-class classification (b)

Rather than single novelty detection algorithms, an ensemble of one-class classification algorithms has been highlighted as a means of improving the detection performance. Krawczyk and Wozniak [39] proposed five diversity measures for selecting effective committee members. Based on the empirical study with a large number of datasets, the entropy-based measure returned the best performances, followed by the sphere intersection measure and the energy measure. Krawczyk and Filipczuk [38] proposed an efficient medical decision support framework for breast cancer diagnosis. In their work, the entire dataset is decomposed to one of the three classes and novelty detection algorithm is applied to each class. In order to improve the detection performance, an ensemble of one-class classification algorithms is constructed for each class. Cyganek [20] and Yeh et al. [63] attempted to construct one-class support vector ensembles; the former divided the training data into some number of homogeneous clusters in the feature space and applied a 1-SVM in each cluster, whereas the latter adopted the AdaBoost framework [23]. Wilk and Wozniak [62] extend the binary classification into multi-class classification by employing a fuzzy inference system with a set of one-class classifiers. Their experimental results show that the fuzzy combiner yields consistently lower error rates than other combination methods.

In this study, since we assume that all training wafers are homogeneous, we only have examples of a normal class, so the one-class classification-based novelty detection scheme is more suitable than binary classification for the assignment of a reliability level of VM prediction results. In addition, we combine a set of one-class classifiers to improve the stability of reliability level produced by individual novelty detectors.

3 Reliability evaluation of virtual metrology prediction results

The conceptual structure of our reliability evaluation system for VM, which is illustrated in Fig. 3, differs from a traditional VM system in the following ways. In a traditional VM system, a prediction model is trained based on the process sensor data and the actual metrological values of wafers that are inspected by actual metrology equipment. When an operation on a wafer is completed, its process sensor data are provided to the VM model for prediction of its metrological values. In our reliability evaluation system, however, a novelty detection model is also built, in addition to the VM model, and it is based only on the process sensor data of the training wafers. When an operation on a new wafer is completed, process sensor data are provided to the VM model and the novelty detection model at the same time, in order to predict the metrological values, and the similarity between the sensor data of the new wafer and those of the training wafers, respectively. If the novelty detector determines that the process sensor data of a new wafer is similar enough to those of the training wafers, the new wafer is considered to be drawn from the same underlying distribution as the training wafers, and a high-reliability score is therefore assigned to its VM prediction results. If the degree of similarity is insufficient, the new wafer is considered to be drawn from an underlying distribution that is different from that of the training wafers, and a low-reliability score is then assigned.

Fig. 3
figure 3

The conceptual structure of our reliability evaluation system for VM

In order to build the reliability evaluation system for VM, two types of prediction models are necessary: a regression model for VM and a novelty detection model for reliability evaluation. Regression models are used in the generation of continuous outcomes by configuring the functional relationships between predictors, either discrete or continuous, and targets. Novelty detection models are associated with the generation of binary outcomes (0 or 1) produced by generalizing given data that consist of only predictors. In order to explore the effects and consequence of reliability evaluation, we employed three regression algorithms for VM prediction and five novelty detection algorithms for reliability evaluation. In the next subsections, we briefly introduce the regression and novelty detection algorithms adopted in our experiments.

3.1 Virtual metrology models

Three regression algorithms were employed for VM prediction in our experiments: multiple linear regression (MLR), k-nearest neighbor (k-NN) regression, and artificial neural networks (ANN). MLR [50] estimates the functional relationship between multiple input variables and single or multiple target variables of given data in the form of linear equation. Compared to other complex algorithms, MLR offers a number of advantages such as a closed analytic form, computational efficiency, and less user-specific parameters. However, its performance is degraded when there is a non-linear relationship between the predictors and targets.

Let y ki denote the ith metrological value of the kth wafer, while x kj denotes the jth process sensor data of the kth wafer. Then, the MLR equation with p predictors, d targets, and n training wafers can be written as:

$$y_{ki} = \beta_{k0} + \beta_{k1} x_{i1} + \beta_{k2} x_{i2} + \cdots + \beta_{kp} x_{ip} ,\quad for\;k = 1,2, \ldots ,d,\quad i = 1,2, \ldots ,n.$$
(1)

This can be rewritten in a matrix form as:

$${\mathbf{Y}} = {\mathbf{X\beta }},\quad {\mathbf{Y}} = \left( {\begin{array}{*{20}c} {y_{11} } & \cdots & {y_{1d} } \\ \vdots & \ddots & \vdots \\ {y_{n1} } & \cdots & {y_{nd} } \\ \end{array} } \right),{\mathbf{X}} = \left( {\begin{array}{*{20}c} 1 \\ \vdots \\ 1 \\ \end{array} \;\begin{array}{*{20}c} {x_{11} } & \cdots & {x_{1p} } \\ \vdots & \ddots & \vdots \\ {x_{n1} } & \cdots & {x_{np} } \\ \end{array} } \right),{\varvec{\upbeta}} = \left( {\begin{array}{*{20}c} {\beta_{10} } & \cdots & {\beta_{d0} } \\ \vdots & \ddots & \vdots \\ {\beta_{1p} } & \cdots & {\beta_{dp} } \\ \end{array} } \right).$$
(2)

The intercept \(\beta\) of the above equation can be obtained by minimizing the squared error (residual) between the targets (Y) and the predictions (\({\hat{\mathbf{Y}}}\)), as shown in Eq. (3) using the ordinary least square (OLS) method as follows:

$$E = \frac{1}{2}\sum\limits_{i = 1}^{n} {e_{i}^{2} } = \frac{1}{2}\det \left| {({\mathbf{Y}} - {\hat{\mathbf{Y}}})^{\text{T}} ({\mathbf{Y}} - {\hat{\mathbf{Y}}})} \right| = \frac{1}{2}\det \left| {({\mathbf{Y}} - {\mathbf{X\beta }})^{\text{T}} ({\mathbf{Y}} - {\mathbf{X\beta }})} \right|.$$
(3)
$$\frac{\partial E}{{\partial {\varvec{\upbeta}}}} = {\mathbf{X}}^{\text{T}} {\mathbf{Y}} - {\mathbf{X}}^{\text{T}} {\mathbf{X\beta }} = 0,\quad {\varvec{\upbeta}} = ({\mathbf{X}}^{\text{T}} {\mathbf{X}})^{ - 1} {\mathbf{X}}^{\text{T}} {\mathbf{Y}}.$$
(4)

ANN [5] is one of the most widely used non-parametric regression algorithms in many fields, including that of virtual metrology, due to its ability to capture non-linear relationships between predictors and targets. A 3-layer feed-forward neural network was employed in our experiments. In ANN, the targets are expressed as a combination of input values and weights as follows:

$$y_{k} = \sum\limits_{q = 1}^{h} {w_{kq}^{(2)} g\left( {\sum\limits_{r = 1}^{d} {w_{qr}^{(1)} x_{r} } } \right)} ,\quad k = 1,2, \ldots ,d,$$
(5)

where \(w_{kq}^{(2)}\), \(w_{qr}^{(1)}\), and g(•) denote the weight connected between the kth output node and the qth hidden node, the weight connection between the qth hidden node and the rth input node, and the activation function, respectively. Training ANN is equivalent to optimizing the weights in Eq. (5), which are obtained by minimizing the objective loss function, which is generally done using the least squared residual in Eq. (3).

k-NN [28] is the most popular memory-based learning algorithm. Since it does not require a training procedure, it is employed in a number of tasks that require rapid model update. k-NN predicts the target values of a new instance based on the similarity information between the new instance and its neighbor instances. Once a new instance is provided, k-NN first searches the k most similar instances in the reference data set. Next, the weight for each selected neighbor instance is assigned; the greater the similarity, the greater the weight. The target values of the selected neighbors are then aggregated using a predefined combining rule to produce the target value of the new instance:

$$\hat{y} = \sum\limits_{{j \in {\text{NN}}(x)}}^{{}} {w_{j} y_{j} ,}$$
(6)

where NN(x) and w j denote the index set of k-nearest neighbors of the new instance x, and the weight assigned to the jth nearest neighbor, respectively. In k-NN learning, two user-specific parameters must be declared: the number of nearest neighbors (k), and the weight allocation method. Here, we adopted the locally linear reconstruction (LLR) method [32], due to its ability to determine the two parameters in a structured way, unlike other heuristic-based approaches. LLR finds the optimal weights for the nearest neighbors by minimizing the reconstruction error E(w) between the target instance and the projection made by its neighbors, which is defined as follows,

$$E({\mathbf{w}}) = \frac{1}{2}\left| {{\mathbf{x}}_{t} - \sum\limits_{j = 1}^{k} {{\mathbf{w}}_{j} {\tilde{\mathbf{x}}}_{j} } } \right|^{2} ,$$
(7)

where \({\mathbf{x}}_{t}\), \({\tilde{\mathbf{x}}}_{j}\), and \({\mathbf{w}}_{j}\) are the target instance, jth nearest neighbor of \({\mathbf{x}}_{t}\), and the weight assigned to \({\tilde{\mathbf{x}}}_{j}\). By solving this quadratic programming, LLR can find the optimal set of weights systematically.

3.2 Novelty detection (one-class classification) algorithms

In order to assign a level of reliability to a wafer’s VM prediction results, we performed an evaluation to compare the homogeneity of the process sensor data for a new wafer with that of the training wafers based on novelty detection techniques. Once a set of instances is provided, novelty detection algorithms characterize and generalize the data, assuming that they are drawn from the same underlying distribution. When a new instance is provided, its novelty score is computed. It is determined as being novel if the novelty score is greater than the given threshold; if not, it is considered normal.

A total of five novelty detection algorithms were employed: a Gaussian density estimator (Gauss), a mixture of Gaussians (MoG), KMC, k-nearest neighbor (k-NN), and SVDD. Gauss [3] is the simplest parametric novelty detection method. It assumes that normal data is generated from a Gaussian distribution, as shown in Eq. (8).

$$p({\mathbf{x}}) = \frac{1}{{(2\pi )^{2/d} |{\varvec{\Sigma}}|^{1/2} }}\exp \left[ { - \frac{1}{2}({\mathbf{x}} - {\varvec{\upmu}})^{\text{T}} \varSigma^{ - 1} ({\mathbf{x}} - {\varvec{\upmu}})} \right].$$
(8)

When a set of training instances are given, Gauss estimates its two model parameters, \({\varvec{\upmu}}\) and \({\varvec{\Sigma}}\), which are the mean vector and the covariance matrix of the normal training data, respectively. Then, whenever a new instance is provided, its probability is computed using Eq. (8) with the estimated parameters (\({\varvec{\upmu}}\), \({\varvec{\Sigma}}\)). If the probability is high enough, the new instance is considered to be from the same distribution as the training data, so it is given a high-reliability score. If the probability is low, the new data are not considered to be from the same distribution, and is given a low-reliability score.

Gauss requires a very strict assumption of unimodality, which is often violated in practice. To obtain a more flexible density estimate, MoG [44] allows more than one modal, and the probability is estimated by a linear combination of K individual distribution components as follows:

$$p({\mathbf{x}}) = \sum\limits_{k = 1}^{K} {P(k)p_{k} ({\mathbf{x}}).}$$
(9)

where K, P(k), and p k (x) are the number of components in a mixture model, the prior probability of the kth component, and the conditional probability of x for the kth component, respectively. When a new instance is provided, the probability is computed, and it is determined as being normal only if the probability is high enough. In MoG, each component is assumed to be a Gaussian distribution, and the parameters of each Gaussian are optimized by an expectation–maximization algorithm [5].

KMC [57] is similar to MoG in that it groups the normal data into K clusters, where instances within the same cluster are homogeneous, while those in different clusters are heterogeneous. However, KMC does not require a Gaussian assumption for each cluster. With a given normal data set, KMC finds K centroids that minimize the within-cluster sum of the squared error,

$$\arg \;\mathop {\hbox{min} }\limits_{C} \sum\limits_{i = 1}^{K} {\sum\limits_{{{\mathbf{x}}_{j} \in {\text{C}}_{i} }}^{{}} {||{\mathbf{x}}_{j} - {\text{c}}_{i} ||^{2} ,} }$$
(10)

where \(c_{i}\) is the centroid of C i , and C is the union of all clusters (C = C 1∪…∪C K ). When a new instance x n is provided, its novelty score is determined based on the distance to the nearest cluster, as follows:

$${\text{Novelty score }}({\mathbf{x}}_{n} ) = ||{\mathbf{x}}_{n} - {\mathbf{c}}_{i} ||,\;{\text{where}} \quad ||{\mathbf{x}}_{n} - {\mathbf{c}}_{i} || \le ||{\mathbf{x}}_{n} - {\mathbf{c}}_{j} ||,\;{\text{for}}\;{\text{all}}\;k,\;i \ne j.$$
(11)

In k-nearest neighbor learning, when a new instance is provided, its k most similar instances are selected based on a certain similarity metric, such as the Euclidean distance. Then, the novelty score is computed by aggregating this similarity information. Among various similarity combination methods, we adopted a hybrid novelty score [33], due to its ability to consider distance and local topology simultaneously, which are computed as follows:

$$d_{\text{hybrid}} ({\mathbf{x}}) = d_{\text{avg}} ({\mathbf{x}}) \times \left( {\frac{2}{{1 + \exp ( - d_{\text{c - hull}} ({\mathbf{x}}))}}} \right),$$
(12)

where d avg is the average distance to the k-nearest neighbors, and d c-hull is the distance to the convex hull made by the neighbors as shown in Eq. (13).

$$d_{\text{avg}} ({\mathbf{x}}) = \frac{1}{k}\sum\limits_{i = 1}^{k} {||{\mathbf{x}} - {\mathbf{x}}^{i} ||,\quad d_{{c - {\text{hull}}({\mathbf{x}})}} } = ||{\mathbf{x}} - \sum\limits_{i = 1}^{k} {w_{i} {\mathbf{x}}^{i} } ||,$$
(13)

where x i is the ith nearest neighbor, and w i is its corresponding weight obtained by solving LLR.

SVDD [57, 58] is a novelty detection algorithm that is based on structural risk minimization [60], and it solves a problem in feature space using a kernel trick [45, 53]. SVDD finds a hypersphere with a minimum volume that encloses as many normal instances as possible in the feature space. Let R and a denote the radius and the center of the hypersphere, respectively, in an optimization problem to be solved that is stated as:

$$\begin{gathered} \hbox{min} \;R^{2} + C\sum\limits_{i = 1}^{n} {\xi_{i} } , \hfill \\ s.t.\quad ||\varPhi ({\mathbf{x}}_{i} ) - {\mathbf{a}}||^{2} \le R^{2} + \xi_{i} ,\quad \xi_{i} \ge 0,\;\forall {\mathbf{x}}_{i} , \hfill \\ \end{gathered}$$
(14)

where \(\varPhi ({\mathbf{x}}_{i} )\) and a are a transformed input data and the center of the normal class instances in the feature space, respectively. The solution can be found by formulating it as a Wolfe’s dual problem and utilizing a kernel trick. When a new instance x n is provided, its novelty score can be measured as follows:

$${\text{Novelty score }}({\mathbf{x}}_{n} ) = R^{2} - ||\varPhi ({\mathbf{x}}_{i} ) - {\mathbf{a}}||^{2} .$$
(15)

As an attempt to improve the stability of reliance level obtained by individual novelty detection models, we construct a fusion model of novelty detectors. Since the main purpose of this study is to verify the practical applicability of novelty detection algorithms as a reliability indicator for VM prediction results, we adopted a simple majority voting scheme for aggregating the novelty detection algorithms rather than sophisticated methods discussed in Sect. 2.2.

$${\text{NI}}_{\text{Fusion}} ({\mathbf{x}}_{n} ) = \delta \left( {\sum\limits_{j = 1}^{p} {{\text{NI}}_{j} ({\mathbf{x}}_{n} ) > \frac{p}{2}} } \right),$$
(16)

where p is the number of individual novelty detectors (p = 5 in our experiment). NIFusion and NI j denote the novelty indicator of the fusion and jth individual novelty detector that returns 1 if x n is determined as novel or returns 0 if it is determined as normal. \(\delta\) is an indicator function that return 1 if the condition in the parenthesis is met, otherwise return 0.

4 Experimental settings

4.1 Data

In order to analyze the effect of the proposed reliability evaluation models, at an actual semiconductor manufacturing company in South Korea, we collected the data from 117 process sensors in two pieces of photo-lithography equipments as inputs, and eight metrological values as outputs. Since preventive maintenance (PM) was performed seven times during the data collection, we divided the entire data into eight segmented periods, using the occasions of PM as the points of separation. The number of wafers collected in each period for each piece of equipment is summarized in Table 1. The first 100 wafers in each period were used for training the VM prediction models and novelty detection algorithms (including cross-validation for selecting algorithm parameters and variable selection), and the remaining wafers were used for performance evaluation.

Table 1 The number of wafers collected in each period for each equipment

4.2 Variable selection

In our experiments, a total of 117 input variables (process sensor parameters) were collected. Not only was this too many compared to the number of training wafers, but a number of irrelevant variables were also included in the raw data set. Therefore, we reduced the dimensions of the input in order to improve the prediction performance and the model training efficiency. We adopted stepwise variable selection and a genetic algorithm (GA) in order to select the most relevant variables. Stepwise variable selection process begins with the single most relevant input variable, and the following two procedures are conducted alternately until every significant variable is included: (1) among the candidates, one that most improves the prediction accuracy is added (selection); (2) and among the selected variables, one that is most irrelevant to improve the prediction accuracy is removed (elimination). Note that it is not necessary to remove a variable in the elimination step. A selected variable is removed again if and only if the prediction performance can be maintained without it. Figure 4a illustrates an example of stepwise variable selection. In steps 2 and 4, no variable is eliminated because there is no prediction performance improvement. However, variable x i is removed in step 6 since the prediction performance is enhanced when it is excluded from the selected variable set. In step 9, when there are no variables to add, the stepwise variable selection is finalized. Although the stepwise variable selection can rapidly converge to a subset of significant variables, it is usually not an optimal subset when a large number of input variables are considered. In this circumstance, GA can be a better alternative. GA finds the optimal set of input variables based upon an evolutionary procedures such as selection, crossover, and mutation [19, 30]. Figure 4b illustrates the process of GA variable selection. Initially, a sufficient number of chromosomes, called a population, are created. Each chromosome has the form of binary vector where each element, called a gene, designates the usage of the corresponding input variable: 1 for used, 0 for not used. Next, VM models and novelty detection algorithms are trained with the candidate variables in each chromosome and its fitness value is evaluated. Since the purpose of our study is to discriminate the normal and novel wafers well in a VM process for flexible process control, it is more desirable when the difference of VM prediction errors between highly reliable wafers and unreliable wafers is maximized. Thus, we define the fitness function of the GA as follows,

Fig. 4
figure 4

Variable selection based on the stepwise selection and genetic algorithm (GA)

$${\text{Fitness functions}} = {\text{MAE}}(W_{L} ) - {\text{MAE}}(W_{H} ),$$
(17)

where MAE is the mean absolute error (MAE) that is defined as Eq. (18), while W L and W H denote a set of wafers that are classified as novel (low reliability) and normal (high reliability), respectively. Chromosomes with high fitness values survive and generate a new population by imitating biological reproductive processes such as crossover and mutation. Crossover is associated with exchanging some genes between two chromosomes, whereas mutation is associated with reversing the value of certain genes (ex: from 0 to 1) with a low probability. In doing so, input variables with high prediction performance are kept throughout the generation process, while those with low performance naturally die out. Once this cycle (selection, crossover, and mutation) is repeated a sufficient number of times, we can identify a pseudo-optimal set of variables.

The number of selected input variables in each period for each VM prediction model is summarized in Table 2. It was observed that there was significant redundancy among the process sensor parameters. At most, 38 input variables were selected for EQ1’s first and third periods by GA with MLR, which still represented a 67 % reduction of the original variables. In an extreme case, only two input variables were selected for EQ1’s third period by stepwise selection with k-NN. We would note that, regardless of the prediction algorithm used, fewer input variables were selected with the stepwise selection than by GA for both pieces of equipment; the number of input variables selected by GA for the same equipment/period/prediction algorithm pair was more than twice the number obtained by stepwise selection. The reason for this is that GA has a larger coverage of the search space than stepwise selection, so an individual variable has a greater chance of being considered for selection.

Table 2 The number of selected input variables

4.3 Algorithm parameters and performance measures

In our experiments, three regression algorithms (MLR, k-NN, ANN) were employed for VM prediction, and five novelty detection algorithms (Gauss, MoG, KMC, k-NN, SVDD) were employed for reliability evaluation. Besides MLR and Gauss, each of the adopted algorithms required that algorithm-specific parameters be determined. The parameters for each algorithm and their candidate values are summarized in Table 3. k-NN regression and k-NN novelty detection requires the number of nearest neighbors (k), while MoG and KMC require the number of clusters K. H is the number of hidden nodes for ANN. With the Gaussian kernel, two parameters must be optimized for SVDD: the width of the Gaussian kernel (σ) and the cost of the errors (C). Note that although SVDD can take other form of kernels such as linear kernel or polynomial kernel, we adopted the Gaussian kernel since it has been most commonly adopted and reported better performance for practical use [36, 43, 67]. These algorithm parameters are optimized by fivefold cross-validation process using the training dataset. Initially, a set of parameters for regression algorithms and novelty detectors are fixed. Then, the variable selection is conducted with these fixed parameters. As a result, the best variable sets are obtained for each set of algorithm parameters. Finally, the best parameter–variable set combination is determined using the same fitness function used in the GA.

Table 3 The algorithm-specific parameters for each algorithm and the candidate values

When a new wafer is provided, a binary reliability outcome (low or high) for VM prediction is determined by the novelty detection model as follows. Once a novelty detection model is trained with the same wafers used for building a VM prediction model, the novelty scores of the training wafers are computed and sorted in descending order. Next, the 5 percentile value is set at a cut-off value (threshold). If the novelty score of the new wafer is higher than the threshold, its VM prediction results are labeled as low; if not, its prediction results are labeled as high. This means that if the process sensor data of a new wafer are similar to more than 95 % of training wafers, the training wafers and the new wafer are considered homogeneous, and the VM prediction results of the new wafer are considered highly reliable because the new wafer’s sensor data were sufficiently learned by the VM model. In the opposite case, the training wafers and the new wafer are considered heterogeneous, so the VM prediction results of the new wafer are considered unreliable because the VM model did not have enough learning opportunities.

Based on the evaluated reliability level, the performance of VM is analyzed in terms of two accuracy measures: the MAE and the percentage of absolute range error (PARE). MAE is based on computing the absolute difference between the actual and the predicted metrology values as.

$${\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {|y_{i} - \hat{y}_{i} |} ,$$
(18)

where n is the total number of test wafers, and \(y_{i}\) and \(\hat{y}_{i}\) are the actual and predicted metrological values, respectively, of the ith wafer. Since the scale of actual metrological values is very small, i.e., <10−2, we used an adjusted MAE by multiplying Eq. (18) by 100. PARE is defined as the proportion of wafers whose prediction error is within the level of tolerance, and is computed as:

$${\text{PARE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {I\left( {\left| {y_{i} - \hat{y}_{i} } \right| < \theta } \right)} ,$$
(19)

where I(•) is an indication function that returns 1 if the condition in the parenthesis is satisfied, and returns 0 if it is not. θ is the tolerance level determined by the process recipe. In our experiments, θ was set to 0.003 because the same value was used in the actual manufacturing process.

5 Experimental results

Note again that we use the term W H to refer to wafers whose reliability level is labeled high and W L for wafers whose reliability level is labeled low. The summary statistics of the proportions of W L , as determined by individual novelty detection algorithms and their fusion model, are summarized in Table 4, and their distributions are shown in Fig. 5. We would note that there were a total of 48 cases for each novelty detection algorithm for each piece of equipment: 8 periods \(\times\) 3 VM prediction models \(\times\) 2 variable selection methods. Although we used the same cut-off threshold for both the high and low reliability levels, i.e., the top 5 % novelty score of the training wafers, the distributions of the proportions of WL were quite different, depending upon the novelty detection algorithm. It was observed that KMC assigned a low level of reliability to the wafers most strictly, as only 2:82 % (EQ1) and 2:12 % (EQ2) of the test wafers were labeled as low, on average. In addition, the variation of the proportions obtained using KMC was the smallest among the novelty detection algorithms. Gauss and k-NN assigned low reliability to 4–6 % of the test wafers on average, and these proportions were consistent with the threshold setting, i.e., 5 %. In general, MoG and SVDD were found to over-fit the training data, because they assigned low reliability to more test wafers than was expected, and the variations were also large. MoG assigned low reliability to 7:11 % (EQ1) and 6:21 % (EQ2) of the test wafers, on average. The discrepancy between the proportions of W L in the training data versus the test data was the greatest when SVDD was used. More than 10 % of the test wafers were determined as W L , and with the highest degree of variation, as shown in Fig. 5. When it comes to the Fusion model, it assigned low reliability to the wafers slightly lower than the threshold; 4:34 % and 3:09 % of the test wafers are determined as unreliable for EQ1 and EQ2, respectively. Based on these results, we are able to make an immediate suggestion regarding the adoption of a novelty detection algorithm. If process engineers require strict process control, it would be better to use a novelty detection algorithm that sounds an alarm frequently, such as SVDD. If they require a lesser degree of control, then a novelty detection algorithm that seldom assigns low reliability to wafers, such as KMC, should be employed.

Table 4 The summary statistics of the proportion of W L
Fig. 5
figure 5

The proportion of W L determined by each novelty detection algorithm

The VM prediction performance according to the reliability level, in terms of the adjusted MAE, is summarized in Table 5. Theoretically, 64 MAEs for both W H and W L can be obtained for each VM model-novelty detector algorithm pair, because there were two pieces of equipment, eight periods, and four targets. However, for a certain equipment–period–target variable cases, all wafers resulted in high reliability level so that the MAE(W L ) cannot be obtained. We discarded those cases when computing the statistics for W L . In addition, Table 6 shows the proportion of the equipment-period-target variable pairs that resulted in lower adjusted MAEs for WH than those for WL. If this proportion is large, we can conclude that the novelty detection algorithm is effective, because the wafers with high reliability resulted in smaller errors than those with low reliability. In other words, the greater the proportion, the more effective the novelty detection algorithm.

Table 5 The summary statistics of the adjusted MAE with respect to W H and W L
Table 6 The proportion of equipment–period–target pairs where the adjusted MAE increase is greater than 0

Based on Tables 5 and 6, the following observations can be made. First, W L resulted in a much higher MAE than W L , regardless of the VM models, novelty detection algorithms, or variable selection methods. On average, the MAE of W L was more than 50 % higher than that of W H for the same VM model/variable selection/novelty detection pair, with just a few exceptions. We would note that there were a few extremely high adjusted MAEs for W L , which would make the average adjusted MAEs biased. In terms of the median value of the adjusted MAEs, however, W L still resulted in more than 20 % higher MAEs than W H for most pairs. This finding supports our hypothesis that a VM prediction result would not be reliable when a test wafer’s input data and those of the training wafers are heterogeneous.

Second, among the variable selection methods, the average adjusted MAEs for W H and W L is generally lower with the stepwise selection when MLR and ANN are employed as VM models, whereas the average adjusted MAEs for W H are not significantly different, but those for W L is much lower with GA selection than stepwise selection when k-NN is employed. Therefore, the proportions in Table 6 are greater with GA than with stepwise selection for MLR and ANN for most cases, whereas stepwise selection and GA selection resulted in higher proportions for three cases, respectively. We would note that in general, the input variables selected by GA selection outnumbered those obtained by stepwise selection. Looking at the adjusted MAEs of the training data, we see that the error rates with GA selection were lower than those with stepwise selection for all VM prediction models. However, their levels of performance with the test data were reversed with an only exception of k-NN for WL. A possible explanation for this is that because GA selection takes a broader search space into account than stepwise selection, GA selection brings a higher risk of over-fitting, in practice.

Third, among the VM prediction models, MLR and k-NN resulted in a similar level of performance in terms of the adjusted MAE for W H and the adjusted MAE difference between W H and W L , while ANN was not as accurate as the others. However, when we look at the adjusted increase in MAE shown in Table 6, MLR and ANN were more effective than k-NN. We would note that in an ideal situation, the value in each cell in Table 6 should be 1, because a good VM model with a proper novelty detection algorithm always makes more accurate predictions for W H than W L . However, for some equipment/period/target pairs, only a few wafers, such as a number of less than five, were identified as W L , and some of them were false alarms in practice. In such cases, the MAE of W L could be smaller than W H . Although k-NN made fairly accurate predictions for W H , it failed to distinguish W L from W H . ANN, on the other hand, succeeded in distinguishing W L from W H , but its MAE for W H is lower than that of MLR. Overall, when considering both the MAE for W H and the adjusted increase in MAE, MLR was found to be the best model using our experimental settings. However, we should recall that only 100 training wafers were used in our experiment, due to the difficulty of acquiring actual data. If a sufficient number of training wafers were provided, it is possible that more complex regression algorithms such as k-NN or ANN would perform better than a simple linear model.

Fourth, the best novelty detection algorithm depended upon the VM prediction model. Gauss was found to be the best for MLR in terms of the median adjusted MAE, the difference between the MAE in WH and W L , find the proportion of adjusted MAE increase. For the other two VM models, k-NN was found to be the best when using the same criteria. It is interesting that the simplest novelty detection algorithm, i.e., Gauss, was best suited for the simplest (linear) VM model, while the more complicated novelty detection algorithm, i.e., k-NN, went well with non-linear VM models. It is worth noting that although the fusion of individual novelty detectors did not result in the lowest adjusted MAEs for W H , it gave a remarkable performance in terms of the proportion of adjusted MAE increase. For the six VM model-variable selection pairs, the fusion novelty detector resulted in the highest proportions of the adjusted MAE increase (Table 6), with an exception of ANN–GA pair. Even in the ANN–GA pair, its adjusted MAE increase proportion is 0.9841, which is very close to the best result (1, MLR) and much higher than the others. This implies that the fusion novelty detectors can reduce the variation of individual novelty detectors so a more stabilized performance can be achieved. We would also note that among the novelty detection algorithms, SVDD displayed behavior that was different from that of the other algorithms. Its mean and median adjusted MAE of W H was as low as that of the other algorithms, but the gap between W H and W L was significantly narrower. As explained earlier, SVDD rejected many wafers as it gave a high reliability level. Some of these rejected wafers were actually not similar to the training wafers, but the others were labeled low even though they were actually drawn from the same underlying distribution. The MAE of those wafers was not as large as that of an actual novel wafer; thus, this diluted the MAE of W L . As a consequence, we would not recommend the use of a conservative novelty detection algorithm such as SVDD unless one wishes a very strict process monitoring and is willing to accept a number of alarms.

The VM prediction performance according to the reliability level, in terms of PARE, are summarized in Table 7. First of all, similar to the results obtained in terms of the adjusted MAE, W H resulted in higher average PAREs (better performance) than W L , regardless of the VM model, variable selection algorithm, or novelty detection algorithm. The average PARE of W H for a certain VM model/variable selection/novelty detection algorithm pair is at least 10 % higher than that of W L . With stepwise selection, the performance of MLR and ANN seemed indistinguishable from one another, since the PAREs of W H were always >0.9, but those of W L were smaller than 0.8, except for SVDD. Although k-NN resulted in similar PAREs for WH, the PAREs for W L were greater than those obtained with the other VM models. Therefore, the difference in the PARE for W H versus W L became narrower. This is not desirable unless a VM model can predict both W H and W L very well, so the reliability level for the prediction results becomes of no use. When we looked back at Table 5, it seemed unfortunate that, k-NN did not have as a good prediction power for W L as for W H . Therefore, we can conclude that the high PARE for W L obtained with k-NN was not so much due to the fact that many of its predictions were accurate, but rather, that they were marginally within the threshold θ in Eq. (18).

Table 7 The summary statistics of the PARE with respect to W H and W L

Second, it is worth pointing out that the variable selection with GA was comparable to the stepwise variable selection only when MLR was adopted as a VM model. With the other two regression models, the average PARE of W H was not significantly greater than that of WL, and was even lower in some cases. We suspect that because GA covers a broader search space than stepwise selection, it was likely to over-fit the training data. Since MLR is a linear model, it has a relatively lower degree of complexity than k-NN and ANN, and this low level of complexity compensates for the over-fitted variable selection results. k-NN and ANN, on the other hand, are models with a higher level of complexities, which are able to generate the arbitrary shape of a curve for regression fitting. Thus, the over-fitted variable selection results were not controlled by the VM model, which resulted in prediction performance degradation.

Third, among the VM model-novelty detection pairs, MLR with Gauss was found to be the best combination. Although the average PARE was not the highest with the MLR–Gauss pair, the difference between the best PARE and that of the MLR–Gauss pair was negligible. However, the difference between the PARE of W H and W L was maximized with the MLR–Gauss combination. However, similar to the results in terms of adjusted MAE, the fusion of novelty detectors results in the best when looking at the performance stability. Table 8 shows the proportion of equipment–period–target pairs in which the average PARE of W H is greater than that of W L among a total of 64 pairs. It is confirmed that the fusion model was outstanding for all VM model-variable selection combinations. At least 70 % of pairs resulted in a higher average PARE of WH than WL (k-NN-GA), while more than 92 % of them resulted in a higher average PARE of WH than WL when MLR-GA combination is employed.

Table 8 The proportion of equipment–period–target pairs where the PARE of W H is greater than that of W L

In summary, in terms of adjusted MAE and PARE, we can make the following observations. First, every novelty detection algorithm was useful in detecting wafers that would produce less reliable VM predictions. Second, stepwise variable selection resulted in better reliability estimation performance than GA selection in general, because it prevented over-fitting of the training data. Third, among the candidate regression models and novelty detection algorithms, the MLR–Gauss pair produced effective reliability evaluation as well as accurate VM prediction. Fourth, constructing a fusion model can improve the stability of the proposed framework since the proportion of the equipment–period–target variable pairs that result in better performance for WH than WL in terms of both adjusted MAE and PARE is higher with the fusion model than the other individual novelty detectors.

Figure 6 shows a number of VM prediction results and their corresponding reliability levels for certain equipment/period/target pairs with certain VM model/variable selection/novelty detection pairs. We would note that the small circles represent actual metrological values, while the empty squares and large circles represent the predicted metrological values of W H and W L , respectively. We would also note that, in general, the variation of the predicted metrological values is smaller than that of the actual values, because none of the regression models was designed to learn the natural noise. In Fig. 6a there are four wafers (wafer ID 7, 11, 12, and 21) with low reliability, and their actual and predicted metrology values are notably different from those of the wafers with high reliability, except for one wafer (wafer 12). In Fig. 6b, c, two wafers with low reliability have VM values that are very different from the actual ones, while the other two wafers have VM predictions that are similar to the actual ones, despite the low reliability. In Fig. 6d, only one wafer turned out to be unreliable, and its prediction value is very different from its actual metrology value. With these VM prediction results and the evaluated reliability, a process engineer could take appropriate action as follows. Let us assume that a wafer’s reliability level is high. If its predicted metrology value is within the control limit, we can conclude that the process is operating properly, and no action need to be taken. If, on the other hand, its predicted metrology value is outside the control limit, one can conclude that something has gone wrong during the operation. In this case, proper follow-up action such as tool adjustment or recipe modification should be performed. If, however, a wafer’s reliability level turns out to be low, then no further action should be taken, based on its predicted VM values alone, until its actual metrology value is measured, because the prediction is not trustworthy. If the predicted metrology value is outside the control limit but it turns out that its actual metrology value is within the control limit, we are able to avoid the performance of additional unnecessary operations. The problem with the reliability evaluation occurs when a wafer’s reliability level is low, but its predicted VM value is within the control limit and its actual metrology value is also within the limit. The cost of this is the resource needed to provide additional actual metrology. However, by updating the novelty detection model with the inclusion of that wafer, we can improve the reliability evaluation model in the long run.

Fig. 6
figure 6

The actual VM results (small circle) and the VM prediction of W H (empty square) and W L (large circle)

6 Conclusion

In this paper, we have proposed a framework for evaluating the reliability of VM predictions to support the selective usage of VM results, in order to facilitate flexible process control. In order to determine the reliability level, we propose the use of novelty detection algorithms that determine the homogeneity between a test wafer and training wafers. If the test wafer is determined to be similar to the training wafers, its VM prediction is considered highly reliable; if not, it is considered unreliable. In order to analyze the effect of the proposed reliability evaluation methods, we conducted extensive experiments using two variable selection methods, three VM prediction models, and five novelty detection algorithms as well as their fusion model, based on actual process and metrology data. The experimental results showed that every novelty detection algorithm could satisfy our purpose, but specifically, the MLR–Gauss or MLR–fusion pair with stepwise variable selection was outstanding. We also demonstrated that, based on the evaluated reliability level and predicted metrological values, an appropriate follow-up action can be taken that will facilitate accurate and flexible process control.

Apart from a number of experimental results which we noted, there are a few limitations of the present work that suggest further directions for research. First, there is no clear definition of the outlier in an actual manufacturing control system; we cannot evaluate the performance of novelty detectors using more diversified measures, such as the rejection ratio of normal class (false alarm) and the acceptance ratio of novel class (miss). Thus, what we have done is to evaluate the latent effect of outliers indirectly by comparing the performances for the normal and novel classes determined by the novelty detectors. Therefore, it should be worth applying our framework to a process control system which has a clear definition of outliers. Second, because we had difficulty collecting actual data from a semiconductor manufacturing process, we could not investigate the long-term effect of the reliability evaluation models. Therefore, long-term-based VM prediction and reliability evaluation models should be developed and analyzed. Third, although we provided a general guideline for the selective usage of the reliability evaluation results, its practical impact should be studied by implementing a reliability evaluation methodology in a wafer-to-wafer control scheme.