Introduction

The semiconductor wafer fabrication is a complex, long and costly process which involves hundreds of complicated chemical steps and requires monitoring a great number of process parameters (Chien et al. 2014). Due to such a complexity, it is nearly impossible to produce wafers without any defects even operated by well-trained process engineers with highly automated and precisely positioned equipments in a nearly particle-free environment (Wang et al. 2006).

Fig. 1
figure 1

Typical wafer map pattern types

Wafer map is a graphical representation of a silicon wafer at which all the good and defective die are contained. Wafer map defects are usually formed in clusters (Hansen et al. 1997) and Fig. 1 illustrates typical defect pattern types. For example, in the center pattern type, most defective die are in the center of a wafer map while in the scratch pattern type, most defective die form a scratch and in the edge-ring pattern type, most defective die are in the edge-ring region and so on. These defect cluster patterns can provide clue to identify the process failures in the semiconductor manufacturing. For example, a uniformity problem during the chemical-mechanical planarization can cause center pattern, inappropriate wafer handling or poor shipment can cause the scratch pattern, a layer-to-layer misalignment during the storage-node process can cause edge-ring pattern and so on. Therefore, there is a strong need to accurately classify the defect patterns to quickly identify the root causes of failures.

In recent years, convolutional neural networks (CNN) (Krizhevsky et al. 2012) is one of the most popular deep learning methods which has shown excellent performance in a wide variety of areas including image classification (Krizhevsky et al. 2012; Rakhlin et al. 2018), defect pattern classification (Kim et al. 2019; Lin et al. 2019; Kyeong and Kim 2018; Nakazawa and Kulkarni 2018), recommender systems (Van den Oord et al. 2013), speech recognition (Xiong et al. 2018), natural language processing (Kim 2014), and face recognition (Ding and Tao 2018). CNN has several advantages that it relatively does not need any specific preprocessing, no prior knowledge and no human effort involved in feature extraction and it is also a good feature extractor.

Error-correcting output codes (ECOC) (Dietterich and Bakiri 1995) combining with multiple binary classifiers has shown high classification accuracy in multi-class classification problems. By combining the advantages of CNN and ECOC, in this research, we present an image-based wafer map defect pattern classification method. The presented method consists of two main steps: without any specific preprocessing, high-level abstraction features are extracted from CNN and then the extracted CNN features are fed to combination of ECOC and support vector machines (SVM) (Cortes and Vapnik 1995) for wafer map defect pattern classification.

The main contributions of this study can be summarized as follows:

  1. 1.

    CNN, ECOC, SVM as well as all combination of them are not new multi-class classification methods. However, to the best of our knowledge, this is the first time to apply the presented method for wafer map defect pattern classification.

  2. 2.

    For performance comparison, CNN and CNN feature-based SVM (CNN-SVM) classification methods are considered. Six different binary classifiers including SVM are also used in ECOC. Among them, the presented method shows the best performance.

The rest of the paper is organized as follows. “Related work” section discusses the related work of wafer defect pattern classification and “Method” section presents the framework of the presented method. The experimental results tested on 20,000 wafer maps are reported in “Experimental results” section. Finally, conclusion is given in “Conclusion” section.

Related work

Wafer map defect pattern classification is a multi-class classification task. Therefore, in this section, wafer map defect pattern classification and multi-class classification are discussed in more detail.

Wafer map defect pattern classification

There are a great number of methods have been proposed for wafer map defect pattern classification.

In (Fan et al. 2016), Ordering Point to Identify the Cluster Structure (OPTICS) (Ankerst et al. 1999) is first applied to remove outlier defects and then the extracted density-based and geometry-based features are used as input of SVM for classification. In (Piao et al. 2018), spatial filter (Gonzalez and Woods 2006) is applied to remove outlier detective die and the extracted random transform-based features are used as decision tree ensemble for classification. A set of novel rotation- and scale-invariant features is used as input of SVM for classification (Wu et al. 2015). In (Ooi et al. 2013), polar Fourier Transform and rotational moment invariants features are used as input of alternating decision tree classifier for classification. In (Chang et al. 2012), spatial filter is used to remove outlier detective die and then, linear hough transform is used to detect line spatial pattern, circular hough transform is used to identify bull’s-eye and blob spatial patterns, while zone ratio approach is used to pinpoint ring and edge spatial pattern. In (Yuan et al. 2010), support vector clustering (Ben-Hur et al. 2001) is used to remove outlier defective die and a Bayesian mixture model is proposed to model the defect cluster distributions where defect cluster patterns with amorphous/linear, curvilinear, and ring patterns are modeled by multivariate normal distribution, principal curve, and spherical shell, respectively. In (Wang 2008), spatial filter is used to remove outlier detective die and then a hybrid scheme combining entropy fuzzy c means with spectral clustering is used to extract the defect clusters. Finally, convexity and eigenvalue ratio are used to classify the defect pattern type. A DBSCANWBM framework (Jin et al. 2019) is proposed at which outliers are detected by applying DBSCAN with optimal parameter values and the detected outliers are removed differently according to the considered pattern types. Then the detected patterns are classified based on discriminative features of the pattern types.

Recently, two image-based wafer map defect pattern classification methods are proposed (Kyeong and Kim 2018; Nakazawa and Kulkarni 2018). The method (Kyeong and Kim 2018) is a mixed-type pattern detection method at which each individual classification model is separately built for each defect pattern type of circle, ring, scratch, and zone. In (Nakazawa and Kulkarni 2018), CNN is directly applied to classify twenty-two classes of pattern types and the extracted image features from fully connected layer are used for wafer map image retrieval.

Multi-class classification

Multi-class classification is a task of classifying an unknown object into one of several pre-defined classes. Generally speaking, multi-class classification methods can be categorized into two groups: The fist group is a direct multi-class classification method which includes methods such as classification and regression trees (CART) (Breiman et al. 1984), ID3 (Quinlan 1986), C4.5 (Quinlan 1993), naive Bayes (NB), k-nearest neighbors (kNN), multi-class SVM, neural network (Hagan et al. 1996), CNN (Krizhevsky et al. 2012) and so on. In contrast, the second group is an indirect method which decomposes the multi-class problem into a set of binary subproblems. According to the commonly used decomposition strategies, the second group can be further divided into three broad categories: one-vs-all, one-vs-one and ECOC. In this research, we focus on ECOC model and interested readers can see (Lorena et al. 2008) for more review on the combination of binary classifiers in multi-class problems.

Summary of related work

A great number of publications have shown that ECOC can improve the classification accuracy (Dietterich and Bakiri 1995; Ali Bagheri et al. 2012; García-Pedrajas and Ortiz-Boyer 2011; Zheng et al. 2008; Liu 2006; Al-Shargie et al. 2018). However, among many binary classifiers, SVM is a much more common choice for ECOC (Zheng et al. 2008; Liu 2006; Al-Shargie et al. 2018; Othman and Rad 2019; Abd-Ellah et al. 2018; Dorj et al. 2018).

CNN itself can be used for multi-class classification. However, instead of directly being used for classification, CNN features are extracted first and then SVM is used for classification to improve classification accuracy (Tang 2013). Other combination of CNN and SVM can also be found in (Huang and LeCun 2006; Niu and Suen 2012; Xue et al. 2016).

As described above, since CNN can extract good features and ECOC with SVM used as binary classifiers can obtain high classification accuracy, we expect CNN feature-based ECOC with SVM used as binary classifiers can also obtain high classification accuracy. This is the right reason why combination of CNN, ECOC, and SVM are used in this research for wafer map defect pattern classification. Combination of CNN, ECOC, and SVM can also be found in some other domains (Othman and Rad 2019; Abd-Ellah et al. 2018; Dorj et al. 2018).

Method

The main framework of presented method is given in Fig. 2 where

  1. 1.

    wafer map image data is used to train CNN model,

  2. 2.

    CNN features are extracted from fully connected layer of the trained CNN,

  3. 3.

    extracted features and class labels are fed to ECOC (SVM used as binary classifiers) and perform 10-fold cross-validation,

  4. 4.

    final classification accuracy evaluation.

Fig. 2
figure 2

Framework

Convolutional neural network

CNN is a class of deep neural networks which has shown to be particularly effective in image classification, image and video recognition, object detection, recommender systems, medical image analysis, natural language processing and so on. CNN consists of an input layer, an output layer and many hidden layers between them. The hidden layers are a combination of convolution layers, normalization layers, pooling layers and fully connected layers.

The most important operation on CNN is in the convolution layers at which filters and input image carry out convolution operation and then the output of each convolved image is used as the input to the next layer. In this way, CNN combines the lower-level features of earlier layers to form higher level image features. These higher level image features are better suited for classification since they are in greater levels of abstraction (Donahue et al. 2014). In addition, there is no need to manually extract the useful features since the features are learned directly by the CNN. These advantages are the right reason why we use CNN in this research.

ECOC classification

ECOC (Dietterich and Bakiri 1995) is a powerful tool used for multi-class classification which can not only improve the classification accuracy but also reduce both variance and bias errors (Kong and Dietterich 1995). This is the reason why we choose ECOC as our classification model.

The main idea of ECOC is combining multiple binary classifiers for multi-class classification. In this research, SVM binary classifiers are utilized as ECOC basic classifiers. In the following, ECOC and SVM are briefly described.

ECOC

ECOC consists of coding and decoding steps.

In the coding step, the code matrix M is defined from data. M is defined as:

$$\begin{aligned} \mathrm{M} \in {\{ 1, - 1\} ^{C \times L}} \end{aligned}$$
(1)

where C is the number of classes, L is the length of codewords (the number of binary classifiers), each row represents a codeword of the corresponding class, each column represents the corresponding binary classifier, the value \({\mathrm{M}_{cl}} = 1 (\mathrm{{or}}\, - 1)\) means the samples associated with class c will be treated as positive (or negative) class for lth binary classifier. Then each of L binary classifiers is used to train according to the partition of the classes in the corresponding column of M.

In the decoding step, class labels of the test data are predicted. For a given test sample, each classifier generates a value 1 or -1 so that a L length output vector can be obtained. The obtained output vector is compared to each codeword in the code matrix and the class whose codeword has the closest distance to the output vector is chosen as the predicted class.

In ECOC, the number of classifiers L depends on what kind of coding design is used. A coding design can direct which classes are trained by each binary classifier. There are many coding designs, however, one-versus-all (Nilsson 1965) and one-versus-one (Hastie and Tibshirani 1998) are the most widely used coding designs. In one-versus-all, the number of n classifiers are needed since n-class problem is converted into n two-class problems and for the ith two-class problem, class i is separated from the remaining classes. While in one-versus-one, the number of classifiers \(n(n - 1)/2\) are needed since n-class problem is converted into \(n(n - 1)/2\) two-class problems, which accounts for all pairs of classes. In this research, one-versus-one coding design is used.

Support vector machines

SVM (Cortes and Vapnik 1995) was originally introduced for binary classification. SVM searches for the maximum marginal hyperplane.

Training data of instance-label pairs can be expressed as:

$$\begin{aligned} ({x_{1,}}{y_1}),({x_{2,}}{y_2}), \ldots ,({x_{n,}}{y_n}),\,{x_i} \in {R^d},\,{y_i} \in \{ + 1, - 1\} \end{aligned}$$
(2)

where \(y_i\) is the class label which can take one of two values, either +1 or -1, n is the number of training samples and d is the number of dimensions.

If the training dataset can be linearly separable, any separating hyperplane can be written as:

$$\begin{aligned} {w^T}{x_i} + b = 0,\,i = 1,2, \ldots ,n \end{aligned}$$
(3)

where w is d-dimensional weight vector (\(w = (w_1,w_2, \ldots , w_d)\)) and b is a bias term.

Hyperplane considering the side of margin can be written as:

$$\begin{aligned} \begin{aligned}&{w^T}{x_i} + b \ge 1\quad \forall {y_i} \in 1 \\&{w^T}{x_i} + b \le - 1\quad \forall {y_i} \in - 1 \\ \end{aligned} \end{aligned}$$
(4)

Combining the two inequalities in Eq. 4 is equivalent to

$$\begin{aligned} {y_i}({w^T}{x_i} + b) \ge 1,\,i = 1,2, \ldots ,n \end{aligned}$$
(5)

The weight vector w and bias term b need to be learned from training data to find the maximum marginal hyperplane. This can be obtained by solving the following minimization problem for w and b, subject to Eq. 5:

$$\begin{aligned} \mathrm{{minimize}}\,\,\,\,\,\,\,\,\frac{1}{2}{\left\| w \right\| ^2} \end{aligned}$$
(6)

where \(\left\| w \right\| \) is the Euclidean norm of w, that is \(\sqrt{w \cdot w} = \sqrt{w_1^2 + w_2^2 + \ldots + w_d^2}\)

Such an optimization problem can be solved by the following Lagrange formulation since it is much easier to handle and also can extend to the nonlinear case:

$$\begin{aligned} {\mathcal {L}}(w,b,\alpha ) = \frac{1}{2}{\left\| w \right\| ^2} - \sum \limits _{i = 1}^n {{\alpha _i}({y_i}({w^T}{x_i} + b) - 1)} \end{aligned}$$
(7)

where \(\alpha = {({\alpha _1}, \ldots ,{\alpha _n})^T}\) and \({\alpha _i}\) are the nonnegative Lagrange multipliers.

Equation 7 needs to be minimized with respect to w and b. The optimal solution is given by the saddle point and the solution satisfies the following Karush–Kuhn–Tucker conditions:

$$\begin{aligned}&\frac{{\partial {\mathcal {L}}(w,b,\alpha )}}{{\partial w}} = 0 \end{aligned}$$
(8)
$$\begin{aligned}&\frac{{\partial {\mathcal {L}}(w,b,\alpha )}}{{\partial b}} = 0 \end{aligned}$$
(9)

Equations 8 and 9 lead to:

$$\begin{aligned}&w = \sum \limits _{i = 1}^n {{\alpha _i}{y_i}{x_i}} \end{aligned}$$
(10)
$$\begin{aligned}&\sum \limits _{i = 1}^n {{\alpha _i}{y_i} = 0} \end{aligned}$$
(11)

Substituting Eqs. 8 and 11 into Eq. 7, we obtain the following dual problem:

$$\begin{aligned}&\mathrm{{maximize}}\,\,\,\,\,\,{{{\mathcal {L}}}}(\alpha ) = \sum \limits _{i = 1}^n {{\alpha _i}} - \frac{1}{2}\sum \limits _{i,j = 1}^n {{\alpha _i}{\alpha _j}{y_i}{y_j}x_i^T{x_j}} \end{aligned}$$
(12)
$$\begin{aligned}&\mathrm{{subject}}\,\,\,\mathrm{{ to}}\,\,\,\sum \limits _{i = 1}^n {{\alpha _i}{y_i} = 0} ,\,\,\,{\alpha _i} \ge 0\,\,\,i = 1,2, \ldots ,n \end{aligned}$$
(13)

In the solution, training samples with \({\alpha _i} > 0\) are support vectors and all the other training samples have \({\alpha _i} = 0\). Once the support vectors are found, the class label of a given test sample can be predicted with the following decision function:

$$\begin{aligned} D(x) = \sum \limits _{i \in S} {{\alpha _i}{y_i}x_i^Tx + b} \end{aligned}$$
(14)

where S is the set of support vector indices, b is a automatically determined numeric parameters by the optimization and x is test sample.

However, if the training data is not linearly separable, then the constraints in Eq. 4 is too strict so that there no hyperplane exists. In this case, a slack variable can be used to solve the problems. If the readers interested in nonlinear separable cases, please see (Cortes and Vapnik 1995) for more details.

Experimental results

Data

Table 1 shows a labeled WM-811K (Wu et al. 2015) dataset and the used dataset for experiments. The WM-811K dataset consists of 811457 real wafer maps and among them, domain experts are recruited to label the pattern types of 172950 wafer maps. In this labeled WM-811K dataset, there are 9 types of defects (center, donut, edge-loc, edge-ring, loc, near-full, random, scratch, and none) and its pattern type distribution is shown in Table 1a. However, near-full pattern type is excluded from experiments since it can be simply identified by the defect cover ratio whose computation cost of classification is much less than that in the presented method. Therefore, the remaining eight pattern types are considered in this research. As can be seen from Table 1a, the number of wafer maps in each pattern type is highly imbalanced. The number of wafer maps in each pattern type is fixed as 2500 to make the pattern types balance in our experiments, In a labeled WM-811K dataset, since the number of wafer maps in each pattern type of center, edge-loc, edge-ring, loc, and none pattern is greater than 2500, this quantity of wafer maps is randomly selected from each of these pattern types. In contrast, the number of wafer maps in each pattern type of donut, random, and scratch pattern is less than 2500 so that synthetic wafer maps are generated for these pattern types. Each synthetic wafer map is generated from a selected real wafer map by removing two defective die where this pair of defective die is randomly selected from all pair combinations of two different defective die. In this way, four, two, and two synthetic wafer maps are respectively generated from each wafer map of 555 donut, 866 random, and 1193 scratch patterns. Therefore, total 2220 (555 * 4), 1732 (866 * 2), and 2386 (1193 * 2) synthetic wafer maps are respectively generated for donut, random and scratch patterns. Then the lack number of synthetic 1945 (2500 - 555) donut, 1634 (2500 - 866) random, and 1307 (2500 - 1193) scratch patterns are randomly selected from each of the generated 2220 donut, 1732 random, and 2386 scratch patterns to fulfill the quantity requirements. Table 1b shows the pattern type distribution of used dataset for experiments. Therefore, total 20000 wafer map images are used for experiments. The WM-811K dataset is in numeric format so that the each numeric wafer map data need to be converted into grayscale image data. Grayscale image of size [256 256] is used as CNN input.

Table 1 Labeled WM-811K dataset and used dataset for experiments

CNN configurations

Fig. 3 shows the CNN architecture used in this research. CNN architecture is composed of input layer, 7 convolution related layers, fully connected layer and softmax layer, and output layer. In each of 7 convolution related layers, there are convolutional layer, batch normalization layer, ReLU (Rectified Linear Unit) layer and max pooling layer.

In each convolutional layer, a filter of size [3 3] and zero padding of size 1 is applied. The filter size [3 3] is applied since it is the smallest filter size which can capture the space of left/right, up/down, and center. Defects in edge-ring and edge-loc patterns are located in the wafer map edges. Zero padding is applied since it can prevent the convolution operation from losing such edge information. Batch normalization layer is used to normalize the activations and gradients propagating through a network, making network training an easier optimization problem. Batch normalization layers between convolutional layers and ReLU layers can be used to speed up network training and reduce the sensitivity to network initialization. After each batch normalization layer, the most widely used activation function ReLU is applied. After each ReLU layer, the max pooling layer is used to reduce the feature map size and remove redundant spatial information. Such kind of reduction can make it possible to increase the number of filters in deeper convolutional layers without requiring large amount of computation per layer. Therefore, as can be seen from Fig. 3, the only difference between 7 convolution related layers is the number of filters. In each max pooling layer, pool size [2 2] and stride size [2 2] are applied not to make the pooling regions overlap. A fully connected layer is a layer at which each neuron has full connections to all the neurons in the preceding layer and it combines the features to classify the images. The number of outputs in fully connected layer is equal to the number of classes. Softmax layer normalizes the output of the fully connected layer. The stochastic gradient descent with momentum is used to applied for CNN training. Deeper layers contain higher-level features which are constructed based on the lower-level features of earlier layers. Therefore, fully connected layer right before the classification layer is used to extract the CNN features for CNN-feature based classification.

Fig. 3
figure 3

CNN architecture

Classification accuracy

Classifications methods of CNN and CNN-SVM are used for comparison. ECOC can be combined with any binary classifiers. For comparison, six different binary classifiers of NB, linear discriminant analysis (LDA), CART, kNN, logistic regression (LOGISTIC), and SVM are also applied in CNN feature-based ECOC (CNN-ECOC) classification. Therefore, in this subsection, experiment results of total eight classification methods are presented. The 10-fold cross-validation is applied to all eight classification methods. Due to random weight initialization in CNN and random partition of samples in 10-fold cross-validation, each of eight classification methods is run ten times to see the accuracy variances. Classification results of CNN, CNN-SVM and CNN-ECOC are respectively shown in Tables 2, 3 and 4. For convenience, the used six classification methods in Table 4 are shortened to CNN-ECOC-NB, CNN-ECOC-LDA, CNN-ECOC-CART, CNN-ECOC-kNN, CNN-ECOC-LOGISTIC, CNN-ECOC-SVM and the corresponding classification results are respectively shown in Table 4a–f. Among them, the presented method in this research is CNN-ECOC-SVM.

LIBSVM (Chang and Lin 2011) is used for CNN-SVM, while all the other programs are implemented with MATLAB 2018b (Matlab 2018). Except in CNN classification, the extracted CNN features are first standardized and the one-vs-one decomposition strategy is applied in all remaining seven classification methods. Linear SVM is applied both in CNN-SVM and CNN-ECOC-SVM. The default template function of ’templateNaiveBayes’, ’templateDiscriminant’, ’templateTree’, ’templateKNN’ and ’templateSVM’ are respectively used for binary classifier of NB, LDA, CART, kNN and SVM. The template function ’templateLinear’ is used for LOGISTIC binary classifier where the learner is specified with ’logistic’. In these template functions, all input parameters are used with the default values during training except the learner option ’logistic’ in the ’templateLinear’ function. The default values used in each of these template functions can be found in MATLAB 2018b (Matlab 2018).

In Tables 23, and each table in Table 4, the first ’Iter’ column shows the iteration number, the rightmost ’Avg’ column shows the average classification accuracy of each iteration, and the bottom two rows respectively show the average classification accuracy and standard deviation of each pattern type over 10 iterations.

From Tables 2 and 3, it can be seen that CNN-SVM significantly outperforms CNN since

  1. 1.

    even the lowest average classification accuracy of 10 iterations 97.40% in CNN-SVM is much higher than the highest that 90.90% in CNN classification,

  2. 2.

    average classification accuracy of each pattern type in CNN-SVM is higher than that of each corresponding pattern type in CNN classification, and

  3. 3.

    the standard deviation of each pattern type in CNN-SVM is much lower than that of each corresponding pattern type in CNN classification. Although there is a slight increase in overall standard deviation, the overall average classification accuracy is improved 7.42 (97.92–90.50)%.

The only difference between these two methods is in their objectives. Softmax layer used in CNN is used to minimize cross-entropy or maximizes the log-likelihood, while SVM is used to simply find the maximum margin between data samples of different classes. Such a significant improvement is mainly due to better generalization ability of SVM than that of softmax.

Table 2 CNN classification accuracy (%) (CNN)
Table 3 CNN feature-based SVM classification accuracy (%) (CNN-SVM)
Table 4 CNN feature-based ECOC classification accuracy (%)

In terms of classification accuracy, among all eight classification methods, CNN-ECOC-SVM with 98.43% in Table 4f obtains the highest overall classification accuracy and CNN-SVM with 97.92% in Table 3 obtains the 2nd highest overall classification accuracy:

  1. 1.

    In terms of average classification accuracy in each iteration: Even the lowest average classification accuracy in CNN-ECOC-SVM, 98.26% is higher than the highest that in each of all compared classification methods except CNN-SVM. Compared with CNN-SVM in Table 3, 98.26% is lower only in two iterations, 5th iteration of 98.29% and 6th iteration of 98.35%.

  2. 2.

    In terms of average classification accuracy in each pattern type over 10 iterations: CNN-ECOC-SVM could not win all compared methods in all eight pattern types. However, CNN-ECOC-SVM obtains the highest average classification accuracy in each of four center, edge-loc, edge-ring, and loc pattern types; in donut pattern type, only CNN-SVM (99.98%) and CNN-ECOC-kNN (100%) are higher than that in CNN-ECOC-SVM (99.95%); in random pattern type, only CNN-ECOC-kNN (99.96%) is higher than that in CNN-ECOC-SVM (99.87%); in scratch pattern type, only CNN-ECOC-kNN (99.09%) is higher than that in CNN-ECOC-SVM (98.69%); in none pattern type, only CNN-ECOC-LDA (99.51%) is higher than that in CNN-ECOC-SVM (98.39%). Compared with CNN-SVM, CNN-ECOC-SVM wins in all pattern types except in donut pattern type. In donut pattern type, CNN-SVM obtains average classification of 99.98% while CNN-ECOC-SVM obtains 99.95%.

  3. 3.

    In terms of standard deviation in each pattern type over 10 iterations: Compared with CNN-SVM, in CNN-ECOC-SVM only the standard deviation in center and donut pattern types are respectively higher than that in each corresponding pattern type. However, in center pattern type, CNN-ECOC-SVM with 99.28% obtains higher average classification accuracy than that 99.09% in CNN-SVM. Although as described above, in donut pattern type, CNN-ECOC-SVM obtains lower average classification 99.95% than that 99.98% in CNN-SVM, the overall standard deviation in CNN-ECOC-SVM, 0.19 is lower than that 0.30 in CNN-SVM.

From Tables 23, and each table in Table 4, a summary of accuracy comparison is presented in Table 5 where each row is taken from the row of average classification accuracy and overall standard deviation in each corresponding method. In Table 5, the best value in each column is marked in bold.

Table 5 A summary comparison of classification accuracy

Among all eight classification methods, the statistical analysis of ANOVA is applied to determine whether 10 iterations of each method has a common average classification accuracy. The mean difference is significant at the 0.05 level. The corresponding result is given in Table 6 where the columns of ’Source’, ’SS’, ’df’, ’MS’, ’F’, ’Prob>F’ respectively indicate the source of the variability, the sum of squares due to each source, the degrees of freedom associated with each source, the mean squares for each source, F-statistic and p value. In Table 6, the small p value 2.8599e\(-\)54 indicates that average classification accuracy of 10 iterations in eight methods are not the same.

Table 6 ANOVA result

In addition, the result of pairwise comparisons for one-way ANOVA also is presented in Table 7. In Table 7, the first two column show the pair of compared methods, ’Lower confidence’ column contains the lower confidence interval, ’Upper confidence’ column contains the upper confidence interval and the ’p value’ column contains the p value for the hypothesis test that the corresponding mean difference is not equal to 0. The two compared methods in each row of 1-10th, 14-22th, 24th, 25th and 27th row are significantly different since the confidence intervals does not include zero and p value is smaller than 0.05. In contrast, the two methods in each row of 11-13th, 23th, 26th and 28th row are not significantly different. From 13th, and 28th rows, it can be seen that the presented method CNN-ECOC-SVM is not significantly different from CNN-SVM and CNN-ECOC-LOGISTIC.

Table 7 Pairwise comparison test

Except the classification accuracy, other classification performance measures of precision, recall, specificity, and F-measures are also considered, and the comparison results are shown in Table 5. For each classification method in Table 8, the iteration with the highest average classification accuracy over 10 iterations is selected for comparison. The selected iteration number for each corresponding method is shown in ’Iteration no.’ column. In Table 8, the best value in each performance measure is marked in bold. It can be seen from Table 8 that CNN-ECOC-SVM respectively shows the highest accuracy, precision, recall, and F-measure of 98.79%, 98.79%, 98.79%, 99.83%, and 98.79%.

Table 8 A comparison of performance measures

Extensive experimental results have proved our original expectation and the main reasons why CNN-ECOC-SVM could obtain such a good performance are that CNN can extract the most discriminative features and combination of ECOC and SVM can improve the classification accuracy.

An interesting observation can be found that each of all eight classification method shows the highest two standard deviation values in edge-loc and loc pattern types. We carefully guess that this phenomenon may come from lack of significant differences between these two defect pattern types. However, we leave the deep research on explanations for our future work.

The WM-811K (Wu et al. 2015) dataset is widely applied in SVM-based method (Wu et al. 2015), OPTICS-SVM (Fan et al. 2016), JLNDA-FD (Yu and Lu 2016), decision tree ensemble learning-based method (Piao et al. 2018), and soft voting ensemble (SVE) (Saqlain et al. 2019) for wafer map defect pattern classification. Since the used dataset in each of them are not completely identical and also different from that used in this research, it is impossible to perform a direct comparison. However, for rough comparison with CNN-ECOC-SVM, only the overall classification accuracies are provided: SVM-based method 94.63%, OPTICS-SVM 94.30%, JLNDA -FD 90.50%, decision tree ensemble learning-based method 90.50%, and SVE 95.87%. Obviously, the overall classification accuracy of CNN-ECOC-SVM is much higher than that of the state-of-the-art methods.

Conclusions

In this research, we present an image-based wafer map defect pattern classification method. The presented method consists of mainly two steps: feature extraction and classification. In the feature extraction step, high-level features are extracted from CNN and in the classification step, the extracted CNN features are used to train ECOC model for wafer map defect pattern classification. In the ECOC model, SVM binary classifiers are employed.

Experimental results conducted on 20000 wafer maps show that the presented method achieves the best classification accuracy up to 98.43% in comparison to other wafer map defect pattern classification methods.

There are a lot of work need to be done for future work. In this research, only the one-vs-one decomposition strategy is applied. However, to improve the classification accuracy, other decomposition strategies and other binary classifiers also need to be considered.