1 Introduction

Surface defect inspection refers to surface inspection of a finished product to identify defects such as scratches, pits, protrusions, and stains. Manual surface inspection methods performed by quality inspectors have the disadvantages of low efficiency, high labor intensity, low accuracy, low real-time performance, etc. They cannot satisfy the continuously increasing quality standards of industrial manufacturing processes. As one of the key technologies in manufacturing, machine vision provides a solution to fulfill the increasing demands on the documentation of quality and the traceability of products, by using engineering systems to perform quality inspection and remove defective products from production lines [1]. The machine vision system has the advantages of high precision, high efficiency, high speed and continuous detection, non-contact measurement, etc. Thereby, a large variety of solutions and applications has been inspired and utilized in this field since the 1980s, and their number has continued to grow. European and North American market data reveal that the growth of machine vision applications generally outperform the overall economic growth. Moreover, China has also become a major market for machine vision in recent years [2]. According to [3], the size of the global machine vision market was approximately USD $7.2 billion in 2017, growing 6.8% year-on-year.

Golnabi and Asadpour [4] classified the applications of machine vision into four categories: visual inspection, process control, part identification, and robotic guidance and control mechanisms. Among these, automated visual inspection (AVI) is the most significant and widely used application. Numerous studies and works have been performed on the research of AVI algorithms. Traditional AVI algorithms can be classified into statistical methods, spectral methods, model-based methods, and learning-based methods [5], which generally comprise two stages (feature extraction and defect identification). Evidently, these depend heavily on human expert-designed features and are sensitive to variations in the application conditions. In recent years, deep learning has achieved remarkable performance in face recognition, speech recognition, natural language processing, etc. However, it has relatively few applications in the field of AVI. The probable reason is that deep learning relies strongly on a large amount of training data, whereas surface defect datasets are generally small and challenging to be collected or labeled. Nonetheless, compared to traditional defect detection methods, deep learning-based methods can learn high-level features automatically from training data without the design of manual features. They are more versatile in detecting different types of defects and less sensitive to variations in application conditions. In our work, recent advances and applications of deep learning-based AVI algorithms are investigated. In particular, we focus on AVI of industrial products in semiconductor, steel, and fabric manufacturing processes. Our literature survey indicates that a large number of AVI methods and applications have been studied in these fields. We believe that they are the most important application areas of AVI for the following reasons. The complexity and miniaturization of printed circuit boards (PCBs) and integrated circuits (ICs) may make inspection feasible only through AVI systems. Presently, many steps in semiconductor production can be performed reliably only through the use of machine vision [6]. AVI is also essential for quality control in the steel manufacturing process, because traditional manual surface inspection procedures are inadequate for guaranteeing quality surfaces [7]. In the fabric manufacturing process, AVI is definitely an important means to replace manual inspection, although it continues to be a challenging task owing to the variability in texture and diversity of defects [8]. This motivates many studies to identify a better solution.

The remainder of this paper is organized as follows: Prior literature reviews are presented in Section 2. The hardware, software, and algorithms of AVI are described in Section 3. In Section 4, traditional surface defect detection approaches are reviewed, including statistical methods, spectral methods, model-based methods, and learning-based methods. In Section 5, our analysis of deep learning networks and public datasets for surface defect inspection is presented. Our investigation of the deep learning-based inspection approaches and their applications in steel, fabric, and semiconductor industry is described. The challenges and solutions in this field are also discussed. This work is concluded in Section 6.

2 Prior literature review

A number of reviews and surveys of AVI methods have been conducted since the 1980s. The reviews published in the early years are available in [9], whereas recent review papers are presented chronologically in this section and illustrated in Table 1.

Table 1 Recent literature review papers

Malamas et al. [10] presented a review on industrial vision systems, applications, and tools in 2003 and discussed the important issues and directions for designing and developing industrial vision systems. In 2008, Xie [11] systematically reviewed the advances in surface inspection using computer vision and image processing techniques, particularly based on texture analysis methods under four categories: statistical approaches, structural approaches, filter-based methods, and model-based approaches. Kumar [12] surveyed the computer vision-based fabric defect detection methods, in 2008. He divided the methods into three categories: statistical, spectral, and model-based approaches. The paper also indicated that the combination of statistical, spectral, and model-based approaches could yield better results than any individual approach. Mahajan et al. [13] reviewed and described the fabric defect detection methods for visual inspection. They characterized the feature extraction and decision-making methods into three categories: statistical, spectral, and model-based methods. Hani et al. [14] presented a literature review of the pattern recognition algorithms for automated visual inspection of surface mount device printed circuit board (SMD-PCB). The review focused on segmentation algorithms, feature extraction algorithms, and performance evaluation of different types of classifiers. Ngan et al. [15] offered a survey of fabric defect detection methods with description of their characteristics, strengths, and weaknesses in 2011. They divided the methods into seven approaches (statistical, spectral, model based, learning, structural, hybrid, and motif based). Neogi et al. [7] presented a comprehensive review of vision-based steel surface inspection systems, in 2014. The review covered overall aspects of steel surface inspection and classified steel surfaces into six types: slab, billet, plate, hot strip, cold strip, and rod/bar. In 2015, Huang and Pan [9] studied AVI systems and reviewed their applications in the surface inspection of semiconductor products including wafer, TFT-LCD, and light-emitting diode (LED). They classified the inspection algorithms to projection methods, filter-based approaches, learning-based approaches, and hybrid methods. Hanbay et al. [16] presented a comprehensive literature review of fabric defect detection methods in 2016. Defect detection methods were divided into structural approaches, statistical approaches, spectral approaches, model-based approaches, learning approaches, and hybrid approaches. The main concepts underlying these approaches as well as with their strengths and weaknesses were discussed. Anitha and Rao [17] reviewed the defect detection methods for various categories of PCB such as single layer, double layer, and multilayer bare PCB and assembled PCB, in 2017. In 2018, Sun et al. [3] studied the research status and trends of steel inspection from the perspectives of detected object, hardware, and software. In addition, the detection algorithms were divided into statistical method, filtering-based methods, model-based methods, and machine learning-based methods.

3 AVI system

The principle of designing an automated visual inspection system is to replace the manual inspection process completely [18], as shown in Fig. 1. AVI is composed mainly of the following processes: image acquisition, defect detection, and quality control. An image acquisition process is aimed at measuring and acquiring images of the object to be inspected, using an optical system. The optical system consists of a digital camera or analog camera with a CCD or CMOS sensor as well as lighting system. The defect detection process refers to defect detection and recognition using image-processing techniques such as image preprocessing, feature extraction, and classification. The detection results are output to a quality control system to serve as a guide for defective product rejection. The detection results may include information on whether a sensed image is defective or defect free, the severity of the defects, and the category of the defects.

Fig. 1
figure 1

Fundamental principle of automated visual inspection system

3.1 Camera and lighting

The sensor is the most important part of a camera. It is used to generate the image. There are two main sensors: CCD sensor and CMOS sensor. Compared with CCD, CMOS image sensors are advanced technologies and are predominantly used in digital circuits. It is convenient for CMOS to incorporate the functions, such as analog-to-digital conversion, addressing, windowing, gain and offset adjustments, and smart preprocessing, on the chip for smart use. It is considered that CMOS will become the dominant sensor technology for machine vision in the future [2]. The trends of AVI also include a smart camera consisting of a sensor and a processing core that performs major image processing operations in situ and transmits only necessary information to the computer workstation [19].

Cameras can be categorized as analog or digital cameras depending on whether they produce an analog or digital video signal after acquiring an image. The transmission of an analog signal requires a special interface card called frame grabber, whereas a digital camera performs an analog-to-digital conversion internally and transmits the digital video signal to a computer. Analog video transmission has been the dominant technology in the machine vision industry for a long time. However, because analog video transmission may cause image quality degradation, its applications have declined in recent years. Advanced AVI systems typically use digital video transmission. Apart from higher image quality, digital cameras offer the advantages of significantly higher resolutions and frame rates, significantly smaller size, and less power requirement than those of analog cameras [1].

A suitable lighting system makes the entire vision inspection system more efficient and accurate. The types of light sources that are commonly used in machine vision include incandescent lamps, xenon lamps, fluorescent lamps, and light-emitting diode (LED) [1]. Presently, LED is the primary illumination method for machine vision. LED has a long life cycle, and its lifetime is commonly longer than 100,000 h. Its brightness can be controlled conveniently with low power consumption and low heat production. It can be designed into different sizes and shapes and can irradiate at different angles. Guidelines for setting LED light sources include the achievement of good contrast between the foreground and background for reliable measurement and of good contrast among the internal features [20].

3.2 Software and algorithms

The most frequently used imaging processing software for AVI includes OpenCV, Halcon, VisionPro, etc. OpenCV [21] is an open source image processing library with algorithms: smoothing images, morphology transformations, image pyramids, image moments, thresholding operations, histogram calculation, histogram comparison, template matching, etc. In addition, machine learning algorithms (such as support vector machines) and deep neural networks (such as GoogLeNet network) are included. Halcon [22] contains the image processing library used for blob analysis, morphology, matching, measuring, and identification. In addition, it provides 3D vision using shape-based 3D matching and surface-based 3D matching, as well as deep learning algorithms based on CNN. VisionPro [23] is a library with pattern matching, blob, caliper, line location, and image filtering algorithms. It also offers deep learning-based image analysis.

The frequently used online defect detection algorithms of industrial AVI are reference-based approaches and rule-based approaches. Reference-based approaches consist primarily of image subtraction and template matching. These measure the difference between a sensed image of the object to be inspected and a predefined reference pattern [9]. Image subtraction conducts pixel-by-pixel subtraction of a sensed image and a reference ideal image. The defects of the object are displayed in the subtracted images. Image subtraction is simple and can be implemented directly. However, it is excessively sensitive to image variation and may cause a lot of false positives. Template matching is feature-level comparison of the extracted object features and the predefined ideal templates, which are composed of feature patterns or models. The fundamental form of template matching is to move an image of the object to be detected across the template image and compute a similarity measure at each position [24]. The reference-based approach is intuitive, convenient for practical application, and reliable for detecting possible defects. However, it exhibits problems including inflexibility to variation and the need to store and maintain a large number of reference patterns.

A rule-based approach involves the extraction of features from the sensed object and comparison of those features to a list of rules that describes an ideal model. It can circumvent the need for an extensive database of templates by examining the sensed object with respect to a list of design rules or against the features that can be extracted from design rules [6]. The rules can utilize attributes such as surface area, perimeter, ratio of perimeter to area, number of holes, area of holes, minimum enclosing bounding box area, maximum radius, and minimum radius [25]. For PCB, the design rules can be [26] (1) the minimum and maximum trace widths for all the traces used, (2) the minimum and maximum circular pad diameters, (3) the minimum and maximum hole diameters, (4) the minimum conductor clearance, and (5) the minimum annular rings and trace termination rules. The disadvantage of the rule-based approach is that it may omit the flaws that do not violate the rules [27] or may require complicated schemes to eliminate false alarms [28].

In the early years, most of the industrial inspection systems utilized the template matching approach and rule-based comparison schemes [6]. However, these have been evolving into intelligent classifiers that have the capability to learn complex and subtle classification strategies [19]. With the advent of the state-of-the-art deep learning techniques, a number of new algorithms have also been developed for better surface defect detection. As these algorithms mature, they will eventually promote the development of industrial surface defect detection algorithms.

3.3 Evaluation metrics

The evaluation metrics include error escape rate, false alarm rate, accuracy, precision, recall, and F1-score. Escape rate and false alarm rate are frequently used in addition to accuracy, for the performance evaluation of defect detection algorithms. Whereas error escape rate is the ratio of the number of defective samples detected as defect free to the total number of defective samples, false alarm rate is the ratio of the number of defect-free samples detected as defective to the total number of defect-free samples.

$$ {\displaystyle \begin{array}{l} Error\ Escape\ Rate= FP/\left( TN+ FP\right)\\ {}\begin{array}{l} False\ Alarm\ Rate= FN/\left( TP+ FN\right)\\ {}\begin{array}{l} Accuracy=\left( TP+ TN\right)/\left( TP+ FN+ TN+ FP\right)\\ {}\begin{array}{l} Precision= TP/\left( TP+ FP\right)\\ {}\begin{array}{l} Recall= TP/\left( TP+ FN\right)\\ {}F1- Score=\left(2\times Precision\times Recall\right)/\left( Precision+ Recall\right)\end{array}\end{array}\end{array}\end{array}\end{array}} $$

where TP represents the numbers of true positives, FN represents the numbers of false negatives, TN represents the numbers of true negatives, and FP represents the numbers of false positives.

4 Traditional AVI algorithms

Traditional methods for defect detection proceed in two stages: feature extraction and defect identification. Features could be in the spatial domain, such as histogram, local binary pattern (LBP), and co-occurrence matrix, or in the transform domain, such as Fourier transform, wavelet transform, and Gabor transform [29]. Following feature extraction, defect identification can be performed by using common pattern classifiers such as SVM, K-nearest neighbor, random forest, and K-means. From the perspective of feature extraction and identification, surface defect detection can be categorized mainly into four general approaches: statistical methods, spectral methods, model-based methods, and learning-based methods [5]. The comparative studies are available in [3, 5, 9, 11,12,13, 15, 16]. Traditional defect inspection methods and their applications are illustrated in Table 2.

Table 2 Traditional defect inspection methods and their applications

4.1 Statistical methods

Statistical methods measure the spatial distribution of pixel values with the assumption that the statistics of defect-free regions are stationary [13]. The defects are detected using the first-order statistics such as mean-, variance-, and histogram-based computations, in conjunction with the second-order statistics based on the co-occurrence matrix [30]. Popular statistical methods include histogram properties, co-occurrence matrix, mathematical morphology, and local binary pattern (LBP). Commonly used histogram statistics include range, mean, geometric mean, harmonic mean, standard deviation, variance, and median, as well as histogram comparison statistics such as L1 norm, L2 norm, Bhattacharyya distance, and Matusita distance [11]. Histogram properties have been successfully used in real-world applications as they are convenient to implement and invariant to rotation and translation [31]. The spatial gray level co-occurrence matrix (GLCM) introduced by Haralick [32] is widely used for texture defect detection. It describes the spatial distribution of texture by calculating the gray correlation between two pixels. Commonly used GLCM features include contrast, correlation, energy, entropy, and uniformity. Mathematical morphology is based on lattice theory and topology. It includes operations such as corrosion and expansion, open and closed operations, skeleton extraction, limit corrosion, hit-and-miss transformation, morphological gradient, top-hat transformation, particle analysis, and watershed transformation [3]. Mathematical morphology is highly suitable for defect detection of random or natural textures [16]. LBP, introduced by Ojala [33], considers the neighborhood of an image and compares the gray value of the pixel in the center with those of the other pixels in the neighborhood [34]. LBP is widely used in surface defect detection as they are robust to grayscale level variance such as illumination [35].

Several recent applications of statistical methods are available in [36,37,38,39]. Ashour et al. [36] presented a method based on gray-level co-occurrence matrix and discrete shearlet transform in 2018. Luo et al. [37] proposed a generalized completed local binary pattern framework with two variants for steel surface defect classification, in 2018. Li et al. [38] presented a fabric defect detection algorithm based on saliency histogram features, in 2019. Luo et al. [39] investigated the LBP method and proposed a selectively dominant LBP to quantitatively exploit the functional information from non-uniform patterns, in 2019.

4.2 Spectral methods

Spectral methods are also called filter-based methods. These transform signals from the spatial domain to the frequency domain by mathematical transformation, for feature extraction. Examples are Fourier transform, Gabor filter, and wavelet transform. There are a number of applications of these filter-based methods. Fourier transform is an important frequency-based analysis method for defect detection. It provides global information through an analysis of the frequency of signal over an entire time period. However, it cannot analyze local details of an image [35]. A Gabor filter is a type of short-term Fourier transform and applies a function of the Gaussian distribution. It is used extensively in texture defect detection because it can be customized with different scale and angle values based on different texture structures [16]. Wavelet transform is based on multi-resolution signal decomposition theory. It offers localized information from the horizontal, vertical, and diagonal directions on an input image [15]. Li and Tsai [40] presented a wavelet-based defect detection in solar wafer images with inhomogeneous texture, in 2012. Malek et al. [41] optimized the automated online fabric inspection by fast Fourier transform and cross-correlation, in 2013. Bissi et al. [42] adopted a Gabor filter for automated defect detection of uniform and structured fabrics, in 2013. Hu et al. [43] presented automated defect detection in textured materials using wavelet-domain hidden Markov models, in 2014. Wen et al. [44] developed a new fabric defect detection method based on adaptive wavelet by designing appropriate wavelet bases for different fabric images, in 2014. Hu et al. [45] presented an unsupervised defect detection method in textiles based on Fourier analysis and wavelet shrinkage, in 2015. Bi et al. [46] presented a defect detection method for LCD using Gabor filters, in 2015. Hu [47] presented an approach that addressed defect detection in textured surface by using an optimized elliptical Gabor filter, in 2015. Tong et al. [48] established a defect detection model using an optimized Gabor filter to address the woven fabric inspection problem in the textile industry, in 2016. Chol et al. [49] presented an algorithm for detecting pinholes in steel slabs by using a Gabor filter combination and morphological features, in 2017. Ma et al. [50] presented a surface defect detection method based on improved Gabor filters for scratch identification in industrial pipeline, in 2018.

4.3 Model-based methods

Model-based methods construct representations of images by modeling multiple properties of defects [51]. The popular model-based methods are the Markov random field (MRF) [52] and auto-regressive model [53]. In MRF, two random fields named the labeling field and feature field are used to describe the image, and a distribution function is used to describe the distribution of feature vectors under the condition of the labeling field [3]. The application of MRFs for surface inspection can be traced to the 1990s [54]. Recently, Xu and Huang [55] developed a Gaussian Markov random field model for automatic pattern extraction and defect detection in nanomaterials, in 2012. An auto-regressive model describes the linear dependence between different pixels of an image by using linear equation systems, which incur less computational effort and cost than nonlinear systems [16]. Recently, Kulkarni et al. [56] presented an automatic surface defect detection algorithm using a two-dimensional auto-regressive model for fringe-projected-surface images, in 2019.

4.4 Learning-based methods

Learning-based methods are developed with machine learning and pattern recognition algorithms [9]. The highly popular pattern recognition algorithms such as support vector machine (SVM) [57], artificial neural network (ANN), k-nearest neighbor (k-NN) [58], random forest [59], generic algorithms, and clustering methods are applied frequently for defect classification. Among these, SVM is one of the most widely used classifiers for traditional surface defect detection.

4.4.1 SVM for surface inspection

Jia et al. [60] presented a real-time visual inspection system that used SVM to automatically learn complicated defect patterns for steel surface inspection, in 2004. Gao et al. [61] presented an algorithm for fabric defect detection based on dimensional histogram statistic and SVM, in 2006. Kang et al. [62] proposed an automated defect classification algorithm based on machine learning and the SVM classifier for TFT-LCD panel inspection, in 2009. Baly and Hajj [63] applied SVM for wafer classification and illustrated the selection of the values of SVM parameters, in 2012. Huang and Lu [64] proposed an automatic defect classification algorithm for TFT-LCD by using a linear SVM based on features including shape, histogram, and color, in 2013. Xie et al. [65] presented a defect detection and classification approach for PCBs and wafers, using SVM with a combination of median filter, background removal, morphological operation, and segmentation, in 2013. Zhang et al. [66] introduced an automated defect detection method for PCB, in 2018. In this method, detection was achieved by obtaining the defect region based on template matching, extracting the histogram and geometric features of the defect region, and using SVM classifier for recognition and classification.

4.4.2 ANN for surface inspection

Kumar et al. [67] proposed an approach for segmenting local textile defects using a feed-forward neural network, in 2003. Herein, principal component analysis (PCA) for dimension reduction of feature vectors was applied. Kang and Liu [68] introduced a method for detecting local defects in cold rolled strips, in 2005. PCA using singular value decomposition was also employed to reduce the dimension of the extracted feature vector. The feed-forward neural network was then adopted to detect the defects in the steel strips. Yang et al. [69] recommended a hybrid defect recognition method for steel surface inspection, in 2007. They used neural networks for identification and morphology processing for noise filtering. Ashour et al. [70] proposed a supervised texture classification method based on the feed forward ANN and the multi-class SVM, in 2008. Chen et al. [71] adopted four neural networks, namely, backpropagation, radial basis function, and two learning vector quantization networks, for TFT-LCD defect identification, in 2009. Tseng et al. [72] proposed an automatic defect classification scheme for color-filter production through three stages, namely, defect extraction, feature description, and defect-type classification using a neural network decision tree classifier, in 2011.

4.4.3 Other learning-based algorithms

Other learning-based algorithms include random forest, clustering methods, and generic algorithms. Several application examples are presented here. Kwon and Kang [73] proposed a defect detection algorithm based on random forest to determine the irregularity of the variety surface, in 2011. Tseng et al. [74] proposed an automatic detection method for multicrystalline solar cells, using binary clustering of features, in 2015. Hu et al. recommended a hybrid chromosome genetic algorithm for surface defect classification of a large-scale strip steel image collection, in 2016 [75]. Tian and Xu [76] developed an algorithm for identifying surface defects in hot rolled steel plates, based on a genetic algorithm and an extreme learning machine, in 2017. Piao et al. [77] proposed a decision tree ensemble learning-based method for wafer map failure pattern recognition, in 2018.

4.5 Combination methods

In different literature, defect detection methods are divided into different categories. It generally includes statistical, spectral, and model-based methods and occasionally also includes learning-based methods, structural methods, or other methods not described in this paper. The literature survey in our work reveals that regardless of how these methods are classified, the combinations of these methods can achieve optimal performance.

Several representative applications of combinations of methods are presented. Celik et al. [78] developed a system for fabric inspection through feature extraction based on wavelet transform, double thresholding binarization, and morphological operations and for defect classification using the gray level co-occurrence matrix and a feed forward neural network. Wang et al. [79] proposed an online diagnosis system based on clustering techniques to identify spatial defect patterns for semiconductor manufacturing. Specifically, a spatial filter was used to assess whether the input data contained systematic cluster and to extract it from the noisy input. Then, an integrated clustering scheme combining fuzzy C means was adopted to separate the defect patterns. Furthermore, a decision tree was applied for decision-making. Nguyen et al. [80] proposed an automatic defect detection system for organic light-emitting diode (OLED) panels by combining three learning-based algorithms: SVM, random forest, and k-NN. Possible features were designed, and feature selection using PCA and random forest was adopted. Then, a hierarchical structure of classifiers (SVM, random forest, k-NN) was applied for defect identification.

5 Deep learning-based AVI algorithms

5.1 Deep learning networks and defects database

5.1.1 Deep learning networks

In its initial year (2006) [81], deep learning’s application was focused on the MNIST digit image classification problem, thereby breaking the supremacy of SVMs [82]. Then, the breakthrough was achieved on the ImageNet [83] dataset and in ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Recently, Feng et al. [84] discussed how the deep neural network algorithms accomplish the computer vision tasks such as image classification, object detection, and image segmentation. The survey covered image classification networks including AlexNet, VGGNet, GoogLeNet, ResNet, and DenseNet and object detection algorithms including Faster-RCNN, YOLO, and SSD.

As a deep convolutional neural network, AlexNet [85] consists of five convolutional layers, three max-pooling layers, and three fully connected layers, having 60 million parameters and 650,000 neurons. AlexNet is known as the foundation work of modern deep CNN [84]. VGGNet [86] is a significantly deeper CNN network achieved by stacking convolutional layers and using an architecture with very small (3 × 3) convolution filters. It is capable of pushing the depth to 16–19 weight layers. In VGGNet, a stack of convolutional layers is followed by three fully connected layers and a final soft-max layer. GoogLeNet [87] modifies the convolution layers by using the Inception module to extend the depth and width of the networks. It has 22 layers. In the Inception module, 1 × 1 convolutions are used before 3 × 3 and 5 × 5 convolutions to reduce the computation cost. Google Inception-v3 [88] saves computation cost further by factorizing convolutions into smaller convolutions or asymmetric convolutions. For example, the 5 × 5 convolution is decomposed into two 3 × 3 convolution operations, and the convolution of kernel size n × n is decomposed into two convolutions of sizes 1 × n and n × 1. Although increasing the depth of networks aids in obtaining higher accuracy, as the number of network layers increase to a certain extent, the training accuracy saturates and then declines rapidly. ResNet [89] proposes residual building blocks to address this degradation problem. This involves the addition of parameter-free identity shortcut connections to feed-forward neural networks. The residual building blocks with short connections fully utilize features from previous layers to alleviate the degradation problem. Therefore, the network performance can be improved by stacking more residual blocks. This enables ResNet to have up to 152 layers. To further strengthen feature reuse and propagation, DenseNet [90] connects each layer to every other layer in a feed-forward fashion. A DenseNet network with L layers has L × (L + 1) / 2 direct connections. Owing to this dense connection structure, DenseNets can scale up to hundreds of layers without optimization challenges.

Apart from the aforementioned deeper and larger CNNs, a set of lightweight CNNs has been developed to reduce computation complexity while maintaining high accuracy. They are suitable for mobile or real-time applications that have limited computation resources or high computation speed requirements, such as the online AVI applications discussed in this paper.

The classical lightweight neural networks include SqueezeNet, MobileNet, and ShuffleNet. SqueezeNet [91] is a deep CNN network using the fire module, which comprises a squeeze layer and an expand layer. The squeeze layer is used to decrease the number of input channels to the expand layer and thereby reduce the quantity of parameters. Furthermore, the majority of the 3 × 3 filters are replaced by 1 × 1 filters to reduce the number of parameters. Although SqueezeNet has a minimal number of parameters, it achieves an accuracy level similar to that of AlexNet, on ImageNet with 50× fewer parameters. MobileNet [92] is also a lightweight neural network adapted for mobile and embedded vision applications with high accuracy. It utilizes depthwise separable convolution [93] and factorizes a standard convolution into a depthwise convolution and pointwise convolution to reduce computation and model size substantially. The depthwise convolution applies a filter to each input channel. Then, the pointwise convolution applies 1 × 1 convolution to combine the outputs of depthwise convolution. The cost of 3*3 depthwise separable convolution is 3*3*M*D*D + M*N*D*D, and the cost of 3*3 standard convolution is 3*3*M*N*D*D, where M is the number of input channels, N is the number of output channels, and D*D is the size of output feature map. Therefore, compared with the standard convolution, the 3*3 depthwise separable convolution can save 8 to 9 times the amount of calculation at only a small reduction in accuracy [92]. MobileNet-v2 [94] introduces a novel inverted residual layer to decrease the number of operations further. ShuffleNet [95] is also a computationally efficient CNN model designed for mobile devices. It has two novel operations: pointwise group convolution and channel shuffle. Pointwise group convolution is used to reduce computation complexity, whereas channel shuffle is used to aid the information flow across feature maps. ShuffleNet-V2 [96] introduces an operation called channel split to further improve the performance of its first version.

An object detection task is occasionally a part of a defect detection process. It is aimed at identifying the location of the object of interest. The most popular deep learning-based object detection algorithms are Faster-RCNN, YOLO, and SSD. Faster-RCNN [97] introduces a region proposal network (RPN), which is a fully convolutional network for proposal generation. It integrates RPN and Fast-RCNN [98] to share convolutional features and achieve high object detection accuracy. However, the Faster R-CNN still has a low detection speed, because it is a two-stage method that detects objects through region proposal and region classification. YOLO [99] and SSD [100] are one-stage object detection methods that detect objects using regression. In YOLO, a neural network predicts bounding boxes and class probabilities directly from full images in an evaluation. Although YOLO can achieve real-time speed, it is less accurate than the two-stage Faster-RCNN. The single shot multibox detector (SSD) outperforms YOLO in accuracy owing to two major improvements. First, SSD extracts important features from multi-scale CNN feature maps. Second, it adopts a number of default bounding boxes by following the concept of anchor proposed by Faster R-CNN [84].

Another set of deep learning algorithm suitable for surface defect inspection is unsupervised or semi-supervised learning methods. The representative methods are auto-encoder and generative adversarial network (GAN). Auto-encoder is a typical unsupervised learning algorithm based on two neural networks called encoder and decoder. It was introduced by Rumelhart et al. [101] in 1986 and extended to deep auto-encoder by Hinton et al. [102] in 2006. To achieve a higher robustness than that of the deep auto-encoder, the denoising auto-encoder [103] (introduced in 2008) adopts an approach that combines corruption and denoising to make the learned representations robust to partial corruption of the input pattern. The denoising auto-encoder is one of the common options for surface defect detection when considering unsupervised deep learning algorithms. GAN is an unsupervised learning framework introduced by Goodfellow et al. [104] in 2014 and has since developed [105]. It contains a generative model G and discriminative model D. G captures the data distribution with the aim of maximizing the probability of D committing an error. D estimates the probability that a sample originated from the training data rather than the generative model. The framework corresponds to a minimax two-player game. In addition, GAN can be extended for semi-supervised learning [106], which combines supervised learning and unsupervised learning in a framework. The above-mentioned deep learning networks suitable for defect dection are illustrated in Table 3.

Table 3 Deep learning networks suitable for defect detection

5.1.2 Defects database

We have conducted a survey on publicity available datasets containing surface image samples of steel, textile, and semiconductor products. The information on a few datasets is presented in Table 4. This database information is provided with the aim of facilitating researchers from the AVI community or deep learning community in initiating further innovation and applications of deep learning in solving traditional AVI problems.

Table 4 Publicly available surface image datasets

The DAGM texture database [107] was provided by the open competition “Weakly supervised learning for industrial optical inspection” organized by DAGM (German chapter of the International Association for Pattern Recognition) and the GNSS (German Chapter of the European Neural Network Society). The DAGM dataset consists of six types of artificially generated texture images. Each type has 1000 non-defective images and 150 defective images with a labeled defect on the background texture.

WM-811K [108] is a large publicly accessible dataset of wafer maps, containing 811,457 real-world wafer maps. Among these, 696,599 images are unique wafer maps. Approximately 20% of the wafer maps are labeled from one of the nine types (54,356 in the training set and 118,595 in the test set), which include eight defective types (Center, Donut, Edge-local, Edge-ring, Local, Near-full, Random, and Scratch) and a normal type.

The Northeastern University (NEU) surface defect database [109] contains six types of typical surface defects observed on hot-rolled steel strips: rolled-in scale (RS), patches (Pa), crazing (Cr), pitted surface (PS), inclusion (In), and scratches (Sc). The database contains 1800 grayscale images, each having 300 samples.

The DeepPCB dataset [110] contains 1500 image pairs of PCBs. Each of these consists of a defect-free template image and an aligned tested image with annotations including positions. Six common types of PCB defects are provided: open, short, mousebite, spur, pin hole, and spurious copper.

The magnetic tile defect dataset [111] contains 2688 defect images of six common magnetic tile defects, with their pixel level ground-truth labeled. The solar cell dataset [112] contains 2624 samples of 300 × 300 pixel 8-bit grayscale images of functional and defective solar cells with varying degrees of degradations, extracted from 44 solar modules.

RSDDs [113] contain images of two types of rail surface defects. One type comprises images of express rails (67 images), whereas the other type comprises images of common/heavy haul rails (128 images). Every image contains at least one defect and has a complex background with substantial noise.

TILDA [114] is a benchmark database for textile defect detection. It contains 3200 images of eight representative textile types. Each textile is classified into seven defective types and a defect-free type, and each type consists of 50 images (768 × 512 pixel, 8-bit, gray level image).

In addition to the above-mentioned defect datasets, a few datasets are available that contain fabric or texture images without defects. For example, Fabrics Dataset [116] consists of approximately 2000 images of garment and fabric samples. The Kylberg texture dataset [115] contains 28 texture classes with 160 unique texture patches per class. The texture patch size is 576 × 576 pixels, and all the patches are normalized. The KTH-TIPS database [117] presently contains images of ten types of texture materials: sandpaper, crumpled aluminum foil, Styrofoam, sponge, corduroy, linen, cotton, brown bread, orange peel, and cracker B. Although there are no defect images in these datasets, they can be used for image classification or defect detection by synthesizing defects on them.

5.2 Research status of deep learning-based AVI algorithms and applications

The advent of the aforementioned deep learning techniques has inspired a number of novel deep learning-based defect detection algorithms. They integrate the two phases of the traditional detection methods, i.e., feature extraction and defect identification, into one phase. They extract features and classify defects simultaneously by learning from the training samples. They are not required to design a set of human features such as statistical or spectral features, as in traditional methods. Without using an expert-designed feature set, deep learning-based detection algorithms can automatically generate distinct features from the training set and enable users to circumvent manual identification of rules for feature extraction or classification. Furthermore, they are generally capable of achieving higher detection accuracy [118].

A literature survey indicates that most of deep learning-based surface defect detection approaches employ deep CNN-based supervised learning for defect recognition. CNN is the most popular and used group of deep learning algorithms because of their wide application potential in pattern recognition [119]. It is a deep neural network architecture specialized in image processing and pattern recognition and whose hierarchical structure enables the extraction of multilevel image features to achieve accurate pattern identification [120]. CNN consists of three types of layers: convolutional layer, pooling layer, and fully connected layer. The convolutional layer learns feature representation of the input and outputs a feature map. The pooling layer is used for dimensionality reduction of the feature map. The fully connected layer performs the mapping of input data to a feature vector for final classification. As CNN exhibits a unique feature-learning capability, wherein it learns features from image samples automatically and exhibits strong reliability, it is generally the preferred option for surface quality inspection using deep learning.

The deep CNN-based approaches and applications of surface defect detection in the semiconductor industry are described in Table 5. The deep CNN-based approaches and applications for surface defect inspection of fabrics are illustrated in Table 6. The deep CNN-based approaches and applications for surface inspection of steel and other products are presented in Table 7. Semi-supervised learning methods and unsupervised learning methods are demonstrated in Table 8. These employ auto-encoder-based methods, Faster-RCNN, YOLO, SSD, and GAN for unsupervised or semi-supervised learning of surface defects.

Table 5 Deep CNN-based supervised learning approaches for surface inspection in semiconductor industry
Table 6 Deep CNN-based supervised learning approaches for fabric inspection
Table 7 Deep CNN-based supervised learning approaches for inspecting steel and other products
Table 8 Semi-supervised learning and unsupervised learning-based approaches for surface defect detection

5.2.1 CNN-based supervised learning methods for semiconductors

Yang et al. [121] presented an online detection method for Mura defects by combining a deep convolutional feature extractor and a sequential extreme learning machine classifier. It is capable of learning and recognizing a Mura defect image within 1.5 ms. Kim et al. [122] proposed a CNN network for surface mount technology (SMT) defect detection by modifying AlexNet and adopting the ResNet structure. Additional input image transformation was conducted by histogram stretching and chip region extraction to improve the detection accuracy. Kim et al. [123] proposed a CNN-based defect image classification model based on residual network for through-silicon via (TSV) process. They achieved classification performance of up to 97.2% accuracy. Jang et al. [124] proposed a defect inspection method by using deep CNN and defect probability images obtained from traditional inspection techniques. It outperforms a conventional CNN model using RGB or grayscale image. Zhang et al. [125] proposed a multi-task CNN model to handle the multi-label PCB classification problem by defining each label learning as a binary classification task. They achieved good performance on the PCB defect dataset. Deng et al. [126] proposed an automatic defect verification system by fast circuit comparison and deep CNN-based defect classification to decrease the false alarm rate of AVI for the PCB industry. Ghosh et al. [127] proposed a transfer learning-based method to classify PCB defects without reference images or the need to locate the defects in the images. An adaptation network was trained by extracting mid-level representations of PCB images from an intermediate layer of a pre-trained Inception-v3 network. Wei et al. [128] studied the method of extracting defect areas using morphology and deep CNN for PCB defect classification. They achieved significantly better results than a traditional classification algorithm based on digital image processing, on a dataset containing 1818 images. Nakazawa et al. [129] presented a CNN-based wafer map defect pattern classification method for synthetic wafer maps, containing 22 defect classes generated theoretically. They achieved an overall classification accuracy of 98.2%. Yuan-Fu [130] employed CNN and extreme gradient boosting for wafer map retrieval tasks and defect pattern classification. They observed CNN to be more applicable for wafer map image classification because it is capable of learning relevant features from the input image. Ishida et al. [131] proposed a deep CNN network based on VGG to recognize wafer map failure patterns. A data augmentation technique with noise reduction was used for data processing. Experimental results on a benchmark dataset demonstrated the high accuracy of the method. Cheon et al. [132] proposed a wafer surface defect detection method by combining CNN and k-NN. It can extract effective features for defect classification without additional feature extraction algorithms and achieve high defect classification performance in wafer surface defect. Banda et al. [133] used deep learning for identifying defective photovoltaic cells automatically, based on CNNs including LeNet, CifarCNN, and GoogleNet architecture. The method successfully distinguished between a defective and a normal photovoltaic cell. Deitsch [112] investigated two approaches for automatic defect detection in solar photovoltaic cells: an approach based on hand-crafted features that are classified in an SVM and an end-to-end deep CNN approach. Experiments revealed the CNN-based approach to be more accurate than the SVM-based approach. Lin et al. [134] proposed an application of CNN in LED chip defect inspection. In the CNN, a class activation mapping technique was introduced to localize defect regions exactly. They achieved an accuracy of 94.96% for LED chip defects inspection.

5.2.2 CNN-based supervised learning methods for fabric

Park et al. [135] proposed a new surface defect inspection method for automatic visual inspection of dirties, scratches, burrs, and wears on surface parts. CNNs with different depths and layer nodes were tested to select an adequate structure for defect inspection. Weimer et al. [136] proposed a CNN method for texture surface defect recognition. They utilized 70% of 1,299,200 samples obtained after data augmentation for training and achieved a classification accuracy of 99.2%. Wang et al. [137] proposed a deep CNN for defect detection with less prior knowledge on the images and robust to noise. They achieved fast detection as well as high accuracy on a benchmark database. Jeyaraj et al. [138] proposed a multi-scaling CNN algorithm for fabric defect detection. They achieved an average accuracy of 96.55% on six different fabric materials. This is higher than that of conventional fabric defect detection. Gao et al. [8] investigated the problem of woven fabric defect detection using a CNN with multi-convolution and max-pooling layers. They obtained an overall detection accuracy of 96.52%. Furthermore, the authors constructed a high-quality database that includes images of common defects in woven fabric with solid color. Li et al. [118] proposed a compact CNN architecture with multilayer perceptron for detecting a few common fabric defects. In addition, multi-scale analysis, filter factorization, multiple locations pooling, and parameter reduction were used to improve the detection accuracy.

5.2.3 CNN-based supervised learning methods for steel

Saiz et al. [139] proposed an automatic defect classifier method for steel surfaces with two independent stages: preprocessing and CNN. They achieved a classification rate of 99.95%, outperforming other traditional detection methods on a publicly available dataset. Chen et al. [140] proposed an ensemble approach that integrates three deep CNNs for steel surface defect recognition: ResNet-32 and wide residual networks (WRNs) WRN-28-10 and WRN-28-20. Liu et al. [141] proposed a new neural network by utilizing Google Inception architecture and residual structure for steel defect detection. They achieved an accuracy of over 99.47%. Vannocci et al. [142] proposed an application of CNN in classifying steel strip images and a comparison with classical machine learning approaches. Thereby, they established the effectiveness and general validity of deep learning. Song et al. [143] developed a deep CNN-based detection method for micro defects on metal screw surfaces. A comparison with traditional template matching-based techniques and LeNet-5 has demonstrated the superiority of the proposed deep CNN-based method. Chun and Zhao [144] combined CNN and SVM to inspect industrial products more effectively. Here, CNN was used for feature extraction and SVM for decision-making. Soukup et al. [145] trained classical deep CNN on a database of photometric stereo images of rail surfaces in a purely supervised manner. They achieved significantly higher performance than that of a traditional model-based approach. Ren et al. [51] presented a deep learning-based approach requiring small training data for automated surface inspection on three public and an industrial datasets. It was realized by extracting patch feature using deep CNN, generating the defect heat map based on patch features, and predicting the defect area by thresholding and segmenting the heat map.

5.2.4 Semi-supervised learning and unsupervised learning methods

Li et al. [29] proposed a Fisher criterion-based stacked denoising auto-encoder with the objective of learning more discriminative features for patterned fabric defect detection when limited defective samples are available. Mei et al. [146] proposed an unsupervised learning-based automated approach by using a multi-scale convolutional denoising auto-encoder network and Gaussian pyramid to detect and localize fabric defects. They achieved good overall performance. Mujeeb et al. [147] proposed an unsupervised learning algorithm to detect surface level defects by using a deep auto-encoder network and training input reference images. During training, various copies were generated automatically through data augmentation. The unsupervised algorithm does not rely on the availability of defect samples for training. Siegmund et al. [148] presented a comprehensive defect detection method for two common fabric defects groups. The proposed method employed VGG and ResNet for feature extraction. Then, a regional proposal network (RPN) and Faster-RCNN were used to generate the region proposal and detect objects. Li et al. [149] provided an end-to-end solution for the surface defects detection in steel strips. They improved YOLO by making it completely convolutional and capable of simultaneously predicting the class, location, and size information of defect regions. Li et al. [150] proposed a surface defect detection method by adopting an SSD network combined with MobileNet to identify the types and locations of surface defects. The method can automatically detect surface defects more accurately and rapidly than traditional machine learning methods. Yang et al. [151] proposed a real-time defect detection algorithm for tiny parts, based on an SSD and the speed model. The values of accuracy of detection of defect types 1, 2, 3, and 4 are 98.00%, 99.00%, 97.80%, and 79.40%, respectively. Di et al. [152] proposed a semi-supervised learning method based on convolutional auto-encoder and semi-supervised GAN to classify surface defects on steel. It has yielded remarkable performances. Gao et al. [153] proposed a semi-supervised deep learning method called PLCNN for steel surface defect recognition. PLCNN is a convolutional neural network improved by Pseudo-label that unlabeled data can be used in the training process. Comparative analysis with other conventional methods demonstrated that the proposed method has a significant improvement with the help of the unlabeled samples.

6 Discussions

6.1 Analysis of the deep learning-based defect dection algorithms

There are three paradigms for deep-learning defect detection: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning is most widely used and capable of achieving high detection accuracy and is reliable for online industrial application given sufficient training data. This is illustrated in Table 5, Table 6, and Table 7. We can conclude from the three tables that supervised learning-based defect detection methods generally utilize convolutional neural networks by adopting one of three approaches: the transfer learning approaches, the approach of constructing a CNN based on classical network structures such as ResNet, and the approach of constructing a CNN from scratch by stacking convolutional layers, pooling layers, and fully connected layers.

The transfer learning approach is to utilize transfer-learning technology to transfer a pre-trained model that has already been trained in a large dataset, to the target detection problem (which generally has a small training dataset). Transfer learning relaxes the hypothesis that the training data must be independent and identically distributed with the test data [154]. This enables a deep learning model to utilize the knowledge (including the model structure and pre-trained parameters) of a trained model from other fields. The common transfer learning approach is to leverage a pre-trained network and alter the final layers to fine tune the weight parameters on the target dataset [155]. For example, Ghosh et al. [127] proposed a transfer learning-based method to classify PCB defects by utilizing a pre-trained Inception-v3 network.

The second approach is to utilize the classical network structures such as AlexNet, VGGNet, Google Inception networks, and ResNet and perform a few modifications to make them adaptive for solving the target detection problems. This is the common way of utilizing CNN for defect detection. The aforementioned papers including Yang et al. [121], Kim et al. [122], Kim et al. [123], Ishida et al. [131], Banda et al. [133], Saiz et al. [139], Chen et al. [140], Liu et al. [141], Vannocci et al. [142], and Chun and Zhao [144] can be assigned into this category.

The third approach is to propose a novel CNN structure with different depth and width by stacking three types of layers (convolutional layer, pooling layer, and fully connected layer) in different ways. The convolutional layer learns the feature representation of the input and outputs a feature map. The pooling layer is used for dimensionality reduction of the feature map. The fully connected layer performs the mapping of input data to a feature vector for final classification. The best depth or width of a proposed CNN network can be obtained through comparison experiments and testing. The aforementioned papers including Zhang et al. [125], Deng et al. [126], Wei et al. [128], Nakazawa et al. [129], Cheon et al. [132], Park et al. [135], Wang et al. [137], Jeyaraj et al. [138], Gao et al. [8], Li et al. [118], Song et al. [143], and Soukup et al. [145] can be assigned to this category.

Another conclusion from Table 5–Table 7 is that a few of these CNN-based supervised learning methods introduce and combine other image processing technologies or pattern recognition methods. For example, Yang et al. [121] utilized CNN as well as sequential extreme learning machine for Mura defect detection. Cheon et al. [132] proposed a wafer defect detection method by combining CNN and k-NN. Li et al. [118] proposed a compact CNN architecture with multilayer perceptron for fabric defect detection. This is a feasible and occasionally effective approach to combining different methods to achieve the objective.

Moreover, we plotted a graph to illustrate the number of total images of the datasets used by the CNN-based methods listed in Table 5, Table 6, and Table 7 and the accuracy of these methods. There are 29 application examples listed in the three tables. We omit four of them: [122], the one which has not supplied image information; [123, 131], the two which have maximum number of images; and [51], the one which uses three datasets. The data number of the remaining 25 application examples are analyzed and illustrated in Fig. 2. As shown in Fig. 2, most of these use approximately 5000 image samples for training and testing. They can achieve high defect detection accuracy with an average accuracy of 96.82%. Therefore, for defect detection, CNN-based methods generally use thousands of original image samples to obtain high detection accuracy. Its demand for the number of training data is not as high as anticipated. Furthermore, as shown in Fig. 2, the trend of the accuracy curve does not match that of the curve of the number of images, which indicates that a higher amount of training images does not guarantee a higher accuracy. This contradicts our usual belief that a higher amount of training data can improve the performance of the model and alleviate the overfitting problem. Overfitting problem is a highly common problem that occurs when a large deep CNN model is applied to a small dataset. We may explain it this way. Obtaining sufficient training data is an important means to enable the training set generalize well to the test set and avoid overfitting. Alternative means can be to use regularization technology or to modify the neural network architecture. In addition to improving the generalization capability, we also need to minimize the gap between training error and human-level error by using larger neural networks, adopting appropriate hyperparameters or trying better optimization algorithms during training. Once we have obtained both good generalization capability and high training accuracy, the accuracy of a specific defect detection problem depends on the difficulty of the detection problem itself. For instance, if it is difficult for humans to recognize tiny defects or distinguish defects with similar features, then the supervised learning-based methods will face the same dilemma. That may explain why PCB and solar cell defects are not easy to be recognized in the following methods: [125], its accuracy is 89.89% for multi-categories; [127], its accuracy is 91.125%; and [112], its accuracy is 88.42%. The samples of these defects are described in Fig. 3.

Fig. 2
figure 2

Number of images and accuracy of the CNN-based methods listed in Tables 5, 6, and 7

Fig. 3
figure 3

Samples of PCB defects in [125] (left) and [127] (middle) and the solar cell defects in [112] (right)

As shown in Fig. 3, it is not easy for humans to clearly identify and classify these defects, let alone supervised learning methods. In this case, it is particularly important to set up a suitable lighting system to achieve good contrast between the foreground and background; thereby, AVI system can clearly capture these defective images without introducing additional noise. If there is noise, it may make supervised learning methods more confusing in defect recognition. Furthermore, it is also significant to correctly label these defects; otherwise, it may cause supervised learning methods more prone to make incorrect judgments. On the contrary, as illustrated in Fig. 4, the defects in NEU dataset are larger and more obvious, and the features of various defects are also significantly different; thereby, they are easier to be recognized by humans, and the same is true for supervised learning methods. As shown in Table 7, [139, 140] have achieved much higher accuracy, up to 99.95% and 99.889%, respectively.

Fig. 4
figure 4

Samples of defects in Neu dataset [140]

Compared to above CNN-based supervised learning approaches, very few studies on unsupervised learning or semi-supervised learning-based defect detection approaches have been conducted. The frequently used unsupervised learning frameworks are auto-encoder and GAN. As presented in Table 8, Li et al. [29] and Mei et al. [146] proposed a denoising auto-encoder for fabric defect detection. Mujeeb et al. [147] utilized a deep auto-encoder network for surface defect detection. However, unsupervised learning is less reliable than the supervised learning method. Therefore, it has few online industrial AVI applications. Semi-supervised learning provides an alternative solution when insufficient labeled data are provided. It can achieve similar precision as supervised learning albeit using fewer labeling samples. However, the state-of-the-art semi-supervised learning technology innovated by the deep learning community has rarely been employed for defect detection.

6.2 Challenges and solutions

As demonstrated in the previous section, most of deep learning-based surface defect detection approaches employ deep CNN-based supervised learning for defect recognition. And they are frequently implemented in three ways. The first way is to use the transfer learning method, which utilizes the knowledge (including model structure and pre-trained parameters) of a trained model from other fields and fine tunes them on the target datasets in order to reduce the amount of training data or training time. The second way is to adopt classic convolutional neural network structures, such as Inception-v3 and ResNet, and modify them to a certain extent to make them suitable for the target defect detection problems. The third way is to construct a convolutional neural network from scratch by stacking convolutional layers, pooling layers, and fully connected layers together and train them to achieve the desired accuracy. However, these methods mainly consider the accuracy of defect recognition and classification and less consider how to achieve high efficiency and low computational cost. In order to improve the detection accuracy, these methods generally tend to deepen or expand network scale, which consumes a lot of computing time and requires high-performance computing resources. They are less able to meet the millisecond-level real-time detection requirements in industrial AVI applications, thereby limiting their application in industrial fields. Therefore, how to build a deep CNN-based defect detection model that meets both high precision and real-time requirements is a challenge for deep-learning-based AVI applications.

A probable solution is to directly use lightweight networks such as SqueezeNet [91], MobileNet [92], and ShuffleNet [95] as the main networks of defect detection, because they are tailored for mobile applications, or they are aimed at achieving a balance between lowest computation cost and highest accuracy. The details of these networks have been described in the previous section. An alternative solution is to utilize effective convolutional algorithms, such as depthwise separable convolution [93] applied in MobileNet and the fire module introduced by SqueezeNet. When considering saving the computational cost of convolution, depthwise separable convolution should always be the first choice, because a 3*3 depthwise separable convolution can save 8 to 9 times the amount of calculation at only a small reduction in accuracy. It is realized by decomposing standard convolution into depthwise convolution (each input channel is convoluted by applying a filter) and pointwise convolution (1*1 convolution to combine the outputs of depthwise convolution). The fire module has a squeeze convolution layer (which has only 1*1 filters), feeding into an expand layer that has a mix of 1*1 and 3*3 convolution filters. It sets the number of filters in the squeezer layer (all 1*1 convolutions) to be less than the filters in the expander layer (1*1 and 3*3 convolutions), so the squeeze layer helps to limit the number of input channels to the 3*3 filters, thereby reducing the calculation amount [91]. These convolution algorithms can greatly help achieve a high detection speed while maintaining a high detection accuracy. For instance, we have proposed a lightweight deep convolutional neural network based on the depthwise separable convolution and a squeeze-and-expand mechanism to detect the surface defects of the copper clad laminate (CCL) images obtained from the industrial CCL production line, and high computation speed has been achieved while maintaining good detection accuracy [156]. In general, by developing more lightweight networks or more efficient convolution algorithms, we can strike a balance between lowest computation cost and highest accuracy and finally realize rapid and accurate defect detection in industrial online applications.

Another challenge faced by AVI applications based on deep learning is that deep neural networks usually require a large amount of labeled data as training samples, but the preparation of labeled data incurs significant labor and time costs. Moreover, it is occasionally highly challenging or unfeasible to label or collect sufficient training data. At the same time, industrial high-speed production lines often produce defects that have never appeared before. The new defects are not included in the training samples, and this might also impede the application of deep learning in industrial AVI. Therefore, when a large amount of labeled data cannot be provided, how to use deep learning for defect detection is still a challenge.

Data augmentation technology can alleviate the problem of insufficient training samples to some extent. It preprocesses the original images by performing image transformation (such as flipping, random cropping, re-scaling, and color shifting) to expand the original dataset. The transformed image samples will be added to the original dataset to form an expanded dataset, which is fed to the network for training. Data augmentation can also be performed automatically during training [147]. But it cannot completely address the problem of insufficient data. Unsupervised learning can address the deficiency of training data, but it is less reliable than the supervised learning method and thereby unfeasible for online industrial AVI applications. An alternative solution could be to use semi-supervised learning paradigm. Semi-supervised learning can achieve similar precision as supervised learning albeit using fewer labeling samples. It uses both labeled and unlabeled data for training, which contrasts supervised learning (data all labeled) or unsupervised leaning (data all unlabeled) [157]. It can maximize the use of unlabeled data that are relatively easy to obtain. Traditional semi-supervised learning includes generative modeling and graph-based methods, etc. The details of these methods and more comprehensive overviews are provided in [157,158,159]. The newly proposed GAN can be attributed to the category of generative modeling and is one of the research hotspots in semi-supervised deep learning [160, 161]. However, it may suffer from unstable training and are too complicated to use in online AVI application. Many recent approaches for semi-supervised learning add a loss term which is computed on unlabeled data and encourages the model to generalize better to unseen data by using the following methods: entropy minimization, which encourages the model to output confident predictions on unlabeled data, and regularization, which encourages the model to produce the same output distribution when its inputs are perturbed and avoid overfitting the training data [162]. For instance, Berthelot et al. [162] from Google Research proposed a holistic semi-supervised learning algorithm named MixMatch, which introduces a unified loss term for unlabeled data that seamlessly reduces entropy. MixMatch has obtained state-of-the-art results across many datasets. Zheng et al. [163] have proposed a sophisticated algorithm based on MixMatch for automated surface inspection and revealed that it is effective for two public defect datasets (DAGM and NEU) and one industrial dataset (CCL).

In general, the challenge of achieving accurate and fast detection and the lack of sufficient training samples hinder the application of deep learning in industrial AVI. Probable solutions might be to utilize lightweight neural networks, efficient convolution algorithms, automatic data augmentation, semi-supervised deep learning paradigm, and other deep learning technologies that are still under development. Although extensive research has been conducted on deep learning-based defect detection, there is still consideration room for improvement in accuracy and computation speed. The state-of-the-art in deep learning should still be comprehensively studied to make online AVI applications more applicable.

7 Conclusion

Traditional defect detection algorithms generally conduct detection in two stages: feature extraction and defect identification. They have to design a set of human features, which are heavily dependent on extensive domain knowledge. Furthermore, these methods tend to work effectively only under specified conditions and are sensitive to input variations. Once the application condition varies, the algorithm needs to be adjusted substantially.

The recent advancement in deep learning provides generic tools that conduct detection in one stage. It learns features and identifies defects simultaneously. It is capable of learning high-level features from training data automatically without requiring additional feature extractor or domain expert knowledge. A deep network-based detection approach is applicable to different objects and defect types as long as it is trained based on corresponding data. Moreover, it is insensitive to the variations in the input or application conditions when the training data has not varied substantially. In general, compared to the traditional defect detection methods, deep learning-based detection approaches are more automatic, more generic, and more robust because they do not have to design feature manually, are applicable to different types of objects and defect types, and insensitive to variations.

Notwithstanding these advantages, deep learning-based defect identification has been rarely used in practical industrial applications. It remains an unsolved problem given insufficient image samples. There are three paradigms for deep-learning defect detection: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning is the most widely used and capable of achieving high detection accuracy. However, it has disadvantage of being strongly dependent on a large amount of labeled training data. The preparation of labeled training data incurs significant labor and time costs. Moreover, it is occasionally highly challenging or unfeasible to label or collect sufficient training data. Unsupervised learning can address the deficiency of training data. However, it is less reliable than the supervised learning method and, therefore, unfeasible for online industrial AVI applications. Semi-supervised learning may provide a solution that can achieve similar precision as supervised learning albeit using fewer labeling samples. It uses both labeled and unlabeled data for training and maximizes the use of unlabeled data that are relatively easy to obtain. Many recent approaches for semi-supervised learning add a loss term which is computed on unlabeled data and encourages the model to generalize better to unlabeled data by using the following methods: entropy minimization, which encourages the model to output confident predictions on unlabeled data, and regularization, which encourages the model to produce the same output distribution when its inputs are perturbed and avoid overfitting the training data.

In addition, the absence of large amount of training samples for supervised learning-based defect detection can be alleviated through data augmentation technology. There are two approaches to conducting data augmentation. One is to preprocess the original images to expand the original dataset. This is implemented by performing image transformation such as flipping, random cropping, re-scaling, and color shifting. The transformed image samples are added to the original dataset to form an expanded dataset, which is fed to the network for training. An alternative method is to generate images automatically through data augmentation during training. This method can be utilized also in semi-supervised learning framework.

Another challenge faced by deep learning-based defect detection is to meet the millisecond-level real-time detection requirements in industrial applications while maintaining high accuracy. By developing lightweight neural networks or efficient convolution algorithms, we can strike a balance between lowest computation cost and highest accuracy and finally realize rapid and accurate deep learning-based defect detection in industrial online applications.

As semi-supervised learning and data augmentation can be used to alleviate or address the absence of large amount of training samples, and lightweight neural network and efficient convolution algorithms can be employed to improve the computation speed, we consider that deep learning exhibits the potential to gradually replace the traditional defect detection algorithms. The future development direction of deep learning-based defect detection approaches may be the utilization of automated data augmentation during training, the development of semi-supervised learning approaches to alleviate the problem of insufficient training data, and the innovation of efficient convolution algorithms and lightweight neural networks to meet real-time computation requirement. And we believe that with the continuous development of deep learning, surface defect inspection using deep learning has a promising future.