1 Introduction

With the vigorous development of convolutional neural network in the field of target detection and image recognition [1], the target detection model YOLO has become a hotspot in application research with the advantages of fast algorithm implementation, flexibility, and excellent generalization performance [2].

FPGA is a semi-custom circuit, which has programmable input and output units, embedded RAM, dedicated wiring resources, configurable logic blocks. FPGA parallelizes massive data to achieve accelerated processing, and the power consumption is extremely low. Compared with CPU, FPGA has unique advantages in accelerating convolution neural network algorithm [3], which is conducive to deployment to small mobile devices.

At present, most of the domestic mobile phone lens manufacturing factories use artificial visual method to detect lens defects. This method has low efficiency and high false detection rate, which is susceptible to the subjective factors of the test workers, and the labor cost is also high [4], which does not adapt to the current automation trend. In recent years, with the development of machine vision technology, its application in lens defect detection has gradually deepened [5]. Zhu et al. used blackbody as a dark background to improve the contrast of lens defects, and used image analysis method for detection. The detection method takes 5 s per piece [6]. Zhu optimized the point source projection lighting scheme of precision microscope and other instruments to achieve the optimal imaging effect of surface defects, analyzed the shape and size of lens defects, and wrote corresponding image processing algorithms for each defect type to identify. The cost of this method was high [7]. Zhu trained a degenerated YOLO network based on deep learning, and realized the PC-controlled detection system to identify various defects in resin lens samples. The detection power consumption of this method was high [8].

On the basis of the above research on detection methods, this article adopts optimization strategies such as parameter reordering, dynamic quantization, loop expansion and block segmentation to design a software and hardware collaborative processing platform based on FPGA hardware acceleration technology, and real-time detection of mobile phone lens defects is realized. The experimental results show that the platform has low real-time detection delay, low power consumption, low cost, high accuracy, and high defect positioning accuracy, which proves the effectiveness and practicability of the accelerated detection of this platform.

2 Material and Methods

2.1 Selection of Network Model

The R-CNN series algorithm is a two-stage structure network, and its detection speed cannot meet the real-time requirements. YOLOv1 [9] is not suitable for the defect detection of mobile phone lenses due to the immature prediction frame mechanism and loss calculation, which leads to the low positioning accuracy of the target with small feature size. YOLOv2 [10] is a single-stage structural network proposed by Redmon et al. in 2017. Its core idea is to transform the target detection problem into a classification regression problem [11], and introduce the Anchor mechanism to improve the recall rate in the process of model training and enhance the classification ability of fine-grained image features. It is suitable for detecting targets with small feature size fast detection speed, which can achieve real-time detection of targets.

Due to the low cost of FPGA development board and the relatively limited computing resources, YOLOv2 with high detection accuracy for small size targets, low real-time detection delay, and lightweight network structure is selected.

2.2 YOLOv2 Network Structure

The overall structure of YOLOv2 network is shown in Fig. 9.1.

Fig. 9.1
figure 1

Overall structure of YOLOv2 network

It can be seen from Fig. 9.1 that YOLOv2 network is composed of CBL, MCN, Route, and Reorg. CBL is the smallest component in the YOLOv2 network structure, which is composed of Conv, BN, and Leaky ReLU. The function of Conv (convolution operation) is to realize feature extraction and obtain feature map. Batch normalized BN layer is a linear transformation of feature image elements to accelerate the training convergence process [12]. The activation function is to make a nonlinear transformation of the feature image elements to enhance the nonlinear fitting ability of the network. Except that the final output layer uses Sigmoid, the rest are Leaky ReLU, as shown in (9.1).

$$f\left( x \right) = \left\{ {\begin{array}{*{20}c} x \times 0.1, & x < 0 \\ x, & x \ge 0 \\ \end{array} } \right.$$
(9.1)

MCN is a down-sampling component, which is composed of the maximum pooling layer MAX and N CBLs. The function of the maximum pooling layer is to realize the down-sampling of the feature map, reduce its size and keep its functional characteristics unchanged.

Route is the routing layer, which integrates multi-dimensional feature information.

Reorg is a reordering layer that enables reordering of features. The large size feature map is divided into several small size feature maps.

2.3 Accelerator Framework

In the acceleration scheme, the part with large amount of calculation and redundant calculation structure in the neural network is implanted into FPGA, and the dynamically configurable neural network acceleration IP is designed. The PS terminal stores the structural parameters of the neural network, and the calculation results are obtained by calling the neural network acceleration IP. The overall framework of neural network accelerating IP design is shown in Fig. 9.2.

Fig. 9.2
figure 2

Neural network accelerating IP framework

It can be seen from Fig. 9.2 that the input and output modules are designed in the FPGA, and the convolution calculation, pooling calculation and reorder calculation are encapsulated in the IP module, and the AXI-lite and AXI-full interfaces are encapsulated. The PS terminal configures the calculation unit through AXI-lite, and the weight parameters and sample feature data are introduced into the convolution neural network to accelerate the IP through AXI-full, and then the calculated results are received by AXI-full.

2.4 Integrated Convolutional Neural Network IP

Input and Output Module

The input and output module is used as the main device, and its interface design adopts AXI-full interface, and DDR3 at the PS terminal is used as the slave device. Considering the limited BRAM resources of FPGA, the idea of loop block segmentation is used for expansion, as shown in (9.2) and (9.3).

$$Tci = \left( {Tco - 1} \right) \times S + K$$
(9.2)
$$Tri = \left( {Tro - 1} \right) \times S + K$$
(9.3)

Among them, \(Tci\) and \(Tri\) are the size of input feature graph, \(Tco\) and \(Tro\) are the size of output feature graph, \(S\) is the step size of convolution kernel, \(K\) is the size of convolution kernel. The process of expanding the input and output modules in depth is shown in Fig. 9.3.

Fig. 9.3
figure 3

Input and output block segmentation diagram

In Fig. 9.3, Tn is the input depth, and Tm is the output depth obtained. The buffer design of double buffers and pipeline operation are adopted to increase the throughput of data and avoid the performance bottleneck for the subsequent convolution acceleration module. The pipelined input and output structure of double buffer is shown in Fig. 9.4.

Fig. 9.4
figure 4

Pipelined input and output structure of double buffer

Figure 9.4 for the input structure, A is to pass data from DDR3 to cache Buffer1, a is to pass data from DDR3 to cache Buffer2, B is to pass data from Buffer1 to on-chip cache, and b is to pass data from Buffer2 to on-chip cache. Figure 9.4 for the output structure, A is to pass the data from the on-chip output cache to the cache Buffer1, a is to pass the data from the on-chip output cache to the cache Buffer2, B is to pass the data from Buffer1 to DDR3, and b is to pass the data from Buffer2 to DDR3.

Convolution Module

Convolution calculation occupies the vast majority of the calculation in the neural network, and its essence is multiplication and accumulation. In order to improve the parallelism, the two optimization strategies of expansion and block segmentation are adopted. The input and output feature maps are expanded in two dimensions. Each time, Tm output feature maps and Tn input feature maps are parallelly calculated. The input feature maps are divided into blocks, and only one block is calculated each time. Reduce FPGA requirements for BRAM, reuse computing data, reduce the number of read and write data from off-chip storage.

Maximum Pooling Module

Consider the maximum pooling as a special convolution. It does not require weight parameters, only pooling the input feature map of a channel, and its corresponding operation unit is a comparator. The corresponding hardware structure is shown in Fig. 9.5.

Fig. 9.5
figure 5

Hardware structure corresponding to maximum pooling

When the 2 × 2 maximum pooling is used in Fig. 9.5, the REG is initially zero. When the four numbers are input, in turn, the comparator inputs a large number into the REG register. After 2 × 2 clock cycles, the comparator completes the comparison, outputs the largest number, and clears the REG. Since the resources consumed by comparators are mostly LUT, the pooling design adopts the block segmentation optimization strategy.

Reordering Module

Reordering within FPGA can be seen as a multiplexer, as shown in Fig. 9.6.

Fig. 9.6
figure 6

The corresponding structure of reordering

In Fig. 9.6, one input buffer corresponds to four output buffers to achieve 2 × 2 reordering.

2.5 Production and Training of Defect Lens Data Set

The four common defects (petal injury, glue hole, crack, membrane crack) of mobile phone lens are selected as the research objects to make data sets. The labeling software is used to mark mobile phone lens defect detection dataset. Due to the limited number of actual samples, the number of samples is expanded through data enhancement (rotation transformation, affine transformation, image enhancement, and noise addition). In this way, the purpose of increasing the number of samples is realized, the direction and size of defects are changed, and the robustness of image noise is enhanced [13].

YOLOv2 training network parameters: In order to balance the training effect and the pressure of memory occupation, the number of pictures sent to the network in batches at each iteration is 64, and the number of copies sent to the trainer in batches at each iteration is 8. In order to improve the training speed and avoid overfitting, the momentum constant is 0.9 and the weight attenuation regularization coefficient is 0.0005. With the increase of iterations, in order to make model learning more effective, the initial learning rate is 0.001, and the total number of iterations is 200. The learning rate adjustment strategy is steps. When the number of iterations reaches 160 and 180, the learning rate is reduced to 0.0001 and 0.00001, respectively. The number of categories in region layer is changed to 4, and the number of convolution kernels in the last convolution layer is changed to 45.

In the training process, the change curves of Loss and recall rate of the model are shown in Fig. 9.7a, b, respectively.

Fig. 9.7
figure 7

Change curve of training process

It can be seen from Fig. 9.7a that when the training model is iterative to 90 times, the loss function converges to 0.01. It can be seen from Fig. 9.7b that when the training model is iterative to 70 times, the recall rate is close to 1. Based on the analysis of training process performance curve, the weight effect of YOLOv2 network model is ideal. Use right re-results detect the samples of the validation set, and the results are shown in Table 9.1.

Table 9.1 Test results of model validation set

Table 9.1 shows that the model detection accuracy is 96.13%, which proves that the training effect of the model is ideal and can be transplanted in the next step.

2.6 Transplantation of YOLOv2

Rewrite YOLOv2 feature extraction network darknet framework source code, the weight file is separated into two files of weight and offset parameters. Parameter mapping is shown in (9.4).

$$y = A \times X + B$$
(9.4)

In the above equation, A and B are mapping coefficients. The weight and offset parameter expressions of convolution kernel are shown in (9.5) and (9.6).

$${\text{weight}}_{i} = {\text{weight}}_{i} \times A_{i}$$
(9.5)
$${\text{bias}}_{i} = {\text{bias}}_{i} \times A_{i} + B_{i}$$
(9.6)

In the above equations, where \(A_{i}\) and \(B_{i}\) are the mapping coefficients of characteristic figure \(i\). Store the mapped values in binary form for the separated files, reducing the amount of computation and increasing the computation speed [14].

Quantification of Data

Before quantization, each parameter in the weight and offset data occupies 32 bits, which requires high bandwidth of the transmission bus and indirectly limits the operation rate. In this article, the dynamic fixed-point 16-bit quantization [15] is used to reduce the bit width of the parameters. The optimal order of weight for each layer is searched through traversal, as shown in (9.7).

$$\exp_{w} = \arg \min \mathop \sum \limits_{i = 0}^{n} \left| {w_{float}^{i} - w_{{\left( {bw,\exp_{w} } \right)}}^{i} } \right|$$
(9.7)

In the above equation, \(\exp_{w}\) is the order code of the minimum sum of the absolute value of the loss accuracy of all parameters in this layer after quantization, \(n\) is the number of parameters to be quantified in this layer,\(w_{float}^{i}\) is the original floating point value of the first parameter, \(w_{{\left( {bw,\exp_{w} } \right)}}^{i}\) is the number of fixed points of the first parameter under the bit width \(bw\) and the order code \(\exp_{w}\), and then the number of floating points is converted.

Weight Reordering

When the reordering module is designed, the idea of block segmentation is used to optimize the convolution operation. When the designed algorithm is transformed into RTL circuit by HLS, the simulation results show that the weight parameters are stored separately, resulting in the burst transmission of AXI-full, the size is only K*K, which cannot make full use of the transmission bandwidth of DRAM. In order to make full use of the transmission bandwidth, the weight parameters need to be reordered, as shown in Fig. 9.8.

Fig. 9.8
figure 8

Weight parameter reordering

PS and PL Terminal Collaborative Real-time Detection

The functional design of PS terminal is mainly to store the neural network structure, drive the USB camera, call the integrated neural network to accelerate IP, preprocess the images collected by the camera, and post-process the final output of the integrated neural network to accelerate IP. The coordination design framework is shown in Fig. 9.9.

Fig. 9.9
figure 9

PS and PL terminal synergy framework

It can be seen from Fig. 9.9 that the PS terminal first reads the weight and offset files stored in the SD card and puts them to the external memory DDR3. Then, it drives the USB camera to collect real-time video frames, caches the collected images, and then preprocesses them, mainly including the shaping of the collected images and pixel normalization. Finally, it reads the neural network structure, and configures the integrated neural network to accelerate IP according to the input and output layers and order codes.

OpenCV is called to drive the camera, and its built-in function will cache 4 frames of video. In order to solve the problem that continuous acquisition will cause non-real-time video frames, set the program segment to delete the cache frame, which will affect the delay of lens detection.

In order to simplify the detection platform, HDMI is used to display, DIGILENT’s RGB2DVI open-source IP core is used to complete data bit conversion, and XILINX’s VTC IP core is used to control video timing. In order to solve the problem of video timing, the original data of the video should be input. After the USB camera collects the image data, the data are stored in the external memory DDR3. Since the DDR3 controller is on the PS terminal, the video data need to be transmitted to the PL terminal through the AXI-full interface. In order to improve the video transmission rate and simplify the workload of the PS terminal, the VDMA IP core is called on the PL terminal. The DDR3 data is read by AXI-full and converted into AXI-stream format. The AXIS2VIDEO IP core is called. The data transmitted by the AXI-stream bus is converted into 24-bit RGB data and the video sequence received from the VTC IP core is output to the RGB2DVI IP core. AXIS2VIDEO IP core interacts with VDMA IP core through AXI-stream bus.

When displaying real-time detection results, taking into account the PS terminal's limited computing power, in order to reduce the display time, the ipywidgets library is used. The processed image is displayed directly. If there are defects in the image, it is stored. If there are no defects, it is directly covered to the next frame.

The FPGA hardware accelerator lens defect real-time detection process is shown in Fig. 9.10.

Fig. 9.10
figure 10

FPGA hardware acceleration lens defect real-time detection

In Fig. 9.10, the convolution layer, the maximum pooling layer, and the reordering layer are accelerated in the integrated neural network acceleration IP according to the division of labor. The routing layer records the address information of input and output. The post-processing is mainly the final detection layer, and the PS terminal is used to realize the positioning and category calibration of the lens defect prediction box.

3 Results

3.1 Hardware and Software Platform Environment

The hardware and software platform environment used in the experiment is shown in Table 9.2.

Table 9.2 Hardware and software platform configuration

This experiment uses TUL’s PYNQ-Z2 development board based on zynq7020 chip. The main resource allocation of zynq7020 chip is shown in Table 9.3.

Table 9.3 Zynq7020 partial main resource allocation

PYNQ-Z2 supports the open-source framework PYNQ, which is simply understood as combining Python with ZYNQ. The library module of Python language can be called under the framework to achieve efficient embedded development and provide full play to the advantages of software and hardware collaboration [16]. Although RTL programming cannot be directly carried out in Python environment, the PYNQ open-source framework provides convenience for transplanting the YOLOv2 model.

3.2 Real-Time Detection

Using the panel display mobile phone lens defect verification set sample for real-time detection, the effect is shown in Fig. 9.11.

Fig. 9.11
figure 11

Real-time detection effect

Figure 9.11a shows two petal defects in the model correct box, Fig. 9.11b shows one glue hole defect in the model correct box, Fig. 9.11c shows two crack defects in the model correct box, and Fig. 9.11d shows one membrane crack defect in the model correct box. The data above the target box are labels for the prediction category. From the test results, the model can predict and distinguish four kinds of defects. At the same time, in the case of multi-defect coexistence, there is basically no defect missing.

In the process of real-time detection, the adjustment of camera focal length directly affects the image quality of real-time video frames, which indirectly affects the detection results. The real-time detection log is shown in Fig. 9.12.

Fig. 9.12
figure 12

Real-time detection log

It can be seen from Fig. 9.12 that the FPGA acceleration process can be completed within 1 s. With the time-consuming of pre-processing, post-processing and PS terminal real-time detection and display results, it can be concluded that it takes 2 s for FPGA to complete the detection of a real-time image by detecting multiple samples and taking the average value respectively.

3.3 Hardware Resource Consumption

The hardware resource consumption accelerated by porting the YOLOv2 network to the PYNQ-Z2 development board is shown in Fig. 9.13.

Fig. 9.13
figure 13

Hardware resource consumption

From Fig. 9.13, BRAM and DSP consume the most hardware resources. The reason for BRAM consumption is that the weight parameters and input–output feature maps are stored every time. The reason for DSP consumption is that in order to maximize the acceleration effect of a large number of multiplication and addition operations in the neural network, multiplication and addition operations are performed in parallel.

3.4 Detection Power Consumption

The power consumption comprehensive evaluation report of PYNQ-Z2 development board running YOLOv2 model on Vivado platform for lens defect detection is shown in Fig. 9.14.

Fig. 9.14
figure 14

Power consumption synthesis report

It can be seen from Fig. 9.14 that the static power consumption is 0.198 W, the dynamic power consumption is 2.775 W, and the total power consumption is 2.953 W. It can be seen that the power consumption of FPGA is extremely low.

3.5 Detection Accuracy

The focal length is adjusted. After the imaging is clear and stable, 500 samples of validation set are detected in real time. The statistical verification results are shown in Table 9.4.

Table 9.4 The verification results of lens detection under FPGA

The analysis of Table 9.4 shows that the accuracy of FPGA real-time detection is 90.80%, and the lowest accuracy is petal injury. Compared with other defects, petal injury has smaller feature size and smaller difference between feature shape and background, which leads to relatively simple feature extraction and low detection rate.

4 Discussion

The description of CPU detection environment in YOLO official network is shown in Fig. 9.15.

Fig. 9.15
figure 15

YOLO official description

From Fig. 9.15, YOLO official CPU environment detection of an image takes 6–12 s.

FPGA versus CPU detection performance is shown in Table 9.5.

Table 9.5 Comparison of FPGA and CPU detection performance

The analysis of Table 9.5 shows that the FPGA hardware acceleration lens defect real-time detection acceleration rate is 350% of the CPU environment, FPGA acceleration effect is significant. FPGA has excellent advantages in power consumption. For long-running equipment or instruments, it is more energy-saving to choose FPGA. The detection accuracy of FPGA is slightly lower than that of CPU environment. The main reason is that the weight and offset parameters are quantified by dynamic fixed-point 16 bits, and the decrease of bit width leads to the loss of accuracy and the decrease of accuracy.

Table 9.6 compares the defect detection performance of this platform with the other three methods in references.

Table 9.6 Method performance comparison

It can be seen from Table 9.6 that the platform in this article is superior to the other three machine vision methods in terms of detection delay, power consumption, cost and application scenarios, and meets the requirements of real-time detection of lens defects with low cost and low power consumption.

5 Conclusions

In order to solve the problems of high delay, high power consumption, high cost and harsh deployment conditions in current machine vision method for mobile phone lens defect detection, and meet the requirements of low delay, low power consumption, high accuracy and strong stability in real-time application scenarios of small mobile devices. In this article, optimization strategies such as parameter reordering, dynamic quantization, loop expansion, and block segmentation are adopted to design a soft and hard collaborative processing platform based on FPGA hardware acceleration technology, and real-time detection of lens defects is realized. The experimental results show that the FPGA complete real-time detection of an image takes 2 s, which is 3.5 times faster than the CPU performance, and the total power consumption is 2.953 W, which is equivalent to 3.6% of the CPU, and the performance power consumption ratio is increased by 97 times. The defect location accuracy of the detection method is high, and the detection accuracy is 90.80%, which proves that the detection platform has certain advantages in performance.

Limited by time and energy, this platform has the following limitations. Firstly, the current FPGA detection accuracy is 5.33% lower than that of CPU. Subsequently, the dynamic fixed-point 16-bit quantized model can be iteratively optimized by multiple trainings to reduce the loss of detection accuracy and control it within an acceptable range. Secondly, the post-processing flow in the real-time detection process of the platform is not accelerated by FPGA hardware, and the time consumption is about 0.5 s. The post-processing flow can be accelerated in FPGA, which can reduce the real-time detection delay to a certain extent.