1 Introduction

Welding is often used in industrial processes and is widely used in railways, bridges, chemical equipment manufacturing [1], and other fields. With the continuous improvement of industrial production levels, the requirements for welding quality are becoming more and more stringent. However, due to the limitation of the working environment, various welding defects such as cracks, pores, and cracks are inevitably produced [2]. Welding defects can lead to significant safety risks, so identifying them is important.

Currently, the mainstream defect identification methods include traditional and deep learning methods [3]. The traditional method classifies defects based on statistical information and image features [4], involving defect segmentation, feature extraction, feature selection, and defect identification [5]. Weld defect segmentation [6] mainly extracts the defect areas, such as the improved OTSU method [7]. Feature extraction obtains feature sets that describe defects, such as edge features, area-based features, and texture features [8], and feature selection mainly removes redundant features and noise while retaining useful features. Defect identification involves identifying the type and nature of defects and is the core stage of the entire defect identification system.

Defect identification mostly adopts multi-source information fusion decision-making [9], Bayesian [10], support vector machines, and fuzzy logic. The traditional defect recognition method aims to determine the defect’s target point within the image by extracting the feature points and geometric relationships while the external environment is relatively fixed. For instance, Zhou et al. [11] used the Hough transform to identify welds, and Xu [12] developed a SIFT feature point extraction method to identify the weld. However, traditional defect identification methods suffer from poor generalization ability. Thus, it is necessary to increase the cost of the welding process to accurately cooperate with the visual sensor and enhance the welding process accuracy. However, this cooperation has extremely high requirements for the working environment, and once its changes slightly, it must be remodeled, adjusted, and calibrated.

In recent years, deep learning methods based on convolutional neural networks (CNNs) have become a research hotspot in image processing and pattern recognition [13,14,15]. Unlike the traditional recognition method, a CNN can extract highdimensional nonlinear graph features with good stability and is less susceptible to interference from external local conditions. As a feedforward neural network, a CNN avoids complex pre-processing, which is evident in traditional recognition algorithms, by employing the original image as the model’s input and extracting its features. Due to their high recognition efficiency and strong data representation ability, CNNs have been widely used in various fields [16,17,18]. For example, Khumaidi et al. [19] utilized a CNN to classify weld defects in images obtained by a webcam. Although the final accuracy reached 95.8%, this method required large sample size to achieve high precision. Jiao et al. [20] extracted 15 characteristic parameters that can be used to characterize the weld surface defect status from the weld image and combined them with a BP neural network to identify the internal weld surface defect. However, the overall recognition accuracy (91%) was inadequate for actual industrial production. Zhou et al. [21] proposed an improved U-Net robust weld recognition algorithm based on fusion attention mechanism. This method improved the feature fusion and loss function, and added a feature classification structure to output the corresponding weld type name. Experimental results show that this method achieved a weld recognition accuracy of 95.6%. However, this method is mainly used for weld detection, and its effectiveness in welding defect recognition remains to be explored. Zhang [22] proposed a welding defect identification algorithm based on principal component analysis to extract crack defects that may occur in the welding area of long-distance pipelines. To enhance recognition accuracy, CNN has become deeper, and some key technologies have been proposed, such as skip connection (SC) [23] and batch normalization (BN) [24], to ease training such deep neural networks.

In classification networks, common deep learning networks are VGG-16 [25], PReLU-nets [26], and ResNet [27]. Although these networks attain a high recognition accuracy, their parameters and calculations are excessive and thus inappropriate for resource-constrained devices. Moreover, in actual welding, the hardware resources used in the weld defect identification network model are limited, and therefore deploying such a network must consider computing power, power, and hardware storage space while ensuring a fairly high recognition accuracy. Therefore, how to compress the deep CNN model’s computational complexity and storage space has become a current research hotspot. Indeed, Han et al. [28] used pruning, clustering, and Hoffman coding to compress the mode’s storage space. Nevertheless, applying such a network in the mobile terminal of the industrial site requires the model to be lightweight while attaining high recognition accuracy, reducing the calculation burden, and maximizing the computing power requirements.

Spurred by the above analysis, this study improves SqueezeNet to have fewer parameters, higher recognition accuracy and does not require many training samples to identify weld defects. The improved SqueezeNet model performance is analyzed and compared with other existing defect classification algorithms considering precision and accuracy. The results further verify the effectiveness of the proposed model.

2 SqueezeNet network model

The SqueezeNet model is a lightweight network model proposed by Iandola et al. [29] in 2016, based on AlexNet. This paper shrinks the network’s convolution kernel and uses the mean pooling layer instead of the fully connected layer [30]. These modifications reduce the parameter cardinality of the original AlexNet to about 1/50 while ensuring the recognition accuracy and the model’s memory occupation is only 4.8 MB.

Figure 1 illustrates the SqueezeNet model, which mainly comprises a convolutional layer (Conv), a max pooling layer (Maxpool/2), a fire module (Fire) and a global average pooling (GAP). When the model receives an input image, it initially passes through the first convolutional layer, which utilizes 96 7 × 7 convolution kernels for preliminary feature extraction. This is immediately followed by a max pooling layer that reduces the spatial dimensions of the feature map. Subsequently, the image progresses through several "Fire" modules in sequence. Figure 2 shows the specific structure of the fire module highlighted in blue in Fig. 1. The Fire module primarily consist of squeeze and expand layers. The squeeze layer employs a 1 × 1 convolution kernel to decrease the input feature channels, while the expand layer substitutes the 3 × 3 convolution kernels with 1 × 1 kernels, effectively reducing the model’s parameter count. Additionally, a max pooling layer is inserted after certain Fire modules to further reduce the spatial size. Following the final Fire module, the network includes a large convolutional layer with 1000 output channels, which addresses the image classification task for 1000 categories. At the network’s terminus, a global average pooling layer (GAP) highlighted in orange replaces the traditional fully connected layer. This layer calculates the average value of each feature map and produces a fixed-size output, which not only decreases the model's parameter count but also aids in reducing the risk of overfitting. Finally, the softmax layer transforms the output of the global average pooling layer into a probability distribution, predicting the likelihood that the image belongs to each category.

Fig. 1
figure 1

The network architecture of SqueezeNet

Fig. 2
figure 2

Structural diagram of fire module in SqueezeNet

3 Proposed method

This paper improves the classic lightweight network SqueezeNet to build a lightweight CNN model for weld defect recognition. The major improvements are the following. The first is to improve the fire module of the SqueezeNet model and apply depthwise separable convolution to the fire module to reduce the number of parameters further. Second, the residuals are introduced in the SqueezeNet to solve the deep network degradation problem. Third, the ECA channel attention mechanism module is introduced to extract important features and improve the model’s recognition efficiency.

3.1 Improvements to the fire module

In 2017, Google proposed a lightweight network called MobileNet, which has now evolved to v3 [31]. In version v1, the concept of Depthwise Separable Convolution was proposed, which comprises Depthwise Convolution (DW Conv) combined with Pointwise Convolutions (PW Conv). The Fig. 3 depicts the main steps of depthwise separable convolution, including two main stages: depthwise convolution and pointwise convolution. Initially, depthwise convolution operates on each channel of the input feature map separately, using separate convolution kernels, and then the outputs of these convolution kernels are spliced together to form the final output. This approach avoids the need to convolve on all input channels in standard convolution, significantly reducing the number of parameters and computational effort. Following the depthwise convolution, batch normalization is applied to enhance training efficiency and stability. The ReLU activation function is subsequently used to introduce nonlinearity into the network, enabling the model to capture more complex features. Next is pointwise convolution, using a 1 × 1 convolution kernel to process the output of the depthwise convolution. The primary purpose of this stage is to amalgamate the features from various channels produced during the depthwise convolution. Similar to the earlier stage, each pointwise convolution is followed by batch normalization and ReLU activation to further stabilize network behavior and bolster its nonlinear properties.

Fig. 3
figure 3

Flowchart of depthwise separable convolution

The Fire module is the core of SqueezeNet and uses lightweight strategies to reduce the network’s parameters. The Fig. 4a below reveals that the traditional Fire module mainly consists of two layers of convolution: a squeeze layer using a 1 × 1 convolution kernel and an expand layer using a mixture of 1 × 1 and 3 × 3 convolution kernels [32]. A ReLU activation function is added after each convolution layer to give the model stronger representation capabilities and introduce nonlinearity.

Fig. 4
figure 4

Structural comparison of fire module and N-Fire module

By replacing the convolution kernel in the Fire module, the model’s parameters are reduced to a certain extent. However, as SqueezeNet becomes deeper due to employing many Fire modules, the 3 × 3 convolution will still produce many parameters, and thus the Fire module must be improved further.

In Fig. 4b, the new Fire module (N-Fire) has two major changes compared to the old. First, the BN (Batch Normalization) layer is added before the original ReLU activation function, which can stabilize the gradient changes of each layer and effectively converge the network during the training process. The second modification adopts the MobileNet concept and replaces the original 3 × 3 standard convolution of the expanding layer in the Fire module with 3 × 3 Depthwise Separable Convolution. It can not only ensure the performance of the model, but also significantly reduce the number of parameters of the SqueezeNet model.

3.2 Use of residual structures

The original SqueezeNet model is a simple tandem structure similar to the VGG network, aiming to overcome the gradient vanishing and gradient explosion problems caused by deepening the network. In order to assist the network’s recognition accuracy from reaching saturation or even suffering from a recognition accuracy decrease, a residual structure is introduced in SqueezeNet.

Figure 5 depicts the architecture of a residual block, commonly used in deep neural networks to address the vanishing gradient problem. The input X first passes through a weight layer followed by ReLU activation, forming the function F(X). It then proceeds to a second weight layer. The output of this sequence is added to the original input X, forming F(X) + X, which then passes through another ReLU activation. The addition of X directly to the output of the network layers allows the gradient to flow directly through the network, thereby preserving the gradient magnitude and improving training efficiency.

Fig. 5
figure 5

Schematic diagram of the residual structure

3.3 Use of ECA modules

The channel attention mechanism plays an important role in improving the performance of deep convolutional neural networks. The ECA module is a channel attention mechanism, which is an ultra-lightweight attention module. By learning feature channels, each channel is divided into different attention values. Then the network can reasonably allocate computing resources and improve the model’s accuracy given that number of small parameters increases.

The ECA module structure diagram is shown in Fig. 6, where the ECA module uses a 1 × 1 convolutional layer directly after the global average pooling layer, and the fully connected layer is removed. The module avoids dimensionality reduction and effectively captures cross-channel interactions. Although the ECA modules involve only a few parameters, they achieve good results. The ECA module realizes cross-channel information interaction through a one-dimensional convolution, and the convolution kernel size adapts to changes through a function. Given channel dimension C, the kernel size is adaptively determined by K:

$$K=\left|\frac{{log}_{2}\left(C\right)}{\gamma }+\frac{b}{\gamma }\right|$$
(1)

where \(\gamma \) and \(b\) are hyperparameters.

Fig. 6
figure 6

ECA module structure diagram

Figure 6 depicts the implementation process of the ECA module. First, the input feature map passes through the global average pooling layer (GAP), and the feature map changes from the matrix [H,W,C] to the vector [1,1,C]. Then, the adaptive one-dimensional convolution kernel size K is calculated according to the number of feature map channels. After obtaining the K value, it is used in the one-dimensional convolution to obtain the weight of each feature map channel. Finally, the normalized weight and the original input feature map are multiplied channel by channel to generate a weighted feature map. In this paper, the \(\gamma \) and \(b\) are set to 2 and 1 respectively, so the convolution kernel K is equal to 3.

3.4 Improved SqueezeNet

Figure 7a illustrates the final improved SqueezeNet network model, which contains the N-Fire module, residual structure and ECA module. The input feature map initially passes through a convolutional layer (Conv1), followed by a max pooling layer (Maxpool/2). It then passes through two blocks (Resblock1 and Resblock2), each followed by a max pooling layer to reduce the size of the feature map. The Resblock2 is followed by an ECA module to boost the model's ability to recognize important features by adjusting the weights of channel features, thereby enhancing overall performance. The subsequent process includes an N-fire module (N-Fire9), a convolutional layer (Conv10), and a global average pooling (GAP), and finally the classification result is obtained through a Softmax layer.

Fig. 7
figure 7

Structure of the improved SqueezeNet

As shown in Fig. 7b, Resblock1 consists of three N-Fire modules (N-Fire2, N-Fire3 and N-Fire4), each of which is connected by a ReLU activation function. The introduction of the N-Fire module significantly reduces the model's parameter count. The green line in the figure represents the residual connection, which directly connects the input and output of the N-Fire module. This design helps alleviate the gradient vanishing problem in deep networks and ensures that the gradient can be effectively propagated in the network.

The structure of Resblock2 is shown in Fig. 7c. It is similar to Resblock1 and contains four modules: N-Fire5, N-Fire6, N-Fire7, and N-Fire8. Each module is followed by a ReLU activation function. The residual connection also directly connects the input of the N-Fire module to its output, improving the ability to learn complex functions while maintaining the depth and efficiency of the network.

4 Datasets and evaluation indicators

4.1 Experimental datasets

The Institute of Computer Application of Liaoning Normal University creates the weld defect data set used in the following experiments. The dataset is a linear structured light weld dataset made by a laser sensor and the triangular ranging principle [33]. The dataset images are divided into four categories: welding depressions, welding holes, welding burrs, and no defects. Figure 8 presents a structured light weld under the four defect types. Among them, welding depressions are caused by improper heat control; welding holes are caused by trapped gas or incomplete material evaporation; welding burrs are sharp, protruding metal pieces or spatters formed due to uneven metal melting and solidification.

Fig. 8
figure 8

Laser weld diagram of line structure

Given that the original dataset is small and the number of categories is inconsistent, the balance of categories is low, and therefore the dataset must be enriched. To ensure the effectiveness of subsequent image classification and recognition training, we employ the same data augmentation method as described in reference [33]. For the image with no specific defect types, symmetry and rotation were used to enlarge the image data. For the image with the hole type, the image data were enlarged by rotating the origin of the image 90° to the right. Following augmentation, the dataset comprises a total of 2000 images, evenly distributed with 500 images in each of the four categories.

The image dataset is divided into a training, validation, and test set using an 3:1:1 ratio. The image data of the training set is used to train the model, and the image data of the validation set is used to predict the model. During training, images of the input model are pre-processed using Mosaic data enhancement and adaptive image scaling.

4.2 Evaluation indicators

This article employs a range of evaluation metrics to assess the model's performance, including Precision, Accuracy, Recall, and F1-score. Additionally, to evaluate the model's efficiency, FLOPs and Parameters are also used. The classification of real samples involves two types: positive and negative samples. TP (True Positive) indicates that positive samples are correctly identified as positive, while TN (True Negative) denotes that negative samples are correctly identified as negative. Conversely, FP (False Positive) occurs when negative samples are incorrectly classified as positive, and FN (False Negative) refers to instances where positive samples are mistakenly identified as negative.

Precision is the proportion of correctly identified positive samples among all samples that are predicted to be positive. The formula for calculating precision is as follows:

$$Precision=\frac{TP}{TP+FP}$$
(2)

Accuracy, which is the percentage of all correctly predicted samples in the total sample, is calculated as follows:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(3)

where TP + TN is the sum of the correctly predicted positive and negative classes. Generally, the higher the accuracy, the better the model’s classification effect.

Recall, which is the percentage of the positive samples that are correctly identified, is calculated as follows:

$$Recall=\frac{TP}{TP+FN}$$
(4)

F1-score is a classification metric where the model’s harmonic average of accuracy and recall is calculated as follows:

$$F1-score=2*\frac{Precision*Recall}{Precision+Recall}$$
(5)

5 Results and analysis

This chapter uses the weld dataset that is enrichment. The training set, a subset of the entire dataset, is used to train the model, and the validation is used to make the predictions after the training is completed and verify the performance of the improved SqueezeNet network model. The obtained performance is compared using the metrics presented in Sect. 4.2.

5.1 Experimental environment and parameter settings

Training the CNN involves many calculations and configuring the corresponding deep learning environment. Thus, the following experiments are conducted on a cloud server, using an Intel Xeon Platinum 8225C CPU with 43G memory and an NVIDIA RTX2080Ti GPU with 11G video memory. In the CUDA environment, the PyTorch deep learning framework PyTorch 1.11.0 is used to build a weld defect identification network combined with Python3.8.

The input image size of the network is 224 × 224, the sample batch size is 8, the number of iterations is 200, and the learning rate is 0.002 to train the processed dataset. During training, the adaptive moment estimation (Adam) algorithm is used to dynamically adjust the learning rate of the parameters so that the parameter change range is not too large and the parameter training is more stable.

5.2 Ablation experiments

The effect of specific improvement methods on our method’s defect identification performance is investigated through an ablation study that involves five experiments. The first experiment utilizes the original SqueezeNet model as the baseline network [34]. The second experiment employs SqueezeNet by adding a residual module, and the third trial modifies the fire module to adopt the improved deep separability method. The fourth experiment incorporates the ECA channel attention mechanism in the model, and the fifth experiment combines the residual module, the improved fire module, and the ECA channel attention mechanism with the original SqueezeNet model. To enhance clarity, these five models are SqueezeNet0, SqueezeNet1, SqueezeNet2, SqueezeNet3, and SqueezeNet4.

This paper employs a series of ablation experiments to assess the impact of various structural enhancements on model performance. According to the data in Table 1, SqueezeNet0 has 738,502 parameters. By incorporating a residual module, SqueezeNet1 sees a slight increase to 744,262 parameters. SqueezeNet2, utilizing an improved depth-separable method, significantly reduces the parameter count to 449,862. SqueezeNet3, which introduces the ECA channel attention mechanism, maintains a parameter count comparable to SqueezeNet0, while SqueezeNet4, which integrates multiple improvements, has the lowest at 255,947. In terms of computational complexity, SqueezeNet1 and SqueezeNet3 are on the higher end, whereas SqueezeNet2 and SqueezeNet4 demonstrate lower FLOPs, particularly SqueezeNet4, where computational complexity is notably reduced to 356.06M, showcasing extremely high computational efficiency. Regarding accuracy, SqueezeNet4 achieves the highest performance with 98.00%, while SqueezeNet1 and SqueezeNet2 also achieve commendable accuracies of 97.00% and 97.75%, respectively. These findings illustrate that through structural improvements and technological integration, the SqueezeNet model's performance can be effectively enhanced, reducing computational complexity and increasing classification accuracy.

Table 1 Comparison of the improved model effects

5.3 Model comparison experiments

We challenge our method against other common classification models for a more comprehensive comparison. Specifically, we compare the final improved SqueezeNet network, i.e., SqueezeNet4, against ResNet, Inceptionv4, and MobileNetv3. The prediction effect is compared after 200 training iterations under the same platform and hyperparameters.

As shown in Table 2, SqueezeNet4 outperforms SqueezeNet0 across all performance metrics due to structural improvements, notably reducing computational complexity to 356.06M Flops. In comparison to traditional and larger models such as ResNet and Inceptionv4, which provide stable performance, they also require substantial computing resources. Among them, Inceptionv4 has the highest computational demand, reaching 6153.58M Flops. Compared with MobileNetv3, SqueezeNet4 maintains high accuracy and is further optimized considering parameter quantity. Indeed, SqueezeNet4 has 0.26M parameters, proving it is lightweight.

Table 2 Comparison of different models

Based on the above comparative data, the improved SqueezeNet significantly reduces the number of parameters and floating-point operations while ensuring high recognition accuracy and precision. Overall, the improved SqueezeNet is very convenient for lightweight deployment.

5.4 Visual analysis of species identification results

After training the improved SqueezeNet, it is tested on the weld defect dataset. The confusion matrix in Fig. 9 shows the relationship between the true labels and predicted labels for the four welding defect categories (MC, OX, QC, and WD). The rows of the matrix represent the true labels, and the columns represent the predicted labels of the model. The values on the diagonal (47, 49, 50, 50) represent the number of correct predictions for each category. For example, 47 samples in the MC category are correctly predicted, 49 in the OX category, and 50 samples in the QC and WD categories are completely correctly predicted. The off-diagonal elements in the matrix represent the number of prediction errors. For example, 3 samples in the MC category are incorrectly predicted as WD, and 1 sample in the OX category is incorrectly predicted as MC. The results infer that the maximum prediction classification is on the diagonal, indicating that most defect features are successfully predicted, verifying the effectiveness of the improved model for weld defect identification.

Fig. 9
figure 9

Confusion matrix

According to the confusion matrix, the accuracy and recall of the improved SqueezeNet model in various welds can be calculated for specific quantitative analysis (Table 3).

Table 3 Weld defect classification test results

Table 3 reveals that the overall prediction results of the improved model on the weld defect test set are better. The accuracy and recall rate is relatively high, among which the recognition accuracy of welding burrs (MC) is the lowest (94%), and the recognition accuracy of welding holes (QC), and no defects(WD) is relatively high (100%).

5.5 K-fold cross validation

Cross-validation is a statistical analysis technique employed to assess the generalization capability of a model, specifically its proficiency in predicting unseen data. This method facilitates a more comprehensive evaluation of the model's performance across various subsets, thus enhancing its reliability and accuracy.

K-fold cross-validation is a widely utilized method for validating models, particularly for assessing the generalization capability of statistical models. In this approach, the dataset is divided into K equal-sized subsets. The process involves several key steps: firstly, the entire dataset is randomly split into K non-overlapping subsets, each approximately of the same size. Subsequently, one of these subsets is selected as the test set, while the remaining K-1 subsets serve as the training set. The model is then trained on the training set and evaluated on the test set. This procedure is repeated K times, each time with a different subset acting as the test set and the others as the training set. Ultimately, the model's performance is determined by the average of the results from the K separate tests. This method ensures a comprehensive and fair evaluation of the model by allowing each data point an equal chance to be tested.

In this article, the K value for the K-fold cross-validation method is set to 5. Accuracy and macro-F1 scores are utilized to evaluate the outcomes of the cross-validation process.

Table 4 presents the results of a K-fold cross-validation test conducted using the SqueezeNet4 model, with K values ranging from 1 to 5. As K increases, both the accuracy and F1-score metrics show a trend of improvement, suggesting enhanced model performance with larger K values. Specifically, when K is set to 1, the model achieves an accuracy of 94.75% and an F1-score of 95.00%. These metrics improve progressively with each increase in K; by the time K reaches 5, accuracy has risen significantly to 97.75%, and the F1-score to 97.00%. This improvement can be attributed to the model having more comprehensive training and validation across the different subsets of data, thereby reducing variance and bias in the evaluation process.

Table 4 K-fold cross validation test results

5.6 Specific application

The lightweight welding defect identification algorithm based on a convolutional neural network, proposed in this article, will be implemented in our welding robot to inspect weld quality both during and after the welding process. The welding robot, depicted in Fig. 10, includes components such as a mechanical arm, welding gun, camera, weld sensor, and a four-wheel differential chassis. An Advantech 276 industrial computer serves as the host to control the robot's welding operations. Since the welding control program, various dependency packages, and video monitoring resources are all stored on this industrial computer, the memory available to support the welding defect identification model is limited. Despite these limitations, the proposed welding defect model can meet the real-time requirements for monitoring weld quality throughout the welding process.

Fig. 10
figure 10

Schematic diagram of welding robot

6 Conclusion

This paper proposes a weld defect recognition algorithm based on a CNN for the problem of weld defect recognition in the welding process. Based on the SqueezeNet network, the algorithm improves the fire module and adds the residual structure and ECA attention channel mechanism to the network, demonstrating that the improved model can effectively identify four types of weld defects.