1 Introduction

The safe and stable operation of transmission lines is the most basic demand of society, and insulators are an important part of transmission lines. Insulators in outdoor environment for a long time are prone to breakage, self-explosion, string falling and other defects. These insulator defects will lead to a series of serious problems such as the interruption of transmission lines and the collapse of power systems. Therefore, insulator defect detection is very important for the stability of the transmission system, and it is also a challenging task [1, 2]. Therefore, automatic processing of insulator defects detection base on aerial images is a prevalent choice. The inspection methods for insulator defects mainly include manual inspection, helicopter inspection [3], robot inspection [4], and drone inspection [5]. The aerial images are taken by helicopters or unmanned aerial vehicle (UAVs), which are sent back to the computer for image processing to assist the detection of technical personnel [6]. Manual detecting of insulator defects is low efficiency, high costs, and prone to fatigue errors. In recent years, with the rapid improvement of machine vision, technologies of object detection base on machine learning have been used extensively in the fields of biological images, agriculture, and human-computer interaction [7,8,9,10].

The insulator defect detection algorithms can be divided into two categories: the traditional machine learning algorithms and the deep learning algorithms. Generally, the traditional machine learning algorithms extracted features based on image processing techniques, connected a machine learning-based detection algorithm, such as support vector machines, template matching, and adaptive thresholds. Combining the color features with spatial features, an insulator defects detection algorithmic based on spatial morphological features is proposed in [11]. A laser online monitoring method s proposed in [12] to detect insulator conditions and predict the occurrence of flashover. A fusion algorithm is proposed in [13], which combines shed contour features and ray similarity matching. A composite insulator defect detection method based on Hough transform ellipse detection is proposed in [14], the Canny operator is used to extract the edge of the insulator, and after elliptic curve fitting, it is compared with the edge curve of the real insulator to detect the contour of the insulator. The above detection methods are based on traditional machine learning algorithms and belong to the category of shallow learning architectures. It is difficult to detect defects using traditional machine learning algorithms under poor quality images, which will affect the accuracy and robustness of detection.

In the area of detection, deep learning-based object detection algorithms have been at the center of attention in numerous machine vision tasks [15, 16]. In general, deep learning algorithms are composed of a multilayer neural network that learns features directly from the highly nonlinear data without any manual feature extraction., Region-convolutional neural networks (Faster-RCNN) [17], single shot detector (SSD) [18], and you only look once (YOLO) [19] are the most popular deep learning networks used in the insulator defect detection tasks. The Faster R-Transformer model is proposed in [20], which combines a self-attention mechanism and a convolutional neural network for insulator detection.

In [21], the residual neural network (Resnet) network is combined with the Faster R-CNN network and connect to a fully convolutional network (FCN) to inspected the insulator. However, the false detection rate will increase in the above methods, due to the greatly varied in the surrounding detection environment. In [22], a detection network combining the MobileNet light network and YOLOv4 network was proposed, and the detected mAP value reaches 93.81%. In this paper, the following problems will be solved to realize the insulators defect detection: (1) The generalization capability of algorithms should be improved to overcome the interference of the complexity of aerial images; (2) Try to improve the detection accuracy and the detection speed together. An improved method base on you only look once vision4 (YOLOv4) [23] network is proposed for insulator defect detection of the aerial images in highly diverse outdoor environments.

The remainder of this paper is arranged as follows. The proposed insulator defects detection method is elaborated in Sect. 2. The experimental details and results compared with other algorithms are shown in Sect. 3.2. In Sect. 5, we discuss the training strategy of the network model and analyze the detection effect in various scenarios. Finally, Sect. 5 is devoted to the conclusion.

2 Materials and Methods

2.1 Proposed Approaches

The China power line insulator data set (CPLID) [24] used in this paper is provided by the state grid corporation of China, and the dataset has been published online. There are 848 high-resolution images, each with a resolution of 1152 × 864 pixels. The data set consists of two categories, 600 ordinary insulator images and 248 defective insulator images. These images were taken by UAVs from different angles. Figure 1 shows some images from CPLID. In Fig. 1, from left to right, there are different insulators on power pylons erected on water, in forests and in villages.

Fig. 1
figure 1

Typical images in the CPLID

However, the CPLID has contained numerous actual power transmission systems scenes, the number of defects images is insufficient for defect detection. In this section, a new data augmentation method composed of the affine transformation and the mosaic is proposed. The CPLID is expanded, the original images are set as the test set, and the data-enhanced images are set as the training set. The operation details are described below.

The dataset could be augmented directly through generally affine transformations, such as translation, scaling, and rotation. In CPLID, the location of insulators is mainly distributed at the center of the aerial image and occupying a large area. Therefore, only rotation operation is adopted to construct a new dataset to avoid losing objects during other operations. The image rotates at a random angle around its center point. The rotation operation can be calculated as:

$$\left[\begin{array}{c}u\\ v\\ 1\end{array}\right]=\left[\begin{array}{ccc}{cos}\theta & -{sin}\theta & 0\\ {sin}\theta & {cos}\theta & 0\\ 0& 0& 1\end{array}\right]\left[\begin{array}{c}x\\ y\\ 1\end{array}\right]$$
(1)

where the location of a pixel for the original image and the transformed image is defined as ‘(x, y)’ and ‘(u, v)’, respectively, the θ represents the rotation angle. The image after random angle rotation is shown in Fig. 2. For example, Fig. 2a is an image with θ = 0°, and Fig. 2b shows another image with θ = 330°.

Fig. 2
figure 2

Affine transformation for different images

After that, the transformed images are processed through the mosaic method, a data augmentation method proposed by YOLOv4. That mixes four images to enrich the background of the detected object. As elaborated in Fig. 3, the four transformed images are cropped, scale, randomly spliced into one image, and reshaped to a specific size. In this Fig, the white frames indicate the objects of insulators to be detected.

Fig. 3
figure 3

Example of images with data augmentation

There are various limitations to training the network under the default size of the anchor boxes, such as the low-performance detection and predict boxes size mismatched with the actual object. The anchor box determined by prior knowledge can significantly improve the detection performance. Thereby, the size of the anchor boxes is redesigned base on the K-means, an algorithm for clustering analysis by the iterative solution.

The size of training images has been reshaped to 416 ×416. Three feature maps are extracted, which are 13× 13, 26× 26, and 52× 52, each feature map set three anchor boxes. The ratio of width to height is regarded as cluster object. All the ratio data are divided into nine groups to selecting the cluster center for each group by multiple clustering. The average of all clustering results is set as the final value of the anchor box size for the network training. The values of the anchor box size are set to (23,22), (86,21), (112,35), (186,46), (251,180), (279,87), (291,47), (293,65), and (294,121), respectively.

2.2 Insulator Defect Detection Method

The YOLO network is an end-to-end object detection algorithm proposed in 2016 that can directly regard the detection task as the regression task. The advantage of the algorithm is that the context information of images can be extracted, which will effectively promote the accuracy and detection speed. The schematic diagram of insulator defects detection using YOLO network is shown in Fig. 4. As shown in Fig. 4a, the input insulator image is divided into a 7 × 7 grid, and we need to detect 9 bounding boxes for each small grid. Furthermore, each bounding box contains 5 values: x, y, w, h and confidence score. Specifically, x, y, w and h represent the coordinates of the central position of the target to be detected and its height and width, respectively, and the confidence score denotes the probability that the detected target belongs to the defect or insulator. Figure 4b draws all the priori boxes (7 × 7 × 9) of the insulator image. Some of these boxes are thicker, and some are thinner, which is an indication of different confidence levels, with thicker borders for higher confidence and thinner borders for lower confidence. Figure 4c depicts the target classes to which the different grids belong, with purple indicating the image background, yellow indicating insulator string, and red indicating insulator defect. Finally, the location of the defect target is found by the maximum suppression and confidence comparison, and it is framed in the input insulator image as shown in Fig. 4d.

Fig. 4
figure 4

Schematic diagram of insulator defects detection using YOLO network

Fig. 5
figure 5

Schematic diagram of YOLOv4 network

The network structure of YOLOv4 is shown in Fig. 5. When YOLOv4 is used to detect insulator defects, the image is first input into YOLOv4's backbone network cross-stage-partial-connections (CSP) Darknet53, which is improved on the basis of YOLOv3 backbone network [25]. The structure of CSPnet [26] is added to enrich the insulator image characteristics and improve the detection accuracy of the network. Secondly, the feature map of the insulator image is generated by the convolution block of multi-layer CBL, which consists of convolution (Conv) layer, batch normalization (BN) layer, and Leaky Relu layer. Thirdly, the feature map passes through the spatial pyramid pooling network (SPP-Net), which increases the receptive field of the image [27]. Then, the top-down feature pyramid network (FPN) [28] is used to convey semantic features, and the bottom-up path aggregation network (PAN) [29] is used to convey positioning features, so as to enhance the ability of the network to extract insulator features. Finally, the prediction results of YOLOv4 are output.

Next, this section introduces the framework of the power line insulator defect detection procedure. The flow chart is shown in Fig. 6.

Fig. 6
figure 6

The flowchart of the insulator defect detection

First, the original dataset is processed through data enhancement, the size of anchor boxes has been redesigned base on k-means arithmetic. Second, the training set is input into the network to update the optimal model. The effect of the trained model is evaluated by testing set. Finally, the images are input into the trained model for detection. Moreover, to verify whether the defects obtained from the model are labeling correctly in the image, the defect should be located in the area of the insulator object. If the location of the defect is outside of the insulator area, it is considered as the error-detection result. A warning message will be sent and the image will be examined manually.

The pseudo code of the feature extraction algorithm in this paper is as follows:

figure a

2.3 Evaluation Metrics

Five common indices, precision, recall, F1 score, mean average precision (mAP), and frame per second (FPS) are used to evaluate our defect detection method. The criterion for object detection prediction is that the coincidence rate between the bounding box and the actual box is greater than the threshold, namely, the value of intersection-over-union (IoU) is greater than the threshold. The intersection-over-union (IoU) represents the coincidence rate between the predicted box and the actual box The formulas for calculating the accuracy and recall rate are as follows:

$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{TP}{TP+FP}$$
(2)
$$ {\text{Recall}} = \frac{{TP}}{{TP + FN}} $$
(3)

where TP is the number of objects detected correctly, FP is the number of objects detected incorrectly, FN is the number of objects for leak detection. The precision and the recall are a pair of contradictory indicators, one of which increases and the other decreases. Therefore, the F1 score is used to balance the two parameters. The calculation formula of F1 is as follows:

$$F1=\frac{2\times P\times R}{P+R}$$
(4)

The AP value is the area under the precision-recall curve. Generally, models with good detection performance have higher AP value. The mAP is the average AP of all detected objects, which measures the performance of the model in all categories. The formula can be described as:

$$AP={\int }_{0}^{1}P\left(R\right)dR$$
(5)
$$ {\text{mAP}} = \frac{1}{n}\sum\limits_{{i = 1}}^{n} A P_{i} $$
(6)

The FPS denotes the number of frames per second. The FPS is affected by hardware conditions, so it is generally tested under the same hardware conditions.

3 Experiment and Results

The data enhancement method is used to expand the CPLID from 848 to 2544 images, including 1800 normal insulator images and 744 defect images. The deep learning framework is TensorFlow. Other configurations of the system include the NVIDIA Quadro P2200 GPU, Inter Core i9-9900K CPU 3.6 GHz, GPU acceleration library 10.1, and Operating system Windows 10. In the paper of YOLOv4, the accuracy is improved by introducing the features included mosaic data augmentation, cosine annealing scheduler, and label smoothing. By studying the impact of different features during training (detailed information is described in the section of the discussion), only the mosaic data enhancement is adopted in this task.

The features extracted in the backbone layer are universal. At the beginning of training, freeze the backbone network layer so that more memory will be used to train deeper layers. The training speed of the model will decrease. It is inspired by transfer learning, the training phase is divided into two-stage, namely the freezing stage and the thawing stage. During the freezing stage, the initial learning rate is set to 0.001 and the maximum learning rate is set to 0.01. During the thawing stage, the initial learning rate is 0.0001 and the maximum learning rate is 0.001. During network training, the model iterates for 25 epochs respectively, and the batch size is set to 2.

3.1 Comparison with Other Defect Detection Methods

This section verifies the performance of the model in terms of detection accuracy, calculation, and detection speed. The proposed method and other models are detected under the same data set, including SSD, Faster-RCNN, YOLOv3 [30], YOLOv4-tiny and YOLOv5× [31]. The SSD is a typical one-stage target detection algorithm, which adopts an end-to-end detection method. The Faster-RCNN is a popular two-stage object detection algorithm with high detection accuracy. The YOLOv3 is also the two-stage object detection algorithm with real-time performance. The YOLOv4-tiny is a streamlined version of the YOLOv4, which reduces the number of layers requiring feature fusion. YOLOv5 is released following YOLOV4. The YOLOv5 network includes four versions YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5×. The detection results of SSD, Faster-RCNN, YOLOv3, YOLOv4-tiny, YOLOv5× and the proposed method are shown in Table 1.

Table 1 Detection performance of several detection models

The YOLOv5× and the YOLOv4(ours) achieved high scores in mAP with 99.01% and 99.08%, respectively. The precision value of YOLOv4(ours) is 91%. Meanwhile, the AP value of the defect is 100%, which indicates that all defects have been detected. The F1 value of YOLOv4(ours) is also significantly better than other algorithms. The recall of YOLO v5×, v4 is 91.98 and 98.84, and the precision of YOLO v5×, v4 is 99.18 and 91, respectively. Simultaneously achieving high values of precision and recall is very difficult in practice, and we need to choose which one is more important for our task. Usually, an increase in precision is accompanied by a decrease in recall, and vice versa.

The Faster R-CNN network is a typical two-stage object detection network, which needs to generate candidate regions of size W×H×K through the region proposal network (RPN) operation, thus its FPS value is low. Moreover, the ROI-Pooling layer in the Faster R-CNN network is a fully connected layer, which will generate many repeated operations when training the network, thereby reducing the training speed of the model. All the feature information used for classification and localization of SSD comes from different feature layers, and the size of the a priori frame is fixed, which causes the detection result to not match the target size in the image, and the recognition and localization of small targets are not accurate enough. Therefore, the results in Table 1 show that Faster R-CNN and SSD have lower AP values for defects.

The specific experimental results are shown in Fig. 7, the precision and recall curves of YOLOv4(ours) are shown in Fig. 8. Similar to the numerical results presented in Table 1, the superiority of YOLOv4(ours) in several metrics such as Recall, mAP@0.5 and F1 is clearly depicted in Fig. 7. Specifically, SSD and Faster-RCNN perform poorly on mAP@0.5 and F1, with the corresponding numerical results below 50, while YOLOv4(ours) approaches 100. Our algorithm behaves similarly to YOLOv5 on two contradictory metrics, Precision and Recall, but it is superior to YOLOv5 on Recall. In terms of detection speed, our algorithm and YOLOv5 achieve similar level of performance, both around 60 frames per second. The experimental results show that the detection algorithm based on YOLOv4 proposed in this paper is more suitable for the defects detection of insulators and provides a competitive solution for the intelligent inspection of the power grid.

Fig. 7
figure 7

The performance of various algorithms

Fig. 8
figure 8

The precision and recall curves of insulator defects detection

The detailed cross-sectional comparisons between our model and other previous works are shown in Table 2. In this table, three insulator defect detection models with the best results are selected and name as No.1–No.3, respectively. Specifically, No.1 refers to the defect detection algorithm based on YOLOv5 from Hefei Institutes of Physical Science [15]. According to the paper, its mAP can reach 99.05% and its accuracy can reach 86.8%. No.2 is the algorithm from Zhengzhou University [21], which incorporates Faster R-CNN network and Resnet-101 network, with an accuracy of 96.83%. No.3 represents the algorithm from Wuhan University, which has a relatively high mAP and accuracy of 93.81% and 97.26%. No.4 is the insulator defect detection algorithm based on YOLOv4 proposed in this paper.

Table 2 Comparisons between our results and previous works.

To further verify the generality of the proposed method, experiments are performed with the VisDrone 2019 dataset [32][33]. Recognizing objects in UAV images has been a hot topic in the field of object detection. The VisDrone 2019 dataset was collected by the AISKYEYE team at the machine learning and data mining lab of Tianjin University and used as the competition dataset at the international conference on computer vision (ICCV) 2019 workshop. It contains 8599 images, including 6471 images as training set, 548 images as validation set, and 1610 images as test set. There are a total of 10 categories in this dataset, including van, bus, person, truck, car, awning tricycle, bicycles, pedestrians, cars, and tricycles.

Table 3 YOLOv4 versus YOLOv3 performance for VisDrone

For a more comprehensive comparison, a series of work have been done and presented in Table 3. The detection results of our improved method are compared with that of YOLOv3 and YOLOv4-tiny. Specifically, the first column of Table 3 represents the network name, the next four columns enumerate the AP values of typical categories in the dataset, including car, bus, van and truck, the last column is the mAP value. The best detection mAP of 10.68 is reported by our proposed method, while the AP values in the three items of bus, van and truck are also higher than YOLOv3 and YOLOv4-tiny. In conclusion, the proposed algorithm has better detection performance on the VisDrone dataset.

3.2 Effect of Improved Methods

A series of comparison experiments have been conducted to evaluate the effectiveness of the proposed method. The detection results of yolov4 algorithm trained by different improved methods are shown in Table 4. The first line represents training with the original method, the second line represents training after data augmentation, the third line represents training after redesigning the size of the anchor box and the last line represents training after data expansion and anchor box redesign at the same time.

It can be found that the precision decreases slightly when training with new anchor boxes, which may be due to insufficient samples. The precision value is significantly improved after data enhancement. Using data enhancement and K-means method, the precision is greatly improved. It is proved that both methods can improve the performance of the experiment.

Table 4 The effect of data augmentation and anchor box redesign

4 Discussion

4.1 Influence of Features on Training

In this section, the effects of different features on training, including cosine annealing scheduler, label smoothing, and mosaic data augmentation, are introduced and analyzed.

4.1.1 Altering Learning Rate with the Cosine Annealing Scheduler

The learning rate is updated by the cosine annealing scheduler to simulate the learning restart process. It starts with a large learning rate, drops to the minimum value relatively quickly, and then increases rapidly to avoid the weight falling into local minimum value during the gradient descent process. A schematic diagram of updating the learning rate based on a cosine annealing scheduler is shown in Fig. 9. In the figure, the vertical axis represents the learning rate and the horizontal axis corresponds to the learning process, which is a covariate directly related to time.

Fig. 9
figure 9

Schematic diagram of updating learning rate based on cosine annealing scheduler

4.1.2 Label Smoothing

In addition, to prevent the network from overfitting and improve its generalization ability during training, label smoothing will be used to convert hard labels to soft labels. It is a regularization method, that adds penalty factors to refine the output labels. The implementation process of the method is shown as:

$$\hat{y}_{i} = y_{{hot}} \left( {1 - \alpha } \right) + \alpha /num$$
(7)

where the \({\widehat{y}}_{i}\) represents the redesigned label, \(y_{{{{hot}}}}\) represents the original label, \(\alpha\) represents the value of the smooth label, \(num\) represents the number of categories. In this paper, the value of label smoothing is set to 0.1.

The network is trained with different strategies, of which each strategy is tested with different IoU values. The details of the training strategy are shown in Table 5, the mAP values of the detection result are shown in Table 6. The IoU@0.25, IoU@0.50, IoU@0.75, and IoU@0.90 represent the map values of the network prediction results when IoU is set to 0.25, 0.5, 0.75, and 0.9, respectively. It can be seen from Table 5 that the accuracy of the algorithm decreases as the IoU threshold increases. For these four strategies, when the IoU value is set to 0.9, the value of mAP goes down to the minimum. The reason for this phenomenon is if the IoU value is set too large, some prediction objects with low confidence will be filtered out, thus missing the network prediction. When the value of IoU is set to 0.5 or 0.75, the mAP value of strategy 2 is significantly higher than that of other strategies. Based on the above experiments results, for the purpose of improving the recall rate of prediction results, the value of IoU is set to a relatively lower value of 0.5 in this paper.

Table 5 Training network under different feature combinations
Table 6 Experimental results of different strategies

5 Robustness Test

Some typical images with different light intensities and different background complexity are detected, and they are compared with other models to further endorse the detection effectiveness of the proposed method.

The results are shown in Figs. 10, 11, and 12, respectively.

Figure 10 shows a comparison of the detection results under good light conditions. It can be seen that all objects in the image are recognized by YOLOv4 in Fig. 10f, and the predicted box matches the actual box best. In Fig. 10a, the inspection of the defects and the distant insulator are missed. In Fig. 10c the defect is detected by YOLOv3 but the insulator in the distance is also missed. The Faster R-CNN, YOLOv4-tiny and YOLOv5× accurately located the insulator, but all miss the defects in Fig. 10b, e and d.

Fig. 10
figure 10

The detection result under good light conditions

Fig. 11
figure 11

The detection result under the poor light conditions

Figure 11 shows the detection result of an image with poor light conditions. As shown in Fig. 11f, the proposed network still accurately located all the objects. In Fig. 11c, defects are detected by YOLOv4-tiny, but the location is not accurate enough, and the bounding box of defect is larger than the actual box. However, the other algorithms as shown in Fig. 11a, b, c and e only locate the insulators and miss the defects. Both of these two algorithms are difficult to withstand the influence of changes in the outdoor detection environment.

Figure 12 shows the detection comparison of different algorithms under complex backgrounds. Because the shooting angle cannot be precisely controlled, part of the insulators in the images are blocked. As shown in Fig. 12a, b, c and d, the insulators blocked by the transmission line pylon are missed. In Fig. 12f, all objects are detected using the YOLOv4(ours). It has high robustness and high detection accuracy, thus it is more suitable for insulator defect detection.

Fig. 12
figure 12

The detection result under complex background.

6 Conclusions

This paper discusses the insulator defects detection of the power grid aerial images. Firstly, the YOLOv4 network is used as the basic model to analyze the defect feature and optimize the size of anchor boxes. Secondly, in response to the problem of fewer data sets, the dataset is expanded by the proposed augmentation method. Finally, under the same experimental conditions, the proposed method is compared with other advanced algorithms. This paper draws the following conclusions from the experiments:

  1. (1)

    By implementing data enhancement and anchor boxes redesign, our proposed method is superior to other comparison algorithms, including SSD, Faster-RCNN and released version of YOLO. Compared to the streamlined version of the YOLOv4, our algorithm improves the detection precision by 37.2%. Moreover, it also outperforms YOLOv5× across the board in several evaluation metrics such as Recall, AP value, mAP and F1 of insulator defect detection results.

  2. (2)

    We also conduct a cross-sectional comparison of our proposed method with other previous works and further validate its generality on the VisDrone 2019 dataset. The simulation results all show the effectiveness and superiority of the proposed method in this paper.

  3. (3)

    The robustness test results demonstrate that our proposed method performs well under different light intensities and complex environmental backgrounds, and can accurately detect all targets, which is significantly better than other comparative algorithms.

However, the inspection task of the power transmission systems is not only the defect detection of the insulators but also the common components that may cause the power grid failure including poles, ground wires, fittings, etc. In the future, multi-fault detection and classification will be carried out at the same time, and visual intelligent detection software will be constructed to realize intelligent detection of power transmission system.

7 Supporting information

Images and data from this study are available on Figshare at: https://figshare.com/articles/dataset/Power_Line_Insulators_Dataset/11826483https://doi.org/10.5281/zenodo.3656611.