Keywords

1 Introduction

Electricity, as a crucial essential industry, significantly impacts people’s lives and property safety, the national economy, and overall economic development [1]. However, due to long-term exposure to natural elements such as rain, snow, and foreign substances, power transmission lines are prone to issues such as foreign object attachments [2]. If these issues are not addressed promptly, they can compromise the stability of the power grid. Current solutions often involve computer vision techniques to detect foreign objects on power transmission lines. For example, Zhang proposed a method named RCNN4SPTL, which replaces the feature extraction network of Faster RCNN with a more lightweight SPTL-Net, thereby improving the detection speed [3]. Additionally, Wang et al. proposed a method based on SSD for power transmission line detection, studying the effects of different feature extraction networks and network parameters on the accuracy and speed of object detection [4]. Jiang et al. identified bird nests on power transmission lines, cropping the detection results into sub-images and filtering out those not containing bird nests using an HSV color space model, significantly improving detection accuracy [5]. Qiu focused on birds as an object category and proposed a lightweight YOLOv4-tiny network model for bird detection on power transmission lines, providing a basis for preventing bird-related power grid shutdowns [6]. However, in real-world scenarios, foreign objects that can invade power transmission lines include birds, bird nests, balloons, and kites [7]. The high-resolution images captured by drones and other equipment contrast with the relatively small size of these foreign objects, posing a challenge for foreign object intrusion detection. This article proposes a lightweight hybrid object detection network named GridFormer that combines convolution’s inductive bias advantages and transformers’ global modeling capabilities [8] to achieve good generalization performance on object detection tasks [9, 10]. This method strikes a new tradeoff between computational cost and detection accuracy, providing a new, effective solution for foreign object intrusion detection in power transmission lines.

2 Methods

2.1 Model Design

This paper intends to construct a lightweight object detection model based on the CNN-Transformer network, see Fig. 1. It consists of three parts. The first part is a hybrid feature extraction network, which extracts image semantic information; the second part is a feature fusion part, which fuses different levels of feature maps through up-sampling and down-sampling operations to generate a multi-scale feature pyramid. The last part is the classification and localization part, which introduces the auxiliary head design and dynamic label assignment strategy so that the middle layer of the network learns more information by richer gradient information to help train. This model combines the advantages of convolution and transformer to seek Pareto improvement in computational cost and detection accuracy on the object detection task.

Fig. 1.
figure 1

The overall framework of GridFormer.

In this paper, we use the self-attention mechanism in the deep part of the network, which avoids the Patch division of the feature map due to its small enough size, reduces the effect of positional coding, and ensures the overall consistency of the feature map.

2.2 Attention Block

The structure of the Attention Block is modeled after the design of the Transformer encoder, which is shown in the dashed box in Fig. 1. Absolute position encoding is usually used in ViT, and each patch corresponds to a unique position encoding, so it is not possible to achieve translation invariance in the network. Attention Block uses a 3\(\,\times \,\)3 depth-separated convolution introduces translation invariance of the convolution into the Transformer module, and stabilizes the network training using residual connectivity. Furthermore, the computational procedure is shown in Eq. 1:

$$\begin{aligned} f(x)=DWConv(x)+x \end{aligned}$$
(1)

The vital design in the Attention Block is LMHSA. Given an input of size \(\mathbb {R}^{n*c}\), the original multi-head attention mechanism first generates the corresponding Query, Key, and Value. Then, by dot-producing the point Query and Key, it produces a weight matrix of size \(\mathbb {R}^{n*n}\):

$$\begin{aligned} Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d}})*V \end{aligned}$$
(2)

This process tends to consume a lot of computational resources due to the large size of the input features, making it difficult to train and deploy the network. We use kxk average pooling kernel to downsample the Key and Value branch generation. The calculation process of the lightweight self-attention mechanism is shown in Fig. 2:

Fig. 2.
figure 2

Calculation process diagram of lightweight self-attention mechanism.

Two relatively small feature maps K’ and V’ were obtained.

$$\begin{aligned} K'=AvgPool(K)\in \mathbb {R}^{\frac{n*c}{k^2}} \end{aligned}$$
(3)
$$\begin{aligned} V'=AvgPool(V)\in \mathbb {R}^{\frac{n*c}{k^2}} \end{aligned}$$
(4)

where k is the size and sampling step of the pooling kernel, n is the product of input feature maps H and W, and c is the number of feature map channels. Introducing a pooling layer in the Self-Attention module to downsample the feature map efficiently saves computation and memory. Moreover, LMHSA is composed of multiple Lightweight Self-Attention, see Eq. 5:

$$\begin{aligned} LMHSA(x)=Concat_{i=[1:h]} [softmax(\frac{Q_iK_i^T}{\sqrt{d}})*V_i] \in \mathbb {R}^{n*d} \end{aligned}$$
(5)

The computational complexity of LMHSA is:

$$\begin{aligned} O(LMHSA)=h*(\frac{n*c^2}{k^2}+\frac{n^2*c}{k^4}+O_\phi (\frac{n^2}{k^2})) \end{aligned}$$
(6)

where n is the product of the length and width of the output feature map, h is the number of heads of the multi-head self-attention mechanism, k is the kernel size and step size of the pooling kernel, c is the number of channels of the input feature map and the output feature map, and \(\phi \) denotes the softmax activation function.

The computational complexity of the standard self-attention mechanism can be expressed as:

$$\begin{aligned} O(LMHSA)=h*(n*c^2+n^2*c+O_\phi (n^2)) \end{aligned}$$
(7)

Compared with MHSA, the computational cost of LMHSA is about \(\frac{1}{K^2}\) of that of MHSA. The LMHSA in this paper effectively reduces the computational cost, and the optimized matrix operation is more friendly to network model training and inference.

3 Training Methods

In order to improve the accuracy and generalization ability, this paper introduces mixup [13], Mosaic [14] data augmentation, cosine annealing [15] and label smoothing [16] to train the model.

3.1 Data Augmentation

The currently available foreign object dataset has limited capacity, and the size of the sample capacity is crucial for the training effect of the network model. Therefore, we can use data augmentation methods to expand the dataset. Among the commonly used methods for data augmentation include techniques such as spatial transformation and colour transformation. In addition, this paper uses data augmentation methods such as Mosaic and Mixup of image mixing classes. Mosaic data augmentation improves the CutMix method, which aims to enrich the image background while improving the model’s detection performance for small objects.

3.2 Cosine Annealing

In order to avoid the model from falling into local optimal solutions, a learning rate adjustment strategy can be used, and one of the standard methods is cosine annealing. The learning rate is gradually reduced through cosine annealing so that the model can better search for the global optimal solution during the training process. This strategy is widely used in deep neural network training and has achieved good results. The learning rate can be reduced by the cosine annealing function, denoted as

$$\begin{aligned} \eta _t=\eta _{min}^i+\frac{1}{2}(\eta _{max}^i-\eta _{min}^i)[1+cos(\frac{T_{cur}}{T_i}\pi )] \end{aligned}$$
(8)

where \(\eta _{max}\) and \(\eta _{min}\) are the maximum and minimum values of the learning rate, respectively, and \(T_{cur}\) and \(T_i\) are the current and total number of iterations of an epoch, respectively.

3.3 Label Smoothing

One-hot coded labels in multiclassification problems tend to lead to model overfitting because the model focuses on probability values close to 1. To address this problem, label smoothing can be used to balance the model’s predictions and reduce the risk of overfitting. Label smoothing is introduced to smooth the categorical labels, denoted as:

$$\begin{aligned} y'_i=y_i(1-\varepsilon )+\frac{\varepsilon }{M} \end{aligned}$$
(9)

where \(y'_i\) is the label after label smoothing, \(y_i\) is the one-hot label encoding, M is the number of categories, and \(\varepsilon \) is the label smoothing hyperparameter.

3.4 Anchor Clustering

The scheme chosen in this paper is an anchor-based object detection model. The traditional anchor selection method is challenging to improve the accuracy, and we first use the K-means clustering algorithm to cluster the manually labelled actual bounding boxes in the training set to obtain the optimal anchor size. Then, we select nine anchor points to predict the bounding box based on the average IoU to improve the detection accuracy. The clustering steps are:

  1. 1.

    Randomly select N boxes as initial anchors;

  2. 2.

    Using the IOU metric, assign each box to the ANCHOR that is closest to it;

  3. 3.

    Calculate the mean value of the width and height of all boxes in each cluster and update the position of the anchor;

  4. 4.

    Repeat steps 2 and 3 until the anchor no longer changes or the maximum number of iterations is reached.

The anchor clustering centers of the transmission line foreign object intrusion dataset in this paper were calculated by K-means as [[38, 49], [81, 55], [61, 88], [78, 153], [110, 120], [144, 98], [155, 182], [200, 257], [317, 380]], as illustrated in Fig. 3. Figure 4 illustrates the Pascal VOC dataset category of anchor clustering.

Fig. 3.
figure 3

Anchor clustering of transmission line foreign object detection dataset.

Fig. 4.
figure 4

Anchor clustering of Pascal VOC dataset.

4 Experiments

4.1 Datasets

Significantly, few open-source datasets are related to the grid intrusion of foreign objects. The foreign object detection dataset used in this paper is mainly from the dataset provided by the 2nd Guangzhou-Pazhou Algorithm Competition-Complex Scene-Based Transmission Corridor Hidden Dangerous object Detection Competition [11], with a total of 800 annotated image data, and the ratio of this paper’s training set and test set division is 9:1. among the categories of foreign objects are nest, balloon, kite and trash. This paper also validates the effectiveness of this paper’s model on the open-source dataset Pascal VOC [12]. Pascal VOC 2007 has 9,963 images containing 24,640 labelled objects, and Pascal VOC 2012 has 11,540 images containing 27,450 labelled objects, which contain the same 20 object classes for the object detection task.

4.2 Experimental Settings

In this paper, the GridFormer model is constructed based on the PyTorch framework, and the neural network parameters are optimized using the Adam optimizer. The initial learning rate is 1e−3, the minimum learning rate is 1e−5, cosine annealing is used to attenuate the learning rate, label smoothing is set to 0.005, the input resolution is 640 * 640, the batch size is set to 8, and the maximum epoch is set to 100. All the training is done using NVIDIA RTX 3080 GPU.

4.3 Evaluation Metrics

In order to reasonably evaluate the performance of the lightweight object detection model, this paper adopts the average value of the APs of each category (mAP) to measure the detection accuracy of the object detection model; the number of floating-point operations, GFLOPs, is used to measure the computational amount of the model, and the number of parameters, Params (M), is used to measure the complexity of the model, which together reflect the computational cost of the model. The number of frames per second, FPS, is used to measure the inference speed of the model. The above four metrics are used to determine the trained model’s performance comprehensively.

4.4 Results

On the target dataset of transmission channel hazards, this paper has done relevant experiments and completed the test; the experimental results are shown in Table 1. GridFormer achieved 96.78% mAP with only 25.39M parametric quantities, and the P-R curves of the four categories in the dataset are shown in Fig. 5.

In this paper, we compare the excellent network design of Ghostnet. GridFormer improves the mAP by 4.96% over Ghostnet in terms of accuracy, and the model can reach 68.7 FPS in terms of inference speed, which can satisfy the demand for real-time transmission line foreign object detection.

Table 1. Performance of GridFormer on transmission line foreign object detection dataset

This paper also validates the effectiveness and generalization performance of GridFormer on the Pascal VOC dataset, and the experimental results are shown in Table 2.

Fig. 5.
figure 5

P-R curves for each category of transmission line foreign object detection dataset.

Table 2. Performance of GridFormer on Pascal VOC dataset

Compared with Ghostnet, GridFormer improves the AP on all 15 categories, in which the accuracy of the cow category improves more than 10 AP, and the detection accuracy exceeds 91 AP for both the aeroplane and horse categories. In this paper, we found that the four categories with lower AP of bottle, chair, diningtable, and pottedplant have the superclass of Household, which is due to the cumbersome categories, severe occlusion in household scenarios, and the significant personalized differences of the detected categories, etc. leading to the poor performance in the model inference.

4.5 Ablation Studies

This paper experimentally compares the effect of different training methods on the results. The effects of data augmentation, cosine annealing, and label smoothing on the object detection results are further explored.

Table 3 demonstrates the effect of training methods on model accuracy and F1 scores. In Experiments 2, 3, and 4, Mosaic and Mixup data augmentation techniques were introduced to be able to generate new samples, and the comparison between Experiment 1 and Experiment 2 without the addition of such data augmentation methods showed a growth of 1.4% mAP compared to Experiment 1 without such data augmentation methods. The comparison between Experiment 2 and Experiment 3 shows that labelling improves the model accuracy by 0.6% mAP. In Experiment 4, the label smoothing technique is introduced in this paper, which improves the model accuracy by 1.97% mAP. We believe that this is related to the existence of a more severe data imbalance in the dataset, where there is more data in the category of nest, and the label smoothing technique better handles the scarcity of the samples by adjusting the probability distributions of category labels, thus improving the model’s accuracy for all categories. The situation, thus improving the model’s ability to detect all categories.

Table 3. Effect of data augmentation, cosine annealing, and label smoothing on the detection.

5 Conclusion

In grid transmission line systems, foreign object detection is an important protection measure to ensure the regular operation of transmission lines. In this paper, we propose a lightweight object detection model GridFormer based on a CNN-transformer hybrid feature extraction network, which effectively combines the advantages of convolutional induction bias and the transformer’s long-term dependence and still can show the excellent performance of the transformer on smaller datasets. The application of GridFormer in the field of transmission line foreign object detection can be better adapted to the scenario of diverse foreign object morphology, smaller objects to be detected and smaller data volume. The experiments show that the model in this paper can find a new tradeoff between inference speed and detection accuracy.