Keywords

1 Introduction

The object detection has been developed significantly with the emergence of a large number of models such as Swin-Transformer [8] and DAMO-YOLO [12]. Small object detection is important in many fields such as industry, military, and autonomous driving. However, small objects themselves have few features while detection scenes often require high accuracy and real-time performance. Therefore how to balance them is a challenge to researchers [4].

There are some models such as Next-ViT [7], PPYOLOE [11] and PP-PicoDet [14], have been proposed with high detection accuracy and lightweight weights. The existing improvement of loss functions has focused on more accurate calculation of the deviation between predicted and ground-truth bounding boxes. There is few work on improving loss functions based on the characteristics of detection targets. Existing loss functions include IoU [15], GIoU [10], DIoU [19], CIoU [19], EIoU [17], and Alpha-IoU [5]. Researchers have explored factors such as intersection over union, center point distance, aspect ratio, and orientation, to make the loss function more accurate. By providing more accurate guidance during training, the detection model can improve its performance.

In this paper, the novel attention-based loss function called SC-AttentionIoU is presented which guides the network to pay more attention to the key regions of the detection target. This loss function identifies the spatial region with significant features of the detection target and incorporates this prior knowledge. Loss function attention determines the weight of the internal ground-truth bounding box. High weight is assigned to the key feature regions of small targets, while low weight is assigned to irrelevant or defective regions. In summary, this paper presents the following innovations:

  1. 1.

    The attention-based loss function SC-AttentionIoU is proposed. It incorporates attention into the loss calculation and guides the model to focus on the key features of small objects to improve detection accuracy.

  2. 2.

    A novel weight generation strategy is proposed in SC-AttentionIoU. Different weight regions are generated within the predicted box based on the distribution characteristics of small objects, which calculates the loss value of the SC-AttentionIoU more accurately.

  3. 3.

    The factor is proposed to address the potential problems caused by introducing attention into the loss function. This factor allows the model to not only focus on key features but also consider other regions during training, resulting in more accurate prediction box locations.

This paper is organized as follows. Section 2 provides a brief review of related methods for loss functions. Section 3 describes the proposed methods. Section 4 presents experimental results, and Sect. 5 provides conclusions.

2 Related Work

2.1 Small Object Detection

Small object detection has always been a challenging problem. The definition of small objects usually refers to objects with a size in the image that is very small, even less than 10 pixels. In this case, small objects are often affected by various factors such as image blur, noise, and scale variation, making it difficult to accurately detect and recognize them. To address the problem of small object detection, Facebook AI proposed a Transformer-based small object detection method called DETR [2], which uses deformable convolution and feature pyramid techniques to effectively solve the problem of object size and scale variation in small object detection. Many researchers have also attempted to improve small object detection by improving network structures and feature extraction methods. Yang et al. combined spatial attention, channel attention, and self attention to improve the detection effect of small objects in intelligent transportation [13]. The model has faster convergence speed and higher accuracy. Zhao et al. proposed a detection model [18] based on feature fusion and anchor point improvement, which reduces the missed detection rate of small targets in complex backgrounds and has considerable detection speed.

2.2 Loss Funcation

The most important in the loss function for object detection is the bounding box regression which is used to calculate the offset from the detection box to the ground-truth box. IoU [15] was first proposed to calculate the intersection-over-union ratio between the detection box and the ground-truth box. GIoU [10] used the minimum enclosing rectangle to calculate the offset between the predicted box and the ground-truth box based on IoU. DIoU [19] replaced the minimum enclosing rectangle with the Euclidean distance between the center points, which more accurately calculates the distance between the predicted box and the ground-truth box. CIoU [19] adds an aspect ratio penalty to DIoU to solve the problem of IoU being insensitive to shape. EIoU [17] further improved the aspect ratio by using the ratio of the lengths of the width and height sides, resulting in more accurate results. SIoU discovered the influence of the direction between the detection box and the ground-truth box.

3 Method

The SC-AttentionIoU proposed in this paper is focused on the coarseness of object detection annotation and the lack of features in small objects. It is well known that in annotated images, the detected object does not always occupy the entire area of the bounding box [9]. There are often other unrelated regions such as other objects or backgrounds within the bounding box, as shown in Fig. 1. These unrelated regions interfere with the model training. Since small objects contain fewer features, it is feasible to improve detection accuracy by focusing on the key features of small objects. To reduce the impact of unrelated regions, this paper introduces attention within the bounding box to assign different regions within the bounding box with different weights. This allows the model to focus more on regions with higher weights during bounding box regression, and reduces the interference of unrelated regions on the model. However, if the model focuses too much on the key features, it may generate predicted boxes that are closer to the distribution of key features but do not match the ground truth boxes. To avoid this issue, this paper further improves AttentionIoU and proposes SC-AttentionIoU.

Fig. 1.
figure 1

PCB data set and SSDD data set small object data set picture.

3.1 AttentionIoU

In IoU calculations, the weights of the internal regions of the predicted and detected boxes were averaged. When the overlap area remains constant and the sizes of the detection box and true box are consistent, predicted boxes of varying shapes have an identical intersection over union ratio to the true box. This result is insensitive to shape variations. In the initial training stage, the averaging of the weights of the internal regions of the true box leads to equal attention being given by the model to both irrelevant regions and critical features within the true box, which is disadvantageous for model training.

When the target itself contains a large number of features, it is difficult to select the key features, as the importance of the detection target’s features for model training cannot be determined based on prior knowledge. Conversely, small targets with fewer features can be more easily identified with key features, thus allowing the model to focus on critical features and improve detection accuracy. As shown in Fig. 1, it is evident that the key features are located at the center position of the true box.

When applying weights to the key features of detection targets, the attention-based network [3, 6] is used to add pixel-level weight annotations to the object within the true box. However, the approaches greatly increases the annotation cost. Therefore, this paper combines the distribution of the key feature regions to adjust the true box and generate a novel weight box. The weight box is located inside the true box and is closer to the distribution of key feature regions. The weight box annotates the critical features of small targets to match high-weight areas. The region between the weight box and the true box is referred to as the low-weight area, primarily comprising irrelevant regions, as depicted in Fig. 2.

Fig. 2.
figure 2

The red box is the annotation box, and the yellow box is the weight box. (Color figure online)

The formula of AttentionIoU is

$$\begin{aligned} AttentionIoU=\frac{(\theta \cap \rho )+\varepsilon (\exists \cap \rho )}{(\theta +\rho +\varepsilon ^*\exists )-(\theta \cap \varphi )} \end{aligned}$$
(1)

where \(\rho \) denotes a prediction box, \(\theta \) denotes a real box, \(\exists \) denotes a weight box, and \(\varepsilon \) denotes the weight factor of the weight box that \(\varepsilon \) in [1,5]

The weight box is located inside the true box, and the intersection between the weight box and the predicted box is contained within the intersection between the true box and the predicted box. When calculating IoU, the intersection of the internal regions of the weight box is increased by a factor 1+\(\varepsilon \). This increases the weight of the weight box region. The formula to generate the coordinates of the weight box is

$$\begin{aligned} \exists _{xi}=x_i^{gt}+(-1)^{i+1}*\frac{w^{gt}}{\tau } \end{aligned}$$
(2)
$$\begin{aligned} \exists _{yi}=y_i^{gt}+(-1)^{i+1}*\frac{h^{gt}}{\mu } \end{aligned}$$
(3)

where \(\tau \) is the generation factor for the x-coordinate of the weight box. \(\mu \) is the generation factor for the y-coordinate of the weight box. Both generation factors are based on prior knowledge of the detection targets. \(\exists _{xi}\) denotes the x-coordinates of the top-left and bottom-right corners of the weight box, \(\exists _{yi}\) denotes the y-coordinates of the top-left and bottom-right corners of the weight box, \((x_1,y_1)\) and \((x_2,y_2)\) are the top-left and bottom-right coordinates of the true box, respectively, \(w^{gt}\) is the width of the true box, and \(h^{gt}\) is the height of the true box. \(x_i^{gt}\) denotes the x-coordinate of the true box. The variable \(y_i^{gt}\) represents the y-coordinate of the true box, where i lies within the range of [1, 2].

3.2 SC-AttentionIoU

Adding a weight box inside the real box allows the model to focus on the key features of the small object as much as possible. However, this will also make the detection box more closely related to the size and shape of the weight box. If the difference between the weight box and the real box is large, it will affect the detection accuracy. To solve this problem, this paper uses aspect ratio as the weight of AttentionIoU and proposes SC-AttentionIoU. The formula for SC-AttentionIoU is

$$\begin{aligned} SC-AttentionIoU=1-\partial *AttentionIoU+\frac{\rho ^2(b,b^{gt})}{c^2} \end{aligned}$$
(4)

where \(\partial \) is the weight calculated based on aspect ratio. After using AttentionIoU, the IOU value becomes sensitive to the shape of the predicted box. In exploring the aspect ratio of the predicted box, CIoU and EIoU respectively explored two ways of calculating the aspect ratio penalty term. However, since AttentionIoU itself is sensitive to shape, the previous approach of separating the IOU from the aspect ratio factor is no longer feasible. This paper proposes to use aspect ratio as the weight for IoU value, and combine it with AttentionIoU for calculation. The formula for \(\partial \) is

$$\begin{aligned} \partial =1{-}\frac{(\frac{\rho ^{2}(w,w^{gt}))}{(w^{c})^{2}}+\frac{\rho ^{2}{(h,h^{gt}))}}{(h^{c})^{2}})}{T} \end{aligned}$$
(5)

where \(\partial \) cannot be 0. \(T\) \(\ge \)2 is a regulation factor for the value range of \(\partial \). In this paper, the value range of \(\partial \) is set to [(T−2)/T,1]. \(w^{gt}\) denotes the width of the annotated box, \(h^{gt}\) denotes the height of the annotated box. \(w^{c}\) is the width of the bounding rectangle between the real box and the predicted box. \(h^{c}\) is the height of the bounding rectangle between the real box and the predicted box.

SC-AttentionIoU already takes into account the effect of aspect ratio, so no aspect ratio penalty is added in SC-AttentionIoU. In CIoU when IoU is 0, the loss is guided by Euclidean distance instead of the aspect ratio. The proposal in loss function can speed up the convergence of the model training. The comparison of different loss functions and the proposed loss function on the PCB dataset are shown in Fig. 3. The results show that SC-AttentionIoU converges faster and has a lower loss convergence value.

Fig. 3.
figure 3

Boundary box regression Loss function image.

4 Experiment

4.1 Experiment Setting

In this paper, the YOLOv5 was used as the baseline and compared with existing loss functions on two datasets: ship remote sensing dataset [16] released by the Department of Electronic and Information Engineering of the Naval Aeronautical and Astronautical University, and printed circuit board defect dataset (PCB dateset) released by the Intelligent Robot Open Laboratory of Peking University. The former dateset consists mostly of small targets. The latter dateset contains 1,386 images with six types of small defect targets.

In training, the image size is adjusted to 640 \(\times \) 640, and data augmentation techniques such as mosaic [1], translation, and flipping are used. The optimizer used is SGD, and the learning rate follows a cosine annealing scheme with an initial value of 0.01. Results were obtained after running 300 epochs on a Tesla A100-SXM4 GPU.

4.2 Experimental Indicators

According to the needs of small target detection scenarios, this article selects precision, recall, and mAP as detection indicators. Precision mainly measures the accuracy of the model in identifying targets. The recall rate mainly measures the model’s ability to identify all true targets. mAP reflects the comprehensive level of accuracy and recall rate. mAP@0.5 primarily focuses on the average precision of the model at relatively lenient thresholds, while mAP@0.5–0.95 specifically focuses on the average precision of the model at stricter thresholds.

4.3 Comparative and Ablation Experiments

Table 1 shows the experimental results of SC-AttentionIoU and other loss functions on the SSDD [16] dataset. This loss function is designed based on the characteristics of small objects. The SSDD dataset has a simple background, few categories, and contains a small number of relatively large objects. The experiment results show that the mAP@.5 score of this loss function exceeds DIoU by 0.7%, CIoU by 1.2%, EIoU by 1.3%, and SIoU by 0.5%. It demonstrate that this loss function can maintain effectiveness in the presence of relatively large objects interference. The test results are shown in Fig. 4.

Table 1. SSDD dataset loss function comparative test
Fig. 4.
figure 4

Comparison of partial detection results with different loss functions.

Table 2 shows the experiments of SC-AttentionIoU and other loss functions on the PCB defect dataset in this paper. The PCB defect dataset has more types of defects and various defect shapes compared to SSDD, which verifies the improvement effect of the proposed loss function in complex backgrounds. The experimental results show that the proposed loss function outperforms DIoU by 6.6%, CIoU by 4.5%, EIoU by 6.2%, and SIoU by 3.6% in terms of recall rate. This demonstrates that the proposed loss function can improve the detection ability of small targets in complex backgrounds. In terms of mAP@.5-.95, the proposed loss function is 1.85% higher than the average level of other loss functions. Therefore, the proposed loss function can effectively improve the detection ability of the model on small target datasets. The test results are shown in Fig. 5.

Table 2. PCB dataset loss function comparative test
Fig. 5.
figure 5

PCB Dataset Detection Results.

Table 3 shows the experiments of SC-AttentionIoU and other detection models on the PCB defect dataset in this paper. We compare the advanced models such as PPYOLOE, YOLOv6, PP-PicoDet, combined with SC-AttentionIoU proposed in this paper with the widely used CIoU to verify the robustness of SC-AttentionIoU in different detection models. * indicates the use of the proposed loss function in this paper. The experimental results show that after being combined with SC-AttentionIoU, the YOLOv6 model improves the mAP@.5-.95 score by 1.4%, the PPYOLOE-s model improves by 4.3%, and the PP-PicoDet model improves by 2.6%. Therefore, the proposed loss function has strong robustness on different detection models.

Table 3. Ablation experiments of SC-AttentionIoU on different detection models

Table 4 shows the experiments of AttentionIoU and other Loss Funcation on the SSDD ship dataset in this paper. We combine the widely used CIoU, EIoU, SIoU, and attentionIoU and replace the IoU in the loss function with attentionIoU to validate the effectiveness and robustness of the proposed attentionIoU idea. * denotes the use of attentionIoU proposed in this paper. Experimental results show that when combined with attentionIoU, CIoU* improves MAP@.5-.95 by 2% compared to CIoU, EIoU* improves MAP@.5-.95 by 2% compared to EIoU, and SIoU* improves MAP@.5-.95 by 2% compared to SIoU. Therefore, the proposed attentionIoU demonstrates strong robustness and effectiveness. The formula for CIoU*, EIoU*, and siou* is shown below.

$$\begin{aligned} \text {CIoU}^*=1-\textit{AttentionIo}U+\frac{\rho ^2(\textbf{b},\textbf{b}^{gt})}{c^2}+\alpha v \end{aligned}$$
(6)
$$\begin{aligned} \text {EIoU}^*=1-\textit{AttentionIo}U+{\frac{\rho ^{2}\left( \textbf{b},\textbf{b}^{\textrm{gt}}\right) }{\textrm{c}^{2}}}+{\frac{{\rho ^{2}}\left( \textbf{w},\textbf{w}^{\textrm{gt}}\right) }{\textrm{C}_{\textrm{w}}^{2}}}+{\frac{{{\rho }^{2}}\left( \textbf{h},\textbf{h}^{\textrm{gt}}\right) }{{\textrm{C}}_{\textrm{h}}^{2}}} \end{aligned}$$
(7)
$$\begin{aligned} \text {SIoU}^*=1-\textit{AttentionIo}U+\frac{\rho ^2(\textbf{b},\textbf{b}^{gt})}{c^2}+\frac{\varDelta +\varOmega }{2} \end{aligned}$$
(8)

\(\alpha \) is a trade-off parameter, and v is a parameter used to measure the consistency of the aspect ratio. \({\frac{{\rho ^{2}}\left( \textbf{w},\textbf{w}^{\mathrm {g\textrm{t}}}\right) }{\textrm{C}_{\textrm{w}}^{2}}}+{\frac{{{\rho }^{2}}\left( \textbf{h},\textbf{h}^{\textrm{gt}}\right) }{{\textrm{C}}_{\textrm{h}}^{2}}}\) represents the width and height losses of the prediction frame and the real frame. \(\varDelta \) represents distance loss. \(\varOmega \) represents shape loss. The above variables are not discussed due to the limited length of the article.

Table 4. Ablation Experiments of Attention IoU on Different Loss Functions

In summary, SC-AttentionIoU has shown a certain improvement in small object detection datasets. This paper also points out the potential problems caused by changing the weights of the internal regions of the ground truth boxes and proposes corresponding solutions. This proves that exploring the weights of the internal regions of ground truth boxes is beneficial. Our work provides a new direction for exploring loss functions in small object detection and is beneficial for the development of future object detection technologies.

5 Conclusion

This paper proposed the AttentionIoU loss which extends the IoU loss in the bounding box regression to stress small object detection accuracy. Moreover, to address the issue that changing the internal weights of the predicted boxes affects the aspect ratio loss in AttentionIoU, the SC-AttentionIoU loss function is presented. Finally, the proposed loss function is validated on two small object datasets and compared with other existing loss functions, demonstrating its effectiveness in small object detection.