Keywords

1 Introduction

Object detection is a widely studied task that aims to locate and classify the objects of interest. In recent years, object detection has achieved remarkable progress due to the powerful ability of Convolutional Neural Networks (CNNs) and the availability of an enormous amount of data [4]. However, as an important branch of object detection, small object detection has always been a bottleneck for detector performance. Small objects, typically refer to objects with a pixel size of less than 1024 (\(32 \times 32\)) [18], have very important research significance in practical scenarios such as remote sensing detection [1, 14], disaster rescue [22, 38], and intelligent transportation system [20, 31]. Unfortunately, the features of small objects are extremely limited, making them susceptible to background and noise interference. Moreover, these weak features are likely to be lost during the feature extraction and downsampling process, leading to a noticeable drop in detection performance when dealing with small objects. For example, Faster R-CNN [24] achieves an mAP of \(41.0\%\) and \(48.1\%\) for medium and large objects on the COCO dataset [18], respectively, but the result for small objects drops significantly to only \(21.2\%\). Therefore, as a task with both theoretical significance and practical demand, how to effectively enhance the detection performance on small objects is an urgent and important problem to be solved.

Fig. 1.
figure 1

Pictorial demonstrations of existing feature pyramid networks.

In order to detect objects of various sizes, advanced detectors often adopt a divide-and-conquer approach that utilizes larger receptive fields to detect large objects and smaller ones to detect small objects. This principle is usually reflected in the Feature Pyramid Network (FPN) [16]. As shown in Fig. 1, many studies have noticed the importance of FPN and attempted to fuse low-level and high-level features in a more effective manner to obtain better detection results. Consequently, many FPN variants have been devised to achieve more comprehensive feature fusion [7, 19, 27]. We collectively refer to them as the Expanded FPNs. However, the fusion strategies of expanded FPNs are generally accomplished by the element-wise addition operation, and the only difference between them is the level of the fused features. In contrast, to extract detailed features that are conducive to small object detection, the element-wise subtraction between the corresponding levels may better obtain edge information [28]. It should be noted that in the high-level feature layers, the information of small objects is almost submerged in the frequent downsampling process, and subtracting such features cannot extract small object features. Instead, it may lead to the loss of main body features. Therefore, a hierarchical feature fusion strategy is necessary. On the other hand, we notice that the fused features have information of different scales. Using global information can help to guide the refinement of each level of features, thus improving detection performance [15].

Based on the above observations, we propose a novel approach called Hierarchical Focused Feature Pyramid Network (HFFPN). HFFPN mainly consists of two parts: Hierarchical Feature Subtraction Module (HFSM) and Feature Fusion Guidance Attention (FFGA). HFSM leverages the feature subtraction operation to obtain the edge information of objects. To avoid erasure effects on main body information caused by subtraction operations at higher semantic levels, HFSM adopts a hierarchical subtraction strategy. Besides, the proposed FFGA introduces a novel attention mechanism for small object detection by incorporating both self-features and higher-level features in the generation of attention weights. It deviates from the common self-attention methods [12, 30], which solely relies on the self-features. The adjacent feature levels often contain richer interaction information, particularly with low-level features assisting high-level features in exploring potential information on small objects.

To sum up, our contributions are summarized as follows:

  • We design a brand-new Hierarchical Feature Subtraction Module (HFSM). It fully utilizes the difference of information between feature layers and helps to improve the performance of small object detection. The hierarchical strategy employed in HFSM further enhances the robustness of the model.

  • We introduce a Feature Fusion Guidance Attention (FFGA) to utilize the global fused information. The self-attention mechanism used highlights useful information and suppresses noise information by weighting the features of itself, helping to explore potential information of small objects.

  • Extensive experiments on the DOTA and COCO datasets demonstrate that the proposed HFFPN significantly improves the performance of the baseline algorithm and surpasses the current state-of-the-art detectors.

2 Related Work

2.1 Small Object Detection

With the development of deep learning, extensive research has been carried out on small object detection. There have been numerous attempts to enhance the performance of small object detection from different perspectives, all with the common goal of increasing the exploitable features of small objects. SCRDet [37] achieves a more refined feature fusion network by introducing flexible downsampling strides, allowing for the detection of a broader spectrum of smaller objects with greater precision. R3Det [36] designs a feature refinement module to enhance the detection performance of small objects. Oriented RepPoints [13] captures features from adjacent objects and background noise for adaptive point learning, which utilizes contextual information to discover small objects.

2.2 Feature Pyramid Network

It is a consensus that the shallow layers are usually rich in detailed information but lack abstract semantic information, while the deeper layers are on the contrary due to the downsampling. Smaller objects predominantly rely on shallow features and can be more effectively detected by detectors with smaller receptive fields. Feature Pyramid Network [16] combines the deep layer and shallow layer features by building a top-down pathway to form a feature pyramid. PAFPN [19] enriches the feature hierarchy by adding a bottom-up path, enhancing deeper features without losing information from the shallow layers. HRFPN [27] utilizes multiple cross-branch convolution to enhance feature expression. NAS-FPN [7] searches for the optimal combination method for feature fusion in each layer.

2.3 Self-Attention

The attention mechanism exhibits an impressive capability to quickly concentrate on and distinguish objects within a scene, while effectively ignoring irrelevant aspects. And self-attention is also a powerful technique in deep learning that allows a model to selectively focus on different parts of input, effectively capturing dependencies and relationships within it. Spatial self-attention and channel self-attention are two common kinds of self-attention. SENET [12] is the first proposed channel attention. It uses a SE block to gather global information through channel-wise relationships and enhance the representation capacity. CBAM [30] can sequentially generate attention feature maps in both channel and spatial dimensions for adaptive feature refinement, resulting in the final feature map. Self-attention mechanism has shown outstanding performance in handling small objects to some extent. SCRDet [37] utilizes pixel attention and channel attention to highlight small object regions while mitigating the impact of noise interference. CrossNet [14] develops a cross-layer attention module to enhance the detection of small objects by generating more pronounced responses.

3 Methodology

3.1 Overview

In order to fully utilize the information of small objects, we propose a novel feature pyramid network, named HFFPN, as shown in Fig. 2. The detector receives the input image I and sends it to the backbone network for feature extraction. The image feature \(C_i\) gradually becomes richer in semantic information during the subsampling process while losing detailed information. \(C_i\) is then passed through the proposed Hierarchical Feature Subtraction Module (HFSM) to obtain intermediate feature \(M_i\) in a top-down manner. Next, \(M_i\) is further fused through convolution with a kernel size of 3 to obtain fused feature \(P_i\). Finally, \(P_i\) is sent to the proposed Feature Fusion Guidance Attention (FFGA) to obtain focused feature \(O_i\), which are particularly focused on effective information, especially small objects. The focused feature \(O_i\) will be used by the model to predict the category and location of objects.

Fig. 2.
figure 2

Overview of the proposed HFFPN, which consists of HFSM and FFGA.

3.2 Hierarchical Feature Subtraction Module

The Hierarchical Feature Subtraction Module (HFSM) is designed to enhance the specific details of low-level features in the feature pyramid. Generally, features at the bottom of pyramid have higher resolution and smaller receptive fields, and contain local information such as edges, textures, and colors, which are crucial for detecting small objects. However, the widely used fusion strategy, i.e., element-wise addition, fails to enhance the local information due to its uniqueness at each level. To cope with it, we propose HFSM that adopts the subtraction operation with hierarchy to highlight the local information, thereby alleviating the above-mentioned problem. The specific process of HFSM is as follows.

Firstly, the input image I passes through the backbone network to obtain the image feature \(C_i\):

$$\begin{aligned} C_i = {\left\{ \begin{array}{ll} I, &{} i=0, \\ \mathcal {F}(C_{i-1}), &{} i=1,\dots ,t \end{array}\right. }, \end{aligned}$$
(1)

where \(\mathcal {F}(\cdot )\) denotes the convolution block in backbone and t is the number of feature layers.

Secondly, \(C_i\) is then processed by HFSM to obtain the intermediate feature \(M_i\). The proposed HFSM aims to better extract detailed information from different feature levels. The subtraction operation can capture the differential information between two feature levels, which often includes fine-grained or edge information, crucial for detecting small objects. Afterwards, the intermediate features are further fused through a \(3\times 3\) convolutional layer. These processes can be represented by the following equations:

$$\begin{aligned} M_i = {\left\{ \begin{array}{ll} \sigma (C_i), &{} i=t, \\ \sigma (C_i) \oplus \texttt{UP}(M_{i+1}),&{} i=l+1,\dots ,t-1 \\ \frac{1}{2}(\sigma (C_i) \oplus \texttt{UP}(M_{i+1})) \oplus \left| \sigma (C_i) \ominus \texttt{UP}(M_{i+1})\right| , &{} i=2,\dots ,l, \end{array}\right. }, \end{aligned}$$
(2)
$$\begin{aligned} P_i = \texttt{conv}_{3\times 3}(M_i). \end{aligned}$$
(3)

where \(\sigma (\cdot )\) denotes a \(1\times 1\) convolution, and \(\texttt{UP}(\cdot )\) represents upsampling with ratio of 2. \(\oplus \) and \(\ominus \) denote element-wise addition and element-wise subtraction, respectively. \(\left| \cdot \right| \) indicates the operation of taking absolute values. l is a hyperparameter for hierarchical strategy.

3.3 Feature Fusion Guidance Attention

Feature Fusion Guidance Attention (FFGA) is a generalized self-attention mechanism that can effectively focus on useful information, especially small object information. In the feature pyramid, the fused features contain multi-scale information from different levels, and adjacent levels have stronger complementary abilities in feature distribution due to their similar receptive fields. Based on the features between adjacent levels, self-attention is designed to guide the current level of features to focus on useful parts, which can effectively improve the quality of each feature layer and thus improve detection performance. Specifically, the process of FFGA guiding feature focusing can be expressed as Fig. 3.

Fig. 3.
figure 3

Diagram of the FFGA.

Firstly, the input of FFGA are the current layer feature \(P_i \in \mathbb {R}^{C\times H\times W}\) and the up one \(P_{i+1} \in \mathbb {R}^{C\times \frac{H}{2}\times \frac{W}{2}}\). These two features are concatenated along the channel dimension to obtain the guided feature \(F_g \in \mathbb {R}^{2C\times H\times W}\). \(F_g\) is then sequentially fed into the channel attention (CA) and spatial attention (SA) modules, and we obtain the attention feature \(F_a \in \mathbb {R}^{2C\times H\times W}\). Afterwards, \(F_a\) is passed through a \(1\times 1\) convolution to generate the attention map \(W_a \in \mathbb {R}^{1\times H\times W}\). This map is multiplied as attention weight with the current layer feature \(P_i\) to obtain the focused feature \(O_i \in \mathbb {R}^{C\times H\times W}\) after attention guidance. This process can be represented by the following formulas:

$$\begin{aligned} F_g &= \texttt{concat}(P_i, \texttt{UP}(P_{i+1})), \end{aligned}$$
(4)
$$\begin{aligned} F_a &= SA(CA(F_g)), \end{aligned}$$
(5)
$$\begin{aligned} W_a &= \texttt{conv}_{1\times 1}(F_a) , \end{aligned}$$
(6)
$$\begin{aligned} O_i &= {\left\{ \begin{array}{ll} P_i \otimes W_a, &{} i=2,\dots ,t-1, \\ P_i, &{} i=t, \end{array}\right. }, \end{aligned}$$
(7)

where the composition of channel attention and spatial attention has been detailed in Fig. 3. They have a similar structure that mainly consists of an average pooling layer, a \(1\times 1\) convolution layer followed by a ReLU activation, and a \(1\times 1\) convolution layer followed by a sigmoid activation. The input feature generates attention focusing on channel and spatial dimensions in the two modules respectively. After dimension expansion, they are element-wise added to the original feature, allowing the original feature to obtain a different degree of attentional gain in the channel and spatial dimensions.

4 Experiments

4.1 Datasets

DOTA [32] is a rotation-based small object dataset in the remote sensing field. It contains 2, 806 images with a total of 188, 282 instances. The detection targets in DOTA include 15 common categories in remote sensing images, namely Bridge (BR), Harbor (HA), Ship (SH), Plane (PL), Helicopter (HC), Small vehicle (SV), Large vehicle (LV), Baseball diamond (BD), Ground track field (GTF), Tennis court (TC), Basketball court (BC), Soccer-ball field (SBF), Roundabout (RA), Swimming pool (SP), and Storage tank (ST). COCO [18] is the most popular dataset for object detection. Due to its definition of small object and specialized evaluation metric mAP\(_s\), COCO is commonly used as a well recognized benchmark for small object detection.

4.2 Experiment Settings

We employed Resnet50 and Resnet101 [11] pre-trained on ImageNet [25] as backbone networks. We utilized the SGD algorithm with a momentum of 0.9 and a weight decay of 0.0001 for network optimization. The initial learning rate warms up at a rate of 0.001 per iteration for the first 500 iterations. The training schedule for all experiments was consistent. We trained 12 epochs on the two datasets, and the learning rate decays at the epoch 8 and 11 with ratio of 0.1. The code for all experiments was built on the MMdetection [2] platform.

4.3 Comparison Results

Results on DOTA. We selected the RoI Transformer [5], a general method for aerial object detection, as the baseline algorithm. Table 1 reports the comparison result on DOTA test set. With Resnet50 as the backbone, our method obtains \(76.64\%\) mAP\(_{50}\), improving the performance of baseline by approximately \(1\%\), thereby surpassing the performance of the state-of-the-art algorithms. With Resnet101 as the backbone, HFFPN also increases the baseline’s performance by \(0.87\%\) mAP\(_{50}\), achieving the best result on the DOTA dataset. These results fully demonstrate HFFPN’s advantages on small object detection and reflect its potential applications. Figure 4 provides a more intuitive visual comparison.

Table 1. Comparison with state-of-the-art methods on DOTA test set. The reported results come from AerialDetection [6] and OBBDetection [33]. \(\ddag \) indicates that it is the result of our re-implement. Note that we only list some classes for better display.

Results on COCO. On the COCO dataset, we applied HFFPN to two-stage [24], one-stage [17], and anchor-free [26] detectors, respectively. Table 2 shows the performance gain brought by HFFPN. Although the overall mAP improvement is not significant due to the small proportion of small objects in the COCO dataset, the consistent and significant increase in the mAP\(_{s}\) metric indicates that HFFPN makes detectors more capable of detecting small objects while maintaining their detection capabilities of other scales of objects.

Table 2. Comparison experiment on COCO. The baseline results come from [2].

4.4 Ablation Study

To further verify the advantages and effectiveness of the proposed method, we conduct a series of experiments on the DOTA dataset. The baseline algorithm is RoI Transformer with Resnet50.

Evaluation for Component Effectiveness. To evaluate the effects of HFSM and FFGA, we carry out several ablation experiments, and the experimental results are shown in the Table 3. Without any improvement schemes, the mAP\(_{50}\) detected by the baseline is \(75.70\%\). The introduction of HFSM and FFGA gradually improves the detection accuracy to \(76.24\%\) and \(76.64\%\). The results indicate that each combination in HFFPN brings improvement to the detector.

Fig. 4.
figure 4

Visualization on DOTA test set. The yellow circles highlight the difference of detection result. We can easily find that HFFPN (second row) can help detect more small objects and achieve higher accuracy in classification and regression. (Color figure online)

Evaluation on Different Settings of l in HFSM. Hierarchical level l, as a hyperparameter in HFSM, determines in which feature layers the operation of feature subtraction is performed. Specifically, the feature subtraction module will be introduced when the level lower than l. Table 4 shows the results under different values of l. When l is 2, the performance of baseline with HFFPN reaches the highest. Assuming we do not employ the hierarchical strategy by setting l equals to 5, where feature subtraction is performed between each level of features, we would observe a significant drop in results. The hierarchical strategy ensures that the subtraction is performed only on detailed features, making it applicable to a wide range of input images and thus enhancing the model’s robustness.

Comparison with Other FPNs. Table 5 presents performance of the baseline algorithm with different FPNs. It can be observed that some expanded FPNs do enhance the detector’s performance to some extent, but the improvements are not as significant as those of the proposed HFFPN.

Evaluation on Different Detectors. To verify that the proposed HFFPN is a common method for most detectors, experiments were conducted on several different detectors. Table 6 shows the comparison results of these detectors with or without using HFFPN. The experimental results show that the use of HFFPN has led to performance improvements for all detectors, strongly indicating the universality and effectiveness of the proposed method.

Table 3. Evaluation on the effectiveness of each component. FS, HS, CA and SA denote feature subtraction, hierarchical strategy, channel attention, and spatial attention, respectively.
Table 4. Results of differnent l.
Table 5. Comparison with other FPNs.
Table 6. Improvements on DOTA by applying HFFPN to different detectors.

5 Conclusion

To better utilize the detailed information for small object detection, this paper proposes a hierarchical focused feature pyramid network. It mainly contains a hierarchical feature subtraction module and feature fusion guidance attention. This design overcomes the problem of neglecting edge information that exists in common FPN methods, thus improving the detection ability of small objects without affecting the detection performance of objects at other scales. Comparison and ablation experiments on multiple datasets demonstrate the excellent performance of the proposed method, fully verifying the effectiveness of HFFPN.