1 Introduction

Neural networks, especially deep neural networks, have fundamental advantages over traditional methods for visual computing [1,2,3,4,5,6,7,8,9,10,11,12,13]. For object detection task, since R-CNN [14] was proposed 4 years ago, the accuracy on VOC [15] dataset has gradually improved. Different from R-CNN and Fast R-CNN [16], Faster R-CNN is fully based on the convolutional network. Furthermore, the one-stage object detection approaches, such as SSD, combine two stages in Faster R-CNN to obtain the bounding boxes and the labels in the same output. Although the accuracy of one-stage detectors is a little lower than two-stage, it has the advantage of concise network architecture and high speed.

The above networks are used for many applications [17,18,19,20,21]. The typical network training rule is to train the networks by minimizing their average error over training data, which is known as the empirical risk minimization (ERM) principle [22]. The classical theory of machine learning tells us that the convergence of ERM can be guaranteed as long as the size of the learning machine does not increase with the number of training data [22, 23].

However, a recent research [24] shows suspect opinion that ERM allows large neural networks to memorize (instead of generalizing from) the training data despite that the previous works conduct a lot of tricks, such as taking strong regularization and applying the random label for classification problem.

In many applications of neural networks of recent years [25, 26], the performance can be easily impacted by the training and testing data. The neural networks being trained with ERM may give the opposite (error) predictions for the custom (testing) examples. Therefore, the generalization is still a challenge.

Typical data augmentation methods to address the above problems can be found in classification task [27] and can be formalized by the vicinal risk minimization (VRM) principle [28], which tries to train networks on similar but different examples. The basic methods include slightly image rotation, random crop, horizontal flip, mild scaling, etc. Other augmentation methods are noisy labeled data by adding noise to labels [29], label smoothing by softening the label from one-hot to no explicit ones and zeros in labels [30]. Blending methods try to blend the inputs and their targets across different classes [23, 31, 32] and achieve dramatic improvements in classification task.

However, the above data augmentation methods are oriented for classification task with the assumption that the examples in the vicinity share the same classes, and they are not suitable for being applied to the detection task directly.

For the classification task, the classifier only needs to produce a prediction for each image. However, for the detection task, the detector has to predict both locations and categories of all objects. So the complexity of detection is much higher than that of classification. Therefore, directly and simply moving above blending method from classification task to detection task will put more pressures on training and will make it difficult for the network to converge to the optimal state, eventually leading to performance degradation.

To solve the above problem, we present a multi-phase blending method to improve the accuracy of object detectors and achieve remarkable improvements.

Firstly, we propose a scheduled and incremental coefficient to control the blending intensity. We construct a sigmoid formulation to lead the multi-phase training process. (1) In the initial phase, the intensity starts from almost zero and increases slowly and smoothly. So the network has time to fit itself to the difficult object dataset and converge to a good state. (2) In the second phase, the intensity grows rapidly and reaches a high level of full intensity in a short time, so that the blending method starts to amplify the regularization effect on the detector. (3) In the last phase, the detector is trained with full intensity until the detection network converges. Based on the above dynamic coefficient, we propose an incremental blending method, in which the blending degree is controlled by this coefficient. In this way, more complex and various training data can be created to regularize the network. Meanwhile, the training process will not become too tough for the network.

Secondly, we also design a hybrid loss function with incremental intensity. The blending intensity of both increases smoothly at the beginning, which is controlled by our scheduled coefficient. Different from the original loss function, we propose a hybrid method for loss functions, in which the classification function and regression function can be blended separately.

Thirdly, the blending method will further increase the number of negative examples by creating hybrid categories with more backgrounds than objects, which generally belong to negative examples. For the detection task, too many negative examples have no advantage for detecting positive examples. On the contrary, they will make the training process more difficult. Therefore, we further discard more negative examples in our multi-phase training process than that in other typical training methods and processes.

Finally, our experiments indicate that we can achieve our purpose of regularizing the object detection networks and eventually improve the performance on complex detection tasks.

The proposed method is highly valuable for its improvement on the detector’s performance without increasing its computational cost. The only price is more time spent in the training phase. Moreover, it is a compact and independent module that is easy to use.

The rest of this paper is organized as follows. Section 2 briefly reviews the related work in object detection. Section 3 presents our regularization method for one-stage object detectors. Section 4 conducts the experiments and discusses empirical results. Section 5 analyzes the highlights of the proposed network. Section 6 concludes this work and discusses future work.

2 Related work

2.1 Detection networks

Two-stage detector R-CNN [14] is a standard two-stage object detection framework. Girshick et al. [14] combines the steps of cropping box proposals like selective search [33] and classifying them through a CNN model, yielding a significant accuracy gain. For speeding up, Fast R-CNN [16] computes the entire image only once in a feature extractor and then puts it into a spatial pooling layer, called ROI pooling, thus allowing to reuse the features in classification.

Faster R-CNN [34] shows that the quality of object proposals can be optimized by deep neural networks and replaces the independent proposal generators in its predecessors by region proposal network (RPN). RPN has a set of boxes, named anchors, paved on the image at different locations, scales, and aspect ratios, and it is trained to make class-agnostic predictions and regression predictions of offsets which fit the object locations for each anchor.

Faster R-CNN is later extended to many more advanced versions. A typical extension of it is Mask R-CNN [35], which uses a parallel branch to segment the object mask and presents a RoIAlign layer to fix misalignment to improve the detection accuracy.

One-stage detector The typical one-stage detectors are YOLO [36, 37] and SSD[38]. YOLO predicts confidences and locations for multiple objects by using the whole feature map. YOLO runs very fast because of eliminating the stage of proposal generation. However, performance is limited. SSD [38] is another one-stage object detection approach and is widely used in pedestrian detection, car detection, and object tracking, etc. Different from two-stage detection, SSD produces the results of bounding boxes and class labels from the feature map at the same time through the location layer and classification layer, so this framework is faster than two-stage detector but less accurate.

RFBNet [39] improves basic SSD. It adds a module called Receptive Field Block (RFB), which consists of several convolutional kernels of different sizes in parallel. Compared with the inceptive block [30], RFB uses a different length of stride and a bigger kernel to ensure the feature map covered. So RFB block expands the receptive field of layers to have the ability to access more information.

Without special notation, our work is in the context of one-stage detection networks.

2.2 Data augmentation methods

Intuitive image operations Most existing data enhancement methods used in object detection are limited to the use of intuitive image operations (such as cropping, rotation, which are minor changes to the object). However, these operations do not obviously change the images.

Noisy label Learning with noisy labeled training data has been extensively studied in machine learning and computer vision literature. Limitations still exist. Experiments in [40] show that the classifiers inferred by label noise-robust algorithms are still affected by label noise. Many studies have shown that label noises can adversely impact the classification accuracy of induced classifiers [41]. Bartlett et al. [42] proved that most of the loss functions are not completely robust to label noise.

Label smoothing There exist several related label smoothing methods [23, 30].

Szegedy et al. [30] tries to soften the label by adding additional labels of each class to enhance the regularization and get a small improvement. This method encourages the model to be less confident. It does regularize the model and makes it more adaptable by preventing the largest logit from becoming much larger than all others. Although it has a positive effect on generalization, this soft method is not explicit because label softening is random, and has little influence on some networks. By contrast to [30], we use the explicit image information to get the same effect of overfitting and avoid any wrong information.

Furthermore, [23, 31, 32] assume that the linear relationship between images and their labels also affect the generalization of models. They adopt another way to get the vicinity distribution: They mix the two original images by simply adding together with a random percentage, the label of each also needs to be added together with the same percentage, and thus the new images and labels are produced to train the neural networks.

Our work differs from the above literature [23, 30,31,32] as follows: (1) It is aimed at object detection, including both regression problems and classification problems, while the above methods are only for classification problems. (2) In addition to one type of blended loss function for the labels, our method constructs two types of hybrid loss functions for both labels and locations, containing hybrid classification loss function and hybrid regression loss function. (3) In order to alleviate the difficulty of training the complex data caused by blending operations, we propose a scheduled and incremental blending parameter to smoothly control blending intensity and discard more negative examples.

2.3 Contribution

As a brief summary of this section, our contributions lie as follows: (1) We design a smooth, scheduled and incremental coefficient with mathematical sigmoid formulation to control the blending intensity among the multi-phase and propose a blending method based on dynamic and incremental intensity. (2) We propose two incremental hybrid loss functions containing hybrid classification loss function and hybrid regression loss function, in addition to the original loss function. (3) We further enhance the hard negative mining method by discarding more negative examples (Fig. 1).

Fig. 1
figure 1

The overview of multi-phase training. The blending intensity smoothly increases according to our scheduled blending intensity. In the initial phase, the intensity starts from almost zero and increases slowly and smoothly. In the second phase, the intensity grows quickly and reaches a high level of full intensity in a short time. In the last phase, the detector is trained with full intensity until the detection network converges

3 The proposed method

3.1 The principle of the proposed method

Firstly, the widely used data augmentation methods of the intuitive image operations increase the number of true images that are stable and concise for training both classification network and object detection networks. Blending method creates data of blended class which is closer to one of the two classes. The blended data expand the training space, and the soft labels of blended data make the nearby feature space smoother (Fig. 2). However, this blending method creates inexact data, which is acceptable for classification network, but hard for the detection network. Therefore, we propose a new multi-phase method to smoothly control the blending intensity among the multi-phase, in which the network can adapt gradually.

Fig. 2
figure 2

The red dots are data of a class in the natural distribution, and the green dots are another class. Blended data are created in the vicinal space of the red dots, to expand the training space and make the feature space smoother

Secondly, in predicting the position of bounding boxes, the coordinates of bounding boxes are continuous values. The softened labels is also continuous values, which match the object detection task very well. Therefore, we propose two incremental hybrid loss functions containing hybrid classification loss function and hybrid regression loss function, in addition to the original loss function.

The basic idea of the proposed method is illustrated in Fig. 1.

3.2 Gaps between classification and detection

Gaps always exist between classification and detection tasks. To initially test the performance of the blending method on regression problems, we conduct a fundamental experiment to show the effect for regression problem in object detection.

The experiment is set as follows. As shown in Fig. 3, we create a white 10\(\times \)10 square box containing a 5\(\times \)5 black box. We establish a data distribution from the original distribution to simulate the natural situation that the detection datasets (like PASCAL VOC, etc.) are sampled from the natural image data distribution. In this experiment, only 10 samples of 25 are selected as training data. In the test phase, we use all data to test the trained model.

For training data distribution \(\mathcal {D} := {(x_i, y_i)_{i=1}^m}\) of location of the black block, it is a sample distribution from real distribution. We denote \(x_i\) as the image pixels and \(y_i\) as its values of location.

Firstly, we construct a new distribution \(\mathcal {D}_v := {(\tilde{x_i}, \tilde{y_i}, \tilde{z_i})_{i=1}^m}\) from \(\mathcal {D} := {(x_i, y_i)_{i=1}^n}\) for images by proposed blending operation.

$$\begin{aligned} \left\{ \begin{aligned} \tilde{x_i}&=\lambda x_i + (1-\lambda ) x_j \\ \tilde{y_i}&=y_i \\ \tilde{z_i}&=y_j\\ \end{aligned} \right. \end{aligned}$$
(1)

\( \lambda \sim \hbox {Beta}(\beta , \beta )\) In our experiment, we set \(\beta = 0.1\).

Fig. 3
figure 3

Left is the original image (black box) and right is the blended image (blended black box)

Secondly, we detect the location of a black box with small-scale AlexNet for the initial test, in which we trained the network by the loss function \(l_\mathrm{{hybrid}}\) (Eq. 4)

$$\begin{aligned} \hbox {loss}_p(\theta )= & {} L_\mathrm{{SM}}(f_{\theta }(\tilde{x_i}), y_i) \end{aligned}$$
(2)
$$\begin{aligned} \hbox {loss}_q(\theta )= & {} L_\mathrm{{SM}}(f_{\theta }(\tilde{x_i}), z_i) \end{aligned}$$
(3)
$$\begin{aligned} l_\mathrm{{hybrid}}(\theta )= & {} \lambda \hbox {loss}_p(f\theta ) + (1-\lambda ) \hbox {loss}_q(\theta ) \end{aligned}$$
(4)

where \(L_\mathrm{{SM}}\) denotes Smooth L1 Loss, \(f_{\theta }\) and \(\theta \) are model and its weights.

Fig. 4
figure 4

Graphs refer to the losses of models which are trained with blending method (red and green) and no blending (blue), where it should be noted that green is a bad example, a shows that blending method is more fluctuant in the training process. In b, the red line refers to the model performing better than baseline while the green one is the model performing worse than baseline model. It means that the final models with original blending method are inconsistent and not always good

As shown in Fig. 4, the experimental results show that there is potential for the blending method to improve the detection model, but the training process is unstable which is the reason we should use scheduled intensity.

In another experiment, we test the simple application of the blending method on VOC 2007 (Table 1). The performance is worse than the original model.

Table 1 Ablation analysis of multi-phased training
Fig. 5
figure 5

The overview of incremental blending method. We design two incremental hybrid loss functions containing hybrid classification loss function and hybrid regression loss function

3.3 Blending intensity for training detectors

Unlike image classifiers, object detectors are usually harder to train due to their complexity, especially when using the blending method.

  • In the context of this research, the detectors simultaneously produce two different losses: the classification loss and the regression loss. So the complexity of the detection task is higher than the classification task.

  • Besides, for each point on the last feature map of the object detector, the prediction of both category and location will be made. Therefore, the loss function of detectors is more complex than the loss function of classifiers.

  • Furthermore, blending method creates hybrid categories of objects or objects and backgrounds with hybrid labels combined by labels of original objects, so the human-made images and labels are more complex than original images and labels.

Through the above analysis, it is not suitable to apply the blending method directly to object detectors. Therefore, we propose a multi-phase blending method with incremental blending intensity for training detection networks.

3.4 Enhanced hard negative mining

In the training process of typical object detectors, after the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a severe imbalance between the positive and negative training examples [38].

The existing method in typical one-stage detectors is hard negative mining. They sorted all negative examples using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at a fixed value.

We consider this problem to be more serious in our method. The blending method will further increase the number of negative examples by creating hybrid categories with more backgrounds than objects, which generally belong to negative examples. However, too many negative examples have no advantage in detecting positive examples. On the contrary, they will make the training process more difficult.

Based on the above discovery, we further discard more negative examples in our multi-phase training process than that in previous training methods and processes.

3.5 Blending training architecture and principles

In the one-stage detector, all the labels and the bounding boxes of objects come out simultaneously. The network produces a fixed-size matrix containing all the information of both the detected objects and backgrounds. Each prediction is related to the corresponding area.

Therefore, we can blend two blocks of fixed-size outputs with correct alignment. In this way, we can blend both images and labels (softening effect) in object detection task and propose a novel training architecture, that is, blending training architecture with incremental blending intensity.

The architecture and principles of the proposed method are shown in Fig. 5

  • Before inputting data batches to the base network, we present a pairwise operation to hybrid pairs of images in addition to intuitive image processing operations.

  • Also, at the tail of the network, we present a hybrid loss function called HLoss which contains the hybrid classification loss and the hybrid regression loss.

  • The blending degree of the blending method is controlled by the scheduled and incremental blending intensity.

3.6 Details of the Algorithm

For convenience, we abbreviate our method, multi-phased blending method as MPB. MPB includes three parts as follows:

3.6.1 Scheduled blending intensity

We design our scheduled blending intensity \(\lambda \) through sigmoid formulation,

$$\begin{aligned} \lambda = \frac{\hat{\lambda }}{1 + \hbox {e}^{-\alpha (epoch-n)}} \end{aligned}$$
(5)

where \(\hat{\lambda }\) is the highest value of the blending intensity, \(\alpha \) and n are the hyperparameters of \(\lambda \), and epoch denotes the current epoch during training. In most of our experiments, \(\hat{\lambda }\), \(\alpha \), n are set to be 0.02, 0.1 and 200 respectively. For the typical detection networks, when epoch goes to around 200, the networks reach the premature stage, from which the loss curve becomes smooth and the network performance keeps stable. Thus, from this stage, we proposed a smoothly incremental blending intensity to further improve the performance of the networks.

3.6.2 Blending method

The blending method includes three major procedures as follows.

In the first step, for a training batch, we randomly select two image and blend them by \( x = \lambda x_1 + (1-\lambda ) x_2\). We construct a new distribution \(\mathcal {D}_v\) from source distribution \(\mathcal {D}_s := {(x_i, y_i)_{i=1}^n}\)

$$\begin{aligned} \mathcal {D}_v := {(\tilde{x}_i, \tilde{y}_{pi}, \tilde{y}_{qi})_{i=1}^m} \end{aligned}$$
(6)

where \(\tilde{x}_i = \lambda x_{pi} + (1-\lambda ) x_{qi} \), \( (x_{pi}, y_{pi}), (x_{qi}, y_{qi}) \in \mathcal {D}_s \) and \(\lambda \) is the scheduled blending intensity from Eq. (5). Then we input these blended images from distribution \(\mathcal {D}_v\) to calculate the feature maps.

In the second step, we calculate the classification loss and regression loss of the feature maps. The basic classification loss is Crossentropy Loss, \((x,y) \in \mathcal {D}_s\), \(\theta \) denotes the parameters of the network.

$$\begin{aligned} \hbox {loss}_\mathrm{{cls}}(\theta ) = \frac{1}{m} \sum _{i=1}^m L_\mathrm{{CE}} ( f_{\theta }(x_i), y_i). \end{aligned}$$
(7)

Here we present a new loss function, in which we replace the basic loss with the sum of two losses, \( (\tilde{x_i}, y_{pi}, y_{qi}) \in \mathcal {D}_v\)

$$\begin{aligned}&\hbox {loss}_i(\theta ) =\frac{1}{m}\sum _{i=1}^m L_\mathrm{{CE}}(f_{\theta }(\tilde{x_i}), y_{pi})\end{aligned}$$
(8)
$$\begin{aligned}&\hbox {loss}_j(\theta ) = \frac{1}{m}\sum _{i=1}^m L_\mathrm{{CE}}(f_{\theta }(\tilde{x_i}), y_{qj})\end{aligned}$$
(9)
$$\begin{aligned}&\hbox {loss}_\mathrm{{hybrid}}(\theta ) =\lambda \hbox {loss}_i(\theta ) + (1-\lambda ) \hbox {loss}_j(\theta )\end{aligned}$$
(10)
$$\begin{aligned}&\hbox {loss}_\mathrm{{cls}}(\theta ) = \hbox {loss}_\mathrm{{hybrid}}(\theta ) \end{aligned}$$
(11)

For localization loss, we modify it in the same way; \(L_{SM}\) refers to the Smooth L1 Loss,

$$\begin{aligned}&\hbox {loss}_i(\theta ) = \frac{1}{m}\sum _{i=1}^m L_\mathrm{{SM}}(f_{\theta }(\tilde{x_i}), y_{pi})\end{aligned}$$
(12)
$$\begin{aligned}&\hbox {loss}_j(\theta ) = \frac{1}{m}\sum _{i=1}^m L_\mathrm{{SM}}(f_{\theta }(\tilde{x_i}), y_{qj})\end{aligned}$$
(13)
$$\begin{aligned}&\hbox {loss}_\mathrm{{hybrid}}(\theta ) = \lambda \hbox {loss}_i(\theta ) + (1-\lambda ) \hbox {loss}_j(\theta )\end{aligned}$$
(14)
$$\begin{aligned}&\hbox {loss}_\mathrm{{loc}}(\theta ) = \hbox {loss}_\mathrm{{hybrid}}(\theta ) \end{aligned}$$
(15)

In the third step, we get the HLoss by adding \(loss_{loc}\) and \(\hbox {loss}_\mathrm{{cls}}\) together and minimize it to train our network.

$$\begin{aligned} HLoss(\theta ) = \hbox {loss}_\mathrm{{cls}}(\theta ) + \gamma \hbox {loss}_\mathrm{{loc}}(\theta ). \end{aligned}$$
(16)

We set \(\gamma \) to 1 in our experiments.

3.6.3 Enhanced hard negative mining

After blending operation, we sort all negative examples using the highest confidence loss for each default box and pick the top ones. We keep the ratio between the negatives and positives at 3, besides we discard \(20\%\) of these negative examples randomly.

4 Experiments

We apply the method on other networks based on the same datasets. PASCAL VOC [15] and MS COCO [43] have 20 and 80 object categories respectively.

In PASCAL VOC 2007, a predicted bounding box is positive if its Intersection over Union (IoU) with the ground truth is higher than 0.5, while in COCO, it uses various thresholds for more comprehensive calculation. The metric to evaluate detection performance is the mean average precision (mAP).

In MS COCO, following settings in other studies, we use trainval35k as training set, which includes train2014 and val2014-minival. We test on test2015 as the evaluation result. All our training is based on one 1080TI, and pytorch as the platform, we will show the details of each experiment respectively in the following parts.

4.1 PASCAL VOC

In this experiment, we follow [38] by using the same settings and hyperparameters.

For SSD + MPB, we set SGD as the optimizer and the initial learning rate at 0.004, momentum at 0.9, set epoch at 400 and weight decay at 0.0005 and batch size at 32. We set \(\gamma \) at 1 and \(\hat{\lambda }\) at 0.1. We used a strategy called warm restart [44] to accelerate the training that gradually ramps up the learning rate from \(10^{-6}\) to 0.004 at the first 5 epochs. After the warm-up phase, the learning rate goes back to \(10^{-6}\) until 200 epoch and keeps it in the following epochs. For settings of the blending training parameter, \(\hat{\lambda }\), \(\alpha \), n is set to be 0.02, 0.1 and 200 respectively. We trained the model for 7.5 hours totally and reached the best model at 340 epoch. For DSSD and YOLOv2, the settings are almost the same as the SSD.

For RFB + MPB, we use a similar strategy and parameters as above. Almost settings follow [39]. We set SGD as the optimizer and the initial learning rate at 0.004, momentum at 0.9. We set the batch size at 32, weight decay at 0.0005 and epoch at 400. We also use the warm-up strategy that gradually ramps up the learning rate from \(10-\hbox {e}6\) to \(4-\hbox {e}3\) at the first 15 epoch. After the warm-up phase, the learning rate goes back to \(10^{-6}\) until 250 epoch and keep it in the following epochs. Similarly, \(\hat{\lambda }\), \(\alpha \), n is set to be 0.02, 0.1 and 200 respectively. We reached the best model at around 390 epoch.

As shown in Tables 2 and 5, we can see the comparison between the networks with and without MPB on the VOC2007 test set. SSD* is the updated SSD results with more data augmentation [38]. For a fair comparison, we reimplement SSD* with Pytorch-0.4 and CUDA9.0 and apply our method in the same environment. We also use the same data augmentation methods in [38]. By using our method, SSD* is greatly improved by \(1.3\%\). DSSD and YOLOv2 also are upgraded by \(0.8\%\) and \(0.6\%\). For the latest fast one-stage detector RFBNet, it is also improved obviously by \(0.4\%\) and \(0.3\%\) for RFB300 and RFB512 respectively.

Table 2 Detection results on PASCAL VOC 2007

Another experiment on PASCAL VOC 2012 is shown in Table 3. The settings are same as the above experiments and training set used in this part is 07++12, which denote trainval2007 + test2007 + trainval2012. We can see that the improvements on VOC2012 test are also marked. SSD*, YOLOv2, and RFBNet512 are greatly improved by \(1.1\%\), \(0.6\%\) and \(0.2\%\), respectively.

Table 3 Detection results on PASCAL VOC 2012
Table 4 Comparison between our method and others on MS COCO
Table 5 Class-specific comparative results of MPB on PASCAL VOC 2007

4.2 MS COCO

In this experiment, the hyperparameters are the same as the previous literature [39] on COCO.

In previous literature, the basic learning rate is set to 0.002, and max epoch is set to 300. We train our network with trainval35k that is also used in previous networks. The No.1 one-stage detection network from [39] on COCO is RFB512-E, and hence we also apply our method on RFB512-E in this experiment. As shown in Table 4, our method achieves an improvement to RFBNet300 and RFB512-E by \(0.8\%\) and \(0.6\%\) respectively. Although MS COCO is more difficult than PASCAL VOC and exists more hard or unclear objects, our method still works well and achieves a better promotion than VOC (Table 5).

4.3 Performance on LRP

Localization recall precision (LRP) [51] is a new performance metric for object detectors, and it can directly measure bounding box localization accuracy. As in mAP, moLRP is the performance metric for the entire detector. Mean optimal box localization, FP, and FN components denoted by \(moLRP_{IoU}\), \(moLRP_{FP}\) and \(moLRP_{FN}\) respectively are similarly defined as the mean of the class-specific components. We test our models and demonstrate results in Table 6. For each metric, smaller is better.

From Table 6, we can know that \(moLRP_{IoU}\), \(moLRP_{FP}\) and \(moLRP_{FN}\) are actually decreased by MPB, which demonstrates the both improvements on classification and localization.

Table 6 Experimental results of SSD and RFB through LRP on MS COCO

4.4 Two-stage detector

We also test on Faster R-CNN and results are shown in Table 7. In this experiment, the settings of networks are the same as the original one [34]. Other settings of MPB are the same as the experiment of SSD+MPB.

4.5 Ablation experiments

4.5.1 Blending method

In order to better understand the proposed network, we investigate the effect of each component of HLoss and compare it with [38]. The comparison is shown in Table 8.

Firstly, we set up the network just by applying our method to the localization part. For the part of localization, we apply the blending method to the process of localization predicting, by adding the HLoss to the tail of the localization part. For the part of the classification, as the input images are blended before training, we keep the random parameter \(\lambda \) greater than 0.5 to make sure the first image is the main part and calculate the loss with it. The results show that the method actually improves the performance on the regression task.

Secondly, we set the HLoss only on classification. We implement a similar network by only adding HLoss to the tail of the classification part. Most of the operations are similar to the above.

The results show that HLoss in both components contributes to the improvement in performance for object detection. A combination of them achieves the best result.

Table 7 Comparative results for Faster RCNN with or without MPB
Table 8 Ablation analysis for hybrid loss function
Table 9 Ablation analysis of multi-phased training

4.5.2 Scheduled blending intensity

The comparison between models being trained with or without scheduled blending intensity is shown in Table 9. As we see, models without scheduled blending training lead to be even worse than the baseline model, because object detection datasets are not easy for networks to learn. However, scheduled blending training can overcome this difficulty by gradually increasing the blending intensity, which means that the network has time to adapt to the object detection datasets.

In Fig. 6, the loss of fixed ratio converges slowly, and larger blending intensity makes the network harder to converge , but the loss of our method converges faster because of the low intensity in the early phase.

Fig. 6
figure 6

The confidence and location losses of models with different blending intensity schedules

4.5.3 Enhanced hard negative mining

The comparison between models being trained with or without scheduled and incremental blending intensity is shown in Table 9. As we can see, the enhanced hard negative mining improves the performance of the detection networks, because blending method creates much more negative examples which affect the training process.

Fig. 7
figure 7

Comparison between RFBNet and RFBNet+MPB. RFBNet+MPB performs better on low-confidence object and gives more detections on uncertain area

4.6 Comparison of blending schedules

In this experiment, three groups are made to compare the performance: (1) networks trained with no ratio; (2) networks trained with fixed ratio (we set blending intensity at 0.02, 0.05, 0.1); (3) networks trained with scheduled ratio. (We test linear schedule, exponential schedule, and sigmoid schedule.)

Table 10 Comparison between different schedules on PASCAL VOC 2007
Table 11 Comparison between MPB and other methods on PASCAL VOC 2007
Table 12 Comparison of performances for different quantity of blended data

According to Table 10, blending with fixed ratio makes the networks worse, and linear scheduled blending method also performs badly due to its fast blending intensity increasing in the early time. Exponential schedule performs better than the baseline but worse than the sigmoid schedule due to the low intensity in mid time.

4.7 Comparison with other data augmentation methods

We also compare our method with other data augmentation methods which can work on one-stage detectors (label smoothing [30] and random erasing [52] and traditional methods). SSD is the typical network, and SSD* comes with extra augmentation methods [38]. For SSD* + LM(label smoothing), we soften the classification labels for each object by set 0.9 and 0.1/20 in which the previous value is 1 and 0, respectively. For SSD*+RE(random erasing), we use its default setting. SSD* + MPB is the one with blending method. All the networks are trained under the same environment and same hyperparameters. Our method can further improve the detection models based on traditional augmentation methods. Compared with label smoothing and random erasing, our method is more effective.

We get the final results shown in Table 11. Our method is better than label smoothing.

Fig. 8
figure 8

In a the blue bar refers to the number of decreased weights from original SSD to SSD with MPB, and the pink bar refers to the number of increased weights from original SSD to SSD with MPB. Similarly, in b the blue bar refers to the number of decreased biases from original SSD to SSD with MPB, and the pink bar refers to the number of increased biases from original SSD to SSD with MPB

Fig. 9
figure 9

The number of detected objects of SSD with MPB and not. The results are obtained from PASCAL VOC 2007 test dataset by SSD+MPB

Fig. 10
figure 10

More examples of comparison. Pictures are selected from MS COCO dataset. As we can see, the model trained with MPB gives more possible predictions boxes than the original one

4.8 Quantity of blended data

We conducted an experiment to compare different amounts of mixed data based on CIFAR10. Five different CIFAR10 datasets are designed including 1 original CIFAR10 dataset of 50 k images and 4 expanded datasets (75 k, 100 k, 125 k, 150 k). These datasets are trained on VGG19, and the final results are shown in Table 12. Obviously, the model trained with expanded dataset outperforms the original model, and the model performs better with more additional blended data.

5 Analysis

Based on Sect. 4, our method improves the performance on the object detection network. In this section, we lead a deep analysis of how this architecture gets a better result.

Firstly, through the proposed method of blending pairs operation, the diversity of the dataset is enhanced, which improves the regularization and generalization of the network. The observation of the experiment result also confirms the proposed idea as follows.

  • Our experiment compares all the weights between the original SSD network and the improved network with MPB which are trained in previous experiments.

  • As shown in Fig. 8, the weights and biases are decreased by MPB, which means it actually regularizes the network.

Fig. 11
figure 11

Bad examples on MS COCO dataset, a gives a wrong bounding box of giraffe, b gives labels of handbag and backpack simultaneously to the handbag with low confidence, c gives a wrong bounding box of the fork, d label the window incorrectly

Secondly, we analyze the final detection result to show how our network improves confidence in the previous RFBNet as follows.

  • As shown in Fig. 7, in the best case of the ski, the confidence of it grows 4x from less than 0.1 in RFB to 0.4 in RFBNet+MPB. In the worst case of a woman in green, the confidence of her varies a little from 0.96 in RFBNet to 0.94 in RFBNet+MPB (Fig. 8).

  • Networks being trained with our method tries to give more confidence to uncertain objects such as some small and illegible objects which are hard to be detected by previous methods (Fig. 9). Our method has slight fluctuations on the high-confidence objects due to the effect of softening and this will not impact the final result. More examples are listed in Figs. 10 and 11.

Thirdly, We also explore the improvement on the number of the successfully detected objects for medium or low overlap (\(overlaps <= 0.5\)) with ground truth as follows.

  • As shown in Fig. 9, RFBNet+MPB increases the number of medium or overlap objects by \(15.2\%\), which means that RFBNet+MPB gives more correct detections.

  • Benefit from high successful detection rate, RFBNet+MPB gives more accurate predictions than the original network, which eventually leads to a decrease on regression loss.

6 Conclusion and future work

In this paper, we propose a novel multi-phase blending method with incremental blending intensity for training detection networks. Besides, we design an incremental hybrid loss function containing both classification loss function and regression loss function. Furthermore, we discard more negative examples than the existing methods. In this way, we can stabilize the training process of object detection networks and eventually regularize the networks to achieve remarkable improvements on one-stage detectors. The experiments demonstrate the validity of the proposed method. One limitation is that hyperparameters is handcrafted. It is necessary to take several experiments to find the best hyperparameters for each model. Thus, in future work, we will explore adaptive blending training methods to automatically searching the optimized hyperparameters. Secondly, we also want to continue the research on other specific problems in the detection task. Finally, will also plan to extend our idea to other areas of computer science and applications [53,54,55,56,57,58,59,60,61], especially in the areas of intelligent computing [62,63,64] and visual computing [65,66,67].