1 Introduction

Object detection is one of the fundamental problems in computer vision that has been substantially addressed due to the great advances in deep learning over the past few years. It is well known that prevalent object detectors mostly regard detection as a problem of classifying candidate boxes[4, 5, 17]. This has led to the increasingly successful of the application of CNN(convolutional neural networks) in image recognition tasks [18, 25,26,27]. As a result, an increasing number of novel object detection methods based on CNNs [2, 10, 19] have been proposed. These structurally diverse frameworks have improved the accuracy of object detection to a certain degree, and many have achieved real-time performance for many benchmark datasets. However, images typically contain occlusions and small objects to which most current object detection methods are not sensitive. Insensitivity to these objects will inevitably restrict the accuracy of object detection. Therefore, the development of detection methods that are sensitive to occlusions and small objects in images is a key problem that must be addressed to provide more robust object detection.

In general, the problem associated with small object detection is actually a problem involving the detection of objects with vastly different size scales, which is a very common problem in object detection. Hence detection of small objects become more challenging. As such, current object detection methods accommodate the detection of small objects by generating feature representations of different scales. A number of empirical studies [13, 14, 17] have suggested that feature representations generated by multi-scale feature maps are very helpful for detecting small objects, especially large-scale feature maps. This indicates that multi-scale feature extraction methods can be expected to enhance the detection of small objects. We are thus motivated to attempt to design a multi-scale feature extraction method and integrate it into our model. The problem of ensuring sensitivity to occlusions is generally addressed by capturing a large number of variations in visual features within a large dataset. However, capturing all possible occlusions within a dataset is not possible, and occlusions with low probabilities will be absent from even very large datasets. So, how can we get these rare occlusions? Moreover, an effort to address this issue by collecting larger data sets is highly inefficient. Therefore we consider trying to use an adversarial network to generate the occlusions what we need.

This paper addresses these problems by proposing an improved approach denoted as Robust Faster R-CNN. The novel design employs a cascaded network structure based on the Faster R-CNN architecture to extract features from objects with different scales in multi-label data. In addition, we train the adversarial network to generate training samples with occlusions significantly affecting the classification ability of the model, which improves its robustness to occlusions. Furthermore, we design a multi-scale RoIAlign operation by adding multiple pool sizes for adapting the detection ability of the network to objects with different sizes. Experimental results for the PASCAL VOC 2012 and 2007 datasets, which are widely used benchmarks for evaluating object detection performance, demonstrate that our approach performs more effectively and more accurately than several state-of-the-art approaches.

2 Related work

In the past few years, many works have been carried out on various object detection models. These models are usually based on two types of frameworks: One kind of object detection method rely on region proposal. These region-based methods divide the object detection task into two stages. In the first stage, a dedicated region proposal generation network(RPN) is grafted onto a deep convolutional neural networks (CNNs) to extract features from the proposed regions, and thereby generate high quality candidate boxes. Then a region-wise sub-network is designed to classify and refine these candidate boxes in the second stage. And another region-free methods divide the object detection task into one stage.

With the rise of CNN, the two-stage methods has quickly become the mainstream of object detection in recent years. Such as R-CNN [5], Fast R-CNN [4], Faster R-CNN [17], SPPnet [6], R-FCN [2]. R-CNN [5] method extracted region proposals using the Selective Search method [23], and linear support vector machine (SVM) was adopted as a classifier for region proposals. However, for R-CNN, the process of generating region proposals was computationally slow. Accordingly, Fast R-CNN [4] was developed to increase the computational speed of the region proposal generation process by developing a novel RoIPooling (i.e., Spatial Pyramid-Pooling) that allowed the classification layers to reuse features computed over CNN feature maps. Then, Faster R-CNN [17] replaced the Selective Search method with a network of region proposal generation to further increase the computational speed of region proposal generation. Moreover the convolutional layers were shared with other components, which realized end-to-end training of the entire network. Faster R-CNN was elected as the state-of-the-art method in the ILSVRC and COCO 2015 competitions, and a performance of 69.9 was obtained for the PASCAL VOC 2007 dataset [3]. In addition, single-stage object detection methods, such as SSD [14], YOLO [16], and RON [11], have been developed in recent years. These methods directly estimate object candidates without a reliance on region proposal, and are therefore computationally faster than two-stage methods. While these methods have a great performance for salient and universal object, they are hard to recognize occlusions and small object.

The current success of object detection is closely related to the application of large-scale dataset. But for occlusion problem, some of the rare occlusions are not easy to find in large-scale dataset. However, it is inefficient to expand the dataset by adding rare occlusion samples. Therefore, instead of attempting to collect the dataset to find rare occlusions, we attempt to generate occlusions which will be rare occlusion samples. Hence, we do a lot of work about adversarial networks. As an alternative to relying on large-scale datasets to capture all possible variations of visual features, A-Fast-R-CNN [24] proposed the training of an Adversarial Spatial Dropout Network (ASDN) to generate low probablity adversarial examples in convolutional features. This approach has recently demonstrated good performance [22]. This inspire us and motivate us to find wonderful idea to enhance the ability of our model to solve the occlusion problem. Other methods have proposed the use of a cascaded network to recognize occluded or invisible key points [1]. In addition, a \(1\times 1\) convolutional layer has been employed to reduce the number of network parameters and thereby accelerate calculations [21]. Although these past developments have resulted in considerable improvements in object detection for images with small objects and occlusions, none of these methods can effectively solve both problems simultaneously with reasonable accuracy and computational speed.

In contrast, the proposed methods in this paper combine a highly effective network structure, multi-layer fusion, multi-scale pooling, and a more effective training strategy to take full advantage of CNNs for object detection, and extracts features with different size scales without substantially reducing the computational speed. In conjunction with the adversarial network, the proposed method can adapt to widely varying object characteristics in multi-label images.

3 Improved model

3.1 Multi-cascaded network

The different depth features of a CNN correspond to different levels of semantic features. In general, the features extracted by a deep network contain a greater proportion of high-level semantic information, while features extracted by a shallow network contain more detailed features. Therefore, the feature map becomes increasingly abstract as the depth of the network increases, and the reduced proportion of detailed information results in a decreased recognition effect for small objects. The solution to this problem employed by nearly all current methods that have achieved good classification and object detection results is to adopt image pyramids, i.e., multi-scale training. However, this method is computationally intensive. This has led to efforts seeking to enhance the recognition of multi-scale objects by modifying the network structure.

Fig. 1
figure 1

The improved Faster R-CNN model with multi-scale RoIAligns and cascaded network structure

The above-described effect of increasing network depth on network performance has been clearly demonstrated by the VGG16 model [20], which is illustrated in Fig. 1. As can be clearly seen in the figure, the convolution layers of the VGG16 model adopts multiple small \(3\times 3\) convolutional kernels in succession, which increases the depth of the network while reducing the number of parameters, and thereby reducing the computational complexity of the model. In addition, the use of a smaller core facilitates the use a greater number of filters compared with algorithms adopting large convolution kernels, such as AlexNet [12]. This in turn facilities the use of a greater number of activation functions, which will enhance the learning of more complex patterns and concepts. However, small convolutional kernels can provide less information regarding the scale, shape, and position of objects, particularly for small objects. Furthermore, extant filled edge features are counted several times, which increases the number of errors. In contrast, larger convolution kernels can capture more spatial context, which facilitates the recognition of objects with more spatial context, which facilitates the recognition of objects with different scales. However, it is noted that the effect of the number of convolutional kernels is equivalent to the effect of the number of parameters. Therefore the number of kernels mainly determined by the quantity of parameters in the cascade networks. If the convolutional layers brings lots of parameters to the network, this will undoubtedly limit performance. We must control the number of parameters while improving performance. Accordingly, an optimal tradeoff is required between the quality of feature representation and the computational performance. This is addressed in the present work by designing the multi-cascaded network structure of the improved Faster R-CNN model illustrated in Fig. 1. The structure adds two shallow networks to the original VGG16 model, where one layer contains five \(5\times 5\) convolution kernels and the other layer contains three \(7\times 7\) convolution kernels. In addition, the two shallow networks added to the original VGG16 model make the final output feature map size equivalent to that of the VGG16 model, but with more detailed object information owing to its higher resolution. Because of high resolution feature map has more information of objects but contains more information of objects. Each cascaded network has the same number of pooling layers, as marked in the figure, which ensures that the feature maps used for fusion are consistent in size. The concat layer splices the feature map and maintains constant fusion feature map sizes. Actually, this represents matrix splicing, as is demonstrated clearly in Fig. 1. Batch normalization (BN) and scale operations are added after each convolution layer, which can increase the training rate and the classification effect [8].

3.2 Parameter transferring

As shown in Fig. 2, the parameters pre-trained with the Faster R-CNN model are directly transferred to the improved Faster R-CNN model to reduce the training time [15]. Only the parameters pre-trained on Faster R-CNN for the last fc6 layer are not transferred because the use of multi-scale ROIAlign in the improved Faster R-CNN model changes the dimensions of the fc6 layer. Then, additional training is conducted to fine-tune the parameters for the improved Faster R-CNN model. Moreover, the transferred fc7 can be seen as a means of guaranteeing the representation capabilities of the transferred model parameters.

Fig. 2
figure 2

Faster R-CNN parameter transference to the improved Faster R-CNN model

3.3 Multi-scale ROIAligns

The RoIPool operation [17] is a standard operation for extracting a small feature map (e.g., \(7\times 7\)) from an RoI. First, RoIPool quantizes a floating number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, Finally, the feature values representative of each bin are aggregated (usually by a max pooling operation). For example, quantization can be performed on a continuous coordinate x by computing round(\(\hbox {x}/16\)), where 16 is a feature map stride and the round(\(\cdot\)) function represents rounding. However, these quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust for large objects, it has a largely negative effect on predicting pixel-accurate object boxes (i.e., for small objects). Furthermore, the RoIPool operation breaks pixel-to-pixel translation-equivariance. Therefore, the present work adopts the RoIAlign operation proposed in Mask R-CNN [7]. This eliminates the harsh quantization of RoIPool, and properly aligns the extracted features with the input. As shown in Fig. 3, RoIAlign avoids any quantization of the RoI boundaries or bins (e.g., it applies \(\hbox {x}/16\) rather than round(\(\hbox {x}/16\))). Then, bilinear interpolation is employed to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is aggregated using a max pooling operation.

Fig. 3
figure 3

Process of the RoIAlign operation, where the dashed grid represents a feature map, the solid lines an RoI (with \(7\times 7\) bins), and the dots the 4 sampling points in each bin. Here, RoIAlign computes the value of each sampling point by bilinear interpolation based on the nearby grid points on the feature map. No quantization performed on any coordinates involved in the RoI, its bins, or the sampling points

The Faster R-CNN framework tends to lose a considerable amount of object information during feature map generation, which seriously detracts from its small object detection performance [15]. For example, an object originally composed of \(32\times 32\) pixels has only \(2\times 2\) pixels remaining in the last layer of the feature map. This problem is generally addressed by enlarging the feature map and utilizing a smaller anchor scale in the RPN. The Faster R-CNN framework applies the RoIPool operation to the feature map with a pool size of \(7\times 7\) for each RoI proposed by the RPN. However, capturing object features at different size scales is quite difficult when employing a single pool size. The present work addresses this issue by applying two pooled sizes of \(11\times 3\) and \(3\times 11\), as has been previously proposed for enhancing small object detection in the R2CNN model [9]. The \(3\times 11\) pool size is designed to capture more horizontal features, and therefore aids in the detection of objects with widths that are much greater than their heights. In contrast, the \(11\times 3\) pool size is designed to capture more vertical features, and is therefore helpful for detecting objects with heights that are much greater than their widths. Furthermore, a pool size of \(11\times 11\) is also added to enhance the robustness of the proposed model for detecting objects at small size scales. In addition, the adoption of a smaller anchor scale has been demonstrated to enhance small object detection [9]. Therefore, we added smaller anchor scales to the original scales of (8, 16, 32) so that the proposed model utilized anchor scales of (4, 8, 16, 32), which would generate 12 anchors in the RPN. The proposed multi-scale RoIAlign operation can therefore pool features extracted at variable size scales, and thereby improves the accuracy of object detection.

3.4 Feature descending fusion

Since we use multi-scale poolingIn addition, which the multi-scale RoIAlign operation leads to the larger dimensions of subsequent provides a fully-connected layer with larger dimensions, and increases the computational time consumed of associated with object detection. The improved Faster R-CNN model also uses the convolution layer as well as the pooling layer to reduce the parameter redundancy of the fully connected layer, as has been previously proposed [21]. It is well known that the use of different dimensions in the multi-scale RoIAlign operation makes direct feature splicing impossible. However, this can be addressed through the use of a flatten layer to transform the pooled feature map (i.e., a multidimensional matrix) into a number of one-dimensional vectors, such as was applied in the R2CNN model. However, the present work seeks to avoid parameter redundancy prior to using the flatten layer by reducing the number of model parameters via the application of a convolutional layer with a kernel size of \(1\times 1\) and a step size of 1. We accordingly reduce the dimensionality of each of the four pooled feature maps, while the dimension of the \(7\times 7\) feature map is reduced to 512, that of the \(11\times 11\) feature map is reduced to 128, and the dimensions of the of \(3\times 11\) and \(11\times 3\) feature maps are reduced to 256. Then, we use flatten layers to transform the pooled feature map into four one-dimensional vectors, and the concat layer is employed to pass the vectors to the fully connected layer.

This process is illustrated in Fig. 4, where we have added a \(1\times 1\) convolution layer after each multi-scale pooling layer. As is well known, the convolution process using a \(1\times 1\) kernel typically acts to decrease dimensionality, which here refers to the number of image channels (thickness), while the width and height of the image is not changed.

Fig. 4
figure 4

The feature filtering structure using convolution layers with a kernel size of \(1\times 1\)

4 Adversarial network

The functionality of an adversarial network A(X), where X is a set of features, is first analyzed by comparing the loss function obtained for an object detector network F(X) to that obtained for A(X), while adopting the terms \({{F}_{c}}\)(X) and \({{F}_{l}}\)(X) to represent the class and predicted bounding box location outputs, respectively, and C and L to represent the respective groundtruth class and bounding box locations for X. Accordingly, the loss function of F(X) can be given as follows.

Fig. 5
figure 5

Architecture of the Adversarial Spatial Dropout Network (ASDN) in combination with the improved Faster R-CNN framework. Occlusion masks are created to generate training examples that are difficult to classify

$$\begin{aligned} {{E}_{F}}{=}{{E}_{soft\max }}({{F}_{c}}(X),C)+{{E}_{bbox}}({{F}_{l}}(X),L) \end{aligned}$$
(1)

Here, the first term is the SoftMax loss and the second term is the loss based on \({{F}_{l}}\)(X) and L. The purpose of an adversarial network is to learn how to predict those X that \({{F}_{l}}\)(X) would fail to accurately classify. Accordingly, A(X) generates new adversarial examples for a given X, which are then added to the training samples. The adversarial network is trained via the following loss function.

$$\begin{aligned} {{E}_{A}}{=-}{{E}_{soft\max }}({{F}_{c}}(A(X)),C) \end{aligned}$$
(2)

Therefore, obtaining a low value of \({{E}_{F}}\) for examples generated by A(X) that are easily classified by F(X) results in a high value of \({{E}_{A}}\). In contrast, obtaining a high value of \({{E}_{F}}\) for examples generated by A(X) that are difficult for F(X) to classify results in a low value of \({{E}_{A}}\). As such, the two networks perform exactly opposite tasks.

4.1 Adversarial spatial dropout network training

We apply stage-wise training to the ASDN, as was conducted in a previous work [24]. Here, the ASDN is first pre-trained on a multi-label image dataset to obtain a preliminary perception of the dataset appropriate for use during the joint training of the ASDN and the improved Faster R-CNN. Subsequently, the ASDN is trained by fixing all of the network layers.

Fig. 6
figure 6

a Examples of occlusions that are sifted to select the hard occlusions, and are used as the groundtruth for training the ASDN. b Examples of occlusion masks generated by the ASDN, where the black regions represent occlusions representative of the most significant pixels for classification

As shown in Fig. 5, the ASDN has the same structure as the improved Faster R-CNN framework in terms of the convolutional layers, RoIPooling layer, and the fully connected layers. The convolutional features for each feature map after the RoIPooling layer are applied as the inputs for the ASDN. Given a feature map of size \(d\times d\), the ASDN will generate a mask representative of those parts of the feature map to be occluded by assigning zeros in an effort to increase the value of \({{E}_{F}}\) obtained for the improved Faster R-CNN by introducing occluded features that are more difficult to classify. This is conducted by applying a \(d/3 \times d/3\) sliding window that deletes the values in all the channels at its corresponding position, and thereby generates a new feature vector. All of the new feature vectors obtained in this manner are passed to the Softmax loss layer to calculate the loss function, and the feature vector obtaining the highest loss is selected. Then, the window creates a single \(d \times d\) mask with 1 for the central window location and 0 for the other pixels. The sliding window process is represented by mapping the window back to the image, as shown in Fig. 6a. In this way, the ASDN generates spatial masks for n feature maps and obtains n training samples that have high losses. The binary cross entropy loss is used to train the ASDN, which is given as follows.

$$\begin{aligned} E= & {} -\frac{1}{n}\sum \limits _{p}^{n}\sum \limits _{i,j}^{d} \left[ {\tilde{M}}_{ij}^{p}{{A}_{ij}} \left( {{X}^{p}}\right) \right. \nonumber \\&+\left. \left( 1-{\tilde{M}}_{ij}^{p}\right) \left( 1-{{A}_{ij}}\left( {{X}^{p}}\right) \right) \right] \end{aligned}$$
(3)

Here, \({{A}_{ij}}({{X}^{p}})\) represents the outputs of the ASDN at location (ij) given an input feature map \({{X}^{p}}\), and if \({{M}_{ij}}\) = 1, we drop out the values of all the channels in the corresponding spatial location of the feature map X. The output generated by the ASDN is not a binary mask but rather a continuous heatmap. The ASDN uses importance sampling to select the 1 / 3 of the pixels in a heatmap, which are assigned a value of 1, while the remaining 2 / 3 pixels are set to 0. As illustrated in Fig. 6b, the application of occlusions generating high loss in the ASDN learning process results in a recognition of those parts of objects that are most significant for classification.In this case, we use the masks to occlude these parts to make the classification harder.

4.2 Joint training

We jointly optimize the pre-trained ASDN and our improved Faster R-CNN model. In the joint model, the ASDN shares the convolutional layers and RoIPooling layer with the improved Faster R-CNN model, but uses its own separate fully connected layers. Naturally, the parameters of the two networks must be optimized independently in accordance with their diametrically opposed tasks. For training the improved Faster R-CNN model, we first use the pre-trained ASDN to generate masks for creating modified feature maps after the RoIPoolings layer during the forward propagation training stage, and then pass the modified features to the improved Faster R-CNN model for calculating losses and model training. Although the features are modified, their labels remain unchanged. This ensures that more diverse examples are introduced when training the improved Faster R-CNN model, and results in greater robustness for classifying objects with occlusions. For training the ASDN, the sampling strategy applied to convert the heatmap into a binary mask makes the classification loss calculation non-differentiable, so that the gradients from the classification loss are not available for back-propagation during training. Same as A Fast R-CNN [24], only those hard example masks are used as ground-truth to train the adversarial network by using the same loss as described in Eq. (3) to compute which binary masks lead to significant drops in Robust Faster R-CNN classification scores.

5 Experiment

5.1 Datasets and evaluation metrics

The PASCAL VOC 2007 and 2012 datasets employed in the experiments contain a total of 9963 and 22,531 images, respectively, and are divided into train, val, and test subsets. Our experiments employed 5011 trainval and 4952 test images for VOC 2007 and 11,540 trainval and 10,991 test images for VOC 2012. The average precision (AP) and the mean of the AP (mAP) were employed as the evaluation metrics in compliance with the PASCAL challenge protocols. Test speed and convergence speed are also important metrics for evaluating model performance. The experimental results obtained for the proposed improved Faster R-CNN and Robust Faster R-CNN frameworks were compared with results obtained using several state-of-the-art approaches, including Faster R-CNN, A-Fast-R-CNN, SSD, and RON. All of the experimental results were obtained by running the models on a PC equipped with an i7 processor with a 4.20 GHz clock speed, a GTX 1080Ti single core GPU, and 16 GB memory.

5.2 Convergence and joint model training

We initialized the parameters of the improved Faster R-CNN with the Faster R-CNN parameters trained on the VOC 2007 trainval subset. To accommodate the changed dimensions of the fully connected fc6 layer in the improved model, this layer was initialized from zero-mean Gaussian distributions with standard deviations 0.01, and the learning rate was set to 0.01 and scaled by a factor of 0.1 every 20 epoches based on momentum and weight decay values of 0.9 and 0.0005, respectively, for a total of 60 epoches. Training for the Faster R-CNN model and the improved Faster R-CNN model included a number of iterations set to 60 epochs, where each epoch consisted of 2000 iterations. The mAP values of the training models were calculated at different iterations during the training processes prior to generating the final models, and the results are presented in Fig. 7. The figure indicates that the mAP scores for the training models began to converge after a little less than 40 epochs, or 70K iterations. Over these number of iterations, the improved Faster R-CNN training model yielded an mAP score of 77.5% and that of the Faster R-CNN training model was 73.2%. These results demonstrate that the improved Faster R-CNN model has a faster convergence rate than the Faster R-CNN model. The ASDN was pre-trained for 12K iterations. Then, the joint model was trained for 120K iterations. We again adopted a varying learning rate, which was initially 0.001 and decreased to 0.0001 after 60K iterations based on the momentum and weight decay values adopted in the previous part.

Fig. 7
figure 7

The mAP scores obtained during model training based on the PASCAL VOC 2007 dataset

5.3 Ablation experiments

The ablation experiments were designed to evaluate the influence of different anchor scales and different RoIPool sizes on the object detection performance of the models trained with the VOC 2007 dataset, including the Faster R-CNN, cascaded network, which is an equivalent network structure to that of the improved Faster R-CNN, but which adopts the standard RoIPool operation, and Robust Faster R-CNN. Although RoIPooling can also capture the different scale features of the object, but compared with RoIAligns having lower accuracy. Owing to RoIAligns removes the strict quantization of RoIPooling, correctly aligning the extracted features with RoI. These quantization of RoIPooling introduce misalignment problem between the RoI and the extracted features. Furthermore RoIPooling breaks pixel-to-pixel translation-equivariance. Meanwhile quantization lead to miss some information of features. While this may not effect on accuracy of detecting large objects, for small objects, the problem of quantization will reduce the accuracy of recognition. The results are presented in Table 1. The results clearly indicate that the cascaded network with four pool sizes (\(3\times 11\), \(11\times 3\), \(7\times 7\), \(11\times 11\)) performed better than Faster R-CNN with a single pool size (\(7\times 7\)), and the cascaded network with a single pool size (\(7\times 7\)) and three pool sizes (\(3\times 11\), \(11\times 3\), \(7\times 7\)). Firstly, these results demonstrate the benefits of the developed multi-scale RoIAlign operation over the standard RoIPool operation owing to the enhanced capability of the multi-scale operation to extract features at variable scales. Secondly, these results demonstrate the cascaded network we designed has a very positive effect on the accuracy from the experimental results and the cascaded network can capture more information so that it can recognize more objects of different sizes. With the increase of depth of the network, the feature map becomes more and more abstract. Some information will be ignored through convolution and pooling, especially small objects. Low-resolution feature map is unfavorable to the recognition of small objects. Hence, we designed a cascaded network structure to extract features from objects with different scales. Finally, these results demonstrate the benefits of including horizontally and vertically biased pool sizes, and also demonstrate that the addition of the \(11\times 11\) pool size enhances the object detection performance of the cascaded network. This latter benefit is mainly because the additional \(11\times 11\) pool size can enhance the detection of smaller objects in the VOC dataset. FT means fine-tuning and it has also contributed greatly to the improvement of model performance. Compared with the results obtained for the improved Faster R-CNN model, the Robust Faster R-CNN model provided an mAP that was 2.3% greater, reflecting the effectiveness of the ASDN.

Table 1 Ablation experiment results on the VOC 2007 dataset
Table 2 Object detection results on the PASCAL VOC 2007 dataset
Table 3 Object detection results on the PASCAL VOC 2012 dataset

5.4 Results

The AP and mAP values obtained for various images in the VOC 2007 dataset and the VOC 2012 dataset by means of the proposed object detection frameworks and the various state-of-the-art approaches are listed in Tables 2 and 3, respectively. The results indicate that the detection performances of the proposed Robust Faster R-CNN models are significantly better than that of Faster R-CNN, and their detection accuracy for small objects in particular, such as bird and plant, is significantly improved. These results confirm the feasibility of the proposed multi-scale RoIAlign operation. Although the mAP value obtained by the state-of-the-art RON approach for the VOC 2007 dataset is slightly greater than that obtained by the proposed robust model, the robust model performs 4.3% better than Faster R-CNN, which demonstrates the effectiveness of our approach. The results in the tables also clearly demonstrate that the inclusion of the ASDN provides significantly greater object detection performance than the improved Faster R-CNN model, which confirms the effectiveness of the ASDN.

Some examples of object detection results obtained by the Robust Faster R-CNN model for the VOC 2007 and 2012 datasets are shown in Fig. 8. These examples demonstrate qualitatively that Robust Faster R-CNN can recognize objects with different sizes and width-to-height aspect ratios, and can predict their locations well, particularly for objects like planes, birds, and people. The results in Fig. 8 also demonstrate the robustness of the proposed approach to occlusions, such as in the car, plant, and people images that include occlusions.

We also qualitatively compare some examples of object detection results obtained by the Robust Faster R-CNN and Faster R-CNN models for the VOC 2007 and 2012 datasets in Fig. 9. In the first case, a bus suffering from occlusion at the top left of the image is ignored by Faster R-CNN, while the proposed method correctly labeled this vague object as a bus. In the second case, a woman on the rightmost side of figure is shown with only half a body and is carrying a small child in her arms. Here, Faster R-CNN detects no object whatsoever at this location in the image, while a person is detected using our proposed method. These examples represent a striking contrast between Faster R-CNN and the proposed method. Finally, the third case presents a chair suffering from occlusion, which is ignored by Faster R-CNN, while the proposed method correctly labels this object as a chair. These illustrations demonstrate the obvious advantages of the proposed method over Faster R-CNN for identifying small objects and objects with occlusions.

Fig. 8
figure 8

Selected examples of object detection results on the PASCAL VOC 2007 and VOC 2012

Fig. 9
figure 9

Qualitative results of faster R-CNN vs. Robust Faster R-CNN on VOC. In every pair of detection results (top vs. bottom), the top is based on faster R-CNN, and the bottom is detection result of Robust Faster R-CNN

Finally, Table 4 lists the detection and computational performance results obtained by Faster R-CNN and Robust Faster R-CNN, which are two-stage methods, and SSD and RON, which are single-stage methods, for the VOC 2012 dataset images. Here, we collected the object detection time for each image, and averaged all of the detection times (ms/image). The results indicate that the two-stage methods generally provide a greater accuracy but lower computational speed than the one-stage methods. In addition, we note that, while the computational speed of Robust Faster R-CNN was less than that of Faster R-CNN, this is expected because the use of the multi-scale RoIAlign operation in Robust Faster R-CNN consumes more computational time than the RoIPool operation in Faster R-CNN. Moreover, the difference between the two is quite small, and Robust Faster R-CNN still meets the requirements of real-time object detection. Consequently, the proposed approach provides dramatically increased detection performance relative to Faster R-CNN with only a slight reduction in computational speed.

Table 4 Detection and computational performance results of the proposed Robust Faster R-CNN and one-stage SSD and RON methods on the PASCAL VOC 2012 dataset

6 Conclusion

This paper presented an effective framework denoted as Robust Faster R-CNN for detecting objects with different size scales and occlusions. The use of a cascaded network as well as the multi-scale RoIAlign operation to learn semantic multi-scale feature representations made the proposed model invariant to objects with different sizes and width-to-height aspect ratios, such as people, cars, and planes. An ASDN was combined with the proposed network to generate training samples with occlusions significantly affecting the classification ability of the model, which improved its robustness to occlusions. Experimental results obtained by the proposed approach and various state-of-the-art approaches for images in the PASCAL VOC 2012 and 2007 datasets demonstrated that the Robust Faster R-CNN model generally obtained superior detection accuracy, and the speed of detection was not significantly reduced relative to that of the Faster R-CNN model.