1 Introduction

The essence of object detection is to locate and classify objects, which uses rectangular bounding boxes to locate the detected objects and classify the categories of the objects. Object detection has some relations with object classification, semantic segmentation and instance segmentation. The details are illustrated in Fig. 1. Object detection is an important area of computer vision and has important applications in scientific research and practical industrial production, such as face detection [215], text detection [94, 282], pedestrian detection [170, 274], logo detection [87, 108], video detection [102, 103], vehicle detection [23, 54], and medical image detection [145], the details are shown in Fig. 2. The limitation of the computing resources, the datasets, and the basic theories have limited the development and application of deep neural networks in recent decades [122]. Therefore, in the field of computer vision, the traditional object detection algorithms were still popular. Traditional object detection algorithms include DPM [58], Selective Search [224], Oxford-MKL [228], and NLPR-HOGLBP [263], etc. The basic architecture of traditional object detection algorithms mainly divided into region selector, feature extractor, and classifier, which is demonstrated in Fig. 3.

Fig. 1
figure 1

a Object classification needs to identify the category of objects in image. b Object detection not only needs to identify the category of objects, but also needs to locate the objects with rectangular bounding boxes. c Semantic segmentation only needs to predict the categories of each pixel, and does not need to distinguish the object instances. d Instance segmentation needs to predict both the categories of each pixel and object instances

Fig. 2
figure 2

The application of object detection can be divided into generic object detection and dedicated object detection

Fig. 3
figure 3

The basic architecture of the traditional object detection algorithms. The region selector mainly uses sliding-windows of different sizes and ratios to slide on the image from left to right and top to bottom by a certain step size. The image blocks cropped by the sliding window are transformed to form an image with uniform size. The feature extractor mainly uses HOG [40], Haar [140], SIFT [155], and other algorithms to extract feature from image blocks. Finally, the classifier uses algorithms such as SVM [36] and Adaboot [229] to identify object category

Although the traditional object detection is relatively mature, it has its own inherent shortcomings. First, sliding-window [57] based region selection strategy has high computing complexity and high window redundancy. Second, the morphological diversity of the appearances, the diversity of illumination changes, and the diversity of the background make it is difficult to design robust features manually. During 2010-2012, only small gains were obtained by building ensemble systems and employing minor variants of the existing methods [66]. Deep convolutional neural networks can learn the features of images from low-level to high-level, which are very robust [8, 105, 168]. Therefore, the researchers gradually turned their attention to the DCNNs.

With the increase of computing power and the number of image datasets [194], there are more opportunities for the development of the DCNNs-based object detection. In 2012, A. Krizhevsky et al. proposed a DCNN called AlexNet [116]. It won the competition of ILSVRC-2012 (top-5 error rates of 15.3%). This work triggered a wave of research and application of the deep convolutional neural networks. In 2014, R. Girshick et al. proposed a RCNN (Regions with CNN features) [66], which is a milestone in applying the DCNNs-based method for object detection. In 2015, J. Redmon et al. proposed an object detection system based on a single neural network called YOLO (You Only Look Once: Unified, Real-Time Object Detection) [187], which was presented in CVPR2016. With the publication of the series of papers, the DCNNs-based object detection methods break through the bottleneck of traditional object detection methods. Object detection has entered a period of using deep learning techniques. Some traditional object detection methods and DCNNs-based object detection methods are shown in Fig. 4.

Fig. 4
figure 4

Classical traditional object detection (such as Oxford-MKL [228], DPM [58], NLPR-HOGLBP [263], and Selective Search [224]) and object detection based on DCNNs. It is seen from the figure that the revival of deep learning transforms the handcrafted features of the object detection into learned features, which is the fundamental difference between the two. Two-stage object detection and one-stage object detection are elaborated in Section 4

From the development trend of the object detection in recent years. First, the accuracy of detection is continuously improved to satisfy the application of various complex scenarios. Second, the speed of detection is improved to satisfy real-time system applications while ensuring the accuracy. Therefore, attention must be paid to the trade-off between accuracy and speed [93, 136, 174] in the future research works. In order to get state-of-the-art results, the trade-off between accuracy and speed is quite important.

More than 300 papers in the field of object detection are cited in this review paper, most of which are based on deep learning. These include two published review papers, Deep Learning for Generic Object Detection: A Survey [147] and Object Detection with Deep Learning: A Review [285]. There are many improvements and rich parts to compare with them. In this review, the article highlights the pipeline of the object detection architectures. First, the backbone networks and loss functions are introduced, which clearly reflect the components and effects of the architecture. The introduction of specific object detection architectures make them easier for researchers to accept based on previous introductions. The structure of this review is more clearly and the content of each section is more reasonable compared to the two published review papers.

Five contributions are listed as follows:

  1. (1).

    The development of deep convolutional neural networks are summarized, and the backbone networks used for object detection are compared in recent years.

  2. (2).

    The network frameworks is analyzed and compared in detail, and the loss functions of object detection is summarized.

  3. (3).

    Some guidance and advancements are provided in the future development of object detection.

  4. (4).

    Difficulties and solutions to object detection are summarized.

  5. (5).

    The applications of object detection are summarized and the technical details are fully analyzed.

The remainder of this review paper is organized as follows. In the Section 2, the architecture of the backbone networks are introduced in detail, and their performance and parameters are compared and analyzed. In the Section 3, the loss functions for object detection are summarized, and the loss function constructions and training strategies are implemented in the actual architecture. The Section 4 summarizes the milestone object detection architectures. The Section 5 summarizes some important complex problems in the field of object detection. The Section 6 summarizes and datasets and evaluation criteria. The Section 7 summarizes the applications of object detection. The Section 8 summarizes the future development directions of object detection.

2 Backbone network for object detection

A variety of DCNNs with powerful capabilities are proposed. The most of them have innovative architectures, which are shown in Fig. 5. The DCNNs are the backbone network for object detection (or classification, segmentation [37, 152]). In order to improve the performance of feature representation, the network architecture becomes more and more complicated (the network layer is deeper and the network parameters are increased). In the environment with limited computing power and storage, such as mobile [248], autonomous driving [24, 218], industrial production. Lightweight network structures are proposed, which simplifies the network structure without reducing the feature representation capability. In Table 1, object detection backbone networks are listed. It can be found that the accuracy of the complex backbone networks(CBNs) can be improved by increasing the depth of the network. It can also be found that reducing the parameters in reasonable ways, which do not affect the accuracy of the lightweight backbone networks(LBNs).

Fig. 5
figure 5

Milestones of deep convolutional neural network architectures and network modules

Table 1 We summarize and compare the commonly used object detection backbone networks, highlighting and comparing complex backbone networks(CBNs) and lightweight backbone networks(LBNs)

2.1 Complex backbone network

AlexNet

The first convolutional neural network was proposed by Yann LeCun based on previous research work in 1998, which is called LeNet-5. It has an average precision of 98% on the MNIST dataset [123]. LeNet-5 is a classical convolutional neural network used to identify handwritten numeric characters. Its emergence determines the basic architecture of the deep convolutional neural networks. The convolutional layer, pooling layer, and fully-connected layer in LetNet-5 are the basic components of the deep convolutional neural networks. This is the first time that a convolutional neural network can be available. However, it has not made great progress in the next decade. The expansion are limited by computing power. At the same time, the traditional machine learning algorithms such as SVM [36, 246] can achieve the same accuracy and even better. Therefore, the convolutional neural networks did not attract much attention. In 2012, AlexNet won the championship of ILSVRC-2012 competition. On the testing set of LSVRC-2010, they achieved top-1 and top-5 error rates of 37.5% and 17.0%. On the testing set of ILSVRC-2012, they achieved top-5 error rates of 15.3%, which is far higher than the second-place. AlexNet increases the depth and breadth of the LeNet-5 network architecture. It consists of five convolutional layers (Conv), three max-pooling layers and three fully-connected layers (FC), with a total of 60 million parameters [116]. The architecture is shown in Fig. 5a.

The great success of AlexNet is due to the following technologies:

  1. (1).

    Data enhancement methods (such as horizontal flipping, random clipping, translational transformation, color illumination transformation) are used to extend the dataset and reduce overfitting.

  2. (2).

    Traditional activation functions (such as Sigmoid and Tanh) are replaced by a ReLu activation function [163]. It solves the problem of gradient dispersion in deeper network.

  3. (3).

    A part of neurons are randomly removed by using a Dropout [209] regularization method during training. It can reduce the complex inter-adaptive relationship of neurons and the probability of overfitting.

  4. (4).

    Multi-GPUs parallel computing technology is used, which communicates between certain layers and speeds up network training.

Since the success of AlexNet, various DCNNs have emerged in the past few years, such as VGGNet [207], ZFNet [266], GoogLeNet [213], and ResNet [82]. Throughout the development of network structures, the ways to improve the performance of the network models include increasing depth of the networks. At the same time, excellent design of subtle topologies and bottlenecks to reduce the parameters of the network models and improve generalization ability.

ZFNet

Researchers want a visualization technology of the convolutional layers, which is possible to intuitively analyze the changes in image feature maps at each layer. Matthew D. Zeiler et al. proposed a method for visualizing feature maps using unpooling layers and deconvolution [267, 268] layers in ZFNet [266]. The authors change the size of the convolution kernel of the first layer in AlexNet from 11 × 11 to 7 × 7 and adjusted the stride size from 4 to 2. These changes can preserve more low-level features. A small convolution kernel can reduce the downsampling rate, which is conducive to the location of large objects and the recognition of small objects [135].

GoogLeNet

In the current popular two-stage object detectors, such as the RCNN series [65, 66, 191], the object detection is divided into two stages. First, the low-level features are used to locate the object, and the DCNN is used to classify the located objects. Therefore, the improvement of feature representation performance is advantageous for location and classification. The way to improve performance is to increase the number of layers in the network and the number of neurons in each layer. However, the larger the network size is, the more parameters there will be. The 1 × 1 convolution kernel proposed by Min Lin et al. in Network-in-Network [141] is introduced in GoogLeNet [213]. The 1 × 1 convolution kernel can not only reduce or increase the dimension, but also implement cross-channel information integration. The dimensionality reduction of the 1 × 1 convolution kernel can reduce the computational complexity while increasing the depth and width of the network. According to this ideology, the author proposed an Inception module [213] (see Fig. 5c) with dimension reductions. It use 9 Inception modules in the network to change the serial structure to parallel structure, and then replacing the fully connected layer with the average pooling layer. The required calculation parameters are reduces from 7 × 7 × 1024 to 1 × 1 × 1024.

VGGNet

VGGNet increased the depth of AlexNet to 16-19 layers [207], which improves the feature representation of network. The mainstream network architectures are VGG16 and VGG19, and the architecture of VGG16 is shown in Fig. 5b. AlexNet and ZFNet use kernel of size 11 × 11 (stride of 4) and kernel of size 7 × 7 (stride of 2) in the first convolutional layer. The size of the convolution kernel is further reduced in VGGNet, and 3 × 3 convolution kernel (stride of 1) is used in each layer. The small kernel and stride are more conducive to extracting the location information of the object in the image. The small kernel instead of the large kernel has the advantage of increasing the depth of the network and keeping the receptive field unchanged. The feature representation capability of the network model is enhanced after reducing the parameters.

ResNet

The gradient dispersion and the gradient explosion problems may occur with increasing the number of layers. These two problems are effectively solved on [207, 213], which enables the network to converge using stochastic gradient descent (SGD) algorithm at tens of layers. However, as the depth of the network continues to increase, there will be situations where the accuracy reaches saturation and then declines rapidly during training. This phenomenon is called degradation [82]. In order to solve this problem, Kaiming He et al. proposed a residual learning module. This can increase the depth of the network to hundreds of layers, so that the feature representation capability of the network is further enhanced. With the top-5 error rates of 3.57%, it won the first place in the ILSVRC2015 classification and detection.

The residual learning module is essentially shortcut connections, adding the result \(\mathcal {F}\)(X) obtained from a layer or stacked layers to the input value X. The expected output value is \({\mathscr{H}}\)(X)=\(\mathcal {F}\)(X)+X. Since the residual learning module solves the problem of training degradation, the depth of network is increased and the performance is continuously improved. ResNet50 and ResNet101 are widely used as backbone networks for object detection.

DetNet

From the architecture and function of these classical DCNNs. The DCNNs are trained on classification task, which can be used as backbone networks for object detection and other tasks. Through the research of generic DCNNs and object detection task. Zeming Li et al. proposed a DetNet - A Backbone network for object detection [135], which makes up for the shortcoming of generic backbone networks in object detection task. The generic backbone networks use a large down-sampling rate, which ensures that the large receptive field is beneficial to the classification of images, but is not conducive to accurately locating large objects and recognizing small objects. The DetNet is compared with the traditional backbone network, which uses a dilated convolution [261, 262] instead of down-sampling the last few layers to ensure the larger receptive field and resolution.

Inspired by the architecture and ideology of classical DCNNs, researchers are constantly improving or integrating with each other based on these networks. InceptionV2 [97] inherits the ideology of InceptionV1 (GoogLeNet) [213], which uses 2 kernels of size 3 × 3 instead of 1 kernel of size 5 × 5, and proposes that Batch Normalization (BN) speeds up the learning rate of the network. Since the small kernel can improve the performance of networks, Christian Szegedy et al. proposed an InceptionV3 [214], which replaces n × n kernel by 1 × n kernel series concatenation n × 1 kernel. Similarly, ResNet [82] demonstrates the advantages of shortcut connections on deep networks. Therefore, an InceptionResNets [212] network is formed by combining the Inception networks and the shortcut connections. The research of DCNNs as the backbone networks is still in progress. Due to the limitation of the paper, such as ResNeXt [252], DenseNet [92], and SE ResNet [89] are not described it in detail.

2.2 Lightweight backbone network

The architecture of the complex backbone networks are described in Section 2.1. The main development direction is to deepen the depth of the network to improve the performance of the network. For example, there are 7-layers for AlexNet, 16-layers for VGGNet, 22-layers for GoogLeNet, 50 to 152 layers for ResNet, and thousands of layers of ResNet and DenseNet [82, 92, 116, 207, 213]. This increases the size of the network parameters. Although some methods are used to reduce the parameters of the network, such as the kernel decomposition method is proposed in the InceptionV1/V2/V3 [97, 213, 214]. In short, the increase of model parameters brings about two problems, storage space and testing time. Since the backbone network accounts for about 90% of the calculation and storage of the object detection, these two problems hinder the application of the DCNNs-based object detection in real scenes. Where storage and computing resources are limited, such as face recognition [215, 240], autonomous driving [24, 218], mobile phone [216, 248], industrial production [245], embedded systems [158, 181]. In order to promote the process of industrialization of DCNNs, researchers proposed lightweight networks to reduce network parameters and ensure network performance.

SqueezeNet

The traditional works use three operations to compress networks, which are singular value decomposition (SVD), network pruning, and deep compression [43, 65, 254]. However, SqueezeNet is not a kind of method to compress the network. It is a few parameters network architecture. The new network architecture is proposed in SqueezeNet, which is called the Fire Module [96]. The Fire Module consists of a squeeze layer and an expand layer, where the squeeze layer consists of only 1 × 1 convolutional filters, and the expand layer consists of 1 × 1 and 3 × 3 convolutional filters (kernels). The output of the expend layer is concatenated by the calculated feature map of the 1 × 1 convolution filters and the 3 × 3 convolution filters. Finally, the parameters size is reduced to less than 0.5MB using the deep compression technology [73]. The parameters size is reduced to 1/510 of the AlexNet parameters.

Xception

Xception is an improvement over the InceptionV3 that replaces the convolution in the InceptionV3 with a depthwise separable convolution [34]. A depthwise separable convolution is proposed in the Xception, which is widely used by MobileNet [88, 196], ShuffleNet [157, 250], and other network architectures. Although the depthwise separable convolution reduces in computational complexity in the Xception, the implementation is not efficient enough in DCNNs. The structure is also constantly improving, which is explained in MobileNet.

MobileNet

The SqueezeNet uses a bottleneck called Fire Module to build a lightweight backbone network [96]. In MobileNet, a convolution method different from the traditional convolution method is the depthwise separable convolution [88]. The depthwise separable convolution turns the standard convolution into a depthwise convolution and a 1 × 1 pointwise convolution. The depthwise convolution is that the feature channel is only operated with one convolution kernel. The number of convolution kernels is equal to the number of feature channels. The pointwise convolution is 1 × 1 convolution kernel. For example, the input feature map size is DK × DK × M, and the output feature map size is DF × DF × N. In the traditional convolution method, the calculation amount is DK × DK × M × N × DF × DF. But in the depthwise separable convolution method, the calculation amount is DK × DK × M × DF × DF + M × N × DF × DF, so the both ratio is \(\frac {1}{N}+\frac {1}{{D_{K}^{2}}}\). By this way, the amount of calculation is reduced.

MobileNetV2

Shortcut connections improve network performance in ResNet [82]. Therefore, the MobileNetV2 convolution block is formed by combining the depthwise separable convolution of MobileNet with a shortcut connection [196]. In order to obtain more features, a 1 × 1 convolution kernel expansion channel number is added before the depthwise convolution. It use the Linear bottlenecks instead of the ReLU activation function to prevent the feature from being destroyed. So the core of MobileNetV2 is Inverted Residuals and Linear Bottlenecks.

ShuffleNet

The depthwise separable convolution is proposed in Xception [34]. It transforms the traditional convolution method into a combination of depthwise convolution and pointwise convolution. This way reduces the amount of parameters and calculation, but the calculation of pointwise convolution is still large. In ShuffleNet [250], the authors proposed to use pointwise group convolution instead of pointwise convolution. Convolution is operated in each group, which reduces the calculation significantly. The channel shuffle is used to implement information exchange between groups. The residual block with depthwise separable convolution, pointwise group convolution, and channel shuffle forms the ShuffleNet unit. The shuffleNet can be built based on the ShuffleNet unit.

ShuffleNetV2

With the study of lightweight networks, researchers used speed instead of FLOPs to measure power. Then used four design guidelines for lightweight networks that weighed speed and accuracy. Based on these four guidelines, the researchers designed the ShuffleNetV2 network structure [157]. Channel split is introduced on the basis of ShuffleNetV1, and the feature input channel of each unit is divided into two parts, each of them consists of three convolutions. Unlike ShuffleNetV1, two 1 × 1 convolutions are used instead of two group convolutions. After the convolution operation, the two branches are concatenated. Finally, the Channel Shuffle operation is used to exchange information between the two branches.

PeleeNet

The core idea of MobileNetV1/V2 and ShuffleNetV1/V2 is depthwise separable convolution [34], which reduces the amount of computation and storage space, but lacks efficient implementation. Based on DenseNet [92], the researchers proposed a PeleeNet [237] as the backbone network for object detection. In the PeleeNet, two-way dense layers are used to obtain receptive fields of different scales, and feature learning of small-scale objects and large-scale objects is simultaneously performed. The stem block before the first dense layer can improve the representation of the feature without increasing the amount of calculation. In order to be suitable for mobile applications, the dynamic number of channels in bottleneck layer, and transition layer without compression and composite function are also the highlights.

3 Loss function for object detection

In deep learning or machine learning, the loss function is also called the cost function. The main purpose of the loss function is to measure the deviation between the predicted value of the network and the true value of the sample. The smaller the value of the loss function, the better the training of the network model, which proves the convergence and robustness of the network model. The object detection is divided into object classification and object location. Therefore, the loss function includes classification loss (Cls Loss) and location loss (Loc Loss). The loss function of some classic object detection architectures are summarized in Table 2. The classification loss belongs to the classification, and the location loss belongs to the bounding-box regression. In recent works, many innovative methods have been proposed in the loss function design and the loss-based training strategies for network architecture. Representative designs include a stage-wise training in the RCNN/SPPNet [66, 81] and a multi-task training in the Fast/Faster RCNN [65, 191]. The stage-wise training separates classification and location, while the multi-task training concentrates classification and location into a whole loss function. Moreover, the loss function can also achieve specific functions. For example, Xinlong Wang et al. proposed a Repulsion Loss to solve the dense occlusion problem [242]. Tsung-Yi Lin et al. proposed a Focal Loss to solve the class imbalance problem [142]. These results demonstrate that the design of the loss function has a positive effect on the robustness of the network model. Next, the classification losses and regression losses of object detection are briefly introduced. Then, the training strategies of the milestones of object detection and the specific loss function are described.

Table 2 The compositions and characteristics of the loss functions of the classical object detection architectures

3.1 Classification loss

Hinge loss

[6, 61] is a proxy function of the 0-1 loss function. It can be used as a loss for the max-margin problem in machine learning or deep learning, and can be extended to multi-class support vector machine (SVM) loss. The standard form of Hinge loss is listed as follows, which is suitable for binary classification.

$$ \begin{array}{c}{L(y)=\max (0,1-t \cdot y)} \\ {y=w \cdot x+b} \end{array} $$
(1)

where (w, b) are hyperplane parameters, x is the data vector that needs to be classified, y ∈ [− 1, 1] is the raw output of the classifier, not the predicted class label. And t ∈ {− 1, 1} represents the intended value.

The binary classification case does not apply to all actual situations, so the hinge loss needs to be extended to multiple classifications. It is defined as follows:

$$ L(y)=\sum\limits_{j \neq i} \max \left( 0, y_{j}-y_{i}+1\right) $$
(2)

The meaning of the above formula is to let the scores yj of other classes minus the real class score yi, and then sum them up. The smaller the value after summation is, the lower the score of the error class will be.

Cross entropy loss

[41] is also called log loss, which is used in the softmax classifier. The function of softmax is to convert the (k + 1) × 1 dimensional feature into the (k + 1) × 1 dimensional probability distribution. The index value of the maximum probability is the category label of the predicted sample. Therefore, according to the characteristics of the softmax function, the cross entropy loss can be defined as follows:

$$ L(p, u)=-\sum\limits_{i=0}^{k} u_{i} \log p_{i} $$
(3)

where \(p=\left (p_{0}, \cdots , p_{k}\right )\) is the probability distribution of K + 1 categories calculated by the softmax, and u is the true category label. The function structure shows that the cross entropy loss is the distance between the predicted value and the target true value. The function structure has convex optimization, which has good convergence when the gradient descent. It is more suitable for multi-category classification than hinge loss.

If the multi-category task becomes a binary-category task, the cross entropy loss can be reduced to the binary cross entropy loss, which can be considered as a special form of multi-category. Its form is defined as follows:

$$ L(p, u)=-u \log p-(1-u) \log (1-p) $$
(4)

Here, the category label u ∈ {0, 1}, p indicates the probability of the prediction belonging to category label u = 1.

The above three losses are mainly used for classifiers. Such as the use of the hinge loss in RCNN/SPPNet [66, 81], the use of cross entropy loss in Fast RCNN/Faster RCNN/SSD [65, 150, 191], and the use of binary cross entropy loss in YOLOV3 [189]. However, in YOLO [187], the classifier uses squared loss, which should be used for bounding-box regression. This shows that the design of the loss function is based on the network architecture and is not static.

3.2 Location loss

The squared loss is one of the basic loss functions of the bounding-box regression, also known as the L2 loss [99]. It represents the sum of the squares of the differences between the target value and the predicted value. Its basic form is defined as follows:

$$ L(y, f(x))=(y-f(x))^{2} $$
(5)

Here, y represents the target value corresponding to the input data x, and f(x) represents the predicted value of the input data x obtained by mapping f. When input data \(X=\left \{x_{1}, x_{2}, \cdots , x_{n}\right \}\), target value \(Y=\left \{y_{1}, y_{2}, \cdots , y_{n}\right \}\) the loss function becomes:

$$ L(Y, f(X))=\sum\limits_{i=1}^{n}\left( y_{i}-f\left( x_{i}\right)\right)^{2} $$
(6)

which is called the Residual Sum of Squares (RSS). The mean value of RSS is usually used as the regression loss, which is called the Mean Square Error (MSE).

$$ L(Y, f(X))=\frac{1}{n} \sum\limits_{i=1}^{n}\left( y_{i}-f\left( x_{i}\right)\right)^{2} $$
(7)

Another loss function for bounding-box regression is the absolute loss, called the L1 [99]. The difference between the L1 loss and the L2 lossis that L1 represents the sum of the absolute values of the difference between the target value and the predicted value. The basic form is expressed as follows:

$$ L(y, f(x))=|y-f(x)| $$
(8)

If there are n samples, the loss function becomes the following form, which can be called the Sum of Absolute Differences (SAD).

$$ L(Y, f(X))=\sum\limits_{i=1}^{n}\left|y_{i}-f\left( x_{i}\right)\right| $$
(9)

The mean value of SAD is usually used as the regression loss, which is called the Mean Absolute Error (MAE) [22, 247].

$$ L(Y, f(X))=\frac{1}{n} \sum\limits_{i=1}^{n}\left|y_{i}-f\left( x_{i}\right)\right| $$
(10)

L1 loss and L2 loss [65] have their own advantages and disadvantages when used for bounding-box regression. The L1 loss is more robust to outliers, but it has points where the derivative cannot be deduced, making the gradient descent inefficient. The gradient descent of L2 loss is more accurate and simple to calculate, but more sensitive to outliers. Therefore, the advantages of L1 loss and L2 loss are combined in the design of the bounding-box regression loss function, the Section 3.3 shows the design method.

3.3 Loss-based training strategies

Stage-wise training method

RCNN/SPPNet are pioneering approaches to object detection based on DCNNs, which is the stage-wise training method. First, the SVM algorithm is used for classification, and then the bounding-box regression [66, 81] is used to accurately locate the objects. Therefore, the stage-wise loss function is used in the training. After calculating the SVM loss function, the bounding-box regression loss function is calculated.

The loss function of the SVM classification algorithm is the Hinge loss with the L2 regularization term [162]. The function form is defined as follows:

$$ L_{c l s}=c \sum\limits_{i} \max \left( 0,1-p_{i}^{*} \cdot p_{i}\right)+\frac{1}{2} w^{2} $$
(11)

Here, \(p_{i}^{*}\) represents the true category of the object, pi represents the probability of the predicted object category, and i is the index of the mini-batch.

The bounding-box regression in RCNN/SPPNet uses the L2 loss as the basic skeleton of the loss function. The main principle is to punish the distance deviation between the predicted bounding-box and the ground truth to optimize the robustness of the prediction. The function is defined as follows:

$$ \begin{aligned} t_{x}^{*} &=\left( x^{*}-x\right) / w \quad, \quad t_{y}^{*}=\left( y^{*}-y\right) / h \\ t_{w}^{*} &=\log \left( w^{*} / w\right) \quad, \quad t_{h}^{*}=\log \left( h^{*} / h\right) \end{aligned} $$
(12)
$$ L_{l o c}=\sum\limits_{i}\left( t_{*}^{i}-w_{*}^{T} \phi\left( t^{i}\right)\right)^{2} $$
(13)

Here, the true coordinate is \(t^{*}=\left (x^{*}, y^{*}, w^{*}, h^{*}\right )\), the predicted coordinate is t = (x, y, w, h), where (x, y) represents the coordinate of the box center, (w, h) represents the width and height of the box. \(w_{*}^{T}\) is the learned parameter, and \(\phi \left (t^{i}\right )\) is the feature vector.

Multi-task training method

[20]. Stage-wise training method cannot achieve end-to-end training. By summarizing the shortcomings of RCNN, Fast RCNN uses softmax classifier instead of SVM classifier [65]. From the network structure analysis, each ROI outputs two feature vectors through two fully-connected layers, which input to the softmax classifier and bounding-box regressor in parallel. Finally, the classification and regression are integrated into a unified loss function, which is called the multi-task loss function. It combines classification loss and location loss, which improves the accuracy of the network model and implements end-to-end training of the network model. The multi-task loss function is defined as follows:

$$ L\left( p, u, t^{u}, v\right)=L_{\text{cls}}(p, u)+\lambda[u \geq 1] L_{\text{loc}}\left( t^{u}, v\right) $$
(14)
$$ L_{\text{cls}}(p, u)=-\log p_{u} $$
(15)

where, \(p=\left (p_{0}, \cdots , p_{k}\right )\) is the probability distribution of K + 1 categories calculated by a softmax, and u is the true category. In Lloc, \(t^{u}=\left ({t_{x}^{u}}, {t_{y}^{u}}, {t_{w}^{u}}, {t_{h}^{u}}\right )\) is the prediction offset to true class u, which is converted by \(t^{k}=\left ({t_{x}^{k}}, {t_{y}^{k}}, {t_{w}^{k}}, {t_{h}^{k}}\right )\) using the parameterization method. \(v=\left (v_{x}, v_{y}, v_{w}, v_{h}\right )\) is the coordinates of the ground-truth of the true category u, where (x, y) represents the coordinate of the box center, (w, h) represents the width and height of the box. The Iverson bracket indicator function [u ≥ 1] can be used to eliminate the effects of background RoIs, when u ≥ 1, the function is 1, otherwise it is 0. Lloc bounding-box loss function is shown as follows:

$$ L_{\text{loc}}\left( t^{u}, v\right)=\sum\limits_{i \in\{x, y, w, \mathrm{h}\}} \text{smooth}_{L_{1}}\left( {t_{i}^{u}}-v_{i}\right) $$
(16)

In which

$$ \text{smooth}_{L_{1}}(x)=\left\{\begin{array}{ll}{0.5 x^{2}} & {\text{ if }|x|<1} \\ {|x|-0.5} & {\text{ otherwise }} \end{array}\right. $$
(17)

The \(\text {smooth}_{L_{1}}(x)\) combines the advantages of L1 loss and L2 loss. In L1 loss and L2 loss, x represents the difference between the target value and the predicted value. When x > 1, L2 loss will increase the error, especially when there are outliers, the error will further increase, so very high penalty for large errors. Therefore, when x > 1, the L1 loss with linear error increase can eliminate the sensitivity of the loss function to the outliers. When x ≤ 1, the L1 loss has no derivative points, which affect the convergence of the network model during the gradient descent, so the L2 loss is selected. In short, \(\text {smooth}_{L_{1}}(x)\) can eliminate the sensitivity to outliers, and the performance is more robust, which can avoid the situation of gradient explosion when RCNN or SPPNet is trained with L2 loss.

The training region proposal network (RPN) [191] is required in the Faster RCNN. The loss function of the RPN is designed according to the multi-task loss function in the Fast RCNN. Different from the true category, only the binary category label is assigned to the anchor box in the RPN, indicating whether it is an object. With reference to equation (14), the loss function of the RPN is defined as follows:

$$ L\left( \left\{p_{i}\right\},\left\{t_{i}\right\}\right)=\frac{1}{N_{c l s}} \sum\limits_{i} L_{c l s}\left( p_{i}, p_{i}^{*}\right)+\lambda \frac{1}{N_{r e g}} \sum\limits_{i} p_{i}^{*} L_{r e g}\left( t_{i}, t_{i}^{*}\right) $$
(18)

Here, pi is the predicted probability of anchor i as the object. If the anchor is positive (object) then \(p_{i}^{*}\) is 1, if the anchor is negative (no object) then \(p_{i}^{*}\) is 0. The classification loss function Lcls reference to equation (4). Because of the specificity of RPN, make some minor changes to the bounding-box regression, which is defined as:

$$ L_{r e g}\left( t_{i}, t_{i}^{*}\right)=R\left( t_{i}-t_{i}^{*}\right) $$
(19)

R reference equation (15) in the definition of \({smooth}_{L_{1}}\). The improved parameterization method is used in the bounding-box regression of the anchor, which inherits the equation (12), but with minor differences. The definition method is defined as follows:

$$ \begin{aligned} t_{\mathrm{x}} &=\left( x-x_{\mathrm{a}}\right) / w_{\mathrm{a}}, \quad t_{\mathrm{y}}=\left( y-y_{\mathrm{a}}\right) / h_{\mathrm{a}} \\ t_{\mathrm{w}} &=\log \left( w / w_{\mathrm{a}}\right), \quad t_{\mathrm{h}}=\log \left( h / h_{\mathrm{a}}\right) \\ t_{\mathrm{x}}^{*} &=\left( x^{*}-x_{\mathrm{a}}\right) / w_{\mathrm{a}}, \quad t_{\mathrm{y}}^{*}=\left( y^{*}-y_{\mathrm{a}}\right) / h_{\mathrm{a}} \\ t_{\mathrm{w}}^{*} &=\log \left( w^{*} / w_{\mathrm{a}}\right), \quad t_{\mathrm{h}}^{*}=\log \left( h^{*} / h_{\mathrm{a}}\right) \end{aligned} $$
(20)

Here, x, y, w, h represent the center coordinates of the anchor box, width and height respectively.

Unlike the two-stage detectors such as Fast/Faster RCNN, YOLO [187] is a single-stage detector. Therefore, the design of the loss function is different from the former. In the YOLO detector, a fully-image can output the bounding-boxes and class probabilities of the objects through a single DCNN. It divides image into a S × S grid, and every grid is responsible for detecting the object of whose center falls into the grid cell. Each grid cell predicts B bounding-boxes and the confidence score of each bounding-box, and the confidence scores indicate the probability and accuracy of the bounding-boxes containing an object. The confidence score is defined as follows:

$$ \text{Confidence score}=\text{Pr}(\text{Object}) * \text{IOU}_{\text{pred}}^{\text{truth}} $$
(21)

The Pr(Object) is 1 if it exists an object in the grid cell, and is 0 if there did not exist an object in the grid cell. \(\text {Pr}\left (\text { Class }_{i} | \text { Object }\right )\) is the C conditional class probabilities to be predicted for each grid cell, which is for each grid cell, not B bounding-boxes. Finally, the test results yield the confidence scores of each bounding-box for class-specific objects, which as follows:

$$ \text{Pr}\left( \text{ Class }_{i} | \text{ Object }\right) * \text{Pr}(\text{ Object }) * \text{IOU}_{\text{pred }}^{\text{truth }}=\text{Pr}\left( \text{ Class }_{i}\right) * \text{IOU}_{\text{pred }}^{\text{truth }} $$
(22)

This gives us both confidence scores and accuracy for the class-specific of each bounding-box, as well as whether the bounding-boxes contain objects. In the training stage, although each grid cell can predict B bounding-boxes, only the bounding-box with the highest IoU of the object ground truth is used as the prediction. So, we define the multi-task loss function as follows:

(23)

where represents the grid cell i found the object, and represents the j th bounding-box in the grid cell i responsible for prediction. (xi, yi) is the center of the bounding-box that offsets from the grid cell boundary, (wi, hi) is the width and height of the bounding-box that normalized to the width and height of the input image. The purpose of the parameters λcoord and λnoobj are to reduce the impact of the grid cell that not object on the stability of the model.

4 The architectures of object detection

Object detection can be divided into object location and object classification. Since the DCNNs shows strong feature representation power [116, 122], the mainstream object detection architectures are based on DCNNs, which can be divided into two categories. One is the two-stage object detection architectures, which separate the object location task from the object classification task. It generates the region proposal first, and then classifies the region. The main advantage is the high detection accuracy and the main disadvantage is the slow detection speed. For example, RCNN [66], SPPNet [81], Fast RCNN [65], Faster RCNN [191], Mask RCNN [80] and RFCN [38] are all two-stage object detection architecture. Others are the one-stage object detection architectures that directly locates and classifies through DCNNs without separating into two parts. The one-stage object detection can generate the class probabilities and location coordinates of an object in a stage directly. It does not require the region proposal process, which is simpler than two-stage object detection. The main advantage is the high detection speed, but the detection accuracy is generally lower than two-stage object detection architecture. For example, OverFeat [197], YOLO series [187,188,189], SSD [150], DSSD [59], FSSD [137] and DSOD [199] belong to one-stage object detection. The performance parameters of some classic two-stage detectors and one-stage detectors are shown in Table 3. The milestones of object detection evolution are shown in Fig. 6. The highlights, properties, and shortcomings of the milestone object detection architectures are summarized in Table 4. The two-stage object detection architecture, the one-stage object detection architecture and the open source object detection platform are introduced below.

Fig. 6
figure 6

The milestones of object detection evolution, in which AlexNet [116] serves as a watershed between traditional methods [58, 155, 208, 224, 229, 241] and DCNNs-based methods. The development of object detection based on DCNNs is mainly divided into one-stage detection architectures [142, 150, 187, 277, 283] and two-stage detection architectures [15, 16, 65, 66, 80, 144, 191]

Table 3 Comparison of milestone two-stage object detectors and one-stage object detectors on the PASCAL VOC2007 test set
Table 4 Summarization of highlights, properties, and shortcomings of the milestone object detection architectures

4.1 Two-stage object detection architecture

RCNN

Inspired by AlexNet’s great success in image feature extraction, R.Girshick et al. used DCNN as the feature extraction backbone network [116] instead of HOG [40, 58, 180], SIFT [155] and other traditional feature extraction algorithms, combined with the regional proposals algorithms (such as Selective Search [224], Objectness [1], category-independent object proposals [49], CPMC [19] and MCG [4]) to generate region proposals to form the RCNN (Regions with CNN features) [66] architecture, as shown in Fig. 7. From the pipeline of the RCNN architecture, the steps are as listed follows. Firstly, the selective search algorithm generates about 2000 category-independent region proposals. Secondly, the region proposals are inputted into the DCNN to extract 4096-dimensional feature as representation. Lastly, the features are classified using the SVM algorithm. Bounding-box regression and the greedy non-maximum suppression (NMS) [166] will be done to perform fine-tuning of bounding-box. In summary, RCNN has improved performance by 30% over traditional object detection algorithms [58, 192].

Fig. 7
figure 7

The details of the classical two-stage object detectors include RCNN, Fast RCNN, and Faster RCNN. The left side is the training process and the right side is the testing process, and reflects the relationship between training and testing

Although RCNN has pushed object detection into the era of neural networks, it still has three disadvantages that prevent application from entering the instance scenario. The three disadvantages are listed as follows:

  1. (1).

    Each picture needs to be pre-fetched with about 2000 region proposals, which consumes a lot of storage spaces and I/O resources.

  2. (2).

    In the case of using AlexNet [116] as the backbone network, the cropped/warped region block is deformed into 227 × 227 RGB image. This causes truncation or stretching of the object image to result in loss of object information.

  3. (3).

    The separate extraction of each region proposals feature does not utilize the ability of DCNNs feature sharing, resulting in a large waste of computing resources.

SPPNet

The process of cropping/warping the image in the RCNN is removed, and the spatial pyramid pooling (SPP) layer (similar to SPM [121]) is added after the last convolutional (Conv) layer. Therefore, the problem of missing object information caused by the cropped/warped image can be solved. Thus an image of arbitrary size can be input into the DCNNs to calculate 21-dimensional fixed-length feature vector for fully-connected (FC) layer [81]. The sharing of the entire image feature maps makes SPPNet test speed 10 to 100× faster than RCNN. The framework of SPPNet and RCNN is extremely same, so no end-to-end training is implemented. The convolutional (Conv) layer cannot be continued to train during fine-tuning in SPPNet, which limits the accuracy of network [65].

Fast RCNN

By analyzing the disadvantages of SPPNet and RCNN in speed, space consumption, and training process. R.Girshick et al. proposed the RoI (Region of Interest) pooling layer, which is a single-level spatial pyramid pooling (SPP) [81]. In the Fast RCNN, the feature map of the image is calculated by the DCNNs. The selection search (SS) [224] algorithm is used to find the region proposals in the image and maps them onto feature maps. Then, the RoI pooling layer maps different feature regions to fixed-size feature vectors and inputs them to the fully-connected (FC) layer. Finally, the softmax predicts object categories and the bounding-box regression accurately locates the object location, and its architecture is depicted in Fig. 7. The Fast RCNN uses multi-task loss jointly train classification and bounding-box regression, so that two tasks share convolution features. Thus the stage-wise training of SVM+bounding-box regression (stage-wise training [66, 81, 294]) can be transformed into multi-task training [20, 177]. Because of these innovations, the advantages of Fast RCNN contrast with RCNN/SPPNet are as listed follows:

  1. (1).

    Fast RCNN has higher accuracy than RCNN/SPPnet.

  2. (2).

    Due to multi-task loss, the detector training is end-to-end.

  3. (3).

    Fast RCNN training can update all network layers. It is superior to SPPNet that only updates the fully-connected (FC) layer.

  4. (4).

    Hard disk storage is not required for feature caching.

  5. (5).

    Training and testing are faster than RCNN/SPPNet.

Faster RCNN

Although the Fast RCNN has a great improvement in accuracy and speed, the way to generate around 2000 region proposals/RoIs is the selective search algorithm [224, 225]. The selective search algorithm needs to search all the region proposals in the image and maps them into the feature maps, which is very time-consuming. In the test, Fast R-CNN took 2.3s to make predictions, 2s of which were used to generate 2000 RoIs. Therefore, the traditional region proposal algorithms [4, 114, 224, 225, 296] become the bottleneck of the object detection architecture. In order to solve this shortcoming, Shaoqing Ren et al. proposed a regional proposal network (RPN) in the Faster RCNN [191] (see Fig. 7) instead of the selective search algorithm to generate region proposals. The RPN integrates region proposals extraction into the DCNNs, which shares the convolution features of the full-image with the detection network. Since the region proposals share the convolution feature maps and the region proposal network (RPN) is implemented on the GPU, this process is nearly cost-free.

The RPN [191] is a fully convolutional network that is connected to the last convolutional layer of the backbone network. A feature map of any size is entered into the RPN, which outputs many rectangular object proposals with objectness scores. The RPN makes K predictions of different scales and aspect ratios for each sliding position in the feature map. Two fully-connected layers output 4 × K coordinates through a box-regression layer (reg), and 2 × K scores through a box-classification layer (cls). If the feature map size is n × n, then output n × n × k ROIs. RPN reduces the proposal number and improves the quality of regional proposals, so the location accuracy and speed of the object detection network (Common Datasets with only 300 proposals per image, 5fps on a GPU) get a big boost. In addition to RPN, researchers have proposed many region proposals based on DCNNs and have achieved good results in recent years, such as DeepBox [117], DeepMultiBox [50], SharpMask [178] and DeepProposal [63].

RFCN

The algorithmic ideas and performances of the RCNN series determine the milestone in object detection. The architecture essentially consists of two subnets (Faster RCNN consists of three subnets, adding RPN [191]). The previous subnet is the backbone network for feature extraction, the next subnet completes the object detection: classification and location. The RoI pooling layer is inserted between two subnets, which converts the multi-scale feature map to the fixed-size feature map. But this step will destroy the translation invariance of the network, which is not conducive to the classification of the object. In order to balance the translational invariance and translational transformation. Jifeng Dai et al. proposed position-sensitive score maps in the RFCN [38], and the RFCN region detection is based on the fully convolutional calculation of the full-image. The RFCN shares fully-convolution and uses position-sensitive score maps/position-sensitive RoI pooling to blend the translational variance into the convolutional layers, which is beneficial for object location. All learnable layers have two characteristics: all of them are convolutional and shareable throughout the full-image; and they can encode spatial information for object detection. The author uses ResNet101 [82] as the backbone network, achieving 83.6% mAP on the PSCAL VOC 2007 testing set/82.0% mAP on the PASCAL VOC 2012 testing set, and faster than Faster R-CNN with ResNet101.

Mask RCNN

Mask RCNN is an extension of Faster RCNN, which adds a Mask network branch for RoIs prediction segmentation parallel to object classification and bounding-box regression [80]. It can complete object detection and instance segmentation simultaneously. The Mask network is a streamlined version of the fully convolutional network used to generate split mask for each RoI. Due to the integer quantization of the RoI pooling [65, 191], the feature map region and the original image region is not aligned. Therefore, it produces a bias for accurately predicting pixel-level mask. Kaiming He et al. proposed a RoIAlign [80] layer instead of the RoI pooling layer, which uses bilinear interpolation to achieve pixel-level alignment. The RoIAlign also can improve the accuracy of the object detection branch. The experimental results show that the performance of Mask R-CNN using ResNet-101-FPN [82, 144] as the backbone network is much better than Faster RCNN with G-RMI. Mask RCNN got the championship of the COCO 2016 object detection challenge. Its AP is increased by 3 points compared to Faster RCNN with Inception-ResNet-v2-TDM [203, 212].

FPN

Tsung-Yi Lin et al. proposed a Feature Pyramid Networks (FPN) in 2017 [144]. The FPN can be considered as a method of feature fusion and detection. Before FPN, the detectors mainly use top-level feature detection or independent detection in different feature layers. These methods cannot take into account both classification information and location information, because the low-level semantic information are relatively small, but the object location information are rich, and the high-level semantic information are rich, but the object location information are relatively rough. Therefore, the FPN fuses different feature layers by way of top-down and lateral connection, and performs detection on the fused multi-layer feature layers. This method greatly enhances the detection performance. Faster RCNN with FPN achieves state-of-art results on the MS COCO dataset [144].

4.2 One-stage object detection architecture

OverFeat

In 2013, Yann LeCun et al. proposed the famous OverFeat architecture [197], which utilized the feature sharing of DCNNs to integrate object classification and object location into one network architecture. The main idea of OverFeat is to extract the patch using the multi-scale fast sliding window [77] on the last pooling layer of DCNN. In order to predict the classification score for each patch and merge the patches according to the scores. In this way, the complex shape and multi-size problem of the object image are solved. OverFeat uses the classification and regression of DCNN to achieve the object classification and object localization. It is compared with RCNN [66], OverFeat has obvious advantages in speed, but lacks in accuracy.

YOLO

Although Faster RCNN uses RPN to reduce the number of region proposals from around 2,000 (RCNN/Fast RCNN) to around 300, there are still overlaps between region proposals [191]. The inevitable overlap can lead to repetitive computation, making it difficult for the object detection architecture to break through the bottleneck of speed. In 2015, J.Redmon et al. proposed YOLO (Your Only Look Once), which is an end-to-end single neural network [187]. It can implement class probabilities and bounding-boxes regression directly from a full-image. YOLO divides the full-image into S × S grids. Each grid cell is responsible for the detection of the object center falling into the grid cell. Each grid cell predicts C class probabilities, B bounding-boxes and confidence scores, and the full-image is encoded to output S × S × (5B + C) tensor. Figure 8 shows the architecture of YOLO.

Fig. 8
figure 8

The details of the classical one-stage object detectors include YOLO and SSD. The upper part is the training process and the lower part is the testing process, and reflects the relationship between training and testing

The YOLO detection system consists of 24 convolutional layers and 2 fully-connected layers with a network entry of 448 × 448 image. In this paper, the network is trained using the PASCAL VOC dataset, so set S= 7, B= 2, C= 20, and the final prediction code is a 7 × 7 × 7 tensor. Up to 45 fps on the testing set, the speed has reached the requirements of real-time image processing, but YOLO has the following shortcomings to be improved:

  1. (1).

    YOLO does not work well for dense small objects because a grid cell can only predict two bounding-boxes and can only belong to the same class.

  2. (2).

    The generalization ability is weak for the case where a new aspect ratio occurs in the same type of object in the testing image.

  3. (3).

    The shortcoming of loss function affects the detection effect.

YOLO9000/V2

Although YOLO achieves real-time object detection, it has a number of localization errors and low recall. In order to achieve higher accuracy, YOLOV2 [188] has the following improvements over YOLOV1:

  1. (1).

    The introducing of batch normalization [97] speeds up network convergence and enhances network generalization capability.

  2. (2).

    Train high resolution classifiers to accommodate higher resolution images [161].

  3. (3).

    In order to solve the weak generalization ability for various aspect ratios objects of YOLO. The idea of anchor in the Faster RCNN [191] is introduced in YOLOV2, and each grid cell can predict 3 scales and 3 aspect ratios.

  4. (4).

    YOLOV2 uses the K-means [231] clustering algorithm to automatically find the prior bounding-boxes, which can make detection performance better.

  5. (5).

    YOLOV2 limits the offset of ground truth relative to the grid cell coordinate between 0 and 1, which solves the instability of network model.

In order to achieve faster speed, authors proposed a Darknet-19 [186] backbone network based on the VGGNet [207]. The authors use the joint training mechanism of classification datasets (ImageNet [42]) and object detection datasets (COCO [143]) to train the network. Therefore, the network learned more categories of object, the trained network is called YOLO9000.

YOLOV3

YOLOV3 [189] inherits the ideas of YOLOV1 and YOLOV2/9000, and has improved their shortcomings to achieve balance between speed and accuracy. To achieve this goal, the authors combine the residual block [82], feature pyramid network (FPN) [144], and binary cross entropy loss to upgrade YOLO to YOLOV3. These changes make the detection network suitable for more complex objects (more categories, multi-size objects).

SSD

The RCNN series and YOLO have their own advantages in speed and accuracy. The RCNN series has high detection accuracy, but the speed is slow. Although the YOLO has fast detection speed, the generalization ability of the object with large dimensional change and the detection effect for small objects is weak. By drawing on the advantages of Faster RCNN and YOLO, Wei Liu et al. proposed Single Shot MultiBox Detector (SSD) [150]. SSD uses VGG16 as the backbone network for feature extraction, which replaces FC6/FC7 with Conv6/Conv7, and then add four convolutional layers (Conv8, Conv9, Conv10 and Conv11). The design idea of SSD network is hierarchical extraction of features, we can see from Fig. 8. The one-stage network is divided into 6 stages. Each stage extracts feature maps of different semantic levels and performs object classification and bounding-box regression. The multi-scale feature maps combined with the anchor mechanism in the Faster RCNN can adapt the SSD to the detection task of multi-scale objects. The accuracy of SSD512 with VGG16 is significantly better than Faster R-CNN, which is 3 times faster. The SSD300 runs at 59 fps faster than YOLO, with significant quality detection [150].

DSSD and FSSD

In order to improve the ability of SSD to express low-level feature maps, DSSD [59] uses ResNet101 as the backbone network. The addition of deconvolution modules and skip-connection [82] enhance the representation of the low-level feature maps and achieve a certain degree of feature fusion. Similarly, FSSD combines low-level features into high-level features based on SSD, which significantly improves the accuracy.

4.3 Open source object detection platform

With the progress of research work on object detection technology in recent years, a large number of excellent object detection architectures based on DCNNs are proposed. These object detection architectures on a scalable, unified platform allows researchers to quickly and easily conduct experimental research. In order to meet the researchers’ demand for high-quality and high-performance object detection platform. Facebook AI Research (FAIR) Institute opens up an object detection platform called Detectron [53]. Google opens up an object detection API System [217]. CUHK&SenseTime Joint Lab opens up an object detection library named Mmdetection [27]. These open source platforms integrate many landmark object detection architectures, backbone networks, a large number of benchmark results, and pre-trained models in the Model Zoo library. Summarization of the respective characteristics of the three open source platforms in Table 5.

Table 5 Summarization of open source object detection platforms

5 Complex problem of object detection

Object detection encounters many obstacles in real-world scene applications. There are crowded occlusion, small objects detection, class imbalance, multi-scale object detection, and redundant detection. Researchers propose lots of methods to respond to the challenges of these issues. The solutions make the object detection technology based on DCNNs to take an important step in practical applications.

5.1 Dense occlusion

Dense occlusion problem often occurs in pedestrian detection [45], autonomous driving [24, 31], and other practical application scenarios. It is divided into two situations, which are occlusion between objects of the same categories and the occlusion between objects of the different categories. Occlusion can lead to object information loss, such as missed detection and false detection. In traditional object detection algorithms, researchers can use the additional object information. Such as gray information, boundary information, and local features to overcome dense occlusion problem. This review paper focuses on methods based on DCNNs to deal with dense occlusion problem.

Influenced by the idea of Generative Adversarial Nets (GANs) [67], Xiaolong Wang et al. proposed A-Fast-RCNN by combining the GANs with the Fast RCNN. Based on GANs, the authors design an Adversarial Spatial Dropout Network (ASDN) [243], which can generate occlusion samples to train detection network. Through this step, it could make the detection network more robust to occlusions. ASDN is only a generator in the network, which plays a role in training process and does not participate in the testing process. A-Fast-RCNN improves robustness of detection network to occlusion objects by adding network modules. In recent researches, Face++ first solves crowed pedestrians detection from the perspective of loss function, which is named Repulsion Loss (RepLoss) [242]. This method is also suitable for generic object detection that is inspired by magnetic attraction and exclusion. The loss function consists of three parts: the attraction item LAttr, the exclusion item LRepGT, and the exclusion item LRepBox, which are defined as follows:

$$ {L=L_{\text{Attr}}+\alpha * L_{\text{RepGT}}+\beta * L_{\text{RepBox}}} $$
(24)

where weighting factors α and β balance LRepGT and LRepBox. Among the equation, the attracting item LAttr denotes the loss between the predicted bounding-box and ground truth. \(P=\left (l_{P}, t_{P}, w_{P}, h_{P}\right )\) are the coordinates of the proposal, \(G=\left (l_{G}, t_{G}, w_{G}, h_{G}\right )\) are the coordinates of the ground truth, and l, t, w, and h represent the coordinates of the left-top points of the boxes and the widths and heights, respectively. Set \(\mathcal {P}_{+}=\{P\}\) represents all positive proposals, and set \(\mathcal {G}=\{G\}\) represents all ground truths in one image. The representation of LAttr is presented as follows:

$$ {{G}_{Attr}^{P}=\underset{G \in \mathcal{G}}{\arg\max} IoU(G, P)} $$
(25)
$$ {L_{\text{Attr}}=\frac{{\sum}_{P \in \mathcal{P}_{+}} \text{Smooth}_{L 1}\left( B^{P}, G_{A t t r}^{P}\right)}{\left|\mathcal{P}_{+}\right|}} $$
(26)

where \(P \in \mathcal {P}_{+}\), BP is the predicted box regressed from proposal P.

The exclusion item LRepGT represents the loss of the predicted bounding-box and the ground truth of the adjacently same class. The representation is defined as follows:

$$ {{G}_{Rep}^{P}= \underset{G \in \mathcal{G}\setminus\{G_{Attr}^{P}\}}{\arg\max} IoU(G, P)} $$
(27)
$$ {{IoG}(B, G) \triangleq \frac{area(B \cap G)}{area(G)}, {IoG}(B, G) \in[0,1]} $$
(28)
$$ \text{Smooth}_{ln}=\left\{\begin{array}{ll} {-\ln (1-x)} & {x \leq \sigma} \\ {\frac{x-\sigma}{1-\sigma}-\ln (1-\sigma)} & {x>\sigma} \end{array}\right. $$
(29)
$$ {L_{\text{RepGT}} = \frac{{\sum}_{P \in \mathcal{P}_{+}} \text{Smooth}_{ln}\left( IoG(B^{P}, G_{Rep}^{P})\right)}{|\mathcal{P}_{+}|}} $$
(30)

where \(IoG(B^{P}, {G}_{Rep}^{P})\) is the overlap between BP and \(G_{Rep}^{P}\), which is defined in the equation (28). Smoothln ∈ (0,1) is a continuously differentiable function, and σ ∈ [0,1) is used to adjust the sensitivity of the repulsion loss to the outliers.

The exclusion item LRepGT represents the loss of the predicted bounding-box and the ground truth of the adjacently different class. The representation is defined as follows:

(31)

where \(\mathcal {P}_{+}\) is divided into \(|\mathcal {G}|\) disjoint subsets based on each proposal object \(\mathcal {P}_{+}=\mathcal {P}_{1} \cap \mathcal {P}_{2} \cap {\ldots } \cap \mathcal {P}_{|\mathcal {G}|}\). Randomly sample from two different subsets for two proposals, \(P_{i} \in \mathcal {P}_{i}\) and \(P_{j} \in \mathcal {P}_{j}\) where \(i, j=1,2, \ldots ,|\mathcal {G}|\) and ij. \(B^{P_{i}}\) and \(B^{P_{j}}\) indicate predicted bounding-boxes. 𝜖 is a small constant to prevent division by 0, is an identity function.

A-Fast-RCNN and RepLoss prove the feasibility of solving the dense occlusion problem from the network architecture and the loss function. Then, Shifeng Zhan et al. proposed OR-CNN [276], which was published in ECCV2018. Based on Faster RCNN [191], OR-CNN adds compactness loss term based on regression loss. It is named Aggregation Loss, which can reduce the missed detection of overlapping objects. At the same time, the use of Part Occlusion-aware RoI Pooling Unit instead of RoI Pooling can reduce the impact of occlusion position on global features. The experimental results show that the results of these three algorithms in the Citypersons dataset [275] are satisfactory results.

5.2 Multi-scale objects detection

Multi-scale object detection is one of the most complicated difficulties in the field of object detection. The convolutional neural network is a typical hierarchical structure. Each layer abstracts the feature map of the image, and semantic information represented by the feature maps are different. This inherent property determines the detection of multi-scale objects based on DCNNs. For example, the RCNN [66] and the YOLO [187] only perform object classification and bounding-box regression on the last layer of feature maps. This leads to much loss of object feature representation information, which is obviously not conducive to the detection of multi-scale objects. For example, YOLO is not robust to small object detection, which is caused by the loss of small object features extracted in the last convolutional layer.

The combination of information fusion and hierarchical structures of DCNNs, researchers proposed the idea of multi-layer feature fusion and multi-layer detection to solve the problem of multi-scale object detection. By summarizing the relevant works, researchers have many achievements around this difficult problem, and many classical network architectures are proposed. These include FPN [144], SNIP [204], FSSD [137], SPPNet [81], Hypercolumns [76], HyperNet [112], ION [7], RON [111], SSD [150], DSSD [59], DSOD [199], MSCNN [14], RBFNet [148], RefineDet [277], STDN [288], DES [281], PFPNet [107], and so on.

According to the idea of multi-layer feature fusion, Hypercolumns fuses each layer of feature maps to obtain hypercolumn that is all nodes of the network corresponding to the pixels and they connected in series as feature vectors. The hypercolumn contain more detailed location and classification information. Similarly, the skip layer feature is applied in ION [7] and HyperNet [112]. In HyperNet, the multi-level feature maps are extracted through the skip connections [82]. The multi-scale hyper-feature obtained by fusion includes high-level and low-level semantics, which improves the robustness of multi-scale object detection.

What about the idea of multi-layer detection? Due to the different representations of the semantic features of each layer, different feature maps can be used to detect different scales objects. The SSD [150] detects the feature maps generated by the last 6 convolutional layers, and finally uses NMS for post processing. Songtao Liu et al. introduced a Receptive Field Block (RFB) [148] in the SSD architecture. It uses multi-branch structure composed of convolutional layers of different sizes and dilated convolution [261, 262] to increase the receptive field. Finally, the convolution outputs of different sizes and dilation ratios are concatenated. The results prove that RFB can improve the robustness of detection. MSCNN [14] is an optimization based on Faster RCNN in detecting multi-scale objects, especially the small object detection. The multi-scale object proposal network is mainly composed of the multi-level network structure, which is the core of the MSCNN.

Multi-layer feature fusion and multi-layer detection can also be combined to improve the robustness of multi-scale object detection. In FPN [144], the bottom-up is down-sampling and the top-down is up-sampling. Then, it uses the lateral connection method to merge the down-sampling and the up-sampling feature map. Finally, it uses each merged layer to make prediction. In DSSD and FSSD, skip connections and feature fusion are also used to transform the architecture of SSD.

5.3 Class Imbalance

One of the reasons why the accuracy of the one-stage object detection is lower than the two-stage object detection is class imbalance. The region proposal of the two-stage object detection can effectively prevent class imbalance [142]. The class include hard positive example, hard negative example, easy positive example, and easy negative example. The number of hard examples is less than easy examples. Representative methods for solving this problem include OHEM [202], Focal Loss [142], CC-Net [169], and RON [111]. The main idea of OHEM is to screen out hard examples as training samples. The Focal Loss changes the weight of the examples through the loss function, which makes the network pay more attention to the training of hard examples. The CC-Net uses a cascaded network structure to process a large number of easy examples in previous stage and a small number of hard samples in latter stage. The RON uses the objectness prior map [111] to distinguish the foreground and background of the object, which keeps the positive and negative examples at certain percentage. Finally, the network training can be completed in combination with the Loss strategy.

Among them, the Focal Loss points out a new idea to solve the problem of class imbalance, which is from the perspective of loss function rather than network structure. Through the analysis of the one-stage detector, there is a problem of hard and easy (positive and negative) class imbalance, which results in low detector accuracy. Kaiming He et al. solved this problem from the perspective of loss function and proposed the Focal Loss, which is modified based on the cross entropy (CE) loss function.

5.4 Post-processing of redundant bounding-boxes

The post-processing of the redundant bounding-boxes is to de-duplicate the repeated bounding-boxes, and select the most accurate bounding-boxes as the results. The de-duplication operation of the bounding-boxes can be used for the intermediate process of object detection and the process of the final results. For example, Faster RCNN uses non-maximum suppression (NMS) to reduce the number of region proposals to 300 in the final step of the Region Proposal Network (RPN) [191]. This process reduces the amount of computation and does not compromise the final detection accuracy. YOLO and SSD use NMS to generate ultimate object results. NMS is the original algorithm for eliminating duplication. Inspired by the NMS algorithm, UMIACS proposed a Soft-NMS algorithm [10]. Based on the NMS and Soft-NMS algorithms, CMU and Face++ proposed a Softer-NMS algorithm [84]. Face++ published a IoU-Net [100] in 2018, which makes up for the shortcoming of NMS that uses only the classification confidence of bounding-box as the deduplication threshold. It introduces a localization confidence of bounding-box into the non-maximum suppression process, which can further improve the location accuracy of bounding-box.

5.4.1 NMS

The core idea of non-maximum suppression (NMS) [10, 166] is to use local maxima to suppress non-maximum values, and only retain the maxima. In the object detection, each bounding-box gets a classification and confidence after being classified. However, there will be overlaps in the bounding-boxes. In this case, NMS is needed to select the highest bounding-boxes in those neighborhoods and suppress the bounding-boxes with low confidence. The algorithm description is in Fig. 9a.

Fig. 9
figure 9

The pseudo code of the post-processing of the redundant bounding-boxes, where a is NMS, b is soft-NMS, c is softer-NMS, and d is IoU-guided NMS

Although NMS [10] has a good promoting effect on the de-duplication of bounding-boxes, there are missed detection on the detection of overlapping (dense occlusion) objects, and the threshold of the algorithm is hard to set. The re-scoring function of the NMS algorithm is defined as follows and the parameters in the equation (32) are defined in Fig. 9a.

$$ s_{i}=\left\{\begin{array}{ll}{s_{i},} & {\text{iou}\left( \mathcal{M}, b_{i}\right)<N_{t}} \\ {0,} & {\text{ iou }\left( \mathcal{M}, b_{i}\right) \geq N_{t}} \end{array}\right. $$
(32)

5.4.2 Soft-NMS

In order to deal with some of the shortcomings of NMS, UMIACS proposed the Soft-NMS algorithm [10] (described in Fig. 9b), which is published in ICCV2017. In the de-duplication process of the redundant bounding-boxes, the Soft-NMS does not delete (suppress) the bounding-boxes when the IoU is higher than a certain threshold. But decay the classification confidence of bounding-boxes to enter the next iteration. For decaying the classification confidence, the authors proposed a linear penalty function and a Gaussian penalty function. They are defined as follows and the parameters in the equation (33, 34) are defined in Fig. 9b.

$$ s_{i}=\left\{\begin{array}{ll}{s_{i},} & {\text{iou}\left( \mathcal{M}, b_{i}\right)<N_{t}} \\ {s_{i}\left( 1-\text{iou}\left( \mathcal{M}, b_{i}\right)\right),} & {\text{iou}\left( \mathcal{M}, b_{i}\right) \geq N_{t}} \end{array}\right. $$
(33)
$$ s_{i}=s_{i} e^{-\frac{\mathrm{i} \text{ou}\left( \mathcal{M}, b_{i}\right)^{2}}{\sigma}}, \forall b_{i} \notin \mathcal{D} $$
(34)

5.4.3 IoU-Net

NMS and Soft-NMS sort bounding-boxes by the classification confidence, but the localization confidence of the bounding-boxes is not utilized. The lack of localization confidence may result in more accurate bounding-boxes being suppressed during NMS process. Therefore, Face++ proposed an IoU-Net [100], which introduces localization confidence into the NMS process. The IoU of the predicted bounding-box and the ground truth bounding-box is used as the localization confidence and sorting criterion instead of the classification confidence in the NMS process. Then the clustering-like rule is used to update the classification confidence, which is called IoU-guided NMS, and it is depicted in Fig. 9d.

5.4.4 Softer-NMS

Softer-NMS [84] introduces KL loss, Gaussian distribution, and Delta distribution into the bounding-box regression. To obtain the IoU and variances of the four coordinates of the predicted bounding-box, which is called Bounding-Box Regression with KL Loss. The standard deviation of the bounding-box can be obtained by the variance of the coordinates, which can be used as the localization confidence. During the suppression process, Softer-NMS performs a weighted average of the localization confidence of the bounding-boxes above a certain threshold based on soft-NMS. This method can solve the problem that the classification confidence and the localization confidence are not positively correlated. The algorithm is described in Fig. 9c.

5.5 Detection speed

The speed of object detection is an important indicator of performance and is critical for applications. The purpose of the lightweight backbone networks mentioned in Section 2.2 is to speed up the speed of object detection. The compression technology of the network model can also improve the speed of object detection. The detailed technologies include network model pruning, network model quantification and network model distillation.

Network model pruning

is a model compression method that cuts unimportant convolutional layers or connections to reduce the parameters of the model. Therefore, the core of pruning is to find unimportant convolutional filters or connections. Hao Li et al. proposed a method based on the weight value [125]. First, the absolute value of the weight of each filter is summed in the convolution, and then filter with the lowest value is deleted, which can reduce the parameters of the model. However, Jian-Hao Luo et al. believe that it is difficult to determine the importance of filter by the weight value. Therefore, a pruning method based on entropy is proposed, which uses entropy to determine the importance of filter [156]. Gwanak-Gu proposed to take random pruning method and then count the performance of each model to determine the local optimal pruning method [3]. Tien-Ju Yang et al. used the energy consumption of each layer to determine which layer to prune [258]. The layer with high energy consumption is selected for pruning, and the weight value is used to assist the judgment.

Network model quantification

to reduce the code length of the weight. For example, 32 bit floating-point number can be represented by 1 bit floating-point number, which reduces the parameters. At present, model binarization is main research direction of quantification. For example, Binarized Neural Networks [249] not only quantify the weight value to 1 bit on the basis of BinnaryConnect, but also change the activation value to 1 bit. That is to reduce the memory consumption, and also simplify many multiply-accumulate operations into a bitwise operation XNOR-Count. Similarly, XNOR-Net [185] binarizes both the weight value and the activation function, which achieves 32 × storage compression with a 58 × speed increase.

Network model distillation

is a kind of migration learning that uses a pre-trained complex model (called teacher model) to train a simple model (called student model) to achieve the effect of compressing. For the detection speed, Guobin Chen et al. used highly sophisticated detector models as teacher models to guide the learning process of efficient student models [25]. Yu Hao et al. proposed a new end-to-end object detection architecture that combines incremental learning with model distillation, which improves detection speed and improves detection accuracy [74].

5.6 Small object detection

Small object detection is one of the difficulties in object detection. The improvement of small object detection will promote the development of related applications, such as automatic driving, remote sensing image detection, industrial defect detection and medical image detection. At present, some classic detection methods (such as Faster RCNN, YOLO, SSD) are not ideal for detecting small objects. By analyzing the characteristics of small objects, there are mainly the following reasons:

  1. (1).

    The small objects occupy small pixel size in the original image, which results in less feature information for detection. (Generally defined object resolution is less than 32 × 32 is small object)

  2. (2).

    After the original image is extracted by the down-sampling of the backbone network, the information of the small objects location may be lost.

  3. (3).

    Small objects are less in the original image, resulting in imbalances between small objects and medium or large objects.

In recent years, researchers propose some technical solutions for small object detection. Mate Kisantal et al. proposed a small object data augmentation method on the MS COCO dataset by over-sampling the image containing small objects and the copy-pasting strategies of small objects [160]. Tsung-Yi Lin et al. proposed a feature pyramid network, which improves the performance of multi-scale detection through multi-layer feature fusion and multi-layer detection, and also enhanced small object detection [144]. Jianan Li et al. proposed Perceptual Generative Adversarial Networks (GANs) to improve small object detection [128]. Perceptual GANs mine structural associations between objects of different scales to improve the feature representation of small objects. Chenyi Chen et al. redesigned the anchor sizes and introduced a context model based on Faster RCNN to improve small object detection. Jun Wang et al. proposed H-CNN to detect small ship objects in synthetic aperture radar (SAR) images [234]. Yantao Wei et al. proposed a multiscale patch-based contrast measure method for infrared small target detection, which effectively suppresses the interference of background clutter [244]. Liangkui et al. proposed a 7-layer DCNN to achieve automatic extraction of small object features and end-to-end suppression of clutter [139]. The problem of small object location loss can be solved by the combination of high-level features and low-level features, such as DSSD [59], Feature-Fused SSD [17], MDSSD [253] are based on this structure.

6 Datasets and evaluation criteria

The evaluation criteria on the dataset is the standard for evaluating the performance of the network. Table 6 summarizes the performance results of some object detection methods on the MS COCO dataset over the past five years. The datasets and evaluation criteria are summarized below.

Table 6 Detection results of the detectors published in the top conferences or journals on the MS COCO test-dev(%) in recent years. S: small objects, M: medium objects, L: large objects

6.1 Datasets

The training of DCNNs is inseparable from various image datasets. The datasets play an indelible role in the development of DCNNs. Whether it is image recognition, object detection or instance segmentation, each subject has its unique methods and specificity difference. Excellent datasets are important players in the current impressive progress in object detection. In the past 20 years, many excellent datasets have emerged. Due to the excellent performance of these datasets, some of them have become important indicators in the industry to measure the performance of algorithms. Relevant personnel revise and expand the datasets to enrich them. This can provide better support for our DCNNs and promote the continuous development of the field. For example, the release of the ImageNet dataset [42, 195] forms a unified standard for the training and testing of supervised learning-based convolutional neural networks. The training on the dataset makes it easy to form a unified evaluation standard and eliminate the differences in the local datasets. The increase in the number and quality of images in the datasets can certainly promote the development of deep learning.

The image datasets are abundant, ranging from about MB to TB and can be applied in object detection, image classification, and instance segmentation. They include MINST [123], CIFAR-10 [115], CIFAR-100 [115], ImageNet [42], Open Images [113], MS COCO [143], PASCAL VOC [51, 52], SUN [251], Tiny Images [222], and Places [286]. Figure 10 shows example images of four commonly used datasets.

Fig. 10
figure 10

Image examples of ImageNet, PASCAL VOC, Open Images and MS COCO

The entry-level MNIST dataset is widely used for training and testing in the field of machine learning due to its single objects and small images. The classical LeNet is designed for the MNIST dataset [123]. Similar to the MNIST dataset, CIFAR-10 and CIFAR-100 are with moderate size, which is about 170MB. CIFAR is a great choice for image classification algorithms. These two datasets are a subset of Tiny Images in a very large size. CIFAR-10 is more diversified than MNIST handwritten dataset, but it only has ten categories, which has caused certain limitations. Therefore, researchers sort out CIFAR-100 based on it. CIFAR-100 is similar to CIFAR-10, which has 100 classes and 600 images per class. They abstract 20 superclasses from these 100 classes. So each image has two labels, a class label and a superclass label. It is worth mentioning that CIFAR-10 dataset is completely mutually exclusive between classes.

The larger Caltech datasets were released by the California Institute of Technology. Caltech datasets mainly include two categories, Caltech-101 (131MB) [56] and Caltech-256 (1.2GB) [70]. Caltech-101 has 102 classes, with 9145 images, which includes 101 normal image categories and a miscellaneous class (background class). Usually when in use, the last of the miscellaneous classes is removed. Caltech-256 is similar to Caltech -101, which contains 30607 images.

The PASCAL VOC dataset(2GB) [52], which pioneers the algorithm competition, provides a set of standardized datasets for image recognition, segmentation, and classification. The training set is presented as labelled images. Objects in the dataset include 4 categories, which are people, animals, vehicles, and indoor objects. The training set consists of a set of images. Each image corresponds to its annotation file (bounding box and the class label) one to one. Multiple objects of different categories may exist in one image. PASCAL VOC dataset has good image quality and complete labels, which is very suitable for evaluating algorithm performance. The proportion of training set and testing set is about 1:1. The category distribution in the image is also same. The SUN dataset(7GB) [251], focuses on object detection and scene recognition, which consists of 908 scene categories and 4,479 object categories. 313,884 objects are labeled with background.

The MS COCO dataset(40GB) [143] appears to be very large compared to the datasets above, which is sponsored by Microsoft. For image annotation information not only have a category and location information, and semantic text description of the image. Like ImageNet in the field of image classification and detection, MS-COCO dataset has become a yardstick in the evaluation of algorithm performance in the field of visual semantic understanding. Google’s open source contains trained models are based on MS COCO dataset. The dataset is aimed at scene understanding and contains 91 classes, 328,000 images, and 2,500,000 labels. The MS COCO category is less comparable with ImageNet and SUN. But it has more images of each type, which makes it easier for the model to acquire stronger ability to analyze a certain type of object in a certain scene during training. The MS COCO dataset contains more images and classes than the PASCAL VOC mentioned earlier.

For another example, ImageNet dataset [42] is at TB level and widely applied in the visual field. Many researches in computer vision field are carried out based on ImageNet, such as image classification, object detection. It is maintained by a dedicated team. It has detailed dataset documentation, and is very easy to use. Some annotation problems of ImageNet will also be centrally fixed once a year and reissued, and the latest version is recommended. ImageNet is even praised as the benchmark of algorithm performance evaluation in the computer vision. It contains more than 14 million images and covers more than 20,000 categories. The well-known ILSVRC image classification and object detection challenge [195] is based on the ImageNet dataset.

Google released a dataset in 2016, which is named Open Images [113]. It has about 9 million images and about 6,000 category labels. Open Images became a new data support for the computer vision community to develop new models. At the same time, the large amount of image data in Open Images can guarantee the complete training of deep network model. Earlier, Google announced Open Images V4. It contains 15.4 million bounding-boxes for 600 categories on 1.9 million Images and is the largest existing dataset with object location annotations. Simultaneously the ECCV 2018 image challenge was held. The annotation work is heavily participated by professional staff, so as to guarantee the high accuracy and consistency of dataset annotation. In addition, the categories of images in Open Images are also quite diverse, and most scene images are complex scenarios that contain multiple objects. The overall size of the dataset is 1.5GB, because Open Images only provides the URLs of the images.

6.2 Evaluation criteria

How to evaluate the performance of the algorithms on the unified datasets is an important issue. At present, there are three main performance evaluation criteria: Precision, Recall, and Frame Rate. There are also other metrics, including IoU, PR curve, ROC curve, and AUC curve. They are closely related to each other.

IoU (Intersection-over-Union) is an important concept in object detection. It is the ratio of the intersection of the predicted bounding-boxes and the ground-truth bounding-boxes of the object and the union of the two. The IoU can also be seen as the similarity of the above two sets, expressed by the Jaccard index. Therefore, IoU can be used to measure the accuracy of object detection. If the IoU of object is closer to 1, the accuracy of detection is higher.

Based on the IoU, performance evaluation is measured by Precision and Recall. To obtain the Precision and the Recall of network model, the statistical True Positives (TP) and False Positives (FP), True Negatives (TN) and False Negatives (FN) in the validation dataset/testing dataset are needed. Here, IoU is used to determine whether the test results are correct. Assuming that the goal of the model is to detect cats from the dataset, it is necessary to determine whether the “cat” detected by the model is true (TP) or false (FP). Set IoU = 0.5 to determine whether the result is correct or not. When IoU > 0.5, the result is considered to be correct. Otherwise, it is wrong. In PASCAL VOC, the IoU = 0.5, while MS COCO competition changes the IoU to a composite value between 0.5 and 1.0. According to IoU = 0.5, the number of correct detections and the number of error detections can be obtained. Therefore, the precision and recall of each category of one image can be calculated. The formula is defined as follows:

$$ P_{C_{ij}}=\frac{TP_{C_{ij}}}{TP_{C_{ij}}+FP_{C_{ij}}} $$
(35)
$$ \text{Recall}_{C_{ij}}=\frac{TP_{C_{ij}}}{TP_{C_{ij}}+FN_{C_{ij}}} $$
(36)

where \(P_{C_{ij}}\) represents the Precision of category Ci in the j th image, while \(\text {Recall}_{C_{ij}}\) represents the Recall of category Ci in the j th image.

The Average Precision (AP) of the category Ci can be calculated:

$$ A P_{C_{i}}=\frac{1}{m} \sum\limits_{j=1}^{m} P_{C_{i j}} $$
(37)

There are multiple categories \(\left \{C_{1}, C_{2}, \cdots , C_{n}\right \}\) for the dataset. Therefore, the mean Average Precision (mAP) of the entire category can be calculated, as follows:

$$ m A P=\frac{1}{n} \sum\limits_{i=1}^{n} A P_{C_{i}} $$
(38)

In the actual scene, it may has multiple categories of object detection. Therefore, mAP is used to describe the detection performance of the model for all object categories.

To more accurately evaluate the performance of the detector, Precision and Recall are used to construct the Precision-Recall curve. With the Recall value increasing, Precision can maintain a high level, which indicates that the detector performance is better. In the case of poor performance, the detector needs to sacrifice a lot of Precision to keep Recall at a high level. The use of Precision and Recall in the PR curve indicate that the curve is more concerned with the positive case in the datasets. Since positive case predictions are the primary focus of the test, PR curves are widely used in many papers.

There is a Receiver Operating Characteristic (ROC) curve corresponding to the PR curve. The difference is that the ROC curve uses FPR (False positive rate) and TPR (True positive rate). The closer the ROC curve is to the upper left corner, the better the performance of the detector is. The coordinate (0, 1) represents the probability that the positive case is all arranged before the counterexample, which is the ideal state. If it is necessary to more intuitively indicate the quality of the detector, the area under the ROC curve (AUC) can be introduced. The value of AUC is the size of the area under the ROC curve. Typically, the AUC value is between 0.5 and 1.0, and a larger AUC value represents better performance.

When the proportion of positive and negative samples in the testing set is uneven, the ROC curve can describe the performance of detector more stably, but the PR curve is greatly affected under the same conditions. Therefore, ROC curve is a more balanced evaluation method than PR curve. Therefore, the choice of PR curve or ROC curve in application should be made according to actual needs. Here are some suggestions:

  1. (1).

    If it is necessary to evaluate the overall performance of the detector, the ROC curve should be used.

  2. (2).

    In the case of uneven categories, and if you want to remove the impact of category distribution on performance evaluation, the ROC curve should be used.

  3. (3).

    If you need to test the impact of different categories on performance of the detector, the PR curve should be used.

  4. (4).

    Most applications should focus on the detection of positive examples in the same category. Therefore, the PR curves should be used.

  5. (5).

    When the category is imbalanced, the ROC curve tends to give a more optimistic result due to a higher tolerance for counterexamples.

7 Applications

As one of the three basic tasks of computer vision, object detection has a wide range of applications in real-world scenarios. In real-world application scenarios, object detection differs in technology implementation depending on the specific tasks. Important applications of object detection are reviewed in this section, including face detection [240], salient object detection [11], pedestrian detection [45], remote sensing image detection [193] and medical image detection [145].

7.1 Face detection

Face detection is the most important application area for object detection, which is the basis of face recognition, face alignment, gender recognition, and sentiment analysis. In real-world, face detection is a challenging detection task due to changes in face features, illumination, gestures, and occlusion.

The purpose of face detection is to determine whether there are faces and find location of the faces in pictures. The traditional face detection is mainly based on the sliding window and the handcrafted feature extractor, and the face template feature is used to perform sliding matching with the detected image feature to determine the position of the face. The representative method is the VJ detection algorithm designed by Viola and Jones in 2001 [230]. It uses Haar features and cascaded AdaBoost classifiers to construct the detector, which greatly increases detection speed and accuracy. In addition, ACF [256] and DPM [57] also increase the performance of face detection. However, traditional face detection algorithms still have many problems. With the advent of the deep learning era, face detection based on deep learning shows strong performance.

The deep learning-based object detection algorithms achieve great success in general object detection, and many face detection algorithms evolve from these general object detection algorithms. Haoxiang Li et al. proposed a Cascade CNN [126], which contains multiple cascaded DCNN classifiers to solve the problem of sensitivity to illumination and angle in real-world scenarios to some extent. Zhang et al. proposed a multi-task face detection algorithm that uses a cascaded architecture like the Cascasde RCNN, called MTCNN [273]. It integrates face detection and face key detection into a framework in three parts. Similar to Cascade CNN, Faceness-Net [257] is a coarse-to-fine detection process that uses multiple DCNN-based network classifiers to detect faces. For improving the performance of the classifier, Hao Wang et al. proposed a Face RCNN based on the Faster RCNN and added center loss based on softmax [101]. In response to small objects and multi-scale problems in face detection, Peiyun Hu et al. proposed a hybrid-resolution model that processes image pyramids in a scale-invariant manner and uses a scaled hybrid detector [90]. M. Najib et al. proposed a SSH [165] to implement multi-scale face detection by detecting on different scale feature maps. Small objects and multi-scale face detection can also be handled by anchor strategies, such as FaceBoxes [278], S3FD [279] and ScaleFace [75].

7.2 Salient object detection

The role of salient object detection is to highlight the main object regions in the image, also known as the salient regions. As an important application of computer vision and object detection, salient object detection is widely used in image understanding, video understanding, computer graphics and robot navigation.

In the era of non-deep learning. Itti et al. proposed the earliest saliency model based on center-surround mechanisms to detect spatially discontinuous objects in the scene [98]. Liu et al. proposed a method of replacing saliency detection with binary segmentation, which promoted the development of saliency object detection [149]. Yu et al. proposed to determine the background score of each region based on the observations of the background and the salient regions [264].

Since DCNNs have strong feature representation capabilities, the introduction of DCNNs into salient object detection is a trend. R. Zhao et al. proposed an MCDL architecture [284] to extract local and global contexts based on multi-layer perceptron (MLP), and then classify the foreground and background. Multi-layer perceptron-based methods include superCNN [83], MAP [271], LEGS [235], and MDF [124]. Although the multi-layer perceptron-based methods improve in performance, it is insensitive to spatial information and is very time consuming. Currently, the most advanced salient object detection methods are based on the full convolutional networks. L. Wang et al. proposed a recurrent fully convolutional networks (RFCN) to further improve the detection performance [236]. P. Hu proposed a deep network set to generate a compact and uniform saliency map to distinguish pixels from the object boundary [91]. Based on the Deeplab algorithm, J. Zhang et al. proposed a DUS to learn the potential saliency and noise patterns through several pixel-supervised methods of heuristic saliency methods [272]. Specific detailed summary can also refer to the review [239].

7.3 Pedestrian detection

Pedestrian detection is widely used in intelligent surveillance, autonomous driving and robotic navigation. The problems faced by pedestrian detection are more complicated than general object detection. Because pedestrian objects have the characteristics of rigid and flexible objects at the same time, it is more susceptible to the influence of posture, dense occlusion, illumination and viewing angle, so the difficulty of pedestrian detection is also higher than that of face detection.

Traditional pedestrian detection is based on handcrafted feature extractors similar to traditional face detection. For example, HOG + SVM pedestrian detection algorithm proposed by Navneet Dalal et al. in CVPR2005 [39], which uses the orientation and intensity information of edge to describe the shape and appearance of the pedestrian.

DCNNs are more suitable for pedestrian detection because the feature representation ability of multi-layer nonlinear mapping of DCNNs is stronger than that of traditional handcrafted feature extractors. Faster RCNN is not well for pedestrian detection, Zhang et al. analyzed that the resolution of the feature map is too low, and there is no hard negative examples mining. Therefore, Zhang et al. used the RPN to process small objects and hard negative examples based on Faster RCNN, and then used random forests to classify proposal regions [274]. Also based on Faster RCNN, Jiayuan Mao et al. proposed a HyperLearner to enhance the recognition of pedestrians and backgrounds by modifying the scales of anchors [159]. For the multi-scale problem of pedestrians, Jianan Li et al. design two sub-networks for parallel detection based on the large-scale and small-scale differences, and then use the scale-aware to merge the two sub-networks [127]. In order to solve the occlusion problem, Yonglong Tian et al. proposed a DeepParts [220], which divides the human body into multiple parts for detection and then combines them. For occlusion problems, special loss functions can also be used, such as Repulsion Loss [242] proposed by Xinlong Wang et al. and Aggression Loss [276] proposed by Zhang et al..

7.4 Remote sensing image detection

Remote sensing image detection is mainly used in military reconnaissance, land and resources survey, urban planning and traffic navigation. The objects are varied, including aircraft, ships, vehicles, roads, airports, ports and various buildings. Remote sensing image detection mainly has the following difficulties:

  1. (1).

    The large view of remote sensing images lead to high image resolution, which put high demand on the speed of object detection.

  2. (2).

    The large view results in smaller objects size relative to image, so small object detection is also difficult for remote sensing images.

  3. (3).

    The objects in natural images are mostly horizontal. While the remote sensing images are is taken overhead, the rotation invariance of the object is an important issue.

  4. (4).

    The background of remote sensing image is quite complex.

At present, remote sensing image detection based on deep learning is devoted to solving these problems. Adam Van Etten proposed YOLT [226] based on high-speed YOLOv2, which improves the speed of high-resolution remote sensing image detection through two detections. At the same time, in order to solve the small object detection problem, the resolution of feature map is increased. Yang Long et al. proposed an unsupervised score-based bounding box regression (USB-BBR) combined with non-maximum suppression to optimize the bounding box and enhance the ability to locate small objects [153]. Jiangmiao Pang et al. proposed an R2-CNN to enhance the detection of small object remote sensing images by introducing attention mechanisms [173]. Li Ke et al. deal with rotational invariance by adding multi-angle anchors to the RPN [130]. Gong Cheng et al. proposed a rotation invariant layer to deal with the rotation invariance [33]. Chen Wang et al. proposed an end-to-end multi-scale visual attention network (MS-VANs) [232]. The core idea is to learn a visual attention network for the feature map of each scale, in order to highlight the objects and suppress the backgrounds.

7.5 Medical image detection

Medical image detection can assist doctors in accurately analyzing the lesion area, greatly improving the accuracy of medical diagnosis, and reducing the manual workload of doctors. Currently, the most abundant open dataset for medical images is https://grand-challenge.org/challenges/

Aryan Mobiny et al. applied the capsule network [119] to the detection of lung cancer, which improves the detection speed. Li et al. introduced the attention mechanism based on the DCNNs and applied it to glaucoma detection [131]. Kawahara et al. proposed multi-stream CNN to classify skin lesions, each of which works on images of different resolutions [106]. Kong et al. proposed a combination of LSTM-RNN and CNN to detect end-diastolic and end-systolic frames in the MRI (Magnetic Resonance Imaging) image of the heart [110]. Hwang et al. proposed a weakly supervised deep learning method to detect lesions in nodules and mammograms in chest X-rays [95].

8 Conclusions

This review paper makes a comprehensive and detailed review of the deep learning based object detection methods. It includes three aspects, which are backbone networks, detection architectures and loss functions. At the same time, it also presents a detailed analysis of the complex problems. Starting from the problems, we sum up the outstanding solutions emerging in recent years. We introduce the evaluation metrics and datasets, and summarize the important applications of the object detection. It is worth mentioning that we summarize the current open source platforms for the object detection, which aims to help researchers for choosing appropriate platform. Finally, we give some potential developing directions of the object detection.

From the developing tendency of the backbone network, we can see that the direction is to increase the number of the network layers and the number of the neurons in each layer. Researchers’ pursuit of performance makes the network larger and larger, leading to explosive growth of network computing. At the same time, gradient dispersion and gradient explosion may occur, which is obviously unacceptable. In order to solve this problem, researchers usually adopt two methods: one is to use appropriate rules in the middle layer to remove some neurons, reduce the number of parameters, and to some extent avoid the occurrence of over-fitting [195, 209]; second, the convolution kernel decomposition method is used to replace the larger convolution kernel with the smaller convolution kernel. This can adjust the network depth and enhance the characterization of the network while keeping the receptive field constant and reducing the number of parameters [34, 212,213,214]. It is worth mentioning that ResNet [82] effectively solves the degradation problem by means of residual learning. Based on the differences between object detection and image classification, Zeming Li et al. proposed the DetNet for object detection tasks, which makes the deep network more specialized. Moreover, it solves some deficiencies of non-dedicated network in the field of object detection and significantly improves the performance. InceptionV2 [97] and InceptionV3 [214] also provide us with new ideas from the perspective of batch processing and network splicing. In the actual application scenario, due to resource constraints, lightweight networks arise at the right moment. An important issue is how to reduce the computational load without losing performance. Overall, backbone networks of the object detection architecture account for 90% of computation and storage. Therefore, the design of efficient and lightweight backbone networks will be a key research direction in the future.

In this review paper, we summarize the one-stage and two-stage detectors at the level of network architecture. The two-stage architectures first generate the object region proposals, and then perform regression and classification on the region proposals, such as Faster RCNN [191]. It is better than the one-stage detectors in accuracy, such as YOLO [187]. But the speed of the two-stage architecture is significantly lower than the one-stage detectors. In practice, accuracy and speed need to be balanced. Generally, when the accuracy reaches a certain level, the costs brought by continuing improving the accuracy are unacceptable. In some fields, the real-time requirements make the speed of the object detection framework more important than the accuracy.

The Loss function plays an important role in deep learning and machine learning. In recent works, many innovative methods have been proposed based on the loss function, which includes the stage-wise loss function [66, 81] and the multi-task loss function [65, 191]. Multi-task loss function can achieve end-to-end training. The accuracy is higher than pipeline multi-stage loss function. It is worth mentioning that the Focal Loss [142] and Repulsion Loss [242] are respectively used to solve the complex problems of class imbalance and dense occlusion. Therefore, although some classic loss functions can usually be used, constructing more appropriate loss function will make the learning framework more robust.

At present, the development of object detection and even the whole artificial intelligence field is mainly data-driven. The continuous efforts of scholars, open source datasets have become very rich. Some standard datasets have become the benchmark for the performance of competition evaluation models. However, in the era of expanding application scenarios and big data explosion, there are relatively few datasets with good labels. Manual labeling of image training datasets can cost huge labor and time. Therefore, how to label data more effectively and use fewer samples for effective learning are the key issues in the field. The current limited learning is far from meeting the needs. How to enable the model to quickly and accurately detect objects is an important research direction in this field. However, the performance of unsupervised learning [18, 48, 78, 154, 183] is still unsatisfactory at present. How to use small and medium sized data for supervised learning to achieve high-precision results will be a hot topic. In addition, it is feasible to reduce the supervision cost by using the weak labels of image data. Based on the pre-training of large detection datasets [79, 206, 255], multi-category supervised learning is carried out on this basis. Based on this, researchers have made many constructive achievements in recent years: (1) Data enhancement and learning of small sample was first proposed by Feifei Li et al., which is called one shot learning [55]. To solve this problem, researchers focus on Siamese network [9] again. Siamese network [12] is a simple and powerful networks. In the field of face detection and recognition, Gregory Koch et al. combined one shot learning with Siamese neural network and converted the one-shot problem into the verification problem in image recognition by learning the feature similarity between image pairs [109]. In addition, Hao Chen et al. proposed a new low-shot transfer detector (LSTD) [26]. They set up a kind of effective object domain detector by using rich source domain knowledge and few training samples. The design and implementation of LSTD can unify the advantages of SSD [150] and RCNN [66] into one deep framework. (2) The semantic information of known categories is used to detect unknown categories, which is known as zero shot learning [184]. This problem is obviously more challenging than one shot learning, which enables the machine to have certain reasoning ability. However, there are several problems in the domain of zero shot learning. The first is Domain shift problem. Since the feature dimension of the sample is often larger than the semantic dimension, the establishment of mapping often lose useful information. In order to preserve more information, the sample is usually mapped to the semantic space and reconstructed. The second is Hubness problem. Zero shot learning uses KNN, so hubness problem maybe appear. If ridge regression method is used, hubness problem can be aggravated. To alleviate the hubness problem, a mapping relation from semantic space to feature space can be established additionally [200]. The above method can get 76.5% accuracy on AWA dataset [184]. Also, generative model can be used to solve this problem, such as self-encoder, GAN [67]. It transforms the problem into supervised classification problem and avoid KNN operation. If there is semantic gap, the flow pattern formed by samples in feature space is inconsistent with that formed by categories in semantic space. Therefore, by adjusting their flow pattern to keep them consistent, then learning the mapping between them [134]. (3) Incremental Learning [21] is used to continue learning new data based on pre-trained models. Meanwhile, it ensures that the learned object detection ability is not lost. Therefore, the major problem in incremental learning is Catastrophic knowledge [201], balancing the relationship between new knowledge and old knowledge.

In the future, more efficient detection frameworks can improve the real-time and accuracy of embedded detection applications, which will make the application of object detection more wide. At present, due to the limited use of contextual information, it is necessary to consider how to organically combine such information with more refined object instance segmentation for future development. The research on the object detection of 3D image and depth image (RGB-D) are still scarce, which requires more attentions. We are optimistic about the development of this field. Finally, we present several promising directions based on the existing object detection technologies.

Object detection of high resolution images and videos

Current object detection method are mainly proposed for small and medium size images. Mingfei Gao et al. introduced a general framework [60] to reduce calculation cost of object detection. At the same time, the accuracy of the detection of different scales object in high resolution image were significantly improved. In this architecture, the R-net uses coarse detection results to predict the potential accuracy gain of an analysis region at higher resolution. Then Q-net continues to select regions for amplification. Experiments on the Caltech Pedestrians dataset [45, 46] show that their approach reduces pixel calculation by more than 50% while maintaining high detection accuracy. The detection results are also verified on the YFCC100M [167] high-resolution testing dataset.

AR and VR scene detection

At present, AR and VR technologies are widely used in the game industry, the film industry, and the military industry. Object detection and scene detection are also a very interesting direction [68, 164, 211]. For example, human visual and imaging devices can be overlapped in augmented reality scenarios. The object detection combines the augmented reality scenarios can enable human-in-loop detection and control modes.

Depth image (RGB-D) detection

At present, RGB-B image adds the depth information and is widely used in autonomous driving and robot vision. In the field of autonomous driving, object detection of RGB-D image [30, 31, 72, 118, 210] is very important. Accurate assessment of the location and orientation of different objects can directly affect scene understanding, motion state adjustment, and path planning. Recently, Martin Simon et al. proposed a Complex-YOLO [205], a network that detects 3D object point clouds in real-time. The network extends YOLOv2 [188] through a specific complex regression strategy to estimate multi-category 3D bounding-boxes in Cartesian space.

Video stream object detection

It is mainly applied to mobile devices or security monitoring devices. The video stream object detection [85, 102,103,104] must consider not only the information of each frame, but also the relationship between each frame. Due to the limitations of computing resources, a reasonable network architecture must be designed to meet the accuracy, speed and storage space requirements of real-time video stream object detection. In the actual application scenario, it is also necessary to consider the redundant feature information between adjacent frames, frame blur or jitter and crowed occlusion. Therefore, the current object detection algorithms are difficult to obtain better results for video stream. To solve these problems, Evan Shelhamer et al. proposed a method to multiplex the features of the previous frame. Then, these features are sent to the lesser-computed parts to calculate the final features [198]. Xizhou Zhu et al. achieved good results by using optical flow to improve the speed and accuracy of video recognition [292, 293]. Based on this improvement, Xizhou Zhu et al. proposed a lightweight network structure, which is suitable for mobile video object detection. A lightweight object detector is applied to the sparse key frames, and the entire network can be trained end to end. The system achieves 60.2% mAP on mobile phone at 25.6 fps [291]. For the case where the object appearance deteriorates in some video frames, a typical solution is to enhance the characteristics of each frame by aggregating adjacent frames. However, due to the motion of the object and the viewpoint, the object features usually cannot be spatially corrected across frames. To meet this challenge, Wang et al. proposed an end-to-end full motion-aware network (MANet) model [238]. It uniformly calibrates the features of objects at the pixel level and the instance level in a unified framework, which obtains a good result on the ImageNet VID dataset.

Contextual information

In a real-world scenario, the human visual system recognizes the object not only according to the properties of the object itself, but also according to the contextual information. The contextual information includes the background information, the environmental information, and the relationship between the objects. These information can improve the robustness of the detector for multi-scale objects, occlusion objects, and fuzzy objects. In recent years, some researches also make some substantial progress in this area [28, 29, 44, 171, 182, 294]. It is worth noting that the introduction of Long Short Term Memory Networks [86] and Recurrent Neural Networks [69, 122] into the processing of contextual information can help object detection. Representative network architectures include ION [7], ACCNN [129], Recurrent CNN [138], and GBDNet [269]. This aspect of research will have great significance for object detection, and definitely become a mainstream direction in the future.

We summarize the object detection algorithms based on DCNNs after the popularity of the deep learning. The traditional object detection algorithms are also mentioned in this review. By summarizing the existing algorithms, technologies and architectures, it is a review of current development and a prospect of object detection. We believe that object detection will make great progress with the continuous development of basic theory and related hardware equipment in the future.