Keywords

1 Introduction

Deep learning (DL) has empowered computer vision to effectively learn image features. It should be noted that most object detectors use a deep neural network as their backbone architecture to extract a feature from images and a detection network to detect objects in images or videos. An object detection method locates objects of a particular category in images or videos. It has fascinated researchers over the last decade. This technology has been applied to humans and society in the form of self-driving cars, face detection, activity recognition, pedestrian detection medical imaging, robotics, object counting, and crop monitoring. Recent developments in computational power have significantly contributed to the development of object detection techniques [1,2,3,4,5,6,7,8].

Several benchmark datasets for instance KITTI, Caltech, MS COCO, PASCAL VOC, and Open Image V5 have played an important role in improving object detection. Organizations maintain a public dataset containing images and videos and information needed how to use them; anyone can download these datasets and conduct experiments. Presently, detection based on deep neural network can be divided into two classes:

  • Two-stage detector

  • One-stage detector.

R-CNN [1] and its different types are examples of two-stage detectors, whereas YOLO [2], its variants are one-stage detectors. Two-stage detectors are highly accurate in terms of localization and classification, whereas one-stage detectors have greater speed in terms of real-time detection. The stages of a two-stage detector can be specified; for instance, in faster R-CNN [4], first stage is called the region proposal network (RPN), which proposes a bounding box; in the next stage, features are pulled out from these boxes with the aid of the RoI pool (RoI pooling) operation. Architecture of a two-stage detector can be viewed in Fig. 1. One-stage detectors, on the other hand, directly predict bounding boxes from input images and the corresponding class label of each box. Architecture of a two-stage detector can be viewed in Fig. 2.

Fig. 1
figure 1

Two-stage detector

Fig. 2
figure 2

One-stage detector

The main objective of this survey paper is to provide an all-inclusive understanding of DL-based two-stage object detection. The authors have reviewed numerous papers and their contribution to object detection; although many survey papers have been published on object detection on border way, but the literature lacks paper focusing on recent development and the recent start of art which have achieved great success in two-stage object detection. Furthermore, the authors have provided a summary on convolutional neural network (CNN) architecture, which serves as the backbone network for a feature extractor in the detection task and is described as the most popular two-stage detector. The authors have also summarized an explanation of popular benchmark datasets for object detection and evaluation metrics.

Section 2 presents the problem definition. Section 3 discusses the popular backbone architecture for object detection. In Sect. 4, authors have covered detail description of two-stage detection methods. Section 5 summarizes information about the application of object detection. Finally, conclusion is given in Sect. 6.

2 Problem Definition

Handcrafted features were one of the main limitations in terms of obtaining good accuracy computer vision tasks. However, with the rise of DL methods, the accuracy of solving vision problems has improved significantly. One of the major problems was object classification, which refers to categorizing all objects present in images into their respective classes. Object detection, also described as object category detection, is a more complex task than classification, as it involves predicting the class of a particular object and its precise location from an input image.

3 Popular Backbone Architecture

The primary requirement of good object detection is to learn a good feature representation. If the learned features are good enough, high accuracy in terms of object detection can be achieved. The popular backbone DCNN architecture widely used in object detection is AlexNet, VGGNet, ResNet, InceptionNet, and ResNeXt.

AlexNet was the first network architecture to be proposed by Krizhevesky in 2012 [3]. It possessed the ability to learn good representation from input images, with a minimal number (8) of layers. It has improved accuracy by a huge margin in the ILSVRC classification challenge [4]. VGG-16, with 16 layers, was based on AlexNet. After further increasing the number of the layers to 20 o network witnessed a dip in accuracy. In [5], the concept of a skip connection was introduced and the new ResNet was proposed, which reduced difficulties pertaining to optimization. This network can be extended to 100 layers with only a few parameters, as compared to VGGNet and AlexNet. Later, its various variants were proposed.

4 Detection Scheme Build on Deep Learning

In a two-stage detector, the first stage is used to generate the proposal in which potential objects can be present. During the second stage, predictions are made based on the generated proposal. The current two-stage detector can more accurately predict an object’s location based on benchmark datasets.

4.1 R-CNN

R-CNN was the first network to be formed on CNN. After the success of the CNN in classification tasks, Ross Girshick proposed the R-CNN network for object detection. The R-CNN detects objects in three phases:

  1. (i)

    Region generation phase

  2. (ii)

    Extraction of feature phase

  3. (iii)

    Prediction phase for classification and regression.

In the first phase, the R-CNN makes use of selective search algorithm (SRA) to select important regions in every input image; the selected regions are known as proposed regions. The advantage of using selective search is that it searches 2000 regions where objects can be present. In the second stage, the selected regions are cropped, resized, and fed into the CNN. At this phase, the CNN produces a 4096 dimensional feature vector as output. In the final step, classification and bounding box prediction happen. Architecture of the R-CNN can be viewed in Fig. 3.

Fig. 3
figure 3

R-CNN

The R-CNN considerably improved the object detection performance of traditional algorithms by a huge margin. However, it still has a few flaws:

  1. (i)

    The extraction of features from the 2000 selected regions through a deep CNN requires a long computational time.

  2. (ii)

    Optimization is difficult, as the network is divided into three stages.

  3. (iii)

    Computational time is a lot for test images.

4.2 Fast R-CNN

After the addressing constraint of the SPPNet and R-CNN, Ross Girshick et al. proposed a fast detection algorithm called fast R-CNN. It is same as the R-CNN, except that the generated region is fed to the CNN. It takes the whole image as input and feeds it to the CNN to obtain the convolutional features map. After convolution, the feature map goes from the RoI layer, which generates the reshaped feature with a fixed size. Fixed features are fed to the classification and regression layers to predict class labels and bounding boxes, respectively. Fast R-CNN extracts features from entire images, whereas R-CNN uses 2000 regions to extract features. This saves immense time during training and testing. Architecture of the fast R-CNN can be viewed in Fig. 4.

Fig. 4
figure 4

Fast R-CNN

4.3 Faster R-CNN

Both previous networks were based on traditional SRA, which were slow, time-consuming, and capture only low-level features in the features map. Faster R-CNN, developed by [4], uses the RPN to generate regions based on the CNN. The RPN generates regions from input images by feeding them into the CNN. It also increases the generation of region proposals with the aid a common set of CNN layers with detector network. After being generated, the regions are changed using RoI pooling layer. The image is then fed to the classification and regression layer for label classification and offset prediction. Faster R-CNN achieved relatively better results with respect to object detection benchmark datasets for instance MSCOCO, Pascal VOC, and ILSVRC [5]. The stages of the network have been outlined in Fig. 5.

Fig. 5
figure 5

Faster R-CNN

4.4 R-FCN

In R-FCN [5], the fully connected (FC) layers that follow the RoI layers are removed, and all major complex features are assigned before the RoI layers. The R-FCN generates position-sensitive maps, which contain information about position regarding distinct classes. The position-sensitive RoI layer is applied to pull out features from score maps. The R-FCN makes use of simple average voting on extracted features from the RoI layer to generate a class vector. At last, the Softmax function is performed on this vector to predict the class score. Architecture of the R-FCN can be viewed in Fig. 6.

Fig. 6
figure 6

R-FCN

4.5 Mask R-CNN

For pixel level detection, He et al. [6] developed the instance segmentation algorithm, the mask R-CNN. This can be viewed as an extension of the faster R-CNN. The Mask R-CNN uses a two-phase strategy. In the first phase, it uses the RPN to generate regions where objects might be present. In the second phase, it foresees the binary mask based on the feature map. A mask-generating branch based on CNN is used to better capture the relevant areas. The mask R-CNN uses RoI align layer, in place of the RoI layer with backbone architecture. Mask R-CNN is simple to accomplish and achieves better accuracy in terms of the instance segmentation task. Figure 7 shows the architecture of the mask R-CNN.

Fig. 7
figure 7

Mask R-CNN

5 Object Detection Application

There are wide range object detection application in real-world scenarios, spanning from social to personal levels (Fig. 8).

Fig. 8
figure 8

Benchmark datasets, example a, b are from Pascal Voc dataset, example c, d, are from MS COCO dataset, e, f are from Open Image v5, example g, h from Caltech dataset

  • Face detection: Detection of faces is a very prominent area in computer vision; it involves detecting human faces from images or videos. It has many applications for instance in security, health care, advertisements, and so on.

  • Pedestrian Detection: It is worth noting that several specific datasets have been published on pedestrian detection. The Euro City Persons dataset, for example, contains information regarding pedestrians, cyclists, and other riders in traffic areas.

  • Text Detection: Text detection deals with detecting text area in images or videos text detection have many applications for example in identifying vehicles by reading number plates, in assisting visually impaired persons.

6 Conclusion

Over the last few years, with the advancement of DL, object detection tasks have evolved rapidly. In this survey, the authors reviewed the modern literature on object detection, covering all relevant information about two-stage object detection and describing backbone architecture. The authors also covered the popular benchmarks of object detection and evaluation matrix. The authors even attempted to cover all terminologies in a deterministic manner to allow the survey to better compress object detection based on deep learning.