Keywords

1 Introduction

Computer vision technology has been extensively used in different segments like industry, automation, consumer markets, medical organizations, entertainment sectors, defense, and surveillance, to mention a few. The ubiquitous and wide applications like scene understanding, video surveillance, robotics and self-driving cars triggered vast research in the domain of computer vision during the most recent decade. Visual recognition systems encompassing image classification, localization, and detection have achieved great research momentum due to significant development in neural networks especially deep learning, and attained remarkable performance [1]. The last 4 years witnessed a great improvement in performance of computer vision tasks especially using deep convolution neural networks (DCNNs) [2].

Several factors are responsible for proliferation for DCNNs viz. (i) Availability of large training datasets and fully annotated datasets (ii) Robust GPU to train large-scale neural network models in a parallel way (iii) State-of-the-art training strategies and regularization methods. Object detection is one of the crucial challenges in computer vision and it is efficiently handled by DCNN [3], Restricted Boltzmaan Machine (RBM) [4], autoencoders [5], and sparse coding representation [6]. This paper aims to highlight state-of-the-art approaches for object detection based on the DCNNs.

The contents of the paper are portrayed as follows. Section 2 introduces object detection. The fundamental building blocks of DCNNs from the perspective of object detection are enunciated in Sect. 3. The state-of-the-art DCNN-based approaches for object detection are discussed in Sect. 4. The paper is concluded in Sect. 5.

2 Object Detection

An image or video contains single or more than one classes of real-world objects and abstract things like human, faces, building, scene, etc. The aim of object detection is to determine whether the given instance of the class is present in the image, estimate the location of the instance/instances of the all the classes by outputting the bounding box overlapping the object instance along with obtained accuracy of detection irrespective of partial occlusions, pose, scale, lightening conditions, location, and camera position. It is generally carried out using feature extraction and learning algorithms. Object detection is a preliminary step for various computer vision tasks like object recognition, scene understanding from images and activity recognition, anomalous behavior detection from videos. Detecting instance or instances of the single class of object from image or video is termed as single object class detection. Multi-class object detection deals with detecting instances of more than one class of objects from the image or video. Following challenges need to be handled while detecting objects from the images.

  • Image-based challenges

Many computer vision applications require multiple objects to be detected from the image. Object occlusions (partial/full occlusion), noise, and illumination changes make detection challenging task. Camouflage is a challenge in which object of interest is somewhat similar to the background scene. This challenge needs to be handled in surveillance applications. It is also necessary to detect objects under conditions of multiple views (lateral, front), poses, and resolutions. The object detection should be invariant to scale, lighting conditions, color, viewpoint, and occlusions.

  • Processing challenges

Detecting objects at large scale without losing accuracy is a primary requirement of object detection tasks. Some applications require robust and efficient detection approaches, whereas others require real-time object detection. Thanks to specialized hardware like GPUs and deep learning techniques which allow to train multiple neural networks in parallel and distributed way helping to detect objects at real-time.

3 Building Blocks of Convolutional Neural Network

DCNNs was first used for image classification. After achieving state-of-the-art performance in image classification, DCNN has been used for more complex tasks like object detection from images and videos.

3.1 CNN Architecture

Convolutional neural network (CNN) is a kind of feedforward neural network in which the neurons are connected in the same way as the neurons present in the brain of animal’s visual cortex area. Figure 1 shows the architecture of CNN. Being hierarchical in nature, CNN encompasses convolution layer with activation function like Rectified Linear Unit (ReLU), followed by pooling layer and eventually fully connected layers. The pattern of CONV-ReLU-POOL is repeated in such a manner that image reduces spatially. The neurons are arranged in the form of three dimensions—width, height, and depth. Depth corresponds to color channels in the image. The image to be recognized is fed as input in terms of [width × height × depth]. For the desired number of filters, the image is convolved with the filter function in order to get the specific feature. This process is repeated for the desired number of filters and accordingly feature map is created. This is done by applying the dot product of the weight assigned to the neuron and the specified region in the image. In this way, the output of a neuron is computed by convolution layer. Rectified Linear Unit (ReLU) applies activation function at elemental level making it nonlinear. Generally, convolution layer is followed by pooling layer to reduce the complexity of the network and the number of parameters in learning by down-sampling the feature map along spatial dimensions. The last layer in the CNN is a fully connected layer which gives the output of the image recognition task in the form of scores representing the object classes. Highest score represents the presence of a corresponding class of the object in the image.

Fig. 1
figure 1

Architecture of convolutional neural network

3.2 Pooling Layers

Addition of pooling layer amidst the consecutive layers of convolutional layer reduces the number of parameters and complexity of the network, and thus, control overfitting. Pooling layer is translation-invariant and it takes activation maps as input and operates on every patch of the selected map. There are various kinds of pooling layers.

  • Max pooling: In this pooling, each depth slice of input is operated using pooling. Figure 2 shows working of max pooling where a filter of size 2 × 2 is applied over a patch of an activation map with a stride of 2. The max value among each entry of 4 numbers is chosen and stored into the matrix, getting a spatially resized map. Another pooling approach is average pooling which takes an average of neighborhood pixels. Max pooling gives better results compared to average pooling [7].

    Fig. 2
    figure 2

    Max pooling over the slice of single depth

  • Deformation constrained pooling (Def-pooling): In order to apply deformation of object parts along with geometric constraints and associated penalty, def-pooling is applied [8]. It has the ability to learn deformable properties of object parts and shares visual patterns at any level of information abstraction and composition.

  • Fully connected layers: These layers perform high-level reasoning in CNN. They exhibit a full connection to all the decision functions in the previous layer and convert 2D features into one-dimensional feature vector. Fully connected layers possess a large amount of parameters and so require powerful computational resources.

3.3 Regularization

Regularization is defined as any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error [9].” Due to a large amount of parameters for training and ability of architectures to learn more abstractions using deep learning, there are chances of obtaining negative performance on test data from the model, i.e., model learns too well such that it poorly generalizes in case of new data. This is known as overfitting. Regularization strategies are required in order to stop overfitting.

4 State-of-the-Art Object Detection Approaches Using DCNN

Table 1 compares the state-of-the-art discussion of DCNN-based approaches for object detection. DCNNs have been extensively used for image classification and achieved state-of-the-art results [1].

Table 1 Comparative study of State-of-the-approaches for object detection

For the very first time, DCNNs have been used for object detection by Szegedy et al. [10]. The authors have formulated object detection as a regression problem for object bounding box masks and defined object detection as estimation and localization of class and object from the image, respectively. Their approach is known as DetectorNet in which the last layer of AlexNet [1] architecture is replaced with regression layer in order to localize the objects using DNN-based object mask regression. To precisely detect the multiple instances of the same object, DetectorNet applies multi-scale box inference with refinement procedure. But, this approach lacks multiple classes of objects for detection since it uses only single mask regression.

To demystify the working of features extracted in CNN model and diagnose the errors associated with the model, Zeiler and Fergus (ZFNet) [3] put forth a novel visualization technique based on multilayered deconvolution network (deconvnet). This model used deconvnet in order to project features back into pixel space of the image.

Deformable DCNN (DeepID-Net) encompasses feature representation learning, part deformation learning, context modeling, model averaging, and bounding box location refinement [11] and uses cascaded CNN for object detection. It works on deformable part objects. Regions with CNN features (R-CNN) [12] take an input image and evaluate ‘n’ number of bottom-up region proposals using segmentation. Once region proposals are obtained, it classifies proposals using class-relevant SVMs to get classified regions. This method acts as a baseline model for a large number of approaches put forth for object detection. Fast R-CNN [13] is the extended version of R-CNN to improve the speed of training and testing phase and improve the detection accuracy. Fast R-CNN suffers from the drawback of calculating the proposals for each region in the image, thus, incurring the large cost of computation, this drawback has been removed in Faster R-CNN by Ren et al. [14]. The authors put forth region proposal network (RPN) which is a fully convolution network in which input image is shared with the detector network, this network simultaneously calculates object bounds and object features at each point, thus freeing cost of region proposals.

This method merges RPN with fast R-CNN and creates a unified network. It works on the principle of “attention mechanism” in which RPN guides network where to search for the object bounds. In the paper, by Markus et al. [15], multiple CNN models are used to detect objects at multiple scales. The papers by Lee et al. [16] and Cheng et al. [17] are based on the region-based proposals. Lee et al. [16] handled the issue of intra-class and interclass variability among objects using multi-scale templates of CNN and non-maximum suppression method. On the other hand, in RIFD-CNN by Cheng et al. [17], the issue of object rotation and intra-class and interclass variability is handled by introducing rotation-based layer and Fisher discriminative layer in the network, respectively. For detecting small objects and localizing them, contextual information based on the multi-scale model of CNN is used in [18], handling the issue of variation in scaling of objects. The aforementioned approaches mainly focus on the specific challenge in object detection viz. multi-scale model, fast and real-time detection, detection accuracy, and localization, interclass and intra-class variation of objects.

It is worth important to amalgamate challenges and address them using unified object detection framework applicable to detect objects in different complex scenarios and thereby enhance the usability of such object detection systems.

5 Conclusion

This paper compares some of the noteworthy approaches to object detection based on DCNNs. DCNN-based approaches are found to be suitable for images and can also be applicable to detect moving objects from the video [19]. The need of the hour is to develop object detection model which can be generalized to work in different application scenarios like face recognition, emotion detection, abandoned object detection (Suspicious object detection), etc. The role of “transfer learning” method for training deep networks would help to cope with the issue [20].

The efficacy of object detection frameworks mainly depends on the learning mode, method of processing the images (parallel programming) and also the platform (CPU, GPU). Continuous change in the scene implies a change in the behavior of objects to be detected, therefore, it is mandatory for such systems to continuously learn the multitude features of objects and detect them despite of a change in their orientation, views, and forms. In addition to this, real-time detection of objects [21] helps to take proactive measures or acts as alarming conditions for effectively monitoring and controlling the public and private places requiring utmost security.

Object detection is a very promising area which can be applied in computer vision and robotics systems, surveillance based on the drone cameras, etc. It is extremely useful in places like deep mines, expeditions to exploring deep ocean floor where human presence is not feasible.