
From many aspects, technological intervention for human problems has changed its face from assisting to complete depending on the technology, especially, after the evolution of artificial intelligence and deep learning. Object detection is one among the task gaining its reputation almost in all the sectors. There are numerous reviews on the area. Therefore, we tried to avoid reassert the same topics again. Instead, we intensify the least expressed attributes of object detection.

The main motive of our study is to highlight indirect parameters of object detection also provide significant acceleration in performance. Moreover, we also briefly review predominant methods including the pre-deep learning era. Further, we tried to draft the best-researched applications of object detection over the decades from various domains.

The manuscript is organized as follows. In second section, it briefly reviews predominant methods, and third section analyzes indirect parameters of object detection. The fourth section drafts best applications of object detection, and the last section draws conclusion.

Review on Predominant Methods

Object detection was carried out based on the template matching and object’s part-based representation [16]. The focus was on a particular object whose object position layout is roughly adamant (like faces). Then, recognition was based on the object’s geometric structure till 1990 [43]. Later, the focus shifted from geometry to the statistical classifier which was based on feature representation [like Adaboost [59], SVM [39] and Neural]. The feature representations through global handcrafted feature extraction-based classifier have set a stage for consecutive research in the ground. The appearance feature representation later shifted from global representation [37] to local representation. The local representation was invariant to geometric transformations: rotation, occlusion, scale, viewpoint and illumination. Representative methods include SIFT [36], Haar-like features [59], shape contexts [6], local binary patterns (LBP), histogram of gradients (HOG) [12] and region covariance. After extracting local features, features are combined either through straightforward concatenation or feature pooling encoders. Through various methods, including bag of visual words [11], spatial pyramid matching and Fisher vectors [42], local hand feature descriptor methods gained a reputation for their invariant ability to geometric transformation.

In the deep learning era feature descriptor for an object, representation is automatically learned from the convolution neural network. Convolution layers of CNN are responsible for feature extraction; later extracted features are learned in the fully connected layers; and finally classification layer assigns class-specific labels. CNN extracts features layer-by-layer initial layer extracts elementary features and deep layers extract more robust features. Features extracted in the initial layers are combined by the deep layers to extract more discriminative features [54, 57].

Object detector uses CNN as backbone for object detection [7]. Predominant methods in the deep learning era include both single- and two-stage detectors [22, 26, 34, 45]. Single-stage detector associates class label and bounding box regressor into a single pipeline that does not associate external or internal object proposal. Commonly, it partitions the input images into a coarse grid, and in each cell, objects are classified and boundings are adjusted. Representative methods include DetectorNet OverFeat [48], YOLO [45] and SSD [35] as shown in Table 1. All these methods are identical which resolves class labels in each cell. However, it differs in simultaneously training bounding box regressor and resolving class labels, and YOLO and SSD are the two important detectors in the single-stage detector. YOLO [45] assigns the probability for all the classes in each cell. The class which obtained the highest probability is considered, and bounding boxes are adjusted with respect to the size of the object. Moreover, classification and bounding box training is carried out parallelly end to end. A single-stage detector has the upper hand in real-time applications like pedestrian detection and other moving objects since it is faster compared to the two-stage detector. SSD incorporated the advantage of YOLO and a two-stage detector to build an object detector as fast as YOLO and as accurate as a two-stage detector (faster RCNN). SSD [34] architecture comprises a fixed convolution size 1*1 with stride two throughout the network. Therefore, each consecutive layer decreases the feature map, and SSD associates classifier and detector in each layer to detect an object of varying size. The two-stage detector, on the other hand, associates a preprocessing object proposal, before resolving class label and bounding box regression. An external region proposal is the computational barrier in two-stage detector. However, these methods are preferred when accuracy is given major preference over speed, as object proposal search for the clues for an object from the image. Therefore, it is effective in identifying even small objects which led to an effective approach for detecting a small object (nanoparticles, cells) and such application we will discuss in “Applications of object detection”. However, the object proposal initially was associated with an external proposal using objectness property which was based on object’s edge, color, texture and gradient. Through this approach, search space is reduced to a great extent. But, the external object proposal was not feasible as it occupies considerable time. Therefore, researchers started to associate object proposals within a DCNN pipeline which increased the performance substantially. Representative methods for two-stage detector include RCNN, fast RCNN [23], and faster RCNN as shown in Table 1. Faster RCNN evolved with object proposal within a DCNN pipeline.

Table 1 Predominant methods of object detection

To summarize, both single- and two-stage detectors methods: faster RCNN, YOLO and SSD, are frequently used for various applications as we will discuss in “Applications of object detection”. Faster RCNN is accurate where YOLO is faster. SSD combines both aspects of faster RCNN and YOLO.

Indirect Parameters of Object Detection

The architecture of an object detector plays a key role in the performance as we discussed in “Review on pre-dominant methods”. Two-stage detector associates object proposal before classification and regression as a different architecture from the single-stage detector which roughly divides the input images to coarse grids omitting the proposal. However, the distinct architecture yields different results as a two-stage detector performs by attaining good accuracy. On the other hand, a single-stage detector performs at a good speed. To sum up, the architecture of object detectors plays a key role in the performance of object detection. As there are numerous reviews from both the detection family, our survey tried to avoid reassert the same methods. Instead, we focus on the other parameters apart from architectural design which can contribute to the performance of object detectors. The indirect parameters includes.

  • Context

  • Object proposal

  • Data augmentation

  • Localization error

  • Training strategy


Context plays a significant role in object recognition, especially when the represented features are insufficient for prediction, i.e., when the detection framework encounters occlusion, small object or low image quality. Modeling a context provides additional clues for prediction. For instance, for detecting the objects in the kitchen, the possible objects are chimney, gas stove, vessels, cooker, etc.

The context broadly falls into two categories: (a) global context and (b) local context.

  1. a.

    Global context: It models an entire scene. Detection in office premises will predict the presence of cubicle and laptop and system. Contextual details are combined with the regular feature representation for final prediction [18].

  2. b.

    Local context: It represents the relationship between the objects. Object’s boundary gives additional details about its interaction with other objects. Expanding objects boundary and exploiting in the boundary regions will provide more supplementary information such as object’s above, below, behind, right and left with other objects which provides a additional clue for prediction from its structural constraints. For example, the proposed object can be a door locker if the object behind is a door, and the proposed object is smaller than a door [18].

DCNN exploits contextual details without explicit modeling since the CNN architectural setup enforces hierarchical feature representation. Notwithstanding, dedicated research has been carried out by explicitly modeling local and global context; the representative framework includes CoupleNet [65], ORN [29], DeepIDNet [30], ION [5]. However, in addition to CNNs hierarchical feature representation, both the detectors (single- and two-stage detectors) have an implicit context modeling. In particular, the single-stage detector looks entire image for detection, thereby modeling a global context. In the two-stage detector, the regressor’s subnetwork appropriates the object boundary by exploiting object boundary.

Object Proposal

Object proposal is a preprocessing step before actual detection. In the absence of object proposal, the detector scans different scale and aspect ratios [12, 15, 59, 66] which leads to computation load and makes the entire process very slow. Object proposal eases the detection framework by selectively giving a few proposals [58] from objectness property (edge, texture, color, gradient) [35]. After the growth of DCCN, the selective search was the computation bottleneck for the entire detection framework. It is being proved that DCNN has excellent proficiency in locating an object from their conv layers [46]. Later this idea turned to propose the object within the detection framework. DCNN proposal has a computational advantage over external proposal methods (selective search, MCG and EdgeBoxes [30]) and provides a unified framework. Combining proposal, classification and bounding box regression, the first such method of a proposal using DCNN was the region proposal network (RPN) [46] which combined RPN with RCNN and is a milestone in object detection (faster RCNN) [46]. Consequently, many DCNN-based proposal methods have arrived, representative methods include DeepProposal [19], ZIP [32], DeNet [56], etc., which further improved the performance of object proposal. A two-stage detector with RPN is the key for many detection challenges, including Pascal VOC and COCO. Notwithstanding, DeepProposal [19], ZIP [32], DeNet [56], etc.., have a performance gain in comparison with RPN with slight computation load.

Data Augmentation

Data augmenting refers to artificially stressing the training data to the various transformations. Such as scaling, cropping, rotating, flipping, distorting and adding noise, leaving the underline category unchanged as augmentation produces more training samples, helps in generalization and avoids over fitting [41, 61]. Researchers [14, 24] proposed adding a datasets by pasting segmented objects into realistic images. Further, Dvornik et al. showed [13] that correctly modeling objects local context is a key to place them in the right surrounding [34].

Localization Error

IOU is an evolution matrix for localization whose performance can eventually affect the detection framework. Intersection over union compares the predicted bounding box and ground truth and, ordinally, expected to be more than or equal to 0.5. The bounding box regressor optimizes the bounding area aiming to increase IOU in parallel with classification. Bounding boxes are a coarse estimation. Therefore, background pixels are combined with a bounding box, which affects the performance of localization. Usually, some post-processing step, such as non-maximum suppression [8, 28, 34], is applied to remove inappropriate bounding box. But, the excellent localization can be suppressed due to the wrong alignment. However, few approaches are developed to minimize localization error. Representative methods include MRCNN [20], CRAFT [62], cascade RCNN [9]. In MRCNN, RCNN is applied several times to adjust the boundingbox iteratively. CRAFT [62] and AttractioNet [21] adopts a multistage detection to bring the best proposals, handover to fast RCNN. CaiVasconceolos proposed cascade RCNN, an extension of multistage RCNN, where cascading RCNN is trained sequentially with each RCNN increasing IOU threshold.

Training Strategy

A deep learning detection framework requires massive data to perform well. Moreover, data augmentation is commonly applied during training to alleviate scale variations problems. Training with massive data tends to complicate and overload the process. Effective training and fast convergence are at most concern during training. A few training strategies is proved effective in literature. Singh and Davis proposed SNIP [8, 50,51,52] that introduced an innovative training technique that decreases scale variations without shrinking the training data. Sing et al. proposed SNIPER, which efficiently processes only context area about ground truth by the relevant scale instead of dealing with the entire image pyramid. MiniBatch size plays a key role in past convergence. Peng et al. proposed MegDet that enabled a large MiniBatch size, effective in faster training and rapid convergence. Further, Peng et al. introduced concurrent GPU training that eases the COCO dataset training by finishing the training in four hours by concurrently processing in 128 GPUs, with the help of GPU batch normalization and novel learning rate policy, impressive in winning the COCO 2017 detection challenge [40].

Comparative Analysis of Indirect-Performance

From sections “Context”, “Object proposal”, “Data augmentation”, “Localization error” and “Training strategy”, we have discussed indirect performance parameters and the corresponding representative methods. To highlight the effectiveness of these parameters and the environment where it can yield more performance in comparison with general detectors are analyzed with the recent works as shown in Table 2. Comparison follows the standard evolution sequences as depicted in Fig. 1.

Table 2 Comparative analysis of indirect performance parameters of object detection
Fig. 1
figure 1

Comparative analysis of indirect performance parameter

After the comparative analysis on the indirect performance parameters with different problem scenarios such as insufficient datasets, insufficient feature extraction, detecting small objects, localization error, training massive deep learning models, results from the comparative analysis as shown in Table 2 highlight indirect parameters can significantly contribute to the performance when the generic detector algorithm drops the performance due to the lack of data samples, feature extraction, class imbalance, etc. Moreover, from the comparative analysis we suggest specific parameter for different problem scenarios.

  • Data augmentation approach: when lack of training data.

  • Context modeling: Feature extraction is not sufficient or quality of the image is not up to the mark.

  • Object proposal: When detecting very small or tiny objects.

  • Effective localization methods: When there is huge class imbalance among the different class in the dataset.

  • Training strategy: Training huge volume of dataset.

Applications of Object Detection

Object detection has been widely used in numerous applications, especially in the field of medical, military, security, anomaly detection and science and engineering as shown in Fig. 2.

Fig. 2
figure 2

Applications of an object detector

Medical Field

Brain Tumor

Manual segmentation of brain tumors is a laborious and time-consuming task for radiologists. A deep learning paradigm is developed as a feasible preference for applications in medical imaging. They can grasp discerning features instinctively, as a neural network can learn a brain’s essential features in regulation to classify and segment tumors. This approach outperforms manual segmentation and the classical machine learning approach in stipulations of false-positive decline. Among the deep learning established ways, CNNs have provided supreme performance for brain tumor segmentation [44].

Radiolucent Lesions

Identification and segmentation of mandibular radiolucent lesions on panoramic radiographs: It aims at five radiolucent lesions (radicular cysts, dentigerous cysts, ameloblastomas, odontogenickeratocysts and simple bone cysts) which takes place regularly in the mandible. A deep learning approach had proven a high standard of detection and classification awareness in the detection of radiolucent lesions of the mandible [3].

Cell Biology

Segmenting a cell from blood or other tissue has significant challenges due to morphology, color intensity, and cell size variability but with the precise accuracy of deep vision. It has displayed outstanding accuracy in classifying and detecting B cells and T cells, detached by a micro-microfluidic chip [55].

Security Field

Luggage Scanner

In the international travel web, the increased traveller throughput and enlarged border security (e.g., postal, sailing and freight). The consequences in demanding a well-timed computerized image identification. Convolution neural networks (CNN), a leading in modern object detection problems, are also used in X-ray baggage images for identifying a potential object of threat (gun, shuriken, razor blade and knife) objects. The research result highlights that CNN achieves exceptional accuracy in detecting threat objects [2].

Anomaly Detection

Identification of Defects in Tiny Particles

Tiny tools (less than 1 cm) particles can flaw due to working conditions and poor design. Due to holes, sags, and abrasions, mass-produced products are prone to fault. Decay and lethargy damage arise in day-to-day functions; deep learning’s effective feature extraction capability is utilized to identify tiny particles defects. An SSD object detector proved to detect the flaw of 0.8-cm darning needles accurately [63].

Anomaly in Steel Structure

Bolt loosing will affect the safety of the steel arrangement and may lead to severe accidents. Due to the bolt joints, complicated fluctuation properties, it is hard to recognize the bolt loosening in steel structures from a conventional dynamics perspective. However, deep learning’s intense feature extraction capability is used to detect the screw and screw number. From the detected set, bolt loosening is effectively identified using trigonometric relationships [64].

Anomaly in Food Particles

The identification of an external object plays an essential part in the agriculture commodity. In various ways, external particles are brought into food brands; foreign objects in foods are the sole means of customer complaints. The identification of external particles is hugely significant for quality and health. It is a major concern for the food security convention. Manual extraction of foreign objects from food material is a time-consuming and labors task. Foreign object segregation from walnuts using DCNN is applied. DCNN overcomes the cumulating phenomenon between walnuts and external particles, which was a challenging task in manual feature engineering. The DCNN performed with above 99% in more than 100 test images [47].

Science and Engineering

Nanoparticles Segmentation

Images produced from the microscopes are large in number and resolution. Shapes and size distributions and properties of nanoparticles play an essential part in interpreting the material. Hence, each image of nanoparticles should be identified and detected for giving a measurable guide. These particles segmentation is challenging because of the overlapping instances, changeable particle sizes and shapes. Moreover, manual detection and segmentation of nanoparticles are laborious and time-consuming. A deep learning paradigm is used for detecting and segmenting the nanoparticles from TEM (transmission electron microscopy) images. Multiple output convolution neural networks (MO-CNN) are used, for concurrent recognition and segmentation of nanoparticles. The proposed deep learning approach is powerful and efficient, with immense precision and capable of studying nanoparticles, even in overlapping particles and complex backgrounds [38].



Surveillance plays a crucial role in the continuous analysis of massive amounts of critical visual information. Detecting targets, monitoring security-sensitive areas, and suspects possible suspicious activities lead to an increase in cognitive load and exhaustion in energy level; moreover, it is prone to error. The best alternative is to replace, with a computer vision, to detect suspects and monitor with ease. Furthermore, the probability of error will be lesser [4].


This survey briefly reviewed predominant methods of object detection from pre-deep learning methods. Most importantly, we have given preference for the indirect parameters. From the review and comparative analysis, we conclude indirect performance parameters plays a crucial role in various problems across different areas by boosting the performance in comparison with generic detector. Furthermore, we have highlighted which parameters can be the appropriate choice for different problem conditions.

Moreover, we have shown the best characteristics of object detection in various domains. Results from the various applications showed deep learning methods outperformed conventional methods and goes behind the human visual perception. Therefore, the transition of technological intervention from assisting to completely depending brought serious anxiety on the future role of human intervention for various tasks. Nevertheless, it is essential to associate with the evolving technology to give continuity against the fast-evolving machine-centric period.