1 Introduction

In recent years, object detection and tracking has gained increasing attention due to its wide range of applications and recent breakthrough research. In the applications of both real-world and academia, object detection and tracking has equal importance. Some of the real-world applications include autonomous driving, monitoring security, transportation surveillance, and robotic vision [1]. A variety of sensing modalities, such as radar, Light Detection and Ranging (LIDAR), and computer vision (CV) has become available for object detection and tracking. Imaging technology has immensely progressed in recent years. Cameras are cheaper, smaller and of higher quality than ever before. Concurrently, computing power has dramatically increased. In recent years, computing platforms are geared toward parallelization such as multi core processing and graphical processing unit (GPU). Such hardware version allows CV for object detection and tracking to pursue real-time implementation. Rapid development in deep convolution neural network (CNN) and GPU’s enhanced computing power are the main reasons behind the fast evolution of CV-based object detection and tracking.

In this context, let us mention the evolution of deep learning (DL) from machine learning (ML) and their characteristic differences. ML is a branch of artificial intelligence (AI), and it basically means learning patterns from examples or sample data. Here the machine is given access to the data and has the ability to learn from it. The data (or examples) could be labeled, unlabeled, or their combination. Accordingly, the learning could be supervised, unsupervised or semi-supervised. Artificial neural networks (ANNs) that have the ability to learn the relation between input and output from examples are good candidates for ML. ANNs enjoy the characteristics like adaptivity, speed, robustness/ ruggedness, and optimality. In the early 2000s, certain breakthroughs in multi-layered neural networks (MLP) facilitated the advent of deep learning. DL means learning in depth in different stages [2]. DL is thus a specialized form of ML which takes the latter to the next level in an advanced form. DL is characterized by learning the data representations, in contrary to task-specific algorithms [3]. Convolutional neural network (CNN) represents one such deep architecture which is most popular for learning with images and video.

In DL framework, the problem of object recognition can be viewed as a task of labeling different objects in an image frame with their correct classes and predicting their bounding boxes with a high probability. The learning performance in DL depends on the number of samples (or previous experiences). Larger the number is, more accurate is the performance. Today, we have abundant data which, in turn, makes DL a meaningful choice [3, 4]. However, DL often needs hundreds or thousands of images for obtaining the best results, unlike the conventional (shallow) learning. The term “shallow” is meant in contrast to “deep” [3, 4]. Therefore, DL is computationally intensive and difficult to engineer. It requires a high-performance GPU to provide very fast object recognition and motion detection.

DL models can be used in both generic and domain-specific object detection and tracking. In the detection network, deep CNN is used as a backbone to extract the key features from an input image/video frame. These features are used to localize and classify the objects in the same frame. Thereafter, in object tracking, these detected objects are tracked based on feature-nearness from frame to frame. Object detection refers to scanning and searching for objects of certain classes (e.g., human, car, and building) in an image/video frame. In the domain of object detection, there are diverse studies conducted, which include edge detection [5, 6], image segmentation [7, 8], pose detection [9], face detection [10], multi-categories detection [11], pedestrian detection [12], scene text detection [13], and salient object detection [14, 177]. The heart of scene understanding is object detection, so it has a wide use in various fields, including security, military, transportation, and medical. Further, segmentation is the mother task of object detection in an image. Segmentation can be performed using various conventional and modern approaches [15]. Better segmentation results in higher object detection accuracy. As the task is unsupervised, segmentation poses several challenging issues.

Object detection can be performed using either image processing techniques or DL networks. Image processing techniques usually do not require historical data for training and are unsupervised in nature. But these techniques are restricted to various factors, such as complex scenarios, illumination effect, occlusion effect, and clutter effect. All these issues are better tackled in DL-based object detection. The working principle of DL networks is supervised in nature, and is restricted to a huge amount of training data and the GPU’s computing power. Many benchmark datasets, for examples, Caltech [16], KITTI [17], ImageNet [18], PASCAL VOC [19], MS COCO [20], and V5 [21], are already developed in object detection field. Due to the availability of such huge amount of data and development of GPUs, DL networks based object detection is widely accepted by researchers.

Object detection is followed by object tracking. The aim of object tracking is to localize the trajectory of a detected object and link it to that. Efficient and robust system design is required to track objects in either a domain-specific scenario or generic scenario. This target is fulfilled by recently developed DL networks. For example, consider the research on DL networks for image classification that was done in ILSVRC 2012 competition [22]. Here, the error rate is reduced by 10% as compared to conventional methods. Thereafter, new deeper learning networks are gradually developed for classification of images. They are well-received by human vision community due to their efficiency. Advancements in object detection are observed in face recognition [23], re-identification of person [24], image semantic segmentation [15, 25], and action recognition [26], among others. All the successes of DL networks for object detection inspire the improvement in object tracking. However, DL networks cannot be directly used for object tracking, since for tracking, objects need to be detected [27,28,29] first from the image frame either manually or by a network using supervised or semi-supervised learning. This learning task requires huge samples to learn the features of the selected object(s). Earlier DL networks [30] were inferior compared to the correlation filter [31] for object tracking. Thereafter, different strategies had been revealed to improve the DL for object tracking [3, 32, 33]. These strategies may be classified based on three main aspects: i) more samples are used to perform the feature learning for tracking objects [34, 35], ii) features are extracted from multiple layers or low layers of deep CNNs [36, 37], and iii) to obtain directly the tracking results, deep networks (end-to-end) are developed [38]. Recently, two reviews [39, 40] have been published on DL for object tracking. Multiple object tracking (MOT) is more complicated than single object tracking and is more applicable in a real-time scenario. Therefore, the research on MOT is overwhelmed by researchers. Although it has been observed that DL is efficient for MOT problems, the tracking performance is purely based on the success of proper image localization and classification [3, 28, 29, 41, 178]. Therefore, it is necessary to summarize and analyze the existing DL networks for both object detection and tracking. Recently, there have been two reviews, one on DL-based object detection [1, 42] and the other on DL-based object tracking [40]. These surveys have covered independently either DL-based object detection task, or DL-based object tracking task, but not the both together.

The present review deals with the tasks of DL-based both object detection and tracking, considering them individually and in combination. In other words, it analyses, in addition, which combinations of detectors and trackers are suitable for which kinds of data. In that sense, this review integrating DL-based object detection and tracking is the first of its kind. With the rapid development in CV research, the article provides a systematic and comprehensive study on the characteristic features, functionalities, and performances of the various state-of-the-art methods at this juncture that offer several efficient solutions and new directions in this domain. It intends to provide an overview of how different DL models are being tremendously deployed in generic object detection, specific object detection, and object tracking, as well as in finding the best detector-tracker combined models. This facilitates the selection of appropriate deep models for multi-object detection and tracking, and in turn enhances the scope for further improvement. These are followed by some crucial application areas of object detection, various challenging research issues in detection and tracking, and certain concerns for the future researchers in DL. The last aspect is very crucial as a kind of caution to the beginners in DL and AI research. A comprehensive bibliography on the up-to-date research work on DL-based object detection and tracking is also presented.

The article proceeds as follows: Section 2 presents the broad approaches for object detection and tracking. Generic object detectors are presented in Section 3. Then, reviews of the application of CNN for various specific tasks are exhibited in Section 4. Section 5 elaborates the most representative and pioneering DL-based approaches for object tracking. Results of detailed analysis of deep networks for both object detection and tracking are stated in Section 6. We conclude the paper in Section 7. Various applications and challenges of object detection and tracking task, together with some concerns, are discussed in Section 8.

2 Object detection and tracking: Broad approaches

In this section, we briefly discuss different approaches, both conventional and DL based, for multi-object detection and tracking along with their characteristic features. As mentioned before, both object detection and tracking are important in the field of CV. In general, object detection is performed in two steps: finding the foreground entities (using features) which are considered as object hypothesis, and then verifying these candidates (using a classifier). We divide object detection into three board categories; i) appearance-based, ii) motion-based, and iii) DL-based. Appearance-based approaches use image processing techniques to recognize objects directly from images/video. But these approaches usually fail in the detection of occluded objects. Whereas, in motion-based approaches, a sequence of images is used for the recognition of objects. These methods may not function properly for detecting the objects in complex scenarios. DL-based approaches use either appearance features or motion features or their combination for object detection in images/video frames. Due to the recent technological breakthroughs, DL-based approaches for object detection have gained much attention as compared to either appearance or motion-based approaches.

Deep CNNs are used as backbone in DL-based object detectors to extract features from the input image/video frame. These features are used to classify the object(s). DL-based approaches have two categories: i) two-stage detectors [43] and ii) one-stage detectors [44]. In two-stage detectors, at first, approximate object regions are proposed using deep features, and then these features are used for the classification as well as bounding box regression for the object candidate. In one-stage detectors, on the other hand, bounding boxes are predicted over the images without the region proposal step. This process consumes less time and hence, can be used in real-time devices. Two-stage detectors achieve high detection accuracy, whereas one-stage detectors have high speed. Various backbone networks (feature generation networks) that are used in DL-based object detection are: i) AlexNet [45], ii) ResNet [46], and iii) VGG16 [43], among others. With the advancement of backbone networks and the increasing capability of GPUs, a remarkable progress has been achieved in two-stage object detectors. Recently, the concept of granular computing has been embedded in deep networks in order to enhance the computation speed significantly, keeping a balance with detection accuracy. Some such networks are granulated CNN [3] and Granulated RCNN [178]. Detailed reviews of DL-based generic and specific object detection are provided in Sections 3 and 4.

As said earlier, the task of object detection is followed by that of object tracking. Tracking aims to serve two major purposes. These are: i) prediction of the location of foreground objects in videos and ii) correct association between detected objects and trajectories in the current frame. Optical flow is used in [47] to track objects by measuring the distance between the new detection and the displacement of trajectory. In [48], motion of newly detected object in the current frame is estimated by Kalman filter. Since real-life dynamic problems are often non-linear, there have been several variations of the traditional Kalman filter, such as Extended Kalman Filter (EKF) [49] and particle filter [50]. These two filters work based on the non-linear transformation of random variables.

In recent years, deep architecture has gained its popularity in MOT. We roughly classify the deep architecture-based MOT into three categories. The first category involves deep feature-based MOT enhancement where the features (semantic) are typically extracted from a deep CNN. Such an example is multiple hypothesis tracking (MHT) [51]. The second category includes MOT using deep CNN (end-to-end) learning. Such end-to-end DL networks, viz, RNN-LSTM and hierarchical RNN models, are developed in [38]. The third category involves MOT using deep network embedding. The core part of the tracking is accurately designed with the help of a deep CNN. A detailed review on all the tracking categories is provided in Section 5.

Since the performance of tracking objects depends on the performance of their detection, we have provided in Section 6 a comparative analysis of performance and challenges among different combinations of detectors and trackers on various videos. The purpose is to show which pair of detector and tracker is suitable to which kind of data. For this analysis, we have focused only on those investigations concerning DL-based multi-object detection and tracking algorithms, which are competitive on the benchmark datasets.

3 Generic object detectors

Generic object detectors have an aim of locating and classifying objects in an image and labeling them with rectangular bounding-boxes to show the confidence of existence. Generic object detectors are of two types: two-stage detectors and one-stage detectors. Two-stage detectors follow the traditional object detection pipeline, i.e., object localization and its classification. Whereas, one-stage detectors consider object detection task as regression/classification problem. For both detectors, the classification task is done based on some features which are generated using a feature generation network, called backbone network. A detailed discussion on backbone networks, two-stage and one-stage detectors is provided in Sections 3.13.2, and 3.3, respectively.

3.1 Backbone networks

This network acts as a feature generation network for objet detection. It takes an image as input and generates its feature map. CNN and its variants are used as the backbone networks. Most of the backbone networks for object detection perform feature generation task at the convolution layers and classification task at the last fully connected layers. Some such example deep networks are AlexNet [45], ZFNet [43], and VGG16 [52]. Improved versions of the basic deep network are also available. For instance, in [53], to make an existing network much deeper, some specially designed layers are used for addition to it, and for replacement of some existing layers, in addition to subtraction of some existing layers. Use of specially designed deep networks is also made [44, 54] to meet some specific requirements. To achieve better accuracy and efficiency, researchers can choose deeper and denser backbones, such as ResNet [55], ResNetXt [56], and AmoebaNet [57], or lightweight backbones, such as MobileNet [58], SqueezeNet [59], Xception [60], and MobileNetV2 [61]. These lightweight backbones are capable of meeting the requirements of the mobile application. To meet the necessity of high degree of precision and more accurate and precise applications, complex backbone structures are required. But real-time video surveillance systems require high processing speed as well as high accuracy [44]. Therefore, researchers are overwhelmed by the improved backbones to adapt to the detection architecture and make a fair trade-off between the accuracy and speed.

As mentioned earlier, deeper and densely connected backbones replace the shallower and sparsely connected backbones to obtain more detection accuracy. For instance, in [44], VGG16 is replaced by high capacity backbone, ResNet that can identify rich features is adopted in Faster RCNN for further gain in accuracy. So, it can be said that the quality of features determines the upper bound of network performance. Deeper and densely connected backbones can provide more qualitative features than shallower and sparsely connected backbones. Therefore, further exploration of deeper network is required. Out of the aforesaid networks, let us explain the features of AlexNet [45] as it is used in the subsequent discussions frequently. AlexNet consists of five convolution (Conv1, Conv2, Conv3, Conv4, and Conv5), three pooling (Pool1, Pool2, and Pool5), and three fully connected (FC1, FC2, and FC3) layers. It takes an image as input and constructs its reduced feature map as output of Pool5. The number of channels in this feature map is equal to that of the filters used in Conv5 layer. Thereafter, this map is converted to a 1-dimensional weighed array through FC1 and FC2 layers. This array of the image is then fed into a classifier with N class labels through FC3 layer, where N is the number of object classes trained. During the training of AlexNet, classification loss is minimized through back propagation, i.e., the error with respect to class label of the objects is minimized. For more details about other deeper networks, one may refer to [62].

3.2 Two stage detectors

Two stage detectors involve two tasks: object region proposal and object classification. First, object region is proposed using either conventional methods or deep networks. Classification task is done based on the features extracted from this proposed region, thus increasing the detection accuracy. Basic architecture of a two stage detector is shown in Fig. 1. Various two stage detectors include region convolutional neural network (RCNN) [45], Fast RCNN [63], Faster RCNN [43], Mask RCNN [55], R-FCN [64], FPN [53], granulated CNN [3], and granulated RCNN (G-RCNN) [178]. These are explained in the following sections:

Fig. 1
figure 1

Basic architecture of two stage detector [1]

3.2.1 RCNN

RCNN [45] is, perhaps, the first model as a two stage detector to show that deep CNN is better than conventional methods for object detection. RCNN has four modules. The first module proposes object regions in the image frame. In the second module, a fixed-length feature vector is extracted from these regions. Third module deals with object classification task. In the last module, bounding boxes are fitted over the classified objects.

In the first module, a selective search method is adopted to propose the approximate object(s) region(s) in the input image. Then, a deep CNN takes each region proposal as input and generates a fixed length (4096-dimensional) feature vector that is further used in the classification task. The classification task is done through fully connected layers which need fixed-length input vectors. Therefore, the feature vectors extracted from all the region proposals should have the same size. An image may contain one or more objects having different sizes and aspect ratios. Therefore, different sized region proposals are obtained in the first module. Features extracted from these different sized region proposals are wrapped in a fixed-sized bounding-box. Then, this fixed-sized feature vector is used for object classification. Here, feature generation/backbone network consists of five convolution and two fully connected layers. All convolution parameters are shared across all the object categories that are used for training. Training of RCNN has two stages. First, RCNN is trained using large-scale dataset, and then, it is fine-tuned using some particular dataset. In RCNN, the last fully connected layer is connected with (N + 1) classification layers (where N: number of object classes, and 1: background) for performing the final object classification. Stochastic Gradient Descent (SGD) is used here for fine-tuning the convolution parameters. For fine tuning of IoU (intersection over union), the overlap between the region proposal and ground truth is measured. If IoU of a region proposal is less than 0.5, then it is considered as negative, otherwise, positive. The region proposal whose IoU-value with respect to the ground truth is maximum, is considered as the ground truth in the next training process. In RCNN, both region proposal and classification tasks are performed separately with no sharing computation. Therefore, RCNN consumes prolonged time for classification task.

3.2.2 Fast RCNN

The next advanced version of RCNN is Fast RCNN [45] which addressed the runtime issue of RCNN. Fast RCNN takes the entire image as an input and generates pooling feature-maps corresponding to the input image. Each feature in pooling-map is considered as a region of interest (RoI). Thereafter, this fixed sized RoI-map is passed through three fully connected layers for object classification and bounding-box fitting over the classified object. As the locations of pooling features are considered as the probable regions and are used for classification task, the computation time can be saved significantly as compared to RCNN. Another difference between RCNN and Fast RCNN is: RCNN involves multi-stage training process, whereas Fast RCNN uses one stage end-to-end training process.

As said earlier, instead of considering the input region proposals, the RoI pooling-map is used for classification task. This feature map consists of some key features that belong to different regions of different sizes. Therefore, Fast RCNN does not require wrapping regions and reversing of spatial features for the region proposals. Here, truncated single value decomposition (SVD) is used for quick detection by updating the weight parameters which helps in accelerating the speed. Experimental results revealed that Fast RCNN achieves 66.9% mAP (mean average precision) on PASCAL VOC 07 [19] dataset. Whereas, RCNN results in 66.0% mAP on the same dataset. The time for training in Fast RCNN is dropped 9 times as compared to RCNN. Fast RCNN trained with truncated SVD achieves higher detection speed as compared to RCNN. Nvdia K40 GPU is used during these experiments. From the aforesaid experimental results, it is evident that Fast RCNN is better than RCNN in terms of detection performance metrics. However, Fast RCNN uses a selective search method over the convolution feature map to propose its pooling map, which slows down its operation.

3.2.3 Faster RCNN

Faster RCNN [43] is an improved version of Fast RCNN in terms of detection accuracy and runtime. As stated earlier, in Fast RCNN, a selective search method is used for region proposal that makes the system slow. Faster RCNN replaces this method with a new region proposal network (RPN) which is a fully-connected CNN. RPN predicts the object region(s) more efficiently in a wide range of aspect ratios and scales. In Faster RCNN, the required time for the generation of region proposal is less as compared to Fast RCNN. Because, Faster RCNN shares both the fully image convolution features and a set of common convolution layers to the detection network at the same time. Here, anchors are placed at each convolution feature location to generate region proposals of different sizes. Anchors are the spatial windows of different sizes and different aspect ratios that are placed at a location in the input feature map. In Faster RCNN, anchor boxes having three different scales and three different aspect ratios are used. On the output of the last convolution layer, a constant sized window of (3 × 3) slides, where the center point of each sliding window (i.e., anchor box) corresponds to a location in the original input image. Anchor box-based region proposal is usually parameterized to predict the bounding-box. Thereafter, the distance between the ground truth box and predicted bounding box is computed to optimize the location of the predicted box. On PASCAL VOC 07 test data set, a mAP of 69.9% is achieved by Faster RCNN, whereas Fast RCNN achieves a mAP of 66.9% having shared convolution computations. Moreover, Faster RCNN (testing time 198ms) is approximately 10 times faster than Fast RCNN (testing time 1830ms) with VGG16 network and Nvdia K40 GPU.

3.2.4 R-FCN

As mentioned earlier, Faster RCNN has two sub-networks: one is a fully convolutional sub-network (shared) which is typically independent of RoI, and the other is an RoI-based unshared network. Faster RCNN uses deep CNN, such as AlexNet [45] and VGG16 [43], and provides efficient results. Whereas, the existing networks for image classification, including ResNets [65] and GoogleNets [66], are eventually fully convolutional. That means, ResNets and GoogleNets architectures construct fully convolutional object detection network without RoI network. However, using Faster RCNN with ResNets and GoogleNets architectures provides inferior results. This happens, because the object detection task is translational variant, whereas the image classification task is translational invariant. Shifting of an object within an image should be discriminative in classification of images, while any translation of an object in a bounding-box may be meaningful in object detection. If the RoI pooling layer is manually inserted into a convolutional network, the translational invariance property may get affected. To address this issue, R-FCN was proposed in [64].

For each object category in R-FCN, the last convolution layer initially generates g2 position sensitive score maps having a grid size of (g × g). Then one position sensitive pooling layer is appended to the last convolution layer to aggregate the responses from these score maps. At last, in every RoI, the g2 scores are averaged to generate an (N + 1)-dimensional (N: number of object categories, 1: background) vector, and then, softmax responses are calculated. Another (4 × g2)-d convolution layer is appended to obtain the class-agnostic bounding-boxes. The testing speed of R-FCN on both MS COCO and PASCAL VOC is 170 ms per image.

3.2.5 FPN

Feature pyramids, built upon image pyramids, have been widely adopted by many object detection systems to improve the scale invariance [67, 68]. However, the training time and memory consumption are high in this process. In some techniques, the pyramids are usually built during testing which leads to a lack of consistency between training and testing-time inferences [43, 63]. The hierarchy of in-network features of a deep CNN produces feature maps having various spatial resolutions. It introduces semantic gaps caused by different depths. This issue is addressed in some studies [69, 70] where the pyramid building is started from the middle layers, but the resulting systems miss the maps of higher degree of resolution. Besides, the feature pyramid network (FPN), proposed in [53], holds an architecture in bottom-up (BU) pathway, top-down (TD) pathway, and a number of lateral connections. These connections are used to combine strong semantic features (low resolution) with weak semantic features (high resolution). The BU pathway can produce a feature hierarchy by down sampling the corresponding feature map with a stride of 2. The layers having the same sized output maps are grouped into some network stages, and the output of the last layer of each stage is chosen as the reference set of feature maps to build the following TD pathway. In TD pathway, first, the feature maps from higher network stages are up-sampled, and then enhanced using those of the same spatial size, as obtained from the BU pathway via lateral connections. A (1 × 1) convolution layer is appended to the up-sampled map to reduce the channel dimensions, and the merged map is achieved by element-wise addition. Finally, a (3 × 3) convolution is appended to each merged map to reduce the aliasing effect of up-sampling and the final feature map is generated. This process is iterated until the finest resolution map is generated. As rich semantics can be extracted by feature pyramid, FPN can be achieved without compromising the memory as well as speed. Moreover, FPN can be implemented at various stages of detection of objects.

3.2.6 Mask RCNN

An extended version of Faster RCNN is Mask RCNN [55] which is mainly developed to serve the instance segmentation task. Here, ResNet-feature pyramid network (FPN) [53] is added with Faster RCNN [65] as backbone to generate informative features, thereby increasing the detection accuracy and speed. RoI features that are extracted from different layers of FPN have different scales. Then, FPN generates a feature hierarchy that consists of different scaled RoI feature maps. This is done in BU pathway. On the other hand, the TD pathway offers features of higher resolution by up-sampling the feature maps from higher pyramid levels. The feature maps at the top pyramid are nothing but the last convolution layer feature maps of the bottom-up pathway. Then, the same spatial-sized feature maps from the BU pathway and TD pathway are merged to generate the region proposal. Both higher-resolution and lower-resolution feature maps are generated by FPN, thereby resulting in significant features for improving the detection accuracy.

Another way of improving the detection accuracy can be obtained by replacing the RoI pooling layer with RoI Align to retrieve a feature map (comparatively small) from each of the RoIs. Traditional RoI pooling quantization method suffers from the mis-alignment problem that arises between RoIs and pooling features. This issue is addressed by RoI Align layer. Here, first, the floating-number of the co-ordinates of each RoI-map is computed. Then, bilinear interpolation operation is done using these floating-numbers to compute the exact values of features. These features are distributed into four RoI bins. Max or average pooling is done to get significant feature values from all the four bins. Finally, these feature values are aggregated and are used for object classification. The aforesaid two modifications improve the detection precision. ResNet-FPN backbone achieves 71.2% AP (Average precision) and RoI Align operation achieves 70.9% AP on MS COCO dataset.

3.2.7 Incorporating granular computing in CNN

In this section, we mention some recent developments in CNN incorporating the concept of granular computing (GrC) for object detection and tracking. Two such new models are there, namely, granulated CNN and RCNN, in short G-CNN and G-RCNN, respectively. Before explaining these models, let us describe, in brief, the concept of granules and granular computing along with its characteristic features.

Granulation is a basic step of human cognition systems. Granular computing (GrC) is a nature-inspired information processing framework where computations/ operations are performed on information granules. Granules evolve during the abstraction of knowledge from the data. Its significance is based on one of the realizations that precision is sometimes expensive and not very meaningful in modelling and controlling complex systems. When the data has overlapping character, it may be convenient to represent them in terms of granules (a clump of indiscernible elements drawn together, for example, by likelihood, similarity, proximity, or functionality).

As GrC deals with granules, rather than individual elements, it leads to gain in computation time; thereby signifying its application to large data sets.

While DL is a computationally intensive process and the GrC paradigm, on the other hand, leads to gain in computation time, it may be appropriate and logical to make their integration judiciously so as to make the DL framework efficient in terms of computation time requiring only CPU. Based on this realization, G-CNN and G-RCNN are formulated for object detection, tracking, and scene description. These are described as follows:

(a) Granulated CNN:

As stated, granulation is the process of formation of granules using the information abstraction. For processing an image frame in GrC paradigm, granules could be made of equal or unequal sizes, and regular or irregular shaped, over the image frames, although irregular ones are more natural for real-life problems. Region growing can be used to obtain irregular shaped (natural) spatio-color neighborhood granules. Forming these granules would represent both static and moving object regions in the image/video frame. These object regions are then fed to the deep CNN architecture for performing object classification, thereby resulting in G-CNN. The functioning principle of G-CNN is as follows: Instead of scanning the entire image pixel by pixel in the Convolution layer of DL, it jumps over the granules only which were formed before. That means, for a (32 × 32) image with N granules, sliding the filter is done only N times instead of over (32 × 32) pixels, where N << (32 × 32). Hence a significant speed up is observed, compromising some accuracy [3].

This is the first investigation [3] incorporating granular computing in deep CNN framework for object detection. Granulated CNN achieves 48.59% detection accuracy and 1.5fps speed over MS COCO dataset. Further, the concept of Z-numbers [71] was used to provide a granulated linguistic description of the output scene, which is unique.

(b) Granulated RCNN:

G-RCNN [178] is an advanced version of Faster RCNN. Here the object detection has two stages: object localization (i.e., RoI) and its classification. G-RCNN is effective for the extraction of RoIs from image/video frame. This is done by incorporating the unique concept of granulation in a deep CNN. Here, granules are constructed using spatio-temporal information. These granules represent the object localization (i.e., region) in an image/video frame. Unlike Fast and Faster RCNNs, G-RCNN uses (i) granules formed over the pooling feature map, instead of the entire feature map, in defining RoIs, (ii) only the objects in RoIs, instead of the entire pooling feature map, for performing object classification, and (iii) only positive RoIs during training, instead of the entire RoI-map. In addition, both image and video can be used for the training of G-RCNN. All these lead to the improvement in real-time detection accuracy and speed. G-RCNN with AlexNet backbone achieves 80.9% detection mAP and 5.6fps speed over PASCAL VOC 12 dataset.

3.3 One stage detectors

In one stage detectors, the bounding boxes are predicted over the images without the region proposal step, thereby increasing the detection speed. Basic architecture of one stage detector is shown in Fig. 2. Various one stage detectors include YOLO [44], YOLOv2 [46], YOLOv3 [72], SSD [70], DSSD [73], RetinaNet [74], M2Det [75], RefineDet [76], and DCN [77]. These are explained in the following sections:

Fig. 2
figure 2

Basic architecture of one stage detector [1]

3.3.1 YOLO

YOLO [44] is an object detector with a single stage which was designed after Faster RCNN. It is mainly applicable for the detection of real-time images. YOLO can predict less than 100 region proposals, whereas Fast RCNN and Faster RCNN can predict 2000 and 300 region proposals per image, respectively. YOLO considers the detection problem as a problem of regression so as to retrieve features from an input image straight way for the prediction of class probabilities and bounding-boxes. The speed of YOLO network is 45 fps excluding the batch processing using Titan X GPU, whereas Fast RCNN and Faster RCNN achieve the speed of 0.5 fps, and 5 fps, respectively on the same GPU.

An input image is divided here into (g × g) grids. Features extracted from each grid cell are used for object classification. Each grid cell predicts B bounding boxes, and for each box, C class probabilities are obtained for C object classes. Two measures are considered for each bounding-box: first, the probability (P) of the bounding-box is defined to check whether the bounding-box belongs to any object or not, and then IoU between the ground truth and bounding-box is defined to check how accurately the bounding-box contains that object. The bounding-box with highest IoU and non-zero class probability is considered as the object region. YOLO network consists of 24 convolution layers and 2 fully connected layers. YOLO is not so good in object localization, which affects its detection accuracy.

As compared to Fast RCNN, YOLO reduces the background false positives by 3 times. However, YOLO obtains 63.4% mAP with 45fps as compared to Fast RCNN (70.0% mAP, 0.5fps) and Faster RCNN (73.2% mAP, 7fps). YOLO detector is restricted to high resolution detection and single-class prediction.

3.3.2 YOLOv2

YOLOv2 [46] is an advancement of YOLO. Decisions from the past training task with a novel concept are adopted in YOLOv2 to improve the speed and detection precision of YOLO. YOLOv2 consists of six tasks, such as i) batch normalization, ii) high resolution classifier, iii) convolution with anchor boxes, iv) size and aspect ratio prediction of the anchor box, v) fine-grained features, and vi) multi-scale training. These are explained in the following section:

  1. (i)

    Batch normalization (BN): Training of YOLOv2 is done using the SGD approach. SGD uses mini-batches for training process. For each mini-batch, mean and variance are computed and used for activation. Then, for each mini-batch, activation is normalized using zero mean and standard deviation of 1. Finally, all the elements in each of the mini-batches are sampled using the same distribution. This operation may be viewed as a batch normalization [78]. It produces activations of same distribution. YOLOv2 adds a batch normalization layer ahead of each of the convolution layers to accelerate its operation in order to achieve the convergence and hence, can regularize the model. Using BN in YOLOv2, the mAP is increased by 2% as compared to YOLO.

  2. (ii)

    High-resolution classifier: An input resolution of (224 × 224) was adopted in YOLO backbone. Whereas, in YOLOv2, the input resolution is increased to (448 × 448). Therefore, there is a requirement of network adjustment to the new resolution inputs for object detection task. Accordingly, some fine-tuning of classification network is done in YOLOv2 for an image of resolution (448 × 448) and 10 epochs. This increases the mAP to 4%.

  3. (iii)

    Convolution with anchor boxes: As already discussed, Faster RCNN utilizes an anchor box as a reference for generating the region proposals, which is then parameterized relative to that reference anchor box to predict the bounding-box. This prediction mechanism is used in YOLOv2. Then it predicts the class and object-ness score for each predicted bounding-box. This operation increases the recall by 7% and reduces the mAP by 0.3%.

  4. (iv)

    Size and aspect ratio prediction of the anchor box: YOLOv2 utilizes k-means clustering method on the training bounding-boxes to obtain better priors. Then, these priors are used to define the center location of the predicted anchor box. The aspect ratio and size of this anchor box are predicted using the cluster information. This operation improves the detection accuracy.

  5. (v)

    Fine-grained features: As discussed, YOLO was trained with (224 × 224) images. Yolov2 architecture is a modification of YOLO architecture. For localizing smaller objects, YOLOv2 is re-trained with higher resolution images (448 × 448). In this re-training process, YOLOv2 uses both the higher and lower resolution features by stacking the adjacent features into different channels. This increases the detection mAP by 1%.

  6. (vi)

    Multi-scale training: To make a network robust to operate on images having different sizes, every ten batches (randomly selected) chooses a new image of dimension size from \( \left \{ 320, 352,. . ., 608 \right \} \). It basically implies that it is possible for the same network to detect at different levels of resolutions. For example, YOLOv2 achieves 78.4% mAP and 40 fps at higher resolution, whereas YOLO achieves 63.4% mAP and 45 fps on VOC 07. Although YOLOv2 achieves high detection precision with high speed, it is restricted to high resolution detection and single-class objects.

3.3.3 YOLOv3

YOLOv3 [72] is the next advanced version of YOLOv2. Deep CNN Darknet-53 is used as the feature generation network in YOLOv3. YOLOv3 uses multi-label classification with overlapping patterns for training, so that it can be used in complex scenarios for object detection. Moreover, during training, three feature maps of different scales are used in predicting the bounding-box. In YOLOv3, the last convolution layer generates three dimensional tensors that contain class predictions, object-ness, and bounding-box. YOLOv3 achieves 57.9% mAP on MS COCO dataset as compared to DSSD513 of 53.3% and RetinaNet of 61.1%. Because of the advantages of multi-class prediction, YOLOv3 can be used for small object classification. YOLOv3 shows worse performance for the detection of medium and large sized objects.

3.3.4 SSD

Single-shot detector (SSD) [79] is a one-stage detector that can predicts multiple classes. Within SSD, at each layer, several feature maps having different scales are generated. SSD predicts the class scores for a set of bounding-boxes (default) of varying scales at every location in the aforesaid feature maps. These bounding-boxes (default) have different scales and aspect ratios for a particular feature map. The scale of bounding-boxes (default) is calculated in one feature map based on the difference between highest feature map and lowest feature map, where each specific feature map learns to be responsive for a particular scale of objects. For each default bounding-box, it predicts the multi-label classification scores. During training, the default bounding-boxes are matched with the ground-truth boxes. The bounding-boxes (matched) are considered as positives and rest are negatives. In case of large number of negatives, the system adopts the background (hard negatives) to get a sufficient number of positive boxes for training. In this approach, loss is defined for each bounding box. Then, based on the loss maximization, bounding-boxes are chosen as either positive or negative, so that the ratio between total negatives and positives is at most 3:1. From experiments, it was evident that SSD512 (with input image size: 512 × 512) produced better results in both speed and mAP with VGG16 [43] backbone. Further, SSD512 obtained mAP of 81.6% on PASCAL VOC 07 test set and 80.0% on PASCAL VOC 12 test set.

3.3.5 DSSD

De-convolutional Single Shot Detector (DSSD) [73] is a modified version of SSD. In DDSD, both prediction module and de-convolution module are added with SSD, and it uses ResNet-101 as backbone. In prediction module, a residual block is added to each prediction layer to do element-wise addition of the outputs of this layer. The de-convolution module augments the feature-map resolution so that small objects can be detected using DSSD. By integrating these two modules with SSD, the DSSD can predict a different set of objects having different sizes. During the training of DSSD, the baseline network ResNet-101 is first pre-trained on the dataset ILSVRC CLSLOC, and thereafter, the original SSD model (ResNet-101) is trained using (513 × 513) images from the same dataset. Parameters of this trained SSD model are then fine-tuned through the training of de-convolution module. Experiments on both PASCAL VOC dataset and MS COCO dataset showed the effectiveness of DSSD513 model [73]. Addition of prediction module and de-convolution module with SSD model enhances the mAP by 2.2% on the test dataset PASCAL VOC 07.

3.3.6 RetinaNet

RetinaNet [74] is another kind of object detector with a single stage that works considering the focal loss as a classification loss. One-stage detectors provide a dense set of object locations containing extreme foreground (positive) and background (negative) class imbalance. Due to this class imbalance issue, the training process is biased to the major class, thereby reducing the detection precision. This problem is addressed in RetinaNet where a loss function, named as focal loss, is defined. This reduces the weight of the loss which are assigned to the negative samples (background). This loss concentrates on the positive (hard) training samples and avoids the vast number of negative samples. In this way, RetinaNet is trained with unbalanced negative and positive samples. The experimental results revealed that the RetinaNet with ResNet-101-FPN backbone achieved 39.1% AP, as compared to DSSD513 with 33.2% AP, on the dataset MS COCO test-dev.

3.3.7 M2Det

M2Det is developed in [75] to meet a wide variation of scale across different object instances.It comprises a multi-level feature pyramid network (MLFPN) which constructs more effective feature pyramids. Three steps are carried out to get enhanced feature pyramids. First, multi-level features extracted from multiple layers in the backbone, are fused to the base feature. Second, the base feature is fed into a block consisting of joint Thinned U-shape Modules and Feature Fusion Modules to obtain decoder-layer features. A feature pyramid with multi-level features is finally built integrating the decoder layers having equivalent scale. In this way, multi-level and multi-scale features are generated. These features are then fed to a SSD for object localization, classification, and bounding-box fitting. M2Det achieves AP of 41.0% at speed of 11.8 fps with single-scale inference strategy and AP of 44.2% with multi-scale inference strategy utilizing VGG16 on MS COCO test-dev dataset. It outperforms RetinaNet800 (Res101-FPN as backbone) by 0.9% with single-scale inference strategy; however, it is two times slower than RetinaNet800.

3.3.8 RefineDet

RefineDet network [76] has two interconnected modules: (i) refinement module and (ii) object detection module. These two modules are inter-connected through a transfer connection block. RefineDet is usually used to transfer features from the last module to the following one for improved prediction of objects. Here, the training is done in end-to-end manner. It has three important stages: (i) preprocessing, (ii) two interconnected modules for detection, and (iii) NMS. Other one-stage detectors, including YOLO, SSD, and RetinaNet, utilize single step regression to obtain final outputs. Whereas, RefineDet uses a cascaded regression (two-step) method to predict the hard-to-detect objects (i.e., small detected objects) more accurately.

3.3.9 DCN

Regular CNN can focus only on features having fixed square size (according to the kernel); therefore, the receptive field cannot cover the entire object pixels properly. Deformable convolutional networks (DCNs) [77] can handle this issue by producing the deformable kernel.

DCN has two varieties, such as DCNv2 and DCNv1. DCNv2 [80] utilizes more deformable convolution layers than DCNv1 to replace the regular convolution layers. All the deformable layers are modulated by a learnable scalar value, which enhances the deformable effect and accuracy. DCNv2 achieved 45.3% mAP, as compared to DCNv1 with 41.7% mAP, on the dataset MS COCO test-dev.

In summary, the aforesaid generic detectors enhance the accuracy by extracting richer features of objects and adopting multi-level and multi-scale features for object detection of different sizes. To achieve higher speed and precision, the one-stage detectors utilize newly designed loss function to filter out the easy samples which are responsible for lowering significantly the number of region proposals. Adaptation of deformable convolution layers is seen to be effective in addressing the geometric variation in images. Modeling the relationship between different objects in an image is also necessary to improve the performance. An overview of various object detectors in terms of characteristics, like region proposal, input feature, loss function, learning method, softmax layer, is provided in Table 1. Comparative studies of their performances are provided in Section 6.

Table 1 Overview of the prominent object detectors

So far, we have explained different detectors, and their relative merits and demerits. Let us now provide some applications of CNN for certain specific detection tasks.

4 Applications of CNN for specific object detection

Specific object detection tasks of CNN that will be discussed here are detection of face [81], salient objects [82, 83], and pedestrians [84, 85]. Salient object detection is accomplished with local contrast enhancement and pixel-level segmentation. Face detection and pedestrian detection are closely related to generic object detection and mainly accomplished with multiscale adaption and multi-feature fusion, respectively. Detailed reviews on detection of salient objects, face, and pedestrians are presented in Sections 4.14.2, and 4.3, respectively.

4.1 DL in salient object detection

Salient object detection aims at focusing on the dominant object regions within an image. An wide spectrum of applications of salient object detection is available which includes image cropping [86] and segmentation [6, 15, 87, 88], image retrieval [89], and object detection [53]. There are two broad approaches for the detection of salient objects: (i) BU [82] approach and (ii) TD [83] approach. The BU approach is based on local feature-contrasts which are dependent on various local and global features, e.g., edges [6, 90] and spatial information [91]. However, multi-scale high level semantic information cannot be explored with these contrasts (low-level). As a consequence, low-contrast salient maps are generated. Whereas, the TD-based approach is task oriented. Task prior knowledge about the object category is used in this approach for the generation of salient maps. Based on these maps, pixels are assigned to a particular object category [92]. In other words, the TD saliency detects the specific objects by pruning the BU saliency points [93].

Because of the significance of multi-scale high-level features for various computer vision-related tasks, including semantic segmentation [92], edge detection [94], and object detection [63], it is quite feasible to use CNN in object (salient) detection. Some earlier study [95] performs searches for obtaining the optimal features. But this approach is completely data-driven which is restricted to a large amount of training data. This issue is addressed in [96], where saliency prediction is integrated into pre-trained object recognition DNNs. Here, DNN’s weights are fine-tuned by transferring the saliency evaluation metrics (i.e., KL-divergence, and normalized scan path saliency) which are based on the specific object function. Here, local features combined with global features improve the salient object detection performance. In [97], two deep independent CNNs (DNN-G and DNN-L) are trained using both local estimation and global search to obtain the global contrast as well as local information, and predict the saliency maps. In [98], a semi-supervised saliency detection network is proposed by integrating visual saliencies from both BU and TD saliency maps. This network results in an object-ness score by averaging the intensities of multi-scale super pixels.

Saliency object detection necessitates the requirement of both semantic segmentation and context modeling. A novel super-pixel wise CNN approach, called Super CNN, is developed in [99] to learn the internal representations of saliency efficiently. Here, saliency object detection is considered as a two-class problem. A novel deep saliency detection framework, namely CRPSD, is presented in [100], which combines both the region-level saliency estimation and pixel-level saliency prediction. In addition, multi-scale feature maps are significant in improving the detection accuracy. A deep network, called Region Net, based on this is formulated in [101] for performing salient object detection. This network is based on Fast RCNN. Two specific tasks, namely, multi-scale contextual modeling and end-to-end edge preserving, are integrated in the Region Net for saliency detection.

4.2 Face detection

Detection of face is essential due to several face-related applications, including face recognition [102, 103], face synthesis [104], and facial expression analysis [105]. Unlike generic object detection, face detection task is performed to recognize and locate face regions covering a very large range of scales. Some generic detectors (e.g., Faster RCNN) are modified so that they can act as face detectors [106,107,108]. In some studies, CNNs are trained with face landmarks and 3-dimensional modeling. For instance, a unified FCN end-to-end framework, called DenseBox, is proposed in [109] for detecting face and localizing face landmarks. In [110], a multi-task learning discriminative framework is developed. It integrates a CNN with the help of a 3-dimensional mean face model. This framework solves two issues during the conversion of generic detector to face detector. These are: elimination of anchor boxes by a 3-dimensional mean face model and the replacement of RoI pooling layer with a face configured pooling layer.

4.3 Pedestrian detection

Generic Faster RCNN is modified in [111] for pedestrian detection. Here, a downstream classifier takes boosted forests, high convolution feature maps, and RPN to take care of the small instances and negative examples. Based on DPM [67], a DL framework, called DeepParts, is developed in [112] for addressing intricate occlusions within the images. DeepParts makes decisions based on 45 DCNN models (fine-tuned), and some strategies, such as part selection and shifting of bounding box. Another deep net, called CompACT-Deep [113], combines hand-crafted features and fine-tuned deep CNNs to handle positive proposals of low IoU-value, and partial occlusion. Another deep CNN, called multispectral DNNs [70], combines the complementary information from both color and thermal images for pedestrian detection.

5 Deep learning-based object tracking

Object tracking is followed by object detection task. Based on the functionalities of DL, MOT methods are classified into three main categories: i) deep network features-based MOT enhancement, ii) deep network embedding, and iii) deep network (end-to-end) learning. Generally, it is hard to obtain MOT results using a single network as some inter-related sub-modules (i.e., detection, feature extraction and matching) are essential for MOT. Besides, assumptions, like fixed distributions and Markov property, are considered to achieve effective tracking performance. These three categories of MOT are explained in Sections 5.15.3.

5.1 Enhancement of MOT using deep network features

In this technique, the tracking framework uses semantic deep features instead of conventional handcrafted features to obtain effective tracking performance. The success of DNN in the classification of the image is because of its ability to learn deep features. These features have rich semantic information. They are not only useful for image classification but also for other tasks, including object detection, image segmentation, and MOT.

In object detection and segmentation tasks, deep features are useful for region proposals. Similarly, in MOT task, deep features are extracted from deep CNN (AlexNet [45]) and are used in MHT [114]. MHT holds multiple associated hypotheses for a detected object and builds a hypothesis tree. Then, a scoring function is defined to determine the best suitable hypothesis for a detected object to obtain effective tracking performance. MHT method is extended in [51] with appearance features of reduced dimension. This is done using a multi-output regularized least square method. To increase the discrimination in person re-identification task, a wide residual network (WRN) is introduced in [115]. 12-normalized and 128-dimensional deep features are extracted from the WRN and used for cosine softmax classification. These deep features are used to compute two distances (i.e., minimum cosine distance and Mahalanobis distance) between detections and existing tracks. The minimum dissimilarity from a series of cascading of these two distances is used to match a detection with the appropriate track. This method is able to obtain competitive on-line tracking performance at real-time.

The feature learning aims to assess the commonalities between detections and tracks. Considering this goal, Siamese CNN [116] with two similar branches is developed for feature learning. Siamese CNN has three categories: i) two branches having one cost layer, ii) two branches having some common CNN layers, and iii) double stream stacked inputs. Based on a comparative study [116], the third category is found to be best for extracting deep features. Both motion information and deep features are fused with a gradient boosting algorithm to solve the tracking problem. The first architecture of Siamese CNN is utilized in [117] to learn the affinities of track-lets to replace them with previous features from ILDA [118]. This architecture is extended in [119] to learn the associate affinities between the existing track-lets and detections. Here, the tracking is formulated as a generalized linear assignment problem and is solved using the soft-margin approach. Hinge loss is considered as the loss of the network. Both spatial and temporal information is required in distance learning for MOT problem. For distance learning, to impart the effects of both constraints, Mahalanobis distance-based matrices (segment-wise) are used.

It is stated that pairwise images may be used in Siamese CNN to learn affinities. This architecture can also be used to learn optical flow features that are extracted by deep CNNs [47]. It is therefore evident that the optical flow features are efficient in the on-line association of data as well as tracking [120]. Compared to traditional algorithms, deep CNNs can result in more robust and smoothing optical flow [47]. The optical flow-based features are effective to enhance the performance in tracking. In [119], a multi-cut framework is developed to construct a matching-cost between detections and track-lets through deep matching features, and to enhance the association outputs. The cost for direct matching between long-term track-lets and detections using deep optical flow can lose the information related to valid paths, and be unable to use them for tracking. Accordingly, the said method is modified in [121] where lifted edges are added for encoding re-identification deep features for tracking multi-objects.

5.2 Deep network embedding-based MOT

In this category, deep CNNs are designed as the core part of the framework for tracking. They are usually trained with the help of samples obtained from tracking-related data. Here, deep CNN is designed to obtain scores for multiple classifications to various track-lets. A deep binary classifier is then developed to indicate whether the two detections belong to the same object or not. These deep network embedding-based MOT methods are mainly of three types depending on three types of learning task, namely discriminative deep network learning, deep metric learning, and generative deep network learning. Let the corresponding MOT methods be referred as DN-MOT, DM-MOT, and GN-MOT, respectively. These methods are explained in Sections 5.2.15.2.2, and 5.2.3, respectively.

5.2.1 DN-MOT

In this approach, object trackers optimize the discriminative models initially and then seek for the best locations in the following frames to associate the detections with track-lets. The best locations are obtained according to these discriminative models. As deep CNNs are adopted widely for discriminative tasks, it is common that the discriminative deep network models are used in tracking. As an example, the particle filtering framework is proposed in [122] for MOT. To track each detected object, two classifiers based on CNN are developed. Features from different layers of deep CNN (i.e., VGG16 [43]) model-based object detector (i.e., Faster RCNN) are fed to these classifiers as inputs to classify the detected object. The first classifier uses features from the region proposal to classify the object instance, and the second classifier extracts features from the convolution layer and thereafter, compares the classified object instance with the past features of the object to determine whether they are similar or not. The confidence scores of the classifiers are used to evaluate the weights of the particle filter, and finally, the tracking is done by particle filtering. A crucial issue of such a model is that training of the network is done in off-line mode, whereas object’s historical features are updated in on-line mode.

Similar as in [122], another MOT framework using object-trackers is developed in [123]. Here, the tracker searches for a candidate which is the best among image patches and neighboring detections. To handle occlusion in [123], spatial features are eventually learned based on the visible map using the convolution and fully connected layers. These spatial maps improve the tracking accuracy. Moreover, to reduce the time complexity of this model, the RoI pooling layer map, instead of the whole image frame, is shared with the classifier for tracking. The main difference between the studies of [122] and [123] is that the former uses category classifier; whereas the latter considers occlusion features for tracking.

In tracking, deep CNN can be used for either classification tasks or learning the regression models. The task of object detection and tracking can be considered as a regression task and learned with the aid of DL [64]. There are few studies carried out in MOT which use regression models. The tracking performance (i.e., precision) can be enhanced by using the regression loss. For example, the regression losses related to the bounding-boxes in [124] are considered to improve the tracking performance. In [125], the tracking problem is considered as a bounding-box regression task using a RNN. However, this method can hardly handle occlusion and similar object problems in MOT task.

5.2.2 DM-MOT

In this category, deep metric learning-based methods are used for MOT. The training of such MOT methods results in learning about which track-let belongs to a specific detection and whether two detections belong to the same object or not. It can be considered as an image-patch verification process. Similar to person re-identification [126] or face recognition [23], accurate affinity learning through a distance metric is adopted in DM-MOT methods. In [115], a deep metric learning network, called deep SORT, is designed and trained for person re-identification and MOT problems. Here, motion features are fused with appearance features to achieve this goal. Deep SORT is good in tracking single class objects, but it fails in multi-class object tracking. This is solved by Multi-class Deep SORT (MCD-SORT) tracker [178]. Both motion and appearance features are used here to make the correct association between the detected object and track-let. Searching for this association of object with trajectory is restricted only within the same class. This increases the performance in multi-class tracking.

Siamese network is developed in [124] for MOT. Here, first, quadruplets of image patches are fed to this network as inputs. Thereafter, triple distances are measured between these image patches. The output of the network provides a ranking among the triple distances. Both motion features and appearance features are fused in this network with the help of the distance metrics. A CNN based on triplet loss is developed in [127] to obtain the information of the distance metrics between track-lets and detections. In [128], it is shown that how motion features can be learned using the difference between LSTM prediction and detections in the next frame.

Instead of learning the distance metric between detections and track-lets, the investigation [129] is based on learning of the distance metric between two track-lets. The network is able to extract a set of features from track-lets for each detection. Then, these features are fed to a Gated recurrent unit (GRU) network as input. The output of the GRU network is pooled temporally and is used to build the local features in Euclidean space. Based on the distance between GRU network’s outputs, several sub-track-lets are generated. These sub-track-lets are then re-connected to the long trajectories with the help of similarity between the global features and local features.

5.2.3 GN-MOT

In this approach, generative learning-based methods are used for MOT. This learning strategy is used in deep networks for appropriate parameter estimation. For MOT problem [130,131,132], deep generative learning is used to increase the performance of tracking. In [133], the posterior probability of the movement of an object and appearance features having Gaussian distribution are modeled through linear regression. Here, the parameters of the regression model are learned with the help of a GRU network. Hidden layers of this network are updated after the completion of the operation for each of the frames, and are utilized to evaluate the mean and the deviation of the distribution for the following frame. In tracking, the joint probability between motion and appearance features is calculated [120]. This joint probability is used to match the track-let with detection in the current frame. Thereafter, a greedy search algorithm for matching is used to determine the best results to associate the detection with existing track-let. During this process, a threshold is preset to delete some matching results that have low probability values. This reduces the computation time.

An LSTM-based generative model is developed in [134] for prediction. This model consists of an encoder which is composed of stacked convolution layers. This encoder takes a sequence of ten image frames as an input, and generates a pixel-wise probability map. The LSTM-based prediction module has two parts: short-term prediction and long-term prediction. Short-term prediction is done to associate detections with the track-lets and long-term prediction is used for updating the trajectories. During this process, detections are generated through Generative Adversarial Network (GAN). For a given frame, the associated detections are added to existing trajectories, and non-associated detections are considered as newly detected objects. When some trajectory does not get associated with any detection for more than ten frames, then that trajectory is deleted from the tracking system.

5.3 End-to-end DL-based MOT

In this technique, DL networks are directly designed for obtaining the tracking results. MOT problems have various stages: building the relationships in between detections and track-lets, upgrading the existing trajectories, initialization of new track-lets, and deletion of trajectories from the tracking system based on some criterion. It is difficult to model the stages within a single framework and entirely learn them. Of late, the process of tracking is simplified using some assumptions. Therefore, a few end-to-end learning approaches have been developed to implement these stages for MOT.

The states of track-lets, in the on-line MOT task, can be estimated with the aid of a recursive Bayesian filter, and each new detection can then be associated with one track-let based on a maximum similarity score. A network, called RNN-LSTM, is developed in [38] to model the stages of MOT. All these stages, such as state estimation of track-lets, new detections, their matching matrix, and existing probabilities, are embedded into this network. The updated results on trajectories are outputted from this network. New probability scores corresponding to these trajectories are then computed to check whether some trajectory is terminated or not. Here, LSTMs are used to calculate the matching matrix between the track-lets and the detections. This matching matrix is used to train the RNN in an end-to-end fashion. This process can show promising tracking results over single object tracking dataset only. The reasons are: i) this approach considers only motion information, ii) initialization and termination of the trajectories do not use context information, and iii) the number of training images is not sufficient for the learning of this model.

The aforesaid issue is solved in [135] where a hierarchical RNN model is designed to integrate different features, including appearance, motion, and their interaction features for each object tracked. This model has three typical sub-LSTM networks that can predict long-term motion features, and extract contextual features and multi-frame appearance for track-lets. The features of all such networks are thereafter concatenated. Then, these features are fed to the top hierarchy layer of RNN as input to measure the matching scores between track-lets and detections in the current frame. For training of this model, each LSTM network is pre-trained individually, and find-tuned after obtaining the results of the top LSTM network of RNN. Here, training is done in an end-to-end way. This model achieves better results as compared to existing methods to re-identify a person. Six or less frames are used in the hierarchical RNNs to obtain optimal tracking results. This work is further extended in [136], where the detailed operation of the network (LSTM) to learn the appearance features is explored. Between the input features and hidden states, a multiplication layer is added to explore the regression module and thereafter, develop a bilinear LSTM module to associate detections with track-lets. This modified LSTM is good in dealing with appearance features only. Therefore, bilinear LSTM for appearance features and conventional LSTM for motion features are mixed to obtain the matching classifier. This is called MHT framework and it can be used for on-line tracking.

In globally optimized MOT, tracking can be modeled with the help of a network flow and probabilistic graph. In [137], min-cost network flow-based DL is designed for MOT. Here, the loss function is defined as weighted l2 distance of edge labels. Thus, min-cost network flows are built on different layers of the deep model and are optimized. Experimental results reveal its effectiveness in global tracking [137]. It is, therefore, expected that the graph model (network flow) based global tracking algorithms can be extended by deep architecture. An overview of various object trackers, including method, network, and the end-to-end train is given in Table 2.

Table 2 Overview of various trackers

5.4 Deep network structure and training for tracking

The deep network has a huge number of parameters. Therefore, it is crucial to train the network accurately. Different network structures are utilized in the tracking process. Based on the functionality, deep network structures can be categorized into RNN, CNN, and their different integrations and variants. As a training strategy mainly depends on network structures, we review here the different DL structures with their corresponding strategies for training.

5.4.1 CNN-based MOT and training

CNNs are widely used in tracking due to its excellent capability in feature learning. During the training of CNN, the task-specific objective function is defined and training data for holistically tracking is used. Object tracking follows object detection. Therefore, CNNs are pre-trained initially for object detection task, and later CNNs are fine-tuned according to the tracking task.

In order to improve the tracking performance, either conventional hand-crafted features are replaced with the features extracted from CNN models [51, 138], or training of CNN models is made using classified (labeled) datasets [115, 121]. Such datasets that are used for training of CNN models are ImageNet [18] and person re-identification datasets, namely CUHK03 [139] and MARS [140]. For example, in deep SORT [115] tracking, the WRN is trained with the help of MARS dataset. In real-time tracking context, person re-identification task based on such MARS training data may result in a huge number of mis-detections, partial detections, and false alarms. It is, therefore, required to train CNN models with the help of real-time tracking data [116, 122, 124].

For some nested CNNs, such as STAM-MOT [123] and CNNTCM [119], it is hard to optimize the network by adopting end-to-end training. Therefore, sub-networks are first pre-trained and then, these are cascaded one after another to obtain the whole network. Thus fine-tuning of the whole network is required. STAM-MOT is developed using VGG16-network [43], and it has three sub-networks: (i) a visible map, (ii) spatial features, and (iii) a classifier. These sub-networks are then pre-trained and the whole network is fine-tuned once the samples for tracking are stored. In CNNTCM, the sequence of images is split into a number of segments. Using these segments, the whole network is fine-tuned.

5.4.2 RNN-based MOT and training

Unlike CNNs, RNNs are suitably used for sequence modeling. They are able to predict a tracking-state based on historical information. Therefore, RNNs have effective tracking performance than CNNs. But the training of RNN is always difficult, since in RNN, the integration of both appearance and motion features is little difficult. Similar to CNNs, the training of RNNs requires both pre-training of sub-networks as well as fine-tuning of the entire network.

The integration of long-term motion of an object and its appearance features is done using the combination of LSTM and RNN [38, 135, 136]. To learn the track-lets’ state and prediction, and matching probability between track-lets and object detections, modified RNN and LSTM are developed in [38]. Both mean square and log-likelihood errors are used for training here. In [136], LSTM and its bilinear version are used to accommodate various appearance features. Here, first, LSTMs are pre-trained individually with appearance and motion features, and then these two LSTMs are fine-tuned using the training data in an end-to-end manner. Of late, GRU-based RNNs are used for tracking [129]. Here, regression is adopted for track-lets’ prediction, and the training of GRU is done by minimizing the log-likelihood error.

6 Some results on object detection and tracking

In this section, we summarize the results of some well-known detectors, trackers, and their different combinations over various benchmark datasets, such as ImageNet [18], PASCAL VOC [19], MS COCO [20], MOT2015 [141], and MOT2016 [142]. These datasets are considered in many areas of research because they can draw a standard comparison between different algorithms and set goals for solutions. For each dataset, evaluation is done based on some specific performance metrics. Datasets with performance metrics are briefly described in Section 6.1. Based on the nature of the task, results can be categorized into detection results and tracking results, as explained in Sections 6.2 and 6.3.

6.1 Benchmark datasets and performance metrics for detection and tracking

For the detection task, static images are required, whereas for tracking, videos are required. Datasets, such as PASCAL VOC, MS COCO and ImageNet, are utilized for general object detection. MOT2015 and MOT2016 are used for tracking. All these datasets along with performance metrics are discussed in the following sections.

(a) PASCAL VOC:

The PASCAL VOC [19] has two series, called PASCAL VOC 07 and PASCAL VOC 12. PASCAL VOC 07 has 5K training and 5K test images. Whereas, PASCAL VOC 12 has 5.7K training and 5.7K test images. Each series contains 20 categories of objects, including car, person, bike/scooter, bicycle, bus, loco, cat, bird, horse, kite, sheep, boat, bottle, chair, dining table, sofa, boat, and television. These 20 categories can be considered as 4 main branches, such as vehicles, person, animals, and household objects. In PASCAL VOC datasets, bounding-boxes are labeled over 27,000 objects. Some examples of annotated images are shown in Fig. 3.

Fig. 3
figure 3

Annotated samples images from PASCAL VOC dataset [1]

(b) MS COCO:

The Microsoft Common Objects in Context (MS COCO) dataset [20] is created for two specific tasks: object detection and segmentation. This dataset contains 91 object categories, out of these 82 categories have more than 5000 labeled instances. These labeled samples cover all 20 object classes that are present in PASCAL VOC datasets. This dataset consists of 2,500,000 labeled instances in a total of 328,000 images. MS COCO concentrates on varied viewpoint and real-time instances (i.e., objects from the natural environment), resulting in rich contextual information. Three categories of images in MS COCO dataset are shown in Fig. 4.

Fig. 4
figure 4

Image samples from MS-COCO dataset [1]

(c) ImageNet:

ImageNet [18], also known as ILSVRC2014, is another important large-scale dataset. It has 200 object classes, nearly 450k training images, 20k validation images, and 40k test images. ImageNet is used for the task of object detection.

(d) MOT:

This dataset has 11 videos, each containing either one and/or two object classes, namely, person and car, and is used widely in state-of-the-art MOT approaches. MOT has two parts: MOT2015 [141] and MOT2016 [142]. MOT contains a sequence of images from diverse scenarios having different distributions for the detection of pedestrians.

MOT2015 and MOT2016 consist of a sequence of 22 and 16 videos, respectively. For each video, half of these sequences are utilized for the purpose of training. The rest of them are used for only testing. These videos are usually captured with different low and high frame rates in both moving and static platforms. Other issues, such as illumination, occlusion, and (or) weather conditions are also considered during the capturing of these videos.

(e) Performance metrics

The performance metrics used in object detection and tracking tasks are:

  1. (i)

    Mean average precision (mAP(%)): It is the mean of the average precision scores for each category.

  2. (ii)

    Multi-Object Tracking Accuracy (MOTA(%)): It is the overall tracking accuracy in terms of false positives, false negatives and identity switches.

  3. (iii)

    Identity switches (IDS): Every trajectory is assigned to one ID, Identity switches are referred to the number of times two trajectories switch their IDs.

  4. (iv)

    Multi-Object Tracking Precision (MOTP(%)): It is the percentage of predicting the alignment of bounding box and ground-truth.

  5. (v)

    Mostly tracked targets (MT(%)): It is the percentage of ground-truth trajectories covered by a track hypothesis for 80% of their life or more.

  6. (vi)

    Mostly lost targets (ML(%)): It is the percentage of ground-truth trajectories covered by a track hypothesis for 20% of their life or less.

  7. (vii)

    Speed (frames per sec(fps)): It is the number of frames processed per second in detection and tracking.

Metrics mAP and Speed are used for object detection, while MOTA, IDS, MOTP, MT, ML, and Speed are used for object tracking.

6.2 Analysis of existing general object detection methods

Tables 3 and 4 summarize the performing results of various object detectors for MS COCO and PASCAL VOC datasets, respectively. Both PASCAL VOC data and MS COCO data are widely used as large image databases for the tasks of object detection and classification. These two public datasets contain a large number of both annotated images and object classes. These images characterize varied viewpoints and real-time instances of different kinds of objects from the natural environment. As a result, researchers can get rich information for training, validation, and testing of their deep models using these data. Accordingly, we have adopted them for object detection and tracking problems, and for providing comparisons in performance among different models.

Table 3 Detection results of various general object detectors over MS COCO test-dev dataset
Table 4 Detection results of various detectors over PASCAL VOC dataset

In our study, the results of various detectors (e.g., Faster RCNN, Mask RCNN, YOLO, YOLOv2, YOLOv3, SSD, DSSD, FPN, R-FCN, and DCN), trackers (e.g., AMIR, Deep SORT, MHT-DAM, CDA-DDAL, RNN-LSTM, QuadMOT, STAM-MOT, and Siamese CNN, and of their different combinations are compared in depth to obtain the best detector-tracker model. These models have different characteristic features. For example, Faster RCNN, Mask RCNN, FPN, and R-FCN are widely used as two-stage detectors, whereas YOLO, SSD, DSSD, and DCN are the most advanced one-stage detectors. Among the aforesaid trackers, AMIR and RNN-LSTM are categorized as end-to-end DL-based trackers. MHT-DAM, CDA-DDAL, and Siamese CNN are widely used as deep features-based trackers. QuadMOT, STAM-MOT, and Deep SORT are the most advanced deep embedding-based trackers. All these detectors and trackers are top-ranked and widely used in the domain of computer vision as state-of-the-art models. Therefore, we have adopted them in our paper for a comparative study. This result can be helpful to researchers who intend to use the existing deep models for object detection and tracking, as well as for comparing any new models, whenever designed.

We have collected these results from various research papers. From Table 3, it is evident that the typical baselines architectures augment the accuracy through the extraction of rich features (i.e., multi-scale and multi-level features) of objects having different sizes. As an example, by adopting VGG16 as the backbone of 512 feature dimensions on MS COCO test-dev dataset, the mAP of RefineDet512 exceeded that of the RefineDet320 (which uses VGG16 with 320 features) by 3.6%. Two-stage detectors, such as Faster RCNN, Mask RCNN, and FPN and its variants, achieve higher mAP scores as compared to one-stage detectors (e.g., YOLOv2, YOLOv3, SSD, DSSD and RefineNet). On the other hand, one-stage detectors achieve higher speed. In addition, it is seen that the integration of one and two stage detectors in one model achieves higher accuracy and speed than those obtained by individuals for object detection. For example, integrated networks DCNv2+Faster RCNN [80] and NAS-FPN [57] with ResNet backbone, achieve the highest detection accuracy over MS COCO test-dev dataset.

Testing results of various detectors over PASCAL VOC dataset are shown in Table 4. It is seen that the region proposal network (RPN) enhances the detection accuracy as compared to conventional region proposal methods (see, 85% mAP vs. 78.2% mAP in 13th row and 5th row in Table 4). More number of training data results in higher detection accuracy. Adopting the VGG16 network as a backbone, Faster RCNN trained with V OC07 + V OC12 + MSCOCO data provides better detection accuracy as compared to the Faster RCNN having same backbone network, but trained with V OC07 + V OC12 data. Rich features always provide good results. R-FCN having ResNet101 backbone is superior to R-FCN with ResNet50 in terms of detection accuracy. R-FCN is superior to other two-stage detectors for the PASCAL VOC dataset.

It may be mentioned that the deep networks that result in high mAP score also require high computation time, i.e., they have slow processing-capability of frames (low fps). For example, the method DCNv2+Faster RCNN [80] in Table 3 (10th row) that provides mAP = 44.8% can process only five frames per sec (fps = 5). Whereas, the method YOLOv2 [46] (11th row) has fps = 45, i.e., it can process 45 frames per second, but it results in mAP of 33%. Similarly, consider NAS-FPN [57] (24th row) and RefineDet320+ [76] (20th row). They have mAP scores of 48%, and 38.6%, respectively, with corresponding fps-values of 5, and 40.2. That means, there has been a trade-off between detection speed and accuracy.

Nothing is free!

This constitutes a big challenge to have a balanced compromise between these two performance indices depending on the problems and need. Here comes the significance of Granulated CNN [3] (last row) where by changing the granule-size one can dictate this balance.

6.3 Results of tracking methods

Comparative performances of some popular trackers over MOT2015 and MOT2016 data sets are shown in Tables 5 and 6, respectively, based on the results available in the existing literature. From the results of MOT2015 (Table 5), the end-to-end DL approaches (e.g., MHT-bLSTM and RNN-LSTM) are seen to provide overall better results. Deep network embedded approaches involving deep metric (e.g., Siamese CNN and DAN) outperform (in terms of fps) the other approaches using only deep features as representation, except AP-HWDPL. From Table 6, global optimization methods, namely LMP and GCRA, are seen to outperform others, including the end-to-end RNN-based models. Further, the performance metric MOTA results is less deviation for MOT2016 data as compared to MOT2015. This is due to the fact that the object detection for MOT2016 is more stable than that for MOT2015 data.

Table 5 Tracking results over MOT2015 dataset
Table 6 Tracking results over MOT2016 dataset

DL-based trackers with higher order features for appearance and motion are seen to be more stable and robust. For instance, AMIR tracker is more stable than LMP tracker. Here, the former is a tracker based on end-to-end RNN involving more features than the latter which is a globally optimized method with lifted edges. Comparative results for various combinations of detectors and trackers are shown in Table 7. From this table, it is evident that the combination of Faster RCNN and Deep SORT is superior to other combinations according to all kinds of tracking evaluation metrics.

Table 7 Results of (Detector + Tracker) over MOT2015 dataset

7 Conclusions

In this study, we have provided a detailed review primarily on various deep learning (DL)-based models for the tasks of generic object detection, specific object detection, and object tracking, considering the detection and tracking both individually and in combination. Some key observations on DL-based generic object detection are as follows. The baseline deep architecture of two-stage detectors enhances the accuracy by extracting richer features of objects and adopting multi-level and multi-scale features for different sized-object detection. By defining the focal loss function appropriately, one-stage detectors are found to be able to filter out the easy samples (background); thereby reducing greatly the number of target proposals and improving in turn the detection speed and precision. This may be applicable to two-stage detectors too. Combining one-stage and two-stage detectors produces better results as compared to those obtained by individuals. To address the geometric variation in image frames, adopting deformable convolution layers is an effective way. Modeling the relationship between different objects in an image, as expected, improves the detection performance. Incorporation of granulation within the deep learning model improves the detection accuracy.

Some salient observations on DL-based specific object detection are as follows. CNN facilitates extraction of salient information in local regions in an image frame. Modeling the visual saliency along the boundaries of different regions using super-pixel segmentation improves the CNN performance in occlusion detection. Extraction of multi-scale deep features is of significance for characterizing the local context in images. Strengthening the local connections (weight parameters) between different CNN layers based on the local and global information from images improves object detection.

Similarly, for object tracking, end-to-end DL-based methods are superior to deep feature-based and deep embedding-based methods. Generative networks exhibit outstanding tracking results as compared to discriminative networks. Learning of higher order features or transferring of on-line features is expected to provide good tracking performance in complex environments. Object tracking using higher order appearance and motion features are seen to be more stable and robust. Finally, the combination of Faster RCNN and Deep SORT is seen to be superior to other combinations in terms of both speed and accuracy as per the indices considered.

8 Discussions: applications, challenges, and concerns

DL-based object detection and tracking is growing rapidly due to the continuous up-gradation of powerful computing equipments. Object detection is followed by object tracking. Therefore, tracking accuracy primarily depends on the accuracy of detection of objects over video frames. Comparative studies among various popular detectors and trackers, as well as their different combinations, have been provided in details. These comparisons are made in terms of both characteristic features of the models and their performances. In this section, we discuss some current applications and trends of object detection and tracking in different domains. This also includes several pertinent challenging issues for future investigations. Finally, certain concerns for researchers are mentioned.

8.1 Object detection: applications and challenges

Object detection has widely been applied in various fields, including military, security, transportation, medical, and life. These are briefly explained citing references as follows:

8.1.1 Security

In security, the most popular applications include detection of face [155], pedestrian [156] and anomaly [157]. The objective of face detection is to detect people faces in an image. Facial landmark localization, estimation of head pose, and recognition of gender are three main components concerning face detection. Readers may refer to the survey paper [10] for more details about face detection, including the application of DL. Pedestrian detection means detecting pedestrians in natural scene. For more details, refer to the survey [12]. Anomaly detection has various applications, such as fraud detection, road safety and health-care monitoring. A good survey on this is provided in [157].

8.1.2 Military

The military field represents various tasks, for example, object detection using remote sensing [158], topographic survey, and detection of flyer. In remote sensing object detection [158], objects are detected from remote sensing images/videos. This task has two challenges. First, the target size is extremely small that makes the object detection procedure very time-consuming (i.e., too slow) for practical use. Second, the complex backgrounds often results in false detection. Due to the dearth of information in remote sensing object detection, strong pipelines, like Faster RCNN, SSD, FCN and YOLO, cannot work well in this domain. Therefore, remote sensing object detection remains as a hot research topic. For more details, readers are referred to the survey [159].

8.1.3 Transportation

Object detection in transportation field involves various applications, such as license plate recognition, automatic driving, and traffic sign recognition. License plate recognition is required in detecting residential access and traffic violations. Various features, such as edge, texture, morphology, and sliding concentric windows, are integrated using connected component analysis for making the task of license plate recognition more robust [160]. Recently, DL is adopted for license plate recognition [161], too. One may refer to [162] in this context. Sensor fusion is utilized in [163] to obtain features for autonomous driving. The survey [164] provides more details.

8.1.4 Medical

Medical image detection, cancer/disease detection, and health-care monitoring represent some applications of object detection in the medical field. A framework of domain adaptation [165] is required for the detection of medical images. Computer-Aided Diagnosis (CAD) can assist doctors in classification of varying types of cancers. Recently, CNNs are trained with large-scale glaucoma dataset for glaucoma detection [166]. Two recent survey papers [167, 168] may be referred.

8.1.5 Life

Applications, such as pattern detection, event detection, rain/shadow detection, image caption generation, and species identification, represent some key tasks here. Event detection aims to detect real-world events from Internet news of festivals, disasters, talks and elections, among others. One may refer to a survey in [11] for further information. Research on appropriate detection of pattern is challenging due to several factors, including pose variation, scene occlusion, different illumination and sensor noise. To achieve promising results, some researchers designed strong baseline architectures for pattern detection in 2D images [169] and 3-dimensional point clouds [170]. For image caption generation, the computer automatically generates a caption for a given image. Here, the semantic information of images is captured and expressed using natural language processing. Both computer vision and natural language processing technologies are used for image caption generation, and it is a major challenging task. The issue is handled by adopting, encoder-decoder frameworks, multi-modal embedding, attention mechanism [169], and most importantly, reinforcement learning [171]. The survey article [171] provides more details. A DL architecture is also designed in [172] for rain detection from images.

The aforesaid applications are just some example applications of DL. There are several other domains where the merits of DL technology are being explored.

8.1.6 Challenging issues

Although the achievement of object detection in various fields is enormous, there still remain many scopes for further improvement. These include: i) combining single-stage and two-stage detectors for object detection, ii) exploration of post-processing method for object detection improvement, iii) development of weakly supervised object detection (WSOD) algorithms, iv), designing unsupervised framework for intelligent detection system, v) development of multi-domain object detectors, vi) adaptation of multi-task learning in object detection, vii) fusing multi-source information, viii) exploration of GAN-based detectors when labeled images is scarce, and ix) making use of cell phone-based family diagnostic tools. Besides these, there are some higher level challenging issues leading to much broader and deeper future scopes of DL research as follows:

  1. (a)

    One may note that granular computing (GrC) has recently drawn the attention of researchers for designing intelligent systems, in general. Its application based on rough-fuzzy sets for image processing, and object detection and tracking has been evident [28, 29, 87, 173] for dealing with uncertainties arising from, say, overlapping, occlusion, and sudden appearance of objects, among others. Since GrC is reputed for computational gain, attempts [3, 178] have been made recently to integrate it with deep CNN judiciously in order to make the CNN computationally speedy, while sacrificing the detection accuracy little. Formation of granules dictates the extent of compromise, or trade-off balance, between the speed and accuracy. Therefore, it is a challenging issue for future researchers.

  2. (b)

    Z-numbers, as explained by Zadeh in 2011 [71], provide a summary of meaning of natural language expression in terms of its qualitative aspect and embedded uncertainty. It may be used to design a framework for quantitative abstraction of information in describing the output scene of deep networks for video-object detection [3, 179,180,181]. Exploiting the merits of Z-numbers in modeling the interpret-ability of the output in natural language, therefore, constitutes another challenge.

  3. (c)

    It may be mentioned that fuzzy sets and rough sets are reputed for input/ output representation and learning the network parameters [174, 175] when the input data is vague, linguistic, ill-defined, or incomplete. These characteristics may therefore be crucial in designing DNNs in ambiguous situations, and thus need to be explored.

  4. (d)

    Further, ANN based models for machine leaning are known as “black-box” models, where even their designers may not be able to explain why the AI arrived at a specific decision. The technical challenge of explaining AI based decisions is sometimes known as the “interpret-ability” problem. Deep leaning models, being a complex AI system, is naturally non-interpretable. Therefore, it leads to the issue of trust-ability of the output solution. Here comes the necessity of explainable AI systems, i.e., explainable deep models which can explain to a user to understand the AI’s cognition so as to determine when to trust or accept the output solution and when to discard. To make this explanation in natural language for convenience, fuzzy set theory may be used. One may refer to [176] in this context concerning the basic concepts for generation of linguistic rules explaining the output decision in terms of input features.

8.2 Object tracking and issues

The task of object tracking aims to detect specific objects in a static image frame and then estimate their (objects) moving trajectories in video frames. Object tracking follows object detection. Therefore, the difficulties in the tracking task mainly arise from: i) incorrect or imprecise object detection, ii) deciding an object as a true incomer or not, iii) proper association between the detection and track-let, and iv) occurrence of false alarms. There are further certain issues concerning tracking as follows: Although a large number of studies has been done to solve the MOT problem for a single class, the same for multi-class problems is not yet much explored. The task-specific deep networks are effective in tracking, but these are not suitable for complex conditions. Learning deep networks using higher order features is required to increase the tracking performance. Learning scenario is required to differentiate moving objects from the background and to promote motion prediction. This is useful for moving platforms. End-to-end DL-based tracking approaches output a large number of false negatives than false positives [135]. All these may constitute a part of future investigations.

8.3 Some concerns

While developing AI and DL technologies for various applications in data science, one may observe its evolution through related technologies/ disciplines over the decades more or less as:

Pattern Recognition (1960’s) → Image Processing (1970’s) → AI/ML/Artificial Neural Networks (1980’s) → Knowledge Based System (1990’s) → Data Mining (2000’s) → Big Data (2010) → Deep Learning and Data-driven Science (2017).

At each evolution of the mother subject – Pattern Recognition (PR), new approaches were developed for its different tasks to handle the varying nature of data, as well as decision-making problems. New terms and technologies were accordingly coined with Big Hopes. However, a beginner should not suddenly jump into the new technologies without knowing its background theories adequately. For example, to know DL, one should know Artificial Neural Networks (ANN) and ML (shallow learning). And to know the latter, one should have complete knowledge on pattern recognition. Otherwise, it may lead to dis-satisfaction just by blaming the DL technology and CNNs. One may remember in this context, for example, what happened with ANN research when it revived in 1980’s with a big expectation; lots of R & D (research and development) funds were invested in academia and industry, and several new journals appeared, and so on. But within a span of about twelve to fifteen years the subject almost lost its interest at the rate similar to that of its growth. One of the main reasons was too much quick expectation without developing the science behind the functioning of this “black-box” system, and trying to apply the same set of models or frameworks, be supervised or unsupervised, almost every domain of applications without studying the relevance or knowing the requisite framework that might have demanded for building new application-specific models.

Hope, learning from that previous example will prevent recurrence of similar feelings for Deep Learning research!