1 Introduction

One of the major tasks in the field of computer vision is to help machines, such as robots, computers, drones, and vehicles, and perform the main tasks of the human vision system, such as image comprehension, and motion analysis. To realize these functions of intelligent motion analysis, many works have attempted to track the visual objects, which became a high-demand research area in the real-time computer vision field. Basically, the main step of visual tracking is to evaluate the trajectory model (i.e., position, direction, shape, etc.) of a tracked object in each scene of a video sequence. A robust tracker appoints consistent markers to the target objects in successive scenes. In short, visual tracking is an operation that seeks to locate, detect, and define the dynamic configuration of one or more objects in the video sequence of one or different cameras.

In recent years, researchers worldwide have been influenced by a broad range of real-world applications, including human activity recognition, video monitoring, visual compression, traffic control surveillance, and human–computer interaction. There are three fundamental stages in scene analysis: moving object detection, object tracking from scene to scene, and object analysis to observe the behavior. Therefore, the development of object tracking is relevant to many purposes. For instance, object tracking has been employed to enhance the analysis of human activities [1]. In the field of intelligent traffic systems, many object tracking techniques have been proposed to address traffic systems, such as traffic surveillance [2]; accident avoidance, especially at traffic intersections [3, 4]; and pedestrian counting [5, 6]. Moreover, the standard of MPEG-4 video compression [7, 8] exploits object tracking techniques to provide more encoding bytes to moving objects in the scene and fewer encoding bytes for the remaining redundant background scene. Currently, the most active applications are human–computer interaction, such as hand gesture recognition [9], which needs a powerful visual tracking mechanism. Tracking is motivating several companies, such as Sony and Intel, which have developed cameras appropriate for visual monitoring, like omnidirectional cameras [10, 11] and smart cameras [12]. Additionally, visual tracking is used in modern medicine. The tracker is employed to observe the tracking path of protein stress granules in cells and discover the characteristics of the cell structure [13]. Furthermore, military guidance utilizes visual tracking [14], as in rocket steering, individual combat systems, unmanned aerial vehicle (UAV) flight control, and radar detection.

The aforementioned works demonstrate a wide and mounting benefit in visual tracking in successive video scenes. Moreover, we can directly observe that the applications strongly depend on the results achieved by an object tracking method. If such a tracking method yields inaccurate outputs and unstable results, it could not be used for such applications. Therefore, the key to grow these applications is to overcome the problems associated with visual tracking. In addition, online robust visual tracking techniques are in high demand and many works are being developed to deal with online performance. Unfortunately, several challenges make visual tracking of objects complex. To create a robust visual tracking system, some difficulties need to be considered, which are as follows:

  1. (i)

    The appearance of the object can be changed by the position and viewing angle, and it shows a large range of dimensions and distances.

  2. (ii)

    The object could be tracked in highly dynamic scenes. The camera and tracked object are in motion, which makes tracking and analysis of the movement difficult.

  3. (iii)

    Real-time processing is one of the main difficulties. A system should have high-speed performance to work with live scene sequences.

This paper does not review all existing works in visual tracking, as many algorithms have been published every year since the 1990s. Moreover, comparing different trackers is a non-trivial task. For these reasons, the paper only evaluates and compares some state-of-the-art works on visual tracking. As such, the paper will help researchers, especially newcomers, understand the performance of most of the existing trackers they need in order to compare their tracker results in terms of current issues in visual tracking. Another goal of the paper is to highlight the status of visual tracking, provide the challenges associated with the trackers, and present the research direction of recent publications. In addition, we debate the quickly rising technique in the tracking community, which is deep learning, and more specifically, Convolutional Neural Network (CNNs). We cover many aspects, measurement analyses, classifications, in-demand implementations, and the upcoming potential of the techniques.

Our work investigates the approaches of online-learning tracking, for which the first frame must have a bounding box. Such approaches can exploit adaptive appearance models, which aim to expand to consider continuous target deformations. Moreover, these approaches should observe the drifting issue. The approaches of pre-trained tracking are not discussed, for which an object is identified earlier than at system startup. The achievement of pre-trained tracking relies on sequential video frames, as well as the training data, which can be considered another problem. Furthermore, offline tracking is not regarded by our work, which utilizes the total enhancement of the path by scanning forward and backward through the video. Offline tracking is generally based on the needs of medical applications, but we consider the broader implementation domains of online tracking.

The paper is organized as follows. In Sect. 2, the principle of tracking is briefly reviewed. In Sect. 3, the paper states common and milestone visual tracking techniques and discussions. Then, CNN-based tracking is discussed in Sect. 4. Finally, the concluding remarks and summary are given in Sect. 6.

2 Principle of tracking

This section discusses the challenges that impact visual tracking. In addition, we present the different metric methods that are commonly applied to test the output of visual tracking approaches. We further discuss the principles of the classifications of visual tracking based on its applications and methods [15].

2.1 Challenges in visual tracking

Many issues appear when tracking through sequential frames that can potentially cause the failure of object tracking. Below, we consider a group of the most common challenges. Figure 1 summarizes these tracking issues.

Fig. 1
figure 1

Different tracking challenges

Illumination changes. The light surrounding an object can continuously vary from area to area for many reasons, such as indoor (turning lights on or off) or outdoor (weather, time of day) lighting. Shadows also impact the lighting conditions. In the same spirit, reflectance or transparency factors can be shown in the scene, whose occurrence differs corresponding to incident light and angle of view. These types of variations create different color distributions over time on the tracked object, which disturbs the performance of the tracking mechanism.

Camera motion effects. Many applications require embedding one or more cameras, as in the case of vehicles, drones, and body-mounted systems. In such applications, it is difficult to distinguish the motion associated with the tracked object from the motion associated with the embedded camera. An embedded camera can also produce irregular motion that cannot be modeled. This motion, in some cases, causes motion blurring, which deforms the image details, and therefore, decreases the visual tracker performance.

Cluttered environment effects. Cluttered scenes are created by either additional objects, especially objects that are similar to the tracked object, or highly textured backgrounds. This confuses the visual tracking algorithm and results in output drift.

Changes in object model. The tracked object has some geometric degradation, because the frames are projected from the 3D space onto the 2D plane. In other words, the object shape is changed based on 3D-to-2D projection, and therefore, some information is lost.

Effects of frames quality. The sensors and acquisition conditions impact the quality of the consecutive video frames. When the video sequence has been compressed, block artifacts can be observed. This causes the visual tracker to yield an undesired output.

Occlusion effects. Other unwanted objects in the scene may occlude the tracked object. The tracker encounters difficulties in that case, because the tracked object can be hidden, either partially or completely, and sometimes, the most important parts of the object can be hidden from the tracker scene.

Disappearance effects. The tracked object may enter the scene, but it leaves temporally due to object motion. In another case, the object may be visible across two or more cameras without overlapping in the scene (for instance, a person can enter by one door and leave from another). In these cases, the visual tracker should memorize the tracked object and be able to find it upon its reappearance in the scene. The difference between occlusion and disappearance is that in the former, the object is covered by another unwanted object, but the object is still in the scene (e.g., the object is walking behind wall or the target person is standing behind tree). In the case of disappearance, the tracked object is removed completely from the scene for a while, but reappears in either the same or another camera scene.

Abrupt effects in motion. The motion rate of the tracked object can change abruptly over the time. This change can be unpredictable, and therefore, the tracker can misplace the object location due to incorrect location prediction.

Similar appearance effects. When the video sequence frames have objects similar in appearance to the target object, as in the case of tracking vehicles on the road, the differentiation between the correct object and the similar objects becomes a difficult challenge for the visual tracking algorithm.

A robust visual tracking technique is required to resist the above-mentioned issues appearing in the successive frames, which are strongly interesting for the researchers. Moreover, the quality of the tracker results must cost less in terms of computational and time efficiency. Today, based on our knowledge, no approach can satisfy all these requirements. Therefore, future tracking approaches are still concerned with these challenges, in different applications such as driver assistance systems, vehicle navigation, traffic surveillance, video player analysis, activity-based recognition, human–computer interfacing, and motion analysis. The visual tracking algorithms are classified based on their applications.

2.2 Classification of visual tracking algorithms

Classification by camera movement. Visual tracking algorithms can be categorized based on the condition of the camera, which is either stationary (static) or non-stationary (moving). The background is unchanging in the condition of a stationary camera; therefore, the foreground and background can be segmented simply. Several works have been presented to segment the foreground and background, such as mixture of Gaussian (MOG) [16, 17]. In the case of a moving camera, the segmentation process between foreground and background is a complex step, because both are changing.

Classification by scene. Visual tracking has two scene scenarios in existing applications. The first one is tracking using a single scene, and the second is tracking across multiple scenes. The single-scene tracking depends on one camera to track the target object, whereas multi-scene scenarios depend on an established network by multiple different cameras to track the object [18]. In the case of multiple scenes, a unique identifier for the object is determined and it is tracked the object continuously using fused images.

Classification by number of moving objects. Visual object tracking can be divided into two classes based on the number of moving objects: single object and multiple objects. Generally, multiple-object tracking is more difficult than single-object tracking. However, both should perform the steps of object detection correctly and object extraction precisely from the video frames. Several factors affect the result of the detection and extraction steps, such as noise, background clutter, and illumination. Multiple-object tracking should tolerate mergers, detection, and occlusion among these objects. In contrast, in the case of single-object tracking, we define one object as the target object and other objects as the background.

Evaluation metrics of visual tracking. The visual tracking algorithm can be compared based on qualitative metrics. Unfortunately, qualitative metrics are insufficient, especially when two or more approaches have similar results. Therefore, quantitative methods have been used and many quantitative metrics for testing the efficiency of trackers have been adopted. Typically, the visual tracking performance is compared against the ground truth. This sub-section presents different quantitative measures, namely the most general evaluations applied in object tracking. The three main types of error in tracking are deviations, false positives, and false negatives. In the deviation case, the deviation error of the location of the object is computed from the ground truth. In a false positive result, the object marked is not a target object. In a false negative result, the object is missed, but it is in the scene.

The overlap between the detected and ground-truth object is calculated based on PASCAL [19]:

$$ \frac{{T^{i} \cap {{GT}}^{i} }}{{T^{i} \cup {{GT}}^{i} }} \ge 0.5 $$
(1)

where Ti is the tracked location in scene i and GTi is the ground-truth location in the same scene. If Eq. (1) is realized, the tracking approach can be consistent with the ground truth. Many researchers have developed PASCAL overlap without a threshold, called Dice [20], which is similar to the similarity metric without a threshold. However, most studies apply a threshold, because it can be used to easily compute metrics on large videos.

Another popular metric comprises precision and recall.

$$ {\text{Precision}} = \frac{{n_{{tp}} }}{{\left( {n_{{tp}} + n_{{fp}} } \right)}} $$
(2)
$$ {\text{Recall}} = \frac{{n_{{tp}} }}{{\left( {n_{{tp}} + n_{{fn}} } \right)}} $$
(3)

Here, ntp, nfp, and nfn are the number of true positives, false positives, and false negatives in a sequence, respectively. Precision and recall can be embedded in the F-score [21]:

$$ F = 2 \cdot \frac{{{\text{Precision }} \cdot {\text{Recall }}}}{{{\text{Precision }} + {\text{Recall }}}} $$
(4)

F1-score [22] is considered, when the area is included:

$$ {{F}}1 = \frac{1}{{N_{\text{frames}} }}\mathop \sum \limits_{i} \left( {2 \cdot \frac{{p^{i} \cdot r^{i} }}{{p^{i} + r^{i} }}} \right) $$
(5)

Here, pi and ri are:

$$ p^{i} = \frac{{\left| {T^{i} \cap {\textit{GT}}^{i} } \right|}}{{T^{i} }} $$
(6)
$$ r^{i} = \frac{{\left| {{{T}}^{i} \cap {{GT}}^{i} } \right|}}{{{{GT}}^{i} }} $$
(7)

Equation (7) considers the average coverage of the tracked object region and the ground-truth region.

We calculate the object tracking accuracy (OTA) metric as follows:

$$ {\textit{OTA}} = 1 - \frac{{\mathop \sum \nolimits_{i} \left( {n_{{fn}}^{i} + n_{{fp}}^{i} } \right)}}{{\mathop \sum \nolimits_{i} g^{i} }} $$
(8)

Here, gi determines the number of ground-truth bounding boxes in sequence i. The OTA computes how much tracked object patches match with ground-truth patches. In the same manner, object tracking precision (OTP) can be expressed with precision similar to that of Dice [23]:

$$ {{OTP }} = \frac{1}{{\left| {M_{s} } \right|}}\mathop \sum \limits_{{i \in M_{s} }} \frac{{\left| {T^{i} \cap {{GT}}^{i} } \right|}}{{\left| {T^{i} \cup {{GT}}^{i} } \right|}} $$
(9)

Here, Ms refers the number of frames in a video, where the tracked object area overlaps the ground-truth area. The average tracking accuracy (ATA) is also similar to OTP [24].

$$ {{ATA }} = \frac{1}{{N_{{frames}} }}\mathop \sum \limits_{{i \in M_{s} }} \frac{{\left| {T^{i} \cap {{GT}}^{i} } \right|}}{{\left| {T^{i} \cup {{GT}}^{i} } \right|}} $$
(10)

The method in [25] applied Deviation, in which the central location error is used as a metric for tracking accuracy:

$$ {\text{Deviation}} = 1 - \frac{{\mathop \sum \nolimits_{{i \in M_{s} }} d\left( {T^{i} ,{{GT}}^{i} } \right)}}{{\left| {M_{s} } \right|}} $$
(11)

Here, d(Ti, GTi) is the normalized distance between the centroids of patches Ti and GTi [26].

3 Tracking methods based on online learning

The challenging task in visual tracking is handling the appearance variations of a target object. The appearance variations are categorized as intrinsic or extrinsic. Intrinsic appearance variations include shape deformation and position variation of a target object, whereas extrinsic variations include changes resulting from varying illumination, camera motion, camera viewpoints, and occlusion. These variations should be updated with adaptive mechanisms that have the ability to continuously adapt their modeling and representations. Thus, there is an essential necessity for an online mechanism that learns incrementally. Generally speaking, existing tracking methods are classified into generative methods and discriminative methods [27].

3.1 Tracking using generative online learning

Generative online learning approaches are used to track an object by searching for the areas that are most similar to the target model. The online learning approach is executed in the tracking algorithm to adapt the representative model of the target in response to appearance variations. Next, some recent improvements in online-learning-based generative tracking approaches are detailed. These approaches are inspired by the developments in appearance representation.

The deficiency of appropriate appearance representation is a crucial aspect that weakens the results of visual tracking algorithms. Classical template matching tracking procedures cannot handle appearance variations, because they use static models. Therefore, dynamic templates based on online learning are adopted to model the appearance variations of an object due to changes in illumination and posture.

Jepson et al. [28] applied the online expectation maximization (EM) technique and adopted a wavelet-based mixture model to improve the appearance representation and interpret the tracking factors efficiently. Zhou et al. [29] included the EM based algorithm for updated appearance representation with a particle filter to improve performance. This method has two EM processes, one for improving the appearance representation and the other for concluding the tracking factors. Tu and Tao [30] developed an online EM method to calculate the appearance representation characteristics and improve the histogram space through incoming observations. The EM has efficient characteristics in terms of stability and simplicity. However, the high number of iterations can cause the result to easily reach a local optimum, and the slow convergence can cause target loss and tracking failure.

Fussenegger et al. [31] presented an online method that can adapt a shape model of fewer dimensions by utilizing incremental principal component analysis (IPCA). This method preserves the latest model of the object to learn variations in the object itself and variations in the surrounding observation. However, in each update, only one sample is dealt with. Yang et al. [32] converted the scene to the grids of histograms of oriented gradients (HOG). Therefore, the IPCA–HOG descriptor has been developed to allow the tracking process to address variations in the appearance as well as the cluttered scene. Chiverton and Xie [33] developed an online updating method based on an active contour shape and a bootstrapping stage. The bootstrapper has been used to obtain the shape characteristics repeatedly from successive scenes. Chiverton et al. [34] applied a memory for object features in a high-dimensional shape coordinate to adapt high-level shape information online. These shapes have been utilized to determine templates. The essential limitation of this technique is that it is unsuitable for real-time processing due to its inappropriate computational speed. Furthermore, similar to many active contour tracking methods, effective tracking follows an empirical selection of factors that adjust the relative contribution of the different model components.

Liu et al. [35] applied a tracking method based on online learning with hybrid models that contain several forms of features, including sketch, texture, and flatness. The hybrid model is learned by computing the most discriminative features from the foreground. Then, the model is developed by modifying the feature confidences and removing the older, less discriminative features with the newer, more discriminative ones from the current scene. Several types of features are used with each other to more fully represent the target than a single feature could. One main weakness of these approaches is the lack of clear removal of the invalid patches during template preservation. To address this issue, Xu et al. [36] derived a hyper model, using HOG, center-symmetric local binary patterns, and color histograms to depict the local statistics of edges, textures, and flatness of objects, which are automatically adjusted online by combining new, efficiently selected patches computed from the fusion of the matching templates and the prospective set.

Kwon and Lee [37] divided the observational paradigm into multiple fundamental observational paradigms to compute the appearance of the object. Each fundamental observational paradigm contains a certain feature of the appearance of the target and is created dynamically at each time step by sparse principal component analysis (SPCA). The global tracker is started from various fundamental trackers corresponding to the fundamental observational template and motion templates. Each tracker covers a specific variation in the object or its surroundings and increases the method stability to numerous variations. However, this technique is insufficient for difficult tracking tasks with severe variations between scenes, because of the fixed number of basic trackers.

To address this issue, Kwon and Lee [38] used a tracker sampler to determine multiple suitable trackers from the tracker space dynamically to update to certain variations. The result of this method is very good, even in a real scene. However, compared with the multi-feature model, the computation of effective templates increases the cost of the computation. Thus, without further enhancement, the algorithm cannot be applied to real-time tracking.

Instead of applying a simple methodology to derive the appearance model for tracking, an online learned subspace representation is employed to provide a compact representation of the object and indicate appearance variations through tracking. The subspace probabilistic model provides an effective calculation.

Ross et al. [39] developed an adaptive probabilistic tracking method based on an inference probabilistic Markov model to adapt the templets of an object by means of incremental eigenbasis updates. Then, in consideration of the changing sample mean over time, an incremental mean update has been incorporated into the learning method [40]. Owing to intrinsic and extrinsic parameters, the appearance of the object has been learned in order to handle variations based on a low-dimensional eigen space representation. Moreover, an impact-lessening parameter was added into the incremental subspace update process to decrease the impacts of earlier observations on the existing appearance model based on IPCA. This technique adapts the appearance pattern more suitably in order to increase the overall tracking result. An online incremental method based on an appearance class was developed by Lee and Kriegman [41]. It is implemented by a combination of sub-groups and the connectivity between them. Each group is described by a principal component analysis (PCA) domain. This method uses a previous appearance model of a class of objects into the appearance model of an object of this class by incrementally learning online from the successive frames, including the target instance. Because of the use of hyper structures, this method has better strategies than the online update method to obtain a more accurate appearance model. The limitation of this method is that a previous model is demanded. In other words, the algorithm tracks an object if it has a model of the object class being tracked.

In all of the above-mentioned frameworks, the tracking relies on image-as-vector representations, which do not obviously use the spatial information within the image pixels.

Researchers introduced image-as-matrix methods or high-order tensors to form representations of image pixels. Li et al. [42] developed a three-dimensional (3D) temporal representation for incremental learning using adaptive updates of the sample mean and eigenbasis. This method succeeded in representing the appearance of an object more informatively. Wen and Gao [43] combined the retinex image with the original image by defining a weighted subspace representation of an object to consider the illumination variations. The online learning mechanism adaptes the appearance model and the different illumination due to the light reflectance. This approach does not update, but empirically determines the weight based on the representation. For more informative modeling, the target features to construct covariance matrices in five modes are applied to consider both spatial and statistical parameters of the object appearance [44]. Each mode of the object updates the eigen-basis and sample mean online to incrementally learn an eigenspace representation to handle appearance variations. The covariance calculation has a computational burden and cannot be embedded in real-time applications. Wu et al. [45] introduced a framework to lower the computational burden using incremental covariance representation updates. The current covariance model computation depends on weighting to give newer samples greater impact.

Lu et al. [46] proposed a subspace learning method that depends on exploitation of a locally-connected graph (LCG). The semantic subspace representation is trained by creating a supervised graph with some labeled object features. The LCG integrates the objects with minor negative features to have a robust subspace through the projection, which is built before the tracking process. Features of the object are categorized based on semantic details into some categories such as illumination, occlusion, and rotation. The LCG uses added label rules to define the subgraph of each class to generate a better informative and reasonable graph to tackle the drifting issue [47].

The appearance template of the target is defined in sparse constraints by a linear combination of only a few basis vectors. The tracking is derived by comparing features with sufficient accuracy in a learned template subspace.

In Mei and Ling [48], the object can be defined as a linear combination of the online updated object samples and negative samples. Then, tracking is considered as a sparse calculation task. The sparsity is realized by determining a least squares problem. However, partial occlusion, appearance changes, and other challenging cases are regarded through an error vector represented by the group of negative samples. This approach showed a stable tracking outcome through experiments. However, it does not handle abrupt pose variations or full occlusion of the object. Liu et al. [49] developed a two-step sparse enhancement method for tracking (Two-step Sparsity Tracking, TST). A sparse set of samples is adopted to decrease the target remodeling error and increase the discriminative power. The template set and the training set are adapted online to improve the efficiency of the tracker. This approach does not address partial occlusion in the case of modeling of the target as a single entity.

Liu et al. [50] applied a basis distribution that updates automatically online and a fixed sparse dictionary to represent the appearance of the object. Chen et al. [51] represented the appearance of the object with the actual intensity pixels of the object region. A similarity measure is used to compute the distance between a tracked object and the updated appearance template. The maximum a posteriori approximation is employed to calculate the object conditions in each frame over time, depending on Bayesian inference.

The model of Jia et al. [52] learns online, depending on sparse representation and incremental subspace learning (ISL) to account for the partial occlusion and drifting issue. The learned framework reinforces the tracker to tolerate the changes in an object appearance. Lu et al. [53] used the geometrical information of the object template set based on sparse representation. This approach is called non-local self-similarity regularized coding, and it utilizes K-nearest neighbors (KNNs) to model the structural features of the object. In this model, the weights of the templates are then learned to account for the appearance variations. It has a robust performance, but the tracking speed restricts its application.

3.2 Tracking using discriminative online learning

Discriminative online learning frameworks, called tracking by detection, handle object tracking as a classification task. They simultaneously exploit features of the object and the background. A binary classifier is used to discriminate the object from its background, and it is trained online to address variations in the environment and appearance. This classifier utilizes features from both the object and the background. Next, the various discriminative tracking frameworks that are dependent on online learning are presented by category, according to where the online update technique is applied.

The discriminative tracking frameworks depend directly on the feature space employed. If the features of an object are readily distinguished from its surroundings, the tracker will usually be able to track it. The updated online feature space is applied for visual tracking, instead of applying a static group of features that is specified a priori. These frameworks adaptively rank the features, and the highest ranked discriminative features are used in the tracking mechanism.

In [54], training features were extracted from raw images using RGB coordinates. Then, the color transform function was used to convert the RGB space into different color spaces, such as normalized RGB, XYZ, YCbCr, and YIQ. Finally, linear discriminant analysis (LDA) is used to build a histogram for tracking using a single-color coordinate, which is decided online.

Collins et al. [55] developed an adjustable online framework to improve and update the proper features for tracking. All features are ranked by computing the distinctions between the object and the background features over time, to determine the most appropriate features to handle the appearance. Then, the selected features are used to identify pixels in the current frame for association with either the object or background category. Both of these two previous approaches utilize color information, which does not have a perfect discrimination property under many conditions, to account for the object and background. For example, the tracker loses the object when the object and background areas have identical or very similar color information over successive frames. Nguyen and Smeulders [56] embedded textural features to enhance the modeling of the object and the background. Changes in foreground appearance are exploited with features extracted from Gabor filters. This framework is generally robust, but textural features increase dimensionality, which makes it invalid for real-time purposes.

In Wang et al. [57], the feature selection method in the particle-filtering process has the advantage of using the current background particles. The Fisher discriminant technique is applied to determine the online discriminative features in a large feature space. However, this approach is also inefficient in real time, because of the number of features and the characteristics of the particle filter. Li et al. [58] used 2D LDA to compute a 2D image matrix instead of transforming 2D images into vectors. The method recursively computes an improved projection subspace, by only updating the model at specific frame intervals rather than for every frame, resulting in less computation time. However, tracking failure may occur, when there is large variation between successive updates, such as an abrupt occlusion or a change in the other side of the face.

Specific features of an object can be used to train an online binary classifier. We cannot have sufficient appearance information of the object; therefore, the binary classifier should be continuously updated to compensate for the insufficiency of training features. The dependence of the classifier on online learning techniques is considered. The object initial location is determined in the current frame, and then, the classifier computes various probable locations in the surrounding area for successive frames. Avidan [59] applied adaptive boosting (AdaBoost) to build a robust classifier by integrating an ensemble of weak classifiers. Each individual classifier is developed online with several training samples based on an 11D histogram of feature coordinates, including a local orientation histogram and pixel colors. Then, the pixels in the following frames are assigned using a strong classifier, labeling them as either object or background and generating a confidence map. The new object location is defined using the highest confidence score in the map. This approach requires a small amount of computational time; however, it is sensitive to noise samples disturbing the tracker performance.

The methods discussed so far handle variations in appearance, cluttered backgrounds, and short-term occlusion. However, drifting can occur due to the accumulated errors in accuracy. To address this issue, the semi-supervised AdaBoost classifier has been developed [60]. The classifier updating process is controlled by a second classifier trained on the first frame. The method categorizes features extracted from the first frame only, and subsequent training features are uncategorized. The performance is unsatisfactory due to tracking errors as a result of extracting sub-optimal positive features.

The online multiple-instance learning (MIL) approach [61] has been developed to solve this difficulty. The classifier is updated, when the existing tracker patch is captured as positive features and the surroundings as negative features, because the object may not be fully present in the bounding box, or it may dominate most of the background. An object area is considered with additional bounding boxes within close range to generate a positive set. Multiple negative sets are extracted using bounding boxes with a distant range. Next, the Haar method is applied, as follows. Prospective bounding boxes are prescribed uniformly in a circular region around the original area. The maximum classification score is used to define the updated location of the object in the MIL, and the classifier coefficients are updated with the new data points. The MIL framework is computationally expensive due to ambiguity between samples of positive sets. Batch-mode adaptive MIL (Li et al. [62]) was designed to reduce the computation time, by separating training sets into batches instead of applying them all at once, allowing real-time tracking.

For long-term tracking, Kalal et al. [63] developed an efficient method that divides the process into tracking, learning, and detection (TLD). For the tracking part, a short-term approach is used, based on the Kanade–Lucas–Tomasi method, and random forests are used in the detection stage [64]. A positive–negative (P–N) learning module computes false positives and false negatives. The object is defined in the first frame, and then, the pattern is observed by the detector using two-bit binary patterns differentiated from surrounding background patterns. In the subsequent frame, the detector determines the locations of the top 50 scores, and then each potential candidate window is computed using normalized cross-correlation. Next, the prospective window with the greatest similarity to the object is labeled as the same object.

Positive samples are considered to be in the vicinity of the object after the new location is determined, and negative samples are considered to be further away from the object. The main advantage of this method is the ability to learn a new appearance and to avoid repeating mistakes. However, it also has several challenges. For example, TLD cannot provide good results in the case of a rotation out of the original plane. For the case where an object leaves the field of view, Hare et al. [65] developed a method named Struck, which relies upon a multi-structured output support vector machine (SVM). It explicitly learns a prediction function to directly compute the object transformation in-between output frames. Alternatively, to address the drift issue, Zhang et al. [66] proposed a multi-expert restoration structure instead of learning one classifier only.

Bolme et al. [67] added correlation filters (CFs) into the tracking process, and proposed the minimum output sum of squared error (MOSSE) filter. This process converts the essential convolution operations in the time domain into simple additions and multiplications in the frequency domain. The MOSSE performs favorably against state-of-the-art trackers with 600 frames per second. The CF-based tracking algorithms have grown in popularity within the tracking community. Heriques et al. [68] introduced a circular structure kernel (CSK) algorithm that adopts a dense sampling training pattern created by circular shifts of an input image patch. The CSK tracker relies on illumination intensity features. It has been developed to use more robust features such as the histogram of oriented gradients (HOG) in kernelized CFs (KCFs) [69], and color attributes or color names (CNs) [70].

Danelljan et al. [71] combined two independent CFs for robust scale computation. Prior to this, CF-based trackers could not handle a scale change in the target. Li and Zhu [72] applied a multi-resolution extension of a KCF (denoted SAMF) for scale changes. Danelljan et al. [73] simultaneously computed the scale and translation of the target object while minimizing the search space. In another study [74], the color histogram was proposed, and channel and spatial reliability were developed. HOG attributes are not sensitive to motion blur or variation in illumination, while CN features maintain robustness to shape deformation. Combining the advantages of using HOG and CN, Bertinetto et al. [75] created the STAPLE tracker, which produces outstanding results compared to state-of-the-art methods.

3.3 Tracking using combined online learning

Generative approaches can efficiently handle the object appearance, but they have poor performance in complex backgrounds. On the contrary, discriminative approaches have the ability to model complex backgrounds and significant appearance variations. Nevertheless, the discriminative approaches cannot handle high noise and generally do not incur drifting. Furthermore, the discriminative approaches can be interrupted by other objects that have a similar appearance. Thus, many researchers have developed frameworks to combine the advantages of both approaches to create a robust tracker.

Lin et al. [76] proposed a discriminative generative framework. The generative method is applied online to track the observational model and the discriminative method is adopted to compute the next position of the object. Zhang et al. [77] introduced a framework based on graph-based discriminative tracking. This framework integrates Fisher discriminant analysis (FDA) and ISL for tracking. The target subspace and the pattern models of graphs are learned online concurrently to collect the appearance variations and derive the object from its background. This framework attempts to preserve within-class compactness to adjust the position. However, it suffers from the drifting error that is accumulated as the track progresses.

Yu et al. [78] proposed a co-training technique to incorporate both generative and discriminative models. The generative model depends on subspace features in online learning to model the object appearance and learn different appearance variations. The discriminative model depends on the continuously updated support vector machine (SVM) classifier with HOG features. The SVM is updated and trained to capture the new appearance. This framework is robust and efficient, but abrupt appearance and occlusion disrupt the tracker performance. Yin and Collins [79] introduced a framework to mitigate the accumulated pixel classification inaccuracy. This framework applies the global shape and region-based probability of the object boundary. Yang et al. [80] proposed a novel tracker that depends on a particle filter and sparse representation. Each object can be modeled by object templates and surrounding templates with an additional representation error to learn appearance variations. Both templates of the object and its surroundings are embedded into a voting method to differentiate between the object and the background.

4 CNN-based tracking

Machine learning has been revolutionized by deep-learning methods, and so, the tracking community has been working to glean from this subject area to improve visual tracking methods. In general, traditional tracking techniques, including online learning techniques, employ man-made features to improve robustness. Over the last 5 years, deep-learning techniques [81] have produced good results in feature extraction via multi-layer nonlinear transformations in numerous applications. These include computer vision [82, 83], speech recognition [84, 85], and natural language processing [86, 87]. This means that deep-learning processes automatically obtain groups of features from the given images [88]. Preprocessing steps may be used, such as the pyramidal technique [89]. As first described in 2006 [90], the key feature of a deep-learning model is its layers. Essentially, they depend on the multi-layered architecture of data representation that is performed within the neural network, and they extract the characteristics directly from the raw input. For image analyses, architecture layers learn from the adjacent chain, e.g., pixels, then edges, then groups of edges, then shapes.

A deep-learning model is configured by several neurons in various hidden layers. The hidden layers represent the inputs to higher-level mapping. Generally, the aim of implementing deep learning within tracking is to distinguish patterns more quickly and accurately than a human does, thereby enhancing the efficiency of video applications.

The main advantages of deep-learning methods are as follows:

  1. (i)

    Development of efficient representations and producing new architectures to update and learn these representations from large-scale unlabeled data.

  2. (ii)

    The ability to directly deduce a complex set of features at a high level of abstraction.

  3. (iii)

    The ability to learn low-level features from minimally processed input samples.

  4. (iv)

    The ability to make decisions using a large number of datasets.

Typically, deep-learning networks are divided into five main groups: convolutional neural networks (CNNs), deep belief networks (DBNs), stacked auto-encoders, deep Boltzmann machines (DBMs), and deep residual learning (DRL) networks.

4.1 History of deep learning

Historically, deep learning has been used since the inception of artificial neural networks (ANNs) [91]. ANN methods have brilliantly dominated in the previous decades for recognition, segmentation, enhancement, and prediction in the areas of industry, biology, finance, robotics, marketing, medicine, manufacturing, and detection of moving objects [92]. In 1980, the perception of deep learning began when the neocognitron model was suggested by Fukushima [91]. LeCun et al. [93] designed a method to address the recognition issue of hand-written ZIP codes by utilizing the back-propagation technique in a deep neural network. However, this method had significant drawbacks which made it impractical to use, as the training time was unreasonable. However, deep neural networks have been applied to speech recognition for several years [94].

Over the next two decades, many research groups attempted to reduce the time cost of the tracking. Hinton [90] achieved great outputs for training multi-layer DBNs by applying an unsupervised restricted Boltzmann machine to pre-training of a single layer at a time. Then, a supervised backpropagation method was exploited for additional improvements. After that, several research domains implemented a primitive deep-learning model to handle various issues. The parallel configurations of hardware and software have resulted in magnificent achievements in deep-learning methods in recent developments [95, 96]. Several models of deep learning have been proposed. Milestone models are CNNs [97], stacked autoencoders [98], and recurrent neural networks (RNNs) [99].

4.2 CNN-based tracking

The CNN model attracts most researchers with its impressive performance, particularly in the computer vision area. Figure 2 shows the percentage of different deep-learning architectures in recently published works for object detection, recognition, and tracking. We can easily observe that the CNN technique is used in 66% of the applications [100].

Fig. 2
figure 2

Percentages of different deep-learning architectures

The CNNs are composed of a multi-layered artificial neural network architecture. Each layer contains several neurons. During the first step, the convolutional layers apply a convolution operation between filters and patches of the input image, outputting a feature map. In the second step, each convolutional layer feeds into a layer that applies a nonlinear function to the feature map. The third step is to down-sample the feature map to decrease its features. Down-sampling can be done in several ways, such as minimum, average, or maximum pooling. Based on the application, these three steps are continuously iterated until the desired high-level feature map is extracted. Finally, the fully-connected output layer generates a certain number of class outputs. The main advantages of the CNN are that it is simple to train, and less dependent on earlier iterations of the model and on human knowledge compared to other methods. It directly receives the 2D input data structure, and can also receive the 3D input.

The CNN-based tracking techniques are either generative or discriminative, similar to conventional trackering techniques. The generative techniques use a similarity metric to estimate object template matching within a certain search area. The discriminative techniques apply binary classification in the CNN scheme to effectively distinguish the object from its background. To utilize a CNN-based tracker, a convenient and simple approach is to substitute hand-crafted features with deep features, captured from the CNN using a popular tracking method, such as the CF.

4.3 CNN-based classification tracking

The spatial information in the last convolutional layer cannot accurately determine the object position. In contrast, earlier CNN layers define an accurate position, but have less robustness from the appearance model perspective. Therefore, Ma et al. [101] used a visual geometry group network (VGGNet) [102] to capture the features of the last convolutional layer. These features represent the semantic resolution that competes with coarse appearance changes. They introduced three different convolutional layers (Conv3–4, Conv4–4, and Conv5–4) with three CFs and then collected their associated output maps to derive the object.

Hong et al. [103] extracted discriminative saliency maps using the CNN, then embedded them into an online SVM to update and account for appearance variations. Moreover, the deep layers are implemented not only to capture features, but also to classify them. The deep spatially regularized discriminative CF (DeepSRDCF) [104] approach exploits the feature map from the CNN layer to the framework of SRDCF [105]. Zhu et al. [106] applied the CNN layers in a similar manner to a faster region CNN (FR-CNN) to create object patterns, which are combined into an online SVM to compute the object appearance. The CCOT [107] and the ECO [108] form a tracker based on a continuous convolution filter. In the CCOT, the tracker adopts features from three convolutional layers by applying a pre-trained VGGNet and updating a discriminative continuous convolution operator to increase robustness. The ECO incorporates deep features: along with handcrafted features HOG and CN, and the convolution operator is factorized to improve the number of parameters. The UPDT model [109] developed the fusion of shallow and deep features to fully exploit the benefits of CNNs.

4.4 CNN-based matching tracking

Recently, the Siamese neural network has gained attention in the area of object tracking. Many researchers have utilized the CNN architecture to learn robust matching methods. In one previous study [110], the developers of the GOTURN modified the Siamese neural network to carry out object tracking for pairs of consecutive frames and adopted a feed-forward network without online training through regression. The developers of SINT [111] introduced a Siamese neural network architecture to compute the similarity between an object pattern in the first frame and candidates in the next frame. This method performs visual tracking as a validation task, which determines the optimal conditions based on the maximum matching score. Bertinetto et al. [112] proposed a fully-connected Siamese model to correlate the object pattern with the recent search region in a CNN. Chen et al. [113] developed a generic framework using an efficient two-flow CNN model to combine two inputs, one for the object image patch and the other for the search region patch. The method estimates the appearance of the object within a certain area. Valmadre et al. [114] applied CFs to the different output features of a layer within the Siamese architecture. The FlowTrack [115] introduces rich traffic information in successive frames to increase feature coding and tracking performance.

The developers of SiamFC [116] were the first to adopt CFs to the Siamese network. The developers of CFNet [117] enhanced SiamFC by applying a CF to the exemplar branch to learn the template representation, which makes the Siamese network more robust to appearance variations. Kuai et al. [118] used target object and target template models to improve the efficiency of SiamFC. The developers of Re3 [119] introduced a recurrent network to extract enhanced features created by exemplar branches. In DSiam [120], appearance changes in the target and background can be updated and learned from previous frames online. Dong and Shen [121] adopted a triplet loss operation to increase the robustness of SiamFC and CFNet. The developers of SiamRPN [122] used a Siamese region proposal network (RPN) to compute the bounding boxes of targets.

Deeper and wider Siamese networks [123] depend on a deeper and broader CNN to improve the efficiency of tracking. This method reduces the negative effects of padding, while managing perceptual domain size and network stride. The architecture of this design is very lightweight, and the output is enhanced while ensuring real-time performance. The developers of SiamMask [124] used a simple framework that is able to perform both tracking and segmentation simultaneously in real time. This method improves the full convolutional Siamese neural network by adding a mask branch for target tracking and segmentation.

5 Experiments and analysis

In this section, we evaluate popular benchmark OTB [125] and highlight advantages of using different visual trackers. We outline the trackers used, and then analyze, compare, and discuss the experimental output. Finally, we summarize our conclusions. All trackers were run in MATLAB on a Desktop PC with 2.9 GHz CPU and a GTX 1080 Ti GPU.

5.1 Tracking algorithm

We considered 13 visual trackers: TLD [63], MEEM [66], CSK [68], KCF [69], SAMF [72], DSST [73], CSRDCF [74], STAPLE [75], CF2 [101], CNN-SVM [103], DeepSRDCF [104], CCOT [107], and ECO [108], all of which have displayed good performance with popular benchmarks. We ran the source codes published by the authors and used tracking results for experimental comparisons. The trackers were compared in the presence of fast motion (FM), motion blur (MB), illumination variation (IV), background clutter (BC), and occlusion (Figs. 4, 5, 6, 7, 8). Table 1 displays these results quantitatively, with the results of different tracking algorithms listed for different sequences. Figure 3 presents the precision and success plots of the one-pass-evaluation (OPE) measurement.

Table 1 Comparison results in terms of average center error (unit pixels) on different attributes (the bold indicates the best performance)
Fig. 3
figure 3

Precision and success plots of compared trackers on the basis of the OPE manner over 13 video sequences for challenging attributes, a FM, b MB, c IV, and d BC

Trackers using deep learning-based clearly outperformed the traditional trackers. The ECO, CNN-SVM, CF2, and CCOT trackers did much better than the others. Based on these results, we conclude that the utilization of deep-learning features substantially improves tracking over human-developed methods. This may be related to the CNN layers, which depend on parameters sharing local connectivity make the image feature extraction more useful.

Many deep-learning-based methods utilize convolutional features from a single layer, while others, such as trackers, utilize a combination of multiple convolutional layers. The deep-learning models for the trackers in this study were pre-trained prior to tracking, and were not updated during the tracking process. An important benefit of this process is that the deep-learning models do not require additional computation and memory space. Therefore, research into improving the performance of combining pre-training and online learning within the deep-learning models could be extremely valuable.

5.2 Qualitative comparison on different attributes

Fast motion. Qualitative comparison results on the fast motion sequences: the challenging Deer, Tiger1, Toy, and BlurCar1 sequences, are presented in Fig. 4.

Fig. 4
figure 4

Qualitative comparison of selected trackers on the a Deer, b Tiger1, c Toy, and d BlurCar1 sequences

Motion Blur. Qualitative comparison results on the motion blur sequences: the challenging BlurFace, and Box sequences, are presented in Fig. 5.

Fig. 5
figure 5

Qualitative comparison of selected trackers on the a BlurFace, and b Box sequences

Illumination variation. Qualitative comparison results on the illumination variation sequences: the challenging Skating1, Crowds, and Basketball sequences, are presented in Fig. 6.

Fig. 6
figure 6

Qualitative comparison of selected trackers on the a Skating1, b Crowds, and c Basketball sequences

Background clutter. Qualitative comparison results on the background clutter sequences: the challenging Skating1, Crowds, and Basketball sequences, are presented in Fig. 7.

Fig. 7
figure 7

Qualitative comparison of selected trackers on the a David3, and b Football1 sequences

Occlusion. Qualitative comparison results on the occlusion sequences: the Rubik, and DragonBaby sequences, are presented in Fig. 8.

Fig. 8
figure 8

Qualitative comparison of selected trackers on the a Rubik, and b DragonBaby sequences

5.3 Discussion and analysis

It is clear that the deep-learning trackers have much smaller average central errors than traditional trackers, for most of the sequences (Table 1), and the ECO tracker performed the best overall in terms of central error. As shown in Table 1, to test each tracker, we considered Deer, Tiger1, Toy, BlurCar1, BlurFace, Box, Skating1, Crowds, Basketball, David3, Football1, Rubik, and DragonBaby sequences, which have various challenges as outlined in the following.

DeepSRDCF enhances tracking outputs using convolutional features and spatial regularization penalties. However, it did not successfully tackle deformation (Basketball) or occlusion (DragonBaby). The CF2 utilizes different convolutional layers to train multiple CFs. However, it did not successfully handle an object with FM and in-plane rotation (BlurFace). The CCOT tracked objects in most of the selected video sequences

The objects in Deer (Fig. 4a) and Tiger1 (Fig. 4b) have abrupt motions, along with appearance changes caused by motion blur, which makes them difficult to track. Nonetheless, most trackers handled the Deer sequence with some drift, but the TLD tracker did not perform well. Due to the weak re-initialization mechanism in the TLD, it may detect a non-target object with a similar shape as the target (Deer 43). For the Tiger1 sequence, the deep trackers (ECO, CF2, and CNN-SVM) tracked well, whereas the other trackers (MEEM, CSK, KCF, SAMF, DSST, CSRDCF, and STAPLE) did not. This is attributed to the repetitive motion in the sequence, along with the fact that the latter trackers do not have a re-initialization mechanism, and hence they cannot locate a target after failure.

Figure 6 shows the results from three complicated sequences where illumination is variable. In the Skating1 sequence, a drastic lighting change occurs when the skater moves around the lights. As a result, MEEM, CSK, KCF, and SAMF suffered from severe drift at frames 180 and 379, while the CNN-SVM, DeepSRDCF, CCOT, and ECO trackers performed well.

5.4 Tracking speed analysis

Several parameters influence the computational speed of trackers, aside from different user platforms. These include the bounding box size of the target object, the number of features, the searching bounding box, the number of iterations, and the type of the classifier. As an example, classification trackers perform faster than matching trackers. Most existing deep-learning trackers adopt a CNN to model the variations in appearance. Some use a CNN to separate the object from its background, while others use it to match candidates with the object. Classification of positive features with negative ones is faster and simpler than matching two features. Therefore, CNN-based classification trackers have faster performance than CNN-based matching trackers, because most CNN-based trackers adopt the Siamese neural network to represent prior information, instead of fine-tuning online. In the MIL tracker, when the number of Haar features becomes larger, the frame rate is lower, but the robustness increases. The average speeds of all the 13 trackers are listed in Table 2.

Table 2 Comparison of trackers speeds

6 Conclusion

In tracking, the main critical issue is appearance variations that prevent the tracker from localizing the object efficiently and correctly. Therefore, online learning techniques are being developed to combat sharp appearance variations during tracking. In this paper, the milestone visual object tracking methods based on online learning were discussed considering generative and discriminative methods. The main concepts and features of these frameworks have been presented at the beginning of Sect. 3. To summarize, generative methods consider only the object appearance without background details. In contrast, the discriminative methods compute a boundary region to differentiate the object from its surroundings by considering details on both the object and the background.

In the visual tracking community, the number of successfully-tracked objects and the error of the average position are used to quantitatively assess the tracking result. Generally, discriminative methods have better results than generative ones, if they have sufficient instances. However, if the number of samples embedded in the training step is small or inadequate, generative methods often have better performance than discriminative ones. Finally, the issue of finding an innovative framework that combines effectiveness and precision or adaptation and balance is still an open and vital issue in visual tracking. With the large improvement that CNNs have provided in recent years, impressive CNN architectures can be applied to perform the visual tracking tasks. It is hoped that this presentation of tracking frameworks, which are dependent on online updating, will provide a valuable orientation to new-comers and researchers in related domains.