1 Background

Vision is the main form to achieve outside information for being. According to the scientific investigation, it is found that more than 80% of total information is achieved by vision. Today, computer vision, pattern recognition, computer graphics, artificial intelligence etc., are developed swiftly. So, it is time to study computer vision that makes computers achieve information as human. Also, computer vision contains all domains that makes computers recognize, analyze, understand things as human.

Based on object detection, visual tracking is presented which is applied to detect, extract, recognize and track moving objects in image sequence [40, 93]. With the extraction and tracking, parameters (especially moving parameters, such as position, speed, acceleration, etc.) are achieved. Next, these parameters are processed and analyzed to understand behavior of moving target. Today, visual tracking is expected to apply in multi-domains such as radar guidance, aerospace, medical diagnosis, intelligent robot, video compression, human computer interaction, education and entertainment by fused technologies of image processing, pattern recognition, artificial intelligence, automatic control and so on.

For visual tracking, its application mainly summarized as follows.

  1. 1)

    Intelligent video surveillance (IVS) [57]

IVS is an automatic survey system which supervises and inspects behavior of target real-time in specific area, such as safety department, private residence and public occasion. It prevents the occurrence of criminal acts by recognized suspicious behaviors. In IVS, main functions are target detecting, recognition and tracking in video. Especially, behavior recognition is most difficulty in IVS.

  1. 2)

    Video compression/encoding [2]

In video compression based on objects (H.265, HEVC etc.), encoding computation mainly spends on two aspects which are block matching and filtering. So it is necessary to apply target tracking technology into encoding method. In this way, object in image sequence can be track easily and fast. This enhances matching speed and accuracy. Further, it raises encoding effectiveness, peak signal to noise ratio (PSNR), and reduces bit error rate (BER).

  1. 3)

    Intelligent traffic system (ITS) [19, 64, 73]

Target detection and visual tracking are widely applied in modern ITS. Target detection is applied to position vehicle in surveillance video or monitor. Then, visual tracking is applied to track detected vehicle to compute vehicle flowrate, condition and abnormal behavior. A real-time computer vision system for vehicle tracking and traffic surveillance is useful. A novel method for tracking and counting pedestrians real-time can compute density of pedestrians. Detection and tracking method of pedestrians in crossroad can help vehicle move safely on crossroad.

  1. 4)

    Other application [34, 80, 94]

Visual tracking can be applied in many other domains. On one hand, it is applied as main technology in military, such as missile guidance, radar detection, unmanned aerial vehicle (UAV) flight control, individual combat system. On the other hand, it is also applied in modern medicine. In fact, medical image technology becomes a novel technology in medical clinical diagnostics and therapeutics. This can be applied to improve health status and exercise habits by a healthcare PC. Also, it can be applied to track trajectory of protein stress granule in cells and analyze dynamical features of cell structure.

Development of visual tracking was based on development of computer vision. Before 1980s, processing with static image was main research in image processing [75]. This is mainly due to the limited development of computer vision technology. So, research and analysis of dynamical image sequence was similar to apply technology of static image research. Research on dynamical image sequence became a highlight after optical flow method was presented [39, 63]. The study upsurge of optical flow continued to 1990s [8]. Recently, there were papers that reviewed L-K method.

Since 1980s, there were lots of visual tracking method appeared, in which some of them are classic methods today. This review introduces and analyzes these methods. In order to introduce and compare these methods well enough, this review classifies them to several classes.

  1. 1)

    Classification of visual tracking problem

Based on different application, there is a classification of visual tracking problem.

  1. i)

    Classification by scene

There are two scenes in real application of visual tracking. One is tracking in single scene, another is tracking in multi-scenes. Target tracking in single scene tracks target in a static scene. In other words, it tracks one designated target in one video (scene), which is taken by one static camera. Target tracking in multi-scenes tracks multi-targets in one monitoring networks constructed by multi-cameras. This method establishes unique identity for every targets and tracks every targets continuously in this networks by fused all monitors [41].

  1. ii)

    Classification by number of moving targets

According to number of moving targets, visual tracking can be classified to two classes, which are single target tracking and multi-targets tracking. Single target tracking is relatively easier than multi-target tracking. But they are same essentially. Both of them need to detect and extract tracking target from scene completely and accurately. The detection and extraction are influenced by multiple factors, such as transformation of background, illumination and noise. Multi-target tracking must consider occlusion, merger and separation of these targets. But when consider one target as the target and other targets as background, multi-targets tracking is same to single target tracking.

  1. iii)

    Classification by movement of camera

In this classification, visual tracking is classified to tracking with static camera and moving camera. In the case of visual tracking with static camera, the background is fixed. This means foreground and background can be relatively separated easily. There are many methods can separate foreground and background, such as Gaussian filter and differential method. In the case of visual tracking with moving camera, it is difficult to separate foreground and background because both of them are moving. A common practice is to restore every image (frames in video) by applied a sensor in camera to measure speed of camera.

  1. 2)

    Classification of visual tracking method

Based on different extracting features, there is a classification of visual tracking method.

  1. i)

    Tracking method based on target geometrical similarity

Target tracking method with geometrical similarity must apply an image block which is called target template. This method assumes that target can be expressed by simple geometrical form which is stored as target template. Target template can be extracted by target detection method or artificial designation. Then, similarity is computed between target form and template. A threshold is applied to determine if there is a target in tracking scene. This kind of method is widely applied to track target in image sequence, such as meanshift. This kind of method has accurate tracking result because it usually applies global information of target, such as color and texture. But the tracking accurate dims when there is deformation or large occlusion of moving target.

  1. ii)

    Tracking method based on local feature

Target tracking method with local feature extracts local features of target, then recognizes and tracks target in image sequence by fused these features. Common local features contain edge which is extracted by Canny operator, angular point which is extracted by Susan operator and so on. Tracking with local features can be applied more positive when there is partly occlusion of target. This is because these local features are extracted all over the whole target, which means that recognition efficiency is less influenced than geometrical features when there is partly occlusion of target. The difficulty of this kind of method is to recognize and track when target changes its feature by movement. The local features can be changed when there is zooming or rotation of target.

  1. iii)

    Tracking method based on contour profile

Target tracking method with contour profile is a newer research than previous two kinds of methods. This kind of method needs a general position of target, which is provided by human. Then recursion by differential equation is solved to converge contour profile to local minimum value by energy function method. An energy function is usually constructed according to feature of image and smoothness of contour, such as edge and curvature. Snake algorithm is an example of this kind of method. It adopts combined action of image, internal and external constraining force to control movement of contour, in which image force pushes snake curve to edge of the image, internal force constrains local smoothness of contour, external force pushes snake curve to local minimum position of expectation (can be provided by human).

  1. iv)

    Tracking method based on forecast

Target tracking method with forecast transforms a target tracking problem to a mode estimation problem. Mode of target contains all movement features concerned. In this kind of method, key point of target tracking is to deduce and express posterior probability of target mode accurately effectively by exist observed data. So forecast tracking is usually based on filtering method. Two well-known filtering methods are Kalman filtering and particle filtering. Recent years, there are newer tracking methods based on fusion of one filtering method and some other kind of methods.

So in this review, we put forward and discuss mainstream algorithm of target tracking today which includes meanshift, particle filter, active contour and SIFT method. In second section, we discuss meanshift method and its improvement. Then we study particle filter method and its improvement in third section. In fourth and fifth sections, active contour method and SIFT method are researched. Finally, we conclude these methods and present our mind.

2 Meanshift and its extension

2.1 Meanshift algorithm

Meanshift algorithm, known as shift of mean vector, was first provided by Fukunaga et al. [33] in year 1975. They presented this algorithm from estimation of gradient function in probability density. Later, Cheng [14] extended meanshift algorithm into domain of computer vision in year 1995. Then, Comaniciu et al. [23] applied meanshift algorithm to solve target tracking problem.

Meanshift is such a pattern recognition algorithm that it provides a fast search method to computer local extremum in distribution of data probability density. Meanshift algorithm computes similarity of object model and candidate image data. It moves searching window to search object by applied climbing algorithm of similarity gradient direction. It reaches target tracking when meanshift algorithm converges to local extremum.

Meanshift changes by extension of its theory. Earlier, it was a mathematical definition, which expressed a mean vector of shift. Today, meanshift denotes an iterative process. This iterative process first computes shift of current data mean, and moves current data along with direction of its shift mean. Then, applying moved data as current data, meanshift continues iteratively.

A basic meanshift vector is defined in Eq. 1, where Sk is a sphere in d-dimensional space with radius h, see Eq. 2, x is a chosen point in Sk, xi is a sample point in Sk, k denotes there are k points in Sk in total n points xi (i = 1…n).

$$ {\mathrm{M}}_{\mathrm{h}}=\frac{1}{k}{\displaystyle \sum_{x_i\in {S}_k}}\left({x}_i-x\right) $$
(1)
$$ {\mathrm{S}}_{\mathrm{h}}(x)=\left\{y:{\left(y-{x}_i\right)}^T\left(y-{x}_i\right)<{h}^2\right\} $$
(2)

Then, to insert a kernel function k() into meanshift, we have Eq. 3.

$$ {\tilde{f}}_{h,k}(x)=\frac{c_{k,d}}{n{h}^d}{\displaystyle \sum_{i=1}^n}k\left({\frac{x-{x}_i}{h}}^2\right) $$
(3)

To reach its derivation, we have meanshift equation in Eq. 4.

$$ \nabla {\tilde{f}}_{h,k}(x)=\frac{2{c}_{k,d}}{n{h}^{d+2}}{\displaystyle \sum_{i=1}^n}\left(x-{x}_i\right)k^{\prime}\left(\left|\left|{\frac{x-{x}_i}{h}}^2\right|\right|\right) $$
(4)

Then, to apply g(x) = −k’(x) and Gaussian kernel function, Eq. 4 transforms to Eq. 5 where meanshift vector m(x) of x is presented in Eq. 6.

$$ \tilde{\nabla}{f}_{h,k}(x)={\tilde{f}}_{h,G}(x)\cdot \frac{2{c}_{k,d}}{h^2{c}_{g,d}}{m}_{h,G}(x) $$
(5)
$$ {\mathrm{m}}_{h,G}(x)=\frac{{\displaystyle {\sum}_{i=1}^n}{x}_ig\left(\left|\left|{\frac{x-{x}_i}{h}}^2\right|\right|\right)}{{\displaystyle {\sum}_{i=1}^n}g\left(\left|\left|{\frac{x-{x}_i}{h}}^2\right|\right|\right)}-x $$
(6)

So, we have Eq. 7 to computer convergent direction of meanshift vector when let mh,G(x) = 0.

$$ \mathrm{x}=\frac{{\displaystyle {\sum}_{i=1}^n}{x}_ig\left(\left|\left|{\frac{x-{x}_i}{h}}^2\right|\right|\right)}{{\displaystyle {\sum}_{i=1}^n}g\left(\left|\left|{\frac{x-{x}_i}{h}}^2\right|\right|\right)} $$
(7)

Detail of meanshift is presented in following Alg.1.

Alg.1 Meanshift Algorithm

Step 1.

Let x is the chosen point in Cd space; d-dimensional sphere S is constructed.

Step 2.

For all xi in S do

figure d

Alg.1 finished.

2.2 Algorithm analysis and improvement

Meanshift algorithm has many advantages. First, computation of the algorithm is relatively not large, so that it can track target real time and efficiently with known target area. Second, acting as a density estimation algorithm with no parameter, meanshift can easily become a module and integrate other algorithms to increase effectiveness of target tracking. Third, meanshift constructs its model by applied kernel function. This means meanshift algorithm has robustness that is not sensitive to edge occlusion, target rotation, deformation and background change.

Meanshift algorithm has some defects in its tracking process. First, feature representation model of target lacks essential update ability. Tracking process can not adapt searching window with different sizes. So tracking process has high failure ratio when there is transformation of scale or form in tracking target. Second, color histogram is a weak feature as the only description of target. Indeed, this feature does not work well when color distributions are similar between background and target. Third, meanshift algorithm relies on relation between current and next frames. So tracking result of target with fast velocity usually converges to an object in background, which color distribution of the object is similar to tracking target. This is because there is no overlapped area between two adjacent frames when target moves quickly. So tracking method loses efficacy because it can not find target in the later frame when it searches area near target’s position in earlier frame.

Figure 1 shows one instance for meanshift algorithm. Videos we applied are from Ref [81].

Fig. 1
figure 1

An instance of meanshift algorithm in target tracking

In Fig. 1, we find that meanshift algorithm successful tracks moving target with red rectangle in Fig. 1a, but it fails in Fig. 1b. The failure occurs because another target occludes the tracking target. In this case, meanshift algorithm can not recognize tracking target and occurs error.

Today, there are many improved methods of meanshift. Comanicui presented Camshift (Continuously Adaptive Meanshift) method based on meanshift [21]. They applied probability density estimation of kernel function in visual target tracking. Camshift method applied histogram to describe target, adopted Bhattacharya parameter to measure distance between model and current object, and tracked target by meanshift method. It has higher tracking rate and effectiveness by used hill climbing method by gradient instead of global search algorithm.

To consider an image as a matrix, probability density of pixel x is ruled when to draw a circle with original point x and radius h, and to define two pattern rules for all xi in the circle.

  1. 1)

    Probability density is larger when color is more similar between pixel x and xi.

  2. 2)

    Probability density is larger when position is nearer between pixel x and xi.

So probability density of pixel x can be constructed in following Eq. 8 where C/hs 2hr 2 denotes density of unit measure, K(x) is a kernel function, s denotes space distance between x and xi, r denotes color difference between x and xi.

$$ {\mathrm{K}}_{h_s,{h}_r}(x)=\frac{C}{h_s^2{h}_r^2}K\left(\left|\left|{\frac{x^s-{x}_i^s}{h_s}}^2\right|\right|\right)K\left(\left|\left|{\frac{x^r-{x}_i^r}{h_r}}^2\right|\right|\right) $$
(8)

Then, Alg.2 is presented for detail Camshift.

Alg.2 Continuously Adaptive Meanshift Algorithm

Step 1. Construct feature representation model of target.

Begin

Apply color distribution of moving target in presented area as feature.

Compute color distributional status by applied distance between pixels and barycenter of the target.

Give weights for every pixel in moving target by color distribution and distance. The distance is inversely proportional to the weight.

End

Step 2. Construct description model for potential area that may contains the target.

Specify searching areas around target of previous result in advance. Extract feature in the detection area and generate feature representation model for potential areas.

Step 3. Compute similarity between feature representation models of target and potential area.

Usually, meanshift applies Bhattacharyya coefficient to measure similarity between color histograms of target and potential area. Bhattacharyya coefficient is computed as Eq. 9 where a, b denote two histograms, n denotes number of color in each histogram. ∑a i and ∑b i denote sum of pixels with same color i. Also, other methods of similarity measure are also applied in the comparison [24].

$$ \mathrm{B}\left(\mathrm{a},\mathrm{b}\right)={\displaystyle \sum_{i=1}^n}\sqrt{{\displaystyle \sum }{a}_i\cdot {\displaystyle \sum }{b}_i} $$
(9)

Step 4. Extract moving target.

While length of vector is larger than given threshold

Choose area with largest Bhattacharyya coefficient as the target.

Goto Step 1.

Alg.2 finished.

Camshift based method applies meanshift method in continuous frames. It tracks target by applied color distribution graph and k-order moment. It has higher tracking robustness in indoor environment. But Camshift confuses target and other objects when color of object is similar to target. This will cause expansion of tracking window.

Figure 2 shows one instance for Camshift algorithm.

Fig. 2
figure 2

An instance of camshift algorithm in target tracking

In Fig. 2, we find that camshift algorithm successful tracks moving target with red rectangle in Fig. 2a, but it fails in Fig. 2b. The failure occurs because another target occludes the tracking target and confuses Camshift method.

Later, Collins studied blob tracking in scale space of meanshift [20]. Peng et al. presented an automatic selecting method of window in kernel function [74]. Li et al. studied convergent threshold and environment of meanshift [60]. Birchfield et al. improved meanshift algorithm by spatiograms versus histograms [9]. Collins et al. first lead online learning method to meanshift algorithm [21]. Jeyakar et al. improved meanshift by background-weighted local kernels [45]. Their work effective improve meanshift when the target is partially occluded. Leichter et al. applied integral histogram to conquer model migration of meanshift [56]. Vojir et al. improved adaptation of scale change for meanshift by hellinger distance [82]. This method was compared with TLD (Tracking Learning Detection) and shows better robustness [48].

Recently, Li et al. combines spatial context inforamtion to meanshift [59]. This algorithm is robustness when target moves fast. Yang et al. applied HSI color feature histogram into meanshift [91]. They classifies superpixels and showed good performance for moving target tracking. Jia et al. improves kernel function of meanshift in their method, which shows higher robustness and more applicable scenarios [47]. Jatoth et al. analyzes performance of many tracking method, and provide a novel method combined α-β filter, Kalman filter and meanshift for target tracking in frame sequences [44]. Guo et al. improves meanshift to non-rigid target tracking method [35]. Their algorithm further promotes accuracy and robustness of tracking.

3 Particle filter and its extension

3.1 Particle filter algorithm

Particle filter is extended from Kalman filter. Different from Kalman filter, particle filter can be applied to track target in nonlinear and non-Gaussian condition. It disseminates weighted discrete random variables in status space. These variables can be approach to probability of every formation when number of particles extends to infinite. Particle filter computes integration of Bayesian estimation by applied large number theorem and statistical testing method. It first samples particles by defined sample function. Then it sets weights of particles according to similarity of particles by likelihood function. Finally, it computes posterior probability by applied particles and their weights. Since UKF (Unscented Kalman Filter) is a local optimization method and is not suitable to apply in nonlinear and non-Gaussian condition, particle filter algorithm is applied in this condition instead of UKF.

Equation 10 is applied to compute probability density function of color distribution p(y) = {pu(y)}u=1…m at position y, where m denotes number of features in feature space, pu(y) denotes the uth feature of y, n is number of pixels in chosen area, x0 is center position, μ is normalization factor that expresses in Eq. 11, c is diagonal length of display area, s() is applied to get serial number of feature, δ(0) = 1 and δ(other) = 0 is a flag function.

$$ {\mathrm{p}}_{\mathrm{u}}(y)=\mu \cdot {\displaystyle \sum_{i=1}^n}\left(k\left(\frac{\left|\left|{y}_i-{x}_0\right|\right|}{c}\right)\cdot \delta \left(s\left({y}_i\right)-u\right)\right) $$
(10)
$$ \upmu =\frac{1}{{\displaystyle {\sum}_{i=1}^n}k\left(\frac{\left|\left|y-{x}_i\right|\right|}{c}\right)} $$
(11)

Then, sample of every color distribution is defined as Eq. 12, where py is positions of particle y, \( \overrightarrow{{\mathrm{v}}_{\mathrm{y}}} \) is velocity vector of y, θ is deflection angle in horizontal direction.

$$ \mathrm{y}=\left\{{\mathrm{p}}_{\mathrm{y}},\overrightarrow{{\mathrm{v}}_{\mathrm{y}}},\uptheta,\ \mathrm{c}\right\} $$
(12)

Sample set of particle is updated and broadcasted through the transition function of system status in Eq. 13, where A is status transition matrix and wt-1 is Gaussian noise at time t-1.

$$ {\mathrm{y}}_{\mathrm{t}}=A{y}_{\mathrm{t}-1}+{w}_{t-1} $$
(13)

In detail, Particle filter algorithm is presented in following Alg.3.

Alg.3 Particle Filter Algorithm

Step 1. Construct feature representation model of target.

Assuming {y i t − 1 } m i = 1 as sample set of posterior probability distribution of target status, applying Eq. 13 to compute current particle set yt and positions of current particles.

Step 2. Compute weight for all particles.

For all samples of particle set in status set yt, to compute its color histogram by applied Eq. 14, which is from Eq. 10.

$$ {\mathrm{p}}_{\mathrm{u}}\left({y}_t\right)=\mu \cdot {\displaystyle \sum_{i=1}^n}\left(k\left(\frac{\left|\left|{y}_t^i-{x}_0\right|\right|}{c}\right)\cdot \delta \left(s\left({y}_t^i\right)-u\right)\right) $$
(14)

Then, according to current status, to compute weight of samples by applied Eq. 16 (Bayesian Theorem) when consider target model is equal to prior probability distribution as Eq. 15.

$$ P\left({y}_t\Big|{\mathrm{p}}_{\mathrm{u}}\left({y}_t\right)\right)=\mathrm{P}\left({\mathrm{y}}_{\mathrm{t}}\Big|{\mathrm{p}}_{\mathrm{u}}\left({\mathrm{y}}_{\mathrm{t}-1}\right)\right) $$
(15)
$$ {\upomega}_{\mathrm{t}}^{\mathrm{i}}=\frac{P\left({y}_t\Big|{y}_t^i\right)}{{\displaystyle {\sum}_{i=1}^n}P\left({y}_t\Big|{y}_t^i\right)} $$
(16)

Step 3. Estimate target status.

Finally, target status is estimated as Eq. 17.

$$ \overline{{\mathrm{y}}_{\mathrm{t}}}={\displaystyle \sum_{i=1}^n}\left({\omega}_t^i\cdot {y}_t^i\right) $$
(17)

Alg.3 finished.

3.2 Algorithm analysis and improvement

Particle filter algorithm has its advantage. First, it can be applied in nonlinear and non-Gaussian condition. So compared to the Kalman filter, which can only applied to solve problem of linear system, particle filter algorithm is applied widely. Another advantage is that computational complexity of particle filter algorithm can be simplified by Monte Carlo method.

Disadvantage of particle filter algorithm also exists. First, there is particle degeneracy phenomenon in this algorithm. Particle degeneracy means most weights of particles changed very small after several iterations. This is because generated random sample changed to invalid sample which are predicted by state variables. Invalid sample is not much significance for the estimation. However, along with increment of invalid sample, number of valid samples is relative reduced. This makes large deviation in estimation. Meantime, particle filter method slows because it spends a large amount of computing resources in sample with small weight.

Second, there is poor sample phenomenon in particle filter algorithm. To avoid particle degeneracy, a variety of resampling algorithms is applied. Though resampling can eliminate particle degeneracy phenomenon, it introduces poor sample phenomenon. When system noise is relatively smaller, this problem becomes more serious. This is because excessive replication of particles with large weight in resampling process. This makes the particle diversity decreases. Poor diversity of particle can not reflect status of the system with their statistical characteristics.

Figure 3 shows one instance for particle filter algorithm.

Fig. 3
figure 3

An instance of particle filter algorithm in target tracking

In Fig. 3, we find that particle filter algorithm successful tracks moving target with particles in Fig. 3a, but it fails in Fig. 3b. The failure occurs because of particle degeneracy.

In order to improve the weakness of particle filter algorithm, and to make up for weakness by particle degeneracy and poor sample, many researchers provide different methods to improve the efficiency of particle filter algorithm.

Resampling method is applied to reduce particle degeneracy. A frequently used resampling method is multiterm sampling algorithm. Sampling variance of this method is presented in Eq. 18, where Ni is sampling times and \( {\displaystyle {\sum}_{i=1}^n}{N}_i=N \).

$$ \mathrm{v}\mathrm{a}\mathrm{r}\left({\mathrm{N}}_{\mathrm{i}}\right)={\mathrm{N}\upomega}_{\mathrm{k}}^i\left(1-{\omega}_k^i\right) $$
(18)

To reduce resampling variance and improve resampling effectiveness, many resampling methods are presented today. For example, residual sampling, systematic sampling and adaptive resampling algorithm are all resampling method [37, 43, 49].

To avoid poor sample, researchers also provide many solutions, regularized particle filter and Markov chain Monte Carlo method etc. [38, 90]. There are also other particle filter algorithms today. Gaussian Particle Filter (GPF) applied Gaussian distribution to construct importance function [52]. It generate next particle by computed Gaussian distribution of all current particles in particle set. GPF is an asymptotically optimal method when number of particle tends to infinity and particle distribution obeys Gaussian distribution. Chen et al. applied prior distribution under process of SA (annealing simulated) as importance function [13]. SA uses Metropolis method to reach a well particle distribution. Real-time Particle Filter (RTPF) improves real-time performance by applied reduction of particle number [54]. It constructs optimal importance function to ensure accuracy of particle filter method.

Particle filter also can be combined other models and generated more real-time and robustness tracking methods. Ross et al. presented incremental visual tracking by PCA of difference, which was generated by image space of frames before current frame [65]. Kwon et al. improved Ref.44, and presented two dimensional affine transformation matrix to describe object [55]. Bao et al. combined L1 minimum tracking to particle filter to solve negativeness rotation, scale transformation, occlusion and illumination change [6]. Jia et al. applied structural local sparse appearance model to describle target [46]. Zhong et al. presented their tracking method by collaborative model with holistic templates and local representations [95]. Their method showed robustness when background changed. Oron et al. provided locally orderless tracking method with locally orderless matching of target [72, 76].

Recently, there are also many researches in this area. Yi et al. applies robust combination of particle filter and sparse representation in single target tracking [92]. Nieto et al. presents a real-time tracking system which has well tracking speed and accuracy [70]. Zhou et al. presents hybrid particle filter and sparse representation to avoid defect of classical particle filter method [96]. Their method solves problem of target offset in traditional tracking algorithm.

4 Image processing method for target tracking

4.1 Active contour model

Active contour model is a recognition method in image recognition. It needs a general contour of tracking target first. Then it solves the differential equation recursively to converge to local minimum of the energy function. Energy function is usually constructed according to the smoothness of the features and contour profile of image, such as edges and curvature. Target tracking algorithm based on contour has no restriction on target shape and movement.

Snake model is a representative method of active contour model [50]. It can also be applied in visual moving target tracking. Snake method applies a deformation contour line that moves by internal and external force of image. Contour of target is defined by constructed suitable deformation energy Es(v), where s denotes unit variable domain, v(x(s), y(s)) denotes a mapping from s∈[0, 1] to image surface. To consider external force of target contour as a differential of potential energy P(v(s)), total energy of target contour can be defined in Eq. 19, where Es(v) can be computed in Eq. 20. In Eq. 20, v is rewritten to v(s) because v is a function of s. So Es(v) also can be rewritten to Es(v(s)).

$$ \mathrm{E}\left(\mathrm{v}\right)={\mathrm{E}}_{\mathrm{s}}(v)+P(v) $$
(19)
$$ {\mathrm{E}}_{\mathrm{s}}\left(v(s)\right)={\displaystyle {\int}_0^1}\left({k}_1{\left|\frac{\partial }{\partial s}v(s)\right|}^2+{k}_2{\left|\frac{\partial^2}{\partial {s}^2}v(s)\right|}^2\right)ds $$
(20)

Equation 20 defines a internal deformation energy of an extensible and flexible v(s). Es(v(s)) contains two variables which are k1(s) and k2(s). In Eq. 20, k1(s) controls stress of contour and k2(s) controls rigidity of contour. Then, Eq. 21 is presented to describe P(v). P(v) is a scalar function that its definition domain is whole image surface I(x, y).

$$ \mathrm{P}\left(\mathrm{v}\right)={\displaystyle {\int}_0^1}p\left(v(s)\right)ds $$
(21)

In p(v), Eq. 22 is construct to attract Snake to edge (contour) of target, where k3 controls attracting strength of potential energy, Ge means Gaussian smoothing filter with characteristics density e, symbol ‘*’ means convolution.

$$ \mathrm{p}\left(\mathrm{x},\mathrm{y}\right)=\pm {\mathrm{k}}_3\cdot \left|{G}_e*I\left(x,y\right)\right| $$
(22)

Active contour model has many advantages. First, in snake method, both features of image data, initial estimation, target contour and constraint based on knowledge unify to one process of feature extraction. Second, the method converges to status of minimum energy automatically based on suitable initialization. Third, extracting area extends and computational complexity decreases in snake because energy is minimized from large size to small size in scale space.

There are two main defects for active contour model. One is that active contour model is sensitized with its initial position. It works well when snake is first put near features of tracking target. But it can not track correct when initial position is far from useful features. The other defect is that active contour model is a non-convex function. This means the method may converge to local extreme value, even divergent.

Figure 4 shows one instance for active contour model.

Fig. 4
figure 4

An instance of snake in target tracking

In Fig. 4, we find that snake algorithm successful tracks moving target with particles in Fig. 4a, but it fails in Fig. 4b. The failure occurs because of contour is changed when two persons overlapping. This makes previous features and deformation contour line changed to features and line for both the two persons. In other words, snake method consider the two persons in Fig. 4b as one target.

There are many researchers improve active contour model. Menet et al. provide B-snake model. Target contour is expressed by B- spline, which can express angular points impliedly [66]. This method can be applied to edge detection and contour extraction without priori knowledge of target contour. Initial snake method will converge to one point or one line without image force. Cohen et al. presented a balloon model based on classical active contour model [18]. Their model depresses sensitiveness of snake model, and can cross some pseudo edge points (local extreme value) in image. Cohen applied the method to medical image processing, and extract ventricular contour from magnetic resonance and ultrasound images. Allili et al. applied adaptive mixture model to active contour [4]. Brox et al. provided a level set-based segmentation method by combined spatial and temporal variations to active contour [11]. Chiverton et al. presented an automatic tracking and segmentation method by motion-based bootstrapping and shape-based active contour [15]. Aitfares et al. presented a mixed target tracking method by region intensities and motion vector [3]. Ning et al. combined registration and active contour segmentation [71].

Recently, Cai et al. present DGT (dynamic graph-based tracker) to solve deformation and illumination change [12]. Sun et al. provide non-rigid target tracking method by supervised level set model and online boosting [79]. Choi et al. combine optical flow to active contour, in which optical flow solves deformation and speed change and active contour tracks contour of moving target [16]. Mondal et al. combine local neighborhood information to fuzzy k-nearest neighbor classifier [67]. Huang et al. improve snake method and apply it to segmentation and tracking of lymphocytes effectively [42]. Their result has a greater improvement with previous algorithms. Cuenca et al. construct new computational mode of constraining force that adopt GPU-CUDA to compute convolution between image and Gaussian smoothing filter [26]. Their method is now adopted into computerized tomographic scanning (CT).

4.2 Scale-invariant feature transform method

Scale-invariant feature transform (SIFT) is a kind of machine vision algorithm. It detects and describes local features of image. SIFT searches extreme points in the scale of space, and extracts position, scale, rotation invariant of extreme points as features. This algorithm is first presented by David Lowe in 1999 [61], and summarized in 2004 [62]. There are many researchers improve SIFT method these years. PCA-SIFT method applies principal components analysis method to reduce dimension of descriptor. It reduces usage of internal memory and speed up the matching speed [51]. SURF method approximates many processes in SIFT method to accelerate process of recognition [7].

First, applying Gaussian convolution kernel as linear kernel in scale transformation, an scale space in R2 can be defined in Eq. 23 where G() is a Gauss’s function with variable scale and described in Eq. 24, (x, y) denotes scale position, σ controls image smoothness. Then, in order to search stable key points in scale space, difference of Gaussian scale space (DOG) is presented to instead of Laplacian of Gaussian operator (LOG) in Eq. 25 by different Gaussian difference kernel.

$$ \mathrm{L}\left(\mathrm{x},\mathrm{y},\upsigma \right)=\mathrm{G}\left(\mathrm{x},\mathrm{y},\upsigma \right)*\mathrm{I}\left(\mathrm{x},\mathrm{y}\right) $$
(23)
$$ \mathrm{G}\left(\mathrm{x},\mathrm{y},\upsigma \right)=\frac{1}{2\pi {\sigma}^2}{e}^{-\frac{x^2+{y}^2}{2{\sigma}^2}} $$
(24)
$$ \mathrm{D}\left(\mathrm{x},\mathrm{y},\upsigma \right)=\mathrm{L}\left(\mathrm{x},\mathrm{y},\mathrm{k}\upsigma \right)-\mathrm{L}\left(\mathrm{x},\mathrm{y},\upsigma \right) $$
(25)

SIFT algorithm can be divided into 4 steps in Alg.4.

Alg.4 SIFT Algorithm

Step 1. Extreme value detection in scale space.

DOG operator is applied to construct difference scale space. Then, every sample point is compared with all neighbors, which conclude 8 points in same scale and 18 points in higher and lower scales. A point is an extreme value if it is extreme value of its scale in all these 27 points.

Step 2. Key point positioning.

Positioning key points and their scales need function fitting, which is presented in Eqs. 26 and 27, where Eq. 27 is presented when let D’(z) = 0.

$$ \mathrm{D}\left(\mathrm{z}\right)=\mathrm{D}+\frac{\partial {D}^T}{\partial z}z+\frac{1}{2}{z}^T\frac{\partial^2D}{\partial {z}^2}z $$
(26)
$$ \overset{.}{\mathrm{z}}=-\frac{\partial^2{D}^{-1}}{\partial {z}^2}\frac{\partial D}{\partial z} $$
(27)

Then, feature points with low contrast and instable edge points are dropped by Eqs. 28 and 29 where ε denotes dropped threshold, H is curvature and presented in Eq. 30, r > 1 is ratio between two eigenvalue of H.

$$ \left|\mathrm{D}\left(\overset{.}{z}\right)\right|=\left|\mathrm{D}+\frac{1}{2}\cdot \frac{\partial {D}^T}{\partial z}\overset{.}{z}\right|\ge \varepsilon $$
(28)
$$ \frac{\mathrm{Tr}{\left(\mathrm{H}\right)}^2}{Det(H)}<\frac{{\left(r+1\right)}^2}{r} $$
(29)
$$ \mathrm{H}=\left[\begin{array}{cc}\hfill {D}_{xx}\hfill & \hfill {D}_{xy}\hfill \\ {}\hfill {D}_{xy}\hfill & \hfill {D}_{yy}\hfill \end{array}\right] $$
(30)

Step 3. Construct SIFT feature vectors.

First, applying the gradient direction distribution of neighborhood of key points, direction parameters of each key point are assigned to keep rotation invariance of operators. Then, primary direction of key points is computed by its modulus and direction of gradient. It is computed by statistical maximum of all directions of gradients in neighborhood of key position.

After that, according to primary direction of key point, one window is extracted with size c. Gradients of all pixels in this window is weighted computed by Gaussian window. Then, with size 2c and c/2, same computation is processed for all blocks in window of one key point. So, there are 16n features to describe one key point if we define n directions.

Finally, illumination effect is removed by normalization of this feature vector.

Step 4. Applying SIFT in extraction.

To generate feature vectors for two images A and B. Then, Euclid distances between every key points bi of B and aj of A are computed as dij. To find the smallest two di’j and di”j for j, bi’ and aj is a pattern pair if di’j/di”j < ε, where ε controls stability of SIFT.

SIFT has many advantages. First, SIFT feature extracts local feature of image. It is robust with rotation, zooming scale, illumination changes. SIFT also shows good stability in perspective changes, affine transforms and noise. Second, Features of SIFT contain much information, that suit for fast and accurate recognition in big data features. Third, lots of SIFT features are constructed even if only a few targets.

SIFT also has some defects. First, traditional SIFT shows bad performance for nonrigid target. Recognition occurs error when target is deformed. Second, computational complexity of SIFT is high because each key point uses many dimensional vectors to represent. Time costs large when features of image increase. Third, in SIFT, image feature extraction is not sufficient and recognition effectiveness is not well enough.

Figure 5 shows one instance for SIFT.

Fig. 5
figure 5

An instance of SIFT in target tracking

In Fig. 5, we find that SIFT algorithm successful tracks moving target with quadrilateral in Fig. 5a, but it fails in Fig. 5b. The failure occurs because many features of SIFT present wrong direction. So, in Fig. 5b, SIFT is degeneration and recognizes error.

Today, there are many improvements for SIFT. First, Ke and Sukthankar presented PCA-SIFT in 2004 [51]. They combined principle component analysis (PCA) into SIFT to reduce dimensions of points. Their method reduced computational time of SIFT. Dalal et al. presented HOG to divide image into small connecting unit, which were combined to block to compute gradient [27]. Bosch et al. changed feature selection space and computed SIFT operators in HSV space [10]. Abdel-Hakim and Farag applied RGB color information to construct CSIFT method [1]. They extended SIFT from grey image recognition to color. Their method matched features by three colors and applied three dimensions in recognition. Morel et al. presented affine SIFT method to improve affine performance of SIFT [68]. Their method added two angles which cause affine transformation.

SIFT and its derivative algorithm, such as SURF, HOG etc. are important in visual target tracking and studied by many researchers. Zhu et al. first applied HOG into human detection [97]. Wang et al. combined multi-features into target tracking [85]. Xu et al. combined HOG to SVM [89]. Chu et al. extracted features by RGB-based SURF which was better than gray-based SURF [17]. Babenko et al. applied SURF features into multi-instance learning [5]. Sun combined HOG features to online boosting framework and achieve good performance [78]. Xia applied Haar feature and HOG features into online boosting method to classify features [87].

SIFT can also be combined to other method. Wang et al. combined SIFT with Camshift [83]. They use Bhattacharyya coefficient to determine matching of SIFT features. Their method has well robustness when tracking target is deformed, revolved and occluded. Xie et al. applied features of SIFT in support vector machine (SVM) [88]. They extended SIFT to multi-targets (rigid) tracking domain. Their method is applied in multiple aircrafts tracking.

5 Experimental results and analysis

5.1 Dataset

Recently, though benchmark of visual target tracking has been made a lot of progress, it is a very necessary work to construct a video data set which has objective evaluation of the tracker performance with a variety of complex changes. It is a key to collect a representative data set for comprehensive performance evaluation of trackers. There are some visual tracking data set for surveillance, such as CAVIAR [29] and VIVID [22]. But targets in both of them moves simple with static background. So we need more complex data set to evaluate algorithms.

Recently, there are many researchers study to construct video data sets and evaluation frameworks. Ref.87 constructs a video test data set, which includes 50 labeled image sequences to evaluate tracking algorithms. It also has many open trackers (29 algorithms can be found now) to evaluate algorithms. Ref. 77 selects 315 sequences with different kinds, and evaluate 19 open trackers in this data set. Positiveness of Ref. 77 is that this reference provides online evaluation at website http://www.alov300.org, which only needs uploaded tracking results with standard data set. Then, the website can present comparison result of test algorithm and existing algorithms by F-score and accuracy. Then, in Ref. 58, they provide a data set including 356 video sequences of 12 kinds. This data set focuses on rigid targets and pedestrian. In its website http://www.lv-nus.org/pro/nus_pro.html, there is test data from 20 good tracking algorithms.

In 2015, Wang et al. provided an evaluation method to evaluate trackers by motion model, feature extractor, observation model, model updater, and ensemble post-processor [84]. Further, In VOTC (Visual Object Tracking Challenge), data sets and evaluation criteria is uniform [53].

5.2 Experimental parameters

5.2.1 Tracker

This paper select 9 algorithms to compare each other. These algorithms are selected because they show effective results in our experiment. The comparison is only to show positiveness and negativeness of all kinds of introduced tracking algorithms in this paper. Experimental results do not imply overall performance of these algorithms. Table 1 presents evaluated 9 algorithms, where ‘MU’ implies model update in table head; ‘L’ implies local, ‘H’ implies holistic, ‘T’ implies template, ‘IH’ implies intensity histogram, ‘PCA’ implies principal component analysis, ‘SR’ implies sparse representation, ‘DM’ implies discriminative model, ‘GM’ implies generative model in ‘Representation’ column; ‘PF’ implies particle filter, ‘LOS’ implies local optimum search in ‘Search’ column; ‘N’ implies no,’Y’ implies yes in ‘MU’ column; ‘M’ implies Matlab code, ’C’ implies C/C++ code, ‘MC’ implies the algorithm combines Matlab and C/C++ in ‘Code’ column.

Table 1 Evaluated Tracking Algorithms

5.2.2 Test set

This paper applies test platform in Ref.87, which contains 50 video sequences for performance test of trackers. In Fig. 6, we can find 1st frame of all test sequences, which is labeled by standardized bounding box to compare (Red rectangle in each image). In this paper, we choose 5 representative video sequences to evaluate these 9 algorithms in Table 1.

Fig. 6
figure 6

1st frame with label of 50 video sequences

Ref.87 provides 11 different attributes of data set. So in this paper, we analyze tracking accuracy in IV, SV, OCC, DEF, MB, FM, IPR, OPR, OV, BC and LR, which were presented in Table 2, with same test data. Then, we apply experiments to illustrate application environment and their advantages, defects in next section.

Table 2 List of the attributes annotated to test sequences [86]

5.2.3 Evaluation criterion

This paper applies benchmark in Ref.87 as evaluation platform. So we apply evaluation criterion of Ref.87 to evaluate these algorithms, which is same to Refs.87-88.

This evaluation criterion is bounding box overlap score, which is defined as S = |γt∩γa |/|γt∪γa |, where T and S represent the intersection and union of two regions, respectively, | · | denotes the number of pixels in the region, γt is tracked bounding box andγa is ground truth bounding box. Then. Because applying one success rate value at a specific threshold (e.g. to = 0.5) for tracker evaluation may not be fair, AUC (area under curve) is also applied of each success plot to rank the tracking algorithms

Then, we have evaluation of robustness. Classic evaluation method is OPE (one-pass evaluation), which is first initialized by accuate position in 1st frame, then processed in text sequence, and last reported as a result of average accuary. OPE is sensitive to initialization, and changes its performance by different initial frame. So other two evaluation methods are applied to evaluate robustness of initialization which disturbs initialization for time (tracking from different frame) and space (tracking from different bounding box). These two methods are TRE (temporal robustness evaluation) and SRE (spatial robustness evaluation).

In this paper, each tracker is tested with more than 29,000 frames in OPE, is evaluated by 12 times in each sequence with more than 350,000 bounding box in SRE, is tested with about 310,000 frames (each sequence is divided into 20 segments) in TRE.

5.3 Experiment result and analysis

5.3.1 Tracking results

This sub-section mainly consists with two parts. In first part, we choose 5 representative video sequences from the data set of Ref.87 to test selected 9 trackers. Test results are provided and analyzed by combined 11 annotated attributes. In second part, we provide tracking result and analyze overall performance of 9 trackers in 50 video sequences. Also, we analyze tracking results for 9 trackers in 11 annotated attributes. So experiments in these two parts can validate our conclusions of these 4 kinds of algorithms.

In video sequence ‘Walking’, there are only three attribute changes, which are ‘SV’, ‘OCC’ and ‘DEF’. Indeed, it is a video sequence with less change and lower difficulty. Some tracking results are provided in following Fig. 7. In Fig. 7, we find that SNA misses its tracking target partly because occlusion appears by telegraph pole in 88th frame. In 203rd frame, change of background leads to failture of HAH. CAM fails in 265th frame because of partly deformation of target.

Fig. 7
figure 7

Test with video sequence 1 (Walking 1)

Then, we test algorithms by video sequence Walking 2, which is also a relative simple test data and has three attribute changes ‘SV’, ‘OCC’ and ‘LR’. Partly tracking results are provided in following Fig. 8. In Fig. 8, we can find that dislocation between tracking box and target appears in SNA and KMS because background is changed in 132nd frames. Then, in 194th frame, another moving object appears and disturb tracking by occlusion. Tracking error occurs in both KMS, SIFT, SNA and CAM. Then, both SNA, KMS, CAM, SIFT and HAH fail to track target because of disturbance of the other moving object in 302nd frame. Finally, in 387th-403rd frames, after disappearance of disturbed object, both of these 5 algorithms can not track target accurately.

Fig. 8
figure 8

Test with video sequence 2 (Walking 2)

Then, we test algorithms with video sequence ‘Woman’, which contains more changed attributes ‘IV’, ‘SV’, ‘OCC’, ‘DEF’, ‘MB’, ‘FM’ and ‘OPR’. Experimental results are provided in following Fig. 9. In Fig. 9, we find SNA converges to local target and shows bad performance by scale change in 70th frame. In 86th frame, deformation appears in target, which leads to failure of KMS and CAM. Then, both HAH, PF and ASLA fails to track target because target occludes partly by car in 145th frame. Further, in 229th frame, partly occlusion appears by complex background, which makes tracking error for all 9 algorithms. When target blurs by its quicken movement and partly occluded by tree, only HAH partly tracks target in 566th frame.

Fig. 9
figure 9

Test with video sequence 3 (Woman)

Then, video sequence ‘Singer’ with attributes ‘IV’, ‘SV’, ‘OCC’ and ‘OPR’ is chosen because it has complex illumination changes, which can compare robustness of tracking algorithms with illumination changes. Experimental results are provided in Fig. 10. In Fig. 10, KMS fails to track and SNA converges local target when illumination enhances in 92nd frame. Then KMS and SIFT fails their tracking when target rotates out of the image plane (OPR) in 148th frame. Also, tracking box of CAM enlarges. In 302nd frame, target keeps rotation with illumination changes, which leads to failure of CAM.

Fig. 10
figure 10

Test with video sequence 4 (Singer)

Final video sequence ‘Skating’ has attributes ‘IV’, ‘SV’, ‘OCC’, ‘DEF’, ‘OPR’ and ‘BC’. This video sequence is chosen because it has many negative attributes which can evaluate algorithms overall. Also, tracking failure is easy to be appeared because of these attributes. Experimental results are provided in Fig. 11. In Fig. 11, SNA fails to track when target quickly deforms in 23rd frame. In 88th frame, only SCM tracks target when quick scale variation of target appears. Between 165th and 268th frames, all algorithms can not track effectively because of quick movement, rotation, and occlusion. Between 329th and 393rd frame, target moves quickly with large illumination change, which leads to failure of all tracking algorithms.

Fig. 11
figure 11

Test with video sequence 5 (Skating)

5.3.2 Evaluation results

After experimental analysis of 9 algorithms in 5 video sequences, we apply benchmark in Ref.87 to analyze tracking results of tracking in 50 video sequences with 9 algorithms. This can evaluate 4 kinds of algorithms in this paper. We first evaluate overall performance of 9 algorithms. Then, performances of 5 reprehensive attributes in all 11 attributes are analyzed.

Figure 12 is presented to analyze overall performance. AUC value is applied to conclude and rank performances of algorithms. In Fig. 12, we find that average performance of TRE is higher than OPE. It is because that frames of OPE is less than sum of frames in TRE. Average performance of TRE is higher because performances of tracking algorithms are better when video sequences are shorter. Similarly, average performance of SRE is lower than OPE. Overall performance shows that algorithms based on particle filter and its extension have better performances. One reason is algorithms based on particle filter are more than other kinds of algorithms. Also, newer algorithms show better performances than older ones.

Fig. 12
figure 12

Overall performance evaluation

Figure 13 is applied to show performances of algorithms with background clutter. There are 21 video sequences has attribute ‘background clutter’. SNA has lowest average performance in all algorithms. This is because SNA needs clearer background to converges contour of target. Also, we find SCM has best robustness of complex background.

Fig. 13
figure 13

performance evaluation with background clutter

Figure 14 is applied to show performances of algorithms with deformation. There are 19 video sequences has attribute ‘deformation’. CAM has lowest average performance in all algorithms, which is according to our conclusion with it. Also, we find both SCM and ALSA have best performance when target deforms.

Fig. 14
figure 14

performance evaluation with deformation

Figure 15 is applied to show performances of algorithms with illumination change. SNA does not achieve good average performance because illumination change makes weak degree to divide background and target. Also, we find both SCM and ALSA have good performance. Then, we find HAH has good performance because it is based on SIFT, which has Illumination invariance.

Fig. 15
figure 15

performance evaluation with illumination change

Figure 16 is applied to show performances of algorithms with scale variation. CAM does not achieve good average performance because its adaptability of scale is poor. Both SCM and ALSA have good performance with this attribute.

Fig. 16
figure 16

performance evaluation with scale variation

Figure 17 is applied to show performances of algorithms with occlusion. Only HAH has good robustness with partly occlusion in all 9 algorithms. This is because HAH is based on feature extraction which can extract some features to track target in occlusion.

Fig. 17
figure 17

performance evaluation with occlusion

6 Discussion and conclusion

After decades of technological development, visual target tracking technology is still a highlight in computer vision domain. This paper first briefly introduced visual moving target tracking technology and its application. Then target tracking methods is classified by their thinking and type of extracted features. Further, every kind of tracking method are elaborated with related algorithm, that include advantages and disadvantages as well as improvement of these algorithms. Moreover, algorithms were also compared with advantages, disadvantages and application scenarios. In summary, this paper provides a comprehensive and objective review in this research domain.

Now there are also many problems in related research domains. They are mainly in the following aspects.

  1. 1)

    Occlusion

Though there are many researches in recognition with occlusion, most of the target tracking system can not completely solve this problem. Occlusion appears in several conditions, which are occlusion between different targets, occlusion due to foreground or background, and object occlusion itself.

Target is only partly visible when occlusion appears. So tracking results always extract tracking mistakes by classical tracking method, soever overall or local features, static or dynamical model.

Our team constructs a new target tracking method based on movement feature [3032]. This method can partly solve short time occlusion. But it still needs a long way to go.

  1. 2)

    Feature selection

Target feature selection and measurement is an important problem in target tracking method. Appropriate features can improve the tracking efficiency, accuracy. So how to select features, how to combine different features, how to weight features are still valued to be research today.

Adaboost classifier is a widely applied classifier to combined different features. Also, learning system is an effective method in weight value. Recently, deep learning shows its potential in image recognition, and it will be applied in multimedia tracking soon.

  1. 3)

    Tracking of multi-targets

Earlier, target tracking systems are mainly applied to track single target. Today, requirement of multi-targets tracking becomes large. The main difference between tracking of single target and multi-targets is complexity. Compared to single target tracking, multi-targets tracking is uncertainty in the existence of the corresponding relationship between observation and target. The tracking mistake may occur when multiple targets are close enough.

Multi-vision method is widely used recently to recognize multi tracking targets. It applies multiple cameras in same background. Then it applies difference between cameras to detect and track targets.

At last, with rise of natural network again from year 2006, deep learning has been widely applied in many research domains, including computer vision [36]. Also, deep learning is applied in visual target tracking recent years. In 2015 VOTC (Visual Object Tracking Challenge), the best algorithm MDNet applied CNN (Convolutional Neural Network) to learn objects [69]. Meantime, Deep-SRDCF also achieve a top 3 result in this challenge [28]. So we can see that deep learning will be another hotspot in this research domain.

However, target tracking method in video is now an important research domain, and already has a wide range of research value. This research field has lots of directions to explore and worth exploring.