1 Introduction

Visual tracking is an important and active research topic in computer vision community because of its wide range of applications, e.g., intelligent video surveillance, human computer interaction and robotics [43, 44]. The purpose of visual tracking is to estimate the state of the tracked target in a video. It is usually formulated as a search task where an appearance model is first used to represent the target and then a search strategy is utilized to infer the state of the target in the current frame. Although it has been extensively studied in the last two decades, it still remains to be a challenging problem due to many appearance variations caused by occlusion, pose, illumination, background clutter, and so on.

Broadly speaking, a tracking algorithm mainly includes two fundamental components: (1) a motion model (or called dynamic model), which relates the states of an object over time and predicts its likely state by supplying the tracker with a number of candidate states (e.g., Kalman filter [26], particle filter [25]); (2) an observation model (or called appearance model) [30], which represents the tracked object and verifies predictions by evaluating the likelihood of each candidate state in the current frame.

According to the different observation models used in existing object tracking algorithms, they can be categorized into methods based on template [1, 14], online classifiers [4, 28] and so on. In the template-based algorithms, the tracked object is described by one single template [14] or multiple templates. Then the tracking problem can be considered as searching for the regions which are the most similar to the tracked object. The trackers based on online classifiers aim to distinguish the tracked objects from its surrounding backgrounds by treating the tracking problem as a binary classification problem. Thus, both classic and recent machine learning algorithms could promote the progress of tracking algorithms or systems, including boosting [20, 21], support vector machine [23, 39], naive bayes [45], random forest [38], multiple instance learning [4], structured learning [22] and so on. Jia et al. [27] exploited both partial information and spatial information of the target based on a novel alignment-pooling method and proposed an efficient tracking algorithm based on structural local sparse appearance model and adaptive template update strategy. This algorithm made good use of the appearance and spatial structure information of the target and reduced the influence of the occluded target template. But when the target underwent heavy occlusions (such as for long term tracking videos when the target is disappeared or totally occluded), appearance changes, or interference of similar object and background, the robustness needed further improvement. So effective modeling of the object’s appearance is one of the key issues for the success of a visual tracker.

In this paper, we propose a robust tracking algorithm by integrating the generative and discriminative model. The object appearance model is made up of generative target model and a discriminative classifier. For the generative target model, we adopt the weighted structural local sparse appearance model [27] combining patch based gray value and Histogram of Oriented Gradients feature as the patch dictionary. By sampling positives and negatives, alignment-pooling features are obtained based on the patch dictionary through local sparse coding, then we use a support vector machine to train the discriminative classifier. A robust inter-frame matching based on optical flow [24] and Delaunay triangulation [17, 18] accompanied with template matching is adopted to improve the proposal distribution of particle filter to enhance the performance of tracking. The proposed method is embedded into a Bayesian inference framework for visual tracking. Through alignment-pooling method across the local patches within one candidate region to obtain the similarity measure, at the same time using the trained discriminative classifier to get the classification score. The similarity measure and classification score are multiplied to obtain the particle confidences. Moreover, in order to adapt the situation change, the patch dictionary and discriminative classifier are updated by incremental learning every five frames. Experimental results on some publicly available benchmarks of video sequences demonstrate the accuracy and effectiveness of our tracker.

The main contributions of this paper are as follows:

  1. 1)

    Integrating inter-frame matching into the framework of visual tracking, the two parts can complement to another, thus improving the performance of tracking;

  2. 2)

    A Robust inter-frame matching based on Optical Flow and Delaunay Triangulation is adopted to enhance the robustness and accuracy of matching;

  3. 3)

    Considering the spatial configuration of each local patch of the target, weighting the structural local sparse appearance model;

  4. 4)

    Enhancing the object appearance model by integrating generative model and a discriminative classifier.

The remainder of this paper is organized as follows: in Section 2 we summarize the previous works most related to our work. The proposed robust tracking method ASLA_DW (adaptive structural local sparse appearance model with discriminative weighted) is described in Section 3, respectively. Experiments and results are provided and analyzed in Section 4. Finally, our work is summarized and conclusions are drawn in Section 5.

2 Related work

Many promising approaches have been proposed to tackle object tracking. These methods can be roughly classified into two categories: generative methods and discriminative methods.

Generative methods represent objects with appearance models, and track targets by searching for the image region most similar to the models. Reference templates can be learned with a set of training data. Black et al. [8] learned a subspace model offline to represent targets and used parametric optical flow estimation simultaneously. Aside from static appearance models, online appearance models which are updated as the appearance of the target changes have also been presented. Wang et al. [40] constructed a dynamic multi-cue integration model for particle filter framework. Ross et al. [37] proposed a tracking framework based on the incremental image-as-vector subspace learning method with a sample mean update. Li et al. [29] modeled target appearance changes by incremental image-as-matrix subspace learning method through adaptively updating the sample mean and eigenbasis. Recently, sparse representation [12] has been successfully applied to visual tracking [32, 33]. In this case, the tracker represents each target candidate as a sparse linear combination of dictionary templates that can be dynamically updated to maintain an up-to-date target appearance model. This representation has been shown to be robust against partial occlusions, which leads to improved tracking performance. However, sparse coding based trackers perform computationally expensive l1 minimization at each frame.

Unlike generative methods, discriminative methods regard objects tracking as a binary classification problem to distinguish the target from background. These methods exploit the information of both the target and background. Avidan [3] combined optical flow representation with a support vector machine (SVM) classifier for objects tracking. Ozuysal et al. [36] proposed a random forest classifier which learned binary features of the target. These methods use a predefined and fixed feature sets for classifier learning. In addition, there are many other approaches that can select good object features online. Collins et al. [13] presented a two-class variance ratio to select best discriminate features online. Zhong et al. [46] proposed a weakly supervised online training data selection method for visual tracking. However, these methods may introduce an error from self-updating which can cause tracking drifting. Aiming at this problem, Babenko et al. [5] proposed an online multiple instance learning (MIL) method for object tracking. In [47], Zhou et al. improved the MIL by selecting the support instances adaptively and update the support instances by taking image data obtained previously and recently into account. However, in order to achieve accurate and robust tracking, most of the discriminative methods do not update their classifiers in which models are far away from the initial setting, as a result, drift occurs when the target appearance changes heavily.

In recent years, deep convolutional neural networks have improved state-of-the-art performance in many computer vision applications. Existing methods have also explored the usage of CNNs in online tracking. In [41], a three-layer CNN is trained on-line. Without pre-training and with limited training samples obtained online, CNN fails to capture object semantics and is not robust to deformation. In order to improve the proposal distribution of particle filter, we estimate the translation of the target by optical flow tracker and Delaunay triangulation through matching detected corners. This can be similar as the problem of image search. Nie L, Wang M et al. [34] proposed a content-based approach to automatically predict the search performance of image search. By fully exploring the information from simple visual concepts, Nie L, Yan S et al. [35] presented a scheme to enhance web image reranking for complex queries.

3 Proposed method

The flowchart of our proposed method is shown in Fig.1. It consists of three components: 1) sampling candidate; 2) similarity calculation; 3) MAP estimation. Given the current frame in timet, based on the patch dictionary and support vector machine classifier constructed in the first frame, we obtain the sample candidates combining particle sampling and predicted samples with optical flow tracker and surf matching and solve the local sparse coding by sparse representation to get the aligned pooling features. Then we use the discriminative classifier to calculate the classification score, at the same time spatially weight the aligned pooling features to get the particle similarity. Thirdly, we obtain the final confidence by multiplying classification score and particle similarity. The particle with the max confidence is chosen as the tracking result (shown as a red color rectangle).

Fig. 1
figure 1

Flowchart of the proposed method

The novelty of our proposed method is listed as below:

  1. 1)

    A robust appearance model made up of generative target model and a discriminative classifier.

  2. 2)

    Integrating the inter-frame matching into the framework of visual tracking.

  3. 3)

    3) Improving the proposal distribution of particle filter based on image matching.

3.1 Object appearance model

Our observation appearance model is shown in Fig.2. It based on generative and discriminative model. The generative model is made up of a patch dictionary and discriminative model with a svm (support vector machine) classifier. Given the initial target region in the first frame, we divide the target region into N patches. For each patch, the gray vector and HOG features are extracted and combined to form the patch dictionary. Through sampling positives and negatives, extracting features vectors (combined gray vector and Hog features), solving local sparse coding with the patch dictionary to get the aligned positive and negative pooling features, we use support vector machine to train the input training data to obtain the discriminative classifier. In the following subsections, we will describe each part of our appearance model in detail.

Fig. 2
figure 2

Our Observation Appearance Model

3.1.1 Histogram of oriented gradients (HOG)

The HOG representation is inspired by the SIFT descriptor proposed by Lowe [31]. It can be constructed by dividing the tracking regions into non-overlapping grids, and then computing the orientation histograms of the image gradient of each grid (Fig. 3).

Fig. 3
figure 3

Region of Interest and corresponding hog feature

Letθ(x, y)andm(x, y)be the orientation and magnitude of the intensity gradient at image pixel(x, y). The image gradients can be computed via a finite difference mask [−1 0 1] and its transpose. The gradient orientation at each pixel is discretized into one of p values by a contrast insensitive definition as follows:

$$ B\left(x,y\right)= round\left(\frac{p.\theta \left(x,y\right)}{\pi}\right) \mod p $$
(1)

Letb ∈ {0,  ... , p − 1}ranges over orientation bins. The feature vector at(x, y)is:

$$ F{\left(x,y\right)}_b=\left\{\begin{array}{cc}\hfill m\left(x,y\right)\hfill & \hfill if\ b=B\left(x,y\right)\hfill \\ {}\hfill 0\hfill & \hfill otherwise\hfill \end{array}\right. $$
(2)

LetFbe a pixel-level feature map for anw × himage and k > 0be a parameter indicates the side length of a square image region. A dense grid of rectangular “cells” [16] is defined and pixel-level features are aggregated to obtain a cell-based feature mapC, with feature vectorsC(i, j)for 0 ≤ i ≤ [(w − 1)/k]and0 ≤ j ≤ [(h − 1)/k]. This aggregation can reduce the size of a feature map. After a bilinear interpolation to aggregate features, each feature can be normalized. The resulting feature vector is the HOG descriptor of the image region. Normally the parameters of HOG descriptor are set to be p = 9andk = 8, the size of the “cell” is2 × 2. This leads to a 36-dimensional feature vector.

3.1.2 Structural local sparse representation

The structural local sparse appearance model has been proposed in [27]. In this section, we review it briefly and draw out the spatial weight of candidate target used in our method. In [27], Jia et al. sampled a set of overlapping local image patches inside the target region with a spatial layout and used these local patches as the dictionary to encode the local image patches inside the candidate regions, ie. M = [m1, m2,  ... , m N , m N + 1,  ... , m N × n ] ∈ R d × (N × n), where dis the dimension of the local image patch vector, n is the number of target templatesT = [T1, T2,  ... , T n ], Nis the number of local patches sampled within one target region. Each column inMis obtained by l 2normalization on the vectorized local image patches. For a target candidate, the local image patches within it can be denoted by Y = [y1, y2,  ... , y N ] ∈ R d × N. For the purpose of saving computing time, we adopt 8-bit gray scale image for analysis.

During tracking, the local patches within the target candidate region can be sparsely represented as a linear combination of the local patches dictionary by solving

$$ \underset{{\mathrm{b}}_i}{ \min}\left|\right|{\mathrm{y}}_i-{\mathrm{Mb}}_i\left|\right|{}_2^2+\lambda \left|\right|{\mathrm{b}}_i\left|\right|{}_1,\kern0.5em \mathrm{s}.\mathrm{t}.{\mathrm{b}}_i\ge 0 $$
(3)

where || • ||2 and || • ||1denote thel 2and l 1normalization respectively, λis the regularization parameter,b i  ∈ R (N × n) × 1 is the sparse code vector of the i-th local image patch, and b i  ≥ 0 means all the elements of b i are nonnegative. Note that B = [b1, b2,  ... , b N ]represents the sparse coefficients of the local patches within one target candidate region. According to the target templates, the sparse coefficients of each local patch are divided into n(the number of target templates) segments, ie.\( {\mathrm{b}}_i^T=\left[{\mathrm{b}}_i^{(1)T},{\mathrm{b}}_i^{(2)T},\dots, {\mathrm{b}}_i^{(n)T}\right] \), where

$$ {\mathrm{b}}_i^{(k)T}=\left[{\mathrm{b}}_{\left(k-1\right)\times N+1},{\mathrm{b}}_{\left(k-1\right)\times N+2},\dots, {\mathrm{b}}_{\left(k-1\right)\times N+N}\right]\in {R}^{1\times N} $$
(4)

denotes the k-th segment ofb i . These coefficients are weighted to obtainv i for the i-th patch,

$$ {\mathrm{v}}_i=\frac{1}{C}{\displaystyle \sum_{k=1}^n{\mathrm{b}}_i^{(k)}}=\frac{1}{C}\left[\begin{array}{c}\hfill {v}_{i1}\hfill \\ {}\hfill {v}_{i2}\hfill \\ {}\hfill \dots \hfill \\ {}\hfill {v}_{iN}\hfill \end{array}\right],i=1,2,\ldots,N $$
(5)

wherev i  ∈ R N × 1 is the sparse coefficients of the i-th local patch and Cis a normalization term,C = v i1 + v i2 +  .  .  .  + v iN . Since one candidate target contains Nlocal image patches, all the vectors v i can form a square matrixV,V = [v1, v2,  ... , v N ]. According to the spatial layout of the target, the local patch can be best described by the block at the same positions of the template (i.e., using the sparse coefficients with the aligned positions). Therefore, we take the diagonal elements of the square matrix V as the final sparse coefficients of the local patches within the candidate region, ie.

$$ \mathrm{f}\kern0.3em \in \kern0.3em \mathrm{diag}\left(\mathrm{V}\right)=\left[\begin{array}{c}\hfill {f}_1\hfill \\ {}\hfill {f}_2\hfill \\ {}\hfill \dots \hfill \\ {}\hfill {f}_N\hfill \end{array}\right] $$
(6)

where f is the sparse coefficients vector of all the local patches, i.e.,f 1 means the sparse code of the first patch and f 2 means the sparse code of the second patch.

3.1.3 Sampling positives and negatives

To initialize the classifier in the first frame, we draw positive and negative samples around the labeled target location. Suppose the location of the target object in the first frame is denoted by l 1(x 1, y 1), we use a Gaussian perturbation to draw positive samples in a circular area which satisfies ||l pos  − l 1|| < γ, and draw negative samples in an annular area specified byγ < ||l neg  − l 1|| < η, whereγand are η thresholds defining the circle and annular areas, respectively. The sets, l pos and l neg , denote the locations of positive and negative candidates, respectively. Without loss of generality, we set the scales of the positive and negative candidates the same as our labeled target object. We then crop the images specified by the set of samples l pos and l neg and compute the sparse code of each image patch to form the training data.

3.1.4 One-class support vector machine

Support vector machines (SVMs) are a family of classification algorithms, developed under the statistical learning theory, originally formulated for binary classification [11, 15]. SVMs offer a solution to optimizing the generalization performance of a decision function, inferred from a given set of training data. Given training data and its corresponding labels:

$$ \left({\mathrm{x}}_n,{y}_n\right),n=1,2,\ldots,N,{\mathrm{x}}_n\in {R}^D,{y}_n\in \left\{-1,+1\right\} $$
(7)

SVMs learning consists of the following constrained optimization:

$$ \begin{array}{l}\underset{\mathrm{w},{\xi}_n}{ \min }=\frac{1}{2}{\mathrm{w}}^{\mathrm{T}}\mathrm{w}+C{\displaystyle \sum_{n=1}^N{\xi}_n}\\ {}s.t.\kern0.5em {\mathrm{w}}^{\mathrm{T}}{\mathrm{x}}_n{y}_n\ge 1-{\xi}_n\forall n\\ {}\kern2em {\xi}_n\ge 0\forall n\end{array} $$
(8)

ξ n (n = 1, 2,  ... , N)are slack variables which penalize data points which violate the margin requirements. Note that we can include the bias by augment all data vectors x n with a scalar value of 1. The corresponding unconstrained optimization problem is the following:

$$ \underset{\mathrm{w},{\xi}_n}{ \min }=\frac{1}{2}{\mathrm{w}}^{\mathrm{T}}\mathrm{w}+C{\displaystyle \sum_{n=1}^N \max \Big(1-}{\mathrm{w}}^{\mathrm{T}}{\mathrm{x}}_n{y}_n,0\Big) $$
(9)

The objective of Eq.(8) is known as the primal form problem of L1-SVM, with the standard hinge loss. This optimization problem can be formed by a Lagrange multiplier, and solved by applying quadratic programming to its dual form.

3.2 Predicted sample candidates

In order to improve the proposal distribution of the particle filter and at the same time when the tracked target is totally lost or disappeared and reappeared after a while, we predict the target location respectively by optical flow tracker, template matching and SURF Keypoints Matching to obtain the extra candidates samples. Based on this operation, we can recapture the target and track it robustly. This operation is shown in Fig. 4. In the following subsections, we describe each part in detail.

Fig. 4
figure 4

Predicted Sample Candidates Obtained respectively Optical Flow Tracker, Template Matching and SURF Keypoint Matching

3.2.1 Translation obtained by optical flow tracker and Delaunay triangulation

  1. 1)

    Optical Flow Tracker

Optical flow or optic flow [10] is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene. It is the displacement field for each of the pixels in an image sequence. It is the distribution of the apparent velocities of objects in an image. By estimating optical flow between video frames, one can measure the velocities of objects in the video. In general, moving objects that are closer to the camera will display more apparent motion than distant objects that are moving at the same speed. Optical flow estimation is used in computer vision to characterize and quantify the motion of objects in a video stream, often for motion-based object detection and tracking systems.

The experimental brightness of any object point is constant over time. Close to points in the image plane move in a similar manner (the velocity smoothness constraint). Suppose we have a continuous image, f(x, y, t) refers to the gray-level of (x, y) at time t. Representing a dynamic image as a function of position and time permits it to be expressed.

• Assume each pixel moves but does not change intensity.

• Pixel at location(x, y)in frame t − 1 is pixel at (x + Δx, y + Δy)in frame t.

• Optic flow associates displacement vector with each pixel.

The optical flow describes the direction and time pixels in a time sequence of two consequent dimensional velocity vectors, carrying direction and the velocity of motion is assigned to each pixel in a given place of the picture. For making computation simpler and quicker we transfer the real world three dimensional (3-D + time) objects to a (2-D + time) case. Then we can describe the image by of the 2-D dynamic brightness function of I(x, y, t). Provided that in the neighborhood of pixel, change of brightness intensity does not generate motion field, we can use the following expression

$$ I\left(x,y,t\right)=I\left(x+\delta x,y+\delta y,t+\delta t\right) $$
(10)

Using Taylor series for the right hand part of Eq. (10), we obtain

$$ I\left(x+\delta x,y+\delta y,t+\delta t\right)=I\left(x,y,t\right)+\frac{\partial I}{\partial x}\delta x+\frac{\partial I}{\partial y}\delta y+\frac{\partial I}{\partial t}\delta t+H.O.T $$
(11)

From Eq. (10, 11), with neglecting higher order terms (H.O.T.) and after modifications we get

$$ I{}_xv{}_x+I{}_yv{}_y+I{}_tv{}_t=0 $$
(12)

or in formal vector representation

$$ \nabla I\cdotp \overrightarrow{v}=-{I}_t $$
(13)

where ∇I is so-called the spatial gradient of brightness intensity and \( \overrightarrow{v} \) is the optical flow (velocity vector) of the image pixel and I t is the time derivative of the brightness intensity [6]. Thus optical flow can give significant information about the spatial arrangement of the objects viewed and the rate of change of this arrangement.

  1. 2)

    Fine Matching Based On Delaunay Triangulation [17, 18]

There may be some wrong matches in the initial surf matches. In order to filter out the wrong matches, we adopt the Delaunay triangulation to refine the matches. The Delaunay triangulation has many remarkable properties that make it the most widely-used triangulation.

Let P = {p 1, ⋅ ⋅ ⋅, p n } be a set of points in R d. The Voronoi cell associated to a point p i , denoted by V(p i ), is the region of space that is closer from p i than from other points inP [9]:

$$ V\left({p}_i\right)=\left\{p\in {R}^d:\forall j\ne i,\left\Vert p-{p}_i\right\Vert \le \left\Vert p-{p}_j\right\Vert \right\} $$
(14)

V(p i )is the intersection of n − 1half-spaces bounded by the bisector planes of segments [p i p j ] , j ≠ i.V(p i )is therefore a convex polytype, possibly unbounded. The Voronoi diagram of P, denoted by Vor(P), is the partition of space induced by the Voronoi cells V(p i ).

Triangulation is a process that takes a region of space and divides it into sub-regions. The space may be of any dimension, however, a 2D space is considered here since we are dealing with 2D points. In this case, the sub-regions are simply triangles. Euler formula of Triangulation is:

$$ f-e+v=1 $$
(15)

where f is the number of facet; e is the number of edges, v is the number of vertex. The complexity of n points P constructed triangulation has N tri triangles and N edge edges. In this case, e = N edge .

$$ {N}_{tri}=2n-2-k $$
(16)
$$ {N}_{edge}=3n-3-k $$
(17)

where k is the number of points P in on the convex hull of P.

The Delaunay triangulation Del(P) of P is defined as the geometric dual of the Voronoi diagram: there is an edge between two points p i and p j in the Delaunay triangulation if and only if their Voronoi cells V(p i )andV(p j ) have a non-empty intersection. It yields a triangulation of P, that is to say a partition of the convex hull of P into d-dimensional vertexes (i . e. into triangles in 2D, into tetrahedra in 3D, and so on). The formula of Del(P) is Eq. (18). Figure 5a displays an example of a Voronoi diagram and its associated Delaunay triangulation in the plane.

$$ \begin{array}{l}Del(p)=\left\{T\left({p}_i,{p}_j,{p}_k\right)\right|{p}_i\in P,{p}_j\in P,{p}_k\in P\\ {}\kern3.75em C\left({p}_i,{p}_j,{p}_k\right)\cap P\backslash \left({p}_i,{p}_j,{p}_k\right)=\phi \Big\}\end{array} $$
(18)
Fig. 5
figure 5

a The Voronoi diagram (gray edges) of a set of 2D points(red dots) and its associated Delaunay triangulation (black edges). bThe Delaunay Triangulation Discrete Points(20). c Triangulated Meshes

where C(p i , p j , p k ) is the circle circumscribed by three vertices p i  , p j  , p k , which form a Delaunay Triangle T(p i , p j , p k ).

The algorithmic complexity of the Delaunay triangulation of n points is O(n log n)in 2D [2]. Figure 5b and c Show the created Delaunay Triangulations using 20 discrete points.

3.2.2 Gravity Center of Corners

After inter frame matching by optical flow with corners, Delaunay Triangulation is adopted to refine the matching corners (to remove outliers). Based on the refined corners in the current frame, the gravity of these corners is used as a predicted sample candidate.

3.2.3 Predicted location by template matching

A normalized cross correlation based template matching technique has been used as the measurement scheme. The object’s template and the rectangular window in the image centered at the predicted position form the two inputs to the correlation system. The equations for normalized cross correlation are as follows.

$$ {R}_{ccoeff}\left(x,y\right)=\frac{{\displaystyle {\sum}_{x^{\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}}}\Big[{T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)}}\ast {I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)\Big]}{\sqrt{{{\displaystyle {\sum}_{x^{\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}}}\left[{T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)\right]}}}^2\ast {{\displaystyle {\sum}_{x^{\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}}}\left[{I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)\right]}}}^2}} $$
(19)
$$ {T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)={T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)-\frac{{\displaystyle {\sum}_{x^{\hbox{'}\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}\hbox{'}}}\Big[{T}^{\hbox{'}}\left({x}^{\hbox{'}\hbox{'}},{y}^{\hbox{'}\hbox{'}}\right)}}\Big]}{\left(w\cdot h\right)} $$
(20)
$$ {I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)={I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)-\frac{{\displaystyle {\sum}_{x^{\hbox{'}\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}\hbox{'}}}\Big[{I}^{\hbox{'}}\left({x}^{\hbox{'}\hbox{'}},{y}^{\hbox{'}\hbox{'}}\right)}}\Big]}{\left(w\cdot h\right)} $$
(21)

where, T: Template of the object to be matched; T ': Zero mean Template; I: Search Window in the image; I ': Zero mean Search window R : Resultant Matrix.

The position of the maximum of the correlation output R ccoeff is taken as the position of the object. The search window is a rectangular window centered at the predicted position of the object. Its dimensions are taken as twice the dimensions of the object to make a trade-off between the computational cost and the probability of correct measurement. Figure 6 shows the template image and matching result of template matching.result.

Fig. 6
figure 6

Template Matching Example. a Template Image; b Template Matching Result

3.2.4 Predicted location by SURF matching

SURF, also known as approximate SIFT, employs integral images and efficient scale space construction to generate keypoints and descriptors very efficiently. SURF uses two stages namely keypoint detection and keypoint description [7]. In the first stage, rather than using DoGs as in SIFT, integral images allow the fast computation of approximate Laplacian of Gaussian images using a box filter. The computational cost of applying the box filter is independent of the size of the filter because of the integral image representation. Determinants of the Hessian matrix are then used to detect the keypoints. So SURF builds its scale space by keeping the image size the same and varies the filter size only.

The first stage results in invariance to scale and location. In the final stage, each detected keypoint is first assigned a reproducible orientation. For orientation, Haar wavelet responses and directions are calculated for a set of pixels within a radius of where refers to the detected keypoint scale. The SURF descriptor is then computed by constructing a square window centered around the keypoint and oriented along the orientation obtained before. This window is divided into regular sub-regions and Haar wavelets of size are calculated within each sub-region. Each sub-region contributes 4 values thus resulting in 64D descriptor vectors which are then normalized to unit length. The resulting SURF descriptor is invariant to rotation, scale, contrast and partially invariant to other transformations. Shorter SURF descriptors can also be computed however best results are reported with 64D SURF descriptors [7]. The result of SURF matching is illustrated in Fig.7, while Fig.8 denotes all the candidate samples obtained by the method above, different predictions with different colors.

Fig. 7
figure 7

Matching result By SURF

Fig. 8
figure 8

All the Candidate Samples Obtained by the method above. For the Optical Tracker, the predicted location in the current frame is denoted as red rectangle; Gravity Centers of Corners as green rectangle; Template Matching as blue rectangle; Surf Matching as Magenta rectangle

3.3 Motion model

Particle filter [25] is a Bayesian sequential importance sampling technique that aims to estimate the posterior distribution of state variables for a given dynamic system. It uses a set of weighted particles to approximate the probability distribution of the state regardless of the underlying distribution, which is very effective for dealing with nonlinear and non-Gaussian systems. As a typical dynamic state inference problem, online visual tracking can be modeled by particle filter.

There exist two fundamental steps in the particle filter method: 1) prediction and 2) update. Let x t denote the state variable describing the affine motion parameters of an object and y t denote its corresponding observation vector (the subscript t indicates the frame index). The two steps recursively estimate the posterior probability based on the following two rules:

$$ p\left({\mathrm{x}}_t|{\mathrm{y}}_{1:t-1}\right)={\displaystyle \int p\left({\mathrm{x}}_t|{\mathrm{x}}_{t-1}\right)}p\left({\mathrm{x}}_{t-1}|{\mathrm{y}}_{1:t-1}\right)d{\mathrm{x}}_{t-1} $$
(22)
$$ p\left({\mathrm{x}}_t|{\mathrm{y}}_{1:t}\right)=\frac{p\left({\mathrm{y}}_t|{\mathrm{x}}_t\right)p\left({\mathrm{x}}_t|{\mathrm{y}}_{1:t-1}\right)}{p\left({\mathrm{y}}_t|{\mathrm{y}}_{1:t-1}\right)} $$
(23)

wherex1 : t  = {x1, x2,  ... , x t }stand for all available state vectors up to time tand y1 : t  = {y1, y2,  ... , y t }denote their corresponding observations. p(x t | x t − 1)is called the motion model that describes the state transition between consecutive frames, and p(y t | x t )denotes the observation model that evaluates the likelihood of an observed image patch belonging to the object class.

In the particle filter framework, the posterior p(x t | y1 : t )is approximated byNweighted particles\( {\left\{{\mathrm{x}}_t^i,{w}_t^i\right\}}_{i=1,2,\dots, N} \), which are drawn from an importance distributionq(x t | x1 : t − 1, y1 : t ), and the weights of the particles are updated as

$$ {w}_t^i={w}_{t-1}^i\frac{p\left({\mathrm{y}}_t|{\mathrm{x}}_t^i\right)p\left({\mathrm{x}}_t^i|{\mathrm{x}}_{t-1}^i\right)}{q\left({\mathrm{x}}_t|{\mathrm{x}}_{1:t-1},{\mathrm{y}}_{1:t}\right)} $$
(24)

In this paper, we adoptq(x t | x1 : t − 1, y1 : t ) = p(x t | x t − 1), which is assumed as a Gaussian distribution similar to [37]. In detail, six parameters of the affine transform are used (i.e., x t  = {x t , y t , θ t , s t , α t , ϕ t }, wherex t  , y t  , θ t  , s t  , α t  , ϕ t denotex , ytranslations, rotation angle, scale, aspect ratio, and skew, respectively). The state transition is formulated by random walk, i.e., p(x t | x t − 1) = (x t ;x t − 1, ψ), whereψ is a diagonal covariance matrix. Finally, the statex t is estimated as\( \widehat{{\mathrm{x}}_t}={\displaystyle {\sum}_{i=1}^N{w}_t^i{\mathrm{x}}_t^i} \). We note that the key of designing a practical tracking algorithm is to develop an effective and efficient observation likelihoodp(y t | x t ).

4 Experiment and results

4.1 Datasets and baselines

The VOT2013 dataset [42] includes various real-life visual phenomena, while containing a small number of sequences to keep the time for performing the experiments reasonably low. The VOT2013 challenge consists of 16 color image sequences with 172 to 770 frames: bicycle, bolt, car, cup, david, diving, face, gymnastics, hand, iceskater, juice, jump, singer, sunshade, torus, and woman. The sequences have been selected to make the tracking a challenging task: objects change aspect or are articulated, the scale and orientation vary, illumination changes and occlusions occur.

Some example frames are shown in Fig.9. The attributes of testing sequences used in our paper are shown in Table 1.

Fig. 9
figure 9

Examples from the dataset: girl, faceocc1, jogging

Table 1 Tracking sequences used in our experiments

In addition, given the tracking result R T and the ground truthR G , we use the detection criterion in the PASCAL VOC [19] challenge, i.e., Eq. (25) to evaluate the success rate.

$$ score=\frac{area\left({R}_T\cap {R}_G\right)}{area\left({R}_T\cup {R}_G\right)} $$
(25)

4.2 Implementation details

The proposed method has been implemented in Matlab and tested on a 1.73 GHz PC with 2 GB memory. The number of particle samples processed in the experiments is 100. During tracking, the pixel’s values of each frame are normalized into [0, 1]. Each particle is associated with an image patch. After image scaling, the image patch is normalized toN 1 × N 2 pixels. In the experiments, the parameters (N1, N2) are chosen as (32, 32). Due to the multiple sub process used in our model, the time complexity of our proposed method is about 1 fps. The time complexity of different methods is shown in Table 2. In our experiments, we have compared the tracking results of our proposed method with those of state-of-the-art methods, such as OAB [20], L1 [29], MIL [4], IVT [37], ASLA [27], FCNT [41]. We implemented these trackers using publicly available source codes or binaries provided by the authors. They were initialized using their default parameters.

Table 2 Time Complexity of Compared Methods (fps)

4.3 Qualitative comparisons

The first test sequence is “Girl”. We show some representative frames of the tracking results of five different trackers in Fig.10. The six representative frame indices are 32, 104, 292, 396, 429, and 494. The man’s face is passing in front of the woman’s face. From the tracking results, we can see that MIL tracker loses the target from frame 292, and finally causes a big drift, the same as IVT tracker. The OAB tracker fails to track the target in frames 104, 429 due to large pose changes and occlusion of another man. For the l1 tracker, it can not track the target in frames 292, 396, 429, and 494. Compared with other trackers, although ASLA tracker can track the target, however, the location accuracy is lower than our proposed tracker (ASLA_DW) which can obtain good tracking results and improve location accuracy. The location errors and the overlap metric of six comparison methods on video sequence “Girl” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Girl sequence for the compared methods, while Fig.17 is the overlap metric results.

Fig. 10
figure 10

The tracking results of Gril sequence

Table 3 Center Location Errors of Testing Sequences
Table 4 Overlap Metric of Testing Sequences

The second test sequence is “FaceOcc1”. We show some representative frames of the tracking results of five different trackers in Fig.11. The six representative frame indices are 60, 196, 356, 418, 558, and 774. The woman’s face is partially occluded by a book from left part of face to the right part of the face. From the tracking results, we can see that the tracking accuracy of MIL tracker is lower than other trackers. For the OAB、L1、 IVT and our proposed tracker, the target can be well tracked, although the location accuracy of l1 tracker is slightly lower. Compared with other trackers, ASLA tracker can not obtain good tracking results in frames 356,418,558,774, whereas our proposed tracker can track the woman’ face robustly and improve the tracking location accuracy better. The location errors and the overlap metric of six comparison methods on video sequence “FaceOcc1” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on FaceOcc1 sequence for the compared methods, while Fig.17 is the overlap metric results.

Fig. 11
figure 11

The tracking results of FaceOcc1 sequence

The third test sequence is “seq_sb”. We show some representative frames of the tracking results of five different trackers in Fig.12. The six representative frame indices are 15, 172, 295, 372, 412, and 480. The man’s face is totally occluded by a book with out of plane rotation. From the tracking results, we can see that MIL tracker loses the target from frame 295, and finally causes a big drift, the same as IVT,OAB,L1, ASLA tracker. Compared with other trackers, our proposed tracker can still obtain good tracking results under such big challenging video. The location errors and the overlap metric of six comparison methods on video sequence “seq_sb” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on seq_sb sequence for the compared methods, while Fig.17 is the overlap metric results.

Fig. 12
figure 12

The tracking results of seq_sb sequence

The fourth test sequence is “Deer”. We show some representative frames of the tracking results of five different trackers in Fig.13. The six representative frame indices are 10, 20, 36, 45, 60, and 71. The challenging factors are motion blur, fast motion, in-plane rotation, background clutters, low-resolution. From the tracking results, we can see that MIL tracker can track the target in frames 20, 36, 45, 60, and 71, while the location accuracy is lower than OAB Tracker. For the l1, IVT, ASLA tracker, they totally lose the target in the whole video sequence. Compared with other trackers, OAB tracker and our proposed tracker can obtain good tracking results. The location errors and the overlap metric of six comparison methods on video sequence “Deer” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Deer sequence for the compared methods, while Fig.17 is the overlap metric results.

Fig. 13
figure 13

The tracking results of Deer sequence

The fifth test sequence is “Jogging”. We show some representative frames of the tracking results of five different trackers in Fig.14. The six representative frame indices are 40, 80, 132, 196, 264, and 302. The jogging woman is occluded by a tree. From the tracking results, we can see that MIL, L1, IVT, ASLA tracker lose the target in frames 80, 132, 196, 264, and 302 due to the jogging woman totally occluded by the tree. For the OAB tracker, it loses the target in frame 80, whereas it can track the target in the remaining frames. Compared with OAB Tracker, our proposed tracker can obtain good tracking results in spite of the target totally occluded by the tree. The location errors and the overlap metric of six comparison methods on video sequence “Jogging” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Jogging sequence for the compared methods, while Fig.17 is the overlap metric results (Table 4).

Fig. 14
figure 14

The tracking results of Jogging sequence

The six test sequence is “Skating2”. We show some representative frames of the tracking results of five different trackers in Fig.15. The six representative frame indices are 50, 120, 181, 240, 333, and 440. This video sequence undergoes fast motion, occlusion and deformation. From the tracking results, we can see that MIL tracker loses the target in frame 120, and finally causes a big drift in frames 181,240,333. However, it tracks the target rightly in frame 440. The L1 tracker loses the target in the representative frames, the same as OAB, IVT tracker. The IVT tracker can not track the target because of its incapable of dealing with occlusion. The tracking accuracy of ASLA tracker is not good than our proposed tracker. The location errors and the overlap metric of six comparison methods on video sequence “Skating2” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Skating2 sequence for the compared methods, while Fig.17 is the overlap metric results. From Tables 3 and 4, we can see that the overall performance for different methods of is shown in Table 5.

Fig. 15
figure 15

The tracking results of Skating2 sequence

Fig. 16
figure 16figure 16

The tracking error for each test sequence. The error is measured the same as in Table 3

Fig. 17
figure 17figure 17

The Overlap metric for each test sequence. The overlap metric is measured the same as in Table 4

Table 5 Overall performance for different methods

4.4 Parameters selection

Since there are many parameters are involved in the formulations, in fact, each parameter may influence the overall results. The difference is that the importance of each parameter. We only show the primary parameters. We settle the optimal parameters based on our experiments. We first set one parameter change, the rest of parameters are not changing, through this we get the optimal one parameter; then get the second optimal parameters one by one. For the normal sizeN 1 × N 2of image patch, letN 1 = N 2, and range of N 1 = [16 20 24 28 32], we verify the sensitivity of the two parameters to the overall results. The results are shown in Fig. 18.

Fig. 18
figure 18

The sensitivity of the two normal sizes to the overall results

4.5 Analysis of the proposed method

Since for the problem of visual tracking, it is in fact a combination of approaches. Each method can have some effect on the result of visual tracking approach, such as for different motion models and different appearance models. Our method is designed to improve the appearance and model models, so we have integrated some methods. Experimental results on some publicly available benchmarks of video sequences demonstrate the accuracy and effectiveness of our tracker.

5 Conclusion

In this paper, we propose a robust tracking algorithm by integrating the generative and discriminative model. The object appearance model is composed of generative target model and a discriminative classifier. For the generative target model, we adopt the weighted structural local sparse appearance model combining patch based gray value and Histogram of Oriented Gradients feature as the patch dictionary. By sampling positives and negatives, alignment-pooling features are obtained based on the patch dictionary through local sparse coding, then use a support vector machine to train the discriminative classifier. A robust inter-frame matching based on optical flow and Delaunay triangulation accompanied with template matching is adopted to improve the proposal distribution of particle filter to enhance the performance of tracking. Our approach is shown to effectively improve the tracking performance on challenging scenarios.