Robust visual tracking based on generative and discriminative model collaboration

Dou, Jianfang; Qin, Qin; Tu, Zimei

doi:10.1007/s11042-016-3872-6

Robust visual tracking based on generative and discriminative model collaboration

Published: 03 September 2016

Volume 76, pages 15839–15866, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Robust visual tracking based on generative and discriminative model collaboration

Download PDF

Jianfang Dou¹,
Qin Qin¹ &
Zimei Tu¹

571 Accesses
10 Citations
Explore all metrics

Abstract

Effective object appearance model is one of the key issues for the success of visual tracking. Since the appearance of a target and the environment changes dynamically, the majority of existed visual tracking algorithms tend to drift away from targets. To address this issue, we propose a robust tracking algorithm by integrating the generative and discriminative model. The object appearance model is made up of generative target model and a discriminative classifier. For the generative target model, we adopt the weighted structural local sparse appearance model combining patch based gray value and Histogram of Oriented Gradients feature as the patch dictionary. By sampling positives and negatives, alignment-pooling features are obtained based on the patch dictionary through local sparse coding, then we use support vector machine to train the discriminative classifier. The proposed method is embedded into a Bayesian inference framework for visual tracking. A combined matching method is adopted to improve the proposal distribution of the particle filter. Moreover, in order to adapt the situation change, the patch dictionary and discriminative classifier are updated by incremental learning every five frames. Experimental results on some publicly available benchmarks of video sequences demonstrate the accuracy and effectiveness of our tracker.

Robust visual tracking via discriminative appearance model based on sparse coding

Article 22 November 2014

Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM

Robust object tracking via multi-scale patch based sparse coding histogram

Article 02 February 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Visual tracking is an important and active research topic in computer vision community because of its wide range of applications, e.g., intelligent video surveillance, human computer interaction and robotics [43, 44]. The purpose of visual tracking is to estimate the state of the tracked target in a video. It is usually formulated as a search task where an appearance model is first used to represent the target and then a search strategy is utilized to infer the state of the target in the current frame. Although it has been extensively studied in the last two decades, it still remains to be a challenging problem due to many appearance variations caused by occlusion, pose, illumination, background clutter, and so on.

Broadly speaking, a tracking algorithm mainly includes two fundamental components: (1) a motion model (or called dynamic model), which relates the states of an object over time and predicts its likely state by supplying the tracker with a number of candidate states (e.g., Kalman filter [26], particle filter [25]); (2) an observation model (or called appearance model) [30], which represents the tracked object and verifies predictions by evaluating the likelihood of each candidate state in the current frame.

According to the different observation models used in existing object tracking algorithms, they can be categorized into methods based on template [1, 14], online classifiers [4, 28] and so on. In the template-based algorithms, the tracked object is described by one single template [14] or multiple templates. Then the tracking problem can be considered as searching for the regions which are the most similar to the tracked object. The trackers based on online classifiers aim to distinguish the tracked objects from its surrounding backgrounds by treating the tracking problem as a binary classification problem. Thus, both classic and recent machine learning algorithms could promote the progress of tracking algorithms or systems, including boosting [20, 21], support vector machine [23, 39], naive bayes [45], random forest [38], multiple instance learning [4], structured learning [22] and so on. Jia et al. [27] exploited both partial information and spatial information of the target based on a novel alignment-pooling method and proposed an efficient tracking algorithm based on structural local sparse appearance model and adaptive template update strategy. This algorithm made good use of the appearance and spatial structure information of the target and reduced the influence of the occluded target template. But when the target underwent heavy occlusions (such as for long term tracking videos when the target is disappeared or totally occluded), appearance changes, or interference of similar object and background, the robustness needed further improvement. So effective modeling of the object’s appearance is one of the key issues for the success of a visual tracker.

In this paper, we propose a robust tracking algorithm by integrating the generative and discriminative model. The object appearance model is made up of generative target model and a discriminative classifier. For the generative target model, we adopt the weighted structural local sparse appearance model [27] combining patch based gray value and Histogram of Oriented Gradients feature as the patch dictionary. By sampling positives and negatives, alignment-pooling features are obtained based on the patch dictionary through local sparse coding, then we use a support vector machine to train the discriminative classifier. A robust inter-frame matching based on optical flow [24] and Delaunay triangulation [17, 18] accompanied with template matching is adopted to improve the proposal distribution of particle filter to enhance the performance of tracking. The proposed method is embedded into a Bayesian inference framework for visual tracking. Through alignment-pooling method across the local patches within one candidate region to obtain the similarity measure, at the same time using the trained discriminative classifier to get the classification score. The similarity measure and classification score are multiplied to obtain the particle confidences. Moreover, in order to adapt the situation change, the patch dictionary and discriminative classifier are updated by incremental learning every five frames. Experimental results on some publicly available benchmarks of video sequences demonstrate the accuracy and effectiveness of our tracker.

The main contributions of this paper are as follows:

1)
Integrating inter-frame matching into the framework of visual tracking, the two parts can complement to another, thus improving the performance of tracking;
2)
A Robust inter-frame matching based on Optical Flow and Delaunay Triangulation is adopted to enhance the robustness and accuracy of matching;
3)
Considering the spatial configuration of each local patch of the target, weighting the structural local sparse appearance model;
4)
Enhancing the object appearance model by integrating generative model and a discriminative classifier.

The remainder of this paper is organized as follows: in Section 2 we summarize the previous works most related to our work. The proposed robust tracking method ASLA_DW (adaptive structural local sparse appearance model with discriminative weighted) is described in Section 3, respectively. Experiments and results are provided and analyzed in Section 4. Finally, our work is summarized and conclusions are drawn in Section 5.

2 Related work

Many promising approaches have been proposed to tackle object tracking. These methods can be roughly classified into two categories: generative methods and discriminative methods.

Generative methods represent objects with appearance models, and track targets by searching for the image region most similar to the models. Reference templates can be learned with a set of training data. Black et al. [8] learned a subspace model offline to represent targets and used parametric optical flow estimation simultaneously. Aside from static appearance models, online appearance models which are updated as the appearance of the target changes have also been presented. Wang et al. [40] constructed a dynamic multi-cue integration model for particle filter framework. Ross et al. [37] proposed a tracking framework based on the incremental image-as-vector subspace learning method with a sample mean update. Li et al. [29] modeled target appearance changes by incremental image-as-matrix subspace learning method through adaptively updating the sample mean and eigenbasis. Recently, sparse representation [12] has been successfully applied to visual tracking [32, 33]. In this case, the tracker represents each target candidate as a sparse linear combination of dictionary templates that can be dynamically updated to maintain an up-to-date target appearance model. This representation has been shown to be robust against partial occlusions, which leads to improved tracking performance. However, sparse coding based trackers perform computationally expensive l1 minimization at each frame.

Unlike generative methods, discriminative methods regard objects tracking as a binary classification problem to distinguish the target from background. These methods exploit the information of both the target and background. Avidan [3] combined optical flow representation with a support vector machine (SVM) classifier for objects tracking. Ozuysal et al. [36] proposed a random forest classifier which learned binary features of the target. These methods use a predefined and fixed feature sets for classifier learning. In addition, there are many other approaches that can select good object features online. Collins et al. [13] presented a two-class variance ratio to select best discriminate features online. Zhong et al. [46] proposed a weakly supervised online training data selection method for visual tracking. However, these methods may introduce an error from self-updating which can cause tracking drifting. Aiming at this problem, Babenko et al. [5] proposed an online multiple instance learning (MIL) method for object tracking. In [47], Zhou et al. improved the MIL by selecting the support instances adaptively and update the support instances by taking image data obtained previously and recently into account. However, in order to achieve accurate and robust tracking, most of the discriminative methods do not update their classifiers in which models are far away from the initial setting, as a result, drift occurs when the target appearance changes heavily.

In recent years, deep convolutional neural networks have improved state-of-the-art performance in many computer vision applications. Existing methods have also explored the usage of CNNs in online tracking. In [41], a three-layer CNN is trained on-line. Without pre-training and with limited training samples obtained online, CNN fails to capture object semantics and is not robust to deformation. In order to improve the proposal distribution of particle filter, we estimate the translation of the target by optical flow tracker and Delaunay triangulation through matching detected corners. This can be similar as the problem of image search. Nie L, Wang M et al. [34] proposed a content-based approach to automatically predict the search performance of image search. By fully exploring the information from simple visual concepts, Nie L, Yan S et al. [35] presented a scheme to enhance web image reranking for complex queries.

3 Proposed method

The flowchart of our proposed method is shown in Fig.1. It consists of three components: 1) sampling candidate; 2) similarity calculation; 3) MAP estimation. Given the current frame in timet, based on the patch dictionary and support vector machine classifier constructed in the first frame, we obtain the sample candidates combining particle sampling and predicted samples with optical flow tracker and surf matching and solve the local sparse coding by sparse representation to get the aligned pooling features. Then we use the discriminative classifier to calculate the classification score, at the same time spatially weight the aligned pooling features to get the particle similarity. Thirdly, we obtain the final confidence by multiplying classification score and particle similarity. The particle with the max confidence is chosen as the tracking result (shown as a red color rectangle).

The novelty of our proposed method is listed as below:

1)
A robust appearance model made up of generative target model and a discriminative classifier.
2)
Integrating the inter-frame matching into the framework of visual tracking.
3)
3) Improving the proposal distribution of particle filter based on image matching.

3.1 Object appearance model

Our observation appearance model is shown in Fig.2. It based on generative and discriminative model. The generative model is made up of a patch dictionary and discriminative model with a svm (support vector machine) classifier. Given the initial target region in the first frame, we divide the target region into N patches. For each patch, the gray vector and HOG features are extracted and combined to form the patch dictionary. Through sampling positives and negatives, extracting features vectors (combined gray vector and Hog features), solving local sparse coding with the patch dictionary to get the aligned positive and negative pooling features, we use support vector machine to train the input training data to obtain the discriminative classifier. In the following subsections, we will describe each part of our appearance model in detail.

3.1.1 Histogram of oriented gradients (HOG)

The HOG representation is inspired by the SIFT descriptor proposed by Lowe [31]. It can be constructed by dividing the tracking regions into non-overlapping grids, and then computing the orientation histograms of the image gradient of each grid (Fig. 3).

Letθ(x, y)andm(x, y)be the orientation and magnitude of the intensity gradient at image pixel(x, y). The image gradients can be computed via a finite difference mask [−1 0 1] and its transpose. The gradient orientation at each pixel is discretized into one of p values by a contrast insensitive definition as follows:

$$ B\left(x,y\right)= round\left(\frac{p.\theta \left(x,y\right)}{\pi}\right) \mod p $$

(1)

Letb ∈ {0, ... , p − 1}ranges over orientation bins. The feature vector at(x, y)is:

$$ F{\left(x,y\right)}_b=\left\{\begin{array}{cc}\hfill m\left(x,y\right)\hfill & \hfill if\ b=B\left(x,y\right)\hfill \\ {}\hfill 0\hfill & \hfill otherwise\hfill \end{array}\right. $$

(2)

LetFbe a pixel-level feature map for anw × himage and k > 0be a parameter indicates the side length of a square image region. A dense grid of rectangular “cells” [16] is defined and pixel-level features are aggregated to obtain a cell-based feature mapC, with feature vectorsC(i, j)for 0 ≤ i ≤ [(w − 1)/k]and0 ≤ j ≤ [(h − 1)/k]. This aggregation can reduce the size of a feature map. After a bilinear interpolation to aggregate features, each feature can be normalized. The resulting feature vector is the HOG descriptor of the image region. Normally the parameters of HOG descriptor are set to be p = 9andk = 8, the size of the “cell” is2 × 2. This leads to a 36-dimensional feature vector.

3.1.2 Structural local sparse representation

The structural local sparse appearance model has been proposed in [27]. In this section, we review it briefly and draw out the spatial weight of candidate target used in our method. In [27], Jia et al. sampled a set of overlapping local image patches inside the target region with a spatial layout and used these local patches as the dictionary to encode the local image patches inside the candidate regions, ie. M = [m₁, m₂, ... , m_N, m_N + 1, ... , m_N × n] ∈ R ^{d × (N × n)}, where dis the dimension of the local image patch vector, n is the number of target templatesT = [T₁, T₂, ... , T_n], Nis the number of local patches sampled within one target region. Each column inMis obtained by l ₂normalization on the vectorized local image patches. For a target candidate, the local image patches within it can be denoted by Y = [y₁, y₂, ... , y_N] ∈ R ^d × N. For the purpose of saving computing time, we adopt 8-bit gray scale image for analysis.

During tracking, the local patches within the target candidate region can be sparsely represented as a linear combination of the local patches dictionary by solving

$$ \underset{{\mathrm{b}}_i}{ \min}\left|\right|{\mathrm{y}}_i-{\mathrm{Mb}}_i\left|\right|{}_2^2+\lambda \left|\right|{\mathrm{b}}_i\left|\right|{}_1,\kern0.5em \mathrm{s}.\mathrm{t}.{\mathrm{b}}_i\ge 0 $$

(3)

where || • ||₂ and || • ||₁denote thel ₂and l ₁normalization respectively, λis the regularization parameter,b_i ∈ R ^{(N × n) × 1} is the sparse code vector of the i-th local image patch, and b_i ≥ 0 means all the elements of b_iare nonnegative. Note that B = [b₁, b₂, ... , b_N]represents the sparse coefficients of the local patches within one target candidate region. According to the target templates, the sparse coefficients of each local patch are divided into n(the number of target templates) segments, ie.$ {\mathrm{b}}_i^T=\left[{\mathrm{b}}_i^{(1)T},{\mathrm{b}}_i^{(2)T},\dots, {\mathrm{b}}_i^{(n)T}\right] $, where

$$ {\mathrm{b}}_i^{(k)T}=\left[{\mathrm{b}}_{\left(k-1\right)\times N+1},{\mathrm{b}}_{\left(k-1\right)\times N+2},\dots, {\mathrm{b}}_{\left(k-1\right)\times N+N}\right]\in {R}^{1\times N} $$

(4)

denotes the k-th segment ofb_i. These coefficients are weighted to obtainv_ifor the i-th patch,

$$ {\mathrm{v}}_i=\frac{1}{C}{\displaystyle \sum_{k=1}^n{\mathrm{b}}_i^{(k)}}=\frac{1}{C}\left[\begin{array}{c}\hfill {v}_{i1}\hfill \\ {}\hfill {v}_{i2}\hfill \\ {}\hfill \dots \hfill \\ {}\hfill {v}_{iN}\hfill \end{array}\right],i=1,2,\ldots,N $$

(5)

wherev_i ∈ R ^N × 1 is the sparse coefficients of the i-th local patch and Cis a normalization term,C = v _i1 + v _i2 + . . . + v _iN. Since one candidate target contains Nlocal image patches, all the vectors v_ican form a square matrixV,V = [v₁, v₂, ... , v_N]. According to the spatial layout of the target, the local patch can be best described by the block at the same positions of the template (i.e., using the sparse coefficients with the aligned positions). Therefore, we take the diagonal elements of the square matrix V as the final sparse coefficients of the local patches within the candidate region, ie.

$$ \mathrm{f}\kern0.3em \in \kern0.3em \mathrm{diag}\left(\mathrm{V}\right)=\left[\begin{array}{c}\hfill {f}_1\hfill \\ {}\hfill {f}_2\hfill \\ {}\hfill \dots \hfill \\ {}\hfill {f}_N\hfill \end{array}\right] $$

(6)

where f is the sparse coefficients vector of all the local patches, i.e.,f ₁ means the sparse code of the first patch and f ₂ means the sparse code of the second patch.

3.1.3 Sampling positives and negatives

To initialize the classifier in the first frame, we draw positive and negative samples around the labeled target location. Suppose the location of the target object in the first frame is denoted by l ₁(x ₁, y ₁), we use a Gaussian perturbation to draw positive samples in a circular area which satisfies ||l _pos − l ₁|| < γ, and draw negative samples in an annular area specified byγ < ||l _neg − l ₁|| < η, whereγand are η thresholds defining the circle and annular areas, respectively. The sets, l _pos and l _neg, denote the locations of positive and negative candidates, respectively. Without loss of generality, we set the scales of the positive and negative candidates the same as our labeled target object. We then crop the images specified by the set of samples l _pos and l _neg and compute the sparse code of each image patch to form the training data.

3.1.4 One-class support vector machine

Support vector machines (SVMs) are a family of classification algorithms, developed under the statistical learning theory, originally formulated for binary classification [11, 15]. SVMs offer a solution to optimizing the generalization performance of a decision function, inferred from a given set of training data. Given training data and its corresponding labels:

$$ \left({\mathrm{x}}_n,{y}_n\right),n=1,2,\ldots,N,{\mathrm{x}}_n\in {R}^D,{y}_n\in \left\{-1,+1\right\} $$

(7)

SVMs learning consists of the following constrained optimization:

$$ \begin{array}{l}\underset{\mathrm{w},{\xi}_n}{ \min }=\frac{1}{2}{\mathrm{w}}^{\mathrm{T}}\mathrm{w}+C{\displaystyle \sum_{n=1}^N{\xi}_n}\\ {}s.t.\kern0.5em {\mathrm{w}}^{\mathrm{T}}{\mathrm{x}}_n{y}_n\ge 1-{\xi}_n\forall n\\ {}\kern2em {\xi}_n\ge 0\forall n\end{array} $$

(8)

ξ _n(n = 1, 2, ... , N)are slack variables which penalize data points which violate the margin requirements. Note that we can include the bias by augment all data vectors x_n with a scalar value of 1. The corresponding unconstrained optimization problem is the following:

$$ \underset{\mathrm{w},{\xi}_n}{ \min }=\frac{1}{2}{\mathrm{w}}^{\mathrm{T}}\mathrm{w}+C{\displaystyle \sum_{n=1}^N \max \Big(1-}{\mathrm{w}}^{\mathrm{T}}{\mathrm{x}}_n{y}_n,0\Big) $$

(9)

The objective of Eq.(8) is known as the primal form problem of L1-SVM, with the standard hinge loss. This optimization problem can be formed by a Lagrange multiplier, and solved by applying quadratic programming to its dual form.

3.2 Predicted sample candidates

In order to improve the proposal distribution of the particle filter and at the same time when the tracked target is totally lost or disappeared and reappeared after a while, we predict the target location respectively by optical flow tracker, template matching and SURF Keypoints Matching to obtain the extra candidates samples. Based on this operation, we can recapture the target and track it robustly. This operation is shown in Fig. 4. In the following subsections, we describe each part in detail.

3.2.1 Translation obtained by optical flow tracker and Delaunay triangulation

1)
Optical Flow Tracker

Optical flow or optic flow [10] is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene. It is the displacement field for each of the pixels in an image sequence. It is the distribution of the apparent velocities of objects in an image. By estimating optical flow between video frames, one can measure the velocities of objects in the video. In general, moving objects that are closer to the camera will display more apparent motion than distant objects that are moving at the same speed. Optical flow estimation is used in computer vision to characterize and quantify the motion of objects in a video stream, often for motion-based object detection and tracking systems.

The experimental brightness of any object point is constant over time. Close to points in the image plane move in a similar manner (the velocity smoothness constraint). Suppose we have a continuous image, f(x, y, t) refers to the gray-level of (x, y) at time t. Representing a dynamic image as a function of position and time permits it to be expressed.

• Assume each pixel moves but does not change intensity.

• Pixel at location(x, y)in frame t − 1 is pixel at (x + Δx, y + Δy)in frame t.

• Optic flow associates displacement vector with each pixel.

The optical flow describes the direction and time pixels in a time sequence of two consequent dimensional velocity vectors, carrying direction and the velocity of motion is assigned to each pixel in a given place of the picture. For making computation simpler and quicker we transfer the real world three dimensional (3-D + time) objects to a (2-D + time) case. Then we can describe the image by of the 2-D dynamic brightness function of I(x, y, t). Provided that in the neighborhood of pixel, change of brightness intensity does not generate motion field, we can use the following expression

$$ I\left(x,y,t\right)=I\left(x+\delta x,y+\delta y,t+\delta t\right) $$

(10)

Using Taylor series for the right hand part of Eq. (10), we obtain

$$ I\left(x+\delta x,y+\delta y,t+\delta t\right)=I\left(x,y,t\right)+\frac{\partial I}{\partial x}\delta x+\frac{\partial I}{\partial y}\delta y+\frac{\partial I}{\partial t}\delta t+H.O.T $$

(11)

From Eq. (10, 11), with neglecting higher order terms (H.O.T.) and after modifications we get

$$ I{}_xv{}_x+I{}_yv{}_y+I{}_tv{}_t=0 $$

(12)

or in formal vector representation

$$ \nabla I\cdotp \overrightarrow{v}=-{I}_t $$

(13)

where ∇I is so-called the spatial gradient of brightness intensity and $ \overrightarrow{v} $ is the optical flow (velocity vector) of the image pixel and I _t is the time derivative of the brightness intensity [6]. Thus optical flow can give significant information about the spatial arrangement of the objects viewed and the rate of change of this arrangement.

2)
Fine Matching Based On Delaunay Triangulation [17, 18]

There may be some wrong matches in the initial surf matches. In order to filter out the wrong matches, we adopt the Delaunay triangulation to refine the matches. The Delaunay triangulation has many remarkable properties that make it the most widely-used triangulation.

Let P = {p ₁, ⋅ ⋅ ⋅, p _n} be a set of points in R ^d. The Voronoi cell associated to a point p _i, denoted by V(p _i), is the region of space that is closer from p _i than from other points inP [9]:

$$ V\left({p}_i\right)=\left\{p\in {R}^d:\forall j\ne i,\left\Vert p-{p}_i\right\Vert \le \left\Vert p-{p}_j\right\Vert \right\} $$

(14)

V(p _i)is the intersection of n − 1half-spaces bounded by the bisector planes of segments [p _i p _j] , j ≠ i.V(p _i)is therefore a convex polytype, possibly unbounded. The Voronoi diagram of P, denoted by Vor(P), is the partition of space induced by the Voronoi cells V(p _i).

Triangulation is a process that takes a region of space and divides it into sub-regions. The space may be of any dimension, however, a 2D space is considered here since we are dealing with 2D points. In this case, the sub-regions are simply triangles. Euler formula of Triangulation is:

$$ f-e+v=1 $$

(15)

where f is the number of facet; e is the number of edges, v is the number of vertex. The complexity of n points P constructed triangulation has N _tri triangles and N _edge edges. In this case, e = N _edge.

$$ {N}_{tri}=2n-2-k $$

(16)

$$ {N}_{edge}=3n-3-k $$

(17)

where k is the number of points P in on the convex hull of P.

The Delaunay triangulation Del(P) of P is defined as the geometric dual of the Voronoi diagram: there is an edge between two points p _i and p _j in the Delaunay triangulation if and only if their Voronoi cells V(p _i)andV(p _j) have a non-empty intersection. It yields a triangulation of P, that is to say a partition of the convex hull of P into d-dimensional vertexes (i . e. into triangles in 2D, into tetrahedra in 3D, and so on). The formula of Del(P) is Eq. (18). Figure 5a displays an example of a Voronoi diagram and its associated Delaunay triangulation in the plane.

$$ \begin{array}{l}Del(p)=\left\{T\left({p}_i,{p}_j,{p}_k\right)\right|{p}_i\in P,{p}_j\in P,{p}_k\in P\\ {}\kern3.75em C\left({p}_i,{p}_j,{p}_k\right)\cap P\backslash \left({p}_i,{p}_j,{p}_k\right)=\phi \Big\}\end{array} $$

(18)

where C(p _i, p _j, p _k) is the circle circumscribed by three vertices p _i , p _j , p _k, which form a Delaunay Triangle T(p _i, p _j, p _k).

The algorithmic complexity of the Delaunay triangulation of n points is O(n log n)in 2D [2]. Figure 5b and c Show the created Delaunay Triangulations using 20 discrete points.

3.2.2 Gravity Center of Corners

After inter frame matching by optical flow with corners, Delaunay Triangulation is adopted to refine the matching corners (to remove outliers). Based on the refined corners in the current frame, the gravity of these corners is used as a predicted sample candidate.

3.2.3 Predicted location by template matching

A normalized cross correlation based template matching technique has been used as the measurement scheme. The object’s template and the rectangular window in the image centered at the predicted position form the two inputs to the correlation system. The equations for normalized cross correlation are as follows.

$$ {R}_{ccoeff}\left(x,y\right)=\frac{{\displaystyle {\sum}_{x^{\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}}}\Big[{T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)}}\ast {I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)\Big]}{\sqrt{{{\displaystyle {\sum}_{x^{\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}}}\left[{T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)\right]}}}^2\ast {{\displaystyle {\sum}_{x^{\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}}}\left[{I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)\right]}}}^2}} $$

(19)

$$ {T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)={T}^{\hbox{'}}\left({x}^{\hbox{'}},{y}^{\hbox{'}}\right)-\frac{{\displaystyle {\sum}_{x^{\hbox{'}\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}\hbox{'}}}\Big[{T}^{\hbox{'}}\left({x}^{\hbox{'}\hbox{'}},{y}^{\hbox{'}\hbox{'}}\right)}}\Big]}{\left(w\cdot h\right)} $$

(20)

$$ {I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)={I}^{\hbox{'}}\left(x+{x}^{\hbox{'}},y+{y}^{\hbox{'}}\right)-\frac{{\displaystyle {\sum}_{x^{\hbox{'}\hbox{'}}}{\displaystyle {\sum}_{y^{\hbox{'}\hbox{'}}}\Big[{I}^{\hbox{'}}\left({x}^{\hbox{'}\hbox{'}},{y}^{\hbox{'}\hbox{'}}\right)}}\Big]}{\left(w\cdot h\right)} $$

(21)

where, T: Template of the object to be matched; T ^': Zero mean Template; I: Search Window in the image; I ^': Zero mean Search window R : Resultant Matrix.

The position of the maximum of the correlation output R _ccoeff is taken as the position of the object. The search window is a rectangular window centered at the predicted position of the object. Its dimensions are taken as twice the dimensions of the object to make a trade-off between the computational cost and the probability of correct measurement. Figure 6 shows the template image and matching result of template matching.result.

3.2.4 Predicted location by SURF matching

SURF, also known as approximate SIFT, employs integral images and efficient scale space construction to generate keypoints and descriptors very efficiently. SURF uses two stages namely keypoint detection and keypoint description [7]. In the first stage, rather than using DoGs as in SIFT, integral images allow the fast computation of approximate Laplacian of Gaussian images using a box filter. The computational cost of applying the box filter is independent of the size of the filter because of the integral image representation. Determinants of the Hessian matrix are then used to detect the keypoints. So SURF builds its scale space by keeping the image size the same and varies the filter size only.

The first stage results in invariance to scale and location. In the final stage, each detected keypoint is first assigned a reproducible orientation. For orientation, Haar wavelet responses and directions are calculated for a set of pixels within a radius of where refers to the detected keypoint scale. The SURF descriptor is then computed by constructing a square window centered around the keypoint and oriented along the orientation obtained before. This window is divided into regular sub-regions and Haar wavelets of size are calculated within each sub-region. Each sub-region contributes 4 values thus resulting in 64D descriptor vectors which are then normalized to unit length. The resulting SURF descriptor is invariant to rotation, scale, contrast and partially invariant to other transformations. Shorter SURF descriptors can also be computed however best results are reported with 64D SURF descriptors [7]. The result of SURF matching is illustrated in Fig.7, while Fig.8 denotes all the candidate samples obtained by the method above, different predictions with different colors.

3.3 Motion model

Particle filter [25] is a Bayesian sequential importance sampling technique that aims to estimate the posterior distribution of state variables for a given dynamic system. It uses a set of weighted particles to approximate the probability distribution of the state regardless of the underlying distribution, which is very effective for dealing with nonlinear and non-Gaussian systems. As a typical dynamic state inference problem, online visual tracking can be modeled by particle filter.

There exist two fundamental steps in the particle filter method: 1) prediction and 2) update. Let x_t denote the state variable describing the affine motion parameters of an object and y_t denote its corresponding observation vector (the subscript t indicates the frame index). The two steps recursively estimate the posterior probability based on the following two rules:

$$ p\left({\mathrm{x}}_t|{\mathrm{y}}_{1:t-1}\right)={\displaystyle \int p\left({\mathrm{x}}_t|{\mathrm{x}}_{t-1}\right)}p\left({\mathrm{x}}_{t-1}|{\mathrm{y}}_{1:t-1}\right)d{\mathrm{x}}_{t-1} $$

(22)

$$ p\left({\mathrm{x}}_t|{\mathrm{y}}_{1:t}\right)=\frac{p\left({\mathrm{y}}_t|{\mathrm{x}}_t\right)p\left({\mathrm{x}}_t|{\mathrm{y}}_{1:t-1}\right)}{p\left({\mathrm{y}}_t|{\mathrm{y}}_{1:t-1}\right)} $$

(23)

wherex_1 : t = {x₁, x₂, ... , x_t}stand for all available state vectors up to time tand y_1 : t = {y₁, y₂, ... , y_t}denote their corresponding observations. p(x_t| x_t − 1)is called the motion model that describes the state transition between consecutive frames, and p(y_t| x_t)denotes the observation model that evaluates the likelihood of an observed image patch belonging to the object class.

In the particle filter framework, the posterior p(x_t| y_1 : t)is approximated byNweighted particles$ {\left\{{\mathrm{x}}_t^i,{w}_t^i\right\}}_{i=1,2,\dots, N} $, which are drawn from an importance distributionq(x_t| x_{1 : t − 1}, y_1 : t), and the weights of the particles are updated as

$$ {w}_t^i={w}_{t-1}^i\frac{p\left({\mathrm{y}}_t|{\mathrm{x}}_t^i\right)p\left({\mathrm{x}}_t^i|{\mathrm{x}}_{t-1}^i\right)}{q\left({\mathrm{x}}_t|{\mathrm{x}}_{1:t-1},{\mathrm{y}}_{1:t}\right)} $$

(24)

In this paper, we adoptq(x_t| x_{1 : t − 1}, y_1 : t) = p(x_t| x_t − 1), which is assumed as a Gaussian distribution similar to [37]. In detail, six parameters of the affine transform are used (i.e., x_t = {x _t, y _t, θ _t, s _t, α _t, ϕ _t}, wherex _t , y _t , θ _t , s _t , α _t , ϕ _tdenotex , ytranslations, rotation angle, scale, aspect ratio, and skew, respectively). The state transition is formulated by random walk, i.e., p(x_t| x_t − 1) = ℕ(x_t;x_t − 1, ψ), whereψ is a diagonal covariance matrix. Finally, the statex_tis estimated as$ \widehat{{\mathrm{x}}_t}={\displaystyle {\sum}_{i=1}^N{w}_t^i{\mathrm{x}}_t^i} $. We note that the key of designing a practical tracking algorithm is to develop an effective and efficient observation likelihoodp(y_t| x_t).

4 Experiment and results

4.1 Datasets and baselines

The VOT2013 dataset [42] includes various real-life visual phenomena, while containing a small number of sequences to keep the time for performing the experiments reasonably low. The VOT2013 challenge consists of 16 color image sequences with 172 to 770 frames: bicycle, bolt, car, cup, david, diving, face, gymnastics, hand, iceskater, juice, jump, singer, sunshade, torus, and woman. The sequences have been selected to make the tracking a challenging task: objects change aspect or are articulated, the scale and orientation vary, illumination changes and occlusions occur.

Some example frames are shown in Fig.9. The attributes of testing sequences used in our paper are shown in Table 1.

Table 1 Tracking sequences used in our experiments

Full size table

In addition, given the tracking result R _Tand the ground truthR _G, we use the detection criterion in the PASCAL VOC [19] challenge, i.e., Eq. (25) to evaluate the success rate.

$$ score=\frac{area\left({R}_T\cap {R}_G\right)}{area\left({R}_T\cup {R}_G\right)} $$

(25)

4.2 Implementation details

The proposed method has been implemented in Matlab and tested on a 1.73 GHz PC with 2 GB memory. The number of particle samples processed in the experiments is 100. During tracking, the pixel’s values of each frame are normalized into [0, 1]. Each particle is associated with an image patch. After image scaling, the image patch is normalized toN ₁ × N ₂ pixels. In the experiments, the parameters (N1, N2) are chosen as (32, 32). Due to the multiple sub process used in our model, the time complexity of our proposed method is about 1 fps. The time complexity of different methods is shown in Table 2. In our experiments, we have compared the tracking results of our proposed method with those of state-of-the-art methods, such as OAB [20], L1 [29], MIL [4], IVT [37], ASLA [27], FCNT [41]. We implemented these trackers using publicly available source codes or binaries provided by the authors. They were initialized using their default parameters.

Table 2 Time Complexity of Compared Methods (fps)

Full size table

4.3 Qualitative comparisons

The first test sequence is “Girl”. We show some representative frames of the tracking results of five different trackers in Fig.10. The six representative frame indices are 32, 104, 292, 396, 429, and 494. The man’s face is passing in front of the woman’s face. From the tracking results, we can see that MIL tracker loses the target from frame 292, and finally causes a big drift, the same as IVT tracker. The OAB tracker fails to track the target in frames 104, 429 due to large pose changes and occlusion of another man. For the l1 tracker, it can not track the target in frames 292, 396, 429, and 494. Compared with other trackers, although ASLA tracker can track the target, however, the location accuracy is lower than our proposed tracker (ASLA_DW) which can obtain good tracking results and improve location accuracy. The location errors and the overlap metric of six comparison methods on video sequence “Girl” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Girl sequence for the compared methods, while Fig.17 is the overlap metric results.

Table 3 Center Location Errors of Testing Sequences

Full size table

Table 4 Overlap Metric of Testing Sequences

Full size table

The second test sequence is “FaceOcc1”. We show some representative frames of the tracking results of five different trackers in Fig.11. The six representative frame indices are 60, 196, 356, 418, 558, and 774. The woman’s face is partially occluded by a book from left part of face to the right part of the face. From the tracking results, we can see that the tracking accuracy of MIL tracker is lower than other trackers. For the OAB、L1、 IVT and our proposed tracker, the target can be well tracked, although the location accuracy of l1 tracker is slightly lower. Compared with other trackers, ASLA tracker can not obtain good tracking results in frames 356,418,558,774, whereas our proposed tracker can track the woman’ face robustly and improve the tracking location accuracy better. The location errors and the overlap metric of six comparison methods on video sequence “FaceOcc1” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on FaceOcc1 sequence for the compared methods, while Fig.17 is the overlap metric results.

The third test sequence is “seq_sb”. We show some representative frames of the tracking results of five different trackers in Fig.12. The six representative frame indices are 15, 172, 295, 372, 412, and 480. The man’s face is totally occluded by a book with out of plane rotation. From the tracking results, we can see that MIL tracker loses the target from frame 295, and finally causes a big drift, the same as IVT,OAB,L1, ASLA tracker. Compared with other trackers, our proposed tracker can still obtain good tracking results under such big challenging video. The location errors and the overlap metric of six comparison methods on video sequence “seq_sb” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on seq_sb sequence for the compared methods, while Fig.17 is the overlap metric results.

The fourth test sequence is “Deer”. We show some representative frames of the tracking results of five different trackers in Fig.13. The six representative frame indices are 10, 20, 36, 45, 60, and 71. The challenging factors are motion blur, fast motion, in-plane rotation, background clutters, low-resolution. From the tracking results, we can see that MIL tracker can track the target in frames 20, 36, 45, 60, and 71, while the location accuracy is lower than OAB Tracker. For the l1, IVT, ASLA tracker, they totally lose the target in the whole video sequence. Compared with other trackers, OAB tracker and our proposed tracker can obtain good tracking results. The location errors and the overlap metric of six comparison methods on video sequence “Deer” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Deer sequence for the compared methods, while Fig.17 is the overlap metric results.

The fifth test sequence is “Jogging”. We show some representative frames of the tracking results of five different trackers in Fig.14. The six representative frame indices are 40, 80, 132, 196, 264, and 302. The jogging woman is occluded by a tree. From the tracking results, we can see that MIL, L1, IVT, ASLA tracker lose the target in frames 80, 132, 196, 264, and 302 due to the jogging woman totally occluded by the tree. For the OAB tracker, it loses the target in frame 80, whereas it can track the target in the remaining frames. Compared with OAB Tracker, our proposed tracker can obtain good tracking results in spite of the target totally occluded by the tree. The location errors and the overlap metric of six comparison methods on video sequence “Jogging” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Jogging sequence for the compared methods, while Fig.17 is the overlap metric results (Table 4).

The six test sequence is “Skating2”. We show some representative frames of the tracking results of five different trackers in Fig.15. The six representative frame indices are 50, 120, 181, 240, 333, and 440. This video sequence undergoes fast motion, occlusion and deformation. From the tracking results, we can see that MIL tracker loses the target in frame 120, and finally causes a big drift in frames 181,240,333. However, it tracks the target rightly in frame 440. The L1 tracker loses the target in the representative frames, the same as OAB, IVT tracker. The IVT tracker can not track the target because of its incapable of dealing with occlusion. The tracking accuracy of ASLA tracker is not good than our proposed tracker. The location errors and the overlap metric of six comparison methods on video sequence “Skating2” are respectively shown in Table 3 and Table 4. Figure 16 is the location error curve of tracking results on Skating2 sequence for the compared methods, while Fig.17 is the overlap metric results. From Tables 3 and 4, we can see that the overall performance for different methods of is shown in Table 5.

Table 5 Overall performance for different methods

Full size table

4.4 Parameters selection

Since there are many parameters are involved in the formulations, in fact, each parameter may influence the overall results. The difference is that the importance of each parameter. We only show the primary parameters. We settle the optimal parameters based on our experiments. We first set one parameter change, the rest of parameters are not changing, through this we get the optimal one parameter; then get the second optimal parameters one by one. For the normal sizeN ₁ × N ₂of image patch, letN ₁ = N ₂, and range of N ₁ = [16 20 24 28 32], we verify the sensitivity of the two parameters to the overall results. The results are shown in Fig. 18.

4.5 Analysis of the proposed method

Since for the problem of visual tracking, it is in fact a combination of approaches. Each method can have some effect on the result of visual tracking approach, such as for different motion models and different appearance models. Our method is designed to improve the appearance and model models, so we have integrated some methods. Experimental results on some publicly available benchmarks of video sequences demonstrate the accuracy and effectiveness of our tracker.

5 Conclusion

In this paper, we propose a robust tracking algorithm by integrating the generative and discriminative model. The object appearance model is composed of generative target model and a discriminative classifier. For the generative target model, we adopt the weighted structural local sparse appearance model combining patch based gray value and Histogram of Oriented Gradients feature as the patch dictionary. By sampling positives and negatives, alignment-pooling features are obtained based on the patch dictionary through local sparse coding, then use a support vector machine to train the discriminative classifier. A robust inter-frame matching based on optical flow and Delaunay triangulation accompanied with template matching is adopted to improve the proposal distribution of particle filter to enhance the performance of tracking. Our approach is shown to effectively improve the tracking performance on challenging scenarios.

References

Adam A, Rivlin E, and Shimshoni I (2006) Robust fragments-based tracking using the integral histogram. In: 2006 I.E. conference on computer vision and pattern recognition (CVPR). IEEE, pp 798–805
Attali D, Boissonnat J-D , and Lieutier A (2003) Complexity of the delaunay triangulation of points on surfaces the smooth case. In: 2003 Proceedings of the nineteenth annual symposium on Computational geometry ACM, pp 201–210
Avidan S (2004) Support vector tracking. IEEE Trans Pattern Anal Mach Intell 26(8):1064–1072
Article Google Scholar
Babenko B, Yang M-H, and Belongie S (2009) Visual tracking with online multiple instance learning. In: 2009 I.E. conference on computer vision and pattern recognition (CVPR) IEEE, pp 983–990
Babenko B, Yang MH, Belongie S (2011) Robust object tracking with online multiple instance learning. IEEE Trans Pattern Anal Mach Intell 33(8):1619–1632
Article Google Scholar
Barron JL, Fleet DJ, Beauchemin SS (1994) Performance of optical flow techniques. Int J Comput Vis 12(1):43–77
Article Google Scholar
Bay, H., Tuytelaars, T., and Van Gool, L (2006) Surf: Speeded up robust features. In: 2006 European conference on computer vision (ECCV) Springer Berlin Heidelberg, pp 404–417
Black MJ, Jepson AD (1998) Eigentracking: robust matching and tracking of articulated objects using a view-based representation. Int J Comput Vis 26(1):63–84
Article Google Scholar
Boissonnat J-D and Yvinec M, (1998) Algorithmic Geometry, chapter Voronoi diagrams: Euclidian metric, Delaunay complexes(Cambridge University Press).
Book Google Scholar
Brox T, Bruhn A, Papenberg N, and Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: 2004 European Conference on Computer Vision(ECCV) Springer Berlin Heidelberg, pp 25–36
Burgess C (1998) A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discovery 2(2):121–167
Google Scholar
Candes EJ, Romberg JK, Tao T (2006) Stable signal recovery from incomplete `and inaccurate measurements. Commun Pure Appl Math 59(8):1207–1223
Article MathSciNet MATH Google Scholar
Collins RT, Liu Y, Leordeanu M (2005) Online selection of discriminative tracking features. IEEE Trans Pattern Anal Mach Intell 27(10):1631–1643
Article Google Scholar
Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–575
Article Google Scholar
Cortes C, Vapnik VN (1995) Support-vector networks. J Mach Learn 20(3):273–297
MATH Google Scholar
Dalal, N., Triggs, B (2005) Histograms of Oriented Gradients for Human Detection. In: 2005 I.E. conference on computer vision and pattern recognition (CVPR) IEEE, pp 886–893
Dou J, Li J (2012) Robust image matching based on SIFT and delaunay triangulation. Chin Opt Lett 10(s1):11001
Article Google Scholar
Dou J, Li J (2014) Image matching based local Delaunay triangulation and affine invariant geometric constraint. Optik-International Journal for Light and Electron Optics 125(1):526–531
Article Google Scholar
Everingham M, Van Gool L, Williams CKI, et al. (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Grabner H and Bischof H (2006) On-line boosting and vision. In: 2006 I.E. conference on computer vision and pattern recognition (CVPR) IEEE, pp 260–267
Grabner H, Leistner C, and Bischof H (2008) Semi-supervised on-line boosting for robust tracking. In: 2008 European Conference on Computer Vision (ECCV) Springer Berlin Heidelberg, pp 234–247
Hare, S., Saffari, A., and Torr, P. H (2011) Struck: Structured output tracking with kernels. In: 2011 International Conference on Computer Vision (ICCV) IEEE, pp. 263–270
Henriques JF, Caseiro R, Martins P, and Batista J (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In: 2012 European Conference on Computer Vision(ECCV). Springer Berlin Heidelberg, pp 702–715
Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Article Google Scholar
Isard M, Blake A (1998) CONDENSATION: Conditional density propagation for visual tracking. Int J Comput Vis 29(1):5–28
Article Google Scholar
Jang D, Choi H, and Kim G (1996) Real-time tracking with Kalman filter. In: 1996 Proceeding of IAPR Workshop on Machine Vision Applications (MVA). pp 10–13
Jia X, Lu H, Yang M-H (2012) Visual tracking via adaptive structural local sparse appearance model. In: 2012 I.E. conference on computer vision and pattern recognition (CVPR) IEEE, pp 1822–1829
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422
Article Google Scholar
Li Xi , et al (2007) Robust visual tracking based on incremental tensor subspace learning. In: 2007 I.E. International Conference on Computer Vision (ICCV). IEEE, pp 1–8
Li X, Hu W, Shen C, Zhang Z, Dick A, Van Den Hengel A (2013) A survey of appearance models in visual object tracking. ACM Trans. Intell. Syst. Technol 4(4):58
Google Scholar
Lowe DG (2004) Distinctive image features form scale-Invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Mei, X., and Ling, H. (2009) Robust visual tracking using l1 minimization. In: 2009 I.E. International Conference on Computer Vision (ICCV) IEEE, pp 1436–1443
Mei X, Ling H (2011) Robust Visual Tracking and Vehicle Classification via Sparse Representation. IEEE Trans Pattern Anal Mach Intell 33(11):2259–2272
Article MathSciNet Google Scholar
Nie L, Wang M, Zha ZJ, Chua TS (2012a) Oracle in image search: a content-based approach to performance prediction. ACM Transactions on Information Systems (TOIS) 30(2):13
Article Google Scholar
Nie, L., Yan, S., Wang, M., Hong, R., & Chua, T. S. (2012b) Harvesting visual concepts for image search with complex queries. In: 2012 Proceedings of the 20th ACM international conference on Multimedia. ACM, pp. 59–68
Ozuysal M, Calonder M, Lepetit V, et al. (2010) Fast keypoint recognition using random ferns. IEEE Trans Pattern Anal Mach Intell 32(3):448–461
Article Google Scholar
Ross DA, Lim J, Lin RS, et al. (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77(1–3):125–141
Article Google Scholar
Saffari, A., Leistner, C., Santner, J., Godec, M., & Bischof, H (2009) On-line random forests. In: 2009 I.E. International Conference on International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, pp 1393–1400
Tang F, Brennan S, Zhao Q, and Tao H (2007) Co-tracking using semi-supervised support vector machines. In: 2007 I.E. International Conference on Computer Vision (ICCV) IEEE, pp 1–8
Wang Y, Tang X, Cui Q (2012) Dynamic appearance model for particle filter based visual tracking. Pattern Recogn 45(12):4510–4523
Article Google Scholar
Wang L, Ouyang W, Wang X, et al (2015) Visual tracking with fully convolutional networks. In 2015 I.E. International Conference on Computer Vision (CVPR) IEEE, pp 3119–3127.
Wu, Y., Lim, J., and Yang, M. H (2013) Online object tracking: A benchmark. In: 2013 I.E. conference on computer vision and pattern recognition (CVPR) IEEE, pp 2411–2418
Yang H, Shao L, Zheng F, Wang L, Song Z (2011) Recent advances and trends in visual tracking: a review. Neurocomputing 74(18):3823–3831
Article Google Scholar
Yilmaz A, Javed O, Shah M (2006) Object Tracking: A Survey. ACM Comput Surv 38(4):1–45
Article Google Scholar
Zhang K, Zhang L, and Yang M-H (2012) Real-time compressive tracking. In: 2012 European Conference on Computer Vision(ECCV) Springer Berlin Heidelberg, pp 864–877
Zhong B et al. (2014) Visual tracking via weakly supervised learning from multiple imperfect oracles. Pattern Recogn 47(3):1395–1410
Article MATH Google Scholar
Zhou, T., Lu, Y., and Qiu, M. (2015) Online visual tracking using multiple instance learning with instance significance estimation. CoRR
Google Scholar

Download references

Acknowledgments

This work was supported by the by Shanghai University Outstanding Teachers Cultivation Fund Program A30DB1524011-21 and 2015 School Fund Project A01GY15GX48.

Author information

Authors and Affiliations

Department of Automation and Mechanical and Electrical engineering, School of Intelligent Manufacturing and Control Engineering, Shanghai Second Polytechnic University, Shanghai, 201209, China
Jianfang Dou, Qin Qin & Zimei Tu

Authors

Jianfang Dou
View author publications
You can also search for this author in PubMed Google Scholar
Qin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zimei Tu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianfang Dou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dou, J., Qin, Q. & Tu, Z. Robust visual tracking based on generative and discriminative model collaboration. Multimed Tools Appl 76, 15839–15866 (2017). https://doi.org/10.1007/s11042-016-3872-6

Download citation

Received: 16 February 2016
Revised: 11 July 2016
Accepted: 15 August 2016
Published: 03 September 2016
Issue Date: July 2017
DOI: https://doi.org/10.1007/s11042-016-3872-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust visual tracking based on generative and discriminative model collaboration

Abstract

Similar content being viewed by others

Robust visual tracking via discriminative appearance model based on sparse coding

Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM

Robust object tracking via multi-scale patch based sparse coding histogram

1 Introduction

2 Related work

3 Proposed method

3.1 Object appearance model

3.1.1 Histogram of oriented gradients (HOG)

3.1.2 Structural local sparse representation

3.1.3 Sampling positives and negatives

3.1.4 One-class support vector machine

3.2 Predicted sample candidates

3.2.1 Translation obtained by optical flow tracker and Delaunay triangulation

3.2.2 Gravity Center of Corners

3.2.3 Predicted location by template matching

3.2.4 Predicted location by SURF matching

3.3 Motion model

4 Experiment and results

4.1 Datasets and baselines

4.2 Implementation details

4.3 Qualitative comparisons

4.4 Parameters selection

4.5 Analysis of the proposed method

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust visual tracking based on generative and discriminative model collaboration

Abstract

Similar content being viewed by others

Robust visual tracking via discriminative appearance model based on sparse coding

Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM

Robust object tracking via multi-scale patch based sparse coding histogram

Explore related subjects

1 Introduction

2 Related work

3 Proposed method

3.1 Object appearance model

3.1.1 Histogram of oriented gradients (HOG)

3.1.2 Structural local sparse representation

3.1.3 Sampling positives and negatives

3.1.4 One-class support vector machine

3.2 Predicted sample candidates

3.2.1 Translation obtained by optical flow tracker and Delaunay triangulation

3.2.2 Gravity Center of Corners

3.2.3 Predicted location by template matching

3.2.4 Predicted location by SURF matching

3.3 Motion model

4 Experiment and results

4.1 Datasets and baselines

4.2 Implementation details

4.3 Qualitative comparisons

4.4 Parameters selection

4.5 Analysis of the proposed method

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation