Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Visual tracking is one of the most important and challenging problems in computer vision field. Traditionally, the majority of existing methods always focus on modeling the holistic appearance of the target within a bounding box, and they have achieved good performance in some conditions [18]. Intuitively however, they ignore the local information of the target, which greatly limits their application in the scenario where the target is partially occluded or the global appearance of the target changes a lot.

Recently, methods [918] with part-based appearance representation are popular in visual tracking task to deal with some situations that the holistic appearance based methods fail. The part based tracker combines multiple parts with local information to achieve stronger representation ability, and is able to explicitly model the target structure variations. However, the obvious shortcomings of these previous part based trackers are: (1) the number of parts are assigned before empirically, which is hard to obtain better discriminative ability with suitable parts; (2) no relationships between neighboring generated parts are considered, which loses the structural information of the target. Therefore, their methods lack strong discriminative ability and fail to combine the parts into a whole to complete the tracking task.

In this paper, we propose a discriminative Hidden Structural Part Tracker (HSPT), which tracks arbitrary objects without any assumptions on the scenarios. The proposed method learns the discriminative parts automatically by integrating the structure and discriminative information. In the learning step, both the appearance of the parts and the relationships between them are considered. Since the discriminative parts are not located in the fixed location to the target center, we regard the state of them as hidden variables in the objective function. Then, the objective is optimized by the Metropolis-Hastings (MH) algorithm [19] integrated with the online Support Vector Machine (SVM) method [20] iteratively. In order to achieve more robust performance in complex environments, the bounding box based appearance of the target is also incorporated in our tracker. The contributions of this paper are concluded as follows:

  • We propose a hidden discriminative structural parts learning based tracker, which simultaneously learns multiple structural parts of the target to represent the target better to enhance its robustness.

  • The MH algorithm and the online SVM are interestingly combined to infer the optimal state of the discriminative parts, and this optimization method handles varying number of parts well.

  • The structural supporting between parts are naturally integrated through the dynamically constructed pair-wise MRF model.

  • Extensive tracking experiments are various publicly available challenge sequences demonstrate the favorable performance against the state-of-the-art methods.

2 Related Works

The part based model has been developed recently in visual tracking task. Shahed et al. proposed a part-based tracker HABT [10], which generated the target parts by manually labeling in the first frame. And the appearance model of the target is assumed to be fixed in the tracker, which limits its performance in the complex environment. Another tracker Frag [9] utilized the regularly partitioned parts to model the appearance of the target, but the appearance model of each part is also fixed. In [18], Yao et al. introduced an online part-based tracking with latent structured learning. However, the proposed method uses fix number of parts to represent the target object, which makes it insufficient to describe the appearance variations when large deformation happens.

There also exist some other part based appearance representation methods. Kwon et al. [11] proposed the BHMC tracker, which generates the parts based on SURF-like key points without global structure constraints and the appearance of the parts were updated roughly. Wang et al. [14] proposed a discriminative appearance model based on superpixels, called SPT, in which the probabilities of superpixels belonging to the foreground were utilized to discriminate the target from the background. However, the relation between superpixels was not incorporated, which makes the tracker easily affected by the similar backgrounds. Godec et al. [12] extended the hough forest to the online domain and integrated the voting method for tracking, regardless of the structure information. Cehovin et al. [13] proposed a coupled-layer visual tracker, which combined the global and local appearance of the target together in part generation. However, the ignorance of the relationships between parts makes it unstable when clutter backgrounds, occlusions or non-rigid motion happen. Cai et al. [17] designed a dynamic graph based method which works well in non-rigid motion, but the parts are generated only based on color feature and some background parts will be easily misclassified as foreground.

3 Overview of Proposed Method

Generally, the tracking task is formulated as a Markovian state transition process, where the current target state is determined by its previous state. Let \( p ( O _{t}| Z _{t})\) be the appearance model and \( p ( Z _{t}| Z _{t-1})\) be the motion model of the target at time \(t\). The state of the target is represented as \( Z _{t}=(\ell _{t}, s _{t})\). \(\ell _{t}\) is the position of the target in the \(2\)D image plane and \( s _{t}\) is the size of the target consisting of the width and height. The motion model \( p ( Z _{t}| Z _{t-1})\) and the appearance model \( p ( O _{t}| Z _{t})\) are described as follows.

Motion Model. Similar to [2], we assume the position and size of the target varies independently in the motion model, that is:

$$\begin{aligned} p ( Z _{t}| Z _{t-1})= p (\ell _{t}, s _{t}|\ell _{t-1}, s _{t-1})= p (\ell _{t}|\ell _{t-1}) p ( s _{t}| s _{t-1}), \end{aligned}$$
(1)

where the target position transition probability \( p (\ell _{t}|\ell _{t-1})=1\), if \(\Vert \ell _{t}-\ell _{t-1}\Vert _{2}< R _{s}\); otherwise, it equals to zero. \( R _{s}\) is the predefined searching radius. The target scale transition probability is similarly handled as [2].

Appearance Model. The appearance of our tracker consists of two parts, the learned discriminative structural parts model \(\mathcal {A}^{(0)}\) and the bounding box based appearance model \(\mathcal {A}^{(1)}\). The learned structural parts focus on the local variations and the bounding box based appearance focuses on the holistic variations of the target. Intuitively, the combination of them can achieve more robust performance. The appearance model of our tracker is formulated as follows:

$$\begin{aligned} p ( O _{t}| Z _{t})=\Big ( p ^{(0)} ( O _{t}| Z _{t})\Big )^{\lambda _{b}}\cdot \Big ( p ^{(1)} ( O _{t}| Z _{t})\Big )^{(1-\lambda _{b})}, \end{aligned}$$
(2)

where \(\lambda _{b}\) is a predefined balance parameter, and \( p ^{(0)}( O _{t}| Z _{t})\) and \( p ^{(1)}( O _{t}| Z _{t})\) are the probabilities of the target candidate given out by \(\mathcal {A}^{(0)}\) and \(\mathcal {A}^{(1)}\), respectively. To model the appearance of the target, the online SVM [20] is adopted for each part.

For the discriminative structural parts model \(\mathcal {A}^{(0)}\), the probability of the candidate, including the appearance likelihood and the deformation likelihood of the parts, is calculated as

$$\begin{aligned} p ^{(0)}( O _{t}| Z _{t})=\prod _{i,j\in \mathcal {N}}\phi ( Z _{i,t}, Z _{j,t})\cdot \prod _{i=1}^{n} p ( O _{i,t}| Z _{i,t}), \end{aligned}$$
(3)

where \(n\) is the number of parts, \( p ( O _{i,t}| Z _{i,t})\) is the probability of the part \(i\) applauding the candidate to be positive, \( O _{i,t}\) is the observation of part \(i\), \(\mathcal {N}\) is the neighboring system of the parts and \(\phi ( Z _{i,t}, Z _{j,t})\) is the pairwise interaction potentials between the learned part \(i\) and part \(j\).

In our model, the state of part \(i\) at time \(t\) is defined as \( Z _{i,t}=( x _{i,t}, y _{i,t}, w _{i,t}, h _{i,t}, {\varDelta x }_{i,t}, {\varDelta y }_{i,t})\), where \(( x _{i,t}, y _{i,t})\), \( w _{i,t}\) and \( h _{i,t}\) are the position, width and height of part \(i\) at time \(t\) respectively. \(({\varDelta x }_{i,t},{\varDelta y }_{i,t})\) is the spatial offset of the part relative to the target center. The interaction potential term is expressed by means of Gibbs distribution:

$$\begin{aligned} \phi ( Z _{i,t}, Z _{j,t})\propto \exp \bigg (-\lambda _{\phi }\Vert v _{t}(i,j)-\tilde{ v }(i,j)\Vert _{2}\bigg ), \end{aligned}$$
(4)

where \(\lambda _{\phi }=0.2\) is the scaler parameter in the experiments and \( v _{t}(i,j)=\ell _{t}(i)-\ell _{t}(j)\) represents the vector pointing from the position \(\ell _{t}(i)\) of part \(i\) to the location \(\ell _{t}(j)\) of part \(j\) at time \(t\), which encodes the supporting between neighboring parts (structural information of the target). \(\tilde{ v }(i,j)\) is the learnt relative position of the part \(i\) and \(j\). Here, we model the relationship between different parts, rather than modeling the exclusions between close targets as in [21].

Let \(\varPhi _{p}( O _{i,t})\) represent the HOG feature [22] of the part observation, and \(\omega _{i,p}^{t}\) is the SVM parameter corresponding to part \(i\) at time \(t\). The likelihood of its appearance is calculated as

$$\begin{aligned} p ( O _{i,t}| Z _{i,t})\propto \exp \big (\omega _{i,p}^{(t)}\cdot \varPhi _{p}( O _{i,t})\big ), \end{aligned}$$
(5)

In order to reduce the influence of some badly learned parts, we only utilize \(\eta \) percent high confident parts to score the candidates. Then the probability of the candidate (3) can be rewritten as

$$\begin{aligned} p ^{(0)}( O _{t}| Z _{t})\propto \prod _{i,j\in \mathcal {N}} \phi ( Z _{i,t}, Z _{j,t})\cdot \exp \Big (\sum _{i\in \mathcal {I}} \omega _{i,p}^{(t)}\cdot \varPhi _{p}( O _{i,t})\Big ), \end{aligned}$$
(6)

where \(\mathcal {I}\) is the index set of the selected \(\eta \) high confident parts.

For the bounding box based appearance model \(\mathcal {A}^{(1)}\), the probability of the candidate is presented as:

$$\begin{aligned} p ^{(1)}( O _{t}| Z _{t})\propto \exp (\omega _{ b }^{(t)}\cdot \varPhi _{b}( O _{t})), \end{aligned}$$
(7)

where the SVM parameter \(\omega _{b}^{(t)}\) and the HOG feature \(\varPhi _{b}( O _{t})\) are determined based on the whole target.

4 Learning the Discriminative Parts

In the tracking task, the appearance of the target changes dramatically. In order to adapt the part models to the target appearance variations, some of the target parts should be added or deleted, and their state should be determined to represent the target optimally. Therefore, we design a reasonable objective function to perform the parts learning, whose goal consists of three aspects: (1) maximize the margin between the target and the background; (2) retain the structure information of the target; (3) cover the most of the target foreground area. The state of each part is treated as a hidden variable, which will be inferred based on the acquired observation information.

As discussed above, our objective in terms of optimization is to find the optimized SVM parameter \(\omega _{i,p}\) and parts state \( Z _{i,t}\). In this section, the MH algorithm and the online SVM are integrated into an unified optimization framework to complete the inference task. The details will be presented as follows.

4.1 Objective

For the local parts to be learned, we expect that they acquire better representation ability to ensure robust tracking performance. The objective for the \(i\)-th part learning is formulated as

$$\begin{aligned} G (\omega _{i,p}, Z _{i};\mathcal {X})=\alpha \cdot \rho \cdot \omega _{i,p}\cdot \varPhi _{p}(\mathcal {X}_{ Z _{i}})+\beta \cdot \mathcal {R}(\mathcal {F}, Z _{i}), \end{aligned}$$
(8)

where the first term is to separate the target parts from the background parts by maximizing the margin between them, and the second term encourages the learned part to cover more target foreground area. \(\mathcal {F}\) is foreground area, \(\rho \in \{-1,1\}\) is the binary label of the updating sample \(\mathcal {X}\) indicating the foreground and background, and \(\mathcal {X}_{ Z _{i}}\) represents the part observations of the updating sample with the part state \( Z _{i}\). \(\mathcal {R}(\mathcal {F}, Z _{i})\) means the overlap ratio between \( Z _{i}\) and \(\mathcal {F}\). We set the balancing parameters \(\alpha =0.7\), \(\beta =0.2\) in all of our experiments.

Naturally, we infer the optimal state of each part jointly and integrate the structure information in the inference process. The objective for the target is proposed as

$$\begin{aligned} G (\omega _{p},\tilde{ Z };\mathcal {X}) =\alpha \cdot \rho \cdot \omega _{p}\cdot \varPhi _{p}(\mathcal {X}_{\tilde{ Z }}) +\beta \cdot \mathcal {R}(\mathcal {F},\tilde{ Z }), \end{aligned}$$
(9)

where \(\tilde{ Z }=( Z _{1},\cdots , Z _{n})\) is the combination of the parts state and \(\omega _{p}=(\omega _{1,p}^{(t)},\cdots ,\omega _{n,p}^{(t)})\) is the concatenated SVM weight of each part. \(\mathcal {R}(\mathcal {F},\tilde{ Z })=\sum _{i=1}^{n}\mathcal {R}(\mathcal {F}, Z _{i})\) represents the coverage ratio of the true target area.

4.2 Optimal Parts Inference

The MH algorithm [19] has been applied in the multiple target tracking task [21, 23], where the authors use it for particle filter sampling to identify the state of each target precisely. However, in our paper, we do not aim at identifying each part all the time. Instead, we utilize the MH algorithm to clean up useless parts and discover new parts adaptively according to the target appearance variation. Firstly, we need convert our objective (9) into the probability form:

$$\begin{aligned} p (\tilde{ Z },\omega _{p}|\mathcal {X})\propto \exp \Big (\zeta \cdot G (\omega _{p},\tilde{ Z };\mathcal {X})\Big ), \end{aligned}$$
(10)

where \(\zeta \) is the scale factor. Then maximizing the objective (9) is equivalent to solve the maximum posterior probability problem:

$$\begin{aligned} \{\tilde{ Z },\omega _{p}\}=\arg \max _{\tilde{ Z }, \omega _{p}} p (\tilde{ Z },\omega _{p}|\mathcal {X}). \end{aligned}$$
(11)

Due to the dependance between the hidden variable \(\tilde{ Z }\) and \(\omega _{p}\), it is difficult to optimize them simultaneously. Hence, we decompose the inference task of the two hidden variables in (10) into a two stage iterative optimization problem. In each pass \(r\), we solve the objective by dividing it into two steps to iteratively update \(\{\tilde{ Z },\omega _{p}\}\) using the following procedure.

Optimize \(\varvec{\omega _{p}}\) . Given the optimized parts state \(\tilde{ Z }^{(r)}\), (11) is equivalent to the following optimization problem:

$$\begin{aligned} \omega _{p}^{(r)}=\arg \max _{\omega _{p}} p (\tilde{ Z }^{(r)}, \omega _{p}|\mathcal {X}) =\arg \max _{\omega _{p}}\Big \{\alpha \cdot \rho \cdot \omega _{p}\cdot \varPhi _{p}(\mathcal {X}_{\tilde{ Z }^{(r)}})\Big \}. \end{aligned}$$
(12)

Then in analogy to classical SVM, we train the parameter \(\omega _{p}^{(r)}\) by solving the following optimization problem:

$$\begin{aligned} \omega _{p}^{(r)}=\arg \min _{\omega _{p}}\Big \{\frac{1}{2}\Vert \omega _{p}\Vert ^{2} +\gamma \sum _{i=1}^{m}\max \big (0,1-\rho ^{(i)}\cdot f _{\omega _{p}} (\mathcal {X}^{i})\big )\Big \}, \end{aligned}$$
(13)

where \(\{(\mathcal {X}^{1},\rho ^{1}),\cdots ,(\mathcal {X}^{m},\rho ^{m})\}\) is the collected sample pool, \(\rho ^{i}\in \{-1,1\}\) is the label of the \(i^{th}\) collected sample \(\mathcal {X}^{i}\), \(m\) is the number of samples. We set parameter \(\gamma =5\) in our experiments. The score of the sample is calculated as \( f _{\omega _{p}}(\mathcal {X}_{\tilde{ Z }^{(r)}}) =\omega _{p}\cdot \varPhi _{p}(\mathcal {X}_{\tilde{ Z }^{(r)}})\), where \(\varPhi _{p}(\mathcal {X}_{\tilde{ Z }^{(r)}})\) is the concatenated HOG feature of parts.

Optimize \(\varvec{\tilde{ Z }}\) . With the determined \(\omega _{p}^{(r)}\), we sample a proposal part state \(\tilde{ Z }^{(r)\prime }\) according to the previous parts state \(\tilde{ Z }^{(r)}\), and calculate the acceptance ratio in the MH algorithm based on the optimized model parameter \(\omega _{p}^{(r)}\) to get the optimized parts state \(\tilde{ Z }^{(r+1)}\). Therefore, the MAP solution of the parts state in the constructed Markov Chain of MH algorithm is utilized to get the optimized state \(\tilde{ Z }\).

To that end, five moves are defined for states change of each part. Birth, indicates the move of adding the candidate parts in the sampler; Death, indicates the move of removing the newly added candidate parts in the sampler. The reversible pair focuses on the newly added candidates generated by SLIC sampling [24] (i.e. the external rectangle region of the generated superpixel is used as the candidate) and the deleted candidates, if the target is changing the pose so that some old parts will disappear and some new parts will be generated. Stay, indicates the move of adding the disappeared parts in previous iterations in the sampler; Leave, indicates the move of removing the learned parts in previous iterations in the sampler. The reversible pair determine the state of the learned parts in previous sampling iterations when the target undergoes heavy occlusion so that some old parts are missed temporarily and appear again then. Update, indicates the move of updating the parts in the sampler, which deals with the dynamically updated appearance of the target as a self-reversible pair.

For easy description, we omit the iteration mark \(r\) in the following. We define two sets in the optimization process: (1) \( T ^{\star }=\{ T _{1}^{\star },\cdots , T _{n}^{\star }\}\) is the learned parts set and its corresponding state set is \( Z ^{\star }=\{ Z _{1}^{\star },\cdots , Z _{n}^{\star }\}\); (2) \( T ^{+}=\{ T _{1}^{+},\cdots , T _{m}^{+}\}\) is the birth candidate set and its corresponding state set is \( Z ^{+}=\{ Z _{1}^{+},\cdots , Z _{m}^{+}\}\). The notation \( Z _{i}^{(\cdot )}\) (\( Z _{i}^{\star }\) or \( Z _{i}^{+}\)) is the current part state and \( Z _{i}^{(\cdot )\prime }\) is the proposal state of the part.

Let \(\mathcal {N}_{i,t}\) be the neighbors of part \(i\) at time \(t\). Let \( C = \{ C _{b}, C _{d}, C _{s}, C _{l}, C _{u}\}\) represent the prior probability of each move type and we set \( C = \{0.3,0.1,0.1,0.01,3.0\}\) in the experiments empirically. The proposal distribution \(q = \{q_{b}, q_{d}, q_{s}, q_{l}, q_{u}\}\) and the acceptance ratio are calculated as follows.

Birth: Select the part \( T _{i}^{+}\) from the birth candidate parts set \( T ^{+}\) with the uniform distribution. The birth proposal distribution can be calculated as \( q _{b}( Z _{i}^{+\prime }; Z _{i}^{+})=\frac{ C _{b}}{m}\), if \(( T ^{\star \prime }, Z ^{\star \prime })=( T ^{\star } \cup \{ T _{i}^{+}\}, Z ^{\star }\cup \{ Z _{i}^{+}\})\), and otherwise it equals to zero. Then the acceptance ratio is presented as

$$\begin{aligned} \alpha _{b}=\min \bigg (1, p (\mathcal {X}| Z _{i}^{+\prime }, \omega _{p})\cdot p ( Z _{i}^{+\prime }) \cdot \frac{ q _{d}( Z _{i}^{+}; Z _{i}^{+\prime })}{ q _{b}( Z _{i}^{+\prime }; Z _{i}^{+})}\bigg ), \end{aligned}$$
(14)

where \( p ( Z _{i}^{+\prime })\) represents the birth transition probability.

Death: Select the part \( T _{i}^{\star }\) as the death part with the uniform distribution from the newly added candidate set \( T ^{+}\cap T ^{\star }\). The death proposal distribution is defined as \( q _{d}( Z _{i}^{\star \prime }; Z _{i}^{\star })=\frac{ C _{d}}{| T ^{\star }\cap T ^{+}|}\), if \(( T ^{\star \prime }, Z ^{\star \prime })=( T ^{\star }\backslash \{ T _{i}^{+}\}, Z ^{\star }\backslash \{ Z _{i}^{+}\})\), and otherwise it equals to zero. Then the acceptance ratio is presented as

$$\begin{aligned} \alpha _{d}=\min \bigg (1,\frac{1}{ p (\mathcal {X}| Z _{i}^{\star \prime },\omega _{p})} \cdot \frac{1}{ p ( Z _{i}^{\star \prime })} \cdot \frac{ q _{b}( Z _{i}^{\star }; Z _{i}^{\star \prime })}{ q _{d}( Z _{i}^{\star \prime }; Z _{i}^{\star })}\bigg ), \end{aligned}$$
(15)

Stay: Select the part \( T _{i}^{\star }\) to be the stay part. We introduce a set \( T ^{(d)}=\mathcal {T}^{\star }\backslash T _{i}^{\star }\), \(\mathcal {T}^{\star }\) is the union set of the parts set in the previous iterations. Then the stay proposal distribution is presented as \( q _{s}( Z _{i}^{\star \prime }; Z _{i}^{\star })=\frac{ C _{s}}{| T ^{(d)}|}\cdot J ( Z _{i}^{\star \prime })\), if \(| T ^{(d)}|\not =0\), and otherwise it equals to zero. The function \( J ( Z _{i}^{\star \prime })\) represents the probability of part \( T _{i}^{\star }\) staying at the state \( Z _{i}^{\star \prime }\), and it is modeled as a normal density centered at the disappearing point. \( Z _{i}^{ l }\) is the state of the part \( T _{i}^{\star }\) at the disappearing point in the previous iterations. Then the acceptance ratio is presented as

$$\begin{aligned} \alpha _{s}=\min \bigg (1, p (\mathcal {X}| Z _{i}^{\star \prime },\omega _{p}) \cdot \prod _{j\in \mathcal {N}_{t}}\phi ( Z _{i}^{\star \prime }, Z _{j,t}) \cdot p ( Z _{i}^{\star \prime }| Z _{i}^{\star }) \cdot \frac{ q _{l}( Z _{i}^{\star }; Z _{i}^{\star \prime })}{ q _{s}( Z _{i}^{\star \prime }; Z _{i}^{\star })}\bigg ), \end{aligned}$$
(16)

where \(\mathcal {N}_{t}\) is the part neighboring system at \(t\), \( p ( Z _{i}^{\star \prime }| Z _{i}^{ l })\) is the state transition probability.

Leave: Select the part \( T _{i}^{\star }\) to be the leave part. The leave proposal distribution is defined as \( q _{l}( Z _{i}^{\star \prime }; Z _{i}^{\star })=\frac{ C _{l}}{| T ^{\star }|}\), if \(( T ^{\star \prime }, Z ^{\star \prime })=( T ^{\star }\backslash \{ T _{i}^{\star }\}, Z ^{\star }\backslash \{ Z _{i}^{\star }\})\). Otherwise, it equals to zero. Then the acceptance ratio is presented as

$$\begin{aligned} \alpha _{l}=\min \bigg (1,\frac{1}{ p (\mathcal {X}| Z _{i}^{\star \prime },\omega _{p})} \cdot \frac{1}{\prod _{j\in \mathcal {N}_{t}}\phi ( Z _{i}^{\star \prime }, Z _{j,t})} \cdot \frac{1}{ p ( Z _{i}^{\star \prime }| Z _{i}^{\star })} \cdot \frac{ q _{s}( Z _{i}^{\star }; Z _{i}^{\star \prime })}{ q _{l}( Z _{i}^{\star \prime }; Z _{i}^{\star })}\bigg ), \end{aligned}$$
(17)

Update: Select the part \( T _{i}^{\star }\) to be the update part. The appearance of the part will be updated if the move is accepted. The proposal distribution of this move type is defined as \( q _{u}( Z _{i}^{\star \prime }; Z _{i}^{\star })=\frac{1}{| T ^{\star }|}\), if \( T _{i}^{\star }\in T ^{\star }\). Otherwise it equals to zero. Then the acceptance ratio is presented as

$$\begin{aligned} \alpha _{u}=\min \bigg (1,\frac{ p (\mathcal {X}| Z _{i}^{\star \prime }, \varvec{\omega }_{p})}{ p (\mathcal {X}| Z _{i}^{\star },\varvec{\omega }_{p})} \cdot \frac{\prod _{j\in \mathcal {N}_{t}}\phi ( Z _{i}^{\star \prime }, Z _{j,t})}{\prod _{j\in \mathcal {N}_{t}}\phi ( Z _{i}^{\star }, Z _{j,t})}\bigg ), \end{aligned}$$
(18)

In the above acceptance ratio calculation Eqs. (1418), the birth transition probability \( p ( Z _{i}^{(\cdot )}) = \exp {(-\lambda _{ o }\cdot \mathcal {R}( Z _{i}^{(\cdot )}, Z _{i}^{\star }))}\). This term penalizes the overlap ratio between the candidate part \( T _{i}^{+}\) and the existing learned parts to avoid adding redundant parts. We set \(\lambda _{ o }=2\) in our experiments. The probability \( p (\mathcal {X}| Z _{i}^{(\cdot )},\omega _{p})\propto \exp ( G (\mathcal {X};\omega _{p}, Z _{i}^{(\cdot )}))\), where \(\mathcal {X}\) is the current updating sample and \( Z _{i}^{(\cdot )}\) (\( Z _{i}^{\star }\), \( Z _{i}^{\star \prime }\), \( Z _{i}^{+}\) or \( Z _{i}^{+\prime }\)) is the part state.

In addition, the target parts interact with each other, especially for the neighboring ones, so it is inappropriate to assume the independence between target parts. We should integrate the relationships between parts in optimization process rather than optimize the objective in (8) for each part individually. Motivated by [21], we propose to utilize the dynamically constructed pairwise MRF to model the supporting between different parts in the part learning process. We set all pairs of parts as the neighbors in the graph. Thus, the part transition probability in (16) and (17) is presented as follows:

$$\begin{aligned} p (\tilde{ Z }^{\prime }|\tilde{ Z })\propto \prod _{i} p ( Z ^{\prime }_{i}| Z _{i}) \prod _{ij\in \mathcal {N}}\phi ( Z ^{\prime }_{i}, Z ^{\prime }_{j}), \end{aligned}$$
(19)

where \(\tilde{ Z }=( Z _{1},\cdots , Z _{n})\) is the combination state of multiple parts, \(\phi ( Z ^{\prime }_{i}, Z ^{\prime }_{j})\) is the pairwise interaction potentials between parts similarly defined as (4), \( p ( Z ^{\prime }_{i}| Z _{i})\) is the part transition model, and \(\mathcal {N}\) is the neighbor system of the parts. \( Z ^{\prime }_{i}\) is the proposal state of part \(i\) and \( Z _{i}\) is the state of part \(i\) currently. The part transition is modeled as a normal density centered at the previous state, which is presented as \( Z ^{\prime }_{i}| Z _{i}= Z _{i}+\varDelta Z _{i}\), where \(\varDelta Z _{i}\sim [\mathcal {N}(0,\sigma _{ x }^{2}),\mathcal {N}(0,\sigma _{ y }^{2}),\mathcal {N}(0,\sigma _{ w }^{2}) ,\mathcal {N}(0,\sigma _{ h }^{2}),\mathcal {N}(0,\sigma _{\varDelta x }^{2}), \mathcal {N}(0,\sigma _{\varDelta y }^{2})]\).

Fig. 1.
figure 1

The first row is the tracking results of our tracker in the sequence shirt and the second row presents the learned discriminative parts and the corresponding structure. The nodes in the graph represent each learned part and the lines represent the spatial relationships between neighboring parts. The green cross represents the center of the target (Colour figure online).

Figure 1 is an example to illustrate how the discriminative parts are automatically learned over time in shirt sequence. The shirt is crinkled in frame \(\sharp 0026\) and \(\sharp 0040\), where the target bounding box contains considerable background in the right bottom corner. In contrast, the target bounding box in frame \(\sharp 0038\) contains only foreground. In this case, the proposed part-based method adaptively generates a part \(261\) to cover the new foreground in frame \(\sharp 0038\), and deletes it in frame \(\sharp 0040\). In this way, our part based model can adapt to the variations of the target better than the bounding box based model. The final optimization scheme is summarized in Algorithm 1.

figure a

5 Experimental Results

5.1 Parameters

The parameters in our experiment are detailed in the following. In the learning phase, we run \( P _{ n }=400\) iterations to complete the part state inference task and \( P _{ n _{0}}=100\) of them are burn-in in the MH algorithm. Generally, the algorithm will converge after about \(300\) iterations. The motion model parameters we used are \(\sigma _{ x }^{2}=3\), \(\sigma _{ y }^{2}=3\), \(\sigma _{ w }^{2}=0.2\), \(\sigma _{ h }^{2}=0.2\), \(\sigma _{\varDelta x }^{2}=0.1\) and \(\sigma _{\varDelta y }^{2}=0.1\). The cell size of HOG is set as \(8\times 8\) pixels. A block consists of \(4\) cells and the strides of the cell are set \(4\) in both \(x\) and \(y\) directions. The linear kernel is exploited in the online learning SVM model. The target area is divided into about \(15\) or \(20\) superpixels in the SLIC algorithm. Meanwhile, in the tracking phase, the searching radius \( R _{s}\in [20,60]\). The balance parameter \(\lambda _{b}\) in (2) is set in the interval \([0,0.5]\).

5.2 Effectiveness of \(\mathcal {A}^{(0)}\) and \(\mathcal {A}^{(1)}\)

Firstly, we chose four representative sequences to demonstrate the behavior of \(\mathcal {A}^{(1)}\) and \(\mathcal {A}^{(0)}\). The results shown in Table 1 demonstrate the performance of HSPT is improved mainly due to the local discriminative parts learning rather than the features or the classifiers adopted.

Table 1. Comparison results of \(\mathcal {A}^{(1)}\), \(\mathcal {A}^{(0)}\), and HSPT.

As presented in Table 1, the combined HSPT outperforms the individual \(\mathcal {A}^{(1)}\) and \(\mathcal {A}^{(0)}\) in all tested sequences. \(\mathcal {A}^{(1)}\) focuses on holistic appearance and \(\mathcal {A}^{(0)}\) focuses on inner structure of the target. \(\mathcal {A}^{(0)}\) is superior over \(\mathcal {A}^{(1)}\) because the local discriminative structure model represents the target better than appearance only model. Especially in the sequences shirt and pedestrian where nonrigid deformations and illumination variations frequently happen, the targets were still well tracked by \(\mathcal {A}^{(0)}\) even when several parts undergo changes in location and appearance. In contrast, \(\mathcal {A}^{(1)}\) was affected more seriously by these challenges. \(\mathcal {A}^{(1)}\) focuses on the holistic appearance, which can enhance the stability of the tracker in the complex situations. HSPT inherits the advantages both from \(\mathcal {A}^{(1)}\) and \(\mathcal {A}^{(0)}\) and thus performed best against the other evaluated trackers on the evaluated sequences.

5.3 Comparison with Other Trackers

Then, we compare our tracker with some state-of-the-art methods, including bounding box based methods (MIL [2], VTD [7], \(\ell 1\) [4], TLD [5]), and part based methods (Frag [9], HABT [10], BHMC [11], SPT [14]). All the codes are provided by the authors on their websites. Ten challenging sequences (nine of them are publicly available [2, 5, 7, 25, 26] and the other one is collected by ourself) are utilized in the experiment. These sequences cover most of the challenging situations in tracking task: non-rigid motion, in-plane and out-of-plane rotation, large illumination changes, heavy occlusions and complex background (see Fig. 3).

Table 2. Comparison results based on the AECP metric.
Table 3. The successfully tracked frames based on the PASCAL VOC metric.

The proposed tracker is implemented in C++ and it runs about \(0.1\) fps on the Intel \(3.0\) GHz PC platform. We present the tracking results in this section and more results as well as demos can be found in supplementary materials. Two metrics are utilized to quantify the performance, namely the Average Error Center Location in Pixels (AECP) metric (\(\downarrow \)) and the PASCAL VOC object detection metric [27] (\(\uparrow \)). The symbol \(\uparrow \) means methods with higher scores perform better, and \(\downarrow \) indicate methods with lower scores perform better. The quantitative comparison results are shown in Tables 2 and 3.

Heavy Occlusion. The target in sequences carchase, stone, car, tiger1 and tiger2 undergoes heavy occlusion multiple times. As shown in Fig. 3, most of the trackers drift away when heavy occlusion happens, and some of them can not recapture the target after occlusion. In the sequence car, TLD with detection module works relatively well. Some other trackers such as Frag and \(\ell 1\) who have intuitive robustness to occlusion also track the car well. However, our tracker still outperforms other methods due to the consideration of the relationships between parts which alleviates the influence of some badly learned parts.

Large Illumination Variations. The frequent large illumination variations in tiger1, tiger2 and pedestrian sequences challenge the performance of the trackers. Since the appearance features are easily affected by illumination variations, most of the previously proposed trackers fail to track the target in these sequences. For example, when the light is shining in the sequence tiger1 and tiger2, Frag fails to track the tiger, and when the woman is under the shadow in the pedestrian sequence, VTD and SPT shrink to those parts that are not shadowed. In contrast, our tracker optimally partitions the target into several parts, and the target can be located with the help of those less affected parts. The combined appearance features of different parts, the structure information between parts, and the structure information between the part and the target center make our tracker outperforms other trackers.

Fig. 2.
figure 2

Tracking results of MIL [2], VTD [7], \(\ell 1\) [4], TLD [5], Frag [9], HABT [10], BHMC [11], SPT [14] and the proposed HSPT tracker. The results of five trackers with relatively better performance are displayed.

Fig. 3.
figure 3

Tracking results of different trackers. Only the trackers with relatively better performance are displayed.

Pose Changes. The inner structure changes caused by pose changes usually make the bounding box based trackers fail to track the target. Nevertheless, part based trackers including Frag and BHMC work relatively well because they focus on the parts instead of the holistic bounding box template, which can be demonstrated in the sequence portman and shirt in Fig. 3, Tables 2 and 3. Especially in the sequence shirt, the bounding box based trackers such as \(\ell 1\), VTD and MIL fail because of the error accumulation when the non-rigid motion happens. Since the combination of part appearance is less influenced than the bounding box based appearance under pose changes and several correctly learned parts are good enough to locate the target, our tracker still works very well in these sequences.

Complex Background. In the football and stone sequences, many similar objects confuse the trackers a lot. As shown in Figs. 2 and 3, TLD and HABT frequently skip to other similar objects. The similar appearance between the target and the background in the sequences football, stone, tiger1 and tiger2 makes it hard to precisely track the target. Through combining the target holistic appearance, the detailed appearance of parts and the structure information between different parts, our tracker can discriminate the target from the complex background. While some specific trackers outperform ours in some specific sequences, but comprehensively speaking, our tracker performs the best.

6 Conclusion

In this paper, a novel online learned discriminative part-based tracker is proposed. The appearance of the target is described by the combination of multiple learned discriminative structural parts. In the parts learning phase, we utilize the MH algorithm based optimization framework integrated with the online SVM to infer the optimal parts state. We introduce the dynamically constructed pair-wise MRF to model the interaction between neighboring parts. The experiments demonstrate the superiority of the proposed method. In the future, we will optimize the codes to make the tracker run in real-time.