Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM

Wang, Yuxia; Zhao, Qingjie

doi:10.1007/978-3-319-26532-2_29

Yuxia Wang¹⁷ &
Qingjie Zhao¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9489))

Included in the following conference series:

International Conference on Neural Information Processing

2111 Accesses
1 Citations

Abstract

This paper proposes a patch-based tracking algorithm via a hybrid generative-discriminative appearance model. For establishing the generative appearance model, we present a spatio-temporal constraint-based sparse representation (STSR), which not only exploits the intrinsic relationship among the target candidates and the spatial layout of the patches inside each candidate, but also preserves the temporal similarity in consecutive frames. To construct the discriminative appearance model, we utilize the multiple-instance learning-based support vector machine (MIL&SVM), which is robust to occlusion and alleviates the drifting problem. According to the classification result, the occlusion state can be predicted, and it is further used in the templates updating, making the templates more efficient both for the generative and discriminative model. Finally, we incorporate the hybrid appearance model into a particle filter framework. Experimental results on six challenging sequences demonstrate that our tracker is robust in dealing with occlusion.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Robust visual tracking based on generative and discriminative model collaboration

Article 03 September 2016

Robust object tracking via multi-scale patch based sparse coding histogram

Article 02 February 2016

A temporal sparse collaborative appearance model for visual tracking

Article 07 February 2020

Keywords

1 Introduction

Visual tracking is an active field of research in computer vision. While numerous tracking methods have been proposed with demonstrated success in recent years, designing a robust tracking method is still an open problem, due to factors such as scale and pose change, illumination variation, occlusion, etc. Especially, occlusion is a core issue. One of the main reasons is the lack of the effective object appearance models, which play a significant role in visual tracking.

For designing a robust tracker, most tracking algorithms employ generative learning or discriminative learning based appearance models. Generative learning based appearance models mainly concentrate on how to fit the data accurately from the object class using generative methods. Among them, sparse representation is a widely used generative method. Jia et al. [3] developed a local appearance model by utilizing the sparse representation of the overlapped patches. Zhang et al. [10] proposed a structural sparse tracking algorithm to exploit the relationship among the target candidates and spatial layout of the patches inside each candidate. Zarezade et al. [9] presented a joint sparse tracker by assuming that the target and the previous best candidates have a common sparsity pattern. Although these methods achieve convincing performance, they either lack of the description of the target spatial layout or ignore the temporal consistency constraint of successive frames. In this paper, we propose a spatio-temporal constraint-based sparse representation (STSR), which not only exploits the spatial layout of the local patches inside each candidate and the intrinsic relationship among the candidates and their local patches, but also preserves the temporal consistency of the sparsity pattern in consecutive frames.

In comparison, discriminative appearance models pose visual tracking as a binary classification issue, aiming to maximize the inter-class separability between the object and non-object regions via discriminative learning techniques. Babenko et al. [2] introduced the multiple-instance learning technique into online object tracking where training samples can be labeled more precisely. In [4], Kalal et al. proposed to train a binary classifier using the P-N learning algorithm with both labeled and unlabeled samples. Despite the convincing performance, most of these methods use holistic representation to represent the object and hence do not handle occlusion well. In this paper, we utilize the patch-based discriminative appearance model proposed by [6] to locate the target from the background, in which the multiple-instance learning-based support vector machine (MIL&SVM) is used as the classifier and it can predict the occlusion state and alleviate the drifting problem. According to the occlusion state, we update the template set as mentioned in [6], making the templates more effective both for the generative and discriminative appearance model.

2 Patchwise Tracking via a Hybrid Generative-Discriminative Appearance Model

In our tracker, we utilize $\mathbf {s}_t$ to denote the object state at time t, and construct our tracker in the particle filter framework (PF). For the dynamic model of PF, $p(\mathbf {s}_t|\mathbf {s}_{t-1})$, we assume a Gaussian distributed model. For the appearance model in PF, $p(\mathbf {y}_t|\mathbf {s}_t)$, we use our patch-based hybrid generative-discriminative appearance model, which will be introduced below.

2.1 Generative Appearance Model Based on STSR

Given the image set of the target templates $\mathbf {T} = [\mathbf {T}_1, \mathbf {T}_2, ..., \mathbf {T}_m]$, where m is the number of target templates, we sample K overlapped local patches inside each target region. The sampled patches are used to form a dictionary $\mathbf {D} = [\mathbf {d}_1^{(1)}, ..., \mathbf {d}_m^{(1)}, ..., \mathbf {d}_1^{(K)}, ..., \mathbf {d}_m^{(K)}]$, each column in $\mathbf {D}$ is obtained by $\ell _2$ normalization on the vectorized gray scale image observations extracted from $\mathbf {T}$.

Let $\{\mathbf {x}_{t-i}^*\}_{i=1}^N$ and $\{\mathbf {x}_t^i\}_{i=1}^n$ represent the best candidates obtained in the previous tracking and particles from the current frame respectively. For $\{\mathbf {x}_{t-i}^*\}_{i=1}^N$ and $\{\mathbf {x}_t^i\}_{i=1}^n$, we also sample K overlapped local patches as done in the template set and denote $\mathbf {x}_{t-i}^* = [\mathbf {x}_{t-i}^{*(1)}, ..., \mathbf {x}_{t-i}^{*(K)}]$ and $\mathbf {x}_t^i = [\mathbf {x}_t^{i(1)}, ..., \mathbf {x}_t^{i(K)}]$. Let $\mathbf {X}_t^{(k)} = [\mathbf {x}_t^{1(k)}, ..., \mathbf {x}_t^{n(k)}]$ denote the k-th local patches of n particles at time t. In order to represent this observations matrix $\mathbf {X}_t^{(k)}$, we not only consider the spatial constraint of the particles and local patches, but also utilize the temporal constraint in consecutive frames.

Spatio-Temporal Constraint. Based on the fact that n particles at current frame are densely sampled at and around the target of the previous frame and the target’s appearance changes smoothly, it is reasonable to assume that these particles are likely to be similar and they have the similar sparse pattern with previous tracking results over a period of time. Thus the k-th image patches of n particles and previous tracking results are expected to be similar. In addition, for patches extracted from a candidate particle or a previous tracking result, their spatial layout should be preserved.

Spatio-Temporal Constraint-Based Sparse Representation (STSR). Based on the above observations, we use $\mathbf {X}^{(k)} = [\mathbf {x}_{t-i}^{*(k)}, ..., \mathbf {x}_{t-1}^{*(k)}, \mathbf {x}_t^{1(k)}, ..., \mathbf {x}_t^{n(k)}]$ to represent the k-th local patches of previous tracking results and n particles in current frame, $\mathbf {D}^{(k)} = [\mathbf {d}_1^{(k)}, \mathbf {d}_2^{(k)}, ..., \mathbf {d}_m^{(k)}]$ to express the k-th patches of m templates, and $\mathbf {Z}^{(k)} = [\mathbf {z}_{t-i}^{*(k)}, ..., \mathbf {z}_{t-1}^{*(k)}, \mathbf {z}_t^{1(k)}, ..., \mathbf {z}_t^{n(k)}]$ to denote the representations of the k-th local patch observations of $\mathbf {X}^{(k)}$ with respect to $\mathbf {D}^{(k)}$. Then the joint sparse appearance model for the object tracking under the spatio-temporal constraint can be obtained by using the $\ell _{2,1}$ mixed norm as

$$\begin{aligned} \min _\mathbf {Z} \frac{1}{2} \sum _{k=1}^{K}||\mathbf {X}^{(k)} - \mathbf {D}^{(k)}\mathbf {Z}^{(k)}||_F^2 + \lambda ||\mathbf {Z}||_{2,1} \end{aligned}$$

(1)

where, $\mathbf {Z} = [\mathbf {Z}^{(1)}, \mathbf {Z}^{(2)}, ..., \mathbf {Z}^{(K)}]$, $||\cdot ||_F$ denotes the Frobenius norm, $\lambda $ is a regularization parameter which balances reconstruction error with model complexity, $||\mathbf {Z}||_{2,1} = \sum _i(\sum _j|[\mathbf {Z}]_{ij}|^2)^{\frac{1}{2}}$ and $[\mathbf {Z}]_{ij}$ denotes the entry at the i-th row and j-th column of $\mathbf {Z}$. The $\ell _{2,1}$ mixed norm regularizer is optimized using an Accelerated Proximal Gradient (APG) method. The illustration of the spatio-temporal constraint-based sparse representation is shown in Fig. 1.

Generative Appearance Model Based on STSR. After learning the $\mathbf {Z}$, the observation likelihood of the tracking candidate i is defined as

$$\begin{aligned} p_g(\mathbf {y}_t|\mathbf {s}_t) = \frac{1}{\beta } \exp (-\alpha \sum _{k=1}^K||\mathbf {x}_t^{i(k)} - \mathbf {D}^k\mathbf {z}_t^{i(k)}||_F^2) \end{aligned}$$

(2)

where, $\mathbf {z}_t^{i(k)}$ is the coefficient of the k-th image patch of the i-th particle corresponding to the target templates, and $\alpha $ and $\beta $ are normalization parameters.

2.2 Discriminative Appearance Model Based on MIL&SVM

Despite the robust performance of the generative appearance model achieved, it is not effective in dealing with the background distractions. Therefore, we introduce a discriminative appearance model based on MIL&SVM to improve the performance of our tracker.

We denote the overlapped image patches extracted from the target templates as the positive pathes $p^+$, and the overlapped patches extracted from the background (which is an annular region and the distance from the center-point of the target object to the edge of the negative patch sampling area is set to R) are denoted as negative patches $p^-$. As we all known, some positive patches obtained above may contain some noisy pixels from background because the bounding box is rectangular whereas the shape of the target may not be a standard rectangle. In order to deal with this problem, we adopt the patch-based MIL&SVM to train a robust classifier. In the training procedure, a row of patches are defined as a positive bag $b^+$ if they extracted from the target templates, or negative bag $b^-$ if they come from background. The training procedure is illustrated in Fig. 2.

With this classifier, we can classify each patch of a candidate object at time t. For a candidate, we use $r^+$ to denote the local patches which are classified as positive and use $r^-$ to denote patches classified as negative. Then the probability of a candidate being the tracking result can be defined as

$$\begin{aligned} p_d(\mathbf {y}_t|\mathbf {s}_t) = \frac{|r^+|}{|r^-|+|r^+|} \end{aligned}$$

(3)

where $|r^+|$ and $|r^-|$ are the number of positive patches and negative patches.

Furthermore, according to the classification result, the occlusion state of a candidate can be obtained as

$$\begin{aligned} O = \frac{|r^-|}{|r^-|+|r^+|} \end{aligned}$$

(4)

2.3 Adaptive Hybrid Generative-Discriminative Appearance Model

Based on the likelihood obtained from the spatio-temporal constraint-based sparse representation and the probability got via multiple-instance learning-based SVM, we construct our final observation model as:

$$\begin{aligned} p(\mathbf {y}_t|\mathbf {s}_t) = \eta p_g(\mathbf {y}_t|\mathbf {s}_t) + (1-\eta )p_d(\mathbf {y}_t|\mathbf {s}_t) \end{aligned}$$

(5)

where $\eta \in [0,1]$ is a control parameter, which can adjust weights of the two methods according to the occlusion state and can be defined as $\eta = \frac{1}{2}(1+O)$.

In order to deal with appearance variation with time, we need to update our templates. We divide the templates $\mathbf {T}$ into two groups according to the occlusion state. The group without occlusion is denoted as $\mathbf {T}_{unocc} = [\mathbf {T}_1, ..., \mathbf {T}_{m_1}]$, and the occluded template set is denoted as $\mathbf {T}_{occ} = [\mathbf {T}_{m_1 +1}, ..., \mathbf {T}_m]$, where $m_1$ is the number of unoccluded patches. The templates in $\mathbf {T}_{unocc}$ are ordered by time and the templates in $\mathbf {T}_{occ}$ are ordered reversely by time. We use two increasing interval sequences and a random number $r \in [0,2]$ to determine the sequence number of the template needed to be deleted as Eq. 6.

$$\begin{aligned} f(r) = \left\{ \begin{aligned} i,&\qquad r\in [\frac{(i-1)^2 +(i-1)}{m_1^2+m_1}, \frac{i^2 +i}{m_1^2+m_1}],&\quad 0<r\le 1\\ j,&\qquad r\in [1+\frac{(j-1)^2 +(j-1)}{m_2^2+m_2}, 1+ \frac{j^2 +j}{m_2^2+m_2}],&\quad 1<r\le 2 \end{aligned} \right. \end{aligned}$$

(6)

where $m_2 = m - m_1$.

After selecting the template to discard, we use the method mentioned in [3] to update the template. For more detail, please refer [3]. After the templates $\mathbf {T}$ is updated, we retrain the MIL&SVM classifier only with the templates without occlusion or with light occlusion.

3 Experiments

We validate our tracker on six challenging sequences and compare it with six state-of-the-art methods proposed in recent years. All of these sequences are publicly available. The challenges of these sequences include severe occlusion and drastic shape deformation. In order to test the effectiveness and robustness of our tracker, we compare it with FragT [1], VTD [5], PT [8], SCM [11], ASLA [3] and SPT [7]. For our tracker, we set the number of templates $m = 10$, the number of local patches $K = 9$, the number of particles $n = 400$, and we use 2 previous tracking results in STSR. We resize all the targets or candidates as (32, 32).The size of the sampling patch is (16,16) and the sampling step is 8 pixels.

Table 1. Location errors (in pixel, the bold font indicates the best performance)

Full size table

Comparative tracking results of selected frames are shown in Fig. 3, from which we can find that our proposed tracker performs very well on all these challenging sequences. FragT is designed for dealing with occlusion and performs well in face_sequence and girl_head when the target is large enough, but it cannot get good results in other sequences when there exists sever occlusion in a small target. VTD adopts multi-trackers to track the target and it achieves satisfactory results in face_sequence and basketball but also shows less effective in dealing with the situation when there exists both rigid shape deformation and occlusion. PT is a part-based tracker and it performs well in dealing with partial occlusion, but it fails when the target is full occluded. Both SCM and ASLS adopt sparsity-based appearance model and they perform well in dealing with occlusion as shown in face_sequence, but cannot get satisfactory performance when there exists rigid shape deformation. SPT achieves good results on DavidOutdoor and girl_move as shown in Fig. 3, but cannot obtain stable performance in clutter scene or when there exists severe and frequent occlusion as shown in screenshots of sequences basketball, woman_sequence and face_sequence.

We also measure the quantitative tracking error, the Euclidean distance from the tracking center to the ground-truth. The center error plots of 7 methods on 6 sequences are shown in Fig. 4, which demonstrates that our tracker is robust in handling occlusion and shape deformation even in a complex scene. We show the location errors in Table 1, which shows that our tracker achieves the best tracking results on 4 sequences and gives the the best tracking result on average.

4 Conclusion

In this paper, we have proposed a novel patch-based tracking method based on the combination of spatio-temporal constraint-based sparse representation (STSR) and multiple-instance learning-based SVM (MIL&SVM). By utilizing the STSR, our tracker effectively captures the structure cues of the target and the temporal similarity in consecutive frames. Furthermore, we utilize MIL&SVM as our discriminative appearance model, which is robust in cluttered background and can predict the occlusion state. Based on the occlusion state, we update the template set separately, making the generative method obtain more precise templates and the discriminative method maintain correctness. Qualitative and quantitative experimental results on different challenging sequences demonstrate that our tracker is very robust to the occlusion.

References

Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the integral histogram. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 798–805, IEEE (2006)
Google Scholar
Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 983–990, IEEE (2009)
Google Scholar
Jia, X., Lu, H., Yang, M.H.: Visual tracking via adaptive structural local sparse appearance model. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1822–1829, IEEE (2012)
Google Scholar
Kalal, Z., Matas, J., Mikolajczyk, K.: PN learning: bootstrapping binary classifiers by structural constraints. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–56, IEEE (2010)
Google Scholar
Kwon, J., Lee, K.M.: Visual tracking decomposition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1269–1276, IEEE (2010)
Google Scholar
Li, X., He, Z., You, X., Chen, C.P.: A novel joint tracker based on occlusion detection. Knowl. Based Syst. 71, 409–418 (2014)
Article Google Scholar
Yang, F., Lu, H., Yang, M.H.: Robust superpixel tracking. IEEE Trans. Image Process. 23(4), 1639–1651 (2014)
Article MathSciNet Google Scholar
Yao, R., Shi, Q., Shen, C., Zhang, Y., van den Hengel, A.: Part-based visual tracking with online latent structural learning. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2363–2370, IEEE (2013)
Google Scholar
Zarezade, A., Rabiee, H., Soltani-Farani, A., et al.: Patchwise joint sparse tracking with occlusion detection. IEEE Trans. Image Process. 23(10), 4496–4510 (2014)
Article MathSciNet Google Scholar
Zhang, T., Liu, S., Xu, C., Yan, S., Ghanem, B., Ahuja, N., Yang, M.H.: Structural sparse tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 150–158 (2015)
Google Scholar
Zhong, W., Yang, M., et al.: Robust object tracking via sparse collaborative appearance model. IEEE Trans. Image Process. 23(5), 2356–2368 (2014)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61175096 and 61273273), Specialized Fund for Joint Building Program of Beijing municipal Education Commission.

Author information

Authors and Affiliations

Beijing Lab of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, 100081, People’s Republic of China
Yuxia Wang & Qingjie Zhao

Authors

Yuxia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingjie Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxia Wang .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Zhao, Q. (2015). Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9489. Springer, Cham. https://doi.org/10.1007/978-3-319-26532-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-26532-2_29
Published: 12 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26531-5
Online ISBN: 978-3-319-26532-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM

Abstract

Similar content being viewed by others

Robust visual tracking based on generative and discriminative model collaboration

Robust object tracking via multi-scale patch based sparse coding histogram

A temporal sparse collaborative appearance model for visual tracking

Keywords

1 Introduction

2 Patchwise Tracking via a Hybrid Generative-Discriminative Appearance Model

2.1 Generative Appearance Model Based on STSR

2.2 Discriminative Appearance Model Based on MIL&SVM

2.3 Adaptive Hybrid Generative-Discriminative Appearance Model

3 Experiments

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Patchwise Tracking via Spatio-Temporal Constraint-Based Sparse Representation and Multiple-Instance Learning-Based SVM

Abstract

Similar content being viewed by others

Robust visual tracking based on generative and discriminative model collaboration

Robust object tracking via multi-scale patch based sparse coding histogram

A temporal sparse collaborative appearance model for visual tracking

Keywords

1 Introduction

2 Patchwise Tracking via a Hybrid Generative-Discriminative Appearance Model

2.1 Generative Appearance Model Based on STSR

2.2 Discriminative Appearance Model Based on MIL&SVM

2.3 Adaptive Hybrid Generative-Discriminative Appearance Model

3 Experiments

4 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation