Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Visual tracking has attracted significant attention due to its wide variety of applications such as terrorist detection, wearable computing and self-driving cars. Much progress has been made in the last two decades. However designing robust visual tracking methods is still an open issue. Challenges in visual tracking methods include no-rigid shape and appearance variations of the object, occlusions, illumination changes, cluttered scenes, etc. [1, 2].

To solve the above problem, a popular approach is to learn a discriminative appearance model for coping with complicated appearance changes [3]. Typically, this assumes that the object/non-object discriminative information from different frames during long-term tracking is generated from a temporally homogeneous source. However this assumption may not hold in practice, as object appearance and environmental conditions vary dynamically over time. In face of challenging factors, only fitting one updating discriminative model which can satisfy all cases is unlikely to optimally distinguish an object from its background through tracking-by-detection methods [48]. Tracking-by-detection requires training of a classifier for detecting the object in each frame. One common approach for detector training is to use a detector ensemble framework that linearly combines the weak classifiers with different associated weights, e.g., [4, 6]. A larger weight implies that the corresponding weak classifier is more discriminative and thus more useful.

Although most previous online ensemble methods originated from offline algorithms achieve many successes in online visual learning task, there are some limitations in visual tracking. As noted by Bai et al. [9], the common assumption was that the observed data (examples and their labels) had an unknown but stationary joint distribution. It may not apply in tracking scenarios where the appearance of an object can undergo significant changes. Due to the uncertainty in the appearance changes that may occur over time and the difficulty of estimating the non-stationary distribution of this observed data directly, they used Bayesian estimation theory to estimate a Dirichlet distribution of classifier weights. Different from their pre-defined non-stationary distribution and high computational complexity, we propose a simple and robust cumulative sum method to model how the different view predictor weights evolve so as to represent the non-stationary distribution which doesn’t need to satisfy some specific distribution and is efficient.

At the same time, Grabner and Bischof [6] noted that updating the weights of online self-learning classifiers through the incoming data without annotation is difficult. Babenko et al. [5] treated tracking as multiple instance learning problem. Bai et al. [9] estimated the ensemble weights using Bayesian interpretation and ensures that the update of the ensemble weights is smooth. Yu et al. [10] proposed a co-training based approach to continuously label incoming data and online update a hybrid discriminative and generative model. We consider the three views of the object-part view, parameter space view and object feature space view at the same time. They are robust to different cases that object-part view covers the occlusion, discriminative parameter space view focuses the difference between the object and the background and the generative object feature space view handles the variants of the object appearance itself.

Moreover, the tracking problem has a temporal dimension which is not present in the classification methods [11] or subspace learning methods [12] by the previous works. We get temporal interval predictors through sequential clustering so as to better utilize the temporal learned structural information in parameter space and object appearance space directly.

Our method models three views of predictors whose weights ensemble with a non-stationary distribution, where their information geometry can be explored by sequential clustering methods. Our method focus on estimating the state of the object with three diverse view predictors in temporal dimension, not the independent and identically distributed variable in a fixed weak classifier pool. In summary, our contributions are as follows:

  1. 1.

    We first propose a clustering ensemble tracker with three diverse views of weak predictors: object-part predictor, parameter space predictor, and feature space predictor. The different views have specific properties for tracking.

  2. 2.

    The sequential clustering is utilized to estimate the temporal non-stationary distributions of weak structure predictor in parameter space and appearance predictor in feature space. Based on sequential clustering theory, it provides a probabilistic interpretation of which interval structured predictor of the object are more discriminative.

  3. 3.

    We propose a simple weighting strategy to ensemble different weak predictors based on the prediction consistency between weak predictors and final ensemble tracker.

2 Related Work

A tracking-by-detection method usually has two major components: object representation and model update. Previous methods employ various object representations [58, 1315]. Our approach is most related to the methods that use structured prediction [7, 8].

From the perspective of that the tracked objects are treated as labeled positive samples and the other as training samples with some structure loss, the tracking problem can be considered as supervised learning problem in each frame. Supervised learning algorithms are commonly described as performing the task of searching through a hypothesis space to find a suitable hypothesis that makes good prediction for one particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for object tracking, it may be very difficult to find a good one to locate the object precisely.

“Ensemble methods” is a machine learning paradigm where multiple (homogenous/heterogenous) individual learners are trained for the same problem, e.g., neural network ensemble [16], bootstrap aggregating (bagging) [17], boosting [18], Bayesian model averaging [19, 20], etc. Avidan [4], who was the first to explicitly apply ensemble methods to tracking-by-detection, extended the work of [21] by adopting the Adaboost algorithm [18] to combine a set of weak classifiers maintained with an online update strategy. Along this thread, Grabner et al. [6] inspired from the online boosting algorithm [22] by introducing feature selection from a pool of features for weak classifiers. Several other extensions to online boosting also existed, including the work by Banbenko et al. [5] who adopted Multiple Instance Learning in designing weak classifiers. In a different approach [23], Random Forests undergoed online update to grow or discard decision trees during tracking. Bai et al. [9] treated weight vector as a random variable and estimate a Dirichlet distribution for ensemble’s weight vector. They all are a binary classifier realized by an ensemble method and don’t exploit the structured data properties which can improve the tracking performance significantly, like as [7, 24]. At the same time, online boosting based trackers [5, 6] only considered the parameter state in current time period. Different from them, we explore the structure of parameter state in parameter space over different time periods in tracking process.

Zhong et al. [25] considered visual tracking in a weakly supervised learning scenario where (possibly noisy) labels but no ground truth are provided by multiple imperfect oracles (i.e., trackers). Kwon and Lee [26] proposed visual tracker sampler to track a target by searching for the appropriate trackers in each frame. They are all ensemble methods applied in visual tracking. Unlike these methods, our method is not a heterogenous method which focuses on the tracker space but an homogenous approach which there is just one tracker. Due to the trained weak trackers in historical sequences, our method is more efficient than heterogenous methods.

Our online ensemble method is most related with online bagging scheme, in the sense that we adopt random combination of weak classifiers. However, we characterize the temporal ensemble weight vector as a clustering center and evolve its distribution with sequential clustering manner. As a result, the final strong classifier is an expectation of the ensemble with respect to the weight vector, which is approximated by an average of the ensemble clustering centers. To the best of our knowledge, in the context of tracking-by-detection, we are the first to present such an online learning scheme that adopt clustering in parameter space and object appearance space to characterizes the uncertainty of a self-learning algorithm.

3 Clustering Ensemble Tracking

In this section, we introduce our tracking algorithm, clustering ensemble tracking (CET), which is a clustering ensemble based appearance model. We begin with an overview of our tracking system which includes a description of structure learning-based part models predictor. We then briefly review the concepts of sequential clustering and ensemble with temporal weak structure predictors. Finally, we give our clustering ensemble based tracking algorithm.

Fig. 1.
figure 1

Framework of the proposed clustering ensemble algorithm. \(W_{i}\) represents the parameter of part models in \(i^{th}\) frame. \(C_{i,j}\) denotes clustering center. \(d_{i,j}\) expresses the decisions related to \(C_{i,j}\).

3.1 Overview

We illustrate the framework of our tracking system (diagram shown in Fig. 1). At each frame, our method starts with a structure predictor \(h(x)\), several clustering centers based on historical weight vectors \(W = \{w_1,w_2,..w_N,...\}\) of \(h(x)\) and input data \(x\). Our method obtains the incremental parameter cluster centers \(C_p = \{C_{p,1},...,C_{p,M}\}\) and object appearance cluster centers \(C_o = \{C_{o,1},...,C_{o,M}\}\) through sequential clustering method, where there is only one cluster, and then the number of clusters increases as the change of the input parameter vectors \(W\) or object feature vectors \(O = \{o_1, ... o_N...\}\). Every parameter cluster center \(C_{p,i}\) and the latest parameter vector \(w_N\) are treated as the parameters of weak structure predictors \(h(x)\). Meanwhile, each appearance cluster center \(C_{o,i}\) evaluates the object candidates through similarity measurement. Then the output of these weak structure predictors \(h(x)\) and the degree of similarity with respective weights \(l = \{l_1, l_2, l_3\}\) are combined to yield the final decision where the object is. For reducing the computing complexity, the cascade method are adopted in experiments. The cascade is that using the most stable weak classifier or the latest classifier rejects most of object candidates and retains a small number of object candidates which are difficult to predict precisely by one weak classifier so that multiple weak predictors give a combined solution of higher quality than any individual solution (empirically proved by [25, 26]).

3.2 Sequential Clustering

In online visual object tracking, the tracked object appearance usually changes gradually. While there are some various factors such as noise or occlusion or fast and abrupt object motion or illumination changes or variations in pose and scale, the object appearance got from the object location will changes much. Meanwhile, the weight vector trained through the changed object training samples varies with the changes of object appearance. Through updating object appearance model, the classifier can adapt the variation of the object appearance. However, model update itself is not absolutely correct without effective supervised information. For alleviating the drift problem resulted by degraded classifier update which comes from incorrectly labeled training samples, we exploit the structure of the parameter space of the trained weak trackers and the predicted object appearance space in historical temporal dimension to guarantee the accuracy of current decision by the final ensemble tracker through sequential clustering. We will introduce the sequential clustering algorithm as follows.

In basic form, parameter or weight vectors \(W = \{w_1,...,w_n\}\) are presented only once and the number of clusters \(C = \{C_1,...,C_m\}\) is not known a priori. The common approach is to define the dissimilarity \(d(x_i, C_j)\) and set the threshold of dissimilarity \({\varTheta }\) and the number of maximum clusters allowed \(q\). The idea is to assign every newly presented vector to an existing cluster or create a new cluster for this sample, depending on the distance to the already defined clusters. In the application of online tracking, the parameter vector changes gradually so that the threshold \({\varTheta }\) and the number \(q\) are difficult to set. Here, to avoid the setting problem above, we create a new cluster using a simple heuristic. As pseudo, the algorithm works like the following:

figure a

As can be seen the algorithm is simple but still quite efficient. Different choices for the distance function \(d(w_i, C_k)\) lead to different results. We define:

$$\begin{aligned} d(w_i, C_k) = 1 -\frac{<w_i,C_{k,c}>}{||w_i||||C_{k,c}||} \end{aligned}$$
(1)

where \(<A,B> = \sum _{i = 1}^n A_i \times B_i\) is the dot product of two vectors, \(||A|| = \sqrt{\sum _{i =1}^n (A_i)^2}\), and \(C_{k, c}\) is the average of all vectors in the set \(C_k\). Due to structured time series property of online tracking, our method creates one new cluster after each D interval frames and uses K-means [27] to re-clustering. The sequential clustering is used in Sect. 3.3.

3.3 Clustering Ensemble Tracker

We adopt the bagging-like method to get the final ensemble results. Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Here, we use the trained structure predictor in every frame as the basic version of a predictor.

Object-Part Predictor. In our paper, similar to [24], a structured part models predictor is trained by an online manner based on the tracked object locations in previous frames. We represent the object bounding box \(B_i=\{\mathbf{x }_i,w_i,h_i\}\) with center location \(\mathbf x _i=(x_i,y_i)\), width \(w_i\) and height \(h_i\). The HOG features extracted from image \(\mathbf I \) that correspond to locations inside the object bounding box \(B_i\) are extracted to obtain feature vector \({\varPhi }{(\mathbf I ;B_i)}\). The part indicators \(i\in V\) where \(V = \{V_0, V_1, ..., V_n\}\) represents the set of object and object parts. Here, \(V_0\) denotes the object itself. Subsequently, we define a graph \(G=(V,E)\) over all objects \(m\in {V}\) that we want to track with edges \((m,n)\in {E}\) between the objects. The edges in the graph model can be viewed as springs that represent spatial constraints between the tracked objects. Next, we define the score of a configuration \(S=\{P_1,...,P_{|V|}\}\) of multiple tracked parts as the sum of two terms: (1) an appearance score that sums the similarities between the observed image features and the classifier weights for all objects and (2) a deformation score that measures how much a configuration compresses or stretches the springs between the tracked objects. Different from [8], the weak base predictor is not our focus, but just part of our method. Mathematically, the score of a configuration \(S_b\) is defined as:

$$\begin{aligned} S_b = \sum _{i\in {V}}^{}\mathbf {w_i^T}{\varPhi }(\mathbf {I};B_i)+ \lambda \sum _{(m,n)\in {E}}||(\mathbf {x_m}-\mathbf {x_n})-\mathbf {e_{mn}}||^2. \end{aligned}$$
(2)

Where the parameters \(\mathbf {w}_i\) represent linear weights on the HOG features for object \(i\), \(\mathbf {e}_{ij}\) is the vector that represents the length and direction of the spring between objects \(i\) and \(j\), the set of all parameters is denoted by \({\varTheta }=\{\mathbf {w_1},...,\mathbf {w_{\mid V\mid }}, \mathbf {e_1},...,\mathbf {e_{\mid E \mid }}\}\). We treat the parameter \(\lambda \) as a hyper-parameter that determines the trade-off between the appearance and deformation scores. For reducing the computing complexity, we set \(m=0\), which means only to compute the distance between the parts \(V_i\) and the root \(V_0\) in \(D(x)\). We use a passive-aggressive algorithm to perform the parameter update [24, 28].

Parameter and Feature Clustered Predictor. In this paper, we redefine the goal of tracking problem as to find the best state that not only using the current trained classifier in the case where the object is easy to identify (see Fig. 2(a)), but also exploiting the historical trained classifier through clustering ensemble methods in the case where the object is difficult to identify (see Fig. 2(b)). In Fig. 2(a)), the object is easy to decide because other regions’ confidences are much lower than the lighter region so that the object is discriminated easier from the background. In Fig. 2(b), the background has many regions in which there are similar confidences as the object so that if the current trained classifier’s decision is wrong, the tracker will drift to the background. After drift, the classifier’s update will be wrong. For reducing the decision ambiguities of the object, we adopt the clustering ensemble method (see Sec. 3.2) in the historical parameter space and object appearance space and use the clustering centers to make a decision where the object is. To improve the computational efficiency and robustness, we get the extremal points in the confidence map as the object candidates. After getting the object candidates, we use the clustering centers as weak classifiers to vote the best state.

Fig. 2.
figure 2

Two confidence maps to decide where object is. The lighter, the more likely the object is.

Each cluster center is treated as a sub-weak clustered predictor in ether discriminative parameter space or generative object appearance space. The score of one object candidate \(B_c\) based on the predictor in parameter space can be computed:

$$\begin{aligned} S_p(B_c) = \sum _{i=1}^{N_p}C_{p,i}^T{\varPhi }(\mathbf {I};B_c) \end{aligned}$$
(3)

where \(N_p\) is the total number of clusters in parameter space by the end of the current frame, and \(C_{p,i}\) is the representation of the \(i^{th}\) cluster center in parameter space. The score of one object candidate using the \(j^{th}\) predictor in object parameter space can be mathematically expressed:

$$\begin{aligned} S_o^j(B_c) = \rho (Q(B_C), C_{o,j})), \end{aligned}$$
(4)

where \(\rho \) is euclidean metric function, \(Q(B_C)\) is object representation directly extracted from the object candidate bounding box, \(C_{o,j}\) is the \(j^{th}\) clustering center in object feature space. In our experiment, \(Q(B_C)\) is a vectorization after resizing the \(B_C\) to its quarter. The same is to extract feature in object space.

According to the Eq. (24), then the final object candidate’s score is:

$$\begin{aligned} S = \lambda _1 S_b + \lambda _2 S_p + \lambda _3 S_o, \lambda _1+\lambda _2+\lambda _3 =1, \end{aligned}$$
(5)

where \(\{S_b, S_p, S_o\}\) are the scores of weak part models predictor, weak parameter predictor and weak object appearance predictor and \(\{\lambda _1, \lambda _2, \lambda _3\}\) are their weights respectively. How to learn the \(\lambda \) is introduced in next section. The final object location is inferred based on Eq. (5):

$$\begin{aligned} B^* = \arg \max _{B_c} S(B_c), \end{aligned}$$
(6)

where the \(B_c\) is object bounding box candidates sampled from the search region near the previous object location.

3.4 Weight Update

Our model updates the weights of three different predictors after the decision stage in each step, not each frame which doesn’t satisfy the update condition (e.g. heavily occluded), so that the model can evolve. For each step, after performing the decision, our method obtains the labels of data predicted by our strong predictor and the observation of performance of weak view predictors, that is, the prediction consistency of weak classifiers with respect to the strong classifier, likely to [9, 29].

The weight distribution is dependent on the accumulative normalized central-pixel error probability. The accumulative property reflects on the cumulative sum of observation of relative reliability of each predictor. The normalized central-pixel error probability is incarnated by normalized probability directly related to the distance between the object’s center and weak predictor observations’ centers. Mathematically, we have

$$\begin{aligned} p(o_i^t|x^t) = \frac{1}{Z_t}exp(-(o_i^t - x^t)^2/\sigma ^2), \end{aligned}$$
(7)
$$\begin{aligned} Z_t = \sum _i^n p(o_i^t|x^t) \end{aligned}$$
(8)

where \(o_i^t\) is the observation state center location of the \(i^{th}\) weak predictor in step \(t\), \(x^t\) is the predicted object’s center location, \(Z_t\) is a normalization factor in each step \(t\), and \(\sigma = 25\). Each part weight is defined as:

$$\begin{aligned} \lambda _i = \frac{\sum _{t=2}^T p(o_i^t|x^t)}{\sum _{t=2}^{T}Z_t} \end{aligned}$$
(9)

which computes relative reliability of each part predictor.

4 Experiments

For the experiments, publicly available video sequences obtained from [5, 11, 30, 31] were utilized. Using the sequences, the proposed method (CET) was analyzed and compared with \(7\) state-of-the-art tracking methods: Multiple Instance Learning (MIL) [5], Visual tracking decomposition [30], Struck [7], Tracking-Learning-Detection (TLD) [32], PartTracker (PT) [33], Structure preserving object tracking (SPOT) [24], Randomized Ensemble Tracking (RET) [9]. All algorithms are compared in terms of the same initial positions in first frame in [31].

4.1 Implement Details

In all of the experiments, the parameters of our trackers are fixed. The experimental results of MIL, VTD, Struck, TLD are dependent on the public dataset where the sequences’ ground truth are re-annotated by Wu et al. [31] and some trackers’ results through the third party appraisal are attached. For fairness, we adopt the other tracker codes provided by the respective authors in their homepages. The binary code of PT is public. We just need to prepare a config file and then can get their results. The source code of SPOT is published in the website of zhang and van der Maaten [24]. There is one limitation in SPOT is that the parts’ initialization for single object tracking is missing in their source code because it is mainly designed for multiple object tracking. We want to use it as our base tracker so that it is necessary to initialize the parts. For handleability and robustness, we divide the object into four parts equally and then complete the part initialization. The source code of RET is also provided by its authors. MIL and TLD use the haar-like feature [34] or LBP-like feature which is sensitive to large illumination, while Struck, VTD, PT, SPOT, RET and CET are based on edge information or HOG feature [35] that is robust to illumination and mirror misalignment. We use the given parameter in their code and get the sequences’ results. In our method, one cluster is initialized newly in every \(D = 100\) frames. The time complexity is mainly determined by the number of parts, the clustering computation complexity, feature extraction and the search region for deciding where object is.

Fig. 3.
figure 3

Plots of overall performance comparison for the 22 videos in the benchmark [31]. The proposed method (“CET”) obtain better performance in terms of precision (left) and success (right) plot

4.2 Quantitative Analysis

The quantitative comparison results with several state-of-the-art trackers and our tracker (CET) are shown in Fig. 3 and Table 1. We follow the same evaluation protocol proposed in [31]. Overall, our method outperforms them consistently in the view of overall performance (see Fig. 3). In addition, Fig. 4 shows the comparison on different subsets such as occlusion and illumination subsets. The quantitative results are shown in Table 1. From the table, CET achieves the competitive performances well against the other state-of-the-art algorithms on all tested sequences. As summarized in Table 1, our method (CET) most accurately tracked the targets in terms of the center location error and the success rate, even though there are several types of appearance changes.

Fig. 4.
figure 4

Several comparisons in different subsets divided based on main variation of the object to be tracked. The details of the subsets refer to [31]. The proposed method (“CET”) obtains better or comparable performance in all the subsets

Fig. 5.
figure 5

Center location errors comparing CET with SPOT and RET. (a) represents the comparison between CET and its base tracker SPOT; (b) denotes the comparison between CET and the latest state-of-the-art ensemble tracker RET

Fig. 6.
figure 6

Success rate based on overlap rate comparing CET with SPOT and RET. (a) represents the comparison between CET and its base tracker SPOT; (b) denotes the comparison between CET and the latest state-of-the-art ensemble tracker RET

Table 1. Comparison of tracking results. The numbers indicate the average center location errors in pixels. The bold, underlined, and italic represent the best, the second, and the third best, respectively. Other numbers in () indicate the percent of successfully tracked frames, where tracking is success when the overlap ratio between the predicted bounding box \(A_p\) and ground truth bounding box \(A_g\) is over than \(0.5\):\( \frac{A_p \cap A_g}{A_p \cup A_g}> 0.5\)

Comparison of Competing Tracking Algorithms. Although SPOT is our base tracker, we can get better performance in most video sequences through introducing the hidden clustering information by sequential clustering method (see Figs. 5(a) and 6(a)). RET exploits the non-stationary distribution of weight vector in parameter space to ensemble and get good performance. Our tracker CET adopts the sequential clustering method to utilize the hidden non-stationary distribution of parameter and object appearance. Through Figs. 5(b) and 6(b), we also get the better performance comparing with RET. We compare the proposed tracking algorithm with nine state-of-the-art tracking algorithms, Table 1 summarises the average center location error performance and success rate of the compared tracking algorithms over the \(22\) sequences. From the experimental results, we see that our tracking algorithm obtains the best performance on ten sequences in the terms of the center location error or the success rate, seven sequences the second best, four sequences the third best. Figure 4 shows that our method can handle occlusion, illumination and out-of-view well. The robustness of our CET tracker lies in the object-part structure which are discriminatively trained online to account for the variations, the historical hidden structure information in parameter space of base tracker and in the object appearance space of the historical predicted object.

5 Conclusion

In this paper, we deal with the tracking problem about decision ambiguities by fusing object-part predictor, parameter clustered predictor and feature clustered predictor together. Object-part predictor exploits the structure between object and its parts which is effective to object deformative appearance changes. Parameter clustered predictor utilizes temporal hidden group structure in object parameter space in some extent. Feature clustered predictor guarantees the object from the distracters in parameter space and get the better performance. Then we propose a tracker, clustering ensemble tracking (CET), based on structure learning and sequential clustering framework to avoid the drifting problem. Extensive experiments show that our algorithm is robust to occlusion, illumination and out-of-view because different predictors have different properties. The accuracy of CET is superior or competitive to several state-of-the-art tracking algorithms in a more effective way.