1 Introduction

Object tracking in image sequences is, in recent decades, a very active research area in computer vision that arises in many applications such as human computer interaction, security and surveillance, traffic control and smart rooms. Tracking is to estimate the target locations in each image of a video sequence. The localization process is based on the object recognition from a set of visual features such as color, shape, speed, etc. Although, many tracking algorithms have been proposed in the literature [4, 12, 13, 17, 28]. Among them the Mean Shift algorithm [4] turns out to be robust and accurate compared other algorithms [17, 28]. This method was proposed by Fukunaga et al. [5], and it was used for the first time in 1997 in the context of image segmentation. Then, it was adopted by Comaniciu et al. [4] for real-time tracking of deformable objects in an image sequence using the color density of the target. The tracking is carried out from its initial position in the first image. The target is modeled as an ellipse, on which calculates its color distribution. The initial color distribution is referenced as a model, and then compared with that of the candidates to determine the most probable position in the next frame. Originally, Mean Shift algorithm, iteratively seeks to maximize a similarity measure between the target model and target candidates.

Despite the great success achieved by the Mean Shift algorithm [4] in many situations (noise, illumination change, rotation, partial occlusion), its performance suffers from the use of a fixed size window, thus a window size that works at one scale is not suitable for another as the target moves towards and away from the camera. If the kernel size is chosen too large, the tracking window will contain many background pixels as well as the foreground object pixels, which often leads the tracking algorithm to converge to an area between multiple modes, rather than converging to just one of the modes. If the kernel size is chosen too small, the localization becomes poor since some pixels on the object are not included in the search window and the similarity function often has many local maxima. Furthermore, other common problems to all tracking algorithms are, first, the background influence because the target can’t be represented accurately without any background information and if the correlation between target and background is high, the localization accuracy of the object will be decreased. Second, the appearance of distractor elements, this problem arises when the tracked scene or the background includes similar objects to the target from the viewpoint of appearance. These objects can be confused with the target and may distract the tracking algorithm, hence their qualification distractor. In summary, all these limitations represent a major challenge causing the drift of the tracking algorithm and which increase the tracking difficulties.

The main objective of this work is to propose a new human tracking algorithm based on Mean Shift technique [4]. Moreover, this work aims to develop a technique cope well with the difficulties discussed in the paragraph above, operating under a minimum control over the object and its environment and without a priori knowledge of the appearance model. The proposed work represents another tracking algorithm compared to our previous work presented in [11].

The main contributions of this paper are as follows. First, we have used moment features of the weight image [2] to estimate the position, scale and orientation changes of the target. Therefore, considering that the weight image derived from the target model and the target candidate represents the probability that a pixel belongs to the target. Hence, the target scale is calculated using the zeroth order moment and the Bhattacharyya coefficient [10], then the width, height and orientation were estimated by the second order moments and the estimated area. Second, a representation model of background named Background-Weighted Histogram (BWH) [19] has been incorporated into the target representation. The proposed approach combines the moment features of the weight image [2] with BWH [19] to generate a robust tracking algorithm named Scale and Orientation-based Background Weighted Histogram (SOBWH). Furthermore, this new approach SOBWH used to effectively exploit the main features of the background, so that tracking can’t be confused with stable objects in the background. In addition, by learning the background with a weighting mechanism [19], we can hide the intensity values related to the background and help to discriminate between the target and its background, which improves the position, scale and orientation estimation in the presence of distractor elements. Our experimental results validate the performance of the implemented method SOBWH under complex tracking scenarios namely, large background variation, scale and orientation changes, target and camera motion, distinction between the object and its background, partial occlusion, illumination change, appearance of distractor elements, rotation and deformation.

The rest of the paper is organized as follows: Sect. 2 reviews some related work. Section 3 briefly introduces the traditional Mean Shift tracking algorithm. Section 4 presents the Mean Shift tracking algorithm for scale and orientation changes of the target. Section 5 investigates the proposed method. The experiment results and discussion are as followed in Sect. 6 and Sect. 7 concludes the paper.

2 Related work

The works presented in this section are among the tracking methods based on the appearance of an object, especially those that are based on scale adaptation and orientation changes of the target. However, the researches to address these problems can be classified into three categories [29].

The first category is the incremental heuristics methods. In the work by Comaniciu et al. [4], Mean Shift algorithm is performed at three different scales (the previous scale, a slightly larger scale and a slightly smaller scale) and for each different window size, the similarity measure based on the Bhattacharyya coefficient [10] is computed for comparison. The window size yielding the largest Bhattacharyya coefficient is chosen as the updated scale. This principle has been used by many authors [1, 21]. However, this approach does not work well with the increase of the object size since the smaller windows usually have higher similarity and therefore the scale is often underestimated. Collins [3] extended the Mean Shift algorithm by adapting Lindeberg’s theory of feature scale selection based on local maxima of differential scale-space filters. Then, Mean Shift is applied in the spatial and scale dimensions to the weight image. It uses blob tracking and a scale kernel to accurately capture the target’s variation in scale. However, this method requires recomputing Gaussian kernels at every Mean Shift iteration, which is computationally intensive, also it is can’t handle the rotation changes of the target. One more similar work is adaptive pyramid Mean Shift tracking which is proposed by Li et al. [16]. Different thresholds for the scale incremental are applied and this scale selection method can track objects of varying sizes while the complexity remains unchanged. However, the tracking process may be interrupted when illumination change seriously.

The second category is spatial moment methods. Bradski [2] modified the Mean Shift algorithm and produced the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm for face tracking, which only consider the moment of the weight image determined by target model to estimate the scale and orientation of the target. Zivkovic et al. [32] proposed a similar algorithm that estimates the scale and orientation changes named Expectation–Maximisation Shift (EM-Shift). They concurrently estimated the target position and the covariance matrix that describes the target shape by calculating the spatial moment of weight image of the target candidate. Yang et al. [27] introduced a new similarity measure that estimates the scale by comparing the second moments of the target model and the target candidate. A scale and orientation adaptive mean shift tracking (SOAMST) algorithm is proposed by Ning et al. [20], in which the moment features are used to determine the scale and orientation changes of the target. The second moments are computed from an image of weights that are proportional to the probability that a pixel belongs to the target model. This method showed that the estimated area and the second order moment can adaptively estimate the changes of the target with high accuracy. In summary, all these methods rely much on the weight image to estimate the scale and orientation changes of the target, which is sensitive to illumination and cluttered backgrounds, thus, they fail when the object has similar color with its background.

The third category is feature mapping-based methods, where the local features between two consecutive frames contain the information about scale and orientation changes. Peng et al. [22] proposed an automatic bandwidth selection to exploit a feature mapping strategy to estimate the scale variation based on backward tracking and object centroid registration. A scale invariant feature transform (SIFT) based Mean Shift algorithm for object tracking is presented in [31] to obtain a reliable estimation of the location and target scale. However, this method brings in the local key points detection and feature matching, which affects the tracking efficiency and it is not well suitable for non-rigid target tracking. Similarly, [14, 30] rely on support features for scale estimation after the Mean Shift algorithm solves for position. Liang et al. [14] search for the target boundary by correlating the image with four templates. The boundaries positions directly determine the target scale. Zhao et al. [30] exploit affine structure to recover the target relative scale from feature point correspondences between consecutive frames. In summary, the methods depending on feature matching are able to robustly estimate the scale, but they can’t be seamlessly integrated to the Mean Shift framework. Moreover, estimating scale from feature correspondences takes times, which requires presence of well-localized features that can be detected with high repeatability, and it has difficulties dealing with a non-rigid or a deformable object.

Other attempts were made to study different representation methods. Yu et al. [29] proposed a three-dimensional Mean Shift tracking algorithm, which combines the multi-scale model and background weighted spatial histogram to design a novel Mean Shift algorithm which directly estimates the target position and scale in three-dimensional image space. Hu et al. [9] proposed an enhanced Mean Shift tracking algorithm using joint spatial-color feature and a similarity function in order to estimate the scale and orientation of the target. Recently, Vojir et al. [26] address the problem of scale adaptation and present a novel theoretically justified scale estimation mechanism which relies solely on the Mean Shift procedure for the Hellinger distance. Finally, all these methods presented above didn’t consider the influence of large background variation, thus, they are not discriminate enough when the object and its background have similar colors.

3 Traditional Mean Shift tracking algorithm

Mean Shift tracking technique is to find the position of the target model in the current image from its color distribution which is usually approximated by a normalized histogram [4]. The color object to be tracked is supposed to have a density function q, and that of the target candidate centered at a point y the density p(y). The problem is to find the point y whose density associated p(y) is closest to q. In [4], the appearance of the target is represented by an m-bin RGB weighted histogram computed from an ellipse region containing the object to track. Let {z * i } i = 1…n be the normalized pixel positions in the target region, which is supposed to be centered at the origin point. The target model \(\hat{q}\) and the probability of the feature s = 1…m in this model \(\hat{q}_{s}\) are computed as:

$$\left\{ {\begin{array}{*{20}l} {\hat{q} = \left\{ {\hat{q}_{s} } \right\}_{{s = 1 \ldots m}} } \\ {\hat{q}_{s} = F\sum\nolimits_{{i = 1}}^{n} k \left( {\left\| {z_{i}^{*} } \right\|^{2} } \right)\delta \left[ {b\left( {z_{i}^{*} } \right) - s} \right]} \\ \end{array} } \right.$$
(1)

Where m is the size of the histogram (i.e. the number of classes used), δ is the Kronecker function [4, 15], where the sum of delta functions for s = 1…m is 1. The function \(b: R^{2} \to \left\{ {1 \ldots m} \right\}\) associates each pixel in the position \(z_{i}^{*}\) the index \(b(z_{i}^{*} )\) of its bin in m-histogram. k(z)1 is an isotropic kernel function, monotonous and decreasing, assigning a lower weight to distant coordinates from center of the target model. z i is the distance between the pixel z i and the window center. The normalization constant F is obtained by imposing \(\sum\nolimits_{s = 1}^{m} {\hat{q}_{s} = 1}\) where [4]:

$$F = \frac{1}{{\mathop \sum \nolimits_{i = 1}^{n} k\left( {\left\| {z_{i}^{*} } \right\|^{2} } \right)}}$$
(2)

Similarly, let \(\left\{ {z_{i} } \right\}_{{i = 1 \ldots n_{h} }}\) be the normalized pixel locations of the target candidate centered at y in the current frame. Using the same kernel profil k(z), but with bandwidth h, the target candidate model \(\hat{p}(y)\) and the probability of the feature s = 1….m in this model \(\hat{p}_{s} \left( y \right)\) are given by:

$$\left\{ {\begin{array}{*{20}l} {\hat{p}(y) = \left\{ {\hat{p}_{s} (y)} \right\}_{{s = 1 \ldots m}} } \\ {\hat{p}_{s} (y) = F_{h} \sum\nolimits_{{i = 1}}^{{n_{h} }} k \left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)\delta \left[ {b\left( {z_{i} } \right) - s} \right]} \\ \end{array} } \right.$$
(3)

Where F h is the normalization constant defined by:

$$F_{h} = \frac{1}{{\mathop \sum \nolimits_{i = 1}^{{n_{h} }} k\left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)}}$$
(4)

Where h is the scale of the target candidate (i.e. the number of pixels considered in the localization process). The Bhattacharyya coefficient [10] is used as a measure of similarity between two distributions \(\hat{q}_{s}\) and \(\hat{p}_{s} \left( y \right)\) as follows:

$$\rho \left[ {\hat{p}\left( y \right),\hat{q}} \right] = \mathop \sum \limits_{s = 1}^{m} \sqrt {\hat{p}_{s} \left( y \right)\hat{q}_{s} }$$
(5)

The Bhattacharyya distance [10] is then defined by:

$$d\left[ {\hat{p}\left( y \right),\hat{q}} \right] = \sqrt {1 - \rho \left[ {\hat{p}\left( y \right),\hat{q}} \right]}$$
(6)

To find the location corresponding to the target in the current image, the distance (6) should be minimized as a function of y, which is equivalent to maximize the Bhattacharyya coefficient (5). This maximization can be performed efficiently using the Mean Shift iterations. The search for the new position of the target in the current frame starts at the estimated target location \(\hat{y}_{0}\) in the previous frame. Thus, the probabilities \(\left\{ {\hat{p}_{s} (\hat{y}_{0} )} \right\}_{s = 1 \ldots m}\) of the target candidates at the position \(\hat{y}_{0}\) in the current image should be calculated first. Using Taylor expansion around \(\hat{p}_{s} (y_{0} )\), the linear approximation of the Bhattacharyya coefficient (5) is obtained as:Footnote 1

$$\rho \left[ {\hat{p}\left( y \right),\hat{q}} \right] \approx \frac{1}{2}\mathop \sum \limits_{s = 1}^{m} \sqrt {\hat{p}_{s} \left( {y_{0} } \right)\hat{q}_{s} } + \frac{1}{2}F_{h} \mathop \sum \limits_{i = 1}^{{n_{h} }} \omega_{i} k\left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)$$
(7)

Where

$$\omega_{i} = \mathop \sum \limits_{s = 1}^{m} \sqrt {\frac{{\hat{q}_{s} }}{{\hat{p}_{s} \left( {\hat{y}_{0} } \right)}}} \delta \left[ {b\left( {z_{i} } \right) - s} \right]$$
(8)

Since the first term in (7) is independent of y, to minimize the distance in (6) is equivalent to maximize the second term in (7). In this procedure, the estimated target moves from the current location \(\hat{y}_{0}\) to the new location \(\hat{y}_{1}\) according to the equation [4]:

$$\hat{y}_{1} = \frac{{\mathop \sum \nolimits_{i = 1}^{{n_{h} }} z_{i} \omega_{i} g\left( {\left\| {\frac{{\hat{y}_{0} - z_{i} }}{h}} \right\|^{2} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{{n_{h} }} \omega_{i} g\left( {\left\| {\frac{{\hat{y}_{0} - z_{i} }}{h}} \right\|^{2} } \right)}}$$
(9)

When we choose kernel g with the Epanechnikov profile [24], Eq. (9) is reduced to:

$$\hat{y}_{1} = \frac{{\mathop \sum \nolimits_{i = 1}^{{n_{h} }} z_{i} \omega_{i} }}{{\mathop \sum \nolimits_{i = 1}^{{n_{h} }} \omega_{i} }}$$
(10)

Where g(z) = −k (z) [4]. By using Eq. (10), the Mean Shift algorithm finds in the new frame the most similar region to the object with limited iterations and the new location \(\hat{y}_{1}\) is the center of the target region in the current frame. From Eq. (10) it can be observed that the key parameters in the Mean Shift algorithm are the weights ω i .

4 Mean Shift tracking for scale and orientation changes of the target

The enlarging or shrinking of the target is usually a gradual process in consecutive frames. In this section, we analyze how to calculate adaptively the scale and orientation changes of the target under the Mean Shift framework.

4.1 Estimating the target area

Based on the analysis presented in [2, 20], we can consider the weight image in the Mean Shift algorithm as a density distribution function of the target, where the weight value of a pixel reflects the possibility that it belongs to the target. Thus, the scale and orientation of the target can be well estimated by using this density distribution function together with the moment features of the weight image. By definition, the zeroth image moment can be considered as the weighted area of the target in the target candidate region as follows:

$$M_{00} = \mathop \sum \limits_{i = 1}^{n} \omega \left( {z_{i} } \right)$$
(11)

To continuously track the object, some adjustment parameters are needed to adapt the tracking to scale change. However, to keep the target in the searching region we increase the size of the target candidate by increasing the width and length of the searching area based on the values from previous frames. On the other hand, using the Eq. (11), the Bhattacharyya coefficient ρ can be used to adjust M 00 in estimating the target area, which is denoted by B. We use the following equation to estimate it:

$$B = e\left( \rho \right)M_{00}$$
(12)

Where e(ρ) is a monotonically increasing Gaussian function with respect to the Bhattacharyya coefficient ρ(0 ≤ ρ ≤ 1). The authors in [20] have been demonstrated that M 00 always greater than the real target area and it will monotonically approach to the real target area with ρ increasing.

Thus, we require that e(ρ) should be monotonically increase and reach maximum 1 when ρ is 1. The Gaussian function e(ρ) is defined as follows:

$$e\left( \rho \right) = exp\left( {\frac{\rho - 1}{\sigma }} \right)$$
(13)

From Eqs. (12) and (13) we can see that when the target model approaches the target candidate, \(e\left( \rho \right)\) gets close to 1 and in this case it is more reliable to use M 00 as the estimation of the target area.

When ρ decreases, the similarity between target candidate and target model diminishes,

M 00 will be much bigger than the target area but e(ρ) is less than 1 so that B can avoid being biased too much from the real target area. Note that when ρ approaches to 0, the tracked target gets lost and e(ρ) will be very small, so that B is close to 0. However, by experimentation the parameter σ is taken between 1 and 2 [20].

4.2 Estimating the width, height and orientation of the target

The moment features describe numeric quantities at some distance from a reference point or axis [23], they are commonly used in image analysis and pattern recognition tasks [23]. This is due essentially to their simplicity, the invariance and geometric meaning of the low order moment values. Similar to the CAMSHIFT algorithm [2], the first order moments of the weight image {M 10, M 01}, are used to locate the target center as follows:

$$M_{10} = \mathop \sum \limits_{i=1}^{{n_{h} }} w_{i} z_{i,1 }, M_{01} = \mathop \sum \limits_{i = 1}^{{n_{h} }} w_{i} z_{i,2}$$
(14)

Equally, the second order moments {M 02, M 11, M 20} [2] are calculated as follows:

$$\begin{aligned} M_{20} = \mathop \sum \limits_{i = 1}^{{n_{h} }} w_{i} z_{i,1}^{2} \quad \quad M_{02} = \mathop \sum \limits_{i = 1}^{{n_{h} }} w_{i} z_{i,2 }^{2} \\ \quad M_{11} = \mathop \sum \limits_{i = 1}^{{n_{h} }} w_{i} z_{i,1} z_{i.2} \hfill \\ \end{aligned}$$
(15)

Where pair (z i,1, z i,2) are the coordinates of pixel i in the candidate region. Comparing Eq. (10) with Eqs. (11) and (14), the new position \(\hat{y}_{1}\) of the target found in Eq. (9) is actually the ratio of the first order moment to the zeroth order moment:

$$y_{1} = \left( {\bar{z}_{1},\bar{z}_{2} } \right) = \left( {\frac{{M_{10} }}{{M_{00 } }},\frac{{M_{01} }}{{M_{00 } }}} \right)$$
(16)

Where \(({\bar{\text{z}}}_{1},{\bar{\text{z}}}_{2} )\) represents the centroid of the target candidate region. According to [2], the second order moment describes the shape and orientation (long axis) of an object. By using Eqs. (10), (11), (15) and (16), we can convert Eq. (9) to the second order moment as follows:

$$\begin{aligned} \mu_{20} = M_{20} /M_{00} - \bar{z}_{1}^{2} \quad \quad \mu_{02} = M_{02} /M_{00} - z_{2}^{2}\\ \quad \mu_{11} = M_{11} /M_{00} - \bar{z}_{1} \bar{z}_{2} \end{aligned}$$
(17)

Equation (17) can be rewritten as the following covariance matrix:

$$Cov = \left[ {\begin{array}{*{20}c} {\mu_{20} } & {\mu_{11} } \\ {\mu_{11} } & {\mu_{02} } \\ \end{array} } \right]$$
(18)

This covariance matrix can be decomposed by using the singular value decomposition (SVD) [7] as follows:

$$Cov = U \times S \times U^{T} = \left[ {\begin{array}{*{20}c} {u_{11} } & {u_{12} } \\ {u_{21} } & {u_{22} } \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {\lambda_{1}^{2}} & 0 \\ 0 & {\lambda_{2}^{2}} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {u_{11} } & {u_{12} } \\ {u_{21} } & {u_{22} } \\ \end{array} } \right]^{T}$$
(19)

Where \(U = \left[ {\begin{array}{*{20}c} {u_{11} } & {u_{12} } \\ {u_{21} } & {u_{22} } \\ \end{array} } \right]\) and \(S = \left[ {\begin{array}{*{20}c} {\lambda_{1}^{2} } & 0 \\ 0 & {\lambda_{2}^{2} } \\ \end{array} } \right]\) \(\lambda_{1}^{2}\) and \(\lambda_{2}^{2}\) are the eigenvalues of the covariance matrix [7]. The vectors (u 11u 21)T and (u 12, u 22)T represent the orientation of the two main axes of the target. In the CAMSHIFT algorithm [2], λ 1 and λ 1 are directly used as the width and height of the target [18]. However, in [20] the authors proposed a new technique to estimate more accurately the width and height of the target. The target to be tracked is represented by an ellipse, for which the lengths of the semi-major axis and semi-minor axis are denoted by a and b, respectively. It has been shown in [18] that the ratio of λ 1 and λ 2 can well approximate the ratio of a to b, i.e., \(\lambda_{1} /\lambda_{2} \approx a/b\). Therefore, the terms a and b can be defined as follows a =  1 and b =  2, where k is a scale factor. Using the geometrical moments of the ellipse [18] and the target region B estimated by the Eq. (12), it can be written as πab = π( 1)( 2) = B. Then, we derive easily that:

$$k = \sqrt {B/\left( {\pi \lambda_{1} \lambda_{2} } \right)}$$
(20)
$$a = \sqrt {\lambda_{1} B/(\pi \lambda_{2} ) } \quad b = \sqrt {\lambda_{2} B/(\pi \lambda_{1} )}$$
(21)

Now the covariance matrix becomes:

$$Cov = \left[ {\begin{array}{*{20}c} {u_{11} } & {u_{12} } \\ {u_{21} } & {u_{22} } \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {a^{2} } & 0 \\ 0 & {b^{2} } \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {u_{11} } & {u_{12} } \\ {u_{21} } & {u_{22} } \\ \end{array} } \right]^{T}$$
(22)

Where a 2 and b 2 are the eigenvalues of the covariance matrix \(Cov\) in Eq. (22), which represent the height and width of the ellipsoid region tracked. Once the position, scale and orientation of the target are estimated in the current frame, the location of the target candidate region needs to be determined in the next frame. The covariance matrix representing the size of the target candidate region in the next frame is defined as follows [20]:

$$Cov_{2} = U \times \left[ {\begin{array}{*{20}c} {\left( {a + \Delta d} \right)^{2} } & 0 \\ 0 & {\left( {b + \Delta d} \right)^{2} } \\ \end{array} } \right] \times U^{T}$$
(23)

Where Δd is the increment of the target candidate region in the next frame. The position of the initial target candidate region is defined by the following ellipse region:

$$\left( {x - y_{1} } \right) \times Cov_{2}^{ - 1} \times \left( {x - y_{1} } \right)^{T} \le 1$$
(24)

Where x are the points of the ellipse used in the next research region and \(\hat{y}_{1}\) the position of the target in the next frame.

5 Proposed method

The background information plays a very important role for the reason that it is often presented in the selected target region. Since, there are two main reasons to use it. First, if some of the target’s features are also present in the background, their relevance for the target localization is reduced. Second, in many applications it is difficult to exactly delineate the target without any background information, thus, for a better target representation, the background model has been used to improve the target localization.

5.1 Background-Weighted Histogram

A representation model of the background named Background-Weighted Histogram (BWH) has been proposed by Comaniciu et al. [4]. The original background model \(\hat{o}\) is modeled by the discrete representation (normalized histogram) \(\left\{ {{\hat{\text{o}}}_{\text{s}} } \right\}_{{{\text{s}} = 1 \ldots {\text{m}}}}\) (with \(\sum\nolimits_{s = 1}^{m} {\hat{o}_{s} = 1}\)) in the feature space as follows:

$$\left\{ {\begin{array}{*{20}l} {\hat{o} = \left\{ {\hat{o}_{s} } \right\}_{s = 1 \ldots m} } \\ {\hat{o}_{s} = F_{1} \mathop \sum \limits_{i = 1}^{n} k\left( {\left\| {z_{i}^{*} } \right\|^{2} } \right)\delta \left[ {b\left( {z_{i}^{*} } \right) - s} \right]} \\ \end{array} } \right.$$
(25)

Where \(\hat{o}_{s}\) represents the probability of colors s = 1…m in this model and F 1 is the normalization constant. Then, the weights coefficients were defined using the following formula [4]:

$$\left\{ {\vartheta_{s} = { \hbox{min} }\left( {\frac{{\hat{o}^{*} }}{{\hat{o}_{s} }},1} \right)} \right\}_{s = 1 \ldots m}$$
(26)

With \(\hat{o}^{*}\) is the minimal non-zero value of the representation. These weights are only used to define a transformation for the representations of the target model and candidates. The transformation diminishes the importance of those features which have low \(\vartheta_{s}\), i.e., are more prominent in the background and less important for target representation. This representation is computed in an elliptical region around the target with a fixed three times of the target area [4]. Then, the transformed target model is obtained by the following formula:

$$\left\{ {\begin{array}{*{20}l} {\hat{q}^{\prime} = \left\{ {\hat{q}^{\prime}_{s} } \right\}_{s = 1 \ldots m} } \\ {\hat{q}^{\prime}_{s} = F^{\prime}\vartheta_{s} \mathop \sum \limits_{i = 1}^{n} k\left( {\left\| {z_{i}^{*} } \right\|^{2} } \right)\delta \left[ {b\left( {z_{i}^{*} } \right) - s} \right]} \\ \end{array} } \right.$$
(27)

Where \(\hat{q}^{\prime}_{s}\) represents the probability of colors s = 1…m in this model, and F is the normalization constant expressed as:

$$F^{\prime} = \frac{1}{{\mathop \sum \nolimits_{i = 1}^{n} k\left( {\left\| {z_{i}^{*} } \right\|^{2} } \right)\mathop \sum \nolimits_{s = 1}^{m} \vartheta_{s} \delta \left[ {b\left( {z_{i}^{*} } \right) - s} \right]}}$$
(28)

Similarly, the transformed target candidate model is:

$$\left\{ {\begin{array}{*{20}l} {\hat{p}^{\prime}(y) = \left\{ {\hat{p}^{\prime}_{s} (y)} \right\}_{s = 1 \ldots m} } \\ {\hat{p}^{\prime}_{s} (y) = F_{h}^{'} \vartheta_{s} \mathop \sum \limits_{i = 1}^{{n_{h} }} k\left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)\delta \left[ {b\left( {z_{i} } \right) - s} \right]} \\ \end{array} } \right.$$
(29)

Where now \(F^{\prime}_{h}\) is given by:

$$F^{\prime}_{h} = \frac{1}{{\mathop \sum \nolimits_{i = 1}^{n} k\left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)\mathop \sum \nolimits_{s = 1}^{m} \vartheta_{s} \delta \left[ {b\left( {z_{i} } \right) - s} \right]}}$$
(30)

The new weight formula \(\omega^{\prime}_{i}\) calculated by the BWH algorithm is:

$$\omega^{\prime}_{i} = \mathop \sum \limits_{s = 1}^{m} \sqrt {\frac{{\hat{q}^{\prime}_{s} }}{{\hat{p}^{\prime}_{s} \left( y \right)}}} \delta \left[ {b\left( {z_{i} } \right) - s} \right]$$
(31)

5.2 Corrected Background-Weighted Histogram

A Corrected Background-Weighted Histogram (CBWH) algorithm is proposed by Ning et al. [19]. Rather than both transforming the target model and the target candidate, it just transforms the target model. The authors proved that the weights assigned to the pixels in the target candidate region by BWH are proportional to those without information from the background. Since, CBWH reduce the prominent background features only in the target model, then, the target candidate model still uses the original model as follows:

$$\left\{ {\begin{array}{*{20}l} {\hat{p}(y) = \left\{ {\hat{p}_{s} (y)} \right\}_{s = 1 \ldots m} } \\ {\hat{p}_{s} (y) = F_{2} \mathop \sum \limits_{i = 1}^{{n_{h} }} k\left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)\delta \left[ {b\left( {z_{i} } \right) - s} \right]} \\ \end{array} } \right.$$
(32)

Where \(\hat{p}_{s} (y)\) represents the probability of colors s = 1···m in this model and F 2 is the normalization constant defined by:

$$F_{2} = \frac{1}{{\mathop \sum \nolimits_{i = 1}^{{n_{h} }} k\left( {\left\| {\frac{{y - z_{i} }}{h}} \right\|^{2} } \right)\mathop \sum \nolimits_{s = 1}^{m} \delta \left[ {b\left( {z_{i} } \right) - s} \right]}}$$
(33)

The new weight formula \(\omega^{\prime\prime}_{i}\) computed by the CBWH in the target candidate region, is given by:

$$\omega^{\prime\prime}_{i} = \mathop \sum \limits_{s = 1}^{m} \sqrt {\frac{{q^{\prime}_{s} }}{{\hat{p}_{s} \left( y \right)}}} \delta \left[ {b\left( {z_{i} } \right) - s} \right]$$
(34)

Motivated by the significant advances made by the CBWH algorithm [19], we adopted this technique in this work in order to exploit the key features of the background.

5.3 Proposed tracking algorithm

The appearance model of the object is a key element affecting performance for many objects tracking system. Estimation techniques based on the kernel densities have been proposed in [4, 6] for constructing statistical representations of the object appearance. Their advantage is to provide a relatively flexible and generic description of the appearance and allow a rapid implementation and real-time in some conditions. This work presents a new appearance model for human tracking based on Mean Shift algorithm [4]. It aims to solve the main tracking challenges related to the scale and orientation changes of the target in the presence of distractor elements where the only knowledge available on the target is its position in the first frame of the images sequence. Two areas are well developed. Firstly, the moment features of the weight image [2] are used to estimate the position, scale and orientation changes of the target. In this context, we support that the weight image derived from the target model and the target candidate represents the probability that a pixel belongs to the target. Thus, the target scale is calculated using the zeroth order moment and the Bhattacharyya coefficient [10], then the width, height and orientation were estimated by the second order moments and the estimated area. Second, the Background-Weighted Histogram (BWH) [19] has been incorporated into the target representation in order to derive a simple representation of the background and to reduce the interference of the main features of the background in the target localization. Finally, these two concepts are combined to generate a robust tracking algorithm entitled Scale and Orientation-based Background Weighted Histogram (SOBWH).

The entire of the proposed algorithm for human tracking SOBWH is summarized as follows.

The stopping criterion threshold ɛ. used in step 7 is derived by constraining the vectors \(\hat{y}_{0}\). and \(\hat{y}_{1}\). to be within the same pixel in original image coordinates. The default value is ɛ = 0.1 [4]. From real time constraints, we also limit the number of Mean Shift iterations to N max  = 20 [4]. At each iteration, the vector Mean Shift is calculated as the similarity measure between the histograms of the target model and target candidates is increased. This process is repeated until the convergence is achieved, in practice, the average number of iterations is much smaller, about 4.

The implementation of the proposed algorithm for human tracking SOBWH is much simpler than as presented above. The role of step 6 is only to avoid potential numerical problems in the Mean Shift maximization. In practice, we only iterate by computing the weights in step 3, then the new position is calculated in step 4, and testing the size of the kernel shift in step 7. The Bhattacharyya coefficient is computed only after the algorithm completion to evaluate the similarity between the target model and the chosen candidate. An obvious advantage of the Mean Shift tracking system compared to the standard matching is the removal of extensive research. The estimation of the object state is thus carried out in a limited number of iterations. The complexity of the classic Mean Shift algorithm is very low because it requires very simple operations, where, the time complexity of it is given by O(Tn 2) where T is the number of iterations and n is the number of data points in the data set.

6 Experimental results

In this section, we perform tests confirming the robustness of the proposed method SOBWH for human tracking based on Mean Shift algorithm [4] that aims to solve the main tracking challenges related to the scale and orientation changes of the target in the presence of distractor elements. The experiments are done using a PC (Intel® Core™ 2 Duo CPU T5800 with 2.00 GHz CPU and 3 GB RAM). The tracking initialization is defined by manually marking a bounding box in the first frame. Once the object to be tracked is localized, our proposed approach SOBWH is used to iteratively find the best object matched region for every subsequent frame. For quantitative performance evaluation of all the methods presented, we calculate the localization errors based on the Object Centroid Position Error formula (OCPE) which is approximated by the distance between the center of the tracking results and that of the ground truth that appears in [25].

$${\text{OCPE}} = \frac{1}{{{\text{N}}_{\text{rg}} }} \sum \limits_{{\exists i g ( {t_{i} } )\Delta r(t_{i} )}} \sqrt {(xg_{i} - xr_{i} )^{2} + (yg_{i} - yr_{i} )^{2} }$$
(35)

Where N rg is the total number of images for each sequence, (xg i , yg i ) and (xr i , yr i ) are the positions of the ground truth and the results obtained by different methods considered at frame i respectively. Ideally, an optimal approach is expected to have a small error. The tracking results obtained by our proposed approach SOBWH are compared with the same existing tracking algorithms in the literature, namely the Adaptive Scale Mean Shift algorithm [4], the EM-Shift algorithm [32] and the SOAMST algorithm [20]. All these algorithms are based on the use of representation by appearance model for object tracking, and they aim to address the scale and orientation changes of the target under the Mean Shift framework. For all these methods, we selected RGB color space as a feature space and it was quantized into s = 16 × 16 × 16 bins. Note that the shape of our tracking algorithm is approximated by an ellipse and its appearance by a normalized histogram based on features derived from a fusion procedure of background information and moment features of the weight image. In order to obtain a better view about how the different algorithms perform, we draw the tracking results separately in different images. We used four Benchmark sequences to validate the performance of the proposed approach SOBWH: Human9 (305 frames), Diving (231 frames), David (471 frames) and Walking Woman (597 frames). All the datasets and the ground truth are available in [8]. The mean position error of the target localization computed by (35), the average iteration numbers and the average processing time for each approach are presented in Tables 1, 2 and 3 respectively.

Table 1 Average errors for target localization in four sequences and for four different methods considered
Table 2 Average number of iterations in four sequences and for four different methods considered
Table 3 Average processing time in four sequences and for four different methods considered

In the first experiment, we have used the “Human9” sequence with 305 frames of 320 × 240 pixels and which presents a person walking in a road. This sequence has many critical challenges such as illumination change, gradual diminution of the scale, deformation, motion blur and fast motion of the target. In these condition the final qualitative tracking results for all the algorithms considered are shown in Fig. 1. Due to the fast motion and drastic changes in scale and background all these algorithms Adaptive Scale, EM-Shift and SOAMST lose the target very quickly after the frame 72 (Fig. 1a–c), however, our proposed approach SOBWH is able to track the human successfully in the whole sequence Fig. 1d. For quantitative evaluation, Tables 1, 2 and 3 summarize the results obtained, thus, the proposed approach SOBWH achieves higher target localization accuracy than other algorithms Adaptive Scale, EM-Shift and SOAMST, it is also needs less number of iterations, which means that the proposed method converges faster and requires less computation time. The average error for target localization per image is presented in Fig. 2. The experimental results show that the proposed approach SOBWH reduces the background interference in target localization and helps to discriminate between the target and its background, which improves the tracking accuracy in the presence of disruptive elements and scale changes compared to other algorithms Adaptive Scale, EM-Shift and SOAMST.

Fig. 1
figure 1

Tracking results from "Human9" sequence for frames 6, 72, 108, 154 and 287 are displayed. a Tracking results using Adaptive Scale Mean Shift algorithm, b tracking results using EM-Shift algorithm, c tracking results using SOAMST algorithm, d tracking results using our proposed method SOBWH

Fig. 2
figure 2

Average error for target localization per image by “Human9” sequence

The second experiment is "Diving" sequence with 231 frames of 400 × 224 pixels. The difficulties of the sequence are: the ability to take into account the fast target motion as well as those of the camera, different viewpoints of the target, scale variation and in-plane rotation. Qualitative tracking results are shown in Fig. 3. Because of important deformation and target rotation, both Adaptive Scale and EM-Shift algorithms lose the target from frame 168 (Fig. 3a, b). As shown in (Fig. 3c, d) SOAMST algorithm and our approach SOBWH achieve to track the target more efficiency over the whole sequence, however, the proposed method SOBWH estimates more accurately the target scale change even if it moves at different speeds during the sequence. Refer to the statistics results presented in Tables 1, 2 and 3 the proposed approach SOBWH achieves higher target localization accuracy than other algorithms Adaptive Scale, EM-Shift and SOAMST. The average error for target localization per image is presented in Fig. 4. These experimental results confirm the robustness of the proposed method while having an appearance model invariant to rotation, scale variation and able to discriminate between the target and its background.

Fig. 3
figure 3

Tracking results from "Diving" sequence for frames 2, 26, 120, 168 and 214 are displayed. a Tracking results using Adaptive Scale Mean Shift algorithm, b tracking results using EM-Shift algorithm, c tracking results using SOAMST algorithm, d tracking results using our proposed method SOBWH

Fig. 4
figure 4

Average error for target localization per image by "Diving" sequence

The third experiment is "David" sequence with 471 images of 320 × 240 pixels. The sequence shows a person entering and moving into a room and the target to be tracked is his face. This sequence involves many critical challenges such as illumination and scale changes, occlusion, in and out plane rotation, motion blur and deformation. Figure 5 presents final qualitative tracking results of all methods. Because of fast motion and drastic background change, neither adaptive scale algorithm nor EM-Shift algorithm achieves good tracking results (Fig. 5a, b). However, Fig. 5d indicates that the proposed method SOBWH successfully tracks the face over the whole sequence and estimates more accurately the scale change compared to SOMAST algorithm. From statistics results presented in Tables 1, 2 and 3 the proposed approach achieves higher target localization accuracy than the other referenced algorithms. The average error for target localization per image is presented in Fig. 6. Based on these experimental results, the proposed approach SOBWH is less sensitive to major illumination and background changes compared to other algorithms Adaptive Scale, EM-Shift and SOAMST.

Fig. 5
figure 5

Tracking results from "David" sequence for frames 2, 9, 296, 439 and 471 are displayed. a Tracking results using Adaptive Scale Mean Shift algorithm, b tracking results using EM-Shift algorithm, c tracking results using SOAMST algorithm, d tracking results using our proposed method SOBWH

Fig. 6
figure 6

Average error for target localization per image by "David" sequence

In the fourth experiment, we have used a long sequence with 597 frames "Walking Woman" of 352 × 288 pixels. It is complex and challenging since the woman moves quickly and has obvious illumination variation, occlusion, scale and orientation changes, deformation, motion blur and target rotation outside the image plane. Final qualitative tracking results are shown in Fig. 7. Because the woman is part occluded by the first car whose color is similar to that of the woman’s shirt, both algorithms Adaptive Scale and EM-Shift derive the target from image 135 (Fig. 7a, b) and they will not take it after. But when the woman moves quickly such as in frames 369 and 587, the estimated target scale and orientation by SOMAST are not as accurate as those by our proposed method. Therefore, the proposed algorithm SOBWH able to correctly track the woman in the whole sequence Fig. 7d. Tables 1, 2 and 3 show that the proposed approach achieves higher target localization accuracy than the other algorithms Adaptive Scale, EM-Shift and SOAMST. The average error for target localization per image is presented in Fig. 8. Based on these experimental results, the proposed approach SOBWH allows to highlight the background features in target area, while having an appearance model invariant to partial occlusion, scale change and able to discriminate between the target and its background.

Fig. 7
figure 7

Tracking results from "Walking Woman" sequence for frames 8, 135, 194, 369 and 587 are displayed. a Tracking results using Adaptive Scale Mean Shift algorithm, b tracking results using EM-Shift algorithm, c tracking results using SOAMST algorithm, d tracking results using our proposed method SOBWH

Fig. 8
figure 8

Average error for target localization per image by "Walking Woman" sequence

In summary, by calculating the average error for each method (Table 1), we arrived that the proposed approach SOBWH provides minimal localization accuracy, followed by SOAMST algorithm, Adaptive Scale algorithm and EM-Shift algorithm respectively. It is also requires less number of iterations (Table 2) which means that the proposed method converges faster and requires less computation time (Table 3). However, our qualitative and quantitative experiments show that the use of background information [19] with the moment features of the weight image [2] can effectively track the position, scale and orientation changes of the target despite the disruptive factors of the scene, as well as, introduces new information in tracking objects such as scale and orientation of the target.

7 Conclusion

In this paper, we proposed a human tracking algorithm based on Mean Shift technique. The novelty of this work is to combine moment features of the weight image and background weighted histogram to design a robust tracking algorithm Scale and Orientation-based Background Weighted Histogram SOBWH in order to estimate effectively the position, scale and orientation changes of the target in the presence of distractor elements and which is used to exploit the main background features, so that an object can be effectively discriminated against the background of the image. The qualitative results and quantitative analysis validate the effectiveness and robustness of the proposed approach SOBWH for human tracking, so that it presents considerable advantages compared to other algorithms in terms of accuracy and computation time. Finally, comparative evaluations confirm the pertinence of the developed ideas in this paper, by demonstrating that we have exceeded a variety of recent algorithms in the literature, namely SOAMST, Adaptive Scale and EM-Shift in various difficult tracking scenarios such as scale and orientation changes, partial occlusion, motion and zoom camera, complex background change, deformation, rotation and illumination variation. These advantages allow us to work freely in the tracking domain with unconstrained environments.