1 Introduction

Tracking a moving object in a sequence of visual scenes is essential in a variety of applications such as image-based traffic surveillance systems, car safety alarms, and unmanned ground vehicles (UGV) [6, 8, 23, 37, 38, 40]. Video images contain measurement noise due to a variety of reasons, such as image sensor noise, vehicle movement, and other obstacles in the scene. Therefore, statistical approaches are usually adopted in solving the object tracking problems. One of the popular methods is a Kalman filter [5, 14, 20, 28, 30, 41]. It is assumed that the equations for the object location change are linear functions and their distributions are Gaussian, with additive, independent Gaussian noise. A state-space model is exploited to predict the object location in the observation from the past ones [28], and the prediction error is used to adaptively update the model. It has been applied to stereo camera-based object tracking [5]. However, a critical drawback of Kalman filters is that it fails to predict the observation when the change of movement is nonlinear and has non-Gaussian distribution. Particle filters based on Sequential Monte Carlo (SMC) methods were shown to be quite efficient in tracking visual objects [3, 25, 26, 29, 32, 33, 42], even when the dynamics of trajectories are non-linear and non-Gaussian. In particle filtering, the distribution of the position change vector is modeled by an ensemble of particles, and the object location is predicted by maximum a posteriori (MAP) estimation. The particles are usually updated by importance sampling where the importance is measured by the posterior probability given the observations [17, 18, 24]. Unlike Kalman filtering, the distribution is a non-parametric model stored in a set of particles, so any non-Gaussian dynamics can be properly approximated. Another type of popular methods is mean-shift algorithm [7, 9, 11, 27, 36]. A vincinity of the mean target location from the previous state is explored to predict the most likely object location. The observation is used to update the mean target location. Color histogram is usually employed to find the distance betwen the prediction and the observation. A joint, spatial color histogram is used in mean-shift tracking [10, 12, 38]. The advantage of mean-shift is that the search space is reduced, yielding a relatively fast tracking.

Those statistical approaches are very flexible and applicable to many realistic situations, but sometimes fail to track the target object especially when the target is occluded or cluttered. In the situations where the objects are completely occluded by other objects, they are handled by dynamic models such as a linear velocity model [5], and a nonlinear dynamic model [18]. In the cases of partial occlusion, the occluded regions or parts of contours are actively detection, and only reliable regions or contours are used in tracking. The shape priors of the target contours are given ahead of time using PCA (principal component analysis) [13], or constructed online [39]. Other methods include hierarchical decomposition [1], using a priori object shape information [16, 31, 34, 35], and learning the similarity patterns of occluded objects [21]. Although these methods have been shown effective, complicated parametric models and training data are usually required.

This paper proposes a novel method for identifying occluded part of the target object in an instantaneous scene for particle filtering-based visual object tracking. Particles, if modeled well, correspond to candidates for the next movement of the target object. In the proposed method they are defined by rectangular windows [25]. The proposed method is composed of two stages: occlusion detection in particles and occlusion pattern reasoning. In the first stage, the rectangular window obtained by each sample of the particle ensemble is divided horizontally into several non-overlapping, equi-sized sub-windows. The histogram distance of each sub-window to the corresponding part of the target window is computed, and used in determining if each sub-window is totally occluded. In the second stage, the sample occlusion detection results are combined to derive the most likely occlusion pattern. For each pixel of the current image in the sequence, the probability that the pixel belongs to the occluded region is computed by accumulating the sample occlusion detection results. The computed pixel occlusion probabilities are combined to identify which part of the target object is likely to be occluded by other objects. The occluded parts are excluded in computing the matching probability of each sample. The proposed method is well mingled in the particle filtering framework so that the tracking performance is not degraded even if there is no occlusion.

The paper is organized as follows. Section 2 describes the proposed method, Section 3 shows the experimental results in real car tracking examples, and Section 4 summarizes our findings and future research issues.

2 Method

2.1 Particle filter formulation

The particle filter is generally described by a standard state space model that has a set of unknown, hidden states linked to an observation process. There is the first-order Markovian assumption that the hidden state at time t is affected by the state at t − 1 only. Given the observation sequence from the initial time 0 to t, denoted by \(\mathbf {z}_{0:t} \doteq [ \mathbf {z}_{0} \ldots \mathbf {z}_{t} ] \), target state s t is a random vector whose behavior can only be described by a posterior probability given the observations, p(s t | z 0 t ), which is obtained by the following recursive probabilistic generation [25]:

$$\begin{array}{rll} \lefteqn{p(\mathbf{s}_{t}|\mathbf{z}_{0:t}) \propto } \\&& p(\mathbf{z}_{t}|\mathbf{s}_{t}) \int p(\mathbf{s}_{t}|\mathbf{s}_{t-1}) p(\mathbf{s}_{t-1}|\mathbf{z}_{0:t-1}) d \mathbf{s}_{t-1} \, , \end{array} $$
(1)

The conditional state density p(s t | z 0 t ) is then approximated by a set of M samples, \(\left \{{\mathbf {s}_{t}^{m}} | m = 1 \ldots M \right \}\). The samples are called particles and the recursive derivation process in (1) within a sequential Monte Carlo framework is called particle filtering. To ignore samples with very low probabilities, an additional sampling based on some importance measure is generally employed [38].

2.2 Sample window segmentation for occlusion detection

In visual object tracking, a tracker is usually modeled by a rectangular region in the observed image, called a window. Assuming that the initial target position is known, the spatial information of the window is usually defined by a vector of center coordinate and scale from the initial window [25, 38]. Those constitute a sample in the particle filtering. The similarity of the target and each sample of particle filters can be measured by the resemblance of the color histograms. If other interfering objects blocks the target object partly or entirely, it may disrupt the histogram of the sample and lower the similarity significantly.

To identify which part of the target is occluded, we horizontally divide each sample window into several sub-windows with equal width. As shown in Fig. 1, the candidate tracking window is horizontally divided into N non-overlapping sub-windows,

$$ \mathcal{W}({\mathbf{s}_{t}^{m}}) = \bigcup\limits^{N}_{i=1}\mathcal{W}_{i}({\mathbf{s}_{t}^{m}}). $$
(2)
Fig. 1
figure 1

Division of a sample window into five horizontal sub-windows

The color information of each sub-window is extracted by 110 histogram bins using the hue-saturation-value (HSV) color space [19, 25], then only 80 % bins out of 110 are chosen based on the probabilistic palette model [15]. Let \(q_{i}(k;\mathbf {s}^{m}_{t})\) and \(q_{i}^{*}(k)\) be functions returning the relative frequency at histogram bin k measured from \(\mathcal {W}_{i}(\mathbf {s}^{m}_{t})\) and the same sub-window in the initial target image, respectively, then the distance from the initial target is calculated by the Bhattacharyya distance [11]:

$$ D_{i}({\mathbf{s}_{t}^{m}})= \left[1-\sum\limits_{k=1}^{K}\sqrt{q_{i}^{*}(k)q_{i}(k;{\mathbf{s}_{t}^{m}})}\right]^{1/2} , $$
(3)

where K is the total number of histogram bins. Because motor vehicles move on the ground, an interfering object intrudes from the side and passes by the target, so horizontal division is likely to find the various patterns of partial occlusions.

Between every pair of adjacent sub-windows, the forward difference for the sub-window distance is computed:

$$ \Delta D_{i}=D_{i+1}-D_{i} \, , \ i = 1, \ldots, N-1 , $$
(4)

where the argument \({\mathbf {s}_{t}^{m}}\) in (3) is omitted for a compact notation. The value of Δ D i represents the magnitude and direction of the local change in the histogram distance from \(\mathcal {W}_{i}\) to \(\mathcal {W}_{i+1}\). A positive, large Δ D i is observed when there is a big distance leap between adjacent sub-windows, which is the case that \(\mathcal {W}_{i+1}\) is occluded and \(\mathcal {W}_{i}\) is not. A negative Δ D i leads to the opposite situation. The use of forward difference instead of absolute distance enables eliminating the effect of the overall distance elevation due to illumination change or other color-influencing factors.

Figure 2a–d illustrate the horizontal window segmentation for occlusion detection. The rectangles around the target vehicles are tracking windows divided by five sub-windows. The first bar graphs below the images represent the histogram distances of sub-windows, and the second bar graphs display their forward differences. In Fig. 2a and b, the target is the white vehicle, and a motorcycle blocks the target in different regions. In Fig. 2a the sub-window distances \(D_{3} \sim D_{5}\) are much larger than \(D_{1}\) and \(D_{2}\), resulting in the largest \(\Delta D_{2}\). In this case, the indices of the occluded sub-windows are \(\{3,4,5\}\). In Fig. 2b, the occluded region is identified by \(\{1,2\}\) because the magnitude of \(\Delta D_{2}\) is large enough to the negative side. From these, the left occlusion boundary may be found by sub-window i if \(\Delta D_{i}\) is negatively large enough, and the right boundary is found by \(i+1\) if \(\Delta D_{i}\) is positively large enough. Figure 2c has both left and right boundaries at i = 2 and 3, so the occluded sub-window indices are {2, 3}. Figure 2d also has both at sub-windows 4 and 1, but there are two interfering objects from the outside because the left index is larger than the right one.

Fig. 2
figure 2

Various occlusion patterns found by horizontal split. The indices of occluded sub-windows are: a \(\{3,4,5\}\). b \(\{1,2\}\). a \(\{2,3\}\). a \(\{1,4,5\}\)

In Table 1, we propose an algorithm to classify various occlusion patterns using the forward difference of sub-window histogram distances. The output of the algorithm is, for each sample, a set of integers in 1, N for the indices of the occluded sub-windows. Out of ΔD 1Δ D N − 1, only the maximum and the minimum are considered to prevent unreliable occlusion boundaries. In lines 7–10, the left and right occlusion boundaries are found by i +and i . With an appropriate choice of 𝜃 ΔD , the occlusion patterns of individual samples are correctly identified. Lines 12 and 14 correspond to the cases of Fig. 2c and d, respectively. However, when i + = 1 and i = N, it can be either total or no occlusion. The two cases are distinguished by comparing the average histogram distance with a threshold 𝜃 D as shown in line 17. The range of the Bhattacharyya distance is 0, 1, so 0.5 was a good start for 𝜃 ΔD and 𝜃 D .

Table 1 Occlusion pattern reasoning for individual samples

2.3 Finding a global occlusion pattern

Figure 3 illustrates is overall finding global occlusion pattern method. In A, the occluded sub-window indices are found by Table 1. Let \(T(x,y,\mathbf {s}^{m}_{t})\) be a function whose value is 1 when a pixel (x, y) belongs to a window defined by a sample \(\mathbf {s}^{m}_{t}\) in (2), such that

$$ T(x,y,{\mathbf{s}_{t}^{m}}) = \left\{ \begin{aligned} 1 & \textrm{if } (x,y) \in \mathcal{W}({\mathbf{s}_{t}^{m}}), \\ 0 & \qquad\text{otherwise} \end{aligned} \right. . $$
(5)
Fig. 3
figure 3

Process of detecting occlusion region using the occlusion map

The score function that a pixel (x, y) belongs to the target region at time t is obtained by averaging \(T(x,y,{\mathbf {s}_{t}^{m}})\) over all the samples,

$$\begin{array}{@{}rcl@{}} T(x,y,t) = \frac{1}{M} \sum\limits_{m=1}^{M} T(x,y,{\mathbf{s}_{t}^{m}}) \, . \end{array} $$
(6)

Similarly, let \(O(x,y,{\mathbf {s}_{t}^{m}})\) be an indicator function that a pixel (x, y) belongs to any occluded sub-window defined by sample \(\mathbf {s}^{m}_{t}\):

$$ O(x,y,{\mathbf{s}_{t}^{m}}) = \left\{ \begin{aligned} 1 & \textrm{if } (x,y) \in \mathcal{W}_{i}({\mathbf{s}_{t}^{m}}) \, , \ \exists i \in I_{{o}}({\mathbf{s}_{t}^{m}}) \\ 0 & \qquad\quad\text{otherwise} \end{aligned} \right. , $$
(7)

where a set of indexes of the occluded sub-windows, \(I_{O}({\mathbf {s}_{t}^{m}})\), is defined by the algorithm in Table 1.

The score function that a pixel (x, y) is occluded is obtained by averaging \(O(x,y,{\mathbf {s}_{t}^{m}})\) over all the samples, such that

$$\begin{array}{@{}rcl@{}} O(x,y,t) = \frac{1}{M} \sum\limits_{m=1}^{M} O(x,y,{\mathbf{s}_{t}^{m}}) \, . \end{array} $$
(8)

Using the score function, an object target region and occluded parts can be found by simple thresholding. Let \(\mathcal {R}_{T}(t)\) be a target object region and \(\mathcal {R}_{O}(t)\) be an occlusion region obtained by

$$\begin{array}{@{}rcl@{}} \mathcal{R}_{T}(t) = \left\{(x,y)\vert T(x,y,t) > \theta_{T} \right\} , \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} \mathcal{R}_{O}(t) = \left\{(x,y)\vert O(x,y,t) > \theta_{O} \right\} , \end{array} $$
(10)

where 𝜃 T and 𝜃 O ∈ 0, 1 are fixed threshold values. Please note that \(O(x,y,{\mathbf {s}_{t}^{m}})\) is always less than or equal to \(T(x,y,\mathbf {s}^{m}_{t})\), so O(x, y, t) ≤ T(x, y, t) for any pixel (x, y). Therefore, we enforce 𝜃 O > 𝜃 T to make \(\mathcal {R}_{O}(t) \subset \mathcal {R}_{T}(t)\).

The procedure in (2)–(10) is illustrated in Fig. 3A and B. The region surrounded by the outside contour is the target region, \(\mathcal {R}_{T}(t)\), and the lightly colored region inside the contour is the global occlusion region, \(\mathcal {R}_{O}(t)\). Then, for a practical reason, the free-form occlusion pattern \(\mathcal {R}_{O}(t)\) is converted to a sub-window-form occlusion patterns. In Fig. 3C, a window \(\mathcal {W}^{*}(t)\) is found by a smallest rectangle to surround all \(\mathcal {R}_{T}(t)\), and it is divided horizontally in the same way as (2), \(\mathcal {W}^{*}(t) = \bigcup _{i} \mathcal {W}^{*}_{i}(t)\).

Each sub-window \(\mathcal {W}^{*}_{i}(t)\) is determined to be occluded or not individually. The ratio of the number of pixels in the occlusion area to that in the target area, written by

$$\begin{array}{@{}rcl@{}} \gamma(t,i) = \frac{ A(\mathcal{R}_{O}(t)\cap\mathcal{W}^{*}_{i}(t))}{A(\mathcal{R}_{T}(t)\cap\mathcal{W}^{*}_{i}(t))} \, , \end{array} $$
(11)

where A( · ) is an area function of image regions. Then, let \(I_{t}^{*}\) be a index set of sub-windows of the target region that are likely to be unoccluded, which is given by

$$\begin{array}{@{}rcl@{}} I^{*}_{t} = \{ i \vert \gamma(t,i) \leq {0.5} \} . \end{array} $$
(12)

The global occlusion pattern is combined results of occluded sub-windows. This process expressed in the bottom of Fig. 3C.

The distance of sample \({\mathbf {s}_{t}^{m}}\) from the initial target is updated by \(I^{*}_{t}\):

$$\begin{array}{@{}rcl@{}} D(I^{*}_{t},{\mathbf{s}_{t}^{m}}) = \frac{\sum_{i \in I^{*}_{t}}D_{i} ({\mathbf{s}_{t}^{m}})}{N}. \end{array} $$
(13)

Finally, the likelihood of a single sample in (1) is obtained by

$$\begin{array}{@{}rcl@{}} p(\mathbf{z}_{t}|{\mathbf{s}_{t}^{m}}) \propto \exp \left[ -\lambda \{ D^{2}(I^{*}_{t},{\mathbf{s}_{t}^{m}}) \} \right], \end{array} $$
(14)

where λ is a positive contant that is empirically determined.

3 Experimental results

The performance of the proposed method is compared with the conventional particle filtering [26], L1 tracker [4] and mean-shift tracking algorithm [7, 11, 27, 36]. We test three video sequences capturing real ground vehicles with occlusions. The videos include various types of occlusions that usually happen with motor vehicles and pedestrians.

3.1 Evaluation method

To compare the performances quantitatively, we used the normalized intersection ratio [22, 29] and the tracking error measure [29]. The normalized intersection ratio is computed by the ratio of overlapped are between hand-labeled ground truth and the tracking window from the tracker state, given by

$$ accuracy = \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{G} \cup \mathcal{T}|} \, , $$
(15)

where \(\mathcal {G}\) and \(\mathcal {T}\) are sets of pixels within ground truth and tracker output regions, respectively, and the cardinality operator | · | returns the number of pixels in a set [22].

We added another performance measure for measuring the general tracking error, computed by the Euclidean distance between the center points of ground truth and the tracking window from the tracker state [29].

3.2 Occlusion between cars and motorcycles

Figure 4 shows the tracking results of the proposed method and the conventional methods. The scene is from a downtown area, and the target is the white vehicle in the center, which is occluded by two motorcycles. The motorcycles move very fast and partly block the target vehicle in the video sequences while passing through the gap between the target and the observer. The first column, Fig. 4a, shows the tracker windows by yellow boxes for the proposed method. The other columns, Fig. 4b–d, show the results of L1 tracker [4], basic particle filtering [26], and mean-shift tracking [11], respectively. In the first row (frame 10), there no occlusion, and the four methods successfully tracks the target. In the second row (frame 40), a motorcycle driver blocks the target to the right. The proposed method and L1 tracker keep track of the target including the occluded area. However, the particle filters is shifted to the left due to the occlusion, and mean-shift tracker is enlarged because of the uncertainty added by the occlusion. In the third row (frame 80), although there is no occlusion, all three conventional methods exhibit incorrect tracking windows, enlarged to include the target. This is because of the tracking error accumulated from the past frames. At frame 100, another motorcycle hugely occluded the target, and this time even the proposed method failed to correctly track the target. Unlike the other 3 methods, the particle filters totally loses the target due to the occlusion. At frame 130, a little occlusion by another car from the right-bottom corner, and the proposed method restored to a correct position, L1 and mean shift trackers track the target with enlarged tracking windows. However, the particle filters cannot restore the error, so it perfectly lost the target. When occlusion occurs, the likelihood of the target is decreased, so the uncertainly grows. Generally L1 and mean-shift trackers extend the tracking window to recover the uncertainty. The proposed method actively finds the uncertain parts due to the occlusion and subtracts from the likelihood calculation, so even the occluded part of the target was able to be tracked.

Fig. 4
figure 4

Comparison of the tracking performances for motorcycle occlusion sequence. a ours, b L1 tracker, c conventional particle filtering, and d mean-shift

A quantitative evaluation was carried out using intersection ratio and tracking error by the center positions introduced in Section 3.1. The evaluation results are shown in Fig. 5. The top graph shows the intersection ratios, ranging from 0 (no intersection) to 1 (perfect match). Larger the number, better the tracking result is. A few frames before 40, occlusion occurs as shown in the second row of Fig. 4. The tracking errors are accumulated until frame 50, and the proposed method recovered the intersection ratio to larger than 0.8 around frame 60, while the others could not recover and have the intersection ratio below 0.4. Another huge occlusion occurs around frame 100, and the particle filters completely lost the target, and the intersection ratio became 0. L1 and mean-shift tracker kept enlarged tracking window and the ratio value stayed almost the same. The proposed method was affected by the occlusion at frames 95–100, the ratio value being below 0.6, but quickly recovers to 0.8. The same phenomena were observed in the centroid distance error measure. The quantitative comparison proves that the proposed method remarkably improved the tracking performance when occlusion occurs.

Fig. 5
figure 5

Comparison of the tracking performances for motorcycle occlusion sequence. a intersection ratio, b tracking error by centroid distance between ground truth and the tracker window

3.3 Occlusion by pedestrians

Two more vehicle tracking scenarios containing occlusions were also tried. The images in Fig. 6 are from “TUD-Crossing” dataset [2], where a lot of pedestrians pass before the target vehicle. Figure 7a and b are the quantitative comparison by the intersection ratios and centroid distances. In Fig. 6a, the proposed method successfully tracks the target vehicle even when pedestrians block the target. Especially at frames 90–120, 3 pedestrians pass before the target, and the tracking performance is not affected. In Fig. 6c, particle filtering without occlusion detection, the pedestrians hugely affected the tracker, and the target is lost perfectly at frame 120. Figure 6b and d shows the results of L1 and mean-shift trackers. Although they keep track of the target even with the occlusion, the performance is greatly decreased due to the pedestrian occlusion.

Fig. 6
figure 6

Comparison of the tracking performances for the “TUD-Crossing” sequence. a ours, b L1 tracker, c conventional particle filtering, and d mean-shift

Fig. 7
figure 7

Comparison of the tracking performances for the “TUD-Crossing” sequence. a intersection ratio, b tracking error by centroid distance between ground truth and the tracker window

Figure 8 is another pedestrian-vehicle case. The tracking windows are drawn by yellow rectangles, and the quantitative performance comparison is done in Fig. 9. Similarly to the previous results, the proposed method outperformed the conventional methods a lot.

Fig. 8
figure 8

Comparison of the tracking performances for a single pedestrian crossing sequence. a ours, b L1 tracker, c conventional particle filtering, and d mean-shift

Fig. 9
figure 9

Comparison of the tracking performances for a single pedestrian crossing sequence. a intersection ratio, b tracking error by centroid distance between ground truth and the tracker window

4 Conclusions

This paper proposes a practical algorithm for detecting occlusions in a color image sequence of ground vehicles based on matching color histograms of horizontally segmented rectangular windows. The proposed method divides the tracking windows into a number of vertical windows, and each sub-window is determined to be occluded or not to find a unified occlusion pattern. The unified occlusion pattern is then used to update the likelihood of the current tracking window and the target window, and the target can be tracked including the occluded region, while other methods cannot include the occluded region in the tracking windows. Performance comparison on three images sequences of vehicles and pedestrians shows the validity of the proposed method. Future work include applying other types of image descriptors based on the shape of the target to improve the performance of the proposed occlusion detection.