Keywords

1 Introduction

Generic object tracking [1, 3, 5,6,7, 12,13,14], where the tracker is not specialized to any specific category of objects, is a popular research field in recent years. Because of the category-agnostic, it is not possible to train a detector offline for a particular type of objects, such as pedestrians or hands. Consequently, occlusion is the most challenging factor for generic object trackers [8], since the trackers usually cannot discriminate the occluders from the targets.

Majority of the work in handling occlusion is to add a sub-module before target model updater to monitor the tracking reliability. In [20], the feedback from tracking results is utilized to decide whether or not to update the target model. However, this strategy still cannot tell what is actually happening, occlusion or target appearance variation, both of which will decrease the tracking confidence.

COD (Context-based Occlusion Detection for Tracking) [15,16,17] is a framework that monitors the background-patches around the target and can identify which of them occlude the target. However, several drawbacks exist. First, the number of background-patches that COD monitors is constant, which contaminates the adaptive ability of the framework. Furthermore, determining the occlusion occurrence simply by the number of occluders over-simplifies the problem and is not guaranteed to be reasonable in all occasions. To solve these issues, we present Adaptive COD, which is adaptive to differently sized targets and able to identify what proportion of the target is affected by occlusion. The number of background-patches is now dependent on the perimeter of the target, hence more background-patches will be allocated to deal with a larger target. After acquiring the positions of the background-patches that occlude the target, we calculate the proportion of the target that is under occlusion. If the proportion is greater than a threshold, model updater will not take any action, avoiding the model being corrupted. The background-patches that occlude the target continues to be monitored, while other background-patches are discarded and new ones will be generated around the new target. As a general framework, Adaptive COD can be integrated with any existing tracking algorithm to address the occlusion problem.

To better evaluate the performance of different trackers and promote the development of tracking algorithms, several benchmarks have been built. OTB [21], VOT [10], and ALOV [19] are the most widely used ones. In OTB [21], each sequence is tagged with 9 attributes, including occlusion, illumination variation and so on, which represent the challenging factors in visual tracking. A sequence will be tagged with attribute ‘occlusion’ if there are frames in the sequence where occlusion happens. In VOT [10], the attribute annotation is further refined to per-frame level. Later in NUS-PRO [11], the occlusion is classified into three levels: no occlusion, partial occlusion and full occlusion. Recently, attribute-specific benchmarks appear. In [18], a dataset for fast moving objects is collected. A higher frame rate video dataset is proposed in [4]. Although occlusion is one of the attributes in OTB [21] and VOT [10], the frames where occlusion happens only take up a small proportion of the overall sequence. Moreover, before the tracker meets these frames, the tracking results have already drift from the groundtruth, which means that different trackers will have different initialization setups in terms of evaluating their robustness to occlusion. In this paper, we build an attribute-specific benchmark which contains sequences where the target undergoes occlusion. In our proposed dataset, we exclude other attributes and only preserve the frames relevant to occlusion. Each sequence contains three parts: before, during and after occlusion. We evaluate our model updating strategy by integrating it with several mediocre tracking algorithms, including KCF [7], SAMF [14], DSST [3] and Staple [1]. The experimental results show that the Adaptive COD improves the robustness of these tracking algorithms.

In summary, the main contributions of this paper are as follows:

  1. 1.

    We improve the occlusion detection framework in [17]. The number of background-patch trackers is adaptive to the size of target. A new model updating strategy is proposed.

  2. 2.

    We establish a new dataset where the sequences contain occlusion for evaluating the robustness of tracking algorithms.

  3. 3.

    Extensive experiments demonstrate the effectiveness of our occlusion detection framework and occlusion benchmark.

2 Occlusion Detection Framework

In this section we first briefly review the Context-based Occlusion Detection for Tracking (COD) framework [17]. Then the proposed Adaptive COD is presented.

2.1 COD Review

Based on the assumption that both target and background-patches are involved in occlusion, COD [17] pays attention to the background around the target to actively detect occlusion. As is shown in Algorithm 1, two kinds of trackers exist in the framework: target tracker and background-patch trackers. Target tracker estimates the bounding box of target in the current frame, while the background-patch trackers provide the position and tracking reliability of every background-patch surrounding the target. Intuitively, if the bounding boxes of a background-patch and the target overlap and that the background-patch has high tracking reliability (hence it is not occluded by the target), then the target is occluded by the background-patch. Please refer to [17] for more details.

figure a

However, COD has the following disadvantages. Firstly, the number of background-patches \(N_{1}\) is constant for variously sized targets in different sequences. For small targets, \(N_{1}\) is relatively too large. Therefore, many background-patches overlay with each other, causing the double counting and repeated calculation. For large objects, \(N_{1}\) becomes relatively small, so the background around the target is not fully monitored. Secondly, the target model will be updated online if the number of background-patches that occlude the target, N, is greater than a constant threshold \(N_{th}\). Similarly, for targets of different sizes, N as merely a counting result cannot properly measure the degree of occlusion.

2.2 Adaptive COD

We propose an Adaptive COD to overcome the limitations of COD mentioned in Sect. 2.1. Adaptive COD inherits the structure from COD but differs in two important aspects: the initialization step and the criterion for identifying occlusion. They are shown in Algorithm 1.

Fig. 1.
figure 1

In left, the number of background-patches for sequence Girl is 38, while for sequence David3 it is 83. In right, the curve shows non-occluded proportion of the target for every frame in sequence Tiger2, along with the the frames \(\#27\),\(\#107\),\(\#186\),\(\#238\),\(\#256\),\(\#355\), corresponding to local minima of the curve. The blue boxes show where the occlusion happens.

Denote the bounding box of target in frame t as \((x_t,y_t,w_t,h_t)\) for \(t=1,...,T\), where \((x_t,y_t)\) are the upper-left corner point coordinates and \((w_t,h_t)\) are the width and height. Then we set \(N_1 = [ \ (w_1+h_1)/2 \ ]\), where [x] will round x to its nearest integer. In this way, the number of background-patches is dependent on the size of target. Unless the scale of target varies heavily, we keep using \(N_1\) in the following frames. The results can be seen in Fig. 1.

We propose a new criterion for identifying occlusion. For target with parameter \((x_t,y_t,w_t,h_t)\), we build a mask \(M_t\) as follows:

$$ \begin{aligned} M_t(x,y)=\left\{ \begin{aligned}&1, if \ x \in [x_t,x_t+w_t] \ \& \& \ y \in [y_t,y_t+h_t] \\&0, otherwise \end{aligned} \right. \end{aligned}$$
(1)

I.e., \(M_t\) has the same size of frame and the region representing the target is set as 1. The area of target region is \(A_t=\sum {M_t}\). Similarly, for a background-patch with parameter \((bx_t^i,by_t^i,bw_t^i,bh_t^i)\) for \(i=1,2,...,N_1\), we build a mask \(m_t^i\). Denoting the tracking reliability of background-patch i as \(r_t^i\) which is usually calculated as Peak-to-Sidelobe Ratio [2], we update \(M_t\) as

$$\begin{aligned} M_t =\left\{ \begin{aligned}&M_t \ - \ m_t^i, \quad if \ r_t^i \ > \ r_{th} \\&M_t, \quad otherwise \end{aligned} \right. \end{aligned}$$
(2)

where \(r_{th}\) is the threshold. After inspecting every background-patch and updating \(M_t\), the area of target that is not occluded is \(S_t=\sum {M_t}\). We use \(\gamma _t = S_t\ / \ A_t\) as the measurement of occlusion, as is demonstrated in Fig. 1. Compared with using N as the indicator of occlusion in COD, the new area-based adaptive criterion makes sense for targets of any size.

After identifying occlusion, the algorithm makes decision on whether to update the target tracker. The background-patches that are identified as occluders will continue to be monitored. Meanwhile, the algorithm will not pay attention to the other background patches which does not occlude the target and new background patches around the target in current frame will be added in the monitoring set.

3 Occlusion Benchmark

In this section, we present a new specialized benchmark for evaluating the robustness of tracking algorithms to occlusion. The benchmark is available at https://pan.baidu.com/s/1qZ0KeoW.

Although occlusion is one of the attributes in OTB [21], VOT [10] and NUS-PRO [11], these benchmarks still cannot accurately reflect the robustness of tracking algorithms to occlusion, due to the following reason. Each sequence usually has multiple challenging factors. Suppose a sequence s with frames (\(\#1\),...,\(\#t_1\),...,\(\#t_2\),...,\(\#T\)), where the occlusion happens in frames between \(\#t_1\) and \(\#t_2\). Since all the trackers start tracking in frame \(\#1\), they will have different tracking outputs before the occlusion occurs in frame \(\#t_1\), which means that the performance on frames between \(\#t_1\) and \(\#t_2\) is heavily influenced by the previous frames. As a recent study [9] shows, performance measures computed on a sequence are significantly biased to the dominant attribute of the sequence. Moreover, besides occlusion, there may exist other challenging factors in frames between \(\#t_1\) and \(\#t_2\), which makes the evaluation more unreliable.

Fig. 2.
figure 2

Sequences in our occlusion benchmark can be divided into three parts. The first column shows the first frames of sequences Coke_1 and fish2_1. The second and third columns show the targets being occluded. The last column shows targets after occlusion.

Table 1. Statistics about our occlusion benchmark.

Based on these observations, we propose an occlusion benchmark that has the following characteristics:

  1. 1.

    Each sequence s with frames (\(\#1\),...,\(\#t_1\),...,\(\#t_2\),...,\(\#T\)) can be divided into 3 sub-sequences. In the first sub-sequence with frames (\(\#1\),...,\(\#t_1\)), neither occlusion nor other challenging factor occur, so the target model can be initialized. In the second sub-sequence with frames (\(\#t_1\),...,\(\#t_2\)), the target is occluded. In the last sub-sequence with frames (\(\#t_2\),...,\(\#T\)), occlusion disappears so we can identify if the tracking succeeds. See Fig. 2 for explanation.

  2. 2.

    In frames (\(\#t_1\),...,\(\#t_2\)), we exclude other attributes such as deformation, so that the only difficulty for tracking is to handle occlusion. However, it is a common scenario that the occluders are of the same category as the targets and have similar appearance, so we keep these sequences in the benchmark.

  3. 3.

    The sequences are selected from OTB [21], VOT [10] and NUS-PRO [11] with diversity and richness. The statistics is shown in Table 1.

In our occlusion benchmark, we propose a new metric called Normalized Center Location Error (NCLE) for evaluating performance. For tracking result \((cx_1,cy_1,w_1,h_1)\) and ground-truth (cxcywh) where \((cx_1,cy_1)\) and (cxcy) are center locations, the traditional CLE adopted by OTB [21] is defined as

$$\begin{aligned} CLE = \sqrt{ (cx_1-cx)^2 + (cy_1-cy)^2 }. \end{aligned}$$
(3)

A constant number, 20-pixel, is used for ranking trackers. However, for differently shaped and sized targets, 20-pixel deviation may have distinct meanings. For example, the width of a pedestrian target is usually smaller than the height, so the deviation is more serious if it is in the horizontal direction. In NCLE, we normalize the CLE by the width and height of target:

$$\begin{aligned} NCLE = min\{ \ max\{ \frac{\left| cx_1-cx\right| }{w},\frac{\left| cy_1-cy\right| }{h} \}, \ 1 \ \}. \end{aligned}$$
(4)

NCLE = 1 means a tracking failure. We utilize NCLE-based Precision Plot and Success Plot [21] as performance measurements in our occlusion benchmark.

4 Experiments

In this section, we present the experimental results of several recent tracking algorithms evaluated on our occlusion benchmark, including KCF [7], SAMF [14], DSST [3] and Staple [1]. Meanwhile, we integrate these trackers into our adaptive COD framework to validate its effectiveness. All the code is available at https://github.com/xgniu/Occlusion-Benchmark.

Fig. 3.
figure 3

The quantitative evaluation results. Left: NCLE-based Precision Plot. The numbers in brackets are the proportion of frames that have NCLE less than 0.5. Right: Success Plot.

Table 2. Different \(\gamma \) for different tracking algorithms. Our framework is not sensitive to the value of \(\gamma \).

4.1 Quantitative Evaluation

The quantitative evaluation results are shown in Fig. 3 in the form of Precision Plot and Success Plot. All the four trackers gain improvements in performance after being integrated into our adaptive occlusion detection framework. Moreover, we find that though different tracking algorithms require differently valued \(\gamma \) for best performance, a wide range of \(\gamma \) can provide comparable results (Table 2). The other thresholds are the same as in COD [17].

4.2 Qualitative Evaluation

Fig. 4.
figure 4

The qualitative evaluation results. Red: SAMF. Blue: SAMF_OD. Green: Staple. Black: Staple_OD. The four sequences are Coke, fish, Tiger2 and Lemming (Color figure online).

Figure 4 visualizes several sequences from our occlusion benchmark along with the tracking results of different algorithms. Only the tracking results of SAMF, SAMF_OD, Staple and Staple_OD are shown for clarity, where the suffix ‘_OD’ stands for being integrated into our occlusion detection framework. As the figure shows, when occlusion occurs, SAMF_OD and Staple_OD outperform their baselines.

5 Conclusion

Based on COD [17], we propose an adaptive occlusion detection framework which calculates the proportion of target that is not occluded. To better evaluate the robustness of tracking algorithms to occlusion, we propose an occlusion benchmark that excludes other challenging factors. In our benchmark, normalized center location error is adopted as the performance measure. Much work is needed in future to solve the occlusion problem for robust visual object tracking.