1 Introduction

Visual object detection and tracking are two rapidly evolving research areas with dozens of new algorithms being published each year. To compare the performance of the many different approaches, a vast amount of evaluation datasets and schemes are available. They include large detection datasets with multiple object categories, such as PASCAL VOC [8], smaller, more specific detection datasets with a single category, such as cars [10], and sequences with multiple frames that are commonly used to evaluate trackers such as VOT2016 [11], OTB-2015 [24], or MOT16 [14]. Although very different in their nature, all of the benchmarks use axis-aligned or oriented boxes as ground truth and estimate the accuracy with the Intersection over Union (IoU) criterion.

Nevertheless, boxes are very crude approximations of many objects and may introduce an unwanted bias in the evaluation process, as is displayed in Fig. 1. Furthermore, approaches that are not restricted to oriented or axis-aligned boxes will not necessarily have higher accuracy scores in the benchmarks [3,4,5]. To address these problems, a number of densely segmented ground truth datasets has started to emerge [13, 16, 23].

Fig. 1.
figure 1

In image (a), both oriented boxes have an identical IoU with the ground truth segmentation. Nevertheless, their common IoU is only 0.71. Restricting the ground truth to boxes may introduce an undesired bias in the evaluation. In image (b), the best possible IoU of an axis-aligned box is only 0.66. Hence, for segmented data, it is difficult to use the absolute value of the IoU as an accuracy measure since it generally does not range from 0 to 1. Furthermore, although the object detection (green) in image (c) has an overlap of 0.62 with the ground truth segmentation, its IoU with the ground truth axis-aligned bounding box is only 0.45 and would be considered a false detection in the standard procedure. The proposed rIoU is the same for both boxes in (a) and 1.0 for the green boxes in (b) and (c). (Color figure online)

Unfortunately, evaluating the accuracy of object detectors and trackers that are restricted to boxes on densely segmented data is not straightforward. For example, the VOT2016 Benchmark [11] generates plausible oriented boxes from densely segmented objects and the COCO 2014 Detection challenge [13] uses axis-aligned bounding boxes of the segmentations to simplify the evaluation protocol. Hence, approaches may have a relatively low IoU with the ground truth, although their IoU with the actual object segmentation is the same (or even better) than that of the ground truth box (see Fig. 1(c)).

To enable a fair evaluation of algorithms restricted to axis-aligned or oriented boxes on densely segmented data we introduce the relative Intersection over Union accuracy (rIoU) measure. The rIoU uses the best possible axis-aligned or oriented box of the segmentation to normalize the IoU score. The normalized IoU ranges from 0 to 1 for an arbitrary segmentation and allows to determine the true accuracy of a scheme. For tracking scenarios, the optimal boxes have further advantages. By determining three different optimal boxes for each sequence, the optimal oriented box, the optimal axis-aligned box and, the optimal axis-aligned box for a fixed scale, it is possible to identify scale changes, rotations, and occlusion in a sequence without the need of by-frame labels.

The optimal boxes are obtained in a fast and efficient optimization process. We validate the quality of the boxes in the experiments section by comparing them to a number of exhaustively determined best boxes for various scenes.

The three main contributions of this paper are:

  1. 1.

    The introduction of the relative Intersection over Union accuracy (rIoU) measure, which allows an accurate measurement of object detector and tracker accuracies on densely segmented data.

  2. 2.

    The proposed evaluation removes the bias introduced by restricting the ground truth to boxes for densely segmented data (such as COCO 2014 Detection Challenge [13] or VOT2016 [11]).

  3. 3.

    A compact, easy-to-use, and efficient evaluation scheme for evaluating object trackers that allows a good interpretability of a trackers strengths and weaknesses.

The proposed measure and evaluation scheme is evaluated on a handful of state-of-the-art trackers for the DAVIS [16] and VOT2016 [11] datasets and made available to the communityFootnote 1.

2 Related Work

In the object detection community, the most commonly used accuracy measure is the Intersection over Union (IoU), also called Pascal overlap or bounding box overlap [8]. It is commonly used as the standard requirement for a correct detection, when the IoU between the predicted detection and the ground truth is at least 0.5 [13].

In the tracking community, many different accuracy measures have been proposed, most of them center-based and overlap-based measures [11, 12, 15, 18, 22, 24]. To unify the evaluation of trackers, Čehovin et al. [20, 22] provide a highly detailed theoretical and experimental analysis of the most popular performance measures and show that many of the accuracy measures are highly correlated. Nevertheless, the appealing property of the IoU measure is that it accounts for both position and size of the prediction and ground truth simultaneously. This has lead to the fact that, in recent years, it has been the most commonly used accuracy measure in the tracking community [11, 24]. For example, the VOT2016 [11] evaluation framework uses the IoU as the sole accuracy measure and identifies tracker failures when the IoU between the predicted detection and the ground truth is 0.0 [12].

Since bounding boxes are very crude approximations of objects [13] and cannot accurately capture an object’s shape, location, or characteristics, numerous datasets with densely segmented ground truth have emerged. For example, the COCO 2014 dataset [13] includes more than 886,000 densely annotated instances of 80 categories of objects. Nevertheless, on the COCO detection challenge the segmentations are approximated by axis-aligned bounding boxes to simplify the evaluation. As stated earlier, this introduces an unwanted bias in the evaluation. A further dataset with excellent pixel accurate segmentations is the DAVIS dataset [16], which was released in 2017. It consists of 50 short sequences of manually segmented objects which, although originally for video object segmentation, can also be used for the evaluation of object trackers. Furthermore, the segmentations used to generate the VOT2016 ground truths have very recently been released [23].

In our work, we enable the evaluation of object detection and tracking algorithms that are restricted to output boxes on densely segmented ground truth data. The proposed approach is easy to add to existing evaluations and improves the precision of the standard IoU accuracy measure.

3 Relative Intersection over Union (rIoU)

Using segmentations for evaluating the accuracy of detectors or trackers removes the bias a bounding-box abstraction induces. Nevertheless, the IoU of a box and an arbitrary segmentation generally does not range from 0 to 1, where the maximum value depends strongly on the objects’ shape. For example, in Fig. 1(b) the best possible axis-aligned box only has an IoU of 0.66 with the segmentation.

To enable a more precise measurement of the accuracy, we introduce the relative Intersection over Union (rIoU) of a box \(\mathcal {B}\) and a dense segmentation \(\mathcal {S}\) as

$$\begin{aligned} \varPhi _{rIoU} \left( \mathcal {S}, \mathcal {B}\right) = \frac{\varPhi _{IoU}(\mathcal {S},\mathcal {B})}{\varPhi _{opt}(\mathcal {S})}, \end{aligned}$$
(1)

where \(\varPhi _{IoU}\) is the Intersection over Union (IoU),

$$\begin{aligned} \varPhi _{IoU} \left( \mathcal {S}, \mathcal {B}\right) = \frac{\left| \mathcal {S} \cap \mathcal {B} \right| }{\left| \mathcal {S} \cup \mathcal {B} \right| }, \end{aligned}$$
(2)

and \(\varPhi _{opt}\) is the best possible IoU a box can achieve for the segmentation \(\mathcal {S}\). In comparison to the usual IoU (\(\varPhi _{IoU}\)), the rIoU measure (\(\varPhi _{rIoU}\)) truly ranges from 0 to 1 for all possible segmentations. Furthermore, the measure makes it possible to interpret ground truth attributes such as scale change or occlusion, as is displayed later in Sect. 4.

The calculation of \(\varPhi _{opt}\), required to obtain \(\varPhi _{rIoU}\), is described in the following section.

3.1 Optimization

An oriented box \(\mathcal {B}\) can be parameterized with 5 parameters

$$\begin{aligned} b = \left( r_c,c_c,w,h,\phi \right) , \end{aligned}$$
(3)

where \(r_c\) and \(c_c\) denote the row and column of the center, w and h denote the width and height, and \(\phi \) the orientation of the box with respect to the column-axis. An axis-aligned box can equally be parameterized with the above parameters by fixing the orientation to \(0^\circ \).

Fig. 2.
figure 2

blackswan from DAVIS [16]. The initial values of the optimization process of (4) are displayed. We use the axis-aligned bounding box (green), the oriented bounding box (blue), the inner square of the largest inner circle (magenta), the largest inner axis-aligned box (black) and the oriented box with the same second order moments as the segmentation (orange). (Color figure online)

For a given segmentation \(\mathcal {S}\), the box with the best possible IoU is

$$\begin{aligned} \varPhi _{opt}(\mathcal {S}) = \max _b\,\,\, \varPhi _{IoU} (\mathcal {S},\mathcal {B}(b)) \qquad \quad \, s.t.\,\, b \in \mathbb {R}_{>0}^4 \times [0^\circ ,90^\circ ). \end{aligned}$$
(4)

For a convex segmentation, the above problem can efficiently be optimized with the method of steepest descent. To handle arbitrary, possibly unconnected, segmentations, we optimize (4) with a multi-start gradient descent with a backtracking line search. The gradient is approximated numerically by the symmetric difference quotient. We use the diverse set of initial values for the optimization process displayed in Fig. 2. The largest axis-aligned inner box (black) and the inner box of the largest inner circle (magenta) are completely within the segmentation. Hence, in the optimization process, they will gradually grow and include background if it improves \(\varPhi _{IoU}\). On the other hand, the bounding boxes (green and blue) include the complete segmentation and will gradually shrink in the optimization to include less of the segmentation. The oriented box with the same second order moments as the segmentation (orange) serves as an intermediate starting point [17]. Hence, only if the initial values converge to different optima do we need to expend more effort. In these cases, we randomly sample further initial values from the interval spanned by the obtained optima with an added perturbation. In our experiements we used 50 random samples. Although this may lead to many different optimizations, the approach is still very efficient. A single evaluation of \(\varPhi _{IoU} (\mathcal {S},\mathcal {B})\) only requires an average of 0.04 ms for the segmentations within the DAVIS [16] dataset in HALCONFootnote 2 on an IntelCore i7-4810 CPU @2.8 GHz with 16 GB of RAM with Windows 7 (x64). As a consequence, the optimization of \(\varPhi _{opt}\) requires an average of 1.3 s for the DAVIS [16] and 0.7 s for the VOT2016 [11] segmentations.

The optimization of the IoU for axis-aligned rectangles bears some similarity to the 2D maximum subarray problem [1]. This might make an alternative algorithmic approach to the optimization possible. However, a straightforward adaptation of methods is difficult, since these methods rely on the additive nature of the maximum subarray problem. In contrast, the IoU is inherently non-linear due to the quotient in its definition.

3.2 Validation

To validate the optimization process, we exhaustively searched for the best boxes in a collection of exemplary frames from each of the 50 sequences in the DAVIS dataset [16]. The validation set consists of frames that were challenging for the optimization process. In a first step, we validated the optimization for axis-aligned boxes. The results in Fig. 3 indicate that the optimization is generally very close or identical to the exhaustively determined boxes.

Fig. 3.
figure 3

The absolute difference \(\varDelta _{\varPhi _{IoU}}\) of the exhaustively determined best axis-aligned box and the optimized axis-aligned box for a selected frame in each of the 50 DAVIS [16] sequences. Most boxes are identical, only a handful of boxes are marginally different \(({<}0.0001\)).

For the oriented boxes, one of the restrictions we can make is that the area must at least be as large as the smallest inner box of the segmentation and may not be larger than the bounding oriented box. Nevertheless, even with further heuristics, the number of candidates to test is in the number of billions for the sequences in the DAVIS dataset. Given a pixel-precise discretization for \(r_c,c_c,w,h\) and a \(0.5^\circ \) discretization of \(\phi \), it was impossible to find boxes with a better IoU than the optimized oriented boxes in the validation set. This is mostly due to the fact that the sub-pixel precision of the parameterization (especially in the angle \(\phi \)) is of paramount importance for the IoU of oriented boxes.

4 Theoretical Trackers

The concept of theoretical trackers was first introduced by Čehovin et al. [22] as an “excellent interpretation guide in the graphical representation of results”. In their paper, they use perfectly robust or accurate theoretical trackers to create bounds for the comparison of the performance of different trackers. In our case, we use the boxes with an optimal IoU to create upper bounds for the accuracy of trackers that underlie the box-world assumption. We introduce three theoretical trackers that are obtained by optimizing (4) for a complete sequence. Given the segmentation \(\mathcal {S}\), the first tracker returns the best possible axis-aligned box (box-axis-aligned), the second tracker returns the optimal oriented box (box-rot) and the third tracker returns the optimal axis-aligned box with a fixed scale (box-no-scale). The scale is initialized in the first frame with the scale of the box determined by box-axis-aligned.

Fig. 4.
figure 4

motorbike from DAVIS [16]. The increasing gap between the box-no-scale and the other two theoretical trackers indicates a scale change of the motorbike. The drop in all three theoretical trackers around frame 25 indicates that the object is being occluded. The best possible IoU is never above 0.80 for the complete sequence.

Fig. 5.
figure 5

dog from DAVIS [16]. The gaps between the box-axis-aligned and box-rot tracker indicate a rotation of the otherwise relatively compact segmentation of the dog. The best possible IoU is never above 0.80 for the complete sequence.

The theoretical tracker can be used to normalize a tracker’s IoU for a complete sequence, which enables a fair interpretation of a tracker’s accuracy and removes the bias from the box-world assumption. Furthermore, the three different theoretical trackers make it possible to interpret a tracking scene without the need of by-frame labels. As is displayed in Fig. 4, the difference between the box-no-scale, box-axis-aligned, and box-rot trackers indicates that the object is undergoing a scale change. Furthermore, the decreasing IoUs of all theoretical trackers indicate that the object is either being occluded or deforming to a shape that can be approximated less well by a box. For compact objects, the difference of the box-rot tracker and the box-axis-aligned tracker indicates a rotation or change of perspective, as displayed in Fig. 5.

5 Experiments

We evaluate the accuracy of a handful of state-of-the art trackers on the DAVIS [16] and VOT2016 [11] datasets with the new rIoU measure. We initialize the trackers with the best possible axis-aligned box for the given segmentation. Since we are primarily interested in the accuracy and not in the trackers robustness, we do not reinitialize the trackers when they move off target. Please note that the accuracy of the robustness measure is also improved when using segmentations; The failure cases (hence \(\varPhi _{IoU} = 0\)) are identified earlier since \(\varPhi _{IoU}\) is zero when the tracker has no overlap with the segmentation and not with a bounding box abstraction of the object (which may contain a large amount of background, see, e.g., Fig. 1).

We restrict our evaluation to the handful of (open source) state-of-the-art trackers displayed in Table 1. A thorough evaluation and comparison of all top ranking trackers is beyond the scope of this paper. The evaluation framework is made available and constructed such that it is easy to add new trackers from MATLABFootnote 3, PythonFootnote 4 or HALCON.

We include the Kernelized Correlation Filter (KCF) [9] tracker since it was a top ranked tracker in the VOT2014 challenge even though it assumes the scale of the object to stay constant. The Discriminative Scale Space Tracker (DSST) [6] tracker is essentially an extension of KCF that can handle scale changes and outperformed the KCF by a small margin in the VOT2014 challenge. As further axis-aligned trackers, we include ANT [21], L1APG [2], and the best performing tracker from the VOT2016 challenge, the continuous convolution filters (CCOT) from Danelljan et al. [7]. We include the LGT [19] as one of the few open source trackers that estimates the object position as an oriented box.

Table 1. Comparison of different tracking approaches and their average absolute (\(\varPhi _{IoU}\)) and relative IoU (\(\varPhi _{rIoU}\)) for the DAVIS [16] and the VOT2016 [11] segmentations
Fig. 6.
figure 6

bmx-trees from DAVIS [16]. On the left, differences between box-no-scale and box-axis-aligned indicate that the object is changing scale and is occluded at frame 18 and around frames 60–70. In the middle plot, we compare the IoU of the axis-aligned box trackers and box-axis-aligned. The corresponding rIoU plot is shown on the right. It becomes evident that the ANT tracker fails when the object is occluded for the first time and the L1APG tracker at the second occlusion. The rIoU shows that DSST and CCOT perform well, while the IoU would imply they are weak.

In Table 1, we compare the average IoU with the average rIoU for the DAVIS and the VOT2016 datasets. Please note that we normalize each tracker with the IoU of the theoretical tracker that has the same abilities. Hence, the KCF tracker is normalized with the box-no-scale tracker, the LGT tracker with box-rot, and the others with box-axis-aligned. By these means, it is possible to observe how well each tracker is doing with respect to its abilities. For the DAVIS dataset, the KCF, ANT, L1APG, and LGT trackers all have the same absolute IoU, but when normalized by \(\varPhi _{opt}\), differences are visible. Hence, it is evident that the KCF is performing very well, given the fact that it does not estimate the scale. On the other hand, the LGT tracker, which has three more degrees of freedom, is relatively weak. A more detailed example analysis of the bmx-trees sequence from DAVIS [16] is displayed in Fig. 6. Please note, the significantly higher difference between the IoU and the rIoU for KCF compared to the other trackers is due to the different normalization factors used in the rIoU measure. The optimal IoU value for a box with fixed size is usually considerably lower than for a general axis-aligned box.

For the VOT2016 dataset, the overall accuracies are significantly worse than for DAVIS. On the one hand, this is due to the longer, more difficult sequences, and, on the other hand, due to the less accurate and noisier segmentations (see Fig. 7). Nevertheless, the rIoU allows a more reliable comparison of different trackers. For example, ANT, LGT and DSST have almost equal average IoU value, while ANT clearly outperforms LGT and DSST with respect to rIoU. Again, we can see that the KCF tracker is quite strong regarding the fact that it cannot estimate the scale.

Fig. 7.
figure 7

Examples from VOT2016 [11] where the segmentations are degenerated. Sometimes due to motion blur (e.g., (a) and (b)) or a weak contrast of the object and its background (c). (Color figure online)

6 Conclusion

In this paper, we have proposed a new accuracy measure that closes the gap between densely segmented ground truths and box detectors and trackers. We have presented an efficient optimization scheme to obtain the best possible detection boxes for arbitrary segmentations that are required for the new measure. The optimization was validated on a diverse set of segmentations from the DAVIS dataset [16]. The new accuracy measure can be used to generate three very expressive theoretical trackers, which can be used to obtain meaningful accuracies and help to interpret scenes without requiring by-frame labels. We have evaluated state-of-the-art trackers with the new accuracy measure on all segmentations within the DAVIS [16] and VOT2016 [11] datasets to display its advantages. The complete code and evaluation system will be made available to the community to encourage its use and make it easy to reproduce our results.