Keywords

1 Introduction

The recent advances in object detection have been driven mainly by the development of Deep Neural Networks (DNNs) [11, 12, 16, 24, 32,33,34]. Especially, one crucial component that allows DNNs to localize object bounding boxes precisely and flexibly is the Bounding Box Regressor (BBR) originally proposed in [12]. As a part of object detection networks, BBR refines off-the-shelf object proposals [11, 12] or anchor boxes with fixed positions and aspect ratios [24, 32, 34] so that the refined ones localize nearby objects more accurately. For this purpose, BBRs are tightly coupled with other components of object detection networks, and trained to localize predefined object classes better. That is, they have been developed typically for supervised object detection where ground-truth bounding boxes for target classes are given.

This paper studies BBR in a direction different from the conventional one. Specifically, we propose a BBR model that is class-agnostic, even well generalizable to unseen classes, and transferable to multiple diverse tasks demanding accurate bounding box localization; we call such a model Universal Bounding Box Regressor (UBBR). UBBR takes an image and any arbitrary bounding boxes, and refines the boxes so that they enclose their nearest objects tightly, regardless of their classes. The model with such a simple functionality can have a great impact on many applications since it is universal in terms of both object classes and tasks. An example of the applications is weakly supervised object detection where box annotations for target object classes are not given. In this setting, object bounding boxes tend to be badly localized due to the limited supervision [3, 20, 36], and UBBR can help to improve the performance by refining the localization results. In this case, UBBR can be considered as a knowledge transfer machine for bounding box localization. Also, UBBR can be used to generate object box proposals. Given boxes uniformly and densely sampled from image space, UBBR transforms them to approximate the boxes of their nearest objects, and the results are bounding boxes clustered around true object boxes. In this case, UBBR can be considered as learning-based object proposal methods [28, 29, 38].

This paper introduces a DNN architecture for UBBR and its training strategy. Our UBBR has a form of Convolutional Neural Networks (CNN), trained with randomly generated input boxes. It successfully generalizes to unseen classes, and can be used to improve localization in various computer vision problems, especially when bounding box supervision is absent. We demonstrate its effectivenss on weakly supervised object detection, object proposal generation, and object discovery. Main contribution of this paper is three-fold:

  • We present a simple yet effective UBBR based on CNN, which is versatile and easily generalizable to unseen classes. We also present a training strategy to learn such a universal model.

  • A single UBBR network achieves, or help to achieve, competitive performance in three different applications: weakly supervised object detection, object proposals, and object discovery.

  • We provide an in-depth empirical analysis for demonstrating the generalizability of our UBBR for unseen classes.

The rest of this paper is organized as follows. Section 2 overviews previous approaches relevant to UBBR, and Sect. 3 presents technical details of UBBR and a strategy for training it. UBBR is then evaluated on three different localization tasks in Sect. 4, and we conclude in Sect. 5 with brief remarks.

2 Related Work

Conventional BBR in Object Detection: BBR has been widely incorporated into DNNs for object detection [11, 12, 24, 32,33,34] for precise localization of object bounding boxes. Initially it was designed as a post-processing step to refine off-the-shelf object proposals boxes [11, 12]. Recently, it directly estimates bounding boxes of nearby objects from each cell of an image grid [33], or aims to transform a fixed set of anchor boxes to cover ground-truth object boxes accurately [24, 32, 34]. Here the anchor boxes, also known as default boxes, are pre-defined bounding boxes that are sampled on a regular grid with a few selected scales and aspect ratios [24, 33, 34] or estimated from ground-truth object boxes of training data [32]. Thus those BBRs are trained to be well harmonized with other components of object detection networks, and are dependent on a few pre-defined object classes and characteristics of anchor boxes. On the other hand, our UBBR is designed and trained to be class-agnostic, transferable to unseen classes, and free from anchor boxes. These properties of UBBR allow us to apply it to multiple diverse applications demanding accurate bounding box localization, beyond the conventional object detection.

Object Proposal: Our UBBR is also closely related to object proposals since it naturally generates accurate object candidate boxes given uniformly sampled boxes as inputs. Well-known early approaches to object proposal are unsupervised techniques [18, 26]. Motivated by the fact that typically an object box include a whole image segment rather than a part of it, they draw bounding boxes encompassing image segments obtained by hierarchical image segmentation methods. Since there is no supervision for object location and image segmentation results often fail to preserve object boundary, the unsupervised techniques are limited in terms of recall and localization accuracy. Supervised approaches for object proposals have been actively studied as well, and exhibited substantially better performance. Before the era of deep learning, there have been proposed object proposal techniques generating object candidate boxes [38] and masks [2], which are trained with object boundary annotations. Recently, Pinheiro et al. [28, 29] introduce DNNs for generating and refining class-agnostic object candidate masks.

Learning-based proposals, including ours, require strong supervision in training. One may ask, if such bounding box annotations are given, why not directly learning an object detector instead of proposals? We would like to argue that the learning-based proposals are still valuable if they are class-agnostic, well generalizable to unseen classes, and universally applied to various applications. Note that existing datasets provide a huge amount of readily available annotations, especially for bounding boxes; there is no reason to avoid them when localizing objects of unseen classes in the context of transfer learning.

Transfer Learning for Visual Recognition: Oquab et al. [27] demonstrated that low-level layers of a CNN trained for a large-scale image classification can be transferred to classification in different domains or even different visual recognition tasks. Since that, transferring low-level image representation has been a common technique to avoid overfitting in various visual recognition tasks like object detection [11, 12, 16, 24, 32,33,34] and semantic segmentation [6, 25, 35]. While these approaches focus on transferring low-level image representation between different tasks, UBBR is to transfer the knowledge about how to draw bounding boxes to enclose an object. In that sense, UBBR also has a connection to TransferNet [15], which transfers the segmentation knowledge to object classes whose segmentation annotations are not available.

Fig. 1.
figure 1

Illustration of UBBR’s architecture. In inference time, the network takes an image with roughly localized bounding boxes and refine them so that they tightly enclose nearby objects. N is the number of input boxes and K is the dimensionality of box features. In training time, the network takes bounding boxes randomly generated around ground-truth boxes, and is learned to transform each input box so that Intersection-over-Union between the box and its nearest ground-truth is maximized.

3 Universal Bounding Box Regressor

3.1 Architecture

The architecture of UBBR is similar with conventional object detectors (e.g., Fast R-CNN [11]) which consist of convolutional layers for feature representation, a region pooling layer for extracting region-wise features, and fully-connected layers for box classification and regression. Figure 1 illustrates training and inference stages of the UBBR network. The architecture first computes a feature map of an input image with the convolutional layers, and a feature vector of a fixed length is extracted for each input box through the RoI-Align layer [13]. Each of the extracted box features is then processed by 3 fully-connected layers to compute a 4-D real vector indicating the offset between the corresponding box and its nearest object. Note that UBBR is designed to use input boxes with arbitrary shapes and object classes unlike those of most conventional object detection networks [11, 34]. Hence, the UBBR network is trained in a anchor-free and class-agnostic manner as will be described in the following.

3.2 Training

Dataset: Since UBBR predicts object boxes, it demands images with ground-truth object boxes during training, and any existing datasets for object detection can meet the need. Note that since UBBR is class-agnostic, class labels of the box annotations are disregarded in our case.

Fig. 2.
figure 2

Example of randomly generated bounding boxes for training UBBR. Black boxes are ground-truths and yellow ones are randomly generated boxes.

Random Box Generation: UBBR takes as its inputs not only image but also (roughly localized) boxes that will be transformed to enclose nearby objects tightly. Thus, each training image has to be served together with such boxes. Furthermore, the boxes fed to the network during training should be diverse for universality of UBBR, but at the same time, have to be overlapped with at least one ground-truth to some extent so that UBBR can observe enough evidences about target object. To this end, in training time we generate input bounding boxes by applying random transformations to ground-truth boxes.

Let \(g = [x_g, y_g, w_g, h_g]^\top \) denote a ground-truth box represented by its center coordinate \((x_g, y_g)\), width \(w_g\), and height \(h_g\). Transformation parameters for the four values are sampled from uniform distributions independently as follows:

$$\begin{aligned} \begin{aligned}&t_x \sim \mathcal {U}(-\alpha ,\; \alpha ),\\&t_y \sim \mathcal {U}(-\alpha ,\; \alpha ),\\&t_w \sim \mathcal {U}(\ln {1 - \beta },\; \ln {1 + \beta }),\\&t_h \sim \mathcal {U}(\ln {1 - \beta },\; \ln {1 + \beta }). \end{aligned} \end{aligned}$$
(1)

Then a random input box \(b=[x_b, y_b, w_b, h_b]^\top \) is obtained by applying the sampled transformation to g:

$$\begin{aligned} \begin{aligned}&x_b = x_g + t_x \cdot w_g,\\&y_b = y_g + t_y \cdot h_g,\\&w_b = w_g \cdot \exp (t_w),\\&h_b = h_g \cdot \exp (t_h). \end{aligned} \end{aligned}$$
(2)

Also, if Intersection-over-Union (IoU) between b and g is less than a pre-defined threshold t, we simply discard b during training. \(\alpha \) and \(\beta \) are empirically set to 0.35 and 0.5 respectively. The effect of \(\alpha \), \(\beta \), and t on the performance of UBBR is analyzed in the next section. Figure 2 shows examples of random box generation.

Loss Function: For the regression criterion, IoU loss [37] is employed instead of conventional ones like \(L_2\) and smooth \(L_1\) losses. The drawback of the conventional losses in bounding box regression is that the bounding box transformation parameters \((t_x, t_y, t_w, t_h)\) are optimized independently [37] although they are in fact highly inter-correlated. IoU loss has been proposed to address this issue, and we observed in our experiments that IoU loss allows training more stable and leads to better performance when compared to smooth \(L_1\) loss.

figure a

The procedure for computing IoU loss between two bounding boxes is described in Algorithm 1, where \(A_u\) and \(A_v\) are the areas of u and v, and \(I_w\) and \(I_h\) means the width and height of their intersection area. Note that we add a tiny constant \(\epsilon \) to IoU value before taking logarithm for numerical stability. The image-level loss is then defined as the average of box-wise regression losses as follows:

$$\begin{aligned} \begin{aligned}&L_{\text {I}oU} = \frac{1}{N} \sum _{n=1}^{N} \text {IoU-loss}\Big (f\big (b_n, \text {UBBR}(b_n)\big ), g_n \Big ), \end{aligned} \end{aligned}$$
(3)

where \(b_n\) is an input box and \(g_n\) is the ground-truth bounding box that is best overlapped with \(b_n\) in terms of IoU metric. Also, UBBR(\(b_n\)) is the offsets predicted by UBBR and f is the transformation function that refines \(b_n\) with the predicted offset parameters.

4 Experiment

In this section, we first describe implementation details, then demonstrate the effectiveness of our approach empirically in three tasks: weakly supervised object detection, object proposal, and object discovery.

4.1 Datasets

To demonstrate transferability of UBBR, we carefully define source and target domains. Basically, we employ COCO 2017 [23] as source and PASCAL VOC [10] as target. Then all images containing the 20 PASCAL VOC object categories are removed from the COCO 2017. As a result, there remain 21,413 training images and 900 validation images of 60 object categories in the source domain dataset. Note that we train a single UBBR with the above dataset, and apply the model to all applications without task-specific finetuning.

4.2 Implementation Details

The training is carried out using stochastic gradient decent with momentum and weight decay. The momentum and weight decay multiplier are set to 0.9 and 0.0005, respectively. The learning rate initially starts from \(10^{-3}\) and is divided by 10 when the validation loss stop improving. We stop the training when the learning rate become \(10^{-6}\). In all experiments, we employ ResNet101 [14] (upto conv4) pre-trained on ImageNet as backbone convolutional layers. The fully-connected layers are composed of three linear layers with ReLU activations. The weight parameters of fully connected layers are randomly initialized from zero-mean Gaussian distributions with standard deviation 0.001, and their biases are initialized to 0. For both training and testing, input images are rescaled using bilinear interpolation such that its shorter side becomes 600 pixels. We generate 50 random bounding boxes for each ground-truth object.

Table 1. Average precision (IoU > 0.5) for weakly supervised object detection on PASCAL VOC 2007 test set. For baseline model, we train OICR using published code and extract detection results from it. We refer to this model as OICR-ours. t is IoU threshold for random box generation. The models trained with smooth L1 and IoU losses are denoted by UBBR-sl1 and UBBR-iou, respectively.
Table 2. Performance improvement of iterative refinement.

4.3 Weakly Supervised Object Detection

To demonstrate the effectiveness of UBBR, we apply our model as a post-processing module of weakly supervised object detection. The goal of weakly supervised object detection is to learn object detectors only with image-level class labels as supervision. Due to the significantly limited supervision, models in this category often fail to localize the entire body of target object but cover only a discriminative part of it. Thus, UBBR can help to improve localization by refining bounding boxes estimated by weakly supervised object detection model. This setting also can be considered as transfer learning for weakly supervised object detection, where UBBR transfer the bounding box knowledge of source domain to target domain.

We use OICR [36] as a baseline model for weakly supervised object detection, and apply UBBR to the output of OICR. The quantitative analysis of the performance on PASCAL VOC 2007 is summarized in Table 1, in which one can see that UBBR improves the object localization quality substantially. We also validate the effect of the threshold t by applying UBBR models learned with two different values of t. In general, the model with a smaller t performs better than that with a larger t since UBBR is able to learn from more various and challenging box localization examples by decreasing t during training. Also, we report the performance of the models learned with conventional smooth \(L_1\) loss. Figure 3 presents qualitative results of our approach.

Fig. 3.
figure 3

Qualitative results of (OICR + UBBR) on PASCAL VOC 2007 test set. Yellow boxes are detection results of OICR and blue boxes are refined bounding boxes. From top to bottom, each row is the result of 1, 2, and 3 iterative refinement respectively. (Color figure online)

Besides the above straightforward application of UBBR, we further explore ways to better utilize UBBR and provide more detailed analysis on its various aspects in the context of weakly supervised object detection as follows.

Fig. 4.
figure 4

Box refinement examples of bike class. Yellow boxes are detection results of OICR and blue boxes are refined bounding boxes. From top to bottom, each row is the result of 1, 2, and 3 iterative refinement respectively. Left three examples are failure cases, and right two examples are successful cases. (Color figure online)

Iterative Refinement: UBBR also can be applied multiple times iteratively so that localization is progressively improved. That is, for each iteration, bounding boxes refined in previous step are fed into the network again. Through this strategy, we can obtain better localization results. It is important to note that, for efficiency of overall procedure, we reuse the convolutional feature map of the backbone network. As can be seen in Table 2, we can further improve the localization performance by iterative refinement, and the effect was consistent up to the third iterations.

Limitation: As Table 1 shows, the quality of refined localization of bike class is worse than baseline. Furthermore, the iterative refinement makes the quality even worse as shown in Table 2. This means UBBR rather degrades localization of bike class, and we found that it is because of a side effect of the class-agnostic nature of UBBR. Figure 4 shows box refinement examples of bike class. Left three examples are failure cases, and right two examples are successful cases. Most of failure cases of bike class occur when there is a person riding the bike. Because UBBR predicts class-agnostic bounding box, it does not distinguish bike and person and recognizes them as a single object in the examples. As illustrated in two rightmost columns, when there is no person on the bike, it successfully localizes the bikes.

Table 3. Average precision (IoU > 0.5) for weakly supervised object detection on PASCAL VOC 2007 test set. COCO-60 is our main dataset excluding 20 categories from original COCO 2017 dataset. COCO-21 and COCO-40 are more reduced datasets which contain 21 and 40 categories respectively. COCO-full is the original COCO 2017 train set which contains 80 categories.
Table 4. Effect of box generation parameters \(\alpha \) and \(\beta \) on the performance of weakly-supervised object detection. \(\alpha = 0.35\) and \(\beta = 0.5\) are used in all other experiments.

Generalizability: The previous experiments already validated that our approach is generalizable to unseen object classes of the target domain. To further demonstrate the generalizability, we analyze the performance of UBBR models trained with even a smaller number of object classes. To this end, we build two additional training sets by reducing the number of object classes. COCO-40 is composed of 40 categories excluding animal, accessory, electronic, and appliance classes from the original training data. Also, COCO-21 consists of 21 classes and is obtained by further excluding furniture, indoor, and food classes from COCO-40. The original training dataset is denoted by COCO-60. Moreover, to eliminate the effect of dataset size, we make the sizes of COCO-40 and COCO-21 identical to that of COCO-60 by randomly sampling 21,413 images containing at least one object belonging to the categories of interest.

We report the performance of UBBRs learned with COCO-40 and COCO-21 in Table 3. Although the models trained with these datasets perform worse due to lack of diversity in their training data, they still improve localization performance substantially. An interesting observation is that they improve localization of animals although their training datasets do not include animal classes. The results indicate that UBBR can be generalizable to unseen and unfamiliar classes well. We also report the performance of UBBR models learned with full COCO 2017 train set, which is denoted by COCO-full and contains all PASCAL VOC classes. It is natural that UBBR trained with COCO-full outperforms the others, but their differences in performance are marginal.

Box Generation Parameters: The box generation parameter \(\alpha \) and \(\beta \) are chosen empirically to generate diverse and sufficiently overlapped boxes. Table 4 shows how these parameters affect the performance of weakly-supervised object detection when t is 0.3. As shown in the table, the performance is not very sensitive to both parameters. In all other experiments, \(\alpha = 0.35\) and \(\beta = 0.5\) are used. Note that we did not optimize those parameters using the evaluation results.

Fig. 5.
figure 5

Recall of box proposals on the PASCAL VOC 2007 test set. (left) recall@IOU = 0.5. (right) recall@IOU = 0.7.

4.4 Object Proposals

For the second application, we employ UBBR as a region proposal generator. Similarly to RPN [34], we generate seed bounding boxes of various scale and aspect ratio and locate them in image uniformly. We feed them into UBBR so that each seed bounding box encloses its nearest object. To select object proposals from the refined bounding boxes, we assign score \(s_n\) to each bounding box \(b_n\). In assumption that the refined bounding boxes will be concentrated around real objects, \(s_n\) is initially set to the number of adjacent bounding boxes whose IoU with \(b_n\) is greater than 0.7. After that, we apply non-maximum suppression (NMS) with IoU threshold 0.6. In NMS procedure, instead of removing adjacent bounding boxes, we divide their scores by 10, which is similar to Soft-NMS [4]. In Fig. 5, performance of proposals generated by our method are quantified and compared with popular proposal techniques [1, 2, 5, 7, 9, 17, 18, 21, 26, 30, 31, 38]. The performance of UBBR clearly outperforms previous methods in comparison. Note that unlike many other methods (except SelectiveSearch [18]), UBBR does not use any images from PASCAL object classes for training. We also evaluate RPN [34] in the same transfer learning scenario with ours, where we train RPN with COCO-60 dataset and evaluate it on PASCAL VOC dataset. Note that we use the same backbone network for both of RPN and UBBR. As shown in Fig. 5, UBBR outperforms RPN in particular with a tighter IOU criterion. Note that the x axis of the figure starts from recall at \(10^0\) proposal rather than \(10^1\) proposals. Figure 6 presents qualitative examples of object proposals obtained by our method.

Fig. 6.
figure 6

Visualization of top-10 region proposals generated by the proposed method.

Table 5. Object discovery accuracy in CorLoc on PASCAL VOC 2007 trainval set.

4.5 Object Discovery

For the last application, we choose the task of object discovery that aims at localizing objects from images. Since most of previous methods consider localization of a single foreground object per image, the object discovery can be viewed as an extreme case of object proposal generation where only top-1 proposals are used for evaluation. The correct localization (CorLoc) metric is an evaluation metric widely used in related work [8, 19, 22], and defined as the percentage of images correctly localized according to the PASCAL criterion: \(\frac{area(b_p \cap b_{gt})}{ area(b_p \cup b_{gt})} > 0.5\), where \(b_p\) is the predicted box and \(b_{gt}\) is the ground-truth box. For evaluation on the PASCAL VOC 2007 dataset, we follow to use all images in PASCAL VOC 2007 trainval set discarding images which only contain ‘difficult’ or ‘truncated’ objects. We report the performance in Table 5. The performance of UBBR significantly outperforms the previous approaches to object discovery [8, 22], which implies that generic object information can be effectively learned by UBBR and transferred to the task of object discovery.

5 Conclusion

We have studied the bounding box regression in a novel and interesting direction. Unlike those commonly embedded in recent object detection networks, our model is class-agnostic and free from manually defined anchor boxes. These properties allow our model to be universal, well generalizable to unseen classes, and transferable to multiple diverse tasks demanding accurate bounding box localization. Such advantages of our model have been verified empirically in various tasks including weakly supervised object detection, object proposal, and object discovery.