Enabling Deep Residual Networks for Weakly Supervised Object Detection

Shen, Yunhang; Ji, Rongrong; Wang, Yan; Chen, Zhiwei; Zheng, Feng; Huang, Feiyue; Wu, Yunsheng

doi:10.1007/978-3-030-58598-3_8

Yunhang Shen¹²,
Rongrong Ji¹²,
Yan Wang¹³,
Zhiwei Chen¹²,
Feng Zheng¹⁴,
Feiyue Huang¹⁵ &
…
Yunsheng Wu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12353))

Included in the following conference series:

European Conference on Computer Vision

4029 Accesses
28 Citations

Abstract

Weakly supervised object detection (WSOD) has attracted extensive research attention due to its great flexibility of exploiting large-scale image-level annotation for detector training. Whilst deep residual networks such as ResNet and DenseNet have become the standard backbones for many computer vision tasks, the cutting-edge WSOD methods still rely on plain networks, e.g., VGG, as backbones. It is indeed not trivial to employ deep residual networks for WSOD, which even shows significant deterioration of detection accuracy and non-convergence. In this paper, we discover the intrinsic root with sophisticated analysis and propose a sequence of design principles to take full advantages of deep residual learning for WSOD from the perspectives of adding redundancy, improving robustness and aligning features. First, a redundant adaptation neck is key for effective object instance localization and discriminative feature learning. Second, small-kernel convolutions and MaxPool down-samplings help improve the robustness of information flow, which gives finer object boundaries and make the detector more sensitivity to small objects. Third, dilated convolution is essential to align the proposal features and exploit diverse local information by extracting high-resolution feature maps. Extensive experiments show that the proposed principles enable deep residual networks to establishes new state-of-the-arts on PASCAL VOC and MS COCO.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Recent Advances of Generic Object Detection with Deep Learning: A Review

Cascade Attentive Dropout for Weakly Supervised Object Detection

Article 22 March 2023

Multiple spatial residual network for object detection

Article Open access 06 September 2022

1 Introduction

Different from fully supervised object detection (FSOD) [19, 39, 41, 42] that requires bounding-box-level annotations, weakly supervised object detection (WSOD) only needs image-level labels. Such relaxation significantly saves the labelling cost and brings large flexibility to many real-world applications.

In a standard pipeline, state-of-the-art WSOD methods first crop region proposals using methods such as RoIPool [19] from backbone networks. Then task-specific heads, i.e., WSOD heads, are built on top of the backbones to localize object instances and learn proposal features jointly. Despite the promising progress made in recent years, there is still a large performance gap from WSOD to FSOD. Prevailing methods generally focus on designing WSOD heads and seldom touch the design of backbone networks, and most state-of-the-art WSOD methods are still built on plain network architectures, e.g., VGG16 [52], VGG-F (M, S) [28] and AlexNet [1], leaving deep residual networks under-explored.

Table 1. Comparisons of different backbones for WSDDN [7] on VOC 2007 [15].

Full size table

In contrast, it is well known that backbones are important for FSOD in both detection accuracy and inference speed. For accuracy, by simple replacing VGG16 with ResNet [26], Faster R-CNN [42] can increase the mAP@0.5 from 41.5%/75.9% (VGG16) to 48.4%/83.8% (ResNet-101) on COCO and PASCAL VOC 2012, respectively. For speed, light-weight backbones [44, 62] significantly reduce the model size and computational complexity. And backbones proposed in [49, 77] also enable training detectors from scratch.

However, the direct replacement of residual network to plain networks in WSOD has led to significant performance drop. As an investigation, we first quantize the performance of deep residual networks to WSOD under various combinational schemes, as shown in Fig. 1. We build WSDDN [7] head on various plain and residual backbones and evaluate them on PASCAL VOC 2007 [15]. As shown in Table 1, ResNet [26] and DenseNet [23] deteriorate detection performance, which is even inferior to AlexNet [1] in terms of mAP. Moreover, some state-of-the-art methods [27, 55, 59] are unable to converge as shown in Table 4.

In this paper, we investigate the intrinsic nature towards enabling residual networks to be workable in WSOD. The underlying problem is that WSOD heads are sensitive to model initialization [5, 9, 11, 32] and suffer from instability [38], which may back-propagate uncertain and erroneous gradient to backbones and deteriorate the visual representation learning. Specifically, we propose a sequence of design principles to take full advantage of deep residual networks in three perspectives, i.e., adding redundancy, improving robustness and aligning features.

1. Redundant adaptation neck. Directly employing ResNet backbones to train WSOD deteriorates the discriminability of proposal features, which also fails to localize object instances accurately. The shortcut connections in residual blocks also enlarge the uncertain and erroneous gradient, which overwhelms the direction of optimization steps. Therefore, our first principle is proposing a redundancy adaptation neck with high-dimension proposal representation between deep residual backbones and WSOD heads, which serves as the key to localize object instances and learn discriminative features jointly.

2. Robust information flow. We have also found that ResNet suffers from uncertainty around object boundaries and imperceptibility of small instances under weak supervision. This is mainly caused by the large-kernel (7 \(\times \) 7) convolution and non-maximum down-sampling, i.e., 2 \(\times \) 2 strided convolution and AveragePool, which lose highly informative features from the raw images. We show that small-kernel convolutions and MaxPool down-samplings provide finer object boundaries and preserve the information of small instances, which enhances the robustness of information flow through the networks.

3. Proposal feature alignment. Modern residual networks commonly achieve large receptive fields by applying an overall stride with \(32 \times \) sub-sampling. However, such coarse feature maps lead to feature misalignment due to the quantizations in RoIPool [19] layer, which introduces confusing context and lacking diversity. By exploiting dilated convolution to extracts high-resolution feature maps for WSOD, we are able to support the efficient alignment of proposal features and exploit diverse local information, as well as to detect small objects.

We implement two instantiations of the proposed principles: ResNet-WS and DenseNet-WS. Extensive experiments are conducted on PASCAL VOC [15] and MS COCO [37]. We show that the proposed principles enable deep residual networks to achieve significant improvement compared with plain networks for various WSOD methods, which also establishes new state-of-the-arts.

2 Related Work

2.1 Weakly Supervised Object Detection

Prevailing WSOD work generally focuses on two successive stages, object discovery and instance refinement.

Object discovery stage combines multiple instance learning (MIL) and CNNs to implicitly model latent object locations with image-level labels. Several different strategies to train the MIL model had been proposed in the literature [6, 8, 17, 51, 61, 63]. Bilen et al. [7] selected proposals by parallel detection and classification branches. Contextual information [27], attention mechanism [58], saliency map [31, 46, 48] and semantic segmentation [64] are leveraged to learn outstanding proposals. High-precision object proposals for WSOD are generated in [30, 57]. Some methods focused on proposal-free paradigms with deep feature maps [3, 4, 78], class activation maps [12, 21, 69, 70, 75] and generative adversarial learning [13, 45]. Some work also used additional information to improve the performance, e.g., object-size estimation [51], instance-count annotations [16], video-motion cue [30, 53] and human verification [40]. Knowledge transfer has also been exploited for cross-domain adaptation w.r.t. data [50] and task [24].

Instance refinement stage aims at explicitly learning the object location by making use of the predictions from the object discovery stage. The top-scoring proposals generated from the object discovery stage are used as supervision to train the instance refinement classifier [16, 25, 32, 56, 65]. Other different strategies [29, 43, 55, 71] are also proposed to generate pseudo-ground-truth boxes and label proposals. Some methods exploit to improve the optimization of the overall framework that jointly learn the two-stage modules with min-entropy prior [34, 60], multi-view learning [72] and continuation MIL [59]. Collaboration mechanism between segmentation and detection is proposed to take advantages of the complementary interpretations of weakly supervised tasks [33, 47].

With the output of the above two stages, a fully-supervised detector can also be trained. Many efforts [18, 74] have been made to mine high-quality bounding boxes. Zhang et al. [73] proposed a self-directed optimization to propagate object priors of the reliable instances to unreliable ones.

2.2 Network Architectures for Object Detection

Significant efforts have been devoted to the design of network architectures for the task of FSOD. DSOD [49] and Root-ResNet [77] exploit to train single-shot detectors, i.e., SSD [39], from scratch, whilst PeleeNet [62] is proposed to train SSD for mobile devices. Li et al. [35] proposed DetNet backbone for FSOD. Fine feature maps are also useful for detecting small objects as observed in FPN [36].

In conclusion, most traditional backbone networks are usually designed for image classification or FSOD. We have not found one that explores the backbone networks for WSOD. Moreover, the cutting-edge WSOD methods follow the pipeline of ImageNet pre-trained plain networks, i.e., VGG-style networks. Undoubtedly, the advanced modules in recent deep residual architectures have not been explored in WSOD.

Table 2. Result of freezing different number of stages in ResNet for WSDDN [7] on VOC 2007 [15]. “NAN” indicates that the training is non-convergent.

Full size table

3 Baseline WSOD

Without loss of generality, we consider building WSOD models on the pre-trained backbones and fine-tuning its parameters on the target data. We use the popular WSDDN [7] method as baseline WSOD head, which is also a basic module in many state-of-the-art approaches [27, 55, 56, 59, 65].

We first investigate several common combination schemes in FSOD to build WSOD heads on ResNet and DenseNet, which are widely used for Faster R-CNN [42], as illustrated in Fig. 1. The C4 [42] combination performs RoIPool [19] on the full-image feature maps from previous 4 stages. All layers in conv5 stage and WSOD heads are stacked sequentially on the RoIPooled features. The FPN [36] combination learns full-image feature pyramids from backbones. Then RoIPool is performed to extract \(7 \times 7\) proposal features followed by two hidden 1, 024-d fully-connected (FC) layers before the WSOD heads. Besides, we also consider a solution, termed C5 [19] combination. C5 combination computes full-image feature maps using all convolutional layers (all 5 stages), followed by a RoIPool layer and later layers.

As shown in Table 1, directly employing ResNet and DenseNet for WSOD task reduces the performance dramatically in various combinations. The best performance of 31.5 mAP is obtained from C4 combination, which is still inferior to the shallow AlexNet backbone in terms of mAP. Moreover, some state-of-the-art methods [27, 55, 59] are unable to converge according to further experiments in Table 4. We focus on the C5 combination in the rest of the paper, as C4 and FPN combinations have their drawbacks in WSOD setting. C4 combination computes entire conv5 stage for each proposal. Thus, it will is cost additional \(10\times \) training time and \(100\%\) memory usage compared with the C5 combination when each image has about 2, 000 proposals. FPN combination imposes an extra burden of learning top-down full-image feature pyramids with lateral connections.

Different from FSOD, WSOD has insufficient supervision and is often formulated via multiple instance learning (MIL) [14], which is sensitive to model initialization [5, 9, 11, 32] and suffers from instability [38]. In this sense, WSOD heads may back-propagate uncertain and erroneous gradient to backbones, whilst deep residual networks enlarge the erroneous information and deteriorate the visual representation learning, which results in dramatically reduced detection performance. To further verify the above analysis, we freeze different number of stages in ResNet, and show the results in Table 2. We summarize: 1) The detection performance mAP is improved progressively by freezing pre-trained layers up to 4 stages, because it prevents convolutional layers in backbones from receiving the erroneous information from WSOD heads. 2) When freezing entire backbones, i.e., all 5 frozen stages, the models has not enough capacity for representation learning (mAP drops dramatically) and even fails to converge. 3) Larger learning rate, i.e., 0.01, improves the performance of models with 3 and 4 frozen stages. However, such a large learning rate also enlarges the erroneous information, which results in non-convergent models with 0 and 2 frozen stages. 4) In contrast to mAP of test set, the localization performance CorLoc that evaluated in trainval set becomes worse as more stages are frozen, which is mainly due to the overfitting. In the following sections, we propose a sequence of design principles to take full advantages of deep residual learning for WSOD.

4 Redundant Adaptation Neck

We visualize the distribution of proposal features uniformly sampled from the PASCAL VOC 2007 trainval set [15] using t-SNE [20] in Fig. 2. We compared VGG16 (V-16) with ResNet of 18 (R-18), 50 (R-50) and 101 (R-101) layers. Proposal features from RoIPool and subsequent layers, i.e., conv5, fc6 and fc7 for VGG16 and conv5 for ResNet, are shown. In Fig. 2a, we observe that the proposal features of FC layers from fine-tuned VGG16 are more discriminative than that of the pre-trained ones, whilst the distribution of the features from conv5 only changes slightly. However, Fig. 2b, 2c and 2d show that the proposal features of ResNet are not discriminative enough to distinguish different categories. Even more, the proposal features of ResNet50 and ResNet101 are deteriorated compared with the pre-trained counterparts. To further explore the training procedure, we also draw the optimization landscape analysis curves for different backbones in Fig. 3. Generally, optimization loss indicates how well the models reason the relationship of proposals to satisfied the imposed constraints in WSOD. VGG16 demonstrates faster convergence and has lower loss than ResNet backbones, which converge to undesirable local minimums.

In conclusion, we observe indiscriminative proposal representation and poor convergence when directly employing ResNet backbones in WSOD task, which cause deteriorated detection performance. As WSOD is required to localize object instances and learn proposal feature jointly with only image-level labels. Therefore, directly stacking WSOD heads on top of residual networks has a large negative impact on the convolutional feature learning. And shortcut connections in residual blocks also enlarge the uncertain and erroneous gradient from WSOD heads throughout the backbones during back-propagation, which overwhelms the direction of optimization steps and fails to infer the proposal-level classifier.

From the perspective of adding redundancy, we propose the first principle that a Redundant Adaptation Neck (RAN), which learns high-dimension visual representation of proposals between deep residual network backbones and the WSOD heads, is the key to localize object instances and learn discriminative features jointly. Our intuition is that the redundant feature representation ensures various WSOD constraints under weak supervision and decreases the negative impact of uncertain and erroneous gradient from WSOD heads, whilst the convolutional layers focus on full-image feature learning. We implement and visualize this principle for ResNet (ResNet-RAN). Instead of instantiating the RAN by stacking convolutional layers, we use multiple perception layers, which are memory-feasible to extract high-dimension features for about 2, 000 proposals. Specifically, the last global pooling layer in ResNet is replaced by two FC layers with high dimension 2,048–4096 before the WSDDN heads. We show the proposal features from conv5 and two FC layers from the RAN, i.e., ran1 and ran2, in Fig. 2e and 2f. ResNet18-RAN and ResNet50-RAN obtain discriminative proposal features in ran1 and ran2 layers. Figure 3 shows that ResNet-RAN also converges to better minimum. It demonstrates that the entangled tasks of localizing object instances and learning proposal features are optimized jointly.

To further explore the limit of the RAN, we freeze all convolutional layers in the backbones, which completely removes the effect of WSOD heads to the convolutional layers. Figure 2g and 2h show that the proposal features are even more discriminative. Meanwhile, the optimization landscape in Fig. 3 is also improved (ResNet-RAN F5). This interesting observation shows that RAN has elastic capacity to accommodate the entangled tasks.

Table 3. Comparison of various proposal feature extractors for WSDDN [7] on PASCAL VOC 2007 [15] test in terms of mAP (%).

Full size table

5 Robust Information Flow

Residual learning greatly alleviates the problem of vanishing gradient in deep networks by enhancing information flow with the skip connections. However, there still exist two main drawbacks in deep residual networks, i.e., ResNet and DenseNet, that hinder the robustness of information flow to uncertain and erroneous gradient under weak supervision. First, the large-kernel (7 \(\times \) 7) convolutions in the stem block weaken the information of object boundaries, resulting in uncertainty around the object boundaries. Second, non-maximum down-sampling, i.e., 2 \(\times \) 2 strided convolutions and AveragePool, may also hurt the flow of information, which makes small instances imperceptible, as the non-maximum down-sampling may not preserve the informative activations and gradient flowing through the network under weak supervision.

From the perspective of improving robustness, we propose a principle that using small-kernel (SK) convolution and MaxPool (MP) down-sampling in the backbones to improve the robustness of information flow, which give finer object boundaries and more sensitivity on small objects. Specifically, we replace the original stem block with three conservative 3 \(\times \) 3 convolutions, with the first and third convolutions followed by 2 \(\times \) 2 MaxPool layers. For down-sampling, we change the strided convolution or AveragePool operation with MaxPool, which is set to 2 \(\times \) 2 with 2 \(\times \) 2 stride to avoid the overlapping between input activations.

We utilize the gradient maps of input images to observe how information flows through the networks. In the second and third rows of Fig. 4, we observe that the gradients of object boundaries in R-18-RAN are more blurry than that of VGG16. And the gradients of some object parts and small instances are missed in R-18-RAN. However, gradient maps of R-18-RAN-SK provide finer object boundaries, and R-18-RAN-SK-MP responses to multiple small objects.

Table 4. Ablation study on PASCAL VOC 2007 test.

Full size table

6 Proposal Feature Alignment

Modern deep residual networks commonly use 5 stages to extract full-image feature maps with \(32 \times \) sub-sampling. This brings large effective receptive fields, which are critical for high classification accuracy. However, the large stride may cause misalignment between region proposals and pooled features from the RoIPool [19] layer. The feature misalignment is caused by two quantization operations: coordinate rounding after being divided by the stride, and projected proposals segmentation into discrete bins. Although the misalignment has little negative impact on FSOD, it introduces serious features ambiguity in WSOD, which further raises the instability problem.

Table 5. Comparison with SotAs on PASCAL VOC 2007 test in terms of AP.

Full size table

To address the misalignment of proposal features, we exploit dilated convolution (DC) [66, 76] to extract high-resolution full-image feature maps for WSOD. Specifically, we fix the spatial size after stage 3 and use dilated convolution with a rate of 2 in the subsequent stages, which results in only \(8 \times \) sub-sampling. We visualize the sampling locations of RoIPool in Fig. 5. In the first three columns, the sampling locations of RoIPool in R-18-RAN may exceed the border of proposals, due to the rounded coordinates, whilst R-18-RAN-DC constraints the regions of the sampling inside the proposals. Quantizing proposals into discrete bins in low-resolution feature maps also causes less diversity of sampling locations, as shown in the last five columns of Fig. 5, while high-resolution feature maps from dilated convolution provide more diverse information.

It is worth noting that RoIAlign [22] uses bilinear interpolation to compute the exact values at sampled locations in discrete bins, which aims to address the quantization errors. However, RoIAlign samples activation in a fixed position, which results in inferior performance as shown in Table 3.

7 Quantitative Results

Datasets. We evaluate the proposed design principles on PASCAL VOC 2007, 2012 [15] and MS COCO [37], which are widely-used benchmark datasets.

Table 6. Comparison with SotAs on VOC 2007 trainval in terms of CorLoc.

Full size table

Evaluation Protocols. The CorLoc indicates the percentage of images in which a method correctly localizes an object of the target category according to the PASCAL criterion. The mAP follows standard PASCAL VOC protocol to report the mAP at \(50\%\) Intersection-over-Union (IoU) of the detected boxes with the ground-truth ones. For MS COCO data, we report the standard COCO metrics, including AP at different IoU thresholds and instance scales.

Implementation Details. All backbone networks are initialized with the weights pre-trained on ImageNet ILSVRC [10]. We use synchronized SGD training on 4 GPUs. A mini-batch involves 1 images per GPU. In the multi-scale setting, we use scales of \(\{480, 576, 688, 864, 1200\}\). We set the maximum number of proposals in an image to be 2, 000. We freeze all pre-trained convolutional layers in backbones unless specified otherwise. The test scores are the average of all scales and flips. Detection results are post-processed by non-maximum suppression using a threshold of 0.3.

Table 7. Comparison with SotAs on VOC 2012 in terms of mAP and CorLoc.

Full size table

7.1 Ablation Study

We validate the contribution of each design principle on PASCAL VOC 2007 in Table 4. For rows (b–e), we report the results of applying each principle to ResNet18, which show consistent improvements over the original backbone (a). Especially, RAN (b) provides the largest performance gain among all principles. It demonstrates that RAN is key to localize object instances and learn proposal features jointly. Rows (f–j) show integrating different principles further improve detection performance. Compared with the baseline (a), the best performances are improved by \(15.0\%\) mAP significantly. It demonstrates that the proposed principles are orthogonal to each other. Rows (k–r) show that more state-of-the-art WSOD methods [27, 55, 56, 59] also have significant performance boost. Thus, the proposed principles for backbones are orthogonal to WSOD methods. Finally, rows (s–z) show that with different deep residual backbones, our models also outperform corresponding baselines, with ResNet50 and ResNet101 having more gains compared with ResNet18 (\(15.0\%\) vs. \(17.5\%\) vs. \(18.4\%\) mAP).

7.2 Comparison with State of the Arts

To fully compare with other backbones, we separately report the detection results for two successive stages, i.e., object discovery and instance refinement. Table 5 and Table 6 show the results on VOC 2007 in terms of mAP and CorLoc, respectively. For object discovery methods, our models with ResNet-WS obtains 43.4–44.1% mAP and 63.1–64.0% CorLoc for WSDDN [7], which significantly outperform the previous result with VGG16 by 8.6–9.3% mAP and 9.6–11.5% CorLoc. The improvements of ResNet101-WS for ContextLocNet [27] are \(9.6\%\) mAP and \(10.6\%\) CorLoc. For the instance refinement methods, replacing the backbones of OICR [56], PCL [55] and C-MIL [59] with ResNet-WS sets the new state-of-the-art results with improvements of 10.8–5.4 mAP.

Table 8. Comparison with the state-of-the-art methods on COCO minival set.

Full size table

For CorLoc, our ResNet101-WS backbone surpasses all single-model detectors with improvements of \(8.3\%\), \(7.3\%\) and \(7.4\%\), respectively. It is noted that we freeze all convolutional layers in our backbones when fine-tuning on target data. When only freezing the first two stages (ResNet18-WS F2) during training, the performances of PCL achieve further gains with \(1.7\%\) mAP and \(3.0\%\) CorLoc. Table 7 shows the results on VOC 2012. It can be observed that ResNet-WS models outperform all counterparts with different WSOD methods and achieve new state-of-the-art results. The superiority of ResNet-WS mainly benefits from successfully optimizing the entangled tasks of jointly localizing object instances and learning discriminative features. Table 8 shows the result on MS COCO. We find that ResNet18-WS backbone surpasses existing models on all metrics. For AP\(_{0.5:0.95}\), our models outperforms compared works by at least \(1.8\%\). The performance are significantly improved for small instances (\(44.8\%\) relative improvement for ContextLocNet [27]). This also indicates the efficiency of improving robustness and aligning features.

8 Conclusion

In this paper, we propose a sequence of design principles to take full advantages of deep residual learning for WSOD task. Extensive experiments show that the proposed principles enable deep residual networks to achieve significant performance improvements compared with plain networks for various WSOD methods, which also establishes new state-of-the-arts. Note that our contributions are not specific to ResNet or DenseNet – other backbones (e.g., GoogLeNet [54], WideResNet [67]) can also benefit from the proposed principles for WSOD task.

References

Alex, K., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2012)
Google Scholar
Arun, A., Jawahar, C.V., Kumar, M.P.: Dissimilarity coefficient based weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Bazzani, L., Bergamo, A., Anguelov, D., Torresani, L.: Self-taught object localization with deep networks. In: WACV (2016)
Google Scholar
Bency, A.J., Kwon, H., Lee, H., Karthikeyan, S., Manjunath, B.S.: Weakly supervised localization using deep feature maps. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 714–731. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_43
Chapter Google Scholar
Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: The British Machine Vision Conference (BMVC) (2014)
Google Scholar
Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Cinbis, R.G., Verbeek, J., Schmid, C.: Multi-fold MIL training for weakly supervised object localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Cinbis, R.G., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39, 189–203 (2015)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Deselaers, T., Alexe, B., Ferrari, V.: Localizing objects while learning their appearance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 452–466. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_33
Chapter Google Scholar
Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., Van Gool, L.: Weakly supervised cascaded convolutional networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Diba, A., Sharma, V., Stiefelhagen, R., Van Gool, L.: Weakly supervised object discovery by generative adversarial and ranking networks. In: CVPR Workshop (2019)
Google Scholar
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. (AI) 89, 31–71 (1997)
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88, 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Gao, M., Li, A., Yu, R., Morariu, V.I., Davis, L.S.: C-WSL: count-guided weakly supervised localization. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Ge, C., Wang, J.: Fewer is more : image segmentation based weakly supervised object detection with partial aggregation. In: The British Machine Vision Conference (BMVC) (2018)
Google Scholar
Ge, W., Yang, S., Yu, Y.: Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Graham-Rowe, D.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
MATH Google Scholar
Gudi, A., van Rosmalen, N., Loog, M., van Gemert, J.: Object-extent pooling for weakly supervised single-shot localization. In: The British Machine Vision Conference (BMVC) (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Jie, Z., Wei, Y., Jin, X., Feng, J., Liu, W.: Deep self-taught learning for weakly supervised object localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Kaiming He, Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Kantorov, V., Oquab, M., Cho, M., Laptev, I.: ContextLocNet: context-aware deep network models for weakly supervised localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 350–365. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_22
Chapter Google Scholar
Ken, C., Karen, S., Andrea, V., Andrew, Z.: Return of the devil in the details delving deep into convolutional nets. In: The British Machine Vision Conference (BMVC) (2014)
Google Scholar
Kosugi, S., Yamasaki, T., Aizawa, K.: Object-aware instance labeling for weakly supervised object detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Kumar Singh, K., Jae Lee, Y., Singh, K.K., Lee, Y.J.: You reap what you sow: using videos to generate high precision object proposals for weakly-supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Lai, B., Gong, X.: Saliency guided end-to-end learning for weakly supervised object detection. In: International Joint Conferences on Artificial Intelligence (IJCAI) (2017)
Google Scholar
Li, D., Huang, J.B., Li, Y., Wang, S., Yang, M.H.: Weakly supervised object localization with progressive domain adaptation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Li, X., Kan, M., Shan, S., Chen, X.: Weakly supervised object detection with segmentation collaboration. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Li, Y., Liu, L., Shen, C., van den Hengel, A.: Image co-localization by mimicking a good detector’s confidence score distribution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 19–34. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_2
Chapter Google Scholar
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: a backbone network for object detection. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, B., Gao, Y., Guo, N., Ye, X., You, H., Fan, D.: Utilizing the instability in weakly supervised object detection. In: CVPR Workshop (2019)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., Ferrari, V.: We don’t need no bounding-boxes: training object class detectors using only human verification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2015)
Google Scholar
Ren, Z., et al.: instance-aware, context-focused, and memory-efficient weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Shen, Y., Ji, R., Zhang, S., Zuo, W., Wang, Y.: Generative adversarial learning towards fast weakly supervised detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Shen, Y., Ji, R., Wang, C., Li, X., Li, X.: Weakly supervised object detection via object-specific pixel gradient. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 29, 5960–5970 (2018)
Article Google Scholar
Shen, Y., Ji, R., Wang, Y., Wu, Y., Cao, L.: Cyclic guidance for weakly supervised joint detection and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Shen, Y., Ji, R., Yang, K., Deng, C., Wang, C.: Category-aware spatial constraint for weakly supervised detection. IEEE Trans. Image Process. (TIP) 29, 843–858 (2019)
Article MathSciNet Google Scholar
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: DSOD: learning deeply supervised object detectors from scratch. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Shi, M., Caesar, H., Ferrari, V.: Weakly supervised object localization using things and stuff transfer. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Shi, M., Ferrari, V.: Weakly supervised object localization using size estimates. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 105–121. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_7
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: The International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Singh, K.K., Xiao, F., Lee, Y.J.: Track and transfer: watching videos to simulate strong human supervision for weakly-supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Tang, P., et al.: PCL: proposal cluster learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 42, 176–91 (2018)
Article Google Scholar
Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Tang, P., et al.: Weakly supervised region proposal network and object detection. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Teh, E.W., Wang, Y.: Attention networks for weakly supervised object localization. In: The British Machine Vision Conference (BMVC) (2016)
Google Scholar
Wan, F., Liu, C., Ke, W., Ji, X., Jiao, J., Ye, Q.: C-MIL: continuation multiple instance learning for weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Wan, F., Wei, P., Jiao, J., Han, Z., Ye, Q.: Min-entropy latent model for weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Wang, C., Ren, W., Huang, K., Tan, T.: Weakly supervised object localization with latent category learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 431–445. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_28
Chapter Google Scholar
Wang, R.J., Li, X., Ao, S., Ling, C.X.: Pelee: a real-time object detection system on mobile devices. In: Conference on Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
Wang, X., Zhu, Z., Yao, C., Bai, X.: Relaxed multiple-instance SVM with application to object discovery. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Wei, Y., et al.: TS2C: tight box mining with surrounding segmentation context for weakly supervised object detection. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Yang, K., Li, D., Dou, Y.: Towards precise end-to-end weakly supervised object detection network. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: The British Machine Vision Conference (BMVC) (2016)
Google Scholar
Zeng, Z., Liu, B., Fu, J., Chao, H., Zhang, L.: WSOD\(\wedge 2\): learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.: Adversarial complementary learning for weakly supervised object localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T.: Self-produced guidance for weakly-supervised object localization. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Zhang, X., Feng, J., Xiong, H., Tian, Q.: Zigzag learning for weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhang, X., Yang, Y., Feng, J.: ML-LocNet: improving object localization with multi-view learning network. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Zhang, X., Yang, Y., Feng, J.: Learning to localize objects with noisy labeled instances. In: AAAI Conference on Artificial Intelligence (AAAI) (2019)
Google Scholar
Zhang, Y., Li, Y., Ghanem, B.: W2F : a weakly-supervised to fully-supervised framework for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. (IJCV) 127, 302–321 (2019). https://doi.org/10.1007/s11263-018-1140-0
Article Google Scholar
Zhu, R., et al.: ScratchDet: exploring to train single-shot object detectors from scratch. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Soft proposal networks for weakly supervised object localization. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar

Download references

Acknowledgment

This work is supported by the Nature Science Foundation of China (No. U1705262, No. 61772443, No. 61572410, No. 61802324 and No. 61702136), National Key R&D Program (No. 2017YFC0113000, and No. 2016YFB1001503), Key R&D Program of Jiangxi Province (No. 20171ACH80022) and Natural Science Foundation of Guangdong Province in China (No. 2019B1515120049).

Author information

Authors and Affiliations

Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, 361005, China
Yunhang Shen, Rongrong Ji & Zhiwei Chen
Pinterest, San Francisco, USA
Yan Wang
CSE, Southern University of Science and Technology, Shenzhen, China
Feng Zheng
Tencent Youtu Lab, Tencent Technology (Shanghai) Co., Ltd., Shanghai, China
Feiyue Huang & Yunsheng Wu

Authors

Yunhang Shen
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Feiyue Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yunsheng Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongrong Ji .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 544 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, Y. et al. (2020). Enabling Deep Residual Networks for Weakly Supervised Object Detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12353. Springer, Cham. https://doi.org/10.1007/978-3-030-58598-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-58598-3_8
Published: 07 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58597-6
Online ISBN: 978-3-030-58598-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enabling Deep Residual Networks for Weakly Supervised Object Detection

Abstract

Similar content being viewed by others

Recent Advances of Generic Object Detection with Deep Learning: A Review

Cascade Attentive Dropout for Weakly Supervised Object Detection

Multiple spatial residual network for object detection

1 Introduction