Distilling Visual Priors from Self-Supervised Learning

Zhao, Bingchen; Wen, Xin

doi:10.1007/978-3-030-66096-3_29

Bingchen Zhao^10,11 &
Xin Wen¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

European Conference on Computer Vision

1987 Accesses
7 Citations

Abstract

Convolutional Neural Networks (CNNs) are prone to overfit small training datasets. We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting. The first phase is to learn a teacher model which possesses rich and generalizable visual representations via self-supervised learning, and the second phase is to distill the representations into a student model in a self-distillation manner, and meanwhile fine-tune the student model for the image classification task. We also propose a novel margin loss for the self-supervised contrastive learning proxy task to better learn the representation under the data-deficient scenario. Together with other tricks, we achieve competitive performance in the VIPriors image classification challenge.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Compensating for the Lack of Extra Training Data by Learning Extra Representation

PrUE: Distilling Knowledge from Sparse Teacher Networks

Self-supervised Learning with Deep Neural Networks for Computer Vision

Keywords

1 Introduction

Convolutional Neural Networks (CNNs) have achieved breakthroughs in image classification [8] via supervised training on large-scale datasets, e.g., ImageNet [4]. However, when the dataset is small, the over-parametrized CNNs tend to simply memorize the dataset and can not generalize well to unseen data [21]. To alleviate this over-fitting problem, several regularization techniques have been proposed, such as Dropout [15], BatchNorm [11]. In addition, some works seek to combat with over-fitting by re-designing the CNN building blocks to endow the model with some encouraging properties (e.g., translation invariance [12] and shift-invariance [22]).

Recently, self-supervised learning has shown a great potential of learning useful representation from data without external label information. In particular, the contrastive learning methods [1, 7] have demonstrated advantages over other self-supervised learning methods in learning better transferable representations for downstream tasks. Compared to supervised learning, representations learned by self-supervised learning are unbiased to image labels, which can effectively prevent the model from over-fitting the patterns of any object category. Furthermore, the data augmentation in modern contrastive learning [1] typically involves diverse transformation strategies, which significantly differ from those used by supervised learning. This may also suggest that contrastive learning can better capture the diversity of the data than supervised learning.

In this paper, we go one step further by exploring the capability of contrastive learning under the data-deficient setting. Our key motivation lies in the realization that the label-unbiased and highly expressive representations learned by self-supervised learning can largely prevent the model from over-fitting the small training dataset. Specifically, we design a new two-phase pipeline for data-deficient image classification. The first phase is to utilize self-supervised contrastive learning as a proxy task for learning useful representations, which we regard as visual priors before using the image labels to train a model in a supervised manner. The second phase is use the weight obtained from the first phase as the start point, and leverage the label information to further fine-tune the model to perform classification.

In principle, self-supervised pre-training is an intuitive approach for preventing over-fitting when the labeled data are scarce, yet constructing the pre-training and fine-tuning pipeline properly is critical for good results. Specifically, there are two problems to be solved. First, the common practice in self-supervised learning is to obtain a memory bank for negative sampling. While MoCo [7] has demonstrated accuracy gains with increased bank size, the maximum bank size, however, is limited in the data-deficient setting. To address this issue, we propose a margin loss that can reduce the bank size while maintaining the same performance. We hope that this method can be helpful for fast experiments and evaluation. Second, directly fine-tuning the model on a small dataset still faces the risk of over-fitting, based on the observation that fine-tuning a linear classifier on top of the pre-train representation can yield a good result. We proposed to utilize a recent published feature distillation method [9] to perform self-distillation between the pre-trained teacher model and a student model. This self-distillation module plays a role of regularizing the model from forgetting the visual priors learned from the contrastive learning phase, and thus can further prevent the model from over-fitting on the small dataset.

2 Related Works

Self-supervised learning focus on how to obtain good representations of data from heuristically designed proxy tasks, such as image colorization [23], tracking objects in videos [17], de-noising auto-encoders [16] and predicting image rotations [6]. Recent works using contrastive learning objectives [18] have achieved remarkable performance, among which MoCo [2, 7] is the first self-supervised method that outperforms supervised pre-training methods on multiple downstream tasks. In SimCLR [1], the authors show that the augmentation policy used by self-supervised method is quite different from the supervised methods, and is often harder. This phenomenon suggests that the self-supervised learned representations can be more rich and diverse than the supervised variants.

Knowledge distillation aims to distill useful knowledge or representation from a teacher model to a student model [10]. Original knowledge distillation uses the predicted logits to transfer knowledge from teacher to student [10]. Then, some works found that transferring the knowledge conveyed by the feature map from the teacher to student can lead to better performance [14, 20]. Heo et al.[9] provided a overhaul study of how to effectively distill knowledge from the feature map, which also inspires our design for knowledge distillation. Self-distillation uses the same model for both teacher and student [5], which has been shown to improve the performance of the model. We utilize the self-distillation method as a regulation term to prevent our model from over-fitting.

3 Method

Our method contains two phases, the first phase is to use the recently published MoCo v2 [2] to pre-train the model on the given dataset to obtain good representations. The learned representations can be considered as visual priors before using the label information. The second phase is to initialize both the teacher and student model used in the self-distillation process with the pre-trained weight. The weight of the teacher is frozen, and the student is updated using a combination of the classification loss and the overhaul-feature-distillation (OFD) [9] loss from the teacher. As a result, the student model is regularized by the representation from the teacher when performing the classification task. The two phases are visualized in Fig. 1.

3.1 Phase-1: Pre-Train with Self-Supervised Learning

The original loss used by MoCo is as follows:

$$\begin{aligned} \mathcal {L}_{\text {moco}}=- \log \left[ \frac{\exp \left( \mathbf {q} \cdot \mathbf {k^{+}} / \tau \right) }{\exp \left( \mathbf {q} \cdot \mathbf {k^{+}} / \tau \right) + \sum _{\mathbf {k^{-}}} \exp \left( \mathbf {q} \cdot \mathbf {k^{-}} / \tau \right) } \right] \,, \end{aligned}$$

(1)

where $\mathbf {q}$ and $\mathbf {k^{+}}$ is a positive pair (different views of the same image) sampled from the given dataset $\mathcal {D}$, and $\mathbf {k^{-}}$ are negative examples (different images). As shown in Fig. 1, MoCo uses a momentum encoder $\theta _{k}$ to encode all the $\mathbf {k}$ and put them in a queue for negative sampling, the momentum encoder is a momentum average of the encoder $\theta _{q}$:

$$\begin{aligned} \theta _k \leftarrow \eta \theta _k+(1-\eta )\theta _q. \end{aligned}$$

(2)

As shown in MoCo [7], the size of the negative sampling queue is crucial to the performance of the learned representation. In a data-deficient dataset, the maximum size of the queue is limited, we propose to add a margin to the original loss function to help the model obtain a larger margin between data samples thus help the model obtain a similar result with fewer negative examples.

$$\begin{aligned} \mathcal {L}_{\text {margin}}=-\log \left[ \frac{\exp \left( \left( \mathbf {q} \cdot \mathbf {k^{+}} - m \right) / \tau \right) }{\exp \left( \left( \mathbf {q} \cdot \mathbf {k^{+}} - m \right) / \tau \right) + \sum _{\mathbf {k^{-}}} \exp \left( \mathbf {q} \cdot \mathbf {k^{-}} / \tau \right) } \right] \,. \end{aligned}$$

(3)

3.2 Phase-2: Self-Distill on Labeled Dataset

The self-supervised trained checkpoint from phase-1 is then used to initialize the teacher and student for fine-tuning on the whole dataset with labels. We choose to use OFD [9] to distill the visual priors from teacher to student. The distillation process can be seen as a regulation to prevent the student from over-fitting the small train dataset and give the student a more diversed representation for classification.

The distillation loss can be formulated as follows:

$$\begin{aligned} \mathcal {L}_{\text {distill}}=\sum _{\mathbf {F}}d_{p}\left( \text {StopGrad}\left( \mathbf {F}_{t}\right) , r(\mathbf {F}_{s})\right) \,, \end{aligned}$$

(4)

where $\mathbf {F}_t$ and $\mathbf {F}_s$ stands for the feature map of the teacher and student model respectively, the StopGrad means the weight of the teacher will not be updated by gradient descent, the $d_p$ stands for a distance metric, r is a connector function to transform the feature from the student to the teacher.

Along with a cross-entropy loss for classification:

$$\begin{aligned} \mathcal {L}_{\text {ce}}=- \log p(y=i|\mathbf {x}) \,, \end{aligned}$$

(5)

the final loss function for the student model is:

$$\begin{aligned} \mathcal {L}_{\text {stu}}=\mathcal {L}_{\text {ce}} +\lambda \mathcal {L}_{\text {distill}} \,. \end{aligned}$$

(6)

The student model is then used for evaluation.

4 Experiments

Dataset. Only the subset of the ImageNet [4] dataset given by the VIPrior challenge is used for our experiments, no external data or pre-trained checkpoint is used. The VIPrior challenge dataset contains 1,000 classes which is the same with the original ImageNet [4], and is split into train, val and test splits, each of the splits has 50 images for each class, resulting in a total of 150,000 images. For comparison, we use the train split to train the model and test the model on the validation split.

Implementation Details. For phase-1, we set the momentum $\eta $ as 0.999 in all the experiments as it yields better performance, and the size of the queue is set to 4,096. The margin m in our proposed margin loss is set to be 0.4. We train the model for 800 epochs in phase-1, the initial learning rate is set to 0.03 and the learning rate is dropped by 10x at epoch 120 and epoch 160. Other hyperparameter is set to be the same with MoCo v2 [2],

For phase-2, the $\lambda $ in Eq. 6 is set to $10^{-4}$. We also choose to use $\ell _2$ distance as the distance metric $d_p$ in Eq. 4. We train the model for 100 epochs in phase-2, the initial learning rate is set to 0.1 and is dropped by 10x every 30 epochs.

Ablation Results. We first present the overall performance of our proposed two phase pipeline, then show some ablation results.

As shown in Table 1, supervised training of ResNet50 [8] would lead to over-fitting on the train split, thus the validation top-1 accuracy is low. By first pre-training the model with the phase-1 of our pipeline, and fine-tuning a linear classifier on top of the obtained feature representation [18], we can reach a 6.6 performance gain in top-1 accuracy. This indicates that the feature learned from self-supervised learning contain more information and can generalize well on the validation set. We also show that fine-tuning the full model from phase-1 can reach better performance compared to only fine-tuning a linear classifier, which indicates that the weight from phase-1 can also serve as a good initialization, but the supervised training process may still cause the model to suffer from over-fitting. Finally, by combining phase-1 and phase-2 together, our proposed pipeline achieves 16.7 performance gain in top-1 accuracy over the supervised baseline.

Table 1. Training and Pre-training the model on the train split and evaluate the performance on the validation split on the given dataset. ‘finetune fc’ stands for train a linear classifier on top of the pretrained representation, ‘finetune’ stands for train the weight of the whole model. Our proposed pipeline (Phase-1 + Phase-2) can have 16.7 performance gain in top-1 validation accuracy.

Full size table

The Effect of Our Margin Loss. Table 2 shows that effect of the number negative samples in contrastive learning loss, the original loss function used by MoCo v2 [7] is sensitive to the number of negatives, the fewer negative, the lower the linear classification result is. Our modified margin loss can help alleviate the issue with a margin to help the model learn a larger margin between data points. Thus leading to a more discriminative feature space. The experiments show that our margin loss is less sensitive to the number negatives and can be used in a data-deficient setting.

Table 2. The Val Acc means the linear classification accuracy obtained by fine-tune a linear classifier on top of the learned representation. The original MoCo v2 is sensitive to the number of negative, the performance drops drastically when number negatives is small. Our modified margin loss is less sensitive to the number negatives, as shown in the table, even has 16x less negatives the performance only drops 0.9.

Full size table

Table 3. The tricks used in the competition, our final accuracy is 68.8 which is a competitive result in the challenge. Our code will be made public. Results in this table are obtain by train the model on the combination of train and validation splits.

Full size table

Competition Tricks. For better performance in the competition, we combine the train and val split to train the model that generate the submission. Several other tricks and stronger backbone models are used for better performance, such as Auto-Augment [3], ResNeXt [19], label-smooth [13], TenCrop and model ensemble. Detailed tricks are listed in Table 3.

5 Conclusion

This paper proposes a novel two-phase pipeline for image classification using CNNs under the data-deficient setting. The first phase is to learn a teacher model which obtains a rich visual representation from the dataset using self-supervised learning. The second phase is transfer this representation into a student model in a self-distillation manner, meanwhile the student is fine-tuned for downstream classification task. Experiments shows the effectiveness of our proposed method, Combined with additional tricks, our method achieves a competitive result in the VIPrior Image Classification Challenge.

References

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123. IEEE (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Furlanello, T., Lipton, Z., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: International Conference on Machine Learning, pp. 1607–1616 (2018)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: IEEE/CVF International Conference on Computer Vision, pp. 1921–1930 (2019)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015). http://arxiv.org/abs/1503.02531
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Kayhan, O.S., Gemert, J.C.V.: On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14274–14285 (2020)
Google Scholar
Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? In: Advances in Neural Information Processing Systems, pp. 4694–4703 (2019)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: International Conference on Learning Representations (2014)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014). http://jmlr.org/papers/v15/srivastava14a.html
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning, pp. 1096–1103 (2008)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: IEEE International Conference on Computer Vision, pp. 2794–2802 (2015)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: 2017 International Conference on Learning Representations (2016)
Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2017)
Zhang, R.: Making convolutional networks shift-invariant again. In: International Conference on Machine Learning (2019)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Tongji University, Shanghai, China
Bingchen Zhao & Xin Wen
Megvii Research Nanjing, Nanjing, China
Bingchen Zhao

Authors

Bingchen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bingchen Zhao .

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, B., Wen, X. (2020). Distilling Visual Priors from Self-Supervised Learning. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-66096-3_29
Published: 03 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distilling Visual Priors from Self-Supervised Learning

Abstract

Similar content being viewed by others

Compensating for the Lack of Extra Training Data by Learning Extra Representation

PrUE: Distilling Knowledge from Sparse Teacher Networks

Self-supervised Learning with Deep Neural Networks for Computer Vision

Keywords

1 Introduction

2 Related Works

3 Method

3.1 Phase-1: Pre-Train with Self-Supervised Learning

3.2 Phase-2: Self-Distill on Labeled Dataset

4 Experiments

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Distilling Visual Priors from Self-Supervised Learning

Abstract

Similar content being viewed by others

Compensating for the Lack of Extra Training Data by Learning Extra Representation

PrUE: Distilling Knowledge from Sparse Teacher Networks

Self-supervised Learning with Deep Neural Networks for Computer Vision

Keywords

1 Introduction

2 Related Works

3 Method

3.1 Phase-1: Pre-Train with Self-Supervised Learning

3.2 Phase-2: Self-Distill on Labeled Dataset

4 Experiments

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation