A Visual Inductive Priors Framework for Data-Efficient Image Classification

Sun, Pengfei; Jin, Xuan; Su, Wei; He, Yuan; Xue, Hui; Lu, Quan

doi:10.1007/978-3-030-66096-3_35

Pengfei Sun ORCID: orcid.org/0000-0002-9719-8900¹⁰,
Xuan Jin¹⁰,
Wei Su¹⁰,
Yuan He¹⁰,
Hui Xue¹⁰ &
…
Quan Lu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

European Conference on Computer Vision

2102 Accesses
6 Citations

Abstract

State-of-the-art classifiers rely heavily on large-scale datasets, such as ImageNet, JFT-300M, MSCOCO, Open Images, etc. Besides, the performance may decrease significantly because of insufficient learning on a handful of samples. We present Visual Inductive Priors Framework (VIPF), a framework that can learn classifiers from scratch. VIPF can maximize the effectiveness of limited data. In this work, we propose a novel neural network architecture: DSK-net, which is very effective in training from small data sets. With more discriminative feature extracted from DSK-net, overfitting of network is alleviated. Furthermore, a loss function based on positive class as well as an induced hierarchy are also applied to further improve the VIPF’s capability of learning from scratch. Finally, we won the 1st Place in VIPriors image classification competition.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Auto CNN classifier based on knowledge transferred from self-supervised model

Article 21 June 2023

Self-Supervised Classification Network

Masked Siamese Networks for Label-Efficient Learning

Keywords

1 Introduction

Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in image classification, object detection, semantic segmentation, etc. With the appearance of AlexNet [14], VGG [18], Inception [12, 19,20,21], ResNet [9], EfficientNet [22], ResNeSt [29], etc., the top-1 accuracy on ImageNet has been increased from 62.5% (AlexNet) to 84.5% (ResNeSt-269). Besides different network backbones, there are also many plug-and-play modules which can significantly improve accuracy, such as SE (Squeeze-and-Excitation) [11], CBAM (Convolutional Block Attention Module) [26], ECA (Efficient Channel Attention) [24], etc.

However, due to the limitation of label data, the performance of CNN is greatly limited. Pre-trained models are the most common solution that can get a fine result because of the prior knowledge. But there are only a few pre-trained models which are fixed architectures and proposed like Inception, ResNet, EfficientNet, etc. For training from scratch on VIPriors classification dataset which has only 50 training samples per class, the effectiveness of learning plays an important role. Effective and sufficient augment strategies are necessary, such as rand erasing [32], Mixup [30], CutMix [28], Cutout [7], AutoAugment [5], RandAugment [6], etc. On the other hand, models would overfit easily with little training data, so it is crucial to lighten the overfitting with appropriate regularization.

In this work, a novel network architecture Dual Selective Kernel network (DSK-net) is proposed to improve the effectiveness on small scale datasets. For more data-efficient learning, positive class classification loss and intra-class compactness loss are applied to enhance discriminative power of the deeply learned features. An induced hierarchy is used which is easier for models to learn from scratch. Methods are evaluated on VIPriors Image Classification dataset. The dataset is derived from ImageNet and contains 50 images per class for training and testing. Experimental results show that our methods achieve the best performance on VIPriors classification dataset.

2 Related Works

2.1 Data Augmentation

Augmentation is an effective way to improve CNNs’ performance especially in the case of insufficient data. Mixup [30] trains a model on convex combinations of pairs of examples and their labels together. Cutout [7] randomly erases square regions on input images during training. CutMix [28] cuts and pastes patches among training images where the training labels are also mixed proportionally to the area of patches. It can efficiently make use of training pixels and retain the regularization effect of regional dropout. GridMask [3] drops pixels on the input images with multiple squares and different ratios. Recently, with the emergence of AutoML, network learning strategies also can be searched from data. Auto-Augmentation [5] is a series of augmentation operation strategies searched on ImageNet which needs a huge space for searching. Hence RandAugmentation [6] proposes a simplified search space which has less computational expense.

2.2 Translation Invariance in CNNs

It is generally known that CNNs are not shift-invariant. A small shift or translation of input will result in a quite different output. To reduce the influence of translation, several augmentation operations are often used such as scaling, rotation and reflection [2, 4, 8, 17, 27]. [31] integrates low-pass filtering to anti-alias which is a common signal processing module. [13] proposes a full convolution architecture by removing spatial location as feature which improves equivariance and invariance of the inductive convolutional prior.

2.3 Important Feature Learning

For image classification, locating and recognizing the discriminative feature is the key to a better performance. And most of discriminative feature extraction modules are based on attention mechanisms which is inspired by human brain neural units. SE (Squeeze-and-Excitation) [11] and ECA [24] are channel attention architectures. Channel and spatial attention modules are applied in CBAM [26]. Inspired by adaptive field sizes of neurons, [15] proposes Selective Kernel (SK) convolution which is based on soft-attention manner to improve feature extraction efficiency. Except attention architecture, loss function can also help model learn more discriminative feature. Center loss [25] is implemented by increasing inter-class dispersion and intra-class compactness. It learns centers form deep features of each class, and then penalizes the distances between deep features and their corresponding class centers.

3 Proposed Method

To be more data-efficient, firstly, a 3-branched network called Dual Selective Kernel (DSK) network is proposed in Fig. 1. DSK has the advantages of discriminative feature extraction, translation invariant and regularization. Secondly, a composite loss function is designed to improve feature discrimination. It helps models not only classify correctly but also increase the diversity of different classes.

3.1 Dual Selective Kernel Network

Discriminative Feature Extraction. To adjust the receptive fields of neurons automatically, selective kernel convolution [15] is added into residual block. For any given feature map $X_i\in \mathbb {R}^{H\times W\times C}$, $X_i$ is respectively conducted by convolutions of kernel size 3 and 5. Then two transforms are conducted: $\mathcal {\widehat{F}}$:$X_i\rightarrow \mathcal {\widehat{U}}\in \mathbb {R}^{H\times W\times C}$ and $\mathcal {\widetilde{F}}$:$X_i\rightarrow \mathcal {\widetilde{U}}\in \mathbb {R}^{H\times W\times C}$. Both $\mathcal {\widehat{F}}$ and $\mathcal {\widetilde{F}}$ are composed with depthwise convolution, Batch Normalization and ReLU. Feature $\mathcal {U}$ is a element-wise sum of $\mathcal {\widehat{U}}$ and $\mathcal {\widetilde{U}}$. For $\mathcal {U}$, global average pooling is used for information embedding. Further, a compact feature $s\in \mathbb {R}^C$ is created by passing feature embedding to fully connected layer for squeeze. Then Batch Normalization, ReLU and another two fully connected layers are applied for different kernel excitation. Finally, a soft attention is conducted to select information in different spatial scales. The weights $\widehat{\omega }$ and $\widetilde{\omega }$ for attention is calculated by a channel-wise softmax operation of per channel between a and b. The final feature map is obtained by applying attention weights to feature $\mathcal {\widehat{U}}$ and $\mathcal {\widetilde{U}}$:

$$\begin{aligned} \mathcal {V}=\widehat{\omega }\cdot \mathcal {\widehat{U}}+\widetilde{\omega }\cdot \mathcal {\widetilde{U}} \end{aligned}$$

(1)

Translation Invariant. The reducing spatial resolution operations in CNNs including max pooling, average pooling and strided convolution are harmful to shift-equivariance. Blur pool [31] is an anti-aliased architecture which is compatible with above architectures components. For example, max pooling with stride = 2 in CNNs will be split into max pooling with stride = 1 and blur pool with stride = 2. Strided convolution with activation function will be split into convolution with stride = 1, activation function and blur pool. As for blur pool kernel, it has several anti-aliasing filters from size $2\times 2$ to $5\times 5$ with increasing smoothing. In DSK, $3\times 3$ filter is applied in max pooling and strided convolution.

Regularization. Like data augmentation techniques applied to input data, it is reasonable to apply corresponding techniques to representation branch in residual block. Let $X_i$ denotes the input tensor of residual block i. $\mathcal {W}_i^1$ and $\mathcal {W}_i^2$ denote weights associated with the two residual units. $\mathcal {F}$ denotes the residual function and $X_{i+1}$ denotes the outputs from i. The 3-branch architecture can be represented as:

$$\begin{aligned} X_{i+1}=X_i+\lambda _i \mathcal {F}(X_i,\mathcal {W}_i^1)+(1-\lambda _i)\mathcal {F}(X_i,\mathcal {W}_i^2) \end{aligned}$$

(2)

When forward and backward during training, $\lambda _i$ is a random value of 0 or 1, which means that only one of branch1 and branch2 will be randomly selected. And $\lambda _i$ is 0.5 for inference, which means that half of each branch’s output will be used for inference.

3.2 Loss Function

Categorical cross-entropy (CE) loss after softmax is widely used in multi-class classification. But for VIPriors classification dataset, CE is suboptimal. Because it forces models to only focus on training image and ignore the compactness of intra-class. In this section, several loss functions will be discussed and a combined loss is proposed as Eq. 3 for a better performance.

$$\begin{aligned} L=\alpha L_{PCL}+\beta L_{CL} +\gamma L_{TSL} \end{aligned}$$

(3)

Positive Class Loss. CE loss is showed in Eq. 4. Let p represents the output of a model and l represents one-hot labels. CE not only directs model to classify the ground truth class correctly but also forces the prediction of other classes as low as possible.

$$\begin{aligned} L_{CE}=-\frac{1}{N}\sum (l*log(p)+(1-l)*log(1-p)) \end{aligned}$$

(4)

But is it suitable to use a loss on a small dataset in which the number of classes is far greater than the number of samples per class? Additionally, [16] proves that there are many label errors in ImageNet including actual multi-label images but only labeled with singe class label. We have reasons to believe that there is the same question on VIProirs classification dataset. Based on the above, making models only focus on ground truth label may be more beneficial during learning. Consequently, the positive class loss (PCL) is proposed as:

$$\begin{aligned} L_{PCL}=\frac{1}{N}\sum ( -l*log(p)+(1-cos(l,p))) \end{aligned}$$

(5)

PCL has two parts: the former is from CE, the latter is cosine loss [1].

Center Loss. Although PCL can direct model for a better learning, it is easily overfitting with less data. Therefore, center loss (CL) [25] in Eq. 6 is used for more discriminative feature extraction. Let $x_i\in \mathbb {R}_d$ denote the ith deep feature belonging to the $y_i$th class. The $y_i$th class center of deep features $c_{y_i}\in \mathbb {R}_d$ is computed by averaging $y_i$th class features of the corresponding classes in each iteration.

$$\begin{aligned} L_{CL}=\frac{1}{2}\sum \limits _i||x_i-c_{y_i}||_2^2 \end{aligned}$$

(6)

Tree Supervision Loss. The semantic relations of classes in VIPriors can be induced as a hierarchical tree. Child nodes of the tree represent 1000 classes in the dataset and parent nodes represent superclasses such as animal, vehicle and etc. For every parent node, its child nodes often have some commonalities which is helpful for classification. Inspired by Neural-Backed Decision Trees (NBDT) [23], a hierarchical architecture is defined according to the semantic relationship based on 1000 classes. Tree supervision loss (TSL) is used for model training. Let $x\in \mathbb {R}_d$ denotes featurized sample, $w_{r\rightarrow i}$ denotes weights of the path from root nodes r to leaf node $n_i$. TSL can be represented as:

$$\begin{aligned} L_{TSL}=L_{CE}([\prod x*w_{r\rightarrow n_1},\prod x*w_{r\rightarrow n_2},...],l) \end{aligned}$$

(7)

4 Experiments

4.1 Implementation Details

Following data augmentation methods are used in our models: random resize and crop, random horizontal flip and CutMix (with a probability of 0.5). All models are trained with 16 GPUs and 64 samples per CPU. In the training stage, warm up with initial lr of 0.0001 in 5 epochs, cosine learning rate [10] with initial lr of 0.1, dropout with probability of 0.2, weight decay of 0.0001 and label smooth are used for learning. For coefficients in Eq. 6, $\alpha , \gamma $ and $\beta $ are set to 1, 0.0005 and 1. In early time of the competition, we trained model on training set for methods attempt and verification. And in the final stage, we trained models on both training set and most of validation set. Only a little samples in validation set were reserved for validation. For final prediction, Test Time Augmentation (TTA) with 10-crop was used. Additionally, experimental results prove that increasing training epochs from 90 to 360 improve model accuracy by 5.3%.

4.2 Results

Table 1 shows the results for ResneXt, D-ResNeXt, SK-ResNeXt, DSK-ResNeXt, PSL and CL on validation set. Models are trained with 360 epochs.

Table 1. Performance of DSK-net, PSL and CL on validation set.

Full size table

Table 2 shows results of TSL for EfficientNet and ResNeSt in the final stage. Models are trained with 720 epochs and tested on partial validation set.

Table 2. The experiment results of TSL.

Full size table

Table 3 shows the results of DSK-net in the final stage. Models are trained with 540 epochs and tested on partial validation set. 69.59% is the best single model performance we achieved.

Table 3. The experiment results of DSK-net.

Full size table

4.3 Other Tricks

Results for CutMix showed in Table 4 indicate that the global semantic information and local area feature are equally import.

Table 4. The experiment results on validation set for CutMix. Input size is $320\times 320$, training epoch is 90.

Full size table

Results of label smooth, dropout and dual pool are showed in Table 5:

Table 5. The experiment results of label smooth, dropout and dual pool on validation set. Models are trained with 360 epochs.

Full size table

4.4 Ensembling

For a better performance, we ensembled predictions of above methods in total 16 models including EfficientNet-b5, EfficientNet-b6, ResNeSt-101, ResNest-200, DSK-ResNeXt50, DSK-ResNeXt101. Finally, a weighted score average method was used that the weight of higher performance models was 3, the rest was 1. Finally, we got the score of 73.08% on test set.

Figure 2 shows an overview of methods and appearances. No external image/video data or pre-trained models were used throughout the competition.

5 Conclusions

In this paper, we discuss and explore data-efficient learning, visual inductive priors and training from scratch. In VIPF, we propose a novel architecture called DSK-net, which is robust to translation. Sufficient experiment results fully proved that DSK-net learns efficiently from insufficient data and outperformed EfficientNet, ResNeSt on VIPriors classification dataset. Then a loss based on positive class is applied for model constraint. An induced hierarchy is used which can direct models to learn discriminatively and easily. Experimental results show that VIPF we proposed is effective. Finally we won the 1st place in VIPriors image classification competition.

References

Barz, B., Denzler, J.: Deep learning on small datasets without pre-training using cosine loss. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 1371–1380 (2020)
Google Scholar
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)
Article Google Scholar
Chen, P.: Gridmask data augmentation. arXiv preprint arXiv:2001.04086 (2020)
Cohen, T., Welling, M.: Group equivariant convolutional networks. In: International Conference on Machine Learning, pp. 2990–2999 (2016)
Google Scholar
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Google Scholar
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
Esteves, C., Allen-Blanchette, C., Zhou, X., Daniilidis, K.: Polar transformer networks. arXiv preprint arXiv:1709.01889 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of tricks for image classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567 (2019)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Kayhan, O.S., Gemert, J.C.v.: On translation invariance in CNNs: convolutional layers can exploit absolute spatial location. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14274–14285 (2020)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)
Google Scholar
Northcutt, C.G., Jiang, L., Chuang, I.L.: Confident learning: estimating uncertainty in dataset labels. arXiv preprint arXiv:1911.00068 (2019)
Sifre, L., Mallat, S.: Rotation, scaling and deformation invariant scattering for texture discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1233–1240 (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019)
Wan, A., et al.: NBDT: neural-backed decision trees. arXiv preprint arXiv:2004.00221 (2020)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542 (2020)
Google Scholar
Wen, Y., Zhang, K., Li, Z., Qiao, Yu.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: deep translation and rotation equivariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037 (2017)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zhang, H., et al.: ResNeSt: split-attention networks. arXiv preprint arXiv:2004.08955 (2020)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, R.: Making convolutional networks shift-invariant again. arXiv preprint arXiv:1904.11486 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI, pp. 13001–13008 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Alibaba Group, Hangzhou, China
Pengfei Sun, Xuan Jin, Wei Su, Yuan He, Hui Xue & Quan Lu

Authors

Pengfei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wei Su
View author publications
You can also search for this author in PubMed Google Scholar
Yuan He
View author publications
You can also search for this author in PubMed Google Scholar
Hui Xue
View author publications
You can also search for this author in PubMed Google Scholar
Quan Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Sun .

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, P., Jin, X., Su, W., He, Y., Xue, H., Lu, Q. (2020). A Visual Inductive Priors Framework for Data-Efficient Image Classification. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-66096-3_35
Published: 03 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Visual Inductive Priors Framework for Data-Efficient Image Classification

Abstract

Similar content being viewed by others