Big Transfer (BiT): General Visual Representation Learning

Kolesnikov, Alexander; Beyer, Lucas; Zhai, Xiaohua; Puigcerver, Joan; Yung, Jessica; Gelly, Sylvain; Houlsby, Neil

doi:10.1007/978-3-030-58558-7_29

Alexander Kolesnikov¹²,
Lucas Beyer¹²,
Xiaohua Zhai¹²,
Joan Puigcerver¹²,
Jessica Yung¹²,
Sylvain Gelly¹² &
…
Neil Houlsby¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12350))

Included in the following conference series:

European Conference on Computer Vision

6849 Accesses
331 Citations
6 Altmetric

Abstract

Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes—from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.

A. Kolesnikov, L. Beyer and X. Zhai—Equal contribution.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Deep Architectures in Visual Transfer Learning

Training Vision Transformers with only 2040 Images

Best Practices for Fine-Tuning Visual Classifiers to New Domains

1 Introduction

Strong performance using deep learning usually requires a large amount of task-specific data and compute. These per-task requirements can make new tasks prohibitively expensive. Transfer learning offers a solution: task-specific data and compute are replaced with a pre-training phase. A network is trained once on a large, generic dataset, and its weights are then used to initialize subsequent tasks which can be solved with fewer data points, and less compute [9, 34, 37].

We revisit a simple paradigm: pre-train on a large supervised source dataset, and fine-tune the weights on the target task. Numerous improvements to deep network training have recently been introduced, e.g. [1, 17, 21, 29, 46, 47, 52, 54, 56, 59]. We aim not to introduce a new component or complexity, but to provide a recipe that uses the minimal number of tricks yet attains excellent performance on many tasks. We call this recipe “Big Transfer” (BiT).

We train networks on three different scales of datasets. The largest, BiT-L is trained on the JFT-300M dataset [43], which contains 300M noisily labelled images. We transfer BiT to many diverse tasks; with training set sizes ranging from 1 example per class to 1M total examples. These tasks include ImageNet’s ILSVRC-2012 [6], CIFAR-10/100 [22], Oxford-IIIT Pet [35], Oxford Flowers-102 [33] (including few-shot variants), and the 1000-sample VTAB-1k benchmark [58], which consists of 19 diverse datasets. BiT-L attains state-of-the-art performance on many of these tasks, and is surprisingly effective when very little downstream data is available (Fig. 1). We also train BiT-M on the public ImageNet-21k dataset, and attain marked improvements over the popular ILSVRC-2012 pre-training.

Importantly, BiT only needs to be pre-trained once and subsequent fine-tuning to downstream tasks is cheap. By contrast, other state-of-the-art methods require extensive training on support data conditioned on the task at hand [32, 53, 55]. Not only does BiT require a short fine-tuning protocol for each new task, but BiT also does not require extensive hyperparameter tuning on new tasks. Instead, we present a heuristic for setting the hyperparameters for transfer, which works well on our diverse evaluation suite.

We highlight the most important components that make Big Transfer effective, and provide insight into the interplay between scale, architecture, and training hyperparameters. For practitioners, we will release the performant BiT-M model trained on ImageNet-21k.

2 Big Transfer

We review the components that we found necessary to build an effective network for transfer. Upstream components are those used during pre-training, and downstream are those used during fine-tuning to a new task.

2.1 Upstream Pre-training

The first component is scale. It is well-known in deep learning that larger networks perform better on their respective tasks [10, 40]. Further, it is recognized that larger datasets require larger architectures to realize benefits, and vice versa [20, 38]. We study the effectiveness of scale (during pre-training) in the context of transfer learning, including transfer to tasks with very few datapoints. We investigate the interplay between computational budget (training time), architecture size, and dataset size. For this, we train three BiT models on three large datasets: ILSVRC-2012 [39] which contains 1.3M images (BiT-S), ImageNet-21k [6] which contains 14M images (BiT-M), and JFT [43] which contains 300M images (BiT-L).

The second component is Group Normalization (GN) [52] and Weight Standardization (WS) [28]. Batch Normalization (BN) [16] is used in most state-of-the-art vision models to stabilize training. However, we find that BN is detrimental to Big Transfer for two reasons. First, when training large models with small per-device batches, BN performs poorly or incurs inter-device synchronization cost. Second, due to the requirement to update running statistics, BN is detrimental for transfer. GN, when combined with WS, has been shown to improve performance on small-batch training for ImageNet and COCO [28]. Here, we show that the combination of GN and WS is useful for training with large batch sizes, and has a significant impact on transfer learning.

2.2 Transfer to Downstream Tasks

We propose a cheap fine-tuning protocol that applies to many diverse downstream tasks. Importantly, we avoid expensive hyperparameter search for every new task and dataset size; we try only one hyperparameter per task. We use a heuristic rule—which we call BiT-HyperRule—to select the most important hyperparameters for tuning as a simple function of the task’s intrinsic image resolution and number of datapoints. We found it important to set the following hyperparameters per-task: training schedule length, resolution, and whether to use MixUp regularization [59]. We use BiT-HyperRule for over 20 tasks in this paper, with training sets ranging from 1 example per class to over 1M total examples. The exact settings for BiT-HyperRule are presented in Sect. 3.3.

During fine-tuning, we use the following standard data pre-processing: we resize the image to a square, crop out a smaller random square, and randomly horizontally flip the image at training time. At test time, we only resize the image to a fixed size. In some tasks horizontal flipping or cropping destroys the label semantics, making the task impossible. An example is if the label requires predicting object orientation or coordinates in pixel space. In these cases we omit flipping or cropping when appropriate.

Recent work has shown that existing augmentation methods introduce inconsistency between training and test resolutions for CNNs [49]. Therefore, it is common to scale up the resolution by a small factor at test time. As an alternative, one can add a step at which the trained model is fine-tuned to the test resolution [49]. The latter is well-suited for transfer learning; we include the resolution change during our fine-tuning step.

We found that MixUp [59] is not useful for pre-training BiT, likely due to the abundance of data. However, it is sometimes useful for transfer. Interestingly, it is most useful for mid-sized datasets, and not for few-shot transfer, see Sect. 3.3 for where we apply MixUp.

Surprisingly, we do not use any of the following forms of regularization during downstream tuning: weight decay to zero, weight decay to initial parameters [25], or dropout. Despite the fact that the network is very large—BiT has 928 million parameters—the performance is surprisingly good without these techniques and their respective hyperparameters, even when transferring to very small datasets. We find that setting an appropriate schedule length, i.e. training longer for larger datasets, provides sufficient regularization.

3 Experiments

We train three upstream models using three datasets at different scales: BiT-S, BiT-M, and BiT-L. We evaluate these models on many downstream tasks and attain very strong performance on high and low data regimes.

3.1 Data for Upstream Training

BiT-S is trained on the ILSVRC-2012 variant of ImageNet, which contains 1.28 million images and 1000 classes. Each image has a single label. BiT-M is trained on the full ImageNet-21k dataset [6], a public dataset containing 14.2 million images and 21k classes organized by the WordNet hierarchy. Images may contain multiple labels. BiT-L is trained on the JFT-300M dataset [32, 43, 53]. This dataset is a newer version of that used in [4, 13]. JFT-300M consists of around 300 million images with 1.26 labels per image on average. The labels are organized into a hierarchy of \(18\,291\) classes. Annotation is performed using an automatic pipeline, and are therefore imperfect; approximately 20% of the labels are noisy. We remove all images present in downstream test sets from JFT-300M. We provide details in supplementary material. Note: the “-S/M/L” suffix refers to the pre-training datasets size and schedule, not architecture. We train BiT with several architecture sizes, the default (largest) being ResNet152x4.

3.2 Downstream Tasks

We evaluate BiT on long-standing benchmarks: ILSVRC-2012 [6], CIFAR-10/100 [22], Oxford-IIIT Pet [35] and Oxford Flowers-102 [33]. These datasets differ in the total number of images, input resolution and nature of their categories, from general object categories in ImageNet and CIFAR to fine-grained ones in Pets and Flowers. We fine-tune BiT on the official training split and report results on the official test split if publicly available. Otherwise, we use the val split.

To further assess the generality of representations learned by BiT, we evaluate on the Visual Task Adaptation Benchmark (VTAB) [58]. VTAB consists of 19 diverse visual tasks, each of which has 1000 training samples (VTAB-1k variant). The tasks are organized into three groups: natural, specialized and structured. The VTAB-1k score is top-1 recognition performance averaged over these 19 tasks. The natural group of tasks contains classical datasets of natural images captured using standard cameras. The specialized group also contains images captured in the real world, but through specialist equipment, such as satellite or medical images. Finally, the structured tasks assess understanding of the structure of a scene, and are mostly generated from synthetic environments. Example structured tasks include object counting and 3D depth estimation.

Table 1. Top-1 accuracy for BiT-L on many datasets using a single model and single hyperparameter setting per task (BiT-HyperRule). The entries show median ± standard deviation across 3 fine-tuning runs. Specialist models are those that condition pre-training on each task, while generalist models, including BiT, perform task-independent pre-training. (\(^\star \)Concurrent work.)

Full size table

3.3 Hyperparameter Details

Upstream Pre-training. All of our BiT models use a vanilla ResNet-v2 architecture [11], except that we replace all Batch Normalization [16] layers with Group Normalization [52] and use Weight Standardization [36] in all convolutional layers. See Sect. 4.3 for analysis. We train ResNet-152 architectures in all datasets, with every hidden layer widened by a factor of four (ResNet152x4). We study different model sizes and the coupling with dataset size in Sect. 4.1.

We train all of our models upstream using SGD with momentum. We use an initial learning rate of 0.03, and momentum 0.9. During image preprocessing stage we use image cropping technique from [45] and random horizontal mirroring followed by \(224 \times 224\) image resize. We train both BiT-S and BiT-M for 90 epochs and decay the learning rate by a factor of 10 at 30, 60 and 80 epochs. For BiT-L, we train for 40 epochs and decay the learning rate after 10, 23, 30 and 37 epochs. We use a global batch size of 4096 and train on a Cloud TPUv3-512 [19], resulting in 8 images per chip. We use linear learning rate warm-up for 5000 optimization steps and multiply the learning rate by \(\frac{\text{ batch } \text{ size }}{256}\) following [7]. During pre-training we use a weight decay of 0.0001, but as discussed in Sect. 2, we do not use any weight decay during transfer.

Downstream Fine-Tuning. To attain a low per-task adaptation cost, we do not perform any hyperparameter sweeps downstream. Instead, we present BiT-HyperRule, a heuristic to determine all hyperparameters for fine-tuning. Most hyperparameters are fixed across all datasets, but schedule, resolution, and usage of MixUp depend on the task’s image resolution and training set size.

For all tasks, we use SGD with an initial learning rate of 0.003, momentum 0.9, and batch size 512. We resize input images with area smaller than \(96\times 96\) pixels to \(160\times 160\) pixels, and then take a random crop of \(128\times 128\) pixels. We resize larger images to \(448\times 448\) and take a \(384\times 384\)-sized crop.^{Footnote 1} We apply random crops and horizontal flips for all tasks, except those for which cropping or flipping destroys the label semantics, we provide details in supplementary material.

For schedule length, we define three scale regimes based on the number of examples: we call small tasks those with fewer than 20k labeled examples, medium those with fewer than 500k, and any larger dataset is a large task. We fine-tune BiT for 500 steps on small tasks, for 10k steps on medium tasks, and for 20k steps on large tasks. During fine-tuning, we decay the learning rate by a factor of 10 at 30%, 60% and 90% of the training steps. Finally, we use MixUp [59], with \(\alpha = 0.1\), for medium and large tasks.

Table 2. Improvement in accuracy when pre-training on the public ImageNet-21k dataset over the “standard” ILSVRC-2012. Both models are ResNet152x4.

Full size table

3.4 Standard Computer Vision Benchmarks

We evaluate BiT-L on standard benchmarks and compare its performance to the current state-of-the-art results (Table 1). We separate models that perform task-independent pre-training (“general” representations), from those that perform task-dependent auxiliary training (“specialist” representations). The specialist methods condition on a particular task, for example ILSVRC-2012, then train using a large support dataset, such as JFT-300M [32] or Instagram-1B [55]. See discussion in Sect. 5. Specialist representations are highly effective, but require a large training cost per task. By contrast, generalized representations require large-scale training only once, followed by a cheap adaptation phase.

BiT-L outperforms previously reported generalist SOTA models as well as, in many cases, the SOTA specialist models. Inspired by strong results of BiT-L trained on JFT-300M, we also train models on the public ImageNet-21k dataset. This dataset is more than 10 times bigger than ILSVRC-2012, but it is mostly overlooked by the research community. In Table 2 we demonstrate that BiT-M trained on ImageNet-21k leads to substantially improved visual representations compared to the same model trained on ILSVRC-2012 (BiT-S), as measured by all our benchmarks. In Sect. 4.2, we discuss pitfalls that may have hindered wide adoption of ImageNet-21k as a dataset model for pre-training and highlight crucial components of BiT that enabled success on this large dataset.

For completeness, we also report top-5 accuracy on ILSVRC-2012 with median ± standard deviation format across 3 runs: 98.46% ± 0.02% for BiT-L, 97.69% ± 0.02% for BiT-M and 95.65% ± 0.03% for BiT-S.

3.5 Tasks with Few Datapoints

We study the number of downstream labeled samples required to transfer BiT-L successfully. We transfer BiT-L using subsets of ILSVRC-2012, CIFAR-10, and CIFAR-100, down to 1 example per class. We also evaluate on a broader suite of 19 VTAB-1k tasks, each of which has 1000 training examples.

Figure 2 (left half) shows BiT-L using few-shots on ILSVRC-2012, CIFAR-10, and CIFAR-100. We run multiple random subsamples, and plot every trial. Surprisingly, even with very few samples per class, BiT-L demonstrates strong performance and quickly approaches performance of the full-data regime. In particular, with just 5 labeled samples per class it achieves top-1 accuracy of 72.0% on ILSVRC-2012 and with 100 samples the top-1 accuracy goes to 84.1%. On CIFAR-100, we achieve 82.6% with just 10 samples per class.

Semi-supervised learning also tackles learning with few labels. However, such approaches are not directly comparable to BiT. BiT uses extra labelled out-of-domain data, whereas semi-supervised learning uses extra unlabelled in-domain data. Nevertheless, it is interesting to observe the relative benefits of transfer from out-of-domain labelled data versus in-domain semi-supervised data. In Fig. 2 we show state-of-the-art results from the semi-supervised learning.

Figure 3 shows the performance of BiT-L on the 19 VTAB-1k tasks. BiT-L with BiT-HyperRule substantially outperforms the previously reported state-of-the-art. When looking into performance of VTAB-1k task subsets, BiT is the best on natural, specialized and structured tasks. The recently-proposed VIVI-Ex-100% [50] model that employs video data during upstream pre-training shows very similar performance on the structured tasks.

We investigate heavy per-task hyperparameter tuning in supplementary material and conclude that this further improves performance.

Table 3. Object detection performance on COCO-2017 [28] validation data of RetinaNet models with pre-trained BiT backbones and the literature baseline.

Full size table

3.6 Object Detection

Finally, we evaluate BiT on object detection. We use the COCO-2017 dataset [28] and train a top-performing object detector, RetinaNet [27], using pre-trained BiT models as backbones. Due to memory constraints, we use the ResNet-101x3 architecture for all of our BiT models. We fine-tune the detection models on the COCO-2017 train split and report results on the validation split using the standard metric [28] in Table 3. Here, we do not use BiT-HyperRule, but stick to the standard RetinaNet training protocol, we provide details in supplementary material. Table 3 demonstrates that BiT models outperform standard ImageNet pre-trained models. We can see clear benefits of pre-training on large data beyond ILSVRC-2012: pre-training on ImageNet-21k results in a 1.5 point improvement in Average Precision (AP), while pre-training on JFT-300M further improves AP by 0.6 points.

4 Analysis

We analyse various components of BiT: we demonstrate the importance of model capacity, discuss practical optimization caveats and choice of normalization layer.

4.1 Scaling Models and Datasets

The general consensus is that larger neural networks result in better performance. We investigate the interplay between model capacity and upstream dataset size on downstream performance. We evaluate the BiT models of different sizes (ResNet-50x1, ResNet-50x3, ResNet-101x1, ResNet-101x3, and ResNet-152x4) trained on ILSVRC-2012, ImageNet-21k, and JFT-300M on various downstream benchmarks. These results are summarized in Fig. 4.

When pre-training on ILSVRC-2012, the benefit from larger models diminishes. However, the benefits of larger models are more pronounced on the larger two datasets. A similar effect is observed when training on Instagram hashtags [30] and in language modelling [20].

Not only is there limited benefit of training a large model size on a small dataset, but there is also limited (or even negative) benefit from training a small model on a larger dataset. Perhaps surprisingly, the ResNet-50x1 model trained on the JFT-300M dataset can even performs worse than the same architecture trained on the smaller ImageNet-21k. Thus, if one uses only a ResNet50x1, one may conclude that scaling up the dataset does not bring any additional benefits. However, with larger architectures, models pre-trained on JFT-300M significantly outperform those pre-trained on ILSVRC-2012 or ImageNet-21k.

Figure 2 shows that BiT-L attains strong results even on tiny downstream datasets. Figure 5 ablates few-shot performance across different pre-training datasets and architectures. In the extreme case of one example per class, larger architectures outperform smaller ones when pre-trained on large upstream data. Interestingly, on ILSVRC-2012 with few shots, BiT-L trained on JFT-300M outperforms the models trained on the entire ILSVRC-2012 dataset itself. Note that for comparability, the classifier head is re-trained from scratch during fine-tuning, even when transferring ILSVRC-2012 full to ILSVRC-2012 few shot.

4.2 Optimization on Large Datasets

For standard computer vision datasets such as ILSVRC-2012, there are well-known training procedures that are robust and lead to good performance. Progress in high-performance computing has made it feasible to learn from much larger datasets, such as ImageNet-21k, which has 14.2M images compared to ILSVRC-2012’s 1.28M. However, there are no established procedures for training from such large datasets. In this section we provide some guidelines.

Sufficient computational budget is crucial for training performant models on large datasets. The standard ILSVRC-2012 training schedule processes roughly 100 million images (1.28M images \(\times \) 90 epochs). However, if the same computational budget is applied to ImageNet-21k, the resulting model performs worse on ILSVRC-2012, see Fig. 6, left. Nevertheless, as shown in the same figure, by increasing the computational budget, we not only recover ILSVRC-2012 performance, but significantly outperforms it. On JFT-300M the validation error may not improve over a long time—Fig. 6 middle plot, “8 GPU weeks” zoom-in—although the model is still improving as evidenced by the longer time window.

Another important aspect of pre-training with large datasets is the weight decay. Lower weight decay can result in an apparent acceleration of convergence, Fig. 6 rightmost plot. However, this setting eventually results in an under-performing final model. This counter-intuitive behavior stems from the interaction of weight decay and normalization layers [23, 26]. Low weight decay results in growing weight norms, which in turn results in a diminishing effective learning rate. Initially this effect creates an impression of faster convergence, but it eventually prevents further progress. A sufficiently large weight decay is required to avoid this effect, and throughout we use \(10^{-4}\).

Finally, we note that in all of our experiments we use stochastic gradient descent with momentum without any modifications. In our preliminary experiments we did not observe benefits from more involved adaptive gradient methods.

4.3 Large Batches, Group Normalization, Weight Standardization

Currently, training on large datasets is only feasible using many hardware accelerators. Data parallelism is the most popular distribution strategy, and this naturally entails large batch sizes. Many known algorithms for training with large batch sizes use Batch Normalization (BN) [16] as a component [7] or even highlight it as the key instrument required for large batch training [5].

Our larger models have a high memory requirement for any single accelerator chip, which necessitates small per-device batch sizes. However, BN performs worse when the number of images on each accelerator is too low [15]. An alternative strategy is to accumulate BN statistics across all of the accelerators. However, this has two major drawbacks. First, computing BN statistics across large batches has been shown to harm generalization [5]. Second, using global BN requires many aggregations across accelerators which incurs significant latency.

We investigated Group Normalization (GN) [52] and Weight Standardization (WS) [36] as alternatives to BN. We tested large batch training using 128 accelerator chips and a batch size of 4096. We found that GN alone does not scale to large batches; we observe a performance drop of \(5.4\%\) on ILSVRC-2012 top-1 accuracy compared to using BN in a ResNet-50x1, and less stable training. The addition of WS enables GN to scale to such large batches, stabilizes training, and even outperforms BN, see Table 4. We do not have theoretical understanding of this empirical finding.

We are not only interested in upstream performance, but also how models trained with GN and WS transfer. We thus transferred models with different combinations of BN, GN, and WS pre-trained on ILSVRC-2012 to the 19 tasks defined by VTAB. The results in Table 5 indicate that the GN/WS combination transfers better than BN, so we use GN/WS in all BiT models.

Table 4. Top-1 accuracy of ResNet-50 trained from scratch on ILSVRC-2012 with a batch-size of 4096.

Full size table

Table 5. Transfer performance of the corresponding models from Table 4 fine-tuned to the 19 VTAB-1k tasks.

Full size table

5 Related Work

Large-Scale Weakly Supervised Learning of Representations. A number of prior works use large supervised datasets for pre-training visual representations [18, 24, 30, 43]. In [18, 24] the authors use a dataset containing 100M Flickr images [48]. This dataset appears to transfer less well than JFT-300M. While studying the effect of dataset size, [43] show good transfer performance when training on JFT-300M, despite reporting a large degree of noise (20% precision errors) in the labels. An even larger, noisily labelled dataset of 3.5B Instagram images is used in [30]. This increase in dataset size and an improved model architecture [54] lead to better results when transferring to ILSVRC-2012. We show that we can attain even better performance with ResNet using JFT-300M with appropriate adjustments presented in Sect. 2. The aforementioned papers focus on transfer to ImageNet classification, and COCO or VOC detection and segmentation. We show that transfer is also highly effective in the low data regime, and works well on the broader set of 19 tasks in VTAB [58].

Specialized Representations. Rather than pre-train generic representations, recent works have shown strong performance by training task-specific representations [32, 53, 55]. These papers condition on a particular task when training on a large support dataset. [53, 55] train student networks on a large unlabelled support dataset using the predictions of a teacher network trained on the target task. [32] compute importance weights on the a labelled support dataset by conditioning on the target dataset. They then train the representations on the re-weighted source data. Even though these approaches may lead to superior results, they require knowing the downstream dataset in advance and substantial computational resources for each downstream dataset.

Unsupervised and Semi-Supervised Representation Learning. Self-supervised methods have shown the ability to leverage unsupervised datasets for downstream tasks. For example, [8] show that unsupervised representations trained on 1B unlabelled Instagram images transfer comparably or better than supervised ILSVRC-2012 features. Semi-supervised learning exploits unlabelled data drawn from the same domain as the labelled data. [2, 42] used semi-supervised learning to attain strong performance on CIFAR-10 and SVHN using only 40 or 250 labels. Recent works combine self-supervised and semi-supervised learning to attain good performance with fewer labels on ImageNet [12, 57]. [58] study many representation learning algorithms (unsupervised, semi-supervised, and supervised) and evaluate their representation’s ability to generalize to novel tasks, concluding that a combination of supervised and self-supervised signals works best. However, all models were trained on ILSVRC-2012. We show that supervised pre-training on larger datasets continues to be an effective strategy.

Few-Shot Learning. Many strategies have been proposed to attain good performance when faced with novel classes and only a few examples per class. Meta-learning or metric-learning techniques have been proposed to learn with few or no labels [41, 44, 51]. However, recent work has shown that a simple linear classifier on top of pre-trained representations or fine-tuning can attain similar or better performance [3, 31]. The upstream pre-training and downstream few-shot learning are usually performed on the same domain, with disjoint class labels. In contrast, our goal is to find a generalist representation which works well when transferring to many downstream tasks.

6 Discussion

We revisit classical transfer learning, where a large pre-trained generalist model is fine-tuned to downstream tasks of interest. We provide a simple recipe which exploits large scale pre-training to yield good performance on all of these tasks. BiT uses a clean training and fine-tuning setup, with a small number of carefully selected components, to balance complexity and performance.

In Fig. 7 and supplementary material, we take a closer look at the remaining mistakes that BiT-L makes. In many cases, we see that these label/prediction mismatches are not true ‘mistakes’: the prediction is valid, but it does not match the label. For example, the model may identify another prominent object when there are multiple objects in the image, or may provide an valid classification when the main entity has multiple attributes. There are also cases of label noise, where the model’s prediction is a better fit than the ground-truth label. In a quantitative study, we found that around half of the model’s mistakes on CIFAR-10 are due to ambiguity or label noise (see Fig. 7, left), and in only 19.21% of the ILSVRC-2012 mistakes do human raters clearly agree with the label over the prediction. Overall, by inspecting these mistakes, we observe that performance on the standard vision benchmarks seems to approach a saturation point.

We therefore explore the effectiveness of transfer to two classes of more challenging tasks: classical image recognition tasks, but with very few labelled examples to adapt to the new domain, and VTAB, which contains more diverse tasks, such as spatial localization in simulated environments, and medical and satellite imaging tasks. These benchmarks are much further from saturation; while BiT-L performs well on them, there is still substantial room for further progress.

Notes

1.
For our largest R152x4, we increase resolution to \(512\times 512\) and crop to \(480\times 480\).

References

Athiwaratkun, B., Finzi, M., Izmailov, P., Wilson, A.G.: There are many consistent explanations of unlabeled data: why you should average. In: ICLR (2019)
Google Scholar
Berthelot, D., et al.: ReMixMatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 (2019)
Chen, W., Liu, Y., Kira, Z., Wang, Y.F., Huang, J.: A closer look at few-shot classification. In: ICLR (2019)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)
Google Scholar
De, S., Smith, S.L.: Batch normalization has multiple benefits: an empirical study on residual networks (2020). https://openreview.net/forum?id=BJeVklHtPr
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Goyal, P., et al.: Accurate, large minibatch sgd: training imagenet in 1 h. arXiv preprint arXiv:1706.02677 (2017)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: ICCV (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., van den Oord, A.: Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Huang, Y., et al.: GPipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965 (2018)
Ioffe, S.: Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 1945–1953. Curran Associates, Inc. (2017)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 67–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_5
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: International Symposium on Computer Architecture (ISCA) (2017)
Google Scholar
Kaplan, J., et al.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A.: Learning multiple layers of features from tiny images. University of Toronto, Technical report (2009)
Google Scholar
van Laarhoven, T.: L2 regularization versus batch and weight normalization. CoRR (2017)
Google Scholar
Li, A., Jabri, A., Joulin, A., van der Maaten, L.: Learning visual n-grams from web data. In: ICCV (2017)
Google Scholar
Li, X., Grandvalet, Y., Davoine, F.: Explicit inductive bias for transfer learning with convolutional networks. In: ICML (2018)
Google Scholar
Li, Z., Arora, S.: An exponential learning rate schedule for deep learning. arXiv preprint arXiv:1910.07454 (2019)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 185–201. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_12
Nakamura, A., Harada, T.: Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216 (2019)
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q.V., Pang, R.: Domain adaptive transfer learning with specialist models. arXiv:1811.07056 (2018)
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (2008)
Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng (2009)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: CVPR (2012)
Google Scholar
Qiao, S., Wang, H., Liu, C., Shen, W., Yuille, A.: Weight standardization. arXiv preprint arXiv:1903.10520 (2019)
Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Transfusion: Understanding transfer learning with applications to medical imaging. arXiv:1902.07208 (2019)
Rosenfeld, J.S., Rosenfeld, A., Belinkov, Y., Shavit, N.: A constructive prediction of the generalization error across scales. In: ICLR (2020)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NIPS (2017)
Google Scholar
Sohn, K., et al.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685 (2020)
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV (2017)
Google Scholar
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: CVPR (2018)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Thomee, B., et al.: Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817 (2015)
Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. In: NeurIPS (2019)
Google Scholar
Tschannen, M., et al.: Self-supervised learning of video-induced visual invariances. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NIPS (2016)
Google Scholar
Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_1
Chapter Google Scholar
Xie, Q., Hovy, E., Luong, M.T., Le, Q.V.: Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252 (2019)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
Google Scholar
Yalniz, I.Z., Jégou, H., Chen, K., Paluri, M., Mahajan, D.: Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546 (2019)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. arXiv preprint arXiv:1905.04899 (2019)
Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L.: S\(^\mathbf{4}\)L: self-supervised semi-supervised learning. In: ICCV (2019)
Google Scholar
Zhai, X., et al.: A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867 (2019)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: ICLR (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Google Research, Brain Team, Zürich, Switzerland
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly & Neil Houlsby

Authors

Alexander Kolesnikov
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Beyer
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Joan Puigcerver
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Yung
View author publications
You can also search for this author in PubMed Google Scholar
Sylvain Gelly
View author publications
You can also search for this author in PubMed Google Scholar
Neil Houlsby
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohua Zhai .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2178 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kolesnikov, A. et al. (2020). Big Transfer (BiT): General Visual Representation Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12350. Springer, Cham. https://doi.org/10.1007/978-3-030-58558-7_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-58558-7_29
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58557-0
Online ISBN: 978-3-030-58558-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Big Transfer (BiT): General Visual Representation Learning

Abstract

Similar content being viewed by others

Deep Architectures in Visual Transfer Learning

Training Vision Transformers with only 2040 Images

Best Practices for Fine-Tuning Visual Classifiers to New Domains

1 Introduction