Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the key factors which contributes to the impact of convolutional neural networks (CNNs) is the transferability of their internal deep representations for a variety of visual recognition tasks. Such deep representations, or ‘features,’ are empirically superior [1, 2] to traditional features when classifying datasets such as Caltech-101, Caltech-256, and SUN397 [35]. Recently, methods of fine-tuning pre-trained networks towards new target datasets have become very popular, as they are usually more effective than training deep networks from scratch. With fine-tuning, the first n layers from a pre-trained network are copied to the target network, while other layers are randomly initialized and trained towards the target dataset. One can either choose to adapt, or fine-tune, the copied layers or leave them unchanged (frozen) during training [6, 7]. Past work has even shown the superiority of fine-tuning over using generic CNN features for different visual tasks like detection [79].

When faced with a new dataset, however, there is little guidance on how many layers to copy and whether to fine-tune or freeze these layers. We argue that the best practice may vary depending upon a few key factors of the transfer setup. As a result, we choose to analyze the performance of fine-tuning methods across seven target datasets, comparing these methods among two intrinsic properties of the datasets: difference from source dataset and the amount of available target training data. Through our analysis, we uncover the best training methods in each scenario and use the results to provide two main recommendations:

  1. 1.

    Copy all layers except the classification layer. This is often standard practice, though we are the first to provide comprehensive evidence across a variety of datasets and many different operating points of the amount of labeled data available in the target dataset.

  2. 2.

    Fine-tune the copied layers. We find that even with very few examples, fine-tuning is possible and beneficial. The exception being if the dataset distance is small and there is only a small amount of training data. In this case, freeze the copied layers.

2 Related Work

Although networks have been shown to increase in class and representation specificity from lower to higher layers [6, 8], for many datasets the best performance of generic AlexNet [10] features occurs at the third-to-last (fc6) [2], or second-to-last (fc7) fully-connected layer [1]. Moreover, Girshick et al.’s ablation studies found it is best to copy all layers from a network pre-trained on ImageNet and fine-tune these layers towards the PASCAL VOC detection task [7].

Yosinski et al. [6] varied the number of pre-trained layers copied and examined the target dataset accuracy for networks that were fine-tuned and frozen (unchanged). With the source dataset as one half of ImageNet’s classes and the target dataset as the remaining half, Yosinski et al. found that fine-tuning was the optimal technique, with performance slightly improving as more layers were copied. When layers were frozen, they saw performance degrade as more layers were copied. However, Yosinski et al.’s work studied a target dataset that was virtually identical to the source dataset and had an extremely large number of samples (approximately 645,000). This directly motivates our broader analysis of target datasets of varying distances to the source dataset, and datasets with scarce and plentiful training data.

The need to characterize distance between source and target is further motivated by Azizpour et al. and Zhou et al., who both demonstrated substantial variation in the performance of generic features depending on the source dataset and qualitative characteristics of the target dataset [9, 11]. Zhou et al. specifically demonstrated that a CNN pre-trained on a scene dataset is superior to an ImageNet model when fine-tuned towards other scene datasets.

3 Experiments

We follow the experimental setup used by Yosinski et al. [6]. Specifically we use the 8-layer Caffe implementation of AlexNet (CaffeNet), pre-trained on the entire ImageNet training set [12].

We evaluate the performance of fine-tuning and freezing when adapting the pre-trained model to 6 target datasets. For each dataset, we define 3–4 dataset splits with a varying number of images per class, for a total of 23 dataset splits. We also define fixed validation and test sets which, due to different amounts of training data, differ in size between datasets. The splits are outlined in Table 1.

For each dataset split, we randomly initialize the top 1, 3, or 5 layers in the pre-trained model while copying the rest of the layers. Additionally, we either freeze the copied layers, setting the learning rate to 0, or fine-tune them, setting the learning rate to 0.2 times that of random initialization. We follow this procedure for all target datasets, resulting in 138 experiments. For notation, T(a-b) denotes that layers a-b are copied and fine-tuned, whereas F(a-b) denotes that layers a-b are copied and frozen. R(a-b) denotes that layers a-b are randomly initialized. For example, T(1-7)R(8) denotes the experiment where we copy and fine-tune layers 1–7 and randomly initialize the final fully-connected layer, fc8.

Table 1. Properties of datasets and dataset splits. \(*\) indicates we used the dataset’s provided validation or test set as our test set.

We evaluated our experiments on 6 datasets: Caltech256, SUN397, MIT Indoor Scene Recognition (MITIndoor) [13], Caltech-UCSD-Birds-200 (CUB) [14], Moorea Labeled Corals (Coral) [15], and Imaging Flow Cytobot Data Plankton (Plankton) [16]. Properties of these datasets are summarized in Table 1. We only used 8 of the Coral categories and 34 of the Plankton categories in our experiments due to insufficient training data per class.

In addition to these 138 experiments, we chose to include an additional artificial dataset from Yosinksi et al. [6], denoted as Yosinski in Table 1. In their experiments, ImageNet is randomly split into two disjoint 500-category datasets (dataset 500 A and dataset 500B) of roughly equal size (approximately 645,000 images). CaffeNet is pre-trained on dataset 500A, with the target as dataset 500B. We use this setup to analyze the case when datasets are essentially identical and also compare our results to theirs. We use the same experiment setups as before for an additional 40 experiments.

3.1 Difference from Source

To measure the difference between the source and target datasets, we compute the cosine distance, \(1 - \frac{\mu _s^T \mu _t}{||\mu _s||_2||\mu _t||_2}\), between the mean fc7 responses of the source, \(\mu _s\), and target, \(\mu _t\), datasets. Although not a formal distance measure due to its violation of the triangle inequality, cosine distance effectively measures the similarity of two vectors, which in our case measures the similarity of two datasets. In Table 2, we compare the cosine distance to other metrics such as MMD between source and target dataset in fc7 feature space which, when using a linear kernel, is equivalent to the Euclidean distance between the means of the fc7 responses. We also consider the accuracy of two classifiers trained to distinguish between datasets: a linear SVM in fc7 feature space and a small CNN model used by Krizhevsky for CIFAR-10 classification in pixel space [17]. This approach was recently used to minimize domain difference for adaptation [1820]. When generating fc7 responses, we use CaffeNet pre-trained on the source dataset: this source dataset is ImageNet for all datasets setups except Yosinski, where source is 500A.

Table 2. Distance between source and target dataset. For Yosinski, source is dataset 500 A and target is dataset 500 B. For others, source is ImageNet and target is listed.

Although these metrics differ in the distances between entries, they yield the same ordering. This suggests that the other metrics are viable substitutes for cosine distance, yet we settle on cosine distance because its computation does not require training a classifier and because it is bounded between 0 and 1.

Fig. 1.
figure 1

Top: Ratio between accuracy for R(8) and accuracy for R(6-8). Bottom: Ratio between R(8) and R(4-8). Each point represents a pair of experiments with other conditions fixed. At Left are fine-tune (T) experiments only, at Right are freeze (F) experiments only. Values above 1.0 imply copying is better than random initialization. Values below 1.0 imply copying is worse. Trend lines are averages.

4 Results

In this analysis we will refer to a low source-target distance as a cosine distance of between 0–0.2, a medium distance as 0.2–0.4, and a high distance as 0.4–1. We will refer to a low amount of target data as 1–20 images per class, a medium amount as 21–99 images per class, and a large amount as 100 or more images per class. All raw experimental results can be found in Fig. 3 in the Appendix.

We begin by studying whether random initialization or initializing with copied parameters yields higher performance across a variety of dataset shifts. To do this, we hold all parameters of a particular experiment fixed except for whether certain layers are randomly initialized or copied. The results across all experiments are shown in Fig. 1, where the ratio of performance between copying and randomly initializing layers is indicated. Here, we found that randomly initializing layers beyond the necessary fc8 layer almost always degrades performance (all numbers across experiments are >1). The notable exception being when source and target datasets have high difference and there are a large number of labeled target examples available for fine-tuning. In this setting randomly initializing offers a stronger benefit when the lowest layers are frozen during the final training step, but marginal or no improvement over copying when all layers are fine-tuned. Therefore, we conclude that copying all but the last layer of the network is generally the best practice for fine-tuning to a new dataset.

Our finding of copying all but the last layer is in direct contrast with Yosinski et al., who showed that copying fewer layers is better when freezing [6]. The contrasting results are easily explained by the much larger amounts of data used in that study, whereas we seek to analyze scenarios in which large amounts of data are not available: the largest amount of data we use for the Yosinski target dataset is 120 images/class (60,000 images). Clearly, there is an inflection point between 60,000 and approximately 645,000 training examples where it no longer becomes beneficial to copy more layers when freezing. For the remainder of our analysis, we follow our first finding to copy all parameters from the initial source network and therefore choose between F(1-7)R(8) and T(1-7)R(8). We present our recommendations in Table 3 and discuss evidence from our experiments next.

Table 3. Best practices for adapting pre-trained networks where all but the last layer are copied. We compare freezing (F(1-7)R(8)) against fine-tuning (T(1-7)R(8)).
Fig. 2.
figure 2

Ratio between accuracy when fine-tuning and accuracy when freezing, measured across pairs of experiments with other conditions fixed. Only showing R(8) experiments. Values above 1.0 imply fine-tuning is better. Values below 1.0 imply freezing is better. Left examines the effect of data on this ratio. The dotted line denotes large amounts of data. Right examines the effect of source-target distance on this ratio. Trend lines are averages.

Figure 2 (left) shows that with a large amount of target data, fine-tuning is always best, fitting our intuition that more data reduces overfitting. But Fig. 2 (right) shows that when there is a low or medium amount of training data, the distance between source and target plays a more important role. Broadly, as distance increases, fine-tuning improves relative to freezing, supporting the notion that learned features are less transferable to distant datasets. At one end, in the low data and low distance setting, freezing outperforms fine-tuning. At the other end, in the medium data and medium-to-high distance setting, fine-tuning outperforms freezing.

The inflection point occurs in settings where (1) target data is low and distance is medium or high, or (2) target data is moderate and distance is low. Here there is no consistent winner between fine-tuning and freezing. In these situations we recommend trying both fine-tuned and frozen networks. An additional consideration, though not shown in the figure, is that fine-tuned models generally take more time to train - one might choose a preferred training technique based on preference towards training speed or accuracy.