Keywords

1 Introduction

For most recent advances in machine learning ranging across a wide variety of applications (image recognition, natural language processing, autonomous driving), deep learning has been one of the key contributing technologies. The search for an optimal deep learning architecture is of great practical relevance and is a tedious process that is often left to manual configuration. Neural Architecture Search (NAS) is the umbrella term describing all methods that automate this search process. Common optimization methods use techniques from reinforcement learning [1, 3, 4, 34,35,36] or evolutionary algorithms [15, 22, 23], or are based on surrogate models [14, 19]. The search is a computationally expensive task since it requires training hundreds or thousands of models, each of which requires few hours of training on a GPU [14, 15, 19, 22, 35, 36]. All of these optimization methods have in common that they consider every new problem independently without considering previous experiences. However, it is a common knowledge that well-performing architectures for one task can be transferred to other tasks and even achieve good performance. Architectures discovered for CIFAR-10 have not only been transferred to CIFAR-100 and ImageNet [14, 19, 22, 36] but have also been transferred from object recognition to object detection tasks [17, 24]. This suggests that the response function for different tasks, i.e. an architecture-score mapping, shares commonalities.

The central idea in this work lies in the development of a search method that uses knowledge acquired across previously explored tasks to speed up the search for a new task. For this we assume that the response functions can be decomposed into two parts: a universal part, which is shared across all tasks, and a task-specific part (Fig. 1). We model these two functions with neural networks and determine their parameters with the help of the collected knowledge of architectures on different tasks. This allows the search for a new task to start with the universal representation and only later learn and benefit from the task-specific representation. This reduces the search time without negatively affecting the final solution.

Fig. 1.
figure 1

An example of the integration of the transfer network into RL-based (a) and surrogate model-based (b) optimizers. The transfer network (c) unravels independent and task-specific influences.

The contributions in this paper are threefold:

  • First, we propose a general, minimally invasive framework that allows existing NAS optimizers to leverage knowledge from other data sets, e.g. obtained during previous searches.

  • Second, as an example, we apply the framework to NAO [19], a recent NAS optimizer, and derive XferNAS. This exemplifies the simplicity and elegance of extending existing NAS methods within our framework.

  • Finally, we demonstrate the utility of the adapted optimizer XferNAS by searching for an architecture on CIFAR-10 [12]. In only 6 GPU days (NAO needed 200 GPU days) we discover a new architecture with improved performance compared to the state-of-the-art methods. We confirm the transferability of the discovered architecture by applying it, unchanged, to CIFAR-100.

2 Transfer Neural Architecture Search

In this section, we introduce our general, minimally invasive framework for NAS optimizers to leverage knowledge from other data sets, e.g. obtained during previous searches. First, we formally define the NAS problem and introduce our notation. Then we motivate our approach and introduce the framework. Finally, using the example of NAO [19], we show what steps are required to integrate the framework with existing optimizers. In this step, we derive XferNAS, which we examine in more detail in Sect. 4.

2.1 Problem Definition

We define a general deep learning algorithm \(\mathbb {L}\) as a mapping from the space of data sets D and architectures A to the space of models M,

$$\begin{aligned} \mathbb {L}\ :\ D\times A\rightarrow M\ . \end{aligned}$$
(1)

For any given data set \(d\in D\) and architecture \(a\in A\), this mapping returns the solution to the standard machine learning problem that consists in minimizing a regularized loss function \(\mathcal {L}\) with respect to the model parameters \(\theta \) of architecture a using the data d,

$$\begin{aligned} \mathbb {L}\left( a,d\right) = {\mathop {\hbox {arg min}}\limits _{m^{(a,\theta )}\in M^{(a)}}}\,\mathcal {L}\left( m^{(a,\theta )}, d^{(\text {train})}\right) + \mathcal {R}\left( \theta \right) \ . \end{aligned}$$
(2)

Neural Architecture Search solves the following nested optimization problem: given a data set d and the search space A, find the optimal architecture \(a^{\star }\in A\) which maximizes the objective function \(\mathcal {O}\) (defined by classification accuracy in the scope of this work) on the validation data,

$$\begin{aligned} a^\star = {\mathop {\hbox {arg max}}\limits _{a\in A}}\, \mathcal {O}\left( \mathbb {L}\left( a,d^{(\text {train})}\right) ,d^{(\text {valid})}\right) ={\mathop {\hbox {arg max}}_{a\in A}} f\left( a\right) \ . \end{aligned}$$
(3)

Thus, Neural Architecture Search can be considered a global black-box optimization problem where the aim is to maximize the response function f. It is worth noting that the evaluation of f at any point a is computationally expensive since it involves training and evaluating a deep learning model.

In this work we assume that we have access to knowledge about the response functions on some source tasks \(1,\ldots ,n\), referred to as source knowledge. The idea is to leverage the source knowledge to address the NAS problem defined in Eq. (3) for a new task, the target task \(n+1\). By no means, the source knowledge is sufficient to yield an optimal architecture for the new task. Sample architectures must be evaluated on the target task in order to gain knowledge about the target response function. We call the knowledge accumulated in this process the target knowledge and refer to the combined source and target knowledge as observation history. In the context of this work, we refrain from transferring model weights.

Fig. 2.
figure 2

XferNAS: Integration of the transfer network in NAO.

2.2 Transfer Network

The most common NAS optimizers are based on reinforcement learning (RL) [1, 3, 4, 34,35,36] or surrogate model-based optimization (SMBO) [14, 19]. Many RL-based methods are based on policy gradient methods and use a controller (a neural network) to provide a distribution over architectures. The controller is optimized in order to learn a policy \(\pi \) which maximizes the response function of the target task. Alternatively, SMBO methods use a surrogate model \(\hat{f}^{(n+1)}\) to approximate the target response function \(f^{(n+1)}\). Both of these approaches rely on the feedback gathered by evaluating the target response function for several architectures.

We propose a general framework to transfer the knowledge gathered in previous experiments in order to speed up the search on the target task. Current state-of-the-art NAS optimizers directly learn a task-dependent policy (RL) or a task-dependent surrogate model (SMBO) \(g^{(i)}\). The core idea in this work is to disentangle the contribution of the universal function \(g^{(u)}\) from the task-dependent function \(g^{(i)}\) by assuming

$$\begin{aligned} g^{(i)}=g^{(u)}+r^{(i)}\ , \end{aligned}$$
(4)

where \(r^{(i)}\) is a task-dependent residual. This disentanglement is achieved by learning all parameters jointly on the observation history, where the universal function is included for all tasks while the task-dependent residual included only for its corresponding task. The universal function can be interpreted as a function which models a good average solution across problems, whereas the task-dependent function is optimal for a particular task i. The advantage is that we can warmstart any NAS optimizer for the target task where \(r^{(n+1)}\) is unknown by using only \(g^{(u)}\). As soon as target knowledge is obtained, we can learn \(r^{(n+1)}\) which enables us to benefit from the warmstart in the initial phase of the search and subsequently from the original NAS optimizer. We sketch the idea of this transfer network in Fig. 1 and provide an example for both, reinforcement learning and surrogate model-based optimizers. The functions \(g^{(u)}\) and \(r^{(i)}\) are modeled by a neural network, referred to as the universal network and the residual task networks, respectively. In the following, we exemplify this integration with the case of NAO [19] to provide a deeper understanding.

2.3 XferNAS

In principle any of the existing network-based optimizers (RL or SMBO) can be easily extended with our proposed transfer network in order to leverage the source knowledge. We demonstrate the use of the transfer network using the example of NAO [19], one of the state-of-the-art optimizers. NAO is based on two components, an auto-encoder and a performance predictor. The auto-encoder first transforms the architecture encoding into a continuous architecture code by an encoder and then reconstructs the original encoding using a decoder. The performance predictor predicts the accuracy on the validation split for a given architecture code. XferNAS extends this architecture by integrating the transfer network into the performance predictor, which not only predicts accuracy for the target task, but also for all source tasks (Fig. 2). Here, the different prediction functions for each task i are divided into a universal prediction function and a task-specific residual,

$$\begin{aligned} \hat{f}^{(i)}=\hat{f}^{(u)}+r^{(i)}\ . \end{aligned}$$
(5)

In this case, the universal prediction function can be interpreted as a prediction function that models the general architecture bias, regardless of the task. This is suitable for cases where there is no knowledge about the target task, as is the case at the beginning of a new search. The task-specific residual models the task-specific peculiarities. If knowledge about the task exists, it can be used to correct the prediction function for certain architecture codes.

The loss function to be optimized is identical to the original version of NAO and is a combination of the prediction loss function \(L_{pred}\) and the reconstruction loss function \(L_{rec}\),

$$\begin{aligned} L=\alpha L_{pred} + (1-\alpha ) L_{rec}\ . \end{aligned}$$
(6)

However, the prediction loss function considers all observations to train the model for all tasks,

$$\begin{aligned} L_{pred} = \sum _{i=1}^{n+1}\sum _{a\in H^{(i)}} \left( f^{(i)}\left( a\right) -\hat{f}^{(i)}\left( a\right) \right) ^2\ . \end{aligned}$$
(7)

\(H^{(i)}\) is the set of all architectures which are evaluated for a task i and for which the value of the response function is known. The joint optimization of both loss functions guarantees that architectures which are close in the architecture code space exhibit similar behavior across tasks.

Once the model has been trained by minimizing the loss function, potentially better architectures can be determined for the target task. The architecture codes of models with satisfactory performance serve as starting points for the optimization process. Gradient-based optimization is used to modify the current architecture code to maximize the prediction function of the target task

$$\begin{aligned} z\leftarrow z + \eta \frac{\partial \hat{f}^{(n+1)}}{\partial z}\ , \end{aligned}$$
(8)

where z is the current architecture code and \(\eta \) is the step size. The architecture encoding is reconstructed by applying the decoder on the final architecture code. The step size is chosen large enough to get a new architecture, which is then evaluated on the target task.

XferNAS has two different phases. In the first phase, the system lacks target knowledge and relies solely on the source knowledge. The architectures with the highest accuracy on the source tasks serve as starting points for the determination of new candidates. This is achieved by means of the process described in Eq. (8) with \(\eta =10\). Having accumulated some target knowledge, the second phase selects as starting points the models with high accuracy on the target task. To keep the search time low, we only examine 33 architectures.

Implementation Details. The transfer network has been integrated into the publicly available code of NAOFootnote 1, allowing a fair comparison to Luo et al. [19]. In our experiments we retain the prescribed architecture and its hyperparameters. However, for the sake of completeness, we repeat this description. LSTM models are used to model the encoder and decoder. The encoder uses an embedding size of 32 and a hidden state size of 96, whereas the decoder uses a hidden state size of 96 in combination with an attention mechanism. Mean pooling is applied to the encoder LSTM’s output to obtain a vector which serves as the input to the transfer network. The universal and the residual task networks are modelled with feed forward neural networks. Adam [11] is used to minimize Eq. (6) with learning rate set to \(10^{-3}\), trade-off parameter \(\alpha \) to 0.8, and weight decay to \(10^{-4}\).

3 Related Work

Neural Architecture Search (NAS), the structural optimization of neural networks, is solved with a variety of optimization techniques. These include reinforcement learning [1, 3, 4, 27, 34,35,36], evolutionary algorithms [15, 22, 23, 26], and surrogate model-based optimization [14, 19] These techniques have made great advancements with the idea of sharing weights across different architectures which are sampled during the search process [2, 5, 16, 21, 33] instead of training them from scratch. However, recent work shows that this idea is not working better than a random search [25]. For a detailed overview we refer to a recent survey [29].

A new but promising idea is to transfer knowledge from previous search processes to new ones [10, 28, 30], which is analogous to the behavior of human experts. We briefly discuss the current work in NAS for convolutional neural networks (CNNs) in this context. TAPAS [10] is an algorithm that starts with a simple architecture and extends it based on a prediction model. For the predictions, first a very simple network is trained on the target data set. Subsequently, the validation error is used to determine the similarity to previously examined data sets. Based on this similarity, predictions of the validation error of different architectures on the target data set are obtained. By means of these, a set of promising architectures are determined and evaluated on the target data set. However, the prediction model is not able to leverage the additional information collected on the target data set. T-NAML [30] seeks to achieve the same effect without searching for new architectures. Instead, it chooses a network which has been pre-trained on ImageNet and makes various decisions to adapt it to the target data set. For this purpose, it uses a reinforcement learning method, which learns to optimize neural architectures across several data sets simultaneously.

4 Experiments

In this section we empirically evaluate XferNAS and compare the discovered architectures to the state-of-the-art. Furthermore, we investigate the transferability of the discovered architectures by training it on a different data set without introducing any further changes. In our final ablation study, we investigate the impact of amount of source and target data on the surrogate model’s predictions.

4.1 Architecture Search Space

In our experiments, we use the widely adopted NASNet search space [36], which is also used by most optimizers that we compare with. Architectures in this search space are based on two types of cells, normal cells and reduction cells. These cells are combined to form the network architecture, with repeating units comprising of N normal cells followed by a reduction cell. The sequence is repeated several times, doubling the number of filters each time. The difference of the reduction cell from the normal cell is that it halves the dimension of feature maps. Each cell consists of B blocks and the first sequence of N normal cells uses F filters. The output of each block is the sum of the result of two operations, where the choice of operation and its input is the task of the optimizer. There are #op different operations, the input can be the output of each of the previous blocks in the same cell or the output of the previous two cells. The output of a cell is defined by concatenating all block outputs that do not serve as input to at least one block. The considered 19 operations are: identity, convolution (\(1\times 1\), \(3\times 3\), \(1\times 3+3\times 1\), \(1\times 7+7\times 1\)), max/average pooling (\(2\times 2\), \(3\times 3\), \(5\times 5\)), max pooling (\(7\times 7\)), min pooling (\(2\times 2\)), (dilated) separable convolution (\(3\times 3\), \(5\times 5\), \(7\times 7\)).

4.2 Training Details for the Convolutional Neural Networks

During the search process smaller architectures are trained (B = 5, N = 3, F = 32) for 100 epochs. The final architecture is trained for 600 epochs according to the specified settings. SGD with momentum set to 0.9 and cosine schedule [18] with \(l_\text {max}=0.024\) and without warmstart is used for training. Models are regularized by means of weight decay of \(5\cdot 10^{-4}\) and drop-path [13] probability of 0.3. We use a batch size of 128 but decrease it to 64 for computational reasons for architectures with \(F\ge 64\). All experiments use a single V100 graphics card. The only exception are networks with \(F=128\) where we use two V100s to speed up the training process.

Standard preprocessing (whitening) and data augmentation are used. Images are padded to a size of 40 \(\times \) 40 and then randomly cropped to 32 \(\times \) 32. Additionally, they are flipped horizontally at random during training. Whenever cutout [6] is applied, a size of 16 is used.

4.3 Source Tasks

Image recognition on Fashion-MNIST [31], Quickdraw [8], CIFAR-100 [12] and SVHN [20] forms our four source tasks. For each of these data sets, we evaluated 200 random architectures, giving us a total of 800 different architectures. Every architecture is trained for 100 epochs with the settings described in Sect. 4.2. The default train/test splits of CIFAR-10, CIFAR-100, Fashion-MNIST and SVHN are used and the train split is further divided into 5,000 images for validation and the remaining for training. For computational reasons we refrain from using the entire Quickdraw data set (50 million drawings and 345 classes). We select 100 classes at random and select 450 drawings for training and 50 for validation per class at random. Each of these architectures was trained on exactly one data set and none of these architectures were evaluated on the target task before or during the search. Therefore, it is valid to conclude that the architecture found is new.

Fig. 3.
figure 3

Convolution and reduction cell of XferNASNet.

4.4 Image Recognition on CIFAR-10

We evaluate the proposed transfer framework using the CIFAR-10 benchmark data set and present the results in Table 1. The table is divided into four parts. In the first part we list the results that some manually created architectures achieve. In the second and third part we tabulate the results achieved by traditional NAS methods as well as those which transfer knowledge across tasks. In the last part we list our results. In contrast to some of the other search methods, we refrained from additional hyperparameter optimization of our final architecture (XferNASNet) (Fig. 3).

Table 1. Classification error of discovered CNN models on CIFAR-10. We denote the total number of models trained during the search by M. B is the number of blocks, N the number of cells and F the number of filters. #op is the number of different operations considered in the cell which is an indicator of the search space complexity. For more details on the hyperparameters #op, B, N and F we refer to Sect. 4.1. We expand the results collected by [19] by the most recent works.

XferNAS is the extended version of NAO which additionally uses the transfer network; so this comparison is of particular interest. We not only observe a significant drop in the search effort (number of evaluated models reduced from 1,000 to 33, search time reduced from 200 GPU days to 6), but also on the error obtained on the test set. The smallest version of NAONet performs slightly better than XferNASNet (3.18 vs. 3.37), but also uses twice as many parameters. If the data augmentation technique cutout [6] is used, this minimal improvement turns around (2.11 vs. 1.99).

The transfer method TAPAS achieves significantly worse results (6.33 vs. 3.92 of the next better method). The other transfer method T-NAML achieves an error rate of 3.5 which is not better than XferNASNet (3.37). It should also be noted that T-NAML finetunes architectures which have been pre-trained on ImageNet. Thus, not only the number of parameters is probably much higher, more data is used and no new architectures are found. Therefore, it arguably solves a different task. A very simple baseline is to select the best architecture on the most similar source task (CIFAR-100). Objectively this baseline performs quite well (4.14) but compares poorly to XferNASNet.

Furthermore, we compare to other search methods that consider the NASNet search space. XferNASNet performs very well compared to current gradient-based optimization methods such as DARTS and SNAS (2.70 versus 2.83 and 2.85, respectively). It also provides good results compared to architectures such as NASNet or AmoebaNet which were discovered by time-consuming optimization methods (3.37 in 2,000 GPU days or 3.34 in 3,150 GPU days versus 3.41 in 6 GPU days). We want to highlight that the results in part 2 of Table 1 are supposed to give the reader an idea how well XferNASNet performs on the NASNet search space compared to other methods operating on the same search space. However, a direct comparison to all methods but NAO is less relevant since this work is orthogonal to ours and can benefit of our transfer learning approach as well.

Table 2. Various architectures discovered on CIFAR-10 applied to CIFAR-100. Although the hyperparameters have not been optimized, XferNASNet achieves the best result.

4.5 Architecture Transfer to CIFAR-100

A standard procedure to test the transferability is to apply the architectures discovered on CIFAR-10 to another data set, e.g. CIFAR-100 [19]. Although, some of the popular architectures have been adapted to the new data set through additional hyperparameter optimization, we refrain from this for XferNASNet. We compare the transferability of XferNASNet to other automatically discovered architectures and list the results in Table 2. For these results, we rely on the numbers reported by Luo et al. [19]. The XferNASNet architecture achieves an error of 18.88 without and an error of 16.29 with cutout. Thus, we achieve significantly better results than all other architectures except NAONet. However, when we increase the number of filters from 32 to 64, we achieve comparable results as NAO with 128 filters and much more parameters. Furthermore, when we increase the number of filters to 128, the error drops to about 14.06, which is significantly lower than that of NAONet (14.75), and notably with only about half the number of parameters. We also report the results obtained for the best architecture found during the random search (19.96) in order to reconfirm that XferNAS is discovering better architectures than the ones available for the source tasks.

Fig. 4.
figure 4

Correlation coefficient between predicted and true validation accuracy with varying amount of source and target knowledge. The source knowledge significantly improves predictions when there is little target knowledge available.

4.6 Ablation Study of XferNet

At this point we would like to closely examine the benefits of knowledge transfer, especially the circumstances under which we observe a positive effect. For this we conduct an experiment where we evaluate 600 random architectures for CIFAR-10 according to Sect. 4.3 and hold out 50 to compute the correlation between predicted and true validation accuracy. The remaining 550 are candidates for the target knowledge available during the training of the surrogate model. In this experiment the amount of source and target knowledge is varied. For every amount of source and target knowledge, ten random splits are created which are used throughout the experiment. We train the surrogate model as described in Sect. 2.3 on all ten splits and test it on our held-out set. While the architectures within the observation history are evaluated on exactly one data set, the architectures for evaluation are unknown to the model. In Fig. 4, we visualize the mean and standard deviation of the correlation between the surrogate model prediction and the actual validation accuracy over the ten repetitions. The x-axis indicates the size of the target knowledge, and the four curves represent experiments corresponding to different sizes of source knowledge. The source knowledge size ranges from 0 architectures per source task (no knowledge transferred, equivalent to NAO) to 150. We elaborate on four scenarios in this context.

How significant is the benefit of knowledge transfer for a new search (zero target knowledge)? This is the scenario in which any method that does not transfer knowledge can not be better than random guessing (correlation of 0). If our hypothesis is correct and knowledge can be transferred, this should be the scenario in which our method achieves the best results. And indeed, the correlation is quite high and increases with the amount of source knowledge.

Does the transfer model benefit from the target knowledge? For any amount of source knowledge, additional target knowledge increases the correlation and, accordingly, improves the predictions. This effect depends inversely on the amount of source knowledge.

What amount of target knowledge is sufficient, so that source knowledge no longer yields a positive effect? For target knowledge comprising of 150 architectures (about 30 GPU days), the effect of source knowledge seems to fade away. Therefore, one can conclude that knowledge transfer does not contribute to any further improvement.

When this threshold is reached, does the knowledge transfer harm the model? We continue to experiment with larger sizes of target knowledge (up to 550 architectures, not shown in the plot) and empirically confirm that the additional source knowledge does not deteriorate the model performance. However, the correlation keeps improving with increasing amount of target knowledge for both cases, with and without knowledge transfer.

5 Conclusions and Future Work

In this paper, we present the idea of accelerating NAS by transferring the knowledge about architectures across different tasks. We develop a transfer framework that extends existing NAS optimizers with this capability and requires minimal modification. By integrating this framework into NAO, we demonstrate how simple and yet effective these changes are. We evaluate the resulting new XferNAS optimizer (NAO + Transfer Network) on CIFAR-10 and CIFAR-100. In just six GPU days, we discover XferNASNet which reaches a new record low for NAS optimizers on CIFAR-10 (2.11\(\rightarrow \)1.99) and CIFAR-100 (14.75\(\rightarrow \)14.06) with significantly fewer parameters. Thus, the addition of this component to NAO does not only reduce the search time from 200 GPU hours to only 6, it even improves the discovered architecture.

For the future we want to combine the Transfer Network with various other optimizers to evaluate its limitations. We are particularly interested in combining it with the recent optimizer methods that use weight-sharing which promise to be even faster. Furthermore, we want to evaluate the discovered XferNASNet on ImageNet, which is an experiment we could not conduct so far due to resource limitations.