Keywords

1 Introduction

Deep Convolutional Neural Networks (CNNs) have achieved remarkable success in various computer vision tasks. One of the key reasons behind this success is the innovation on CNN architectures [9, 15, 20]. Despite the steady stream of promising improvements over state-of-the-art across a wide range of tasks and datasets, these early architectural advancements take years of efforts from researchers and practitioners with vast computational resources. Neural Architecture Search (NAS), on the other hand, aims to alleviate this painstaking process by automating the design of CNN architectures. However, existing NAS algorithms typically require enormous computation overheads to search. For example, early works [23, 37] require thousands of GPU days to complete one search, making them impractical for real-world deployments.

The main computation bottleneck of a NAS algorithm resides in the step of evaluating architectures’ performance. A thorough training of an architecture can take days, even weeks on a single GPU card depending on the complexity of architectures and the scale of the datasets. Hereby, advocating for substitute measurements becomes a common theme in existing NAS algorithms. For instances, there are works [19, 23, 37] using the performance measured on architectures with reduced sizes and training epochs; and there are works [17, 21] using performance measured on shared (instead of trained) weights. Conceptually, both of these two methods are restricted by the generality of search spaces, where the former requires search spaces to be modular and the later requires search spaces to be sequential [21, 25] (as oppose to multi-branched [17, 19, 37]).

To address the aforementioned issues, we propose the Random-Weight Evaluation (RWE), a flexible and effective method to accelerate the performance evaluation of architectures. By leveraging the expressive power of randomly initialized convolution filters [13, 26], RWE freezes the backboneFootnote 1 part of CNN architectures, and only trains the last classification layer. The subsequent performance becomes the indicator to select architectures. In short, our key contributions are summarized below:

  • In this work, we propose RWE to expedite the performance estimation of CNN architectures. RWE is conceptually flexible as it is independent of search spaces, and empirically effective as we show later in the paper that it significantly reduces the evaluation wall-clock time from days to less than a minute. At the same time, RWE reliably estimates the performance of CNN architectures, measured in rank-order correlations.

  • We further design a multi-objective evolutionary NAS algorithm to effectively utilize the proposed evaluation method. On CIFAR-10, the proposed algorithm achieves a set of efficient architectures with state-of-the-art performance, yielding 2.98% Top-1 error and 1.5M parameters with less than two hours on a single GPU card.

2 Related Works

In this section, we provide a brief overview on the topics that are closely related to technicalities of our approach.

Performance Estimation: A commonly used technique, in early NAS works [23, 37], to expedite the performance evaluations involves down-scaling the architecture sizes, by reducing the number of layers, channels, and training epochs. The main limitation of this method is that it is only applicable to modular search spaces, where a CNN architecture is constructed by repeatedly stacking modular blocks. Extending this method to the search space that allows non-modular architectures is not trivial. Concurrently, there is a study [34] showing that extensively reducing the number of training epochs leads to a poor rank-order correlation between predicted and true performance.

Instead of training every architecture from scratch, the weight sharing method attempts to speed up the peformance evaluations by sharing the weights among architectures sampled during search. NAS algorithms in this category [17, 21] typically construct a supernet (prior to the search), such that all searchable architectures become subsets (of the supernet; i.e. subnets) and weights are directly inherited instead of randomly initialized. Then the evaluation of an architecture becomes an inference on the validation set, which is much cheaper than training. Despite the efficiency gained during search, the supernet, required by the weight sharing method, can be more time consuming to train than a complete search [1], and may not be feasible for all search spaces, e.g. [16].

Expressive Power of Randomly Initialized Convolution Filters: Convolution filters, even with randomly initialized weights, are surprisingly powerful in extracting meaningful feature representations from visual inputs [12]. A number of existing works have shown that randomly initialized convolution filters can achieve comparable performance to CNNs with fully trained convolution filters on both vision and control tasks [6, 13]. Few trials have been attempted to estimate the performance of a neural network from randomly initialized weights in the literature. More specifically, Saxe et al. use randomly initialized weights to predict the ranking of shallow CNNs’ performance [26]; Rosenfeld and Tsotsos show that the performance ranking of widely used CNN architectures can be predicted by training a tiny fractions of the weights in the CNNs [24]. In this work, we train the last classification layer while freezing all other weights at initial values, and use this performance as the indicator to compare architectures. We demonstrate that our approach scales to modern CNN search spaces containing deep and complex CNNs.

Multi-objective NAS: Early NAS algorithms [17, 23, 37] are primarily driven by a single objective of maximizing predictive accuracy. However, real-world applications oftentimes require the CNN architectures to balance other completing objectives, such as power consumption, inference latency, memory footprint, to name a few. A portfolio of recently emerged NAS works scalarizes multiple objectives into a composite measurement that simultaneously promotes predictive performance and penalizes architecture complexity [2, 30]. In this work, we opt for evolutionary multi-objective optimization to approximate the entire efficient frontier in one run [3].

3 Proposed Approach

The problem of designing optimal architectures under multiple objectives for a target dataset \(\mathcal {D} = \{ \mathcal {D}_{trn}, \mathcal {D}_{vld}, \mathcal {D}_{tst} \}\) can be formulated as the following bilevel optimization problem [19],

figure a

where the upper lever variable \(\varvec{\alpha }\) defines a candidate architecture, and the lower level variable \(\varvec{w}(\varvec{\alpha })\) represents the weights with respect to it. \(\mathcal {L}(\varvec{w};\varvec{\alpha })\) is the cross-entropy loss on the \(\mathcal {D}_{trn}\) for a candidate architecture \(\varvec{\alpha }\) with weights \(\varvec{w}\). The first objective \(f_1\) represents the classification error on \( \mathcal {D}_{vld}\), which depends on both architectures and weights. Other objectives \(f_2,...,f_m\) only depend on architectures, such as the number of parameters, floating-point operations (FLOPs), latency, etc. In our approach, we use predictive performance to approximate the ground-truth one and take the predictive performance and FLOPs as two objectives to be optimized, which aims to balance between the performance and the complexity of architectures.

3.1 Random-Weight Evaluation

The core of our approach, Random-Weight Evaluation (RWE), is shown in Algorithm 1. First, the architecture \(\varvec{\alpha }\) to be evaluated would get decoded into a CNN net without the last classification layer, which is a linear classifier. Notice that, the number of channels and layers of the decoded architecture should be indicated in this step. Then, the weights of net would be randomly initialized and then kept fixed throughout the whole algorithm. We use a modified version of the Kaiming initialization [8] to initialize the net (default setting in PyTorch). After the initialization of net, a list of linear classifiers list_clsfr would be initialized. Those classifiers get trained on the features that are extracted by the net, and each classifier could only be exposed to a part of the features. Then all of the classifiers are combined as the last classification layer in the inference phrase via neural network ensemble technique [7]. More specifically, the length of the list_clsfr is set to five, and each classifier gets trained on 4/5 features. Compared with the case that only one linear classifier gets trained, a list of linear classifiers can help stabilizing RWE. Finally, the architecture is validated by taking the labels that get approved by most classifiers as a result. After calculating the classification error rate and FLOPs of the architecture, these two measurements of performance and complexity of architectures will be returned as two objectives to be optimized.

figure b

3.2 Search Space and Encoding

Our proposed RWE is conceptually flexible and can be applied to various search spaces. In this work, we adopt the NASNet [37] search space, which is widely used in various NAS algorithms [17, 22, 23]. A pictorial overview of this search space and encoding is shown in Fig. 1.

Network: The macro network architecture is prefixed to be a stack of several cells. We search for two cells: a normal cell and a reduction cell. All normal cells in an architecture share the same structure (but different weights), which is the same case for the reduction cell. The normal cell keeps the resolution and the number of channels for input tensors, while the reduction cell downsamples the resolution and double the number of channels for input tensors (Fig. 1a RIGHT).

Fig. 1.
figure 1

The search space [37] adopted in our approach. (a) RIGHT: the macro network architecture of CNN. MIDDLE: an example structure of normal cells and reduction cells. LEFT: the searching elements for a node. (b) TOP: a 40-integer vector defining an architecture. BOTTOM: candidate operations and their encoding.

Cell and Node: The structures of both normal and reduction cells are defined by five nodes (Node 2 to Node 6), each of which chooses two inputs from previous nodes and two operations applied on two inputs respectively. For a node i, we search for two inputs from previous nodes \(N_1^{cell,i}\) and \(N_2^{cell,i}\) and two operations \(O_1^{cell,i}\) and \(O_2^{cell,i}\) applied on them, where the cell could be n or r, representing normal cells and reduction cells. For those nodes that are not chosen to become the inputs of another node, their outputs would be concatenated to the output node (Node 7) (Fig. 1a MIDDLE and LEFT).

Encoding: We use an integer vector to encode an architecture, as shown in Fig. 1b. Each input and operation to be applied is encoded into an integer denoting a choice. A cell consists of five nodes, and each node is represented using four integers. The integer vectors for a normal and a reduction cell are concatenated into a 40-integer vector.

3.3 Search Strategy

In our approach, we adopt the classic multi-objective evolutionary algorithm NSGA-II [3] as the searching framework. The search process is briefly summarized below.

First, the population is randomly initialized. After individuals in it get evaluated by RWE, the binary tournament selection will be applied to select parents for offspring. Then through two-point crossover and polynomial mutation, offspring is created. Finally, the offspring gets evaluated, followed by the environment selection through the nondominated sorting and the crowding distance. The process is repeated until reaching the max generation.

4 Experimental Results

In this section, we first evaluate the effectiveness of RWE, followed by the results of our approach searching on CIFAR-10 [14] and transferring to ImageNet [4].

4.1 Effectiveness of Random-Weight Evaluation

Fig. 2.
figure 2

The predicted accuracy and ground-truth accuracy on CIFAR-10 of the architectures from final generation. The line represents a linear regression. Corr. is the Spearman rank-order correlation coefficient between the predicted and the ground-truth accuracy.

To demonstrate the effectiveness of RWE, we calculate the Spearman rank-order correlation coefficient between the predicted and the ground-truth accuracy on CIFAR-10, which is the larger the better, equal to 1.0 if the predicted rank is the same as the ground-truth one. The result is shown in Fig. 2. The ground-truth accuracy is obtained through the setting described in Sect. 4.2. Figure 2 LEFT shows the predictions by RWE, as described in Algorithm 1. Figure 2 RIGHT shows the predictions by low fidelity evaluation commonly adopted [22, 23, 37]. These predictions are obtained by a twenty-epoch training on the architecture with 8 layers and 16 initial channels. The correlation coefficients for these two methods are 0.942 and 0.909, respectively. The result shows that, in this circumstance, RWE predicts the accuracy slightly better than the low fidelity evaluation.

4.2 Searching on CIFAR-10

In this work, we search on CIFAR-10, a dataset that is widely used for benchmarking image classification. CIFAR-10 is a 10-category dataset, consisting of 60K images with resolution of \(32 \times 32\). The dataset is split into a training set and a testing set which consists of 50K and 10K images, respectively. Furthermore, we split the training set (80%–20%) to create the training and validation set for the searching stage.

In our searching stage, the population size N is set to 20 and we search for 30 generations. For RWE, every architecture is decoded into a CNN with 10 channels and 5 layers, where the second and fourth layers are reduction cells, and others are normal cells. The length of the linear classifiers list \(L = 5\). The preprocessing for input images only contains the normalization, without the standard data augment techniques that introduce the randomness. The training for the linear classifiers uses the SGD optimizer with a batch size set of 512 and the momentum of 0.9. Also, the learning rate is set to 0.25 initially and then decay to zero by the cosine annealing schedule [18]. We train each linear classifier for 30 epochs, and the search takes one hour and fifteen minutes with a single Nvidia 2080Ti. Figure 3 TOP shows the bi-objective Pareto fronts for different generations in the searching stage. Figure 3 BOTTOM shows that our approach saves orders of magnitude search cost compared with other NAS algorithms.

Fig. 3.
figure 3

TOP: progression of pareto fronts for different generations. BOTTOM: search cost comparison.

Table 1. Comparisons with other state-of-the-art architectures on CIFAR-10. \(^{\Updownarrow }\)denotes the work that shares the same setting with ours and the performance are reported in [22]. \(^{\dagger }\)denotes the work that adopts the cutout [5] technique.
Fig. 4.
figure 4

The visualization of Ours-l architecture.

For validations, we adopt the same training setting in [22] for a fair comparison. We train the architectures selected from the final Pareto Front with 20 layers and 34 initial channels from scratch. The number of epochs is set to 600 with a batch-size of 96. We also use the data augmentation technique cutout [5] with a length of 16, and the regularization technique scheduled path dropout introduced in [37] with a dropout rate 0.2.

The validation results and comparisons to other state-of-the-art architectures on CIFAR-10 are shown in Table 1. We select three representative architectures from the final generation to compare with other hand-crafted and search-generated architectures. The chosen architecture with the lowest error rate (Ours-l, as shown in Fig. 4) results in a 2.98% classification error and 340M FLOPs, which is competitive to other works in error rate but has fewer FLOPs. Also, compared with other NAS algorithms, our approach has much less computational cost measured in GPU days.

To further validate the effectiveness of RWE, we also apply our NAS algorithm on the macro search space [31] adopted in [22]. The results in Table 1 show that our chosen architecture has competitive performance but fewer FLOPs, similar to the case in micro search space.

Table 2. Comparisons with other state-of-the-art architectures on ImageNet. \(^{\Uparrow }\) denotes the architectures that are searched on CIFAR-10 and transferred to ImageNet.

4.3 Transferring to ImageNet

It is a common approach that architectures get searched on CIFAR-10 and then get transferred to other datasets or tasks [32, 37]. To test the transferability of our algorithm, we transfer our architecture with lowest error rate to ImageNet [4] dataset, which is one of the most challenging datasets for image classification. It is consisted of 1.28M images for the training set and 50K images for the validation set. The images are of various resolution and unevenly distributed in 1000 categories. We adopt some common data augmentation techniques, including the random resize and crop, the random horizontal flip, and the color jitter. We adjust our network architecture for ImageNet based on [17]. We train the model on 4 Nvidia Tesla V100 with the SGD optimizer for 250 epochs, with the batch size 1024 and resolution \(224 \times 224\). The learning rate is set to 0.5, with the momentum 0.9 and the weight decay \(3\times 10^{-5}\). The linear learning rate scheduler is used and as a result, the learning rate decays from 0.5 to \(1 \times 10 ^{-5}\) linearly during the training. Also, the warm-up strategy is adopted to increase the learning rate from 0 to 0.5 over the first five epochs. The label smooth [29] technique is also used with a smooth ratio of 0.1. The results of comparisons of our approach to other state-of-the-art architectures on ImageNet is shown in Table 2.

5 Conclusion

In this paper, we proposed a novel performance estimation strategy, Random-Weight Evaluation (RWE), to reduce the search cost of a NAS algorithm. RWE fixes majority of the weights at randomly initialized values, and only trains the last linear classifier layer. We integrated RWE in a multi-objective NAS algorithm, achieving state-of-the-art performance while reducing the computational cost drastically. In particular, RWE obtained a novel architecture on CIFAR-10, yielding 2.98% top-1 classification error and 1.5M parameters with less than two hours searching on a single GPU card. When transferred to ImageNet, the obtained architecture achieved 27.6% top-1 classification error.