Multi-objective Neural Architecture Search with Almost No Training

Hu, Shengran; Cheng, Ran; He, Cheng; Lu, Zhichao

doi:10.1007/978-3-030-72062-9_39

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12654))

Included in the following conference series:

International Conference on Evolutionary Multi-Criterion Optimization

1856 Accesses
5 Citations

Abstract

In the recent past, neural architecture search (NAS) has attracted increasing attention from both academia and industries. Despite the steady stream of impressive empirical results, most existing NAS algorithms are computationally prohibitive to execute due to the costly iterations of stochastic gradient descent (SGD) training. In this work, we propose an effective alternative, dubbed Random-Weight Evaluation (RWE), to rapidly estimate the performance of network architectures. By just training the last linear classification layer, RWE reduces the computational cost of evaluating an architecture from hours to seconds. When integrated within an evolutionary multi-objective algorithm, RWE obtains a set of efficient architectures with state-of-the-art performance on CIFAR-10 with less than two hours’ searching on a single GPU card. Ablation studies on rank-order correlations and transfer learning experiments to ImageNet have further validated the effectiveness of RWE.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Training-Free Multi-objective Evolutionary Neural Architecture Search via Neural Tangent Kernel and Number of Linear Regions

XferNAS: Transfer Neural Architecture Search

The effect of reduced training in neural architecture search

Article 20 April 2020

Keywords

1 Introduction

Deep Convolutional Neural Networks (CNNs) have achieved remarkable success in various computer vision tasks. One of the key reasons behind this success is the innovation on CNN architectures [9, 15, 20]. Despite the steady stream of promising improvements over state-of-the-art across a wide range of tasks and datasets, these early architectural advancements take years of efforts from researchers and practitioners with vast computational resources. Neural Architecture Search (NAS), on the other hand, aims to alleviate this painstaking process by automating the design of CNN architectures. However, existing NAS algorithms typically require enormous computation overheads to search. For example, early works [23, 37] require thousands of GPU days to complete one search, making them impractical for real-world deployments.

The main computation bottleneck of a NAS algorithm resides in the step of evaluating architectures’ performance. A thorough training of an architecture can take days, even weeks on a single GPU card depending on the complexity of architectures and the scale of the datasets. Hereby, advocating for substitute measurements becomes a common theme in existing NAS algorithms. For instances, there are works [19, 23, 37] using the performance measured on architectures with reduced sizes and training epochs; and there are works [17, 21] using performance measured on shared (instead of trained) weights. Conceptually, both of these two methods are restricted by the generality of search spaces, where the former requires search spaces to be modular and the later requires search spaces to be sequential [21, 25] (as oppose to multi-branched [17, 19, 37]).

To address the aforementioned issues, we propose the Random-Weight Evaluation (RWE), a flexible and effective method to accelerate the performance evaluation of architectures. By leveraging the expressive power of randomly initialized convolution filters [13, 26], RWE freezes the backbone^{Footnote 1} part of CNN architectures, and only trains the last classification layer. The subsequent performance becomes the indicator to select architectures. In short, our key contributions are summarized below:

In this work, we propose RWE to expedite the performance estimation of CNN architectures. RWE is conceptually flexible as it is independent of search spaces, and empirically effective as we show later in the paper that it significantly reduces the evaluation wall-clock time from days to less than a minute. At the same time, RWE reliably estimates the performance of CNN architectures, measured in rank-order correlations.
We further design a multi-objective evolutionary NAS algorithm to effectively utilize the proposed evaluation method. On CIFAR-10, the proposed algorithm achieves a set of efficient architectures with state-of-the-art performance, yielding 2.98% Top-1 error and 1.5M parameters with less than two hours on a single GPU card.

2 Related Works

In this section, we provide a brief overview on the topics that are closely related to technicalities of our approach.

Performance Estimation: A commonly used technique, in early NAS works [23, 37], to expedite the performance evaluations involves down-scaling the architecture sizes, by reducing the number of layers, channels, and training epochs. The main limitation of this method is that it is only applicable to modular search spaces, where a CNN architecture is constructed by repeatedly stacking modular blocks. Extending this method to the search space that allows non-modular architectures is not trivial. Concurrently, there is a study [34] showing that extensively reducing the number of training epochs leads to a poor rank-order correlation between predicted and true performance.

Instead of training every architecture from scratch, the weight sharing method attempts to speed up the peformance evaluations by sharing the weights among architectures sampled during search. NAS algorithms in this category [17, 21] typically construct a supernet (prior to the search), such that all searchable architectures become subsets (of the supernet; i.e. subnets) and weights are directly inherited instead of randomly initialized. Then the evaluation of an architecture becomes an inference on the validation set, which is much cheaper than training. Despite the efficiency gained during search, the supernet, required by the weight sharing method, can be more time consuming to train than a complete search [1], and may not be feasible for all search spaces, e.g. [16].

Expressive Power of Randomly Initialized Convolution Filters: Convolution filters, even with randomly initialized weights, are surprisingly powerful in extracting meaningful feature representations from visual inputs [12]. A number of existing works have shown that randomly initialized convolution filters can achieve comparable performance to CNNs with fully trained convolution filters on both vision and control tasks [6, 13]. Few trials have been attempted to estimate the performance of a neural network from randomly initialized weights in the literature. More specifically, Saxe et al. use randomly initialized weights to predict the ranking of shallow CNNs’ performance [26]; Rosenfeld and Tsotsos show that the performance ranking of widely used CNN architectures can be predicted by training a tiny fractions of the weights in the CNNs [24]. In this work, we train the last classification layer while freezing all other weights at initial values, and use this performance as the indicator to compare architectures. We demonstrate that our approach scales to modern CNN search spaces containing deep and complex CNNs.

Multi-objective NAS: Early NAS algorithms [17, 23, 37] are primarily driven by a single objective of maximizing predictive accuracy. However, real-world applications oftentimes require the CNN architectures to balance other completing objectives, such as power consumption, inference latency, memory footprint, to name a few. A portfolio of recently emerged NAS works scalarizes multiple objectives into a composite measurement that simultaneously promotes predictive performance and penalizes architecture complexity [2, 30]. In this work, we opt for evolutionary multi-objective optimization to approximate the entire efficient frontier in one run [3].

3 Proposed Approach

The problem of designing optimal architectures under multiple objectives for a target dataset \(\mathcal {D} = \{ \mathcal {D}_{trn}, \mathcal {D}_{vld}, \mathcal {D}_{tst} \}\) can be formulated as the following bilevel optimization problem [19],

where the upper lever variable \(\varvec{\alpha }\) defines a candidate architecture, and the lower level variable \(\varvec{w}(\varvec{\alpha })\) represents the weights with respect to it. \(\mathcal {L}(\varvec{w};\varvec{\alpha })\) is the cross-entropy loss on the \(\mathcal {D}_{trn}\) for a candidate architecture \(\varvec{\alpha }\) with weights \(\varvec{w}\). The first objective \(f_1\) represents the classification error on \( \mathcal {D}_{vld}\), which depends on both architectures and weights. Other objectives \(f_2,...,f_m\) only depend on architectures, such as the number of parameters, floating-point operations (FLOPs), latency, etc. In our approach, we use predictive performance to approximate the ground-truth one and take the predictive performance and FLOPs as two objectives to be optimized, which aims to balance between the performance and the complexity of architectures.

3.1 Random-Weight Evaluation

The core of our approach, Random-Weight Evaluation (RWE), is shown in Algorithm 1. First, the architecture \(\varvec{\alpha }\) to be evaluated would get decoded into a CNN net without the last classification layer, which is a linear classifier. Notice that, the number of channels and layers of the decoded architecture should be indicated in this step. Then, the weights of net would be randomly initialized and then kept fixed throughout the whole algorithm. We use a modified version of the Kaiming initialization [8] to initialize the net (default setting in PyTorch). After the initialization of net, a list of linear classifiers list_clsfr would be initialized. Those classifiers get trained on the features that are extracted by the net, and each classifier could only be exposed to a part of the features. Then all of the classifiers are combined as the last classification layer in the inference phrase via neural network ensemble technique [7]. More specifically, the length of the list_clsfr is set to five, and each classifier gets trained on 4/5 features. Compared with the case that only one linear classifier gets trained, a list of linear classifiers can help stabilizing RWE. Finally, the architecture is validated by taking the labels that get approved by most classifiers as a result. After calculating the classification error rate and FLOPs of the architecture, these two measurements of performance and complexity of architectures will be returned as two objectives to be optimized.

3.2 Search Space and Encoding

Our proposed RWE is conceptually flexible and can be applied to various search spaces. In this work, we adopt the NASNet [37] search space, which is widely used in various NAS algorithms [17, 22, 23]. A pictorial overview of this search space and encoding is shown in Fig. 1.

Network: The macro network architecture is prefixed to be a stack of several cells. We search for two cells: a normal cell and a reduction cell. All normal cells in an architecture share the same structure (but different weights), which is the same case for the reduction cell. The normal cell keeps the resolution and the number of channels for input tensors, while the reduction cell downsamples the resolution and double the number of channels for input tensors (Fig. 1a RIGHT).

Cell and Node: The structures of both normal and reduction cells are defined by five nodes (Node 2 to Node 6), each of which chooses two inputs from previous nodes and two operations applied on two inputs respectively. For a node i, we search for two inputs from previous nodes \(N_1^{cell,i}\) and \(N_2^{cell,i}\) and two operations \(O_1^{cell,i}\) and \(O_2^{cell,i}\) applied on them, where the cell could be n or r, representing normal cells and reduction cells. For those nodes that are not chosen to become the inputs of another node, their outputs would be concatenated to the output node (Node 7) (Fig. 1a MIDDLE and LEFT).

Encoding: We use an integer vector to encode an architecture, as shown in Fig. 1b. Each input and operation to be applied is encoded into an integer denoting a choice. A cell consists of five nodes, and each node is represented using four integers. The integer vectors for a normal and a reduction cell are concatenated into a 40-integer vector.

3.3 Search Strategy

In our approach, we adopt the classic multi-objective evolutionary algorithm NSGA-II [3] as the searching framework. The search process is briefly summarized below.

First, the population is randomly initialized. After individuals in it get evaluated by RWE, the binary tournament selection will be applied to select parents for offspring. Then through two-point crossover and polynomial mutation, offspring is created. Finally, the offspring gets evaluated, followed by the environment selection through the nondominated sorting and the crowding distance. The process is repeated until reaching the max generation.

4 Experimental Results

In this section, we first evaluate the effectiveness of RWE, followed by the results of our approach searching on CIFAR-10 [14] and transferring to ImageNet [4].

4.1 Effectiveness of Random-Weight Evaluation

To demonstrate the effectiveness of RWE, we calculate the Spearman rank-order correlation coefficient between the predicted and the ground-truth accuracy on CIFAR-10, which is the larger the better, equal to 1.0 if the predicted rank is the same as the ground-truth one. The result is shown in Fig. 2. The ground-truth accuracy is obtained through the setting described in Sect. 4.2. Figure 2 LEFT shows the predictions by RWE, as described in Algorithm 1. Figure 2 RIGHT shows the predictions by low fidelity evaluation commonly adopted [22, 23, 37]. These predictions are obtained by a twenty-epoch training on the architecture with 8 layers and 16 initial channels. The correlation coefficients for these two methods are 0.942 and 0.909, respectively. The result shows that, in this circumstance, RWE predicts the accuracy slightly better than the low fidelity evaluation.

4.2 Searching on CIFAR-10

In this work, we search on CIFAR-10, a dataset that is widely used for benchmarking image classification. CIFAR-10 is a 10-category dataset, consisting of 60K images with resolution of \(32 \times 32\). The dataset is split into a training set and a testing set which consists of 50K and 10K images, respectively. Furthermore, we split the training set (80%–20%) to create the training and validation set for the searching stage.

In our searching stage, the population size N is set to 20 and we search for 30 generations. For RWE, every architecture is decoded into a CNN with 10 channels and 5 layers, where the second and fourth layers are reduction cells, and others are normal cells. The length of the linear classifiers list \(L = 5\). The preprocessing for input images only contains the normalization, without the standard data augment techniques that introduce the randomness. The training for the linear classifiers uses the SGD optimizer with a batch size set of 512 and the momentum of 0.9. Also, the learning rate is set to 0.25 initially and then decay to zero by the cosine annealing schedule [18]. We train each linear classifier for 30 epochs, and the search takes one hour and fifteen minutes with a single Nvidia 2080Ti. Figure 3 TOP shows the bi-objective Pareto fronts for different generations in the searching stage. Figure 3 BOTTOM shows that our approach saves orders of magnitude search cost compared with other NAS algorithms.

Table 1. Comparisons with other state-of-the-art architectures on CIFAR-10. \(^{\Updownarrow }\)denotes the work that shares the same setting with ours and the performance are reported in [22]. \(^{\dagger }\)denotes the work that adopts the cutout [5] technique.

Full size table

For validations, we adopt the same training setting in [22] for a fair comparison. We train the architectures selected from the final Pareto Front with 20 layers and 34 initial channels from scratch. The number of epochs is set to 600 with a batch-size of 96. We also use the data augmentation technique cutout [5] with a length of 16, and the regularization technique scheduled path dropout introduced in [37] with a dropout rate 0.2.

The validation results and comparisons to other state-of-the-art architectures on CIFAR-10 are shown in Table 1. We select three representative architectures from the final generation to compare with other hand-crafted and search-generated architectures. The chosen architecture with the lowest error rate (Ours-l, as shown in Fig. 4) results in a 2.98% classification error and 340M FLOPs, which is competitive to other works in error rate but has fewer FLOPs. Also, compared with other NAS algorithms, our approach has much less computational cost measured in GPU days.

To further validate the effectiveness of RWE, we also apply our NAS algorithm on the macro search space [31] adopted in [22]. The results in Table 1 show that our chosen architecture has competitive performance but fewer FLOPs, similar to the case in micro search space.

Table 2. Comparisons with other state-of-the-art architectures on ImageNet. \(^{\Uparrow }\) denotes the architectures that are searched on CIFAR-10 and transferred to ImageNet.

Full size table

4.3 Transferring to ImageNet

It is a common approach that architectures get searched on CIFAR-10 and then get transferred to other datasets or tasks [32, 37]. To test the transferability of our algorithm, we transfer our architecture with lowest error rate to ImageNet [4] dataset, which is one of the most challenging datasets for image classification. It is consisted of 1.28M images for the training set and 50K images for the validation set. The images are of various resolution and unevenly distributed in 1000 categories. We adopt some common data augmentation techniques, including the random resize and crop, the random horizontal flip, and the color jitter. We adjust our network architecture for ImageNet based on [17]. We train the model on 4 Nvidia Tesla V100 with the SGD optimizer for 250 epochs, with the batch size 1024 and resolution \(224 \times 224\). The learning rate is set to 0.5, with the momentum 0.9 and the weight decay \(3\times 10^{-5}\). The linear learning rate scheduler is used and as a result, the learning rate decays from 0.5 to \(1 \times 10 ^{-5}\) linearly during the training. Also, the warm-up strategy is adopted to increase the learning rate from 0 to 0.5 over the first five epochs. The label smooth [29] technique is also used with a smooth ratio of 0.1. The results of comparisons of our approach to other state-of-the-art architectures on ImageNet is shown in Table 2.

5 Conclusion

In this paper, we proposed a novel performance estimation strategy, Random-Weight Evaluation (RWE), to reduce the search cost of a NAS algorithm. RWE fixes majority of the weights at randomly initialized values, and only trains the last linear classifier layer. We integrated RWE in a multi-objective NAS algorithm, achieving state-of-the-art performance while reducing the computational cost drastically. In particular, RWE obtained a novel architecture on CIFAR-10, yielding 2.98% top-1 classification error and 1.5M parameters with less than two hours searching on a single GPU card. When transferred to ImageNet, the obtained architecture achieved 27.6% top-1 classification error.

Notes

1.
All layers prior to the task-specific heads, e.g., the last linear layer in case of object classification.

References

Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once for all: train one network and specialize it for efficient deployment. In: ICLR (2020)
Google Scholar
Cai, H., Zhu, L., Han, S.: ProxylessNAS: direct neural architecture search on target task and hardware. In: ICLR (2019)
Google Scholar
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
Gaier, A., Ha, D.: Weight agnostic neural networks. In: NeurIPS (2019)
Google Scholar
Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 993–1001 (1990)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
Google Scholar
Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV (2009)
Google Scholar
Juefei-Xu, F., Naresh Boddeti, V., Savvides, M.: Local binary convolutional neural networks. In: CVPR (2017)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Google Scholar
Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. In: ICLR (2018)
Google Scholar
Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (2019)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Lu, Z., et al.: Multi-objective evolutionary design of deep convolutional neural networks for image classification. IEEE Trans. Evol. Comput. 1 (2020)
Google Scholar
Lu, Z., Deb, K., Boddeti, V.N.: MUXConv: information multiplexing in convolutional neural networks. In: CVPR (2020)
Google Scholar
Lu, Z., Deb, K., Goodman, E., Banzhaf, W., Boddeti, V.N.: NSGANetV2: evolutionary multi-objective surrogate-assisted neural architecture search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 35–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_3
Chapter Google Scholar
Lu, Z., et al.: NSGA-Net: neural architecture search using multi-objective genetic algorithm. In: GECCO (2019)
Google Scholar
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2019)
Google Scholar
Rosenfeld, A., Tsotsos, J.K.: Intriguing properties of randomly weighted networks: generalizing while learning next to nothing. In: CRV (2019)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Google Scholar
Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y.: On random weights and unsupervised feature learning. In: ICML (2011)
Google Scholar
Sun, Y., Wang, H., Xue, B., Jin, Y., Yen, G.G., Zhang, M.: Surrogate-assisted evolutionary deep learning using an end-to-end random forest-based performance predictor. IEEE Trans. Evol. Comput. 24(2), 350–364 (2020)
Article Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: CVPR (2019)
Google Scholar
Xie, L., Yuille, A.: Genetic CNN. In: ICCV (2017)
Google Scholar
Xie, S., Zheng, H., Liu, C., Lin, L.: SNAS: stochastic neural architecture search. In: ICLR (2019)
Google Scholar
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)
Google Scholar
Zela, A., Klein, A., Falkner, S., Hutter, F.: Towards automated deep learning: efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906 (2018)
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: CVPR (2018)
Google Scholar
Zhong, Z., et al.: BlockQNN: efficient block-wise neural network architecture generation. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2020)
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (Grant No. 61903178 and 61906081), the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (Grant No. 2017ZT07X386), the Shenzhen Peacock Plan (Grant No. KQTD2016112514355531), and the Program for University Key Laboratory of Guangdong Province (Grant No. 2017KSYS008).

Author information

Authors and Affiliations

Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
Shengran Hu, Ran Cheng, Cheng He & Zhichao Lu

Authors

Shengran Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ran Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Cheng He
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Hisao Ishibuchi
City University of Hong Kong, Kowloon Tong, China
Qingfu Zhang
Southern University of Science and Technology, Shenzhen, China
Ran Cheng
University of Exeter, Exeter, UK
Ke Li
Xi'an Jiaotong University, Xi'an, China
Hui Li
Xidian University, Xi'an, China
Handing Wang
East China Normal University, Shanghai, China
Aimin Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, S., Cheng, R., He, C., Lu, Z. (2021). Multi-objective Neural Architecture Search with Almost No Training. In: Ishibuchi, H., et al. Evolutionary Multi-Criterion Optimization. EMO 2021. Lecture Notes in Computer Science(), vol 12654. Springer, Cham. https://doi.org/10.1007/978-3-030-72062-9_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-72062-9_39
Published: 24 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72061-2
Online ISBN: 978-3-030-72062-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics