Introduction

Convolutional neural networks (CNNs) inspired by the cognitive mechanism of biological natural vision have been successfully applied to many fields in recent years, such as object recognition [1, 2], image segmentation [3, 4] and information retrieval [5, 6]. Generally, CNNs are manually designed for the specific applications. However, manual design methods are heavily dependent upon the knowledge of domain experts, and the design process requires lots of time and efforts.

To automatically design CNNs, Zoph and Le propose the first neural architecture search (NAS) algorithm [7]. Since then, the research on NAS has attracted more and more attention [8, 9]. As one of the most popular NAS algorithms, differentiable architecture search (DARTS) [8] is usually taken as a benchmark framework. DARTS searches CNNs based on a cell structure. Each cell can be seen as a directed acyclic graph (DAG) and has N nodes and E edges. Then, the searched cell is stacked in sequence to form a complete deep CNNs. In the search stage, DARTS assigns a learnable parameter \(\alpha\) to each edge via the softmax function, where the value of \(\alpha\) represents the contribution of the edge in the cell. Nevertheless, there exist two shortcomings in the DARTS algorithm. On the one hand, the operations in the cell have a major number of parameters and FLOPs. On the other hand, based on the architecture parameters, DARTS simply retains the two operations corresponding to the two largest \(\alpha\) at each node, which results in relatively redundant connections in the cell. These two issues cause that DARTS can only search for CNNs with complex architectures. However, the amount of model parameters and FLOPs is crucial for the deployment of CNNs on devices with limited memory and computing resources.

This paper puts forward a differentiable light-weight architecture search method, named DLW-NAS, to automatically build high-performance and light-weight CNNs. Specifically, the core ideas of this work include three aspects. First, as the search space can directly affect the accuracy and complexity of the searched model, we rebuild a new search space involving several effective light-weight operations. Second, we design a novel differentiable architecture search strategy with computation complexity constraints. Last but not the least, we propose a neural architecture optimization strategy, with which we can keep as few operations as possible in the cell while maintaining the model performance.

We have conducted extensive experiments on the CIFAR-10, CIFAR-100 and ImageNet datasets to evaluate DLW-NAS. DLW-NAS obtains 2.73% test error rate on CIFAR-10 with only 2.3M parameters and 334M FLOPs. On CIFAR-100, it uses only 2.47M parameters and 376M FLOPs with an error rate of 17.12%. When transferred from CIFAR-10 to ImageNet for mobile settings, using 3.8M parameters and 397M FLOPs, DLW-NAS produces top-1 and top-5 error rates of 26.1% and 8.3%, respectively, which are the state-of-the-art (SOTA) results but with fewer parameters and FLOPs than SOTA approaches.

Our major contributions can be summarized as below:

  • We present a differentiable light-weight neural architecture search algorithm called DLW-NAS. Especially, we rebuild a light-weight search space to limit the amount of parameters and calculations of the searched model from the source of NAS.

  • We design a differentiable NAS strategy with computation complexity constraints, which can be used to search for light-weight architectures.

  • We propose a novel and effective neural architecture optimization strategy to greatly sparsify the cell structure.

  • On standard image classification datasets, including CIFAR-10, CIFAR-100 and ImageNet, we obtain the SOTA results with the searched light-weight models.

This paper is extended based on our conference version [10] with significant improvements. Firstly, we conduct a more comprehensive review of related approaches on the basis of our conference version. Secondly, we design a new differentiable search strategy with computation complexity constraints to automatically search light-weight CNNs. Thirdly, we have added more comparisons on the CIFAR-10 and ImageNet datasets, and performed more experiments on the CIFAR-100 dataset. Experiments show that the proposed DLW-NAS in this paper outperforms its previous version. Additionally, we have conducted ablation study to verify the effect of each component of DLW-NAS.

The rest of this paper is organized as follows. In “Related Work”, we briefly review some related work. In “The Proposed DLW-NAS”, we describe the proposed method, including the design of the search space, the new search strategy, the neural architecture optimization strategy and computational complexity analysis. The experimental results are reported and analyzed in “Experiments and Results”. The conclusion of this paper is presented in “Conclusion”.

Related Work

There are mainly two ways for the construction of light-weight CNNs, namely handcrafted and automatic design. We briefly review these two kinds of approaches in the following.

Handcrafted

Handcrafted methods usually change traditional convolutional operations or model construction rules to compress the amount of parameters and FLOPs. For instance, SqueezeNet [11] adjusts the number of feature channels via expanding and squeezing convolutional layers. To reduce the computational complexity of convolutional operations, Howard et al. propose the depthwise separable convolution to perform convolutional operations channel by channel [12]. ShuffleNet [13] achieves the reduction of computation complexity using group convolution and channel shuffle. Recently, based on a set of original feature maps, Han et al. have generated numerous “ghost” feature maps by using a serial of linear transformations, which is effective in digging out the required information from original features [1]. Although manually designed CNNs show extraordinarily considerable performance, the design methods are heavily dependent upon the knowledge of domain experts.

Automatic Design

The auto-design of convolutinal networks usually refers to learning the optimal network structure based on NAS algorithms. As one of the most popular NAS algorithms, DARTS converts the discrete operations search into a differentiable optimization problem through the Softmax function, which effectively reduces the search time. Recently, some differentiable NAS methods used for automatically designing light-weight models on embedded devices are proposed [14, 15]. For example, Cai et al. propose a latency regularization loss to reduce inference latency [16] in the search process. Wu et al. present an algorithm for searching a continuous structure with sparsity constraints, which is known as the mixed-path NAS algorithm. These hardware-aware NAS methods are effective in reducing the inference delay of deep neural networks. However, they do not specifically explore how to limit model parameters and FLOPs. To search for CNNs with low computational complexity, quite a few methods are proposed by either constraining the number of transformation operations or redesigning the search space. For example, both BayesNAS [17] and DSO-NAS [18] use the \(\ell _{1}\)-norm to enforce the connections sparse in the cell topology. In CNAS [19], Weng et al. build a new search space by exploring some light-weight operations. In this work, we synthetically consider the differentiable light-weight architecture search problem. In particular, we design a novel search space containing only light-weight operations and propose a new strategy to optimize the neural architectures.

The Proposed DLW-NAS

This section describes our proposed differentiable light-weight neural structure search (DLW-NAS) method in detail. Specifically, the new light-weight search space is first presented in “Light-Weight Search Space”, followed by the neural architecture search algorithm in “The Architecture Search Algorithm”. Section “Neural Architecture Optimization” shows our proposed effective neural architecture optimization strategy. Finally, we quantitatively analyze the complexity of DLW-NAS in “Complexity Analysis”.

Light-Weight Search Space

Search space determines the accuracy and complexity of the constructed CNNs to a great extent. It is the foundation of neural architecture search. To rebuild a new light-weight search space, we comprehensively consider candidate operations in the cell topology from three aspects, namely accuracy, parameters and FLOPs. Our target is to construct deep architectures with as high accuracy and as few parameters and FLOPs as possible. Hence, we mainly improve the search space of DARTS [8]. Except for five light-weight operations in DARTS search space, i.e., max pooling, depthwise separable convolution, identity, dilated convolution and zero, we explore three more awesome light-weight operations (i.e., GhostConv, SKConv and ShuConv). Therefore, there are eight light-weight operations in the search space of DLW-NAS. In the following, we briefly introduce the new three light-weight operations.

  1. (i)

    GhostConv [1] is a plug-and-play light-weight operation. It can generate the same number of feature maps as traditional convolution through a series of simple linear transformation operations (e.g., depthwise separable convolution and 1\(\times\)1 convolution). However, different from conventional convolution, the number of parameters and FLOPs of GhostConv is only 1/r of that of conventional convolution operation, where r represents the number of the used convolutional kernels.

  2. (ii)

    SKConv [20] makes the input feature maps segmented into several branches along the channel dimension and uses convolutional kernels to procure features of different sizes of receptive fields.

  3. (iii)

    ShuConv [13] adopts the point-wise group convolutions as well as channel shuffle, so that they can significantly reduce the computational overhead.

Besides, we constrain the light-weight search space of DLW-NAS from the following two aspects:

  1. 1.

    Limiting the size of convolutional kernels to 3\(\times\)3. Previous methods usually include convolution kernels of size 5\(\times\)5 or even 7\(\times\)7. Whereas, these convolutional kernels retain larger receptive field with a quantity of parameters and FLOPs. Therefore, we abandon convolutional operations with receptive field larger than 3\(\times\)3. In addition, it is desirable to replace stacked convolutional layers and have large receptive field with dilated convolution. For example, with the dilation rate 2, a 3\(\times\)3 dilated convolution can be regarded as a 5\(\times\)5 convolutional operation.

  2. 2.

    Droping redundant operations. In [21], Li et al. prove the existence of the multi-collinearity problem among the candidate operations of previous differentiable NAS approaches [8]. That is, some operations (such as average pooling and max pooling) have high linear correlation, which may split the contributions of the candidate operations during the model searching. To avoid this problem, we preserve only one operation with multi-collinearity in the search space of DLW-NAS. This guarantees that the candidate operations are not redundant.

The Architecture Search Algorithm

This subsection introduces the proposed differentiable neural architecture search strategy. Following previous work [8], we search a cell and stack it to build the deep architecture. The structure of the initial cell is shown in Fig. 1a. It can be abstracted as a directed acyclic graph (DAG). In the cell, the nodes represent layers of the network, while each directional edge \(E_{(i,j)}\) represents the information flow which is from node i to node j. During the search process, information flows that input into a node are summarized to:

$$\begin{aligned} x^{(j)}=\sum _{i<j}\overline{o}^{(i, j)}(x^{(i)}), \end{aligned}$$
(1)

where \(x^{(i)}\) represents the output of the i-th node, and \(\overline{o}^{(i, j)}\) is a set of candidate operations performed between node i and node j (e.g., 3\(\times\)3 dilated convolution and max pooling). For changing the discrete search to a continuous and differentiable optimization problem, like DARTS [8], we utilize the softmax function to compute the contribution of each operation.

$$\begin{aligned} \overline{o}^{(i, j)}=\sum _{o\in O}\frac{exp(\alpha _{o}^{(i, j)})}{\sum _{o^{'}\in O}exp(\alpha _{o^{'}}^{(i, j})}o(x), \quad i<j, \end{aligned}$$
(2)

where O denotes the set of candidate light-weight operations in the search space, and o represents an operation between node i and node j with its architecture weight \(\alpha _{o}^{(i,j)}\). Specifically, we apply a vector \(\alpha ^{(i,j)}\) with dimension |O| to parameterize the candidate operations between node i and node j. In addition, the value of \(\alpha _{o}^{(i,j)}\) measures how much a candidate operation has contributed in the feature map transformation.

Fig. 1
figure 1

The optimization process of neural architecture. a A cell consisting of nodes and directional edges. Two white boxes denote the input nodes, while the blue boxes denote the inner nodes. The directional edges between two nodes denote the candidate operations. b In accordance with the learned architecture weights, only the operation contributes the most is retained. c The ultimate sparse architecture obtained based on our proposed architecture optimization strategy

To learn light-weight architectures, we add computational complexity constraints on the differentiable optimization algorithm. We use \(P({o}^{(i,j)})\) and \(F({o}^{(i,j)})\) to compute the number of parameters and FLOPs of operation \({o}^{(i,j)}\), respectively. To maintain the differentiability of the objective function, the weighted parameter number and FLOPs between a pair of nodes can be computed as:

$$\begin{aligned} \mathbb {E}_{[{params}^{(i,j)}]}=\sum _{o\in O}\alpha _{o}^{(i,j)} \times P({o}^{(i,j)}), \end{aligned}$$
(3)
$$\begin{aligned} \mathbb {E}_{[{flops}^{(i,j)}]}=\sum _{o\in O}\alpha _{o}^{(i,j)} \times F({o}^{(i,j)}). \end{aligned}$$
(4)

To the end, the architecture parameters can be updated using the gradients.

$$\begin{aligned} \left\{ \begin{array}{c} \frac{\partial {\mathbb {E}_{[{params}^{(i,j)}]}}}{\partial {\alpha _{o}^{(i,j)}}} = P({o}^{(i,j)}), \\ \frac{\partial {\mathbb {E}_{[{flops}^{(i,j)}]}}}{\partial {\alpha _{o}^{(i,j)}}} = F({o}^{(i,j)}). \\ \end{array} \right. \end{aligned}$$
(5)

The parameter number and FLOPs of a cell are obtained by summarizing up the parameter numbers and FLOPs of all the internal operations:

$$\begin{aligned} \mathbb {E}_{[{params}_{n}]}=\sum _{i\in n}\sum _{j > i}\mathbb {E}_{[{params}^{(i,j)}]}, \end{aligned}$$
(6)
$$\begin{aligned} \mathbb {E}_{[{flops}_{n}]}=\sum _{i\in n}\sum _{j > i}\mathbb {E}_{[{flops}^{(i,j)}]}. \end{aligned}$$
(7)

After the search stage, the constructed deep architecture contains several cells, and its parameter number and FLOPs are the sum of all the cells:

$$\begin{aligned} \mathbb {E}_{[{params}]}=\sum _{n}\mathbb {E}_{[{params}_{n}]}, \end{aligned}$$
(8)
$$\begin{aligned} \mathbb {E}_{[{flops}]}=\sum _{n}\mathbb {E}_{[{flops}_{n}]}. \end{aligned}$$
(9)

Based on the above analysis, we design the objective function as follows:

$$\begin{aligned} \mathcal {L} = CE + \lambda _{1}\mathbb {E}_{[{params}]} + \lambda _{2}\mathbb {E}_{[{flops}]}, \end{aligned}$$
(10)

where CE is the cross entropy loss to evaluate the deep architecture and the latter two terms are used to restrict the amount of parameters and FLOPs, respectively. Specifically, \(\lambda _{1}\) and \(\lambda _{2}\) are trade-off factors among the accuracy, parameter number and FLOPs.

Fig. 2
figure 2

Topological structure of normal cells and reduction cells. a Normal cell searched by DLW-NAS on CIFAR-10. b Reduction cell searched by DLW-NAS on CIFAR-10. c Normal cell searched by DLW-NAS on CIFAR-100. d Reduction cell searched by DLW-NAS on CIFAR-100. e Normal cell searched by DARTS [8] on CIFAR-10. f Normal cell searched by SparseNAS [22] on CIFAR-10

Neural Architecture Optimization

As shown in Fig. 1, the optimization process of DLW-NAS mainly includes two stages. In the first stage, the architecture weights between each pair of nodes in the cell (as shown in Fig. 1a) are learned and sorted, and only the operation corresponding to the maximum architecture weight is reserved. Then, the original cell becomes a discrete structure (as shown in Fig. 1b). Here we emphasize that, although most of the operations in the original cell have been discarded, as shown in Fig. 1b, there is a connection between any nodes in the discrete structure.

In the second stage, while keeping the model accuracy, we further sparsify the discrete structure and reduce its parameter number and FLOPs. Specifically, the proposed strategy which is based on the optimized spanning tree ensures us to obtain the sparse cell topology. Concretely, we first abstract the discrete structure into a weighted and undirected graph, and the weight value on each edge is set to the inverse of the corresponding architecture weight. Then, in the undirected graph, we find a minimum spanning tree T. Since the construction of the minimum spanning tree T starts from the edges with the least weight, the edges contained in T correspond to the operations with the relatively large architecture weights. Nevertheless, some inner nodes may have only output edge but no input edge. In this case, the information flows will be influenced. To solve this problem, we traverse every node in the cell. As long as a node encounters such a problem, the most weighted operations having input to this node are added to T. By this step, we complete the transformation from the discrete structure to the sparse structure (as shown in Fig. 1c).

Fig. 3
figure 3

a The effect of the connections number contained in the searched cell on model parameters and evaluation accuracy on CIFAR-10. b The effect of the connections number contained in the searched cell on model parameters and evaluation accuracy on CIFAR-100

As shown in Fig. 2, through the above two stages, the proposed architecture optimization strategy yields relatively sparse cells compared to that of DARTS with 8 connections and that of SparseNAS with 12 connections. Whereas, there still exists one issue that whether or not the reserved connections are suitable for the learning tasks. In Fig. 3, we show some experimental results obtained on the CIFAR-10 and CIFAR-100 datasets. From Fig. 3, we can see that 6 is an elbow point of the curves about validation error. This indicates that the object recognition accuracy of the constructed evaluation model with only five connections in the cell is much lower than that with six connections. Moreover, although the performance of the evaluation models with 7 or 8 connections in the searched cell is comparable to that with 6 connections, they need much more parameters. This motivates us to restrict the number of operations in the cell to a constant \(M = 6\). As the retained operations number N is less than M, the \(M - N\) operations, with the largest architecture weights, will be selected. Subsequently, these M operations are used to construct the final cell.

figure a

Algorithm 1 describes the overall process of the neural architecture optimization strategy proposed in this section. The structural weight of the initial neural structure block is the trained parameter, and the input \(\alpha\) is an N × 8 parameter matrix, where 8 represents the eight operations contained in the search space. Lines 1 to 7 describe the process of transforming the initial neural structure into discrete neural structure blocks. Lines 8 to 18 describe how to realize the transformation from discrete structures to sparse structures. Lines 9 to 23 introduce the edge number regularization rules used to balance the accuracy and complexity of neural networks. Finally, a sparse and light-weight neural structure block with excellent performance is output.

Complexity Analysis

In this subsection, we analyze the computational complexity of DLW-NAS from two aspects, i.e., the search space and the whole deep architecture.

Existing differentiable NAS methods are mainly implemented based on the DARTS search space. Hence, we compare the proposed light-weight search space and that of DARTS. For convenience, taking a pair of nodes as an example, we analyze the computational complexity of the search space. Here, we assume that, the size of the input feature maps is 32\(\times\)32, both input and output have 32 channels \(C_{in} = C_{out} = 32\), while the kernel size and convolutional stride are 3\(\times\)3 and 1, respectively. Regardless of the bias, Table 1 compares the parameters and FLOPs of the search space of DLW-NAS and DARTS. For clarity, we calculate the mean values of parameters and FLOPs for all the operations between a pair of nodes as follows.

$$\begin{aligned} \begin{aligned} Params=\frac{1}{|O|}\sum _{o\in O}p_{o}, \quad p_{o}>0,\\ FLOPs=\frac{1}{|O|}\sum _{o\in O}m_{o}, \quad m_{o}>0, \end{aligned} \end{aligned}$$
(11)

where O denotes the set of the candidate operations, |O| is the size of O, while \(p_{o}\) and \(m_{o}\) are the parameter number and FLOPs of the operation o, respectively. Compared with DARTS, the average parameters and FLOPs of DLW-NAS are smaller, which is only 0.126M and 0.84M, respectively. It demonstrates that the computational complexity of the search space of DLW-NAS is lower than that of DARTS.

Table 1 Comparison with DARTS on candidate operations of the search space. The best results are highlighted with boldface

To analyze the complexity of architecture, we take the evaluation model that stacks 20 cells as an example. The architecture complexity is evaluated by the number of transformations of feature maps inside the cells. DLW-NAS and DARTS contain 6 and 8 connections in a cell, respectively. Thus, an image only needs 20 × 6 = 120 transformations in DLW-NAS from input to output, while DARTS has to pass 20 × 8 = 160 transformations. It is easy to calculate that our proposed architecture optimization method reduces the number of operations of the evaluation model by 25%. This shows that the architecture learned by DLW-NAS is more light-weight than that learned by DARTS.

Experiments and Results

In this section, we report the experimental settings and results in detail.

Datasets and Implementation Issues

To evaluate the proposed DLW-NAS method, we have conducted experimental comparison on three standard image classification datasets, including CIFAR-10, CIFAR-100 and ImageNet (mobile setting). We briefly introduce them as follows.

CIFAR-10 and CIFAR-100 [23]

These two datasets consist of 50K training images and 10K testing images separately, with image size of 32\(\times\)32\(\times\)3. The images in these two datasets belong to 10 and 100 classes, respectively. During the architecture search, half of the training data are applied to train the architecture weights, and the remaining half are used to adjust the parameters of the searched architecture.

ImageNet [24]

It is composed of 1.3M images for training and 50K images for test, and the images belong to 1,000 classes. The size of the images is 224\(\times\)224\(\times\)3. The mobile setting is used for test.

Following the optimization algorithm in [25], we learn the architecture weights and network parameters. After the architecture search stage, we utilize the proposed architecture optimization strategy to transform the searched cell to sparse structure, which contains \(M = 6\) connections. Moreover, \(\lambda _{1}\) and \(\lambda _{2}\) are set to 0.01 and 0.005, respectively. When evaluating the architecture, we stack 20 cells and train the deep architecture on CIFAR-10 and CIFAR-100 from scratch. In addition, on ImageNet, we stack 14 cells to test the deep architecture. Particularly, our experiments are conducted using two NVIDIA GTX 1080Ti GPUs.

The Searched Architectures

We conduct the architecture search on CIFAR10 and CIFAR100. Figure 2 presents the architectures searched by DLW-NAS. Compared with SOTA DARTS [8] and SparseNAS [22], we can see that the normal cells searched by DLW-NAS are more sparse, and the number of internal connections is only 3/4 of DARTS and 1/2 of SparseNAS. Such results are attributed to the architecture optimization strategy of DLW-NAS. Furthermore, the up-to-date light-weight operations in the search space ensure to deliver a quite efficient deep architecture.

Architecture Evaluation

In the following, we report the evaluation results on the searched architectures.

Experiments on CIFAR-10 and CIFAR-100

Following previous work, we construct the evaluation model with 20 searched cells, including 18 normal cells and 2 reduction cells. With batch size 96, we train the evaluation model for 600 epochs from scratch. The standard SGD optimizer is used, and we set the initial learning rate to 0.025, the momentum to 0.9 and the weight decay to \(3\times 10^{-4}\) on CIFAR-10 and \(5\times 10^{-5}\) on CIFAR-100. Auxiliary towers [26] of weight 0.4 and cutout regularization [27] of length 16 are also applied.

In Table 2, we list the comparison results between our method and other NAS methods, where “−” indicates that the relevant results are not given in the research work listed. The second column shows the recognition error rate of CNN models constructed by different methods on CIFAR-10. The smaller the value is, the higher the recognition accuracy of the model for objects is. The third and fourth columns are important indicators to measure the complexity of the model, showing the number of model parameters and the amount of calculation of different methods. Using “M” as a statistical unit, the smaller the value is, the lower the complexity of the model is, which means the model is lighter. In addition, the last two columns in the table show the CPU time and model construction methods required for automatic construction of convolutional neural networks, in which RL (Reinforcement Learning), EA (Evolutionary Algorithm) and SMBO (Sequential model-based optimization) are all methods for automatic model construction. Thus, we can see that DLW-NAS achieves a 2.73% error rate with only 2.3M parameters and 334M FLOPs. Compared with the SOTA NAS approaches, such as BayesNAS [17] and DSO-NAS [18], the classification accuracy obtained by DLW-NAS is higher and the number of model parameters is greatly reduced. The test error of SparseNAS [22] is comparable with DLW-NAS. However, there are more connections in the searched cell, such that SparseNAS has more parameters and FLOPs. The parameters and FLOPs of DLW-NAS are about 37% less than those of SparseNAS.

Table 2 Comparison with state-of-the-art methods on CIFAR-10. The best results are highlighted with boldface
Table 3 Comparison with state-of-the-art methods on CIFAR-100. The best results are highlighted with boldface

Table 3 shows the performance of different methods on CIFAR-100. The evaluation indices in the experiments are consistent with those in Table 2. As shown in Table 3, the architecture searched by DLW-NAS on CIFAR-100 uses only 2.47M parameters and 376M FLOPs and delivers an error rate of 17.12%. Comparing with the SOTA DARTS, DLW-NAS surpasses DARTS on classification performance but using fewer parameters and FLOPs. This further demonstrates the advantages of DLW-NAS over SOTA approaches. In a word, compared with the manual design of neural network, evolutionary algorithm and reinforcement learning method, the network structure searched by our method has better performance, faster search speed, fewer parameters and less computation. In comparison with other differentiable neural architecture search methods, our method also has distinct advantages. The results obtained by DLW-NAS are comparable with faster search speed in the case of smaller parameters and less computation with that of the SOTA methods.

Experiments on ImageNet

In this experiment, the architectures searched on CIFAR-10 are transferred to ImageNet for mobile setting test. According to the common comparison method on the ImageNet dataset, Top-1 and Top-5 error rates are used to evaluate the compared methods, and the complexity evaluation indices are the same as CIFAR-10 and CIFAR-100 datasets. Following the conventions [8], we construct the evaluation model with 14 cells, including 12 normal cells as well as 2 reduction ones. Moreover, we train the evaluation model for 250 epochs with batch size 256. Bsides we set the initial value of the learning rate of the SGD optimizer to 0.1.

It can be seen from Table 4 that the results obtained by DLW-NAS are obviously superior to the manual design of neural network methods and other differentiable methods on ImageNet. With only 3.8M parameters and 397M FLOPs, DLW-NAS achieves a 26.1% error rate and outperforms DARTS on either model complexity or classification accuracy. In addition, DLW-NAS performs slightly better than DSO-NAS [18] which is directly searched on ImageNet, with even less parameters and FLOPs. This demonstrates the effectiveness of DLW-NAS on architecture search. In terms of accuracy, SparseNAS performs slightly better than DLW-NAS, but its number of parameters is much more than that of DLW-NAS.

Table 4 Comparison with state-of-the-art methods on ImageNet (mobile setting). The best results are highlighted with boldface

Ablation Study

DLW-NAS mainly includes three innovations: the light-weight search space (LWSS), search strategy with complexity constraints (SSCC) and architecture optimization strategy (AOS). In the experiment, we also conduct ablation study to evaluate the contribution of these components to the performance of DLW-NAS on image classification.

Table 5 shows the experimental results on CIFAR-10. It can be seen that each component has certain impact on the learning accuracy and complexity of the evaluation model. For example, the results in the fourth row are based on the DARTS search space, while that in the sixth row are based on the proposed LWSS. As we can see that using the proposed LWSS can search for models with lower computational complexity and higher recognition accuracy than using the DARTS search space. Comparing the results shown in the second and sixth rows, it can be seen that the proposed AOS can greatly reduce the parameter number and computational complexity. Overall, applying all the three components, DLW-NAS can obtain SOTA learning accuracy with low computational overhead.

Table 5 Contribution of LWSS, SSCC and AOS to model performance and complexity. The best results are highlighted with boldface

Conclusion

In this work, we propose DLW-NAS, which is a differentiable light-weight neural architecture search method. To realize the light-weight architecture search from the source, we establish a novel light-weight search space. Furthermore, we propose a new differentiable architecture search strategy with complexity constraints. In addition, we introduce an architecture optimization strategy to sparsify the connections in the searched architecture. This strategy reduces the parameter number and computational complexity, but basically preserves the model performance. To evaluate the proposed DLW-NAS method, we test it on the CIFAR-10, CIFAR-100 and ImageNet datasets. The results demonstrate its advantages over the SOTA approaches.