1 Introduction

Recent advances in deep neural networks result in growing interests in automated machine learning (AutoML), whose goal is to optimize hyper-parameters and to identify network architectures suitable to specific datasets without much human intervention. The target of AutoML can be generally formalized as follows:

$$\begin{aligned} {x^* \in \mathop {\mathrm {arg\, min}}\limits _{x \in \mathcal {X}} f(x)}, \end{aligned}$$
(1)

where \(f: \mathcal {X}\rightarrow \mathbb {R}\) is a function defined over a search space \(\mathcal {X}\).

In the past few years, white-box-based and black-box-based approaches have been dedicated to developing algorithms for AutoML. In a white-box formulation [27, 47], the form of function f is explicitly known, so that \(x \in \mathcal {X}\) can be optimized in a differential programming manner. In comparison, in a black-box assumption [8, 37], the function f can only be evaluated at \(x \in \mathcal {X}\), yielding noisy observations.

One important factor for AutoML is the definition and construction of search space \(\mathcal {X}\). Different search spaces \(\mathcal {X}\) give rise to different settings for AutoML, and thus promote a variety of search strategies. Given a convex set \(\mathcal {X}\subset \mathbb {R}^d\), then Eq. (1) can be viewed as a hyper-parameter optimization (HPO) problem. And Bayesian Optimization (BO) [6, 37, 39] can provide an elegant compromise in terms of capturing a surrogate model to indicate the likelihood of function f and maximizing an acquisition function to trade off exploration and exploitation. Given a tree-based search space \(\mathcal {X}\subset \mathcal {T}\), which is often used for either building the hyper-parameters’ dependency or searching for a macro neural architecture where the layers are stacked sequentially, some works [38, 40] manage to capture the dependency among layers and to decide a proper exploration-exploitation balance by using Monte Carlo Tree Search (MCTS) strategy. Moreover, if the searching space is extended to a Directed Acyclic Graph (DAG), i.e. \(\mathcal {X}\subset \mathcal {G}\), where \(\mathcal {G}\) is a DAG search space, then the multi-branch and the skip relationship of hyper-parameters can be established. This elastic expression leads to the research direction called Neural Architecture Search (NAS). In [12, 15, 27, 43, 49] the architecture is represented as a super-net, and a white-box approach (i.e., differential programming based approach) is used to update architecture’s importance and network’s weight. However, due to the approximation in bi-level optimization, the differential based methods (both one-shot and single-path approach) suffer instability, i.e.during the search process, the architecture collapsing occurs and thus the skip-layer tends to dominate [10, 41]. On the contrary, some black-box BO methods are also introduced to design NAS strategy. Nevertheless, because of the inconsistency between the vector space \(\mathbb {R}^d\) and the DAG space \(\mathcal {G}\), BO solutions need extra efforts for encoding the neural architecture [19, 30, 35, 45].

Another important factor for AutoML is search strategy, which has been explored by most of the existing methods for NAS based on reinforcement learning, genetic algorithm, Bayesian optimization, and differential programming. These methods decouple the search space and search strategy, leading to lower efficiency due to ignorance on graph property. To bridge the gap, we focus on designing an efficient and stable search strategy on a DAG search space in a fully graphic manner, which is akin to using MCTS in a tree-based search space or adopting Gaussian process (GP) in an Euclidean space. A few prior works [21, 46] have pioneered to perform NAS using graph theory. However, [21] is limited to linear search space and lacks evaluation results on real datasets; whereas [46] lacks feedback reward, making it an open-loop strategy—rather than a NAS method it is virtually a method to define search space. Recently, GFlowNet [4, 5] views a Markov Decision Processes (MDP) as a flow network, and connects the flow-matching (conservation) conditions to the generated policy with the target reward function. Though GFlowNet targets at sampling a diverse set of candidates, which may be not the case of NAS, GFlowNet builds a solid foundation for achieving the black-box optimization with the flow network.

In this paper, by introducing the maximum-flow calculation in a flow network, we formulate the NAS task as a multi-graph maximum-flow problem directly defined on a DAG search space with edge capacity indicating the contribution of the corresponding operation to the model performance. To be specific, our contributions are three-fold:

  1. 1.

    We propose a Max-Flow based NAS approach, called MF-NAS, which defines the search space and the search strategy both in a fully graphic manner.

  2. 2.

    To our best knowledge, we make the first attempt to address the NAS task from a multi-graph maximum-flow perspective.

  3. 3.

    We conduct extensive experiments to evaluate the proposed search strategy across multiple datasets on multiple search spaces and show competitive performance.

2 Preliminaries

2.1 Multi-graph Flow in Graph Theory

Consider a directed graph \(G := (V,E)\), where V is the vertex set and E is the edge set. A multi-graph is defined as a graph with multiple edges (i.e., parallel edges) between two vertices. A flow graph is a directed graph where each edge has a capacity and receives a flow that is limited by the edge capacity. The formal definition of the problem for finding a feasible and maximal flow on a multi-graph is given as follows.

Definition 1 (Maximum-Flow on Multi-Graph)

Given a directed graph \(G = (V,E)\) with a source node \(s \in V\), a sink node \(t \in V\), edge set \(E=\{e_{u,v}^k | u,v \in V, k \in \mathcal {K}\}\), edge capacity function \(c: V \times V \times \mathcal {K} \rightarrow \mathbb {R}\), where \(\mathcal {K} := \{0,1,\ldots ,K-1\}\) and K is number of parallel edges, then the max-flow on the multi-graph G is defined as a feasible s-t-flow function \(f: V \times V \times \mathcal {K} \rightarrow \mathbb {R}\) that maximizes the flow value \(|f|:=\sum _{u\in V, k\in \mathcal {K}}f(e_{u,t}^k)\) on G, where \(f(e_{u,t}^k)\) denotes the flow on the k-th edge from node u to node t.

Flow graph [2, 44] has been a useful tool for modeling network traffic, circulation, and etc. In this paper, we demonstrate that NAS can be modeled as a maximum-flow selection problem on a multi-graph.

2.2 Hyperband and ENAS

Both Hyperband [24] and ENAS [32] are black-box methods. Hyperband [24] speeds up random search by using an early-stopping strategy to allocate resources adaptively. NAS and ENAS [32, 52] both involve training a RNN controller in a loop, where the controller samples a child model (i.e., candidate architecture) for training and its achieved performance is fed back to the controller as a reward to train the LSTM controller. Specifically, in NAS, the child models are trained from scratch to convergence, whereas in ENAS, the child models share their weights to reduces the cost of architecture search. Our proposed MF-NAS applies to the pipeline of NAS: the controller is responsible for generating candidate architectures (i.e., child models), then the parameters of the controller and the weights of child models are updated alternately. Unlike NAS and ENAS, our MF-NAS adopts a max-flow-based controller.

Fig. 1.
figure 1

Illustration of NAS as a max-flow problem. a) A cell-based search space used in DARTS and NAS-Bench-201. b) Connections within a cell in NAS-Bench-201, where bold lines indicate a group of parallel edges (i.e., candidate operations). c) The flow moves through a node following the conservation law, where ihe input flows are first collected by a gather function (e.g., summation function) and then the converged flow is split into output flows. d) Parallel edges with limited capacities between the node pair. In each search step, only one path is selected to construct a child model, and the capacity of the selected edge may be updated with respect to validation accuracy.

3 Methodology

In this section, we formulate the NAS task as a max-flow problem on a multi-graph at first, and then show that MF-NAS can be interpreted from a non-parametric density estimation perspective. Finally, we integrate MF-NAS into the classical AutoML pipelines, demonstrating that MF-NAS enjoys wide applicability.

We regard the architecture as a directed acyclic graph (DAG) referring to NAS [52], ENAS [32], DARTS [27] and its subsequent works [7, 16, 41]. For clarity, we give a sketch of the search space in DARTS and NAS-Bench-201 in a multi-graph perspective in Fig. 1. Specifically, operations are regarded as parallel edges between a pair nodes with learned capacities indicating the importance of the architecture, and the moving flow across nodes follows the conservation law [4].

3.1 Neural Architecture Search with Maximum-Flow

Given a training dataset \(\mathcal {D}\) for a certain task, NAS aims to find an architecture \(m \in \mathcal {A}\) that maximizes a posterior probability, i.e.,

$$\begin{aligned} {m^* = \mathop {\mathrm {arg\, min}}\limits _{m\in \mathcal {A}} p(m|\mathcal {D})}. \end{aligned}$$
(2)

To formulate the task of NAS, we use a directed multi-graph to represent the cell-based search space, where parallel edges are the operations, including convolutions, skip connection, pooling, and no connection.

Instead of finding the optimal solution with maximum posterior probability as in Eq. (2), in this paper, we consider the flow value on the operation-induced multi-graph to be the candidate architecture’s fitness, which will be further interpreted in Sect. 3.2. Consequently, we convert the task of searching for the best architecture to a task of finding a feasible s-t-flow function f that achieves the maximum-flow value, that is:

$$\begin{aligned} f^* :=&\mathop {\mathrm {arg\, min}}\limits _{f}{\sum _{u\in V, k\in \mathcal {K}}f(e_{u,t}^k)}. \end{aligned}$$
(3)

Then, the optimal architecture \(m^*\) is selected as a set of edges whose flows are nonzero according to the optimal s-t-flow function f:

$$\begin{aligned} m^* := \{e_{u,v}^k| u, v \in V ~\text {and}~ k\in \mathcal {K}, ~\text {where}~f^*(e_{u,v}^k) > 0\} . \end{aligned}$$
(4)

In this way, the optimization problem in (2) can be solved by addressing the optimal s-t-flow problem in (3) and (4).

Next, we give a proposition to show that the maximum-flow of the flow network in Fig. 1 (a) can be obtained by calculating the maximum-flow of the normal cell and the reduce cell, separately.

Proposition 1

Suppose that the architecture is formed by connecting the normal cells \(m_{normal}\) and the reduce cells \(m_{reduce}\) in a chain manner. Then, the set of edges \(m^*\) that maximize the flow value |f| as defined in Definition 1 can be achieved by calculating the union set of that maximizes the flow value of the normal cell \(m_{normal}^*\) and that maximizes the flow value of the reduce cell \(m_{reduce}^*\), i.e., \(m^* = m_{normal}^* \cup m_{reduce}^*\)Footnote 1.

Proof

This can be proved by mathematical induction. Base case: if the layer of architecture is only one (formed by one cell), obviously, the statement holds. Inductive step: assume the statement holds for n-layers architecture (i.e., \(m_n^* = m_{normal}^* \cup m_{reduce}^*\))—this is the induction hypothesis (IH), then the statement will be proved to hold for \(n+1\)-layers architecture as shown in Fig. 2(c). Without loss of generality, we consider the \(n+1\)-th cell as a normal cell. Denote \(|f_{normal}|\), \(|f_n|\) and \(|f_{n+1}|\) as the flow value of the normal cell in Fig. 2(a), the flow of the n-layers and \(n+1\)-layer network in Fig. 2(c), respectively. According to Max-Flow Min-Cut Theorem [44], the flow of the \(n+1\)-layer network in the chain can be represented as \(\max ~|f_{n+1}| = min\{ \max ~|f_{n}|,~\max ~ |f_{normal}|\}\). If \(\max ~|f_{n}| >= \max ~ |f_{normal}|\), then maximizing the flow of \(n+1\)-layer network is equivalent to maximize that of the normal cell, and thus \(m_{n+1}^* = m_n^* \cup m_{normal}^* \overset{IH}{=}\ m_{normal}^* \cup m_{reduce}^*\). If \(\max ~|f_{n}| < \max ~ |f_{normal}|\), then the maximum flow problem of \(n+1\)-layers network is equivalent to maximize the flow of n-layers network. Any \(m_{normal}\) that satisfies \(\max ~ |f_{normal}| > \max ~|f_{n}|\) will not affect the result, hence \(m_{normal}^*\) can be chosen as the optimal solution \(m_{n+1}^* = m_n^* \cup m_{normal}^* \overset{IH}{=}\ m_{normal}^* \cup m_{reduce}^*\). This completes the proof.

Fig. 2.
figure 2

Illustration of mathematical induction for flow network analysis (see Proposition 1): a) base case for normal cell; b) base case for reduce cell; c) inductive step.

Hence, Eq. (3) can be solved in a greedy manner by separately optimizing inside each cell:

$$\begin{aligned} \mathop {\textrm{arg max}}\limits _{f}&~~{\sum _{u\in V_{cell}, k\in \mathcal {K}}f(e_{u,t}^k)} \nonumber \\ \text {s.t.}&~~f(e_{u,v}^k) \le c(e_{u,v}^k), ~~\forall u,v \in V_{cell},~\forall k \in \mathcal {K} \nonumber \\&~\sum _{u\in V_{cell}} \sum _{k \ne K-1} \mathbb {I}(e_{u,v}^k) = M, ~~\forall v \in V_{cell}\nonumber \\&~~\mathbb {I}(e_{u,v}^{K-1}) = \prod _{k \ne K-1} (1-\mathbb {I}(e_{u,v}^k)),~~\forall u,v \in V_{cell},\end{aligned}$$
(5)

where \(V_{cell}\) is the node set of a normal cell or a reduce cell, t is the sink node of the cell, the index of the sink node is related to the number of nodes N inside the cell, M is the input degree of node, K and \(f(e_{u,v}^k)\) are defined in Definition 1, and the operation whose index is \(K-1\) denotes no connectionFootnote 2, and \(\mathbb {I}(e_{u,v}^k)\) is an indicator which is 0 if \(f(e_{u,v}^k)=0\) and otherwise 1. The capacities \(c(\cdot )\) of the parallel edges in the multi-graph flow network are updated by the candidate model’s reward (i.e., validation accuracy) as follows:

$$ \begin{aligned} c(e_{u,v}^k) = \left\{ \begin{array}{ll} e^{-(1-r)^2} &{} ~~ \text {if}~~ e_{u,v}^k \in m^* \& e^{-(1-r)^2} > c(e_{u,v}^k) \\ c(e_{u,v}^k) &{} ~~ \text {otherwise}, \end{array} \right. \end{aligned}$$
(6)

where \(m^*\) is the edge set defined in Eq. (4), r is the accuracy on validation set for image classification, thus the capacity \(c(\cdot )\) will be scaled in [0, 1].

MF-NAS benefits from the update rule in two aspects. On one hand, removing less important operations has little influence on final prediction accuracy, while deleting some important operations can lead to a significant drop [3]. We use Eq. (6) to update the capacities, making important operations less prone to be removed. On the other hand, the update formula of Eq. (6) can be viewed as a contraction mapping of the accuracy, which promotes the exploration capability.

We can use dynamic programming (DP) to solve Eq. (5). Take the DARTS search space as an example. In the DARTS search space, we assume that there are two previous nodes of each cell whose initial flow value can be set to one (i.e., the upper-bound of the accuracy), and the feature maps of different nodes in one cell are concatenated as the output. As shown in Fig. 3, there are two states in each node within the cell of the search space: 1) there exists a link between the \((n-1)\)-th node and the n-th node 2) there does not exist a link between the \((n-1)\)-th node and the n-th node. Denote \(V_{cell}^{n}\) as the sub-cell which is terminated by the n-th node (i.e., the nodes in the sub-cell are 1, 2, ..., n), whose max-flow value can be defined as:

$$\begin{aligned} F_n^* := \max _{f}\sum _{u\in V_{cell}^{n-1}, k\in \mathcal {K}}f(e_{u, n}^k). \end{aligned}$$
(7)

We initialize \(F_{-1}^*=1\) and \(F_0^*=1\), and then recursively compute the flow value as follows:

$$\begin{aligned}&F_n^* = \max \{F_n^{(1)}, ~F_n^{(2)}\}, \end{aligned}$$
(8)

where

$$\begin{aligned} F_n^{(1)}&= 2F_{n-1}^* - F_{n-2}^* + \max _{u\in V_{cell}^{n-2}, k\in \mathcal {K}}c(e_{u, n}^k),\end{aligned}$$
(9)
$$\begin{aligned} F_n^{(2)}&= F_{n-1}^* + \max _{u\in V_{cell}^{n-2}, k\in \mathcal {K}}c(e_{u, n}^k) + \max _{u\in V_{cell}^{n-2} \backslash \{u^*\}, k\in \mathcal {K} \backslash \{k^*\}}c(e_{u, n}^k), \end{aligned}$$
(10)

in which \((u^*, k^*) := \arg \max _{u\in V_{cell}^{n-2}, k\in \mathcal {K}}c(e_{u, n}^k)\). We can see that updating the two states corresponds to computing \(F_n^{(1)}\) and \(F_n^{(2)}\), respectively. By using the DP algorithm as in Eq. (8) and the selection method in Eq. (4), we get the candidates architecture \(m^*\) for the next running. There is an equivalent representation of Eq. (8), where the flow value is normalized before moving to the next nodes, making the factors slightly different. If the hypothesis space in Eq. (5) is small enough, random search can be a practical alternative to solve Eq. (5).

Fig. 3.
figure 3

Illustration for the dynamic programming (DP) to solve Eq. (5).

3.2 MF-NAS: A Probabilistic Perspective

We now describe the motivation for the maximum-flow formulation as well as the relationship between the flow of a graph and the performance (i.e., classification accuracy) of a network architecture.

Let \(\varOmega _{1}\) be the set of available operations, \(\varOmega _{2}\) be the set of feasible edges in a search space, \(\varOmega =\varOmega _{1} \times \varOmega _{2}\), and \(\mathcal {A}\) be a class (i.e., set of sets). Then each element \(A \in \mathcal {A}\) represents a feasible architecture in the search space, and it is a subset of \(\varOmega \), i.e., \(A \subset \omega \). Let \(\phi : \mathcal {A} \rightarrow \mathbb {R}\) be a set function, indicating the reward of an architecture. We define a probability measure of an architectureFootnote 3 that can achieve the best reward, i.e.,

$$\begin{aligned} p(A):=p(\phi (A)=r^*), \end{aligned}$$
(11)

where \(r^* = \max _{A' \in \mathcal {A}}{\phi (A')}\). Furthermore, the conditional probability over an architecture given a specific edge in it can be derived by \(p(A|a):=p(A|a \in A)=p(\phi (A)=r^*|a \in A)\). Here, a is the same denotation as \(e_{u,v}^k\) in Definition 1.

To make less assumptions about the conditional distribution, we use a non-parametric approach to estimate the density. Choosing a Gaussian kernel function gives rise to the following kernel density estimation model (KDE):

$$\begin{aligned} p(\phi (A)=r^*|a \in A) = \frac{1}{N} \sum _{\begin{array}{c} i=1 \\ a \in A_i \end{array}}^{N} \frac{\exp {\{-\frac{||r^*-\phi (A_i)||^2}{2\sigma ^2}\}}}{\sqrt{2\pi }\sigma }, \end{aligned}$$
(12)

where N architectures containing edge a are sampled. For a small enough \(\sigma \), in the exponential term the one for which \(\Vert r^*-\phi (A_i)\Vert ^2\) is the smallest term will approach zero most slowly, and hence the sample architectures with the best reward—\(A_s^* = \arg \max _{A_i} {\phi (A_i)}\) will dominate the density. That is, Eq. (12) can be approximated by:

$$\begin{aligned} p(\phi (A)=r^*|a \in A) \propto \exp {\{-(r^*-\phi (A_s^*))^2}\}. \end{aligned}$$
(13)

For image classification tasks, the reward is measured by the accuracy on validation set, and the upper bound of the reward can be set as \(r^*=1\). Obviously, Eq. (13) holds the same form as the equation of capacity update in Eq. (6). As a consequence, the information flow on edge can be interpreted by an approximated conditional probability as shown in Eq. (13), which reveals the preference for different architectures that can achieve the best performance given the contained edge. We further show that the propagation of the conditional probability can be viewed as a flow moving from the input edges to the output edges. In Fig. 4, three basic types of the network topology are illustrated: a) one-input/one-output, where an architecture A containing the output edge has only one input edge, and thus the conditional probabilities conditioned on the output and input are identical; b) many-input/one-output, where an architecture containing any input edge will flow through the output edge, and thus the probability conditioned on the output edge is the summation of the probabilities conditioned on the input edges; c) one-input/many-output, where an architecture containing the input edge spreads to multiple outputs which are chosen uniformly as a priori, and thus the probability is divided equally by the number of output edges. Note that all the network topology including residual or multi-branches graphs can be derived from the three basic types of topology mentioned above. In a nutshell, the probability defined in Eq. (11) can be propagated as a flow, therefore finding an architecture with the highest validation accuracy is equivalent to finding an architecture with the maximum conditional probability on the output edge, which is also equivalent to finding a maximum flow on the DAG graph with the capacity defined in Eq. (6).

Fig. 4.
figure 4

Three types of network topology. a) one-to-one; b) many-to-one; c) one-to-many.

figure a

3.3 MF-NAS Pipeline

The proposed MF-NAS provides a candidate architecture generation method in a single-path NAS solution. To demonstrate the efficiency and the stability, we apply MF-NAS to a multi-fidelity pipeline. For example, Hyperband [24] selects top-k from its candidate pool; whereas MF-NAS dynamically generates new promising candidates as the search progresses by maximizing the network flow. The reason is that the maximum-flow controller involves the edge capacity constraint; and the flow value on one edge can be different for the same net flow value of a cell. In other words, multiple architectures may result in the same net flow value. To enhance the exploration capability of MF-NAS, we choose the candidate architecture using an \(\epsilon \)-greedy strategy [28]. With this strategy, a random architecture is taken with probability \(\epsilon \), and the max-flow architecture is chosen with probability \(1-\epsilon \). We anneal \(\epsilon \) from 1 to 0 such that the controller begins from an exploration phase and slowly starts to moving towards the exploitation phase. We also keep a small portion of top architectures during different rounds to stabilize the searching process. We refer to Algorithm 1 for giving an implementation for MF-NAS, which aims at finding an architecture whose classification accuracy on validation set is maximized on the operation-induced DAG search space.

4 Experiments

To evaluate the effectiveness of our proposed MF-NAS approach, we conduct experiments with different DAG search space on several benchmark datasets, including CIFAR-10 [20], CIFAR-100 [20], STL-10 [13], FRUITS [29], FLOWER-102 [31], Caltech-256 [17] and ImageNet [34].

Settings in Experiments. Specifically, we evaluate the efficiency and stability of MF-NAS in three settings: a) the micro search space used in ENAS [32] and DARTS [27], b) search space of NAS-Bench-201 [16], and c) performance on ImageNet classification. Furthermore, we set the initial learning rate as 0.025 with a cosine scheduler, use SGD with a momentum 0.9. The maximal search round is \(N=4\), and there are n = [30, 20, 10, 5] candidates that are trained r = [30, 60, 70, 80] epochs in different rounds. \(\epsilon \) = [1.0, 0.6, 0.5, 0.25] indicates it decays for every search round. Experiments are conducted on RTX 3090 and NVIDIA V100 GPUs.

Table 1. Comparison with classification architectures on CIFAR-10. Similar to other NAS algorithms, the search cost of MF-NAS does not include the final evaluation cost.

4.1 Results on Micro Cells Based Search Space in ENAS/DARTS

Fig. 5.
figure 5

Stability of DARTS (w/ its amendment versions) and MF-NAS on CIFAR-10.

We evaluate MF-NAS on micro cells-based DAG search space used in ENAS [32] and DARTS [27]Footnote 4. The main difference between MF-NAS and other algorithms lies in the selection method for candidate networks. To be specific, MF-NAS applies optimization algorithms with graph frameworks to choose candidates; whereas others generate candidates by means of Bayesian optimization, reinforcement learning, evolution or gradient-based methods. For demonstrating the efficiency of MF-NAS, we implement MF-NAS following the controller-child pipeline [32], but do not use the weight-sharing. As shown in the Table 1, max-flow based method gets a \(2.63\%\) error rate with 3.3M parameters by taking about one search days, which is more efficient than RL/Evolution/Random/BO based methods on DAG search space. This comes from the fact that the proposed MF-NAS are coupled with DAG search space, and thus they can take advantage of the graph structure, i.e. the relation of the edges and nodes. Also, unlike the gradient-based methods, MF-NAS does not suffer the skip domination issue or the discretization problem [41]. To validate the stability of MF-NAS, we evaluate the performance at different search steps in Fig. 5.

Figure 5 compares the stability of DARTS [27] and its amendment versions [41, 50] with MF-NAS. Note that the search epoch is not related to real search time. For DARTS+PT [41], we get our results on the super-nets of DARTS pre-trained from 25 to 250 training epochs. Both vanilla DARTS and DARTS+PT suffer from the skip layer dominated problem [7, 50], i.e. there are about 6 skip connections in the normal cell at 250 epochs. The ultimate performance of differential methods usually decreases over search epochs because they use some approximations to solve the bi-level optimization. In contrast, thanks to the maximum-flow solution, MF-NAS can steadily achieve improved performance.

Table 2. Top-1 test set accuracy with one GPU day budget.

Based on this search space, we evaluate MF-NAS’s generalization ability. We use other four datasets with different image resolutions, including STL-10 [13], FRUITS [29], FLOWER-102 [31], and Caltech-256 [17]. ENAS [32], DARTS [27], and Hyperband [24] are evaluated as the baselines whose search protocols cover reinforcement learning, differential method and closed-loop random search. Similar to ENAS, we extend Hyperband to support neural network search in the same search space of DARTS. The experiments are under the budget constraints of one GPU day. As shown in Table 2, MF-NAS outperforms other methods on the last dataset. We analyze that Caltech-256 is a challenging dataset, on which the early performance of an architecture does not align with its ultimate performance precisely, and thus Hyperband fails. Additionally, the limited GPU memory restricts the batch size for DARTS on large-scale images, which can cause a performance drop [36].

4.2 Results on Search Space of NAS-Bench-201

In NAS-Bench-201, the architectures are stacked by 17 cells with five operations, and three datasets are evaluated, including CIFAR-10 [20], CIFAR-100 [20], and ImageNet-16-120 [16]. We follow the training settings and split protocol as the paper [16]. We search for the architecture based on MF-NAS three times with different random seeds, and the results are reported in Table 3, showing that our method achieves competitive results with the state-of-art methods. We observe that ENAS can also benefit from the \(\epsilon \)-greedy strategy, indicating that ENAS may suffer from the weight-sharing problem: it is hard to figure out whether the improvement under the weight-sharing strategy is attributed to a better-chosen architecture or comes from well-trained weights. Thus, there will still be a big room to improve the weight-sharing strategy in the NAS area.

Table 3. Top-1 accuracy (%) on NAS-Bench-201.

4.3 Results on ImageNet Classification

For ImageNet [34], the one-shot NAS approaches cost too much memory to build a super-net. Works [9, 27, 53] directly transfer the architecture searched on CIFAR-10 to ImageNet with little modification. Specifically, the search phase runs on CIFAR-10, while the evaluation phase uses the architecture transferred from the searched cells with deeper and wider scales as suggested in DARTS [27]. Learning rate decay policy is different between the last two blocks of Table 4. Table 4 evaluates the performance on ImageNet. In general, MF-NAS can get competitive results in transfer cases.

Table 4. Study on full ImageNet. The first block represents the direct search case while the last two blocks show the transfer cases. \(^\dagger \) and \(\ddagger \) represent lr decay strategy as DARTS and P-DARTS/PC-DARTS, respectively.

4.4 Ablation Study on \(\epsilon \)-greedy

The proposed MF-NAS follows the reinforcement learning based NAS method [1] to involve the \(\epsilon \)-greedy strategy into the algorithm pipeline. To eliminate the influence of \(\epsilon \)-greedy, we show the ablation study in Table. 5.

Table 5. Test accuracy (%) with different \(\epsilon \)-greedy strategies on CIFAR-10. \(\epsilon \equiv 0\) means the \(\epsilon \)-greedy strategy has no effect on the corresponding NAS method; Note that \(\epsilon \equiv 1\) degrades to Random Search.

The results show that MF-NAS and ENAS [32] can benefit from the \(\epsilon \)-greedy strategy. In fact, \(\epsilon \)-greedy strategy can enhance the exploration capacity, which may be more critical in a maximum-flow/RL strategy than a random search solution. As a consequence, max-flow-based and reinforcement learning-based NAS methods turn out to be more efficient than random search schemes considering the pure random search solution only gets 96.75 ± 0.2 % accuracy.

5 Conclusion and Future Work

We have proposed a multi-graph maximum-flow approach for NAS, called MF-NAS, which casts the problem of optimal architecture search as finding a path on DAG. In the search phase, the weights and the capacities of the parallel edges in the multi-graph are updated alternately. Extensive experiments on the image classification task demonstrated the effectiveness of our proposal. As the future work, we will explore for using larger search space, with non-linear gather functions in a general maximum-flow network, and test for other vision tasks, e.g., segmentation and detection [42].