Keywords

1 Introduction

Deep learning successfully transits the feature engineering from manual to automatic design. It marks the mapping function from sample to feature can be optimized accordingly. As a tendency, seeking effective neural networks gradually becomes an important and practical direction. But the design of architecture is still a challenging and time-consuming effort. Part of the research focuses on how depth [7, 17, 17, 26], type of convolution [3, 13], normalization [23, 43] and nonlinearities [24, 27] affect the performance. In addition to these endeavors, another group of work also attempted to simplify the architecture design through stacking blocks/modules and wiring topological connections.

This strategy was demonstrably first popularized by the VGGNet [32] that is directly stacked by a series of convolution layers with plain topology. Due to the problems of gradient vanishing and exploding, extending the network to a deeper level for better representation is nearly difficult. To better adapt the optimization process of gradient descent, GoogleNet [36] adopted parallel modules, and Highway networks [33] utilized gating units to regulate the flow of information, resulting in elastic topologies. Driven by the significance of depth, the residual block consisted of residual mapping and shortcut was raised in ResNet [10]. Topological changes in neural networks successfully scaled up neural networks to hundreds or even thousands of layers. The proposed residual connectivity was widely approved and applied in the following works, e.g. MobileNet [12, 31] and ShuffleNet [45]. Divergent from aforementioned relative sparse topologies, DenseNet [14] wired densely among blocks to reuse features fully. Recent advances in computer vision also explore neural architecture search (NAS) methods [21, 37, 46] to search convolutional blocks. To trade-off efficiency and performance, most of them used hand-designed stacked patterns, and constrained the search space in limited ones. These trends reflect the great impact of topology on the optimization of neural networks. To a certain degree, previous principles of modular design simplify the difficulty of building an effective architecture. But how to aggregate and distribute these blocks is still an open question. Echoing this perspective, we wonder: can connectivity in neural networks be learned? What is the suitable route to do this?

Fig. 1.
figure 1

From a natural perspective to the topological perspective for networks with residual connectivity. Two types of networks with 1/2 interval are given. Red node denotes the input \(\mathbf {x}\), and green one means the output feature \(\mathbf {y}\). Red arrows give an example of this mapping for a node with in-degree of 3.

To answer these questions, we propose a topological perspective to represent neural networks, resulting in a directed acyclic graph as shown in Fig. 1. Under this perspective, transformations (e.g. convolution, normalization and activation) are mapped into a node, and connections between layers are projected to edges which indicate the flow of information. We first unfold the residual connections to be a complete graph. This gives another way to explain the effectiveness of the residual topology, and inspires us to define the search space using a complete graph. Instead of choosing a predefined rule-based topology, we assign learnable parameters which determine the importance of corresponding connections to the edges. To adequately promote generalization and concentrate on critical connections, we attach auxiliary sparsity constraint on the weights of edges. Particularly, we propose two updating methods to optimize the weights of topology. One is a uniform type that regulates different edges uniformly. The other is an adaptive type that is logarithmically related to the in-degree of a node. Then the connectivity is learned simultaneously with the weights of the network by optimizing the loss function according to the task using a modified version of gradient descent.

We evaluate our optimization method on classical networks, such as ResNets and MobileNet. It demonstrates the compatibility with existing networks and adaptability to larger search spaces. To exhibit the benefits of connectivity learning, we construct a larger search space in which different topologies can be compared strictly. We also evaluate our method on various tasks and dataset, concretely, image classification on CIFAR-100 and ImageNet, object detection on COCO. Our contributions are as follows:

  • The proposed topological perspective can be used to represent most existing neural networks. For the residual topology, we reveal for the first time the properties of its dense connections, which can be used for the search space.

  • The proposed optimization method is compatible with existing networks. Without introducing much additional computing burden, we achieve \(2.23\%\) improvement using ResNet-110 on CIFAR-100, and \(0.75\%\) using deepened MobileNet on ImageNet.

  • We design an architecture called TopoNet for larger search spaces and restrict comparison. Quantitative results prove the learned connectivity is superior to random, residual and complete ones, and surpasses ResNet in the similar computation cost by \(2.10\%\) on ImageNet.

  • This method owns good generalization. The optimized topology with learned connectivity surpasses the best rule-based one by \(0.95\%\) in AP on COCO. To the equal-sized backbone of ResNet, the improvement is \(5.27\%\). We also explore the properties of the optimized topology for future work.

2 Related Work

We briefly review related works in the aspects of neural network structure design and relevant optimization methods.

Neural network design is widely studied in previous literature. From shallow to deep, the shortcut connection plays an important role. Before ResNet, an early practice [41] also added linear layer connected from input to the output to train multi-layer perceptrons. Besides, “Inception” layer was proposed in [36] that is composed of a shortcut branch and a few deeper branches. Except on large networks, shortcut also proved effective in small networks, e.g. MobileNet [31], ShuffleNet [45] and MnasNet [37]. The existence of shortcut eases vanishing/exploding gradients [10, 33]. In this paper, we explain from a topological perspective that shortcuts offer dense connections and benefit optimization. On the macrostructure, there also exist many networks with dense connections. DenseNet [14] contacted all preceding layers and passed on the feature maps to all subsequent layers in a block. HRNet [34] benefits from dense high-to-low connections for fine representations. Densely connected networks promote the specific task of localization [39]. Differently, we optimize the desired network from the complete graph in a differentiable way. And it is different from MaskConnect [1] which is constrained by K discrete in-degree and owns binary connections. This also provides an extension to [44] where random graphs generated by different generators are employed to form a network.

For the learning process, our method is consistent with DARTS [21] which is differentiable. In contrast to DARTS, we do not adopt alternative optimization strategies for weights and architecture. Joint training can replace the transferring step from one task to another, and obtain task-related topology. Different from sample-based optimization methods [29], the connectivity is learned simultaneously with the weights of the network using our modified version of gradient descent. [2, 8] also explored this type and utilized weight-sharing across models to amortize the cost of training. Searching from the full space is evaluated in object detection by NAS-FPN [5], in which the feature pyramid is sought in all cross-scale connections. In the aspect of semantic segmentation, Auto-DeepLab [20] formed a hierarchical architecture to enlarge search spaces. The sparsity constraint can be observed in other applications, e.g. path selection for a multibranch network [15], and pruning unimportant channels for fast inference [9].

Fig. 2.
figure 2

Details of node operations and the adjacency matrix. For each node, features generated by preorder nodes are aggregated through weights of edges. Then a transformation unit which consists of convolutional layers, batch normalization and the activation function is used to transform features. Next, features are allocated to postorder nodes where connections exist. For a stage, weights of edges can be represented in an adjacency matrix, in which rows denote the weights of input edges, and columns stand for the weights of output edges.

3 Methodology

3.1 Topological Perspective of Neural Networks

We represent the neural network using a directed acyclic graph (DAG) in topology. Specifically, we map both combining (e.g., addition) and transformation (e.g., convolution, normalization and activation) to a node. And connections between layers are represented as edges, which determine the flow of information. Then we can get a new representation of the architecture \(\mathcal {G}=(\mathcal {N}, \mathcal {E})\), where \(\mathcal {N}\) is the set of nodes, and \(\mathcal {E}\) denotes the set of edges.

In the graph, each node \(n_i \in \mathcal {N}\) performs a transformation operation \(o_i\), parametrized by \(\mathbf {w}_i\), where i stands for the topological ordering of the node. While the edge \(e_{ji} = (j,i,\alpha _{ji}) \in \mathcal {E} \) means the flow of features from node j to node i, and the importance of the connection is determined by the weight of \(\alpha _{ji}\). During forward computation, each node aggregates inputs from preorder nodes where connections exist. Then it performs a feature transformation to get an output tensor \(\mathbf {x}_i\). And \(\mathbf {x}_i\) is sent out to the postorder nodes through the output edges. It can be seen in the left of Fig. 2. It can be formulated as follows:

$$\begin{aligned} \mathbf {x}_i=o_i(\mathbf {x}_i^\prime ; \mathbf {w}_i), \ \text{ where } \ \mathbf {x}_i^\prime = \textstyle \sum \limits _{(j<i) \wedge (e_{ji}\in \mathcal {E})} \ \alpha _{ji} \cdot \mathbf {x}_j. \end{aligned}$$
(1)

In each graph, the first node in topological ordering is the input one, which only performs the distribution of features. The last node is the output one, which only generates final output of the graph by gathering preorder inputs. We also propose an adjacency matrix as the memory space to store weights of edges. As shown in the right of Fig. 2, each row denotes the weights of input edges, and each column is the outputs. For nodes where there are no edges attached, the corresponding \(\alpha \) is 0. The dimension of row (with \(\alpha \ne 0\)) is called in-degree for a node, and the dimension of column (with \(\alpha \ne 0\)) is named as out-degree.

For a network with k stages, k DAGs are initialized and connected in series. Each graph is linked to its preceding or succeeding stage by output or input node. We rewrite the weights of nodes as \(\mathbf {w}_i^k\) and the weights of edges as \(\alpha _{ji}^k\). For the k-th stage, \(\mathcal {T}^k(\cdot )\) denotes the mapping function established by \(\mathcal {G}^k\) with parameters of \(\mathbf {W}^k\) and \(\varvec{\alpha }^k\), where \(\mathbf {W}^k\) is the set of \(\{\mathbf {w}_{i}^{k}\}\), \(\varvec{\alpha }^k\) is the set of \(\{\alpha _{ji}^{k}\}\). Given an input \(\mathbf {x}\) and corresponding label \(\mathbf {y}\), the mapping function from the sample to the feature representation can be written as:

$$\begin{aligned} \mathcal {F}(\mathbf {x}) = \mathcal {T}^k(\cdots \mathcal {T}^2 (\mathcal {T}^1(\mathbf {x}; \varvec{\alpha }^1, \mathbf {W}^1); \varvec{\alpha }^2, \mathbf {W}^2) \cdots ; \varvec{\alpha }^k, \mathbf {W}^k), \end{aligned}$$
(2)

3.2 Search Space

By defining the topological perspective of neural networks, most previous networks can be reformulated from the natural perspective. For definiteness and without loss of generality, we selected the widely used residual connections for analysis [10, 31, 45]. A block with residual connection formulates \(x+\varphi (x)\) as a basic component, in which x represents the identity shortcut, and \(\varphi (x)\) denotes the residual mapping. Normally, the residual component is composed of several repeated weighted layers. We call the number of repeats as interval, which is noted as l. Figure 1 presents two residual architectures with a interval of 1 and 2 respectively. By using Eq.(1), we map the architecture from the natural perspective to the topological perspective. We give an example of this mapping in red lines. From a natural perspective, the layer acquires information through skip connections. In the new topological perspective, the node obtains information by corresponding edges. It should be pointed out that these two perspectives are completely equivalent in results. It also can be seen that the residual connections are rather denser than the original view, and perform multiple feed-forward paths instead of a single deep network. Our topological view explains the reason why residual connectivity is effective from a new aspect different from [40].

If the interval degrades to 1, as shown in the right of Fig. 2, its topology evolves into a complete graph. Structurally, all nodes are directly connected to the input and output, resulting in indirect access to the gradients and the original input. Different from stacking blocks using predefined connectivity, the complete graph provides all possible connections and is suitable to be the search space. For a complete graph with N nodes, the search space contains \(2^{N(N-1)/2}\) possible topological structures. For a network with k stages, the total search space can be noted as \(\prod \nolimits _{k} 2^{N^k(N^k-1)/2}\). And it is much wider than cell-based or block-based approaches [8, 12, 37]. By assigning learnable parameters which reflect the magnitude of connections to edges, it changes to a weighted graph. Within the search space, the connectivity can be optimized by learning continuous weights of edges.

3.3 Optimization of Topological Connectivity

We put forward a differentiable type to optimize the topological connectivity by learning a set of continuous weights of edges \(\varvec{\alpha }\). And they are learned simultaneously with all the other weights in the network via the loss generated by the concurrent task, noted as \(\mathcal {L}_{t}(\cdot )\). Different from [1], we do not transform the weights of \(\varvec{\alpha }\) into binary. This allows us to assign discriminating weights to different feature inputs. Different from the selection of node type [21], we do not select the maximum input edge using \(\arg \max \) operation. Instead, the continuous weights guarantees the consistency between training and testing. The optimization objective can be viewed as:

$$\begin{aligned} \min _{\mathbf {W}, \varvec{\alpha }} \mathcal {L}_{t} (\mathcal {F}(\mathbf {x}; \mathbf {W}, \varvec{\alpha }), \mathbf {y}) \end{aligned}$$
(3)

Set \(\frac{\partial \mathcal {L}_{t}}{\partial \mathbf {w}_i}\) be the gradients that the network flows backwards to \(\mathbf {w}_i\). And let \(\frac{\partial \mathcal {L}_{t}}{\partial \mathbf {x}_i}\) be the gradients to \(\mathbf {x}_i\). Then the gradients update to \(\mathbf {w}_i\) and \(\alpha _{ji}\) are of the form:

$$\begin{aligned} \mathbf {w}_i\leftarrow & {} \mathbf {w}_i + \eta \frac{\partial \mathcal {L}_{t}}{\partial \mathbf {w}_i} \end{aligned}$$
(4)
$$\begin{aligned} \alpha _{ji}\leftarrow & {} \alpha _{ji} + \eta \sum \frac{\partial \mathcal {L}_{t}}{\partial \mathbf {x}_i} \odot \frac{\partial o_i}{\partial \mathbf {x}_i^\prime } \odot \mathbf {x}_j, \end{aligned}$$
(5)

where \(\eta \) is the learning rate, and \(\odot \) indicates entrywise product.

Since the features generated by different layers exhibit different semantic representations, they contribute differently to subsequent layers, resulting in diversities of the importance of connections. Much as the mammalian brain [28] in biology, where synapses are created in the first few months of a child’s development, followed by gradual re-weighting through postnatal knowledge, growing into a typical adult with relative sparse connections.

figure a

To facilitate this process appropriately, we raise to attach sparsity constraint as a regularization on the distribution of weights of edges. Similar thought also has been verified in hashing representation [42] that sparsity can bring effective gain through minimizing a hash collision. We choose L1 regularization, denoted as \(\mathcal {L}_{1}(\cdot )\), to penalize non-zero parameters of edges resulting in more parameters near zero. This sparsity constraint promotes attention to more critical connections. Then the loss function of our proposed method can be reformulated as:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{t} + \lambda \mathcal {L}_{1} = \mathcal {L}_{t}(\mathcal {F}(\mathbf {x}; \mathbf {W}, \varvec{\alpha }), \mathbf {y}) + \lambda \cdot \Vert \varvec{\alpha }\Vert _1, \end{aligned}$$
(6)

and \(\lambda \) is a hyper-parameter to balance the sparse level. Due to the properties of a complete graph, we propose two types to update \(\varvec{\alpha }_{ji}\). The first one is uniform sparsity that attaches constraint on all weights of edges uniformly. And let \(\frac{\partial \mathcal {L}_{1}}{\partial \mathbf {x}_i}\) be the gradients to \(\mathbf {x}_i\), we rewrite the Eq. (5) as:

$$\begin{aligned} \alpha _{ji} \leftarrow \alpha _{ji} + \eta \sum (\frac{\partial \mathcal {L}_{t}}{\partial \mathbf {x}_i} + \lambda \frac{\partial \mathcal {L}_{1}}{\partial \mathbf {x}_i}) \odot \frac{\partial o_i}{\partial \mathbf {x}_i^\prime } \odot \mathbf {x}_j, \end{aligned}$$
(7)

The second one is adaptive sparsity which is logarithmically related to the in-degree \(\delta _i\) of a node \(n_i\). It performs larger constraints on dense input and smaller on sparse input. For the nodes with fewer input edges, this can ensure the smooth flow of information and avoid being blocked. In this type, the \(\alpha _{ji}\) is updated by:

$$\begin{aligned} \alpha _{ji} \leftarrow \alpha _{ji} + \eta \sum (\frac{\partial \mathcal {L}_{t}}{\partial \mathbf {x}_i} + \lambda \log (\delta _i) \frac{\partial \mathcal {L}_{1}}{\partial \mathbf {x}_i}) \odot \frac{\partial o_i}{\partial \mathbf {x}_i^\prime } \odot \mathbf {x}_j. \end{aligned}$$
(8)

These two types will be further discussed in the experiments section. Algorithm 1 summarizes the optimization procedure detailedly.

4 Experiments and Analysis

4.1 Connectivity Optimization for Classical Networks

Our optimization method is compatible with classical networks. To investigate the applicability, we select ResNet-CIFAR [10] consisted of \(3\times 3\) conv and MobileNetV2-1.0 [31] consisted of Inverted Bottleneck. For the optimization of ResNets, we rewire the interval of 2 in the BasicBlock to 1 to form the complete graph. For MobileNetV2-1.0, each node involves a residual connection and can be viewed as a complete graph naturally. In the case that MobileNet owns fewer layers in each stage, we also increase the depth by increasing the node in each stage, resulting in larger search spaces. It is also a common skill to expand networks [38]. Through assigning learnable parameters to their edges, the topologies can be optimized using Algorithm 1. It should be mentioned that the additional computations and parameters introduced by the edges are negligible compared with convolution.

Table 1. Optimization Top-1 Accuracy of ResNets on CIFAR-100.

First, we evaluate the optimization of the connectivity with ResNets on CIFAR-100 [16]. The experiments are trained using 2 GPUsFootnote 1 with batchsize 128 and weight decay 5e–4. We follow the hyperparameter settings in paper [4], which initializes \(\eta = 0.1\) and divides by 5 times at 60th, 120th, 160th epochs. The training and test size is \(32\times 32\). We report classification accuracy on the validation set by 5 repeat runs. The results are shown in Table 1. Under similar Params and FLOPs, the optimization brings \(2.22\%\) improvement on Top-1 accuracy for ResNet-110, which reflects larger search spaces lead to more improvements.

Next, we extend our method to ImageNet dataset [30] using MobileNets. We train MobileNetV2 using 16 GPUs for 200 epochs with a batch size of 1024. The initial learning rate is 0.4 and cosine shaped learning rate decay [22] is adopted. Following [31], we use a weight decay of 4e–5 and dropout [11] of 0.2. Nesterov momentum of 0.9 without dampening is also used. The training and test size is \(224\, \times \,224\). The network with 2 times of layers is denoted as 2N. Under the mobile-setting, we achieve \(76.4\%\) Top-1 accuracy. Under the larger optimization space of 6N, the optimization brings a \(0.75\%\) improvement. This further demonstrates the benefits of topology optimization for different networks.

Table 2. Optimization Top-1 Accuracy of Scaled MobileNets on ImageNet

4.2 Expanding to Larger Search Spaces by TopoNet

Due to restricted optional topologies of classical networks, the topology can be only optimized in small search spaces, which limits the representation ability of topology. These may limit the influence caused by topological changes and affect the search for optimal topology. In this section, we propose a larger search space, and fully illustrate the improvement brought by topology optimization. The properties of edges and nodes in the optimized topology are also analyzed.

Table 3. Architectures of TopoNets for ImageNet.

We design a series of architectures named as TopoNets that can flexibly adjust search space, types of topology and node. As shown in Table 3, it consists of four stages with number of nodes of \(\{N_1, N_2, N_3, N_4\}\). The topology in each stage is defined by a graph, whose type can be chosen from {complete, random, residual}. The complete graph is used for the optimization of topology. For a more strict comparison, we also take the other two types as baselines. The residual one is a well-designed topology. In the random one, an edge between two nodes is linked with probability p, independent of all other nodes and edges. The higher the probability, the denser it is. We follow two simple design rules used in [10], (i) in each stage, the nodes have the same number of filters C; (ii) and if the feature map size is halved, the number of filters is doubled. The change of filters is implied by the first calculation node in each graph. For the head of the network, we use a single convolutional layer for simplicity. The network ends with a classifier composed of a global average pooling (GAP), a 1000-dimensional fully-connected layer and softmax function.

Setup for TopoNet.

To demonstrate the optimization capability in the larger search space and to compare with existing network, we designed a set with similar computation cost as ResNet-50. We select the separable depthwise convolution that includes a \(3\times 3\) depthwise convolution followed by a \(1\times 1\) pointwise convolution, and build a triplet unit ReLU-conv-BN as the node. The number of nodes in each stage is {14, 20, 26, 14}. In this setting, the number of possible discrete topologies is \(6\times 10^{209}\). The weights of \(\varvec{\alpha }\) are initialized to be 1. And C is set to be 64, resulting in Params of \(23.23\,M\) and FLOPs of \(3.95\,G\) (e.g ResNet-50 with Params of \(25.57\,M\) and FLOPs of \(4.08\,G\)).

Strict Comparisons.

To demonstrate the effectiveness of our optimization method, we select graphs with random, residual connectivity as baselines. For comparison, we also reproduce Erdös-Rényi (ER), Barabási-Albert (BA) and Watts-Strogatz WS graphs [44] using NetworkXFootnote 2. Since original paper does not release codes, we compare the best configurations of their method. We use these graphs to build networks under the same setup in TopoNet. Two types of sparsity constraints are also demonstrated. For a fair comparison, all experiments are conducted on ImageNet with training 100 epochs. We use a weight decay of 1e–4 and a Nesterov momentum of 0.9 without dampening. Dropout is not used. Label smoothing regularization [35] with a coefficient of 0.1 is also used.

Table 4. Comparision with Different Topologies on ImageNet.

The validation results are shown in Table 4. Some conclusions can be drawn from the results. (i) Topological connectivity of network largely affects the performance of representation. (ii) The performance is related to the density of connections according to different p and l. (iii) For the complete graphs, direct optimization with \(\varvec{\alpha }\) can yield \(0.98\%\) improvement on Top-1. (iv) Through assigning sparsity constraints, performances have been further improved. The complete graph with adaptive sparsity constraint gets the best Top-1 of \(78.60\%\). This proves the benefits of sparseness for the connectivity. (v) The connectivity can be optimized in neural networks, and is superior to rule-based designed ones, such as random, residual, BA and WS.

Fig. 3.
figure 3

The effect of sparsity constraint on distributions of \(\varvec{\alpha }\). The histogram on the left indicates that sparsity drives most of the weights near zero. Adjacency matrices on the right shows the difference between uniform and adaptive one, whose rows correspond to the input edges for a particular node and columns represent the output ones. Colors indicate the weights of edges.

In order to intuitively understand the optimization effect of sparsity constraints on dense topological connections, we give the distributions of the learned \(\varvec{\alpha }\) Fig. 3. Sparsity constraints push more parameters near zero, resulting in focusing on critical connections. With the enhancement of constraints, more connections disappear. Excessive constraints will damage the representation of features, so we set the weight of balance \(\lambda \) to be e−4 in all experiments. And it is also robust in the range from e−5 to e−4, resulting in similar effects. In the right of Fig. 3, we give the optimized results with two types of constraints. The adaptive one penalizes denser connections a lot and keeps the relative sparser but critical connections, resulting in better performance.

4.3 Transferability on Different Tasks

To evaluate the generalization and transferability for both optimization method and TopoNets, we also conduct experiments on COCO object detection task [19]. We adopt FPN [18] as the object detection method. The backbone is replaced with corresponding pretrained one in Table 4, and is fine-tuned on COCO train2017 dataset. We test using the COCO val2017 dataset. Our fine-tuning is based on \(1\times \) setting of the publicly available Detectron [6]. The training configurations of different models are consistent. Test performances are given in Table 5. To comparable ResNet-50, TopoNets obtain significant promotions in AP with lower computation costs. Contrast with elegant residual topology, our optimization method can also achieve increase by \(0.95\%\). These results indicate the effectiveness of the proposed network and the optimization method.

Table 5. Transferability Results on COCO object detection.
Fig. 4.
figure 4

Impact of node (left) and edge (right) removal for the optimized topology.

4.4 Exploring Topological Properties by Graph Damage

We further explore the properties of the optimized topology. First, we remove individual nodes according to its topological ordering in the graph and evaluate it without extra training. We expect them to break because dropping any layer drastically changes the input distribution of all subsequent layers. Surprisingly, most removals do not lead to a noticeable change as shown in Fig. 4 (left). It can be explained that the available paths is reduced from \((n-1)!\) to \((n-2)!\), leaving sufficient paths. This suggests that each node in the complete graph do not strongly rely on others, although it is trained jointly. Direct links with input/output nodes make each node contribute to the final feature representation, and benefits the optimization process. Another observation is that nodes in the front of topological orderings contribute more. This can be explained that for a node with the ordering of i, the generated \(x_i\) can be only received by node j (where \(j>i\)). This causes the feature generated by the front nodes to participate in aggregation as a downstream input. It makes the front nodes contribute more, which can be used to reallocate calculation resources in future work.

Second, we consider the impact of edge removal. All edges with \(\alpha \) below a threshold are pruned from the graph, only remaining the important connections. Accuracies before and after retraining are given in Fig. 4 (right). Without retraining, accuracy decreases as the degree of pruning deepens. It is interesting to see that we have the “free lunch” of reducing less than \(40\%\) without losing much accuracy. If we fix \(\alpha \) of remaining edges and retrain the weights, it can maintain accuracy with \(80\%\) of the nodes removed. This proves that the optimization process has found indeed important connections. After pruning edges, nodes with zero in-degree or zero out-degree maybe safely removed. It can be used to reduce the parameters and accelerate inference in practical applications.

4.5 Visualization of the Optimization Process

We visualize the optimization process in Fig. 5. During the initial phase, there are strong connections between all nodes. As the optimization process progresses, the connections become sparse, leaving the critical ones. We sample topologies with different connectivity during the process and retrain them from scratch with \(\varvec{\alpha }\) froze. This allows us to compare the change of topology capabilities during optimization. Validation accuracies are given in the right of the figure. It can be seen the representation ability of connectivity increases with the training process, not just the weights of networks.

Fig. 5.
figure 5

The changes of the connectivity and corresponding accuracies after retraining. It proves that the representation capability of topological connectivity improves along with the process, and demonstrates the effectiveness of optimization.

5 Conclusion and Future Work

In this work, we proposed a feasible way for the learning of topological connectivity in neural networks. Motivated by our topological perspective, the optimization space is defined as a complete graph. By assigning learnable continuous weights which reflect the importance of connections, the optimization process is transformed into a differentiable type with less extra cost. The sparsity constraint further improve the generalization and performance. This method is compatible with existing networks, and the optimized connectivity is superior to rule-based designed ones. Experiments on different tasks proved the effectiveness and transferability. Moreover, the observed properties of topology can be used for future work and practical applications. Our work has a wide application and is complementary to existing neural architecture search methods. We will consider verifying NAS-inspired networks in the future work.