Keywords

1 Introduction

Deep neural networks have emerged as one of the most prominent models for problems that require the learning of complex functions and that involve large amounts of training data. While deep learning has recently enabled dramatic performance improvements in many application domains, the design of deep architectures is still a challenging and time-consuming endeavor. The difficulty lies in the many architecture choices that impact—often significantly—the performance of the system. In the specific domain of image categorization, which is the focus of this paper, significant research effort has been invested in the empirical study of how depth, filter sizes, number of feature maps, and choice of nonlinearities affect performance [1,2,3,4,5,6]. Recently, several authors have proposed to simplify the architecture design by defining convolutional neural networks (CNNs) in terms of composition of topologically identical or similar building blocks or modules. This strategy was arguably first popularized by the VGG nets [7] which were built by stacking a series of convolutional layers having identical filter size (\(3 \times 3\)). Other examples are ResNets [8] which are constructed by stacking residual blocks of fixed topology, ResNeXt models [9] which use multi-branch residual block modules, DenseNets [10] which use dense blocks as building blocks, or Multi-Fiber networks [11] which use parallel branches (“fibers”) connected by routers (“transistors”).

While the principle of modularized design has greatly simplified the challenge of building effective architectures for image analysis, the choice of how to combine and aggregate the computations of these building blocks still rests on the shoulders of the human designer. To avoid a combinatorial explosion of options, prior work has relied on simple, uniform rules of aggregation and composition. For example, in ResNets and DenseNets each building block is connected only to the preceding one, via identity mapping, convolution or pooling. ResNeXt models [9] use a set of simplifying assumptions: the branching factor C (also referred to as cardinality) is fixed to the same constant in all layers of the network, all branches of a module are fed the same input, and the outputs of parallel branches are aggregated by a simple additive operation that provides the input to the next module. While these simple rules of connectivity render network design more manageable, they are unlikely to yield the optimal connectivity for the given problem.

In this paper we remove these predefined choices and propose an algorithm that learns to combine and aggregate building blocks of a neural network by directly optimizing connectivity of modules with respect to the given task. In this new regime, the network connectivity naturally arises as a result of training rather than being hand-defined by the human designer. While in principle this involves a search over an exponential number of connectivity configurations, our method can efficiently optimize the training loss with respect to connectivity using a variant of backpropagation. This is achieved by means of connectivity masks, i.e., learned binary parameters that act as “switches” determining the final connectivity in our network. The masks are learned together with the convolutional weights of the network, as part of a joint optimization with respect to the given loss function for the problem.

We evaluate our method on the problem of multi-class image classification using two popular modular architectures: ResNet and ResNeXt. We demonstrate that models with our learned connectivity consistently outperform the networks based on predefined rules of connectivity for the same budget of residual blocks (and parameters). An interesting byproduct of our approach is that, in certain settings, it can automatically identify modules that are superfluous, i.e., unnecessary or detrimental for the end objective. At the end of the optimization, these unused modules can be pruned away without impacting the learned hypothesis while reducing substantially the runtime and the number of parameters to store.

By recasting the training procedure as an optimization over learning weights and connectivity, our method effectively searches over a larger space of solutions. This yields networks achieving higher accuracy than those constrained to use predefined connectivities. The average training time overhead is moderate, ranging between \(13\%\) (for ResNet models) and \(39\%\) (for ResNeXt models) compared to learning using fixed connectivity which, however, yields lower accuracy. Finally we point out that, although our experiments are carried out using ResNet and RexNeXt models, our approach is general and applicable without major modifications to other forms of network architectures and other tasks beyond image categorization. In principle our method can also be used to learn connectivity among layers of a traditional (i.e., non-modular) neural network or a CNN. However, modern networks typically include a very large number of layers (hundreds or even thousands [12]), which would make our approach very costly. Learning connectivity among modules is more manageable as each module encapsulates many layers and thus the total number of modules is typically small even for deep networks.

2 Related Work

Despite their wide adoption, deep networks often require laborious model search in order to yield good results. As a result, significant research effort has been devoted to the design of algorithms for automatic model selection. However, most of this prior work falls within the genre of hyper-parameter optimization [13,14,15] rather than architecture or connectivity learning. Evolutionary search has been proposed as an interesting framework to learn both the structure as well as the connections in a neural network [16,17,18,19,20,21,22,23,24]. Architecture search has also been recently formulated as a reinforcement learning problem with impressive results [25]. Several authors have proposed learning connectivity by pruning unimportant weights from the network [26,27,28,29,30]. However, these prior methods operate in stages where initially a network with full connectivity is learned and then connections are greedily removed according to an importance criterion. Compare to all these prior approaches, our work provides the advantage of learning the connectivity by direct global optimization of the loss function of the problem at hand rather than by greedy optimization of an auxiliary proxy criterion or by costly evolutionary search. Our technical approach shares similarities with the “Shake-Shake” regularization [31]. This procedure was demonstrated on two-branch ResNeXt models and consists in randomly scaling tensors produced by parallel branches during training while at test time the network uses uniform weighting of tensors. Conversely, our algorithm learns an optimal binary scaling of the parallel tensors with respect to the training objective and uses the resulting network with sparse connectivity at test time. While our algorithm is limited to optimizing the connectivity structure within a predefined architecture, Adams et al. [32] proposed a nonparametric Bayesian approach that searches over an infinite network using MCMC. Our approach can be viewed as a middle ground between two extremes: using hand-defined networks versus learning/searching the full architecture from scratch. The advantage is that our connectivity learning can be done without adding a significant training time overhead (only 13–39% depending on the architecture) compared to using fixed connectivity. The disadvantage is that the space of models considered by our approach is a lot more constrained than in the case of general architecture search. Saxena and Verbeek [33] introduced convolutional neural fabric which are learnable 3D trellises that locally connect response maps at different layers of a CNN. Similarly to our work, they enable optimization over an exponentially large family of connectivities, albeit different from those considered here. Finally, our approach is also related to conditional computation methods [34,35,36,37,38,39,40,41,42,43], which learn to drop out blocks of units. However, unlike these techniques, our algorithm learns a fixed, sparse connectivity that does not change with the input and thus it keeps the runtime cost and the number of used parameters constant.

3 Technical Approach

3.1 Modular Architecture

We begin by defining the modular architecture that will be used by our framework. In order to present our method in its full generality, we will describe it in the context of a general modular architecture, which we will then instantiate in the form of the two models used in our experiments (ResNet and ResNeXt).

We assume that the general modular architecture consists of a stack of L modules. (When using ResNet the modules will be residual blocks, while for ResNeXt each module will consist of multiple parallel branches.) We denote with \(\mathbf{x}_j\) the input to the j-th module for \(j=1,\ldots ,L\). The input of each module is an activation tensor computed from one the previous modules. We assume that the module implements a function \(\mathcal {G}(.)\) parameterized by learnable weights \(\theta _j\). The weights may for example represent the coefficients of convolutional filters. Thus, the output \(\mathbf{y}_j\) computed by the j-th module is given by \(\mathbf{y}_j = \mathcal {G}(\mathbf{x}_j; \theta _j)\). In prior modular architectures, such as ResNet, ResNeXt and DenseNet, the connectivity between modules is hand-defined a priori according to a very simple rule: the input of a module is the output of the preceding module. In other words, \(\mathbf{x}_j \leftarrow \mathbf{y}_{j-1}.\) While this makes network design straightforward, it greatly limits the topology of architectures considered for the given task. In the next subsection we describe how to parameterize the architecture to remove these constraints and to enable connectivity learning in modular networks.

3.2 Masked Architecture

We now introduce learnable masks defining the connectivity in the network. Specifically, we want to allow each module j to take input from one or more of the preceding modules \(k=1,\ldots ,j-1\). To achieve this we define for each module a binary mask vector that controls the input pathway of that module. The binary mask vectors are learned jointly with the weights of the network. Let be the binary mask vector defining the active input connections feeding the j-th module. If , then the activation volume produced by the k-th module is fed as input to the j-th module. If , then the output from the k-th module is ignored by the j-th module. The tensors from the active input connections are all added together (in an element-wise fashion) to form the input to the module. Thus, if we denote again with the output activation tensor computed by the k-th module, the input to the j-th module will be given by the following equation:

$$\begin{aligned} \mathbf{x}_j = \sum _{k=1}^{j-1} {m}_{j,k} \cdot \mathbf{y}_k \end{aligned}$$
(1)

Then, the output of this module will be obtained through the usual computation, i.e., . We note that under this model we no longer have predefined connectivity among modules. Instead, the mask now determines selectively for each module which outputs from the previous modules will be aggregated and form the input to the block. In this paper we constrain the aggregations of outputs from the active connections to be in the form of simple additions as this does not require new parameters. When different modules yield feature maps of different sizes, we use zero-padding shortcuts to increase the dimensions of feature tensors to the largest size (as in [8]). These shortcuts are parameter free. We leave to future work the investigation of more sophisticated, parameterized aggregation schemes.

We point out that depending on the constraints defined over , different interesting models can be realized. For example, by introducing the constraint that for each block j, then each module will receive input from only one of the preceding modules (since each must be either 0 or 1). At the other end of the spectrum, if we set for all modules jk, then all connections would be active. In our experiments we will demonstrate that the best results are typically achieved for values in between these two extremes, i.e., by connecting each module to K previous modules where K is an integer-valued hyperparameter such that \(1<K<(j-1)\). We refer to this hyperparameter as the fan-in of a module. As discussed in the next section, the mask vector for each block is learned simultaneously with all the other weights in the network via backpropagation. Finally, we note that it may be possible for a module in the network to become unused. This happens when, as a result of the optimization, module k is such that for all j. In this case, at the end of the optimization, we prune the module in order to reduce the number of parameters to store and to speed up inference (note that this does not affect the function computed by the network). In the next subsection we discuss our method for jointly learning the weights and the masks in the network.

3.3 MaskConnect: Learning to Connect

We refer to our learning algorithm as MaskConnect. It performs joint optimization of a given learning objective \(\ell \) with respect to both the weights of the network (\(\theta \)) as well as the masks (\(\mathbf{m}\)). Since in this paper we apply our method to the problem of image categorization, we use the traditional multi-class cross-entropy objective for the loss \(\ell \). However, our approach can be applied without change to other loss functions and other tasks benefitting from connectivity learning.

In MaskConnect the weights have real values, as in traditional networks, while the masks have binary values. This renders the optimization challenging. To learn these binary parameters, we adopt a modified version of backpropagation, inspired by the algorithm proposed by Courbariaux et al. [44] to train neural networks with binary weights. During training we store and update a real-valued version of the masks, with entries clipped to lie between 0 and 1.

In general, the training via backpropagation consists of three steps: (1) forward propagation, (2) backward propagation, and (3) parameters update. At each iteration, we stochastically binarize the real-valued masks into binary-valued vectors which are then used for the forward propagation and backward propagation (steps 1 and 2). Instead, during the parameters update (step 3), the method updates the real-valued masks . The weights \(\theta \) of the convolutional and fully connected layers are optimized using standard backpropagation. We discuss below the details of our mask training procedure, under the constraint that at any time there can be only K active entries in the binary mask , where K is a predefined integer hyperparameter with \(1\le K \le {j-1}\). In other words, we impose the following constraints:

$$\begin{aligned} m_{j,k} \in \{0,1\} ~ \forall j,k, ~~~\text {and}~~~\sum _{k=1}^{j-1} m_{j,k} = K ~ \forall j. \end{aligned}$$

These constraints imply that each module receives input from exactly K previous modules.

Forward Propagation. During the forward propagation, our algorithm first normalizes the real-valued entries in the mask of each block j to sum up to 1, such that . This is done so that defines a proper multinomial distribution over the \(j-1\) possible input connections into module j. Then, the binary mask is stochastically generated by drawing K distinct samples from the multinomial distribution over the connections. Finally, the entries corresponding to the K samples are activated in the binary mask vector, i.e., , for \(k=1,\ldots ,K\). The input activation volume to the module j is then computed according to Eq. 1 from the sampled binary masks. We note that the sampling from the Multinomial distribution ensures that the connections with largest values will be more likely to be chosen, while at the same time the stochasticity of this process allows different connectivities to be explored, particularly during early stages of the learning when the real-valued masks still have fairly uniform distributions.

Backward Propagation. In the backward propagation step, the gradient with respect to each output is obtained via back-propagation from and the binary masks .

Mask Update. In the parameter update step our algorithm computes the gradient with respect to the binary masks for each module. Then, using these computed gradients and the given learning rate, it updates the real-valued masks via gradient descent. At this time we clip the updated real-valued masks to constrain them to remain within the valid interval [0, 1] (as in [44]).

Pseudocode for our training procedure is given in the supplementary material. After joint training over \(\theta \) and \(\mathbf{m}\), we have found beneficial to (1) freeze the binary masks to the top-K values for each mask (i.e., by setting as active connections in those corresponding to the K largest values in ) and then (2) fine-tune the weights \(\theta \) of the network with respect to these fixed binary masks.

In the next subsections we discuss how we instantiated our general approach for the two architectures considered in our experiments: ResNet and ResNeXt.

3.4 MaskConnect Applied to ResNet

The application of our algorithm to ResNets is quite straightforward. ResNets are modular networks obtained by stacking residual blocks. A residual block implements a residual function \(\mathcal {F}(.)\) with reference to the layer input. Figure 1(a) (left) illustrates an example of these modular components where the 3 layers in the block implement the residual function \(\mathcal {F}(\mathbf{x}; \theta )\). A shortcut connections adds the residual block output \(\mathcal {F}(\mathbf{x})\) to its input \(\mathbf{x}\). Thus the complete function \(\mathcal {G}(.)\) implemented by a residual block computes \(\mathcal {G}(\mathbf{x}; \theta ) = \mathcal {F}(\mathbf{x}; \theta ) + \mathbf{x}\). The ResNets originally introduced in [8] use a hand-defined connectivity that passes the output of a block to the immediately subsequent block, i.e., \(\mathbf{x}_{j+1} \leftarrow \mathcal {F}(\mathbf{x}_j; \theta _j) + \mathbf{x}_j\). Here we propose to use MaskConnect to learn the input connections for each individual residual block in the network. This changes the input provided to block \(j+1\) in the network to be \({\small \mathbf{x}_{j+1} \leftarrow \sum _{k=1}^{j} m_{j+1,k} \left[ \mathcal {F}(\mathbf{x}_k; \theta _k) + \mathbf{x}_k\right] }\), where binary parameters \(m_{j+1,k}\) are learned automatically by our approach simultaneously with the weights \(\theta \) subject to the constraint that \(\sum _{k=1}^{j} m_{j+1,k} = K\). This implies that under our model each residual block now receives input from exactly K out of the preceding blocks. The output tensors from the K selected blocks are aggregated using element-wise addition and passed as input to the module. Our experiments present results for varying values of fan-in hyperparameter K, which controls the density of connectivity.

Fig. 1.
figure 1

Application of MaskConnect to two forms of modular network: (a) ResNet [45] and (b) multi-branch ResNeXt [9]. In traditional ResNet (a) (left) the connections between blocks are fixed (black links) so that each block receives input from only the preceding block. Our approach (a) (right) learns the optimal input connections (solid red links) for each individual block from a collection of potential connections (solid and dotted red links). Similarly, in traditional ResNeXt (b) (left) each module consists of C parallel residual blocks which are all aggregated and fed to the next module (black links). MaskConnect (b) (right) replaces the fixed aggregation points of RexNeXt with learnable masks \(\mathbf{m}\) defining the active input connections (solid red links) for each individual residual block. (Color figure online)

3.5 MaskConnect Applied to Multi-branch ResNeXt

The adaptation of MaskConnect to ResNeXt architectures is slightly more complex, as ResNeXt is based on a multi-branch topology. ResNeXt was motivated by the observation that it is beneficial to arrange residual blocks not only along the depth dimension but also to implement parallel multiple threads of computation feeding from the same input layer. The outputs of the parallel residual blocks are then summed up together with the original input and passed on to the next module. The resulting multi-branch module is illustrated in Fig. 1(b) (left). More formally, let be the transformation implemented by the j-th residual block in module i-th of the network, where \(j=1,\ldots ,C\) and \(i=1,\ldots ,L\), with L denoting the total number of modules stacked on top of each other to form the complete network. The hyperparameter C is called the cardinality of the module and defines the number of parallel branches within each module. The hyperparameter L controls the total depth of the network. Then, in traditional ResNeXt, the output of the i-th module is computed as:

$$\begin{aligned} \mathbf{y}_i = \mathbf{x}_i + \sum _{j=1}^C \mathcal {F}(\mathbf{x}_i; \theta ^{(i)}_j) \end{aligned}$$
(2)

In [9] it was experimentally shown that increasing the cardinality C is a more effective way of improving accuracy compared to increasing depth or the number of filters. In other words, given a fixed budget of parameters, ResNeXt nets were shown to consistently outperform single-branch ResNets.

However, in an attempt to ease network design, a couple of restrictive limitations were embedded in the architecture of ResNeXt modules: (1) the C parallel feature extractors in each module operate on the same input; (2) the number of active branches is constant at all depth levels of the network.

MaskConnect allows us to remove these restrictions without adding any significant burden on the process of manual network design, with the exception of a single additional integer hyperparameter (K) for the entire network. As in ResNeXt, our proposed architecture consists of a stack of L multi-branch modules, each containing C parallel feature extractors. However, differently from ResNeXt, each branch in a module can take a different input. The input pathway of each branch is controlled by a binary mask vector. Let be the binary mask vector defining the active input connections feeding the j-th residual block in module i. We note that under this model we no longer have fixed aggregation nodes summing up all outputs computed from a module. Instead, the mask now determines selectively for each block which branches from the previous module will be aggregated to form the input to the next block. Under this new scheme, the parallel branches in a module receive different inputs and as such are likely to yield more diverse features.

As before, different constraints over will give rise to different forms of architecture. By introducing the constraint that for all blocks j, then each residual block will receive input from only one branch (since each must be either 0 or 1). If instead we set for all blocks jk in each module i, then all connections would be active and we would obtain again the fixed ResNeXt architecture. In our experiments we present results obtained by varying the fan-in hyperparameter K such that \(1< K < C\). We also note that it may be possible for a residual block in the network to become unused, as a result of the optimization over the mask values. Thus, at any point in the network the total number of active parallel threads can be any number smaller than or equal to C. This implies that a variable branching factor is learned adaptively for the different depths in the network.

4 Experiments

We tested our approach on the task of image categorization using two different examples of modularized architecture: ResNet [8] and ResNeXt [9]. We used the following datasets for our evaluation: CIFAR-10 [46], CIFAR-100 [46], Mini-ImageNet [47], as well as the full ImageNet [48]. In this paper we include the results achieved on CIFAR-100 and ImageNet [48], while the results for CIFAR-10 [46] and Mini-ImageNet [47] (showing consistent improvements up to nearly 4% over fixed connectivity) can be found in the supplementary material.

4.1 CIFAR-100

CIFAR-100 contains images of size 32 \(\times \) 32. It consists of 50,000 training images and 10,000 test images. Each image is labeled as belonging to one of 100 possible classes.

Fig. 2.
figure 2

Varying the fan-in (K), i.e., the number of learned active connections to each residual block. The plot reports accuracy achieved by MaskConnect on CIFAR-100 using a ResNet-38 architecture (\(L=18\) blocks). All models have the same number of parameters (0.57M). The best accuracy is achieved at \(K=10\).

CIFAR-100 Results Based on the ResNet Architecture

Effect of Fan-In (K). The fan-in hyperparameter (K) defines the number of active input connections feeding each residual block. We study the effect of the fan-in on the performance of models built and trained using our proposed approach. We use residual blocks consisting of two 3 \(\times \) 3 convolutional layers. We use a model obtained by stacking \(L=18\) residual blocks with total depth of \(D=2+2L=38\) layers. We trained and tested this architecture using different fan-in values: \(K=1,..,17\). All these models have the same learning capacity as varying K does not affect the number of parameters. The results are shown in Fig. 2. We notice that the best accuracy is achieved using \(K=10\). Using a very low or very high fan-in yields lower accuracy. However, the algorithm does not appear to be overly sensitive to the fan-in hyperparameter, as a wide range of values for K (from \(K=7\) to \(K=13\)) produce accuracy close to the best.

Table 1. CIFAR-100 accuracies achieved by models trained using the connectivity of ResNet [45] (Fixed-Prev), a fixed random connectivity (Fixed-Random), and the connectivity learned by our approach (Learned)

Varying the Model. We trained several ResNet models differing in depth, using both MaskConnect as as well as the traditional predefined connectivity. For these experiments we use a stack of L residual blocks with two 3 x 3 convolutional layers for each block. We choose \(L \in \{18, 36, 54\}\) to build networks with depths \(D=2\,{+}\,2L\) equal to 38, 74, and 110 layers, respectively. We show the classification accuracy achieved by different models in Table 1. We report the results achieved using MaskConnect with fan-in \(K=10\), \(K=15\), \(K=20\) for models of depth \(D=38\), \(D=74\), \(D=110\), respectively. Fixed-Prev denotes the performance of ResNet, where each block is connected to only the previous block (\(K=1\)). We also include the accuracy achieved by choosing a random connectivity (Fixed-Random) using the same fan-in values K as our approach and training the parameters while keeping the random connectivity fixed. This baseline is useful to show that our model achieves higher accuracy over traditional ResNet not because of the higher number of connections (i.e., \(K>1\)), but rather because it learns the connectivity. Indeed, the results in Table 1 show that learning the connectivity using MaskConnect yields consistently higher accuracy than using multiple random connections or a single connection to the previous block.

CIFAR-100 Results Based on Multi-branch ResNeXt Effect of Fan-In (K). Even for ResNeXt, we start by studying the effect of the fan-in hyperparameter (K). For this experiment we use a model obtained by stacking \(L=6\) multi-branch residual modules, each having cardinality \(C=8\) (number of branches in each module). We use residual blocks consisting of 3 convolutional layers with a bottleneck implementing dimensionality reduction on the number of feature channels, as shown in Fig. 1(b). The bottleneck for this experiment was set to \(w=4\). Since each residual block consists of 3 layers, the total depth of the network in terms of learnable layers is \(D=2+3L=20\).

Fig. 3.
figure 3

Varying the fan-in (K) of our model, i.e., the number of active input branches to each residual block. The plot reports accuracy achieved on CIFAR-100 using a network stack of \(L=6\) ResNeXt modules having cardinality \(C=8\) and bottleneck width \(w=4\). All models have the same number of parameters (0.28M).

Fig. 4.
figure 4

A visualization of the fixed connectivity of ResNext (left) vs the connectivity learned by our method (right) using \(K=1\). Each green square is a residual block, each row of \(C=8\) square is a multi-branch module. Arrows indicate pathways connecting residual blocks of adjacent modules. It can be noticed that MaskConnect learns sparse connections. The squares without in/out edges are those pruned at the end of learning. This gives rise to a branching factor that varies along the depth of the net. (Color figure online)

We trained and tested this architecture using different fan-in values: \(K=1,...,8\). Again, varying K does not alter the number of parameters. The results are shown in Fig. 3. We can see that the best accuracy is achieved by connecting each residual block to \(K=4\) branches out of the total \(C=8\) in each module. Note that when setting \(K=C\), there is no need to learn the masks. In this case each mask is simply replaced by an element-wise addition of the outputs from all the branches. This renders the model equivalent to ResNeXt [9], which has fixed connectivity. Based on the results of Fig. 3, in all our experiments below we use \(K=4\) (since it gives the best accuracy) but also \(K=1\) since it gives high sparsity which, as we will see shortly, implies savings in number of parameters.

Varying the Models. In Table 2 we show the classification accuracy achieved with ResNeXt models of different depth and cardinality (the details of each model are listed in the Supplementary Material). For each architecture we also include the accuracy achieved with full (as opposed to learned) connectivity, which corresponds to ResNeXt. These results show that learning the connectivity produces consistently higher accuracy than using fixed connectivity, with accuracy gains of up to \(2.2\%\) compared to the state-of-the-art ResNeXt model. Furthermore, we can notice that the accuracy of models based on random connectivity (Fixed-Random) is considerably lower compared to our approach, despite having the same connectivity density (\(K=4\)). This shows that the improvements of our approach over ResNeXt are not due to sparser connectivity but they are rather due to learned connectivity. We note that these improvements in accuracy come at little computational training cost: the average training time overhead for learning masks and weights is about \(39\%\) using our unoptimized implementation compared to learning only the weights given a fixed connectivity.

Parameter Savings. Our proposed approach provides the benefit of automatically identifying residual blocks that are unnecessary. At the end of the training, the unused residual blocks can be pruned away. This yields savings in the number of parameters to store and in test-time computation. In Table 2, columns Train and Test under Params show the original number of parameters (used during training) and the number of parameters after pruning (used at test-time). Note that for the biggest architecture, our approach using \(K=1\) yields a parameter saving of 40% compared to ResNeXt with full connectivity (\(20.5\texttt {M}\) vs \(34.4\texttt {M}\)), while achieving the same accuracy. Thus, in summary, using fan-in \(K=4\) gives models that have the same number of parameters as ResNeXt but they yield higher accuracy; using fan-in \(K=1\) gives a significant saving in number of parameters and accuracy on par with ResNeXt.

Visualization of the Learned Connectivity. Figure 4 provides an illustration of the connectivity learned by MaskConnect for \(K=1\) versus the fixed connectivity of ResNeXt for model \(\{D=29,w=8,C=8\}\). While ResNeXt feeds the same input to all blocks of a module, our algorithm learns different input pathways for each block and yields a branching factor that varies along depth.

Table 2. CIFAR-100 accuracies achieved by two ResNeXt architectures trained using predefined full connectivity (Fixed-Full) [9], random connectivity (Fixed-Random, \(K=4\)), and the connectivity learned by our algorithm (Learned, \(K=1\), \(K=4\)). Each model was trained 4 times, using different random initializations. We report the best test performance as well as the mean test performance computed from the 4 runs. We list the number of parameters used during training (Params-Train) and the number of parameters obtained after pruning the unused blocks (Params-Test). Our learned connectivity using \(K=4\) produces accuracy gains of up to 2.2% compared to the strong ResNeXt model, while using \(K=1\) yields results equivalent to ResNeXt but it induces a significant reduction in number of parameters at test time (e.g., a saving of 40% for model {29, 64, 8})

4.2 ImageNet

Finally, we evaluate our approach on the large-scale ImageNet 2012 dataset [48], which includes images of 1000 classes. We train our approach on the training set (1.28M images) and evaluate it on the validation set (50K images).

ImageNet Results Based on the ResNet Architecture. For this experiment we use a stack of \(L=16\) residual blocks with 3 convolutional layers with a bottleneck architecture. Thus, the total number of layers is \(D=2+3L=50\). Compared to the traditional ResNet using fixed connectivity, the same network trained using MaskConnect with fan-in \(K=10\) yields a top-1 accuracy gain of \(1.94\%\) (\(78.09\%\) vs \(76.15\%\)).

ImageNet Results Based on Multi-branch ResNeXt. In Table 3, we report the top accuracies for three different ResNeXt architectures. For these experiments we set \(K=C/2\). We can observe that for all three architectures, our learned connectivity yields an improvement in accuracy over fixed full connectivity [9].

Table 3. ImageNet accuracies (single crop) achieved by different architectures using the predefined connectivity of ResNeXt (Fixed-Full) versus the connectivity learned by our algorithm (Learned)

5 Conclusions

In this paper we introduced an algorithm to learn the connectivity of deep modular networks. The problem is formulated as a single joint optimization over the weights and connections between modules in the model. We tested our approach on challenging image categorization benchmarks where it led to significant accuracy improvements over the state-of-the-art ResNet and ResNeXt models using fixed connectivity. An added benefit of our approach is that it can automatically identify superfluous blocks, which can be pruned after training without impact on accuracy for more efficient testing and for reducing the number of parameters to store.

While our experiments were carried out on two particular architectures (ResNet and ResNeXt) and a specific form of building block (residual block), we expect the benefits of our approach to extend to other modules and network structures. For example, it could be applied to learn the connectivity of skip-connections in DenseNets [10], which are currently based on predefined connectivity rules. In this paper, our masks perform non-parametric additive aggregation of the branch outputs. It would be interesting to experiment with learnable (parametric) aggregations of the outputs from the individual branches. Our approach is limited to learning connectivity within a given, fixed architecture. Future work will explore the use of learnable masks for full architecture discovery.