Keywords

1 Introduction

figure a

Artificial neural networks are inspired by the simple, yet powerful idea that predictive models can be produced by combining units that mimic biological neurons. In fact, there is a rich discussion on what should constitute each unit and how the units should interact with one another. Units that work in parallel form a layer, whereas a sequence of layers transforming data unidirectionally define a feedforward network. Deciding the number of such layers—the depth of the network—is yet a topic of debate and technical challenges.

A neural network is trained for a particular task by minimizing the loss function associated with a sample of data in order for the network to learn a function of interest. Although several universal approximation results show that mathematical functions can generally be approximated to arbitrary precision by single-layer feedforward networks, these results rely on using a very large number of units [12, 26, 43]. Moreover, simple functions such as XOR cannot be exactly represented with a single layer using the most typical units [45].

In fact, it is commonly agreed that depth is important in neural networks [7, 38]. In the popular case of feedforward networks in which each unit is a Rectified Linear Unit (ReLU) [18, 21, 38, 47], the neural network models a piecewise linear function [3]. Under the right conditions, the number of such “pieces”—the linear regions—may grow exponentially on the depth of the network [46, 48, 67]. Depending on the total number of units and size of the input, the number of linear regions is maximized with more or less layers [58]. Similarly, there is an active area of study on bounding the number of layers necessary to model any function that a given type of network can represent [3, 14, 20, 27, 45, 69].

Although shallow networks present competitive accuracy results in some cases [4], deep neural networks have been established as the state-of-the-art over and again in areas such as computer vision and natural language processing [13, 24, 29, 30, 37, 39, 62, 70] thanks to the the development and popularization of backpropagation [41, 54, 71]. However, Stochastic Gradient Descent (SGD) [53]—the training algorithm associated with backpropagation—may have difficulties to converge to a good model due to exploding or vanishing gradients [6, 28, 35, 49].

Exploding and vanishing gradients are often attributed to excessive depth, inadequate choice of parameters for the learning algorithm, or inappropriate scaling between network parameters, inputs, and outputs [17, 32]. This issue has also inspired unit augmentations [25, 44, 60], additional connections across layers [23, 30], and output normalization [32, 52]. Indeed, it is somewhat intuitive that gradient updates, depth, and parameter scaling may affect one another.

In lieu of reducing depth, we may also increase the number of neurons per layer [22, 64,65,66, 72]. That leads to models that are considerably more complex, and which are often trained with additional terms in the loss function such as weight normalization to induce simpler models that hopefully generalize better. In turn, that helps model compression techniques such as network pruning methods to remove several parameters with only minor impact to model accuracy.

Nonetheless, vanishing gradients may also be caused by dead neurons when using ReLUs. If dead, a ReLU only outputs zero for every sample input. Hence, it does not contribute to updates during training and neither to the expressiveness of the model. To a lesser but relevant extent, similar issues can be observed with a RELU which never outputs zero, which we refer to as a linear neuron.

In this work, we aim to reverse neurons which die or become linear during training. Our approach is based on satisfying certain constraints throughout the process. For a margin defined for each unit, at least one input from the sample is above and another input is below. For each layer and input from the sample, at least one unit in the layer has that input above such a margin and another unit has it below. In order to use SGD for training, these constraints are dualized as part of the loss function and thus become a form of regularization that would prevent converging with the original loss function to spurious local minima.

2 Background

We consider a feedforward neural network modeling a function \(\hat{{\boldsymbol{y}}} = f_\theta ({\boldsymbol{x}})\) with an input layer \({\boldsymbol{x}}= {\boldsymbol{h}}^0 = [{h}_1^0 ~ {h}_2^0 \dots {h}_{n_0}^l]^T\), L hidden layers, and each layer \(\ell \in {\mathbb {L}}= \{1,2,\dots ,L\}\) having \(n_\ell \) units indexed by \(i \in {\mathbb {N}}_\ell = \{1, 2, \ldots , n_\ell \}\). For each layer \(\ell \in {\mathbb {L}}\), let \({\boldsymbol{W}}^\ell \) be the \(n_\ell \times n_{\ell -1}\) matrix in which the j-th row corresponds to the weights of neuron j in layer \(\ell \) and \({\boldsymbol{b}}^\ell \) be vector of biases of layer \(\ell \). The preactivation output of unit j in layer \(\ell \) is \({g}_i^\ell = {\boldsymbol{W}}_{j}^\ell {\boldsymbol{h}}^{\ell -1} + {b}_j^\ell \) and the output is \({h}_j^\ell = \sigma ({g}_j^\ell )\) for an activation function \(\sigma \), which if not nonlinear would allow hidden layer \(\ell \) to be removed by directly connecting layers \(\ell -1\) and \(\ell +1\) [55]. We refer to \({\boldsymbol{g}}^\ell (\chi )\) and \({\boldsymbol{h}}^\ell (\chi )\) as the values of \({\boldsymbol{g}}^\ell \) and \({\boldsymbol{h}}^\ell \) when \({\boldsymbol{x}}= \chi \).

For the scope of this work, we consider the ReLU activation function \(\sigma (u) = \max \{0, u\}\). Typically, the output of a feedforward neural network is produced by a softmax layer following the last hidden layer [10], \(\hat{{\boldsymbol{y}}} = \rho ({\boldsymbol{h}}^L)\) with \(\rho ({\boldsymbol{h}}^L)_j = e^{{h}^L_j}/\sum _{k=1}^{n_{L}} e^{{h}^L_k} ~ \forall j \in \{1, \ldots , n_{L}\}\), which is a peripheral aspect to our study.

The neural network is trained by minimizing a loss function \(\mathcal {L}\) over a parameter set \(\theta := \{ ({\boldsymbol{W}}^\ell , {\boldsymbol{b}}^\ell ) \}_{\ell =1}^L\) based on the N samples of a training set \({\mathbb {X}}:= \{ ({\boldsymbol{x}}^i) \}_{i=1}^N\) to yield predictions \(\{ \hat{{\boldsymbol{y}}}^i := f_\theta ({\boldsymbol{x}}^i) \}_{i=1}^N\) that approximate the sample labels \(\{ {\boldsymbol{y}}^i \}_{i=1}^N\) using metrics such as least squares or cross entropy [19, 59]:

$$\begin{aligned} \min _{\theta } ~~~&\mathcal {L}\left( \theta , \left\{ (\hat{{\boldsymbol{y}}}^i, {\boldsymbol{y}}^i) \right\} _{i=1}^N \right)&\end{aligned}$$
(1)
$$\begin{aligned} \text {s.t.} ~~~&\hat{{\boldsymbol{y}}^i} = f_\theta ({\boldsymbol{x}}^i) \qquad \qquad \forall i \in \{1, 2, \ldots , N\} \end{aligned}$$
(2)

whereas a neural network is not typically trained through constrained optimization, we believe that our approach is more easily understood under such a mindset, which aligns with further work emerging from this community [8, 15, 31].

3 Death, Stagnation, and Jumpstarting

Every ReLU is either inactive if \({g}^\ell _i \le 0\) and thus \({h}^\ell _i = 0\) or active if \({g}^\ell _i > 0\) and thus \({h}^\ell _i = {g}^\ell _i > 0\). If a ReLU does not alternate between those states for different inputs, then the unit is considered stable [68] and thus the neural network models a less expressive function [56]. In certain cases, those units can be merged or removed without affecting the model [55, 57]. We consider in this work a superset of such units—those which do not change of state at least for the training set:

Definition 1

For a training set \({\mathbb {X}}\), unit j in layer \(\ell \) is dead if \({h}^\ell _j({\boldsymbol{x}}^i) = 0 ~ \forall i \in \{1, 2, \ldots , N\}\), linear if \({h}^\ell _j({\boldsymbol{x}}^i) > 0 ~ \forall i \in \{1, 2, \ldots , N\}\), or nonlinear otherwise. Layer \(\ell \) dead or linear if all of its units are dead or linear, respectively.

Figures 1a to 1c illustrate geometrically the classification of the unit based on the training set. If dead, a unit impairs the training of the neural network because it always outputs zero for the inputs in the training set. Unless the units preceding a dead unit are updated in such a way that the unit is no longer dead, then the gradients of its output remain at zero and the parameters of the dead unit are no longer updated [42, 61], which effectively reduces the modeling capacity. If a layer dies, then the training stops because the gradients are zero.

Fig. 1.
figure 1

A unit j in layer \(\ell \) separates the input space \({\boldsymbol{h}}^{\ell -1}\) into an open half-space \({\boldsymbol{W}}^\ell _j {\boldsymbol{h}}^{\ell -1} + {b}^\ell _j > 0\) in which the unit is active and a closed half-space \({\boldsymbol{W}}^\ell _j {\boldsymbol{h}}^{\ell -1} + {b}^\ell _j \le 0\) in which the unit is inactive. The arrow in each case points to the active side. The unit is dead if the inputs from training set \({\mathbb {X}}\) lie exclusively on the inactive side (a); linear if exclusively on the active side (b); and nonlinear otherwise (c). In turn, an input is considered a dead point if it is in the closed half-space \({\boldsymbol{W}}^\ell _j {\boldsymbol{h}}^{\ell -1} + {b}^\ell _j\le 0\) in which each and every unit \(j \in {\mathbb {N}}_\ell \) is inactive (d); a linear point if it is in the open half-space \({\boldsymbol{W}}^\ell _j {\boldsymbol{h}}^{\ell -1} + {b}^\ell _j > 0\) in which each and every unit \(j \in {\mathbb {N}}_\ell \) is active (e); and a nonlinear point otherwise (f).

For an intuitive and training-independent discussion, we consider incidence of dead layers at random. If the probability that a unit is dead upon initialization is p, as reasoned in [42], then layer \(\ell \) is dead with probability \(p^{n_\ell }\) and at least one layer is dead with probability \(1 - \prod _{\ell =1}^L (1-p)^{n_\ell }\). If a layer is too thin or the network is too deep, then the network is more likely to be untrainable. We may discard dead unit initializations, but that ignores the impact on the training set:

Definition 2

For a hidden layer \(\ell \in {\mathbb {L}}\), an input x is considered a dead point if \({\boldsymbol{h}}^\ell (x) = 0\), a linear point if \({\boldsymbol{h}}^\ell (x) > 0\), and a nonlinear point otherwise.

Figures 1d to 1f illustrate geometrically the classification of a point based on the activated units. If \(x^i \in {\mathbb {X}}\) is a dead point at layer \(\ell \), then there is no backpropagation associated with \(x^i\) to the hidden layers 1 to \(\ell -1\). Hence, its contribution to training is diminished unless a subsequent gradient update at a preceding unit reverts the death. If \(\ell = L\), then \(x^i\) is effectively not part of the training set. If all points die, regardless of the layer, then training halts.

If we also associate a probability q for \({\boldsymbol{x}}^i\) not activating a unit, then \({\boldsymbol{x}}^i\) is dead for layer \(\ell \) with probability \(q^{n_\ell }\) and for at least one layer of the neural network with probability \(1-\prod _{\ell =1}^L (1-q)^{n_\ell }\). Unlike p, q is bound to be significant.

We may likewise regard linear units and linear points as less desirable than nonlinear units and nonlinear points. A linear unit limits the expressiveness of the model, since it always contributes the same linear transformation to every input in the training set. A linear point can be more difficult to discriminate from other inputs, in particular if those inputs are also linear points.

Inspired by the prior discussion, we formulate the following constraints:

$$\begin{aligned}&\max _{{\boldsymbol{x}}^i \in {\mathbb {X}}} g^\ell _j({\boldsymbol{x}}^i) \ge 1&\forall \ell \in {\mathbb {L}}, j \in {\mathbb {N}}_\ell \end{aligned}$$
(3)
$$\begin{aligned}&\min _{{\boldsymbol{x}}^i \in {\mathbb {X}}} g^\ell _j({\boldsymbol{x}}^i) \le -1&\forall \ell \in {\mathbb {L}}, j \in {\mathbb {N}}_\ell \end{aligned}$$
(4)
$$\begin{aligned}&\max _{j \in {\mathbb {N}}_\ell } {g}^\ell _j({\boldsymbol{x}}^i) \ge 1&\forall \ell \in {\mathbb {L}}, {\boldsymbol{x}}^i \in {\mathbb {X}} \end{aligned}$$
(5)
$$\begin{aligned}&\min _{j \in {\mathbb {N}}_\ell } {g}^\ell _j({\boldsymbol{x}}^i) \le -1&\forall \ell \in {\mathbb {L}}, {\boldsymbol{x}}^i \in {\mathbb {X}} \end{aligned}$$
(6)

Dead and linear units are respectively prevented by the constraints in (3) and (4). Dead and linear points are prevented by the constraints in (5) and (6). Then we dualize those constraints and induce their satisfaction through the objective:

$$\begin{aligned} \min _{\theta } ~~~&\mathcal {L}\left( \theta , \left\{ (\hat{{\boldsymbol{y}}}^i, {\boldsymbol{y}}^i) \right\} _{i=1}^N \right) + \lambda \mathcal {P}(\xi ^+, \xi ^-, \psi ^+, \psi ^-) \end{aligned}$$
(7)
$$\begin{aligned} \text {s.t.} ~~~&\hat{{\boldsymbol{y}}^i} = f_\theta ({\boldsymbol{x}}^i)&\forall i \in \{1, 2, \ldots , N\} \end{aligned}$$
(8)
$$\begin{aligned}&\xi ^+_{j \ell } = \max \left\{ 0, 1 - \max _{{\boldsymbol{x}}^i \in {\mathbb {X}}} g^\ell _j({\boldsymbol{x}}^i) \right\}&\forall \ell \in {\mathbb {L}}, j \in {\mathbb {N}}_\ell \end{aligned}$$
(9)
$$\begin{aligned}&\xi ^-_{j \ell } = \max \left\{ 0, -1 - \min _{{\boldsymbol{x}}^i \in {\mathbb {X}}} g^\ell _j({\boldsymbol{x}}^i) \right\}&\forall \ell \in {\mathbb {L}}, j \in {\mathbb {N}}_\ell \end{aligned}$$
(10)
$$\begin{aligned}&\psi ^+_{i \ell } = \max \left\{ 0, 1-\max _{j \in {\mathbb {N}}_\ell } {g}^\ell _j({\boldsymbol{x}}^i) \right\}&\forall \ell \in {\mathbb {L}}, {\boldsymbol{x}}^i \in {\mathbb {X}} \end{aligned}$$
(11)
$$\begin{aligned}&\psi ^-_{i \ell } = \max \left\{ 0, -1 - \min _{j \in {\mathbb {N}}_\ell } {g}^\ell _j({\boldsymbol{x}}^i) \right\}&\forall \ell \in {\mathbb {L}}, {\boldsymbol{x}}^i \in {\mathbb {X}} \end{aligned}$$
(12)

We denote by \(\xi ^+\), \(\xi ^-\), \(\psi ^+\), and \(\psi ^-\) the nonnegative deficits associated with the corresponding constraints in (3)–(6) which are not satisfied. These deficits are combined and weighted against the original loss function \(\mathcal {L}\) through a function \(\mathcal {P}\), for which we have considered the arithmetic mean as well as the 1 and 2-norms.

We can apply this to convolutional neural networks [16, 39] with only minor changes, since they are equivalent to a feedforward neural network with parameter sharing and which is not fully connected. The main difference to work with them directly is that the preactivation of the unit is a matrix instead of a scalar. We compute the margin through the maximum or minimum over those values.

4 Computational Experiments

Our first experiment (Fig. 2) is based on the MOONS dataset [51] with 85 points for training and 15 for validation. We test every width in \(\{1, 2, 3, 4, 5, 10, 15, 20, 25\}\) with every depth in \(\{1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, \ldots , 150 \}\). We chose a simpler dataset to limit the inference of factors such as overfitting, underfitting, or batch size issues. The networks are implemented in Tensorflow [1] and Keras [11] with Glorot uniform initialization [17] and trained using Adam [34] for 5000 epochs, learning rate of \(\epsilon = 0.01\), and batch size of 85. For each depth-width pair, we train a baseline network and a network with jumpstart using 1-norm as the aggregation function \(\mathcal {P}\) and loss coefficient \(\lambda = 10^{-4}\).

With jumpstart, we successfully train networks of width 3 with a depth up to 60 instead of 10 for the baseline and width 25 with a depth of up to 100 instead of 30. Hence, there is an approximately 5-fold increase in trainable depth.

Fig. 2.
figure 2

Heatmap contrasting accuracy for neural networks trained on MOONS with depth between 1 and 150 and width between 1 and 25. The left plot is the baseline and the right plot shows the results when using jumpstart. The accuracy ranges from a low of 0.5 (black) to a high of 1.0 (beige), with the former corresponding to random guessing since the dataset has two balanced classes. (Color figure online)

Our second experiment (Table 1) evaluates convolutional neural networks trained on the MNIST dataset [40]. We test every depth from 2 to 68 in increments of 4 with every width in \(\{ 2, 4, 8 \}\), where the width refer to the number of filters per layer. The networks are implemented as before, but with a learning rate of 0.001 over 50 epochs, batch size of 1024, kernel dimensions (3, 3), padding to produce an output of same dimensions as the input, Glorot uniform initialization [17], flattening before the output layer and using a baseline and a jumpstart network with 1-norm as the aggregation function \(\mathcal {P}\) and loss coefficient \(\lambda = 10^{-8}\).

Table 1. Summary of the results for the convolutional neural networks trained on the MNIST dataset without jumpstart (baseline) and with jumpstart.

With jumpstart, we successfully train networks combining all widths and depths in comparison to only up to depth 12 for widths 2 and 4 and only up to depth 24 for width 8 in the baseline. In other words, only 18 baseline network trainings converge, which we denote as the successful models in Table 1.

Our third experiment (Figs. 3 and 4) evaluates convolutional networks trained on CIFAR-10 and CIFAR-100 [36]. For CIFAR-10, we test every depth in \(\{ 10, 20, 30 \}\) with every width in \(\{ 2, 8, 16, 32, 64, 96, 192 \}\). For CIFAR-100, we test depths in \(\{ 10, 20 \}\) with widths in \(\{ 8, 16, 32, 64 \}\). The networks are implemented in Pytorch [50], with learning rates \(\varepsilon \in \{0.001, 0.0001\}\) over 400 epochs, batch size of 128, same kernel dimensions and padding, Kaiming uniform initialization [24], global max-avg concat pooling before the output layer, and jumpstart with 2-norm (\(\mathcal {P} = L^2\)) and \(\lambda \in \{0.001, 0.1\}\) or mean (\(\mathcal {P} = \bar{x}\)) and \(\lambda \in \{0.1, 1\}\).

With jumpstart, we successfully train networks for CIFAR-10 with depth up to 30 in comparison to no more than 20 in the baseline. The best performance—0.766 for jumpstart and 0.734 for baseline—is observed for both with \(\varepsilon = 0.001\), where the validation accuracy of each jumpstart experiment exceeds the baseline in 18 out of 21 depth-width pairs in one case and 20 out of 21 in another. The baseline is comparatively more competitive with \(\varepsilon = 0.0001\), but the overall validation accuracy drops significantly. For CIFAR-100, the jumpstart experiments exceed the baseline in 12 out of 16 combinations of depth, width, and learning rate. The accuracy improves by 1 point in networks with 10 layers and 7.8 points in networks with 20 layers. The maximum accuracy attained is 0.37 for the baseline and 0.38 with jumpstart. The training time becomes 1.33 times greater in CIFAR-10 and 1.47 in CIFAR-100. The use of the precomputed pre-activations on the forward pass involves a similar memory cost: around 50% more.

The source code is at https://github.com/blauigris/jumpstart-cpaior.

Fig. 3.
figure 3

Scatter chart of the number of parameters by accuracy for training (top) and validation (bottom) of convolutional neural networks trained on CIFAR-10. Some depth-width pairs are shown above the plots for reference and the gridlines are solid for depth 30, dashed for 20, and dotted for 10. The results of this experiment are plotted in this format due to their greater variability in comparison to the second experiment, which permits evaluating parameter efficiency. With same number of units but fewer parameters, the results for \(20 \times 8\) are better than \(10 \times 16\) and likewise for \(20 \times 32\) when compared with \(10 \times 64\).

Fig. 4.
figure 4

Scatter chart of number of parameters by accuracy for training (top) and validation (bottom) of convolutional neural networks trained on CIFAR-100. Some depth-width pairs are shown above the plots for reference and the gridlines are dashed for depth 20 and dotted for 10. Once certain capacity is reached at 640 units, we find that the performance for \(20 \times 32\) is competitive with that of \(10 \times 64\) while using less parameters.

5 Conclusion

We have presented a regularization technique for training thinner and deeper neural networks, which leads to a more efficient use of the dataset and to neural networks that are more parameter-efficient. Although massive models are currently widely popular in theory [33] and practice [2], their associated economical barriers and environmental footprint [63] as well as societal impact [5] are known concerns. Hence, we present a potential alternative to lines of work such as model compression [9] by avoiding to operate with larger models. Whereas deeper networks are often pursued, trainable thinner networks are surprisingly not.