Keywords

1 Introduction

Neural networks have been extremely useful for learning complex tasks such as gesture recognition [1] and banknote recognition [2]. More recently, as against shallow networks with one layer of feature abstraction, there has been massive interest in deep networks which compose many layers of features abstractions. There are many earlier works [3, 4] which established that given a sufficiently large number of hidden units, a shallow network is a universal function approximator. Interestingly, many works addressing the benefit of depth in neural networks have also emerged. For example, using the concept of sum-product networks, Delalleau and Bengio [5] posited that deep networks can efficiently represent some family of functions with lesser number of hidden units as compared to shallow networks. In addition, Mhaskar et al. [6] provided proofs in their work that deep networks are capable of operating with lower Vapnik-Chervonenkis (VC) dimensions. Bianchini and Scarselli [7] employing some architectural constraints, derived upper and lower bounds for some shallow and deep architectures; they concluded that using the same resources (computation units), deep networks are capable of representing more complex functions than shallow networks. In practice, the success of deep networks have corroborated the position that deep networks have a better representational capability as compared to shallow networks; many state-of-the-art results on benchmarking datasets are currently held by deep networks [8,9,10].

In recent times, the aforementioned theoretical proofs, practical results and new works [11, 12] now suggest that employing even deeper networks could be quite promising for learning even more complex or highly varying functions. However, it has been observed that the training of models beyond some few layers results in optimization difficulty [13, 14]. In this work, for the sake of clear terms, we refer to models with 2–10 hidden layers as ‘deep networks’, models with more than 10 hidden layers as ‘very deep networks’and use the term ‘deep architecture’to refer interchangeably to a deep network or very deep network. We consider the effective training of very deep networks; that is, simultaneously overcoming optimization problems associated with model depth increase and more importantly improving generalization performance. We take inspiration from an earlier work which employed residual learning for training very deep networks [14]. However, training very deep models with millions of parameters come with the price of over-fitting. On one hand, various explicit regularization schemes such as \(L^1\)-norm, \(L^2\)-norm and max-norm can be employed for alleviating this problem. On the other hand, a more appealing approach is to explore some form of implicit regularization such as reducing the co-adaptation of model units on one another for feature learning (or activations) [19] and encouraging stochasticity during optimization [8]. In this work, we advance in this direction with some modifications on the form of residual learning that we propose for implicitly improving model regularization by emphasizing stochasticity during training. Our contribution is that we propose to modify residual learning for training very deep networks where we allow shortcut connections of identity mappings from the input to the hidden layers; such shortcut connections are stochastically removed during training. Particularly, the proposed training scheme is shown to improve the implicit regularization of very deep networks as compared to the conventional residual learning. We employ our proposed approach for performing extensive experiments using the USPS and MNIST datasets; results obtained are quite promising and competitive with respect to state-of-the-art results.

The rest of this paper is organized as follows. Section 2 discusses related works.

Section 3 serves as background and introduction of residual learning. Section 4 gives the description of the proposed model. Section 5 contains experiments, results and discussion on benchmark datasets. In Sect. 6, we conclude the work with our key findings.

2 Related Work

The optimization difficulty observed in training very deep networks can be attributed to the fact that input features get diluted from the input layer through the many compositional hidden layers to the output layer; this is evident in that each layer in the model performs some transformation on the input received from the preceding layer. The several transformations with model depth may make features not reusable. Here, one can conjecture that the signals (data features) which reach the output layer for error computation may be significantly less informative for effective weights update (or correction). Many works have provided interesting approaches for alleviating the problem of training deep architectures. In [15, 16], carefully guided initializations were considered for specific activation functions; these initializations were found useful for improving model optimization and the rates of convergence. In another interesting work [17], batch normalization was proposed for tackling the problem of internal covariate shift which arises from non-zero mean hidden activations. Nevertheless, the problem of training (optimizing) very deep networks commonly arises when the number of hidden layers exceeds 10; see Fig. 1. For example, Srivastava et al. [13] employed transform gates for routing data through very deep networks; they refer to their model as a highway network. The concept is that the transform gates are either closed or open. When the transform gates are closed, input data are routed through the hidden layers without transformations; in fact, each hidden layer essentially copies the features from the preceding layer. However, when the transform gates are open, the hidden layers perform the conventional features transformations using layer weights, biases and activation functions. Inasmuch as the highway network was shown to allow for the optimization of very deep networks and improving classification accuracies on benchmark datasets, it comes with a price of learning additional model parameters for the transform gates. Another work, He et al. [14] has addressed the problem of feature reuse by using residual learning for alleviating the dilution (or attenuation) of features during forward propagation through very deep networks; they refer to their model as a ResNet. The ResNet was also shown to alleviate optimization difficulty in training very deep networks. In [33], identity shortcut connections were used for bypassing a subset of layers to facilitate training very deep networks.

3 Background: Very Deep Models and Residual Learning

3.1 Motivation

We emphasize the problem of training very deep networks using the USPS dataset. Figure 1-left shows the performance of plain deep architectures with a different number of hidden layers. Particularly, it will be seen that the performance of the models significantly dips from over 10 hidden layers. We further emphasize this problem by going beyond the typical uniform initialization (i.e. Unit_init in Fig. 1) scheme for neural network models; we employ other initialization and training techniques which have been proposed for more effective training of deep models; these techniques include Glorot [15] initialization, He [16] initialization and batch normalization [17] which are shown as Glorot_init, He_init and BN in Fig. 1.

In addition, we investigate this problem using the COIL-20 datasetFootnote 1 which composes 1,440 samples of different objects of 20 classes. The concepts which we follow in using the COIL-20 dataset as sanity check are in two folds: (1) it is a small dataset, hence it is expected that deep architectures would easily overfit such training data (2) the dataset is of much higher dimensionality. Obviously, this training scenario can be seen as an extreme one which indeed favours deep models with enormous parameters for overfitting the training data. This follows directly from the concept of model complexity and curse of dimensionality with high dimensional input data as against the number of training data points. However, our experimental results do not support the overfitting intuition; instead, the difficulty of model optimization is observed when the number of hidden layers is increased beyond 10; see Fig. 1-right. It will be seen that for both USPS and COIL-20 datasets, training with batch normalization improved model optimization with depth increase. Nevertheless, model optimization remains a problem with depth increase. However, residual learning [14] has been employed in recent times for successfully training very deep networks. The idea is to scheme model training such that stacks of hidden layers learn residual mapping functions rather than the conventional transformation functions.

Fig. 1.
figure 1

Performance of deep architectures with depth. Left: Train error on USPS dataset. Right: Train error on COIL-20 dataset. It is seen that optimization becomes more difficult with depth

3.2 Residual Learning: ResNet

In this subsection, we briefly discuss residual learning as a building block for the model that we propose in this paper. In [14], residual learning was achieved by employing shortcut connections from preceding hidden layers to the higher ones. Given an input \(H(x)^{l-1}\) (in block form), from layer l−1 feeding into a stack of specified number of hidden layers with output \(H(x)^{l}\); in the conventional training scheme, the stack of hidden layers learns a mapping function of the form

$$\begin{aligned} H(x)^l = F^l(H(x)^{l-1}), \end{aligned}$$
(1)

where the residual learning proposed in [14] uses shortcut connections such that the stack of hidden layers learns a mapping function of the form

$$\begin{aligned} H(x)^l = F^l(H(x)^{l-1})+H(x)^{l-1}, \end{aligned}$$
(2)

where \(H(x)^{l-1}\) is the shortcut connection. The actual transformation function learned by the stack of hidden layers can be written as follows

$$\begin{aligned} F^l(H(x)^{l-1})=H(x)^l-H(x)^{l-1}, \end{aligned}$$
(3)

where \(1\le l \le L\) and \(H(x)^0\) is the input data, x; L is the depth of the network.

This training setup was found very effective in training very deep networks, achieving state-of-the-art results on some benchmarking datasets [14]. In a following work [18], dropping out the shortcut connections from preceding hidden layers was experimented with; however, convergence problems and unpromising results were reported.

4 Proposed Model

For improving the training of very deep models, we take inspiration from residual learning. Our proposed model incorporates some simple modifications to further improve on optimization and generalization capability as compared to the conventional ResNet. We refer to the proposed model as stochastic residual network (S-ResNet). The proposed training scheme is described below:

  1. (i)

    There are identity shortcut connections of identity mappings from the input to hidden layers of the model; this is in addition to the shortcut connections from preceding hidden layers to the higher ones as seen in the conventional ResNets.

  2. (ii)

    The identity shortcut connections from the input to the hidden layers are stochastically removed during training. Here, hidden layer units do not always have access to the untransformed input data provided via shortcut connections.

  3. (iii)

    At test time, all the shortcut connections are present. The shortcut connections are not parameterized and therefore do not require rescaling at test time as in [8, 33].

Fig. 2.
figure 2

(a) Proposed model with shortcut connections from the input to hidden layers (b) Closer view of the proposed residual learning with a hypothetical stack of two hidden layers

The proposed scheme for training very deep models is shown in Fig. 2(a); conventional shortcut connections from preceding hidden layers, with shortcut connections from the input to the different hidden layers are shown. For the modification that we propose in this work, the transformed output of a stack of hidden layers denoted, l, with shortcut connection from the preceding stack of hidden layers, \(H(x)^{l-1}\), and shortcut connection from the input x can be written as follows

$$\begin{aligned} H(x)^l = F^l(H(x)^{l-1})+H(x)^{l-1}+x. \end{aligned}$$
(4)

where \(1\le l \le L\) \(\mid \) \(x = 0\) for \(l=1\) \(\because \exists \) \( H(x)^0 = x\); \(H(x)^l\), \(F^l(H(x)^{l-1})\), \(H(x)^{l-1}\) and x are of the same dimension. In this work, every stack of residual learning block composes two hidden layers. For a clearer conception of our proposed model, a single residual learning block of two hidden layers is shown in Fig. 2(b). From Fig. 2(b), assume that the underlying target function to be learned by a hypothetical residual learning block is \(F^l(H(x)^{l-1})\), then using the aforementioned constraints on l, it learns a residual function of the form

$$\begin{aligned} F^l(H(x)^{l-1})=H(x)^l-H(x)^{l-1}-x. \end{aligned}$$
(5)

For dropout of shortcut connections from the input layer to the stack of hidden layers l, we can write

$$\begin{aligned} F^l(H(x)^{l-1})=H(x)^l-H(x)^{l-1}-D*x, \end{aligned}$$
(6)

where \(D \in \{0,1\}\) and \(D\sim Bernoulli(p_s)\) determines that x (shortcut connection from input) is connected to the stack of hidden layers l with probability \(p_s\); that is, \(P(D=1)=p_s\) and \(P(D=0)=1-p_s\) for \(0 \le p_s \le 1\); and \(*\) defines an operator that performs the shortcut connection, given the value of D. The conventional dropout probability for hidden units is denoted \(p_h\).

5 Experiments and Discussion

For demonstrating the effectiveness of our proposed model, we train very deep networks and observe their optimization characteristics over various training settings using the USPS and MNIST datasets. The USPS datasetFootnote 2 composes handwritten digits 0–9 (10 classes) of 7,291 training and 2,007 testing samples; while the MNIST datasetFootnote 3 composes handwritten digits 0–9 of 60,000 training and 10,000 testing samples. For the USPS dataset, we use 2 \(\times \) 2 convolutional filters, 2 \(\times \) 2 max pooling windows and 2 fully connected layers of 300 ReLUs. For the MNIST dataset, we use 3 \(\times \) 3 convolutional filters, 2 \(\times \) 2 max pooling windows and 2 fully connected layers of 500 ReLUs. For both datasets, models have output layers of 10 softmax units. Our best model, 54-hidden layer S-ResNet, composes 50 convolution layers, 2 max pooling layers and 2 fully connected layers; we apply batch normalization only in the fully connected layers.

Figure 3-left shows the performance of our proposed model (S-ResNet) on the USPS dataset with different number of hidden layers at a dropout probability of \(p_s=0.8\) for the input shortcut connections to the hidden layers; for the conventional dropout of hidden units, a dropout probability of (\(p_h=0.6\)) is used. It will be seen that with 54-hidden layers, our model achieves a state-of-the-art performance; that is, an error rate of 2.69%, surpassing the conventional ResNet (baseline model). In addition, Fig. 3-right shows the performance of the best proposed model (54 hidden layer S-ResNet) with different dropout probabilities for input shortcut connections to the hidden layers. Table 1 shows the error rates obtained on the test data for the USPS dataset along with the state-of-the-arts results. We observe that the models with asterisk (i.e. \(*\)) employed some form of data augmentation (or manipulation). For example, [26, 27] extended the training dataset with 2,400 machine-printed digits; while [28] employed virtual data in addition to the original training data. However, our proposed model employs no such data augmentation tricks. The result obtained with our proposed model, 54-hiddden layer S-ResNet, surpasses many works which did not employ any form of data augmentation.

Fig. 3.
figure 3

Performance of deep architectures with depth on the USPS dataset. Left: Test error rate with depth. Right: Test error rate for different dropout probabilities of input shortcut connections

Table 1. Error rate (%) on the USPS dataset
Fig. 4.
figure 4

Performance of deep architectures with depth on the MNIST dataset. Left: Test error rate with depth. Right: Test error rate for different dropout probabilities of input shortcut connections

Table 2. Error rate (%) on the MNIST dataset

We repeat similar experiments on the MNIST dataset. Figure 4-left shows the error rates of the S-ResNets and the conventional ResNets with different number of hidden layers. It is observed that the S-ResNets are better regularized as compared to the ResNets for all the different model depths. Particularly, with 54 hidden layers, the S-ResNet achieved a result competitive with the state-of-the-art results; we reach an error rate of 0.52%. Figure 4-right shows the error rates of the 54-hidden layer S-ResNet with different dropout probabilities for the input shortcut connections to the hidden layers. In Table 2, we report the obtained error rates for our experiments, along with the best results reported in recent works. Also, for the MNIST dataset, we found that dropping out input shortcut connections to the hidden layers with a probability of 0.8 yielded the best result as given in Table 2. For both datasets, the S-ResNets employed no explicit regularization technique for improving generalization capability; we relied on the implicit regularization of the models via dropout of input shortcut connections and hidden units for the S-ResNet, and dropout of hidden units only for ResNet. It is interesting to note that the proposed model do not suffer from convergence problem as reported in an earlier work which experimented with a similar training scheme [18]. In addition, the experimental results given in Tables 1 and 2 suggest that the proposed training scheme improves the implicit regularization of very deep networks; that is, lower test errors are achieved for the S-ResNets as compared to the ResNets. We conjecture that the simple modification employed for the proposed model helps to reduce the reliance of model units in one layer over others for feature learning. We observe that [8] also reported an error rate of 0.21%, however [8] employed some form of data augmentation using an ensemble of 5 neural networks; without data augmentation, they obtained a test error rate of 0.52%. Conversely, we employ no data augmentation and model ensemble.

6 Conclusion

Very deep networks suffer optimization problems even in situations that indeed favour over-fitting. Furthermore, assuming that we are able to optimize very deep networks, over-fitting is almost always inevitable due to large model capacity. We address the aforementioned problems by taking inspiration from residual learning. Our proposed model, stochastic residual network (S-ResNet), employs stochastic shortcut connections from the input to the hidden layers for essentially improving the implicit regularization of very deep models. Experimental results on benchmark datasets validate that the proposed approach improved implicit regularization on very deep networks as compared to the conventional residual learning.