Keywords

1 Introduction

In the past five years, deep learning [1] has made gratifying achievements in various computer vision tasks [2, 3]. With the rapid development of deep learning and Convolutional Neural Networks (CNNs), image classification has bidden farewell to coarse feature problems of manual extraction, and turned it into a new process. Especially, after AlexNet [4] won the champion ship of the 2012 Large Scale Visual Recognition Challenge (ILSVRC) [5], CNNs become deeper and continue to achieve better and better performance on different tasks of computer vision tasks.

To overcome degradation problems, a residual learning framework named Residual Networks (ResNets) were developed [8] to ease networks training, which achieved excellent results on the ImageNet test set. Since then, current state-of-the-art image classification systems are predominantly variants of ResNets. Residual networks of Residual networks (RoR) [13] adds level-wise shortcut connections upon original residual networks to promote the learning capability of residual networks. The rectified linear unit (RELU) [16] has been adopted by most of the convolution neural networks. RELUs are non-negative; therefore, they have a mean activation larger than zero, which would cause a bias shift for units. Furthermore, the selection of the activation function in the current DCNN model does not take into account the difference between the datasets. Different image datasets are different in variety and quality of image. The unified activation function limits the performance of image classification.

In order to effectively solve the above problem, this paper proposes an RoR network optimization method. To begin with, we analyze the characteristics of the activation function (RELU, ELU and PELU) and construct an RoR network with them. Thus, an RoR optimization based on different datasets is proposed. In addition, analysis of the characteristics of RoR networks suggest two modest mechanisms, stochastic depth and RoR-WRN, to further increase the accuracy of image classification. Finally, through massive experiments on CIFAR datasets, our optimized RoR model achieves excellent results on these datasets.

2 Related Work

Since AlexNet acquired a celebrated victory at the ImageNet competition in 2012, an increasing number of deeper and deeper Convolutional Neural Networks emerged, such as the 19-layer VGG [6] and 22-layer GoogleNet [7]. However, very deep CNNs also introduce new challenges: degradation problems, vanishing gradients in backward propagation and overfitting [15].

In order to overcome the degradation problem, a residual learning frame-work known as ResNets [8] was presented by the authors at the 2015 ILSVRC & COCO 2015 competitions and achieved excellent results in combination with the ImageNet test set. Since then, a series of optimized models based on ResNets have emerged, which became part of the Residual-Networks Family. Huang and Sun et al. [10] proposed a drop-path method, the stochastic depth residual networks (SD), which randomly drops a subset of layers and bypasses them with identity mapping for every mini-batch. To tackle the problem of diminishing feature reuse, wide residual networks (WRN) [11] was introduced by decreasing depth and increasing width of residual networks. Residual networks of Residual networks (RoR) [13] adds level-wise shortcut connections upon original residual networks to promote the learning capability of residual networks, that once achieved state-of-the-art results on CIFAR-10 and CIFAR-100 [12]. Each layer of DenseNet [14] is directly connected to every other layer in a feed-forward fashion. PyramidNet [27] gradually increases the feature map dimension at all units to involve as many locations as possible. ResNeXt [26] exposes a new dimension called cardinality (the size of the set of transformations), as a sential factor in addition to the dimensions of depth and width.

Even though non-saturated RELU has interesting properties, such as sparsity and non-contracting first-order derivative, its non-differentiability at the origin and zero gradient for negative arguments can hurt back-propagation [17]. Moreover, its non-negativity induces bias shift causing oscillations and impeded learning. Since the advent of the well-known RELU, many have tried to further improve the performance of the networks with more elaborate functions. Exponential linear unit (ELU) [17], defined as identity for positive arguments and \( \exp (x) - 1 \) for negative ones, deals with both increased variance and bias shift problems. Parametric ELU (PELU) [18], an adaptive activation function, defines parameters controlling different aspects of the function and proposes learning them with gradient descent during training.

3 Methodology

In this section, three activation functions (RELU, ELU and PELU) are applied to RoR and can provide more effective optimization methods for different datasets; Secondly, we added a drop-path to avoid over-fitting and widened RoR adding filters to avoid gradient vanish.

3.1 Comparative Analysis of RELU, ELU and PELU

The characteristics and performance of several commonly activation functions (RELU, ELU and PELU) are compared and analyzed as follows.

RELU is defined as:

$$ f = \left\{ {\begin{array}{*{20}l} h \hfill & {if} \hfill & {h \ge 0} \hfill \\ 0 \hfill & {if} \hfill & {h < 0} \hfill \\ \end{array} } \right. $$
(1)

It can be seen that RELU is saturated at \( h < 0 \). Since the derivative of \( h \ge 0 \) is 1, RELU can keep the gradient from attenuation when \( h > 0 \), thus alleviating the problem of vanishing gradients. However, RELU outputs are non-negative, so the mean of the outputs will be greater than 0. Learning causes a bias shift for units in next layer. The more the units are correlated, the more serious the bias shift. the higher their bias shift.

ELU is defined as:

$$ f = \left\{ {\begin{array}{*{20}c} h & {if} & {h \ge 0} \\ {\alpha \,(\exp (h) - 1)} & {if} & {h < 0} \\ \end{array} } \right. $$
(2)

The ELU incorporates Sigmoid and ReLU with left soft saturation. The ELU hyperparameter controls the value to which an ELU saturates for negative net inputs. ELUs diminish the vanishing gradients effect as RELUs do. By using a saturated negative part, the CNNs can no longer have arbitrary large negative outputs, which reduces variance. ELU outputs negative values for negative arguments, the network can push the mean activation toward zero, which reduces the bias shift.

PELU can be defined as follows:

$$ f = \left\{ {\begin{array}{*{20}c} {\frac{a}{b}h} & {if} & {h \ge 0} \\ {a\,(\exp (h/b) - 1)} & {if} & {h < 0} \\ \end{array} } \right.,a,b > 0 $$
(3)

With the parameterization in the PELU function, a and b adjust the characteristics of the exponential function in the negative half axis and control the size of exponential decay and saturation point. a and b can also can adjust the slope of the linear function, to keep the differentiability. The parameters in PELU are updated at the same time as the parameters in the network weight layers during back-propagation.

3.2 RoR Networks with RELU, ELU and PELU

RoR [13] is based on a hypothesis: The residual mapping of residual mapping is easier to optimize than original residual mapping. To enhance the optimization ability of residual networks, RoR can optimize the residual mapping of residual mapping by adding shortcuts level by level based on residual networks.

Figure 2 in [13] shows the RoR (ReLU) architecture. The optimal model is 3-level RoR in [13]. Therefore we adopted 3-level RoR (RoR-3) as our basic architecture in experiments.

RoR-3 includes \( 3n \) final residual blocks, 3 middle-level residual blocks, and a root-level residual block, among which a middle-level residual block is composed of n final residual blocks and a middle-level shortcut, the root-level residual block is composed of 3 middle-level residual blocks and a root-level shortcut. The projection shortcut is done by 1 × 1 convolutions. RoR (ReLU) adopts a conv-BN-ReLU order in residual blocks.

For the saturation advantage of ELU and PELU, we designed new RoR architectures by adopting ELU and PELU, as shown in Fig. 1. The sequence of layers in residual block is Conv-PELU/ELU-Conv-BN. The batch normalization (BN) layer reduces the exploding gradient problem. We use 16, 32, and 64 convolutional filters sequentially in the convolutional layers of the three residual block groups, as shown in Fig. 1. Other architectures are the same as RoR (RELU)’s.

Fig. 1.
figure 1

RoR (ELU/PELU) architecture. The 1-level shortcut is a root-level shortcut, and the remaining three 2-level shortcuts are middle-level shortcuts. The shortcuts in residual blocks are final-level shortcuts.

It can be seen from the previous section that the output of the ReLU is equal to or greater than 0, which makes the RoR network generate the bias shift when training. The bias shift directly limits the image classification performance of the RoR network. ELU and PELU outputs negative values for negative arguments, which allows the RoR network to push the mean activation toward zero. This reduces the bias shift; thus, performance of RoR is improved. Furthermore, as ELU and PELU saturate as the input gets negatively larger, the neurons can-not have arbitrary large positive and negative outputs and still keep a proper weighted sum; Thus variance is reduced. So RoR (ELU/PELU) is more robust to the inputs than RoR (ReLU).

3.3 RoR Optimization Method

PELU adopts the parameter updating mechanism to make RoR (PELU) more flexible in training. Under conditions of sufficient images in the dataset, ELU in RoR (ELU) has a constant saturation point and exponential decay for negative arguments and a constant slope for positive arguments. The deep RoR (ELU) can more easily get stuck in the saturated part during training where the gradient is almost zero–moreso than RoR (PELU) [18]. The gradient of a certain weight at the layer l in RoR (PELU) containing one neuron in each of its L layers is expressed as (4):

$$ \frac{\partial E}{{\partial w_{l}^{{}} }} = h_{l - 1}^{{}} f^{'} (h_{l}^{{}} )[\prod\limits_{j = l + 1}^{L} {f'(h_{j}^{{}} )w_{j}^{{}} } ]\frac{\partial E}{{\partial z_{L}^{{}} }} $$
(4)

where \( h_{0}^{{}} \) is the input of the network. The output for lst layer is \( z_{l}^{{}} = f(h_{l}^{{}} ) \), where \( h_{l}^{{}} = w_{l}^{{}} h_{l - 1}^{{}} \), and loss function is \( E = \ell (z_{L} ,y) \) between the network prediction \( z_{L} \) and label \( y \). If \( \partial E/\partial w_{l}^{{}} \) tends to zero, it will produce vanishing gradients, making the network difficult to converge. One way to overcome this is when:

\( f^{'} (h_{j}^{{}} )w_{j}^{{}} \equiv f^{'} (h_{j - 1}^{{}} w_{j}^{{}} )w_{j}^{{}} \ge 1 \), which means \( f^{'} (wh)w \ge 1 \). Substituted into the PELU definition, the solution lies in to get:

$$ f^{'} (wh)w = \left\{ {\begin{array}{*{20}c} {w\frac{a}{b}} & {if} & {h \ge 0} \\ {w\frac{a}{b}\exp (wh/b)} & {if} & {h < 0} \\ \end{array} } \right.a,b > 0 $$
(5)

Vanishing gradients problem is controlled when meeting \( f^{'} (wh)w \ge 1 \). If \( h \ge 0 \), \( w \ge b/a \) need to be met; \( wb/a\, \cdot \,\exp (wh/b) \ge 1 \) needs to be met, which pushes out \( \left| h \right| \le l\left( w \right) = \left| {\log (b/aw)} \right|(b/w) \) There is a maximum value of \( l(w) \) when \( w = \exp (1)b/a \); \( l(w) = a\,\exp ( - 1) \) which means \( l(w) \le a\,\,\exp ( - 1) \).

For RoR (ReLU) and RoR (ELU), countering vanishing gradient is mostly possible with positive activations, which causes the bias shift. But for RoR (PELU), a can be adjusted to increase \( a\,\exp ( - 1) \) and allow more negative activations \( h \) to counter vanishing gradients. Moreover, \( a \) and \( b \) can be adjusted to make which can \( w \ge b/a \), which can make RoR (PELU) more flexible in training to eliminate vanishing gradients.

However, in RoR (PELU), additional parameters are added to the activation function compared to RoR (ReLU) and RoR (ELU). Adding the parameter update layer by layer makes the network model more complex. We believe that the parameters in the activation function have a greater impact on performance than parameters of other weight layers. Thus, when the number of images in each category is relatively small, RoR (PELU) is more likely to produce overfitting in training than RoR (ELU). Therefore, ELU will perform better under such conditions. Based on the above analysis, for different datasets of image classification, RoR (PELU) and RoR (ELU) complement each other. So, for different image datasets, we propose an optimization method for RoR based on activation functions:

For RoR, the datasets with more class images (such as CIFAR-10), they should be optimized by PELU, and the RoR (PELU) structure should be adopted. Meanwhile, the datasets with relatively fewer images in each category (such as CIFAR-100) should be optimized by ELU, and the RoR (ELU) structure should be adopted.

3.4 Stochastic Depth and Depth and Width Analysis

Overfitting and vanishing gradients are two challenging issues for RoR, which have a strongly negative impact on performance of image classification. In this paper, to alleviate overfitting, we trained RoR with the drop-path method, and obtained an apparent performance boost. We mitigated the gradient disappearing by appropriately widening the network.

RoR widens the network and adds more training parameters while adding additional shortcuts, which can lead to more serious overfitting problems. Therefore, we used the stochastic depth (SD) algorithm, which is commonly used in residual networks, to alleviate the overfitting problem. We trained our RoR networks by randomly dropping entire residual blocks during training and bypassing their transformations through shortcuts, without performing forward-backward computation or gradient updates. Let \( p_{l}^{{}} \) mean the probability of the unblocked residual mapping branch of the \( l \) th residual block. \( L \) is the number of residual blocks, and (6) shows that \( p_{l}^{{}} \) decreases linearly with the residual block position. \( p_{l}^{{}} \) indicates that the last residual block is probably unblocked. SD can effectively prevent overfitting problems and reduce training time.

$$ p_{l}^{{}} = 1 - \frac{l}{L}(1 - p_{L}^{{}} ) $$
(6)

Under the premise network of fixed infrastructure, the main way to improve network performance is to magnify network model by deepening the network. however, increasing the depth of model blindly will lead to worse vanishing gradients. WRN [11] is used to increase width of residual networks to improve the performance, compared to blindly deepened networks (causing the vanishing gradients), in the same order of magnitude, with better performance. Based on this idea, we increased the channel of convolutional layers in the RoR residual blocks from {16, 32, 64} in the original network to \( \{ 16 \times k, \, 32 \times k, \, 64 \times k\} \). Feature map dimension extracted from the residual blocks is increased to widen the network, keeping the network from becoming too deep, and further controlling the vanishing gradients problems. A widened RoR network is represented by RoR-WRN.

4 Experiment

In order to analyze the characteristics of three kinds of networks (RoR (ReLU), RoR (ELU), and RoR (PELU)), as well as verify the effectiveness of the optimization scheme, massive experiments were planned. The implementation and results follow.

4.1 Implementation

In this paper, we used RoR for image classification on two image datasets, CIFAR-10 and CIFAR-100. CIFAR-10 is a data set of 60,000 32 × 32 color images, with 10 classes of natural scene objects containing 6000 images each. Similar to CIFAR-10, CIFAR-100 is a data set of 60,000 32 × 32 color images, but with 100 classes of natural scene objects. This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. The training set and test set contain 50,000 and 10,000 images, respectively. Our implementations were based on Torch 7 with a Titan X. We initialized the weights as in [19]. In both CIFAR-10 and CIFAR-100 experiments, we used SGD with a mini-batch size of 128 for 500 epochs. The learning rate started from 0.1, turned into 0.01 after epoch 250 and to 0.001 after epoch 375. For the SD drop-path method, we set \( p_{l}^{{}} \) with the linear decay rule of \( p_{0}^{{}} = 1 \) and \( p_{L}^{{}} = 0.5 \). In RoR-WRN experiments, we set the number of convolution kernels as \( \left\{ {16 \times k,\,32 \times k,\,64 \times k} \right\} \) instead of {16, 32, 64} in the original networks. Other architectures and parameters were the same as RoR’s. As for the data size being limited in this paper, the experiment adopted two kinds of data expansion techniques: random sampling and horizontal flipping.

4.2 110-Layer RoR Experiments

Three types of 110-layer RoR were used to make up the CIFAR-10/100, with the SD algorithm and without the SD algorithm classification error rate shown in Figs. 2 and 3. It can be seen from the results that RoR+SD can obtain better results than RoR without SD, indicating that SD can effectively alleviate the overfitting problem and improve network performance. Therefore, in the subsequent experiments, we trained RoR with the drop-path method-SD. The results of RoR (ELU) and RoR (PELU) on the CIFAR-10/100 are better than that of RoR (ReLU). RoR (PELU) obtained lowest test error on CIFAR-10, RoR (ELU) obtained lowest test error on CIFAR-100.

Fig. 2.
figure 2

Test Error (%) on 110-layer RoR

Fig. 3.
figure 3

Test Error (%) on 110-layer RoR

The experimental results perfectly validate the effectiveness of the proposed optimization method. In this paper, we think that in the training, some of the input of ReLU fell into the hard saturation region, resulting in corresponding weight that cannot be updated. In addition, the ReLU output has the offset phenomenon; that is, the output mean value is greater than zero, which will affect the convergence of CNN. Using ELU and PELU, with the left side of the soft saturation, makes it more robust to the inputs, which means, we obtained better results.

4.3 Depth and Width Experiments

In order to further optimize the model, we increased the network model from the two aspects of width and depth. The three types of RoR with 38 layers, k = 2 (RoR-WRN38-2+SD) and 56 layers of network, k = 4 (RoR-WRN56-4+SD) are used for image classification experiments on CIFAR-10/100. The classification test error is shown in Tables 1 and 2. It can be seen from the results that, in the case of widening and deepening appropriately, the three types of RoR have improved performance, while the comparison performance results are basically similar to the 110-layer network. RoR (PELU) on CIFAR-10 obtained the lowest classification test error. As a result, RoR (ELU) obtained the best classification results on CIFAR-100. The results also confirm the optimization method we developed.

Table 1. Test Error (%) on RoR-WRN38-2+SD
Table 2. Test Error (%) on RoR-WRN56-4+SD

4.4 Results Comparison of the Best Model Classification

Table 3 compares our optimized RoR model with the state-of-the-art methods on CIFAR-10/100. It can be seen from Table 3 that the classification test error of RoR-WRN56-4+SD(PELU) and RoR-WRN56-4+SD (ELU) on CIFAR-10/100 is better than that of the original network RoR-WRN56-4+SD(ReLU), which proves the effectiveness of the proposed scheme. It can be seen from the experimental results from the optimized RoR has almost none increase in the computational cost and achieves better classification results compared with the same depth and width network. In view of the good effect of the optimization model, we attempted a deeper RoR-WRN74-4+SD model to obtain the optimal model and achieve state-of-the-art results on CIFAR-10. On the CIFAR-100, RoR-WRN74-4+SD achieves state-of-the-art results in much the same manner according to the amount of parameters as other models. Although ResNeXt and PyramidNet obtained a lower error rate on CIFAR-100, the number of model parameters was much larger than our best model.

Table 3. Test Error (%) on CIFAR-10/100 on different methods

5 Conclusion

In this paper, we put forward an optimization method of Residual Networks of Residual Networks (RoR) by analyzing performance of three activation functions. We acquired amazing image classification results on CIFAR-10 and CIFAR-100. The experiment results show optimizational RoR can give more control over bias shift and vanishing gradients and get excellent image classification performance.