1 Introduction

Relying on the ultra-large-scale parameters (Lecun et al. 2015), deep neural networks have become the flexible function approximators and have been very successful in a wide range of tasks, such as industrial automation control (Miao and He 2017), computer aided diagnosis (Shin et al. 2016; Samala et al. 2019) and financial data analysis (Jang and Lee 2018; Jie and Wang 2017). With the development of deep learning, the structure of deep neural networks becomes more and more complex. Inevitably, deep neural networks require high computational costs in both training and testing phases, which is one of the most important reasons that restricts its practical application in consumer electronics. For mobile and embedded devices (Gutierrez-Galan et al. 2018), the inference speed and file size of the model are critical. The depth, size and amount of computation, as well as the memory overhead during the model runtime are rapidly increasing, which make it difficult for deep networks to be applied to mobile terminals or embedded devices with low hardware resources and high real-time requirements, such as autonomous vehicles (Li et al. 2018). Therefore, how to reduce the network size and inference time under the premise of ensuring performance and promote its application in consumer electronics is a hot topic of current research. There have been great interests in reducing the redundancy of deep neural networks to achieve model compression and acceleration (He et al. 2017; Frankle and Carbin 2019; Zoph et al. 2018). The lightening of neural networks (Xu et al. 2019; Kim et al. 2016; Zheng et al. 2017) is the future development direction, and network pruning (Huang and Wang 2018; Yu et al. 2018; Liu et al. 2017) is one of the key technologies.

The pruning of neural networks is meant to reduce or control the number of non-zero parameters or feature maps that need to be used in the model. Pruning can be seen as a structural exploration that finds out how many parameters or feature maps are really needed in each layer to get the best performance. In fact, researchers found that only about 4% of the parameters need to be updated during back-propagation (Molchanov et al. 2017). By sparsifying deep neural networks, we can avoid unnecessary computation and resources, since irrelevant degrees of freedom are pruned away and do not need to be computed. Another benefit is that by reducing the number of parameters (i.e., the redundancy in the parameter space), the generalization ability of the neural network can even be improved. As we have seen in the recent research on the generalization ability of deep neural networks (Painsky and Rosset 2016; Hu et al. 2017; Zheng et al. 2018), the original number of parameters (L0-norm) cannot actually predict its performance. In other words, based on the experimental results, people found that pruning the network in prudent manners even helps to improve generalization. At the same time, new parameter correlations are being developed to predict and describe generalization ability, such as the Fisher-Rao norm (Goh et al. 2014). Interestingly, Fisher pruning algorithm has been proved to have a good correlation with Fisher–Rao norm (Tian et al. 2017), which means that there is a deeper and incomprehensible relationship between network pruning, parameter redundancy and its generalization.

Pruning parameters is simple but challenging because removing parameters in one layer might dramatically impair the input of the following layers. Our goal is to reduce the computational cost of network inference, especially in the transfer-learning environment: when a pre-trained model starts to be fine-tuned, it inherits the large amount of computation that was used to solve the original task, which may be superfluous for solving the target task. On the other hand, small models that are gradually pruned from a dense neural network usually yield much better results than the direct training from scratch of small models, which shows that the value of automatic pruning methods may lie in identifying efficient structures and performing implicit architecture searches rather than just selecting “important” weights. In practice, the denser network can help to avoid bad local minimums and provide better initializations which are critical for sparser network to learn effective representations (Kim et al. 2016).

In the main part of this paper, to accelerate the network inference, we propose a holistic pruning method named Drop-path to reduce model parameters of 2D deep convolutional neural networks, utilizing redundancy inter parameters per layer under PAC-Bayesian framework. The whole process is followed in two alternative steps: pruning and fine-tuning. Given a trained deep CNN, pruning each path is achieved by ordering the influence of neurons in each layer on the PAC-Bayesian boundary of the model. We believe that the invariance of PAC-Bayesian boundary is an important factor to guarantee the generalization ability of deep CNN under the condition of optimizing as much as possible. To the best of our knowledge, this is the first time to reduce model size based on the generalization error boundary. In the pruning step, parameters in paths with different lengths are removed, as shown in Fig. 1. It is worth noting that most convolutional kernels become sparse, rather than some convolutional kernels being completely removed and acting like Atrous convolution operation (Yu and Koltun 2015). In the fine-tuning step, the new network initialized by the previous model is optimized to retain its classification capability. The two steps are alternated to achieve the trade-off between model size and the performance. The pruning ratio in Drop-path is pre-defined as a hyper-parameter which can be determined according to the actual needs of specific applications, e.g., accuracy, memory, and floating-point operations per second (FLOPs). Drop-path is generic and can be well generalized on multi-layer and multi-branch models, since parameter ranking can be applied to any kind of layer and the importance scores can still be propagated. Compared with layer-by-layer pruning and retraining independently (Yang et al. 2017) or greedily (Li et al. 2017), the reasons for cross-layer removal of connections in different paths are as follows:

  • For deep neural networks, the overall pruning and fine-tuning can extremely save the training time.

  • Pruning parameters across layers gives a holistic view of the robustness of the model, resulting in a smaller network.

  • The overall consideration is necessary, as the pruning of bottom and top layers may affect each other. For example, for the residual network (ResNet), pruning the identity mapping layers or the second layer of each residual block results in additional pruning of subsequent layers.

Finally, we perform various experiments to demonstrate the effectiveness of the proposed Drop-path method. Eight popular deep CNNs [e.g., AlexNet (Krizhevsky et al. 2012), VGG-16 (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015) and ResNet (He et al. 2016)] trained on ImageNet (Russakovsky et al. 2015) and CIFAR-10 (Krizhevsky and Hinton 2009) achieve about 2 × inference speed up along with no more than 1% increase of classification error. In this way, we show that Drop-path can accelerate the inference of network with practical implementations, without seriously hurting the performance of deep CNN.

Fig. 1
figure 1

An illustration of Drop-path method running in a simple neural network model. Paths with different lengths, e.g., green and yellow paths, are gradually pruned during the training process. The dotted lines represent the connections that are removed

The rest of this paper is organized as follows. Section 2 gives a brief introduction to the related work of network pruning methods. Section 3 introduces the PAC-Bayesian framework based Drop-path method. Extensive experiments and analyses are presented in Sect. 4. Some properties and conjectures of Drop-path method are discussed in Sect. 5. Finally, we conclude our works and future directions in Sect. 6.

2 Related work

Heavy-duty deep neural networks are difficult to be applied to mobile terminals and embedded devices which require great intelligence and real-time performance in real life. Compression and acceleration of deep neural networks have recently drawn much attention in deep learning community. Recent studies (Kim et al. 2016; Srinivas and Babu 2016; Han et al. 2015) have investigated the significant redundancy (the parameters that are not important or meaningless in the reasoning process of deep neural network) in deep neural networks and reduced the number of neurons and filters by pruning the unimportant ones.

In general, there are two common understandings behind the pruning process. First, it is important to train a large and over-parameterized model which can provide strong representation and optimization capabilities, and people can safely remove redundant parameters without significant damage to accuracy. Therefore, it is generally accepted that this method is more effective than directly training a smaller network from scratch. Second, the pruned architecture and the relevant weights are considered to be the key to achieving the efficient performance. Therefore, most of the existing pruning techniques choose to fine-tune the pruned model instead of training it from scratch. The weights that are retained after pruning are often considered critical (Wang et al. 2018) because it is difficult to select an important set of weights accurately from the structural space.

To measure the importance of neurons or connections in a network, the exact solutions are very hard to obtain given the complexity of highly non-linearity. Some previous works (Molchanov et al. 2017; Theis et al. 2018; Luo et al. 2017) approximate it using 2nd-order Taylor expansion. Our work is a different approximation based on the PAC-Bayesian framework. Bolukbasi et al. (2017) regularize runtime in convolutional layers to determine whether some kernels or weights can be bypassed. Sparsity regularization terms (Figurnov et al. 2016) have been used to learn sparse CNN structure. Han et al. (2016) propose to learn important connections and perform network pruning based on the weight of network connection. Denton et al. (2014) apply singular value decomposition to neural network to preserve important connections. The meProp (Sun et al. 2017) uses approximate gradients by keeping only top-k elements based on the magnitude values. Torfi and Shirvani (2018) propose group sparsity to constrain the effective parameters and set an extra loss term to force some parameters not to be sparse to keep the network performance. The above methods prune the “least important” neurons layer-by-layer either independently or greedily, without considering weights in different layers jointly and the influence of error propagation in the deep network. In fact, one problem with such methods is that neurons deemed unimportant in an early layer can contribute significantly to responses of important neurons in later layers. Therefore, the neurons in the network must be pruned as a whole according to a unified goal.

Huang and Wang (2018) add an item to structural blocks or neural nodes to regularize the network structure and control the output. It adjusts the unit while processing the whole structure, making the final neural network model look cleaner. Yu et al. (2018) determine the importance of each channel in the back-propagation process, and then selects the important parts based on Inf-FS method. Sau and Balasubramanian (2016) use a noisy network to guide the training of another small network. Liu et al. (2017) add a scale quantization factor to each channel of the convolution kernel, which is regularized during training. Controlling each channel is equivalent to controlling the convolution kernels of the corresponding cube. Sun et al. (2016) propose an iterative learning sparse ConvNets, and retrains the entire model with initial weights learned in previous iterations. Targeted Dropout (Gomez et al. 2018) combines the post hoc pruning strategy into the training process and has no significant influence on the potential performance of the particular architecture. Since the estimation of the importance of neurons (i.e., weights) is based on the situation before Dropout, such estimation errors may accumulate during the optimization, and eventually leading to divergence results. In addition, there are special designs that enable small-scale networks to directly train for outstanding performance, such as ShuffleNet (Zhang et al. 2018), Xception (Chollet 2017), Squeezenet (Iandola et al. 2017), and MobileNet (Howard et al. 2017). The introduction of additional prior knowledge (or manual removal of redundancy) makes these frameworks not applicable to a variety of visual tasks, so it is difficult to be integrated into the mobile-end visual processing system.

3 Drop-path method based on PAC-Bayesian framework

In this section, we present the Drop-path method under PAC-Bayesian framework for network pruning and demonstrate the advantages of this approach in achieving more efficient computation and storage capacity savings. It consists of the following steps:

  1. 1.

    Pretrain the baseline deep CNN until convergence on the image classification task;

  2. 2.

    Prune and fine-tune the network alternately;

  3. 3.

    Output the sparse deep CNN model after achieving the target trade-off between classification accuracy and pruning objective, e.g., FLOPs or memory footprint.

The overall pruning and fine-tuning procedure using Drop-path is shown in Fig. 2.

Fig. 2
figure 2

Network pruning and fine-tuning procedure using Drop-path method

3.1 Network pruning and fine-tuning

The goal of network pruning is to remove redundant parameters or connections while minimizing accuracy loss. Our starting point is a well-trained high-performance baseline deep CNN model \( {\mathbb{N}}_{0} :f({\mathbf{x}};\varvec{\theta}_{0} ), \) in which x and θ0 represent the inputs (i.e., sample images) and initialized network weights, respectively. Then the paths in deep CNN are defined as connections formed by parameters in different layers, which do not necessarily span the entire model and therefore have varying lengths. Then we remove redundant paths from path set \( P_{t} = \left\{ {p_{t}^{1} ,\;p_{t}^{2} ,\; \ldots ,\;p_{t}^{\tau } } \right\} \) in the deep CNN model in a global manner, including all convolutional and fully or locally connected layers. In the path set, \( p_{t}^{i} = \left\{ {\theta_{j} ,\;\theta_{j + 1} ,\; \ldots ,\;\theta_{k} } \right\}_{1 \le j < k \le L} \) represents the i-th path from j-th layer to k-th layer and L is a hyper-parameter representing the maximum length of the path, that is, a path contains at most L (≥2) parameters, and each of which comes from a different layer. Suppose a deep CNN model consists of s layers with r (>1) parameters per layer, then the total number of paths τ can be formulated by

$$ \begin{aligned} \tau & = (S - 1)r^{2} + (S - 2)r^{3} + \cdots + (S - L + 1)r^{L} \\ & = \sum\limits_{i = 1}^{L - 1} {(S - i)r^{i + 1} } = \sum\limits_{i = 1}^{L - 1} {Sr^{i + 1} } - \sum\limits_{i = 1}^{L - 1} {ir^{i + 1} } \\ & = Sr^{2} \frac{{1 - r^{L - 1} }}{1 - r} - r^{2} \frac{{1 - r^{L - 1} }}{{(1 - r)^{2} }} + r\frac{{(L - 1)r^{L} }}{1 - r} \\ & = \frac{{(S - 1)^{2} - Sr^{3} + (L - S)r^{L + 1} + (S - L + 1)r^{L + 2} }}{{(1 - r)^{2} }} \\ & = \;o(r^{L - 1} ),\;r \ne 1 \\ \end{aligned} $$
(1)

The above equation indicates that the total number of paths is the L-order infinitesimal of r. The number of parameters per layer is determined by the structure of the baseline model, so we can constrain the maximum length of paths to limit their number. Generally, the maximum length of paths in deep CNN is set to less than 4 to prevent operation overflow. Given the initial pruning rate ξ0 (0 ≤ ξ0 ≤ 1), a total of ξ0 τ paths which includes ξ0 τ L parameters are pruned from the total number of parameters |θ| after Drop-path pruning. In the pruning process, the number of removed parameters at each iteration is controlled by step ζ, i.e., the pruning rate at t-th pruning is updated according to

$$ \xi_{t} = \xi_{t - 1} - \zeta $$
(2)

This means that a total number of (1 − ξ0 + ζ)τ paths are removed at each iteration.

When the connections are sparsified, a new network \( {\mathbb{N}}_{t} {:}\;f({\mathbf{x}};\varvec{\theta}_{t} ) \) initialized by the previous model is trained by using mini-batch stochastic gradient descent (SGD) method, as given by

$$ \varvec{\theta}_{t}^{i} =\varvec{\theta}_{t}^{i - 1} - \alpha \cdot \frac{1}{M}\sum\limits_{i = 1}^{M} {\nabla_{{\varvec{\theta}_{t}^{i - 1} }} \left[ {{\mathcal{L}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right)} \right]} $$
(3)
figure f

where \( \varvec{\theta}_{t}^{\;i} \) represents the updated weights at the i-th iteration of mini-batch SGD in t-th fine-tuning process. \( \varvec{\theta}_{t}^{\;0} \) is the weights after t-th pruning. α and M denotes the learning rate and batch size, respectively. \( {\mathcal{L}} \) is a loss function to measure the gap between model output f (x) and its corresponding ground-truth label y, such as the negative log-likelihood function:

$$ {\mathcal{L}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right) = - \frac{1}{C}\sum\limits_{j = 1}^{C} {\left[ {y_{i}^{j} \log f({\mathbf{x}}_{i} ;\varvec{\theta})^{j} + (1 - y_{i}^{j} )\log (1 - f({\mathbf{x}}_{i} ;\varvec{\theta})^{j} )} \right]} + \lambda ||\varvec{\theta}||_{2} $$
(4)

where C denotes the total number of dataset categories, that is, the dimension of the output vector. \( y_{i}^{j} \) and \( f({\mathbf{x}}_{i} )^{j} \) represent the j-th component of ground-truth label and model prediction, respectively. λ is weight decay parameter that controls the L2-regularization intensity. The choice of loss function is independent of pruning, but only depends on the task to be solved by the original baseline model. The parameters transferred from the denser network \( {\mathbb{N}}_{t - 1} \) are good initialization of the sparser network \( {\mathbb{N}}_{t} \) to be further fine-tuned. Finally, a sequence of network models \( \left\{ {{\mathbb{N}}_{0} ,\,{\mathbb{N}}_{1} ,\; \ldots ,\;{\mathbb{N}}_{T} } \right\} \) with fewer and fewer parameters are fine-tuned and \( {\mathbb{N}}_{T} \) is the final sparse deep CNN model obtained. During the whole pruning and fine-tuning process, the previous well-trained network is used to calculate the importance of paths in current version of deep CNN model and guide the next parameter removing procedure.

During alternate pruning and fine-tuning process, we use a binary matrix (referred to as inhibition mask I) with the same size as the weight matrix of the whole CNN model to specify the reserved or removed parameters in each path. Before the next fine-tuning of the pruned network, the weight matrix is first updated by dot-multiplying with the inhibition mask:

$$ \hat{\varvec{\theta }}_{t} =\varvec{\theta}_{t} \circ \varvec{I}_{t} $$
(5)

where \( \circ \) represents the element-wise operation for computing the Hadamard product. By doing this, the redundant parameters are removed by multiplying with 0, while the remaining parameters can be preserved by multiplying with 1. Then the following training procedure can be performed in the same way as the baseline network, and the model gradually becomes sparse. All the removed parameters being updated through fine-tuning would be truncated to zero again before next pruning. When the pruned model is deployed for testing on mobile devices or embedded systems, the storage space and computation requirements have been greatly reduced, since the memory footprint of inhibition mask (~10% at 30% pruning rate) is much smaller than that of removed real-valued parameters. Based on this, the sparse convolutional kernels of the pruned network can be transformed into a structured matrix for storage and computation. In other words, a matrix of m × n size only needs less than m × n parameters to describe.

The network pruning and fin-tuning process with Drop-path is summarized in Algorithm 1, in which the parameter ranking criterion will be introduced in Sect. 3.2.

3.2 Path ranking criterion

In this part, we introduce the path ranking criteria used to determine the importance of parameters in each path based on the PAC-Bayesian framework. Our intuition is that the invariance of generalization error boundary of deep CNN model should play a key role in the model pruning since the remaining parameters ensure that the potential of the pruned model can be stimulated by fine-tuning. In other words, parameters that have the least impact on the generalization error boundary are the least important. For simpler deep CNN models such as AlexNet and VGG-16, we can easily prune any of the parameters in any trainable layers. However, for more complex network model like ResNet, pruning filters directly is usually difficult. The structure of residual block imposes restrictions, which leads to the interaction of pruning between layers. Therefore, we do not remove the entire convolutional kernels, but prune the parameters on each path to form sparse filters.

Suppose the input space is represented by \( {\mathcal{X}} \) and the classifier f (·) satisfying the distribution \( {\mathcal{F}} \) is used to predict the labels of samples. In order to introduce the criterion for judging the influence of parameters on generalization error boundary, we first give the following relevant definitions (Langford and Schapire 2015) of PAC-Bayesian boundary:

Definition 1

The expected error of a classifier f (·) is defined as the probability Pr of misclassifying the randomly sampled datum (x, y), as given by

$$ C_{{\mathcal{X}}} \equiv { \Pr }_{{({\mathbf{x}},{\mathbf{y}}) \in {\mathcal{X}}}} \left( {f({\mathbf{x}}) \ne {\mathbf{y}}} \right) $$
(6)

Definition 2

The empirical error of a classifier f (·) is defined as the average probability of misclassifying dataset D with a total number N of samples, as given by

$$ \hat{C}_{D} \equiv \text{Pr}_{{({\mathbf{x}},{\mathbf{y}}) \in D}} \left( {f({\mathbf{x}}) \ne {\mathbf{y}}} \right) = \frac{1}{N}\sum\limits_{i = 1}^{N} {P\left( {f({\mathbf{x}}_{i} ) \ne {\mathbf{y}}_{i} } \right)} $$
(7)

Definition 3

The expected distribution error of the classifier \( f \in {\mathcal{F}} \) is defined as the probability of misclassifying the sample \( \left( {{\mathbf{x}},{\mathbf{y}}} \right) \in {\mathcal{X}}, \) as given by

$$ {\mathcal{F}}_{{\mathcal{X}}} \equiv {\mathbb{E}}_{{f \in {\mathcal{F}}}} C_{{\mathcal{X}}} $$
(8)
figure g

Definition 4

The empirical distribution error of classifier \( f \in {\mathcal{F}} \) is defined as the probability of misclassifying the sample \( \left( {{\mathbf{x}},{\mathbf{y}}} \right) \in D \), as given by

$$ {\hat{\mathcal{F}}}_{D} \equiv {\mathbb{E}}_{{f \in {\mathcal{F}}}} \hat{C}_{D} $$
(9)

Based on the above two error metrics, the property of the PAC-Bayes boundary can be summarized as follows:

Theorem 1

(PAC-Bayesian Boundary) For the entire input space\( {\mathcal{X}}, \)all the prior distributions P(f) of classifier f (·) satisfy the following inequality for any\( \varepsilon \in (0,\;1] {:} \)

$$ \text{Pr}_{{D \sim {\mathcal{X}}^{N} }} \left[ {\forall {\mathcal{F}}(f):KL({\hat{\mathcal{F}}}_{D} | |{\mathcal{F}}_{{\mathcal{X}}} )} \right] \le \frac{{KL\left[ {{\mathcal{F}}(f)\; | |\;P(f)} \right] + \ln (\frac{N + 1}{\varepsilon })}}{N} \ge 1 - \varepsilon $$
(10)

where

$$ KL\left( {{\hat{\mathcal{F}}}_{D} | |{\mathcal{F}}_{{\mathcal{X}}} } \right) = {\mathcal{F}}_{{\mathcal{X}}} \ln \frac{{{\mathcal{F}}_{{\mathcal{X}}} }}{{{\hat{\mathcal{F}}}_{D} }} + (1 - {\mathcal{F}}_{{\mathcal{X}}} )\ln \frac{{1 - {\mathcal{F}}_{{\mathcal{X}}} }}{{1 - {\hat{\mathcal{F}}}_{D} }} $$
(11)
$$ KL\left[ {{\mathcal{F}}(f)\; | |\;P(f)} \right] = {\mathbb{E}}_{{f \in {\mathcal{F}}}} \ln (\frac{{{\mathcal{F}}(f)}}{P(f)}) $$
(12)

Theorem 1 illustrates that for any classifier f satisfying the distribution \( {\mathcal{F}}, \) their expected error and empirical error can be defined by its prior distribution and empirical distribution. Furthermore, if the prior distribution of a classifier is known and the prior distribution and empirical distribution are assumed to be of the same type, the inequality (8) can be further simplified to make the classification boundary more compact, i.e., the loss is smaller. It gives the upper bound of the average empirical error, which can be used as the absolute measure to evaluate the generalization performance of the model. For a given learning algorithm and training set, the empirical error is fixed. And in this case, the empirical error and the relative entropy [i.e., KL(\( {\mathcal{F}}||P \))] are increased. In fact, PAC-Bayesian boundary (Herbrich and Graepel 2002) provides the most compact generalization boundary for the classifier and can therefore be used as a favorable way to evaluate the generalization ability of the learning algorithm.

In practice, the generalized error boundary of deep CNN model is usually difficult to calculate due to large-scale parameters. We propose to use the cross-layer parameters in the path to locally observe their influence on the generalization error boundary, and estimate the change of the generalization error boundary of the whole model. In the end, it is used as ranking criterion for judging the importance of the path. Motivated by this, we tend to remove paths (and the corresponding parameters) where the connected neurons have negligible effects on the generalization error boundary of the model.

The influence of paths on generalization error boundary of deep CNN can be characterized by KL-divergence \( \psi_{i} \) between model losses before and after pruning i-th path, which can be formulated by

$$ \psi_{i} = KL\left( {{\hat{\mathcal{L}}}_{D} | |{\mathcal{L}}_{D} } \right) = {\mathcal{L}}_{D} \ln \frac{{{\mathcal{L}}_{D} }}{{{\hat{\mathcal{L}}}_{D} }} + (1 - {\mathcal{L}}_{D} )\ln \frac{{1 - {\mathcal{L}}_{D} }}{{1 - {\hat{\mathcal{L}}}_{D} }} $$
(13)

where \( {\mathcal{L}}_{D} \) and \( {\hat{\mathcal{L}}}_{D} \) represent the model loss on training set D before and after pruning, respectively. And \( {\hat{\mathcal{L}}}_{D} \) can be calculated by

$$ {\hat{\mathcal{L}}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right) = - \frac{1}{C}\sum\limits_{j = 1}^{C} {\left[ {y_{i}^{j} \log f({\mathbf{x}}_{i} ;\hat{\varvec{\theta }})^{j} + (1 - y_{i}^{j} )\log (1 - f({\mathbf{x}}_{i} ;\hat{\varvec{\theta }})^{j} )} \right]} + \lambda ||\hat{\varvec{\theta }}||_{2} $$
(14)

Then \( \psi_{i} \) is min–max normalized as the criterion to determine the importance of each path (i.e., the corresponding parameters), as calculated by

$$ \psi_{i} \leftarrow \frac{{\;\psi_{i} - \psi_{{{\rm{min}} }} }}{{\psi_{{{\rm{max}} }} - \psi_{{{\rm{min}} }} }} $$
(15)

Finally, we can get a series of importance sequences sorted from small to large as shown below to remove redundant paths:

$$ \psi_{{{\rm{min}} }} \le \psi_{i} \le \psi_{j} \le \cdots \le \psi_{{{\rm{max}} }} $$
(16)

In other words, the paths are removed in terms of normalized KL-divergence \( \psi_{i} \) from small to large. In the end, there are \( \tau \xi_{t} \) removed paths \( P_{t} = \{ p_{t}^{1} ,p_{t}^{2} , \ldots ,p_{t}^{{\tau \xi_{t} }} \} \) at the t-th pruning iteration, and the removed parameters include {p1 ∩ p2 ∩ \( \cdots \) ∩ \( p_{t}^{{\tau \xi_{t} }} \)}. Finally, the inhibition mask It is created for network fine-tuning, in which 1 and 0 denotes reserved and removed parameters, respectively. The path ranking procedure based on PAC-Bayesian framework is summarized in Algorithm 2.

3.3 Reasons for pruning paths rather than individual parameters

Path is a collection of parameters spanning multiple layers in a model, which plays a role of “receptive field” of convolutional kernels but on model structures, and estimates the importance of local structure of deep CNN in pruning process. Compared with the statistics based on single parameter, the variation of generalized error boundary based on the calculation of multiple parameters in the path is more accurate, which can capture the change of deep CNN in the structure space more comprehensively. On the other hand, the compulsory cross-layer parameter removal mechanism can improve the multi-scale feature extraction ability of the model. In the fine-tuning process after Drop-path, the potential of the model can be fully exploited and the representational ability can be exerted through more sparse convolutional kernels, playing the role of structure distillation like “data distillation” proposed in Radosavovic et al. (2018). Perhaps the redundancy of deep CNN is centralized, which means that the model only needs a small number of parameters to mine the implicit knowledge hidden in the sample set.

We observe the loss gains of two deep CNN models (AlexNet and VGG-16) by suppressing the parameters one by one. Four cases, including the original networks and pruned networks with three different pruning rates, are counted. The results are min-max normalized to [0, 1] and shown in Fig. 3. We observe the effect of the parameters on the total loss of test samples by zeroing the weights in the network one by one. Although the impact of parameters on model classification results cannot directly reflect the importance of the weight, the loss gain can characterize their role in the model. It can be clearly seen that there are sharp rises in the pruned networks, which indicates that the parameters at the bottom play a more important role. On the other hand, the effects of single parameter on the classification results gradually becomes consistent with the pruning process, while at the beginning a large number of parameters hardly affect the results. In other words, in the model structure space, the structure which contains a lot of redundant information is full of the whole space, and a small number of effective sparse structures are clustered together. This is inspiring for model pruning or structure exploration: we can remove model parameters centrally in a small range, i.e., find the appropriate structure along several directions in the structure space, which is exactly what Drop-path implements.

Fig. 3
figure 3

Loss gains of AlexNet and VGG-16 by suppressing the parameters one by one

Compared with traditional sorting and pruning algorithms based on single parameter, Drop-path can improve the computational efficiency. The computational complexity of single parameter based ranking method is L/(L − 1) times that of path based ranking method. Furthermore, the variation of generalized error boundary caused by a single parameter is similar and indistinguishable. The ranking criterion based on single parameter is unstable, which usually leads to the sharp collapse of deep CNN in the alternative pruning and fine-tuning process. The classification accuracy of the model would be drastically reduced as the pruning proceeds. In Drop-path method, the parameters in one path span at most the L layers at the same time, avoiding excessive removal of a layer of parameters. In fact, the parameters in bottom layers are usually more important, and Drop-path can remove more parameters in top layers as much as possible while retaining more underlying parameters.

4 Experiments

4.1 Experimental setup

4.1.1 Baseline models and benchmark datasets

We conduct experiments on two image classification benchmark datasets, that is, ImageNet (Russakovsky et al. 2015) and CIFAR-10 (Krizhevsky and Hinton 2009). The ImageNet is a large scale dataset, which contains 1000 classes of 1.2 million natural images for training and 50,000 images for testing. CIFAR-10 is a popular benchmark for small-scale natural image classification, which contains 10 classes of 50,000 images for training and 10,000 images for testing.

Then we evaluate the Drop-path method by testing eight popular deep CNN models: AlexNet (Krizhevsky et al. 2012), VGG-16 (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015), ResNet-34/50 (He et al. 2016) for ImageNet, and a modified VGG-16, ResNet-56/110 for CIFAR-10. The architectures for ImageNet, including the category of layers, the size and number of convolutional kernels, are shown in Table 1. In the modified VGG-16 for CIFAR-10, the first two fully connected layers fc1 and fc2 are removed, only the last fully connected layer fc3 with 10 neurons is used for classification. ResNet-56/110 have three stages of residual blocks for outputting feature maps with sizes of 32 × 32, 16 × 16, and 8 × 8. For the subsequent comparison, the number of parameters, FLOPs, and classification accuracies of above baseline models are summarized in Table 2.

Table 1 Network architectures of baseline models for ImageNet
Table 2 Parameter size, FLOPs, and classification accuracy of baseline network models

4.1.2 Pruning and fine-tuning details

In the pruning process, pruning rate is set to set to 1 to ensure that each network can be pruned as much as possible. Step size is set to 0.01 and the intermediate state of deep CNN is preserved to observe the effectiveness of pruning method. In the fine-tuning process, hyper-parameters including batch size, learning rate, weight decay, momentum, and dropout rate for two datasets are set by default, as shown in Table 3. A Large batch size can help optimization, while large momentum can help model avoid local minimum. The learning rates of models trained on ImageNet and CIFAR-10 are reduced by a factor of 10 after 30 and 15 epochs, respectively. The maximum length of paths in AlexNet, VGG-16, GoogLeNet, and ResNet is set to 2, 3, 3, and 3, respectively. All the training and testing process of various deep CNNs are carried out under the Caffe deep learning framework (Jia et al. 2014), based on a workstation consisting of an Intel Core i9-9900k CPU, two NVIDIA GeForce GTX Titan XP GPUs, and 4 × 16 gigabytes of memory. In order to make a reasonable comparison with baseline and state-of-the-art methods, no data augmentations are used in training process.

Table 3 Hyper-parameters setting on two benchmark datasets

4.2 Model fine-tuning and baseline performance comparison

We first plot the learning curves of final fine-tuning process of eight deep CNNs on ImageNet and CIFAR-10 under different pruning rates, including 20%, 40%, 60%, and 80% of parameters being removed, as shown in Fig. 4. For ImageNet, we fine-tune the networks for 150 epochs and report the log-likelihood loss. For CIFAR-10, 70 epochs are sufficient to optimize the network because too much training leads to over-fitting problem. With the increase of pruning rate, i.e., the decrease of model parameters, the final loss of the model increases slightly: from 20 to 80% of pruning rate, the final loss increases by about 0.07 and 0.015 on ImageNet and CIFAR-10, respectively. Even if most of the parameters of deep CNNs are removed, the optimization of the model is not significantly affected, thanks to the good initialization provided by the previous model.

Fig. 4
figure 4

Learning curves of final fine-tuning process of various network models using Drop-path pruning method (20%, 40%, 60%, and 80% pruning rates) on ImageNet and CIFAR-10

We now evaluate the full iterative pruning procedure on two image classification tasks. Figure 5 shows the classification results of AlexNet, VGG-16, GoogLeNet, and ResNet-34/56/110 after pruning and fine-tuning on ImageNet and CIFAR-10. Results of ResNet-50 and modified VGG-16 for CIFAR-10 are only used to demonstrate comparisons with some state-of-the-art algorithms, as shown in Sect. 4.3. The figure depicts classification accuracy relative to the pruning rate of model parameters. Baseline methods, including random and minimum weight pruning, minimum activation criterion, L0-regularization and targeted Dropout (Gomez et al. 2018), are compared to illustrate the effectiveness of Drop-path method. In random and minimum weight pruning, the order of parameters to be pruned is randomly permuted and arranged from small to large, respectively. Minimum activation criterion based pruning sorts the activation values of neurons and removes parameters from small to large. The performance of Drop-path in both six models is much higher than that of random and minimum weight pruning, and minimum activation criterion. Even if the parameters are removed more than 60%, the classification accuracy of six models in ImageNet and CIFAR-10 remains above 70% and 80%, respectively. Furthermore, Drop-path method maintains almost the highest classification accuracy on AlexNet, VGG-16, and ResNet-56 under all parameter sizes. L0-regularization based criterion shows the second-best performance which is closely followed by targeted Dropout. The classification accuracy of removing parameters with larger L0-norms decreases quickly as the pruning rate increases, which indicates the importance of parameters with larger L0-norms. Actually, our pruning method can be combined with these baseline techniques. For example, a deep CNN may be first learned with L0-regularization. Then parameters in the small network can be further pruned by using Drop-path.

Fig. 5
figure 5

Comparison results with baseline methods on six deep CNN models, including AlexNet, VGG-16, GoogLeNet, and ResNet-34 on ImageNet and ResNet 56/110 on CIFAR-10

4.3 Comparison with state-of-the-art methods

In this section, we compare Drop-path with existing state-of-the-art pruning methods on AlexNet, VGG-16, GoogLeNet and ResNet-34/50/56/110, and show results in Table 4 for ImageNet and Table 5 for CIFAR-10. To compare model inference speedup, we report the accuracy loss and the reduction in the number of multiplication and the number of parameters, and denote them as [Accuracy (↓%)], [FLOPs (↓%)], and [Parameters (↓%)] in the Table. FLOPs is a commonly used metric for comparing the computation complexities of various deep CNN models. It is easy to calculate and can be accomplished statically, which is independent of the underlying hardware and software implementations. The trade-off between accuracy loss and the reduction of FLOPs and parameters can be used to illustrate the effectiveness of proposed Drop-path method.

Table 4 ImageNet classification results based on various pruning method
Table 5 CIFAR-10 classification results based on various pruning method

4.3.1 ImageNet classification

In Table 4, on AlexNet, Drop-path-A1 and Drop-path-A2 represent models with 62% and 24% pruning rates, respectively. By achieving smaller accuracy loss (1.04%), Drop-path-A1 reduces significantly more FLOPs (69.17%) than NISP-A (Yu et al. 2018). Drop-path-A2 can achieve lowest accuracy loss (0.01%) with 33.80% and 24% reduction in FLOPs and parameters, respectively. On VGG-16, Drop-path-V1 achieves similar accuracy loss (0.98% vs 1.00%) with larger FLOPs reduction (90.32%), and Drop-path-V2 achieves lowest accuracy loss (0.01%) with 35.57% and 28% reduction in FLOPs and parameters. Comparing with Tucker (Kim et al. 2016) and NISP (Yu et al. 2018), GoogLeNet with Drop-path-G improves accuracy loss by 0.04% and 0.07%, respectively. Using ResNet, our methods (Drop-path-34-R1 and Drop-path-50-R1) reduce more FLOPs and parameters with smaller accuracy loss, which demonstrate the superior performance when compared with the state-of-the-art methods (Frankle and Carbin 2019; Yu et al. 2018; Li et al. 2017). On the other hand, the test time for a single image of pruned AlexNet, VGG-16, GoogLeNet, ResNet-34 and ResNet-50 are reduced from 24, 38, 32, 29 and 36 to 11, 14, 11, 13 and 16 ms. It can be seen that the test time of the pruned model is lower than that of the original model, which shows the effectiveness of the pruning method.

4.3.2 CIFAR-10 classification

On VGG-16 for CIFAR-10 classification, Drop-path achieves 0.05% accuracy loss, while the parameter saving can be up to 2 × and the FLOPs reduction is typically around 50%. It achieves the lowest accuracy loss while reducing the maximum number of FLOPs. On ResNet-56 and 110, classification accuracy under Drop-path-56-R2 and Drop-path-110-R2 even rises slightly (0.03%↑ and 0.04%↑) with a small number of parameters (6% and 4%) removed. We conjecture this is due to the removal of some parameters can help improve the generalization of the “bottleneck” structure. For compression and acceleration, Drop-path-56-R1 and Drop-path-110-R1 removes approximately half of the parameters and achieves almost the same accuracy loss as One-shot pruning (0.1) (Frankle and Carbin 2019) and NISP-110 (Yu et al. 2018). We also observe that with the same accuracy loss, the parameter reduction rate on CIFAR-10 is typically slightly lower than ImageNet, which is possibly due to the fact that ImageNet contains more samples and categories, and thus the deep CNN has more redundancy. The test time for a single image of pruned VGG-16, ResNet-56 and ResNet-110 are reduced from 11, 22 and 24 to 7, 11 and 15 ms.

In theory, both the energy consumption and the inference delay of the network will decrease when the parameters and the amount of computation are reduced. But in practice, they are not simply proportional to these complexity measures. Due to the complexity of physical hardware, the inference time of the large network may be shorter than that of the small network in some cases.

4.4 Layer robustness to drop-path pruning

The more parameters removed, the less the classification accuracy, which is consistent with our understanding. Both layers in the model can be pruned but with different sensitivity. To observe the pruning sensitivity for convolutional layers and fully connected layers of Drop-path, we plot the trade-off curves between classification accuracy and the number of parameters in each layer, as shown in Fig. 6. Preliminary results show that the classification accuracy of deep CNN is related to the total number of parameters in each layer. Moreover, the length of a path is positively correlated with the possibility of removal. The longer the path is, the greater the probability of removal will be, which means that the bottom connections are more important. The majority of parameters of baseline networks reside in the fully or locally connected layers. We remove as many parameters as possible at these higher layers, while the underlying connections are kept untouched.

Fig. 6
figure 6

Pruning sensitivity to convolutional layers (left) and fully connected layers (right) of AlexNet and VGG-16

On AlexNet, even the parameters in last three convolutional layers (i.e., conv3 to conv5) and fully connected layers (fc1 to fc3) are reduced to less than 20%, the accuracy loss is no more than 20%. The convolutional layers are more sensitive to pruning than the fully connected layers, especially the first two convolutional layers conv1 and conv2. The first convolutional layer conv1, which interacts with the input image directly, is undoubtedly the most sensitive to pruning. This sensitivity is due to the fact that the input has only 3 channels and therefore it has less redundancy than the other convolutional layers.

On VGG-16 for ImageNet, the underlying convolutional layers become more important and sensitive. The accuracy loss has exceeded 20%, even if the parameters of layer conv1 are still more than 60%. We believe that the deeper the network structure, the greater the importance of underlying building. To prove this empirically, we plot the distribution histogram of removing parameters under 50% pruning rate in Fig. 7. It can be observed that fully connected layers in AlexNet and VGG-16 have a large amount of redundant parameters that can be pruned. Moreover, with the layers going deeper, more and more parameters are removed.

Fig. 7
figure 7

Distribution of pruning parameters under 50% pruning rate. Yellow and blue histograms represent the model trained on ImageNet and CIFAR-10, respectively (Color figure online)

On VGG-16 for CIFAR-10, the pruning sensitivity of convolutional layers is generally higher. There are four convolutional layers (conv1-conv4) with an accuracy loss of more than 20%, even though their parameters are preserved by more than 40%. It can be seen that the sensitivity of fully connected layers and most convolutional layers to Drop-path pruning is acceptable and usually does not produce devastating results.

5 Discussions

5.1 Difficulty in directly using small-sized networks

The classification performance of pruned deep CNNs presented in Sect. 4 are surprising and also raise questions. Since small-sized models have the potential to show good performance, can we directly optimize sparse models without relying on dense models? To answer this question, we report the classification performance of eight pruned sparse networks with random initializations being trained from scratch, as shown in Fig. 8. Networks pruned by Drop-path with a 20% pruning rate are randomly initialized to MSRA Gaussian distribution (i.e., \( \varvec{\theta}\,\sim\,G\left[ {0,\;\sqrt {2/n_{input} } } \right] \)) (He et al. 2015) and Xavier Uniform distribution (i.e., \( \varvec{\theta}\,\sim\,U\left[ {0,\;2/(n_{input} + n_{output} )} \right] \)) (Glorot and Bengio 2010), where ninput and noutput represent the number of inputs and outputs in the current layer, respectively. Taking the same pruning rates, networks learned from MSRA or Xavier re-initializations perform significantly worse than the fine-tuned deep CNNs: the accuracy loss on ImageNet and CIFAR-10 has risen by 10% and 5%, respectively. The results demonstrate that the weight initialization learned from the previous network is crucial for continuing to learn sparse networks.

Fig. 8
figure 8

Classification results of pruned deep CNNs with fine-tuning or reinitialization on two benchmark datasets

This result is interesting and has enlightened our inference on the optimization and generalization properties of deep CNNs. Although the learning ability of sparse networks is sufficient enough to fit the training images, it is usually easier to fall into local minimum, while the denser network with more parameters can help to explore better initial positions. Once a good initialization position is found by the denser baseline model, fine-tuning the network can improve its classification performance. To see how initialization affects the performance of the network, we plot the loss landscapes of deep CNNs trained on ImageNet in Fig. 9. Noises are added to the parameters along the first two principal component analysis (PCA) directions μ and υ of the weight matrix to observe the loss of the model, which can reflect the local optimization process of deep CNN. If the iterates of SGD are projected onto the plane defined by two random directions, almost none of the motion can be captured. PCA based visualization technique (Li et al. 2018) provides insights into the consequences of a variety of choices facing the neural network practitioner, including network architecture, optimizer selection, and batch size. Then the parameters near the optimal value are calculated as

$$ \tilde{\varvec{\theta }}_{{(d_{1} ,\;d_{2} )}} =\varvec{\theta}_{ * } + d_{1} \mu \varvec{E} + d_{2} \upsilon \varvec{E} $$
(17)

where d1 ∈ (− 1, 1) and d2 ∈ (− 1, 1) are forward steps along the two directions μ and υ in principal component analysis. E is an all-one matrix which has the same dimensions as weight. Finally, the loss of deep CNN can be updated according to

$$ {\tilde{\mathcal{L}}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right)_{{(d_{1} ,\;d_{2} )}} = - \frac{1}{C}\sum\limits_{j = 1}^{C} {\left[ {y_{i}^{j} \log f({\mathbf{x}}_{i} ;\tilde{\varvec{\theta }}_{{(d_{1} ,\;d_{2} )}} )^{j} + (1 - y_{i}^{j} )\log (1 - f({\mathbf{x}}_{i} ;\tilde{\varvec{\theta }}_{{(d_{1} ,\;d_{2} )}} )^{j} )} \right]} $$
(18)

It can be seen that model after pruning and fine-tuning has a similar curvature and minimum region as the baseline model, while the re-initialized network is sharper and accompanied by more local minimums, which means that it is difficult to find a good solution for sparse networks without any prior knowledge. Furthermore, the performance of pruned deep CNNs can even be comparable to that of specially designed sparse networks, as shown in Table 6. By fine-tuning, pruned deep CNNs with much fewer parameters outperform the sparse networks, such as SqueezeNet (Iandola et al. 2017) and ShuffleNet (Zhang et al. 2018), which illustrates that deep CNNs with gradually pruning and fine-tuning are more suitable than directly training from scratch for mobile systems or embedded devices.

Fig. 9
figure 9

Loss landscapes of various deep CNN models trained on ImageNet. a Original network; b MSRA reinitialization, c Xavier reinitialization, d fine-tuning

Table 6 Comparison of lightweight CNNs and pruned deep CNN models on two benchmark datasets

5.2 The choice of maximum length of paths in the pruning of deep CNNs

In fact, the choice of maximum length of paths during the pruning process of deep CNNs is critical. Intuitively, the long path containing a set of neurons spanning multiple layers can comprehensively describe the network coding and decoding process and assess the importance of these neurons as a whole. However, this does not mean that each of these neurons has the same level of importance. On the other hand, short paths can locally assess the importance of neurons at other scales and are therefore still necessary. Larger Lmax can lead to more paths with different sizes, which are beneficial for assessing the importance of network neurons or connections. Researchers believe that neurons have the ability to encode light, sound, taste, and other information that the body perceives (Carsen et al. 2016). In other words, the neural coding process attempts to establish a mapping from stimulus to response, focusing on understanding how neurons respond to different stimuli and building models to predict the response of neurons to specific stimuli. The corresponding neural decoding process studies the mapping in the opposite direction, that is, from the known responses to reconstruct features and estimate the external stimulus. Context relations play an important role in neural networks (Chen and Jahanshahi 2017; Zheng et al. 2018, 2019), and it is difficult for a single neuron to correctly reflect its importance in the inference process of neural networks.

Then we present the classification performance of the eight baseline CNN models with different maximum lengths of path to illustrate the validity of Lmax = 4, as shown in Fig. 10. The classification results of the eight networks showed consistency: all the models with Lmax = 4 achieved the highest classification accuracy under the same pruning rate (~50%). When Lmax equals 1, pruning is based on a single neuron and results in a cliff-like decline in the accuracy of fine-tuning networks. During alternate pruning and fine-tuning steps, networks with larger Lmax required more local preparation time, which means that more information reflecting the importance of neurons was observed and utilized.

Fig. 10
figure 10

Classification results of the eight baseline CNNs with various maximum length of paths, including Lmax = 1, 2, 3 and 4

5.3 Sparse convolutional kernels are efficient

Intuitively, directly removing the entire convolutional kernels should be more appropriate than removing the parameters, which guarantees the integrity of the convolution kernel. However, the redundancy of deep CNN model is distributed in each filter, rather than some specific filters are redundant. We present 64 kernels in the first convolutional layer of pruned AlexNet for ImageNet and pruned VGG-16 for CIFAR-10 in Fig. 11. Even when 50% of the parameters are pruned, the number of remaining convolutional kernels is still larger than the number of raw input channels, and only the kernels are split into multiple small sawtooth-like blocks and become sparse. We believe that sparse convolutional kernels play a similar role to Atrous convolution operation (Yu and Koltun 2015), removing redundancy and guaranteeing the mining of potential features. At the same time, this enlarges the field of perception, so that the output of each convolution layer contains a larger range of information. On the other hand, VGG-16 has more inhibited filters than AlexNet as shown in the last two rows in the figure. This is possible for a small-sized dataset such as CIFAR-10, on which the deep CNN does not need to learn as much significative convolutional kernels as on ImageNet.

Fig. 11
figure 11

Visualization of filters in the first convolutional layer of AlexNet and VGG-16 for ImageNet. Pruned convolutional kernels are ranked by L2-norm. The left and right examples represent the original and pruned convolutional kernels, respectively

6 Conclusion

In this paper, we propose a generic approach named Drop-path for network compression and acceleration based on identifying the importance levels of connections in different paths under PAC-Bayesian framework. To the best of our knowledge, this is the first time to reduce model size based on the generalization error boundary. Drop-path is generic and straightforward, which can be easily and suitably applied to any multi-layer and multi-branch models. Extensive experiments have demonstrated that Drop-path can effectively reduce the redundancy of deep CNN and achieve network compression and acceleration with negligible accuracy loss. Eight popular deep CNNs including AlexNet, VGG-16, GoogLeNet, and ResNet-34/50/56/110 trained on ImageNet and CIFAR-10 achieve the state-of-the-art performance by about 2 × speed up along with no more than 1% increase of error. This results in smaller memory footprint and computational requirements for real-time image processing, making the deep CNN easier to be deployed on mobile systems or embedded devices. Moreover, the proposed Drop-path method can be viewed as a tool to further explore the dependence of model structure on optimization and generalization of neural networks. In the process of theoretical analysis and experimental verification, we conclude:

  1. 1.

    By iteratively removing the least important parameters in different paths, deep CNNs can be successfully pruned by ensuring the invariance of network generalization error boundary as much as possible.

  2. 2.

    Automatic structured pruning can usually find effective network architecture, which performs better than directly training from scratch of sparse models.

  3. 3.

    Sparse convolutional kernels are more efficient than being removed directly as a whole.

As for future work, we plan to apply our approach to more deep learning based applications, such as real-time object detection and instance segmentation. It is also interesting to use Drop-path method to accelerate more advanced neural networks, such as recurrent neural network (RNN) and 3D CNN. In these cases, the approximation method used to estimate the generalization error boundary of neural networks needs to be redesigned.