PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning

Zheng, Qinghe; Tian, Xinyu; Yang, Mingqiang; Wu, Yulin; Su, Huake

doi:10.1007/s11045-019-00686-z

PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning

Published: 19 October 2019

Volume 31, pages 793–827, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multidimensional Systems and Signal Processing Aims and scope Submit manuscript

PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning

Download PDF

784 Accesses
78 Citations
Explore all metrics

Abstract

Deep convolutional neural networks (CNNs) have demonstrated its extraordinary power on various visual tasks like object detection and classification. However, it is still challenging to deploy state-of-the-art models into real-world applications, such as autonomous vehicles, due to their expensive computation costs. In this paper, to accelerate the network inference, we introduce a novel pruning method named Drop-path to reduce model parameters of 2D deep CNNs. Given a trained deep CNN, pruning paths with different lengths is achieved by ordering the influence of neurons in each layer on the probably approximately correct (PAC) Bayesian boundary of the model. We believe that the invariance of PAC-Bayesian boundary is an important factor to guarantee the generalization ability of deep CNN under the condition of optimizing as much as possible. To the best of our knowledge, this is the first time to reduce model size based on the generalization error boundary. After pruning, we observe that the convolutional kernels themselves become sparse, rather than some being removed directly. In fact, Drop-path is generic and can be well generalized on multi-layer and multi-branch models, since parameter ranking criterion can be applied to any kind of layer and the importance scores can still be propagated. Finally, Drop-path is evaluated on two image classification benchmark datasets (ImageNet and CIFAR-10) with multiple deep CNN models, including AlexNet, VGG-16, GoogLeNet, and ResNet-34/50/56/110. Experimental results demonstrate that Drop-path achieves significant model compression and acceleration with negligible accuracy loss.

Blending Pruning Criteria for Convolutional Neural Networks

Global balanced iterative pruning for efficient convolutional neural networks

Article 27 July 2022

CLMIP: cross-layer manifold invariance based pruning method of deep convolutional neural network for real-time road type recognition

Article 17 July 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Relying on the ultra-large-scale parameters (Lecun et al. 2015), deep neural networks have become the flexible function approximators and have been very successful in a wide range of tasks, such as industrial automation control (Miao and He 2017), computer aided diagnosis (Shin et al. 2016; Samala et al. 2019) and financial data analysis (Jang and Lee 2018; Jie and Wang 2017). With the development of deep learning, the structure of deep neural networks becomes more and more complex. Inevitably, deep neural networks require high computational costs in both training and testing phases, which is one of the most important reasons that restricts its practical application in consumer electronics. For mobile and embedded devices (Gutierrez-Galan et al. 2018), the inference speed and file size of the model are critical. The depth, size and amount of computation, as well as the memory overhead during the model runtime are rapidly increasing, which make it difficult for deep networks to be applied to mobile terminals or embedded devices with low hardware resources and high real-time requirements, such as autonomous vehicles (Li et al. 2018). Therefore, how to reduce the network size and inference time under the premise of ensuring performance and promote its application in consumer electronics is a hot topic of current research. There have been great interests in reducing the redundancy of deep neural networks to achieve model compression and acceleration (He et al. 2017; Frankle and Carbin 2019; Zoph et al. 2018). The lightening of neural networks (Xu et al. 2019; Kim et al. 2016; Zheng et al. 2017) is the future development direction, and network pruning (Huang and Wang 2018; Yu et al. 2018; Liu et al. 2017) is one of the key technologies.

The pruning of neural networks is meant to reduce or control the number of non-zero parameters or feature maps that need to be used in the model. Pruning can be seen as a structural exploration that finds out how many parameters or feature maps are really needed in each layer to get the best performance. In fact, researchers found that only about 4% of the parameters need to be updated during back-propagation (Molchanov et al. 2017). By sparsifying deep neural networks, we can avoid unnecessary computation and resources, since irrelevant degrees of freedom are pruned away and do not need to be computed. Another benefit is that by reducing the number of parameters (i.e., the redundancy in the parameter space), the generalization ability of the neural network can even be improved. As we have seen in the recent research on the generalization ability of deep neural networks (Painsky and Rosset 2016; Hu et al. 2017; Zheng et al. 2018), the original number of parameters (L0-norm) cannot actually predict its performance. In other words, based on the experimental results, people found that pruning the network in prudent manners even helps to improve generalization. At the same time, new parameter correlations are being developed to predict and describe generalization ability, such as the Fisher-Rao norm (Goh et al. 2014). Interestingly, Fisher pruning algorithm has been proved to have a good correlation with Fisher–Rao norm (Tian et al. 2017), which means that there is a deeper and incomprehensible relationship between network pruning, parameter redundancy and its generalization.

Pruning parameters is simple but challenging because removing parameters in one layer might dramatically impair the input of the following layers. Our goal is to reduce the computational cost of network inference, especially in the transfer-learning environment: when a pre-trained model starts to be fine-tuned, it inherits the large amount of computation that was used to solve the original task, which may be superfluous for solving the target task. On the other hand, small models that are gradually pruned from a dense neural network usually yield much better results than the direct training from scratch of small models, which shows that the value of automatic pruning methods may lie in identifying efficient structures and performing implicit architecture searches rather than just selecting “important” weights. In practice, the denser network can help to avoid bad local minimums and provide better initializations which are critical for sparser network to learn effective representations (Kim et al. 2016).

In the main part of this paper, to accelerate the network inference, we propose a holistic pruning method named Drop-path to reduce model parameters of 2D deep convolutional neural networks, utilizing redundancy inter parameters per layer under PAC-Bayesian framework. The whole process is followed in two alternative steps: pruning and fine-tuning. Given a trained deep CNN, pruning each path is achieved by ordering the influence of neurons in each layer on the PAC-Bayesian boundary of the model. We believe that the invariance of PAC-Bayesian boundary is an important factor to guarantee the generalization ability of deep CNN under the condition of optimizing as much as possible. To the best of our knowledge, this is the first time to reduce model size based on the generalization error boundary. In the pruning step, parameters in paths with different lengths are removed, as shown in Fig. 1. It is worth noting that most convolutional kernels become sparse, rather than some convolutional kernels being completely removed and acting like Atrous convolution operation (Yu and Koltun 2015). In the fine-tuning step, the new network initialized by the previous model is optimized to retain its classification capability. The two steps are alternated to achieve the trade-off between model size and the performance. The pruning ratio in Drop-path is pre-defined as a hyper-parameter which can be determined according to the actual needs of specific applications, e.g., accuracy, memory, and floating-point operations per second (FLOPs). Drop-path is generic and can be well generalized on multi-layer and multi-branch models, since parameter ranking can be applied to any kind of layer and the importance scores can still be propagated. Compared with layer-by-layer pruning and retraining independently (Yang et al. 2017) or greedily (Li et al. 2017), the reasons for cross-layer removal of connections in different paths are as follows:

For deep neural networks, the overall pruning and fine-tuning can extremely save the training time.
Pruning parameters across layers gives a holistic view of the robustness of the model, resulting in a smaller network.
The overall consideration is necessary, as the pruning of bottom and top layers may affect each other. For example, for the residual network (ResNet), pruning the identity mapping layers or the second layer of each residual block results in additional pruning of subsequent layers.

Finally, we perform various experiments to demonstrate the effectiveness of the proposed Drop-path method. Eight popular deep CNNs [e.g., AlexNet (Krizhevsky et al. 2012), VGG-16 (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015) and ResNet (He et al. 2016)] trained on ImageNet (Russakovsky et al. 2015) and CIFAR-10 (Krizhevsky and Hinton 2009) achieve about 2 × inference speed up along with no more than 1% increase of classification error. In this way, we show that Drop-path can accelerate the inference of network with practical implementations, without seriously hurting the performance of deep CNN.

The rest of this paper is organized as follows. Section 2 gives a brief introduction to the related work of network pruning methods. Section 3 introduces the PAC-Bayesian framework based Drop-path method. Extensive experiments and analyses are presented in Sect. 4. Some properties and conjectures of Drop-path method are discussed in Sect. 5. Finally, we conclude our works and future directions in Sect. 6.

2 Related work

Heavy-duty deep neural networks are difficult to be applied to mobile terminals and embedded devices which require great intelligence and real-time performance in real life. Compression and acceleration of deep neural networks have recently drawn much attention in deep learning community. Recent studies (Kim et al. 2016; Srinivas and Babu 2016; Han et al. 2015) have investigated the significant redundancy (the parameters that are not important or meaningless in the reasoning process of deep neural network) in deep neural networks and reduced the number of neurons and filters by pruning the unimportant ones.

In general, there are two common understandings behind the pruning process. First, it is important to train a large and over-parameterized model which can provide strong representation and optimization capabilities, and people can safely remove redundant parameters without significant damage to accuracy. Therefore, it is generally accepted that this method is more effective than directly training a smaller network from scratch. Second, the pruned architecture and the relevant weights are considered to be the key to achieving the efficient performance. Therefore, most of the existing pruning techniques choose to fine-tune the pruned model instead of training it from scratch. The weights that are retained after pruning are often considered critical (Wang et al. 2018) because it is difficult to select an important set of weights accurately from the structural space.

To measure the importance of neurons or connections in a network, the exact solutions are very hard to obtain given the complexity of highly non-linearity. Some previous works (Molchanov et al. 2017; Theis et al. 2018; Luo et al. 2017) approximate it using 2nd-order Taylor expansion. Our work is a different approximation based on the PAC-Bayesian framework. Bolukbasi et al. (2017) regularize runtime in convolutional layers to determine whether some kernels or weights can be bypassed. Sparsity regularization terms (Figurnov et al. 2016) have been used to learn sparse CNN structure. Han et al. (2016) propose to learn important connections and perform network pruning based on the weight of network connection. Denton et al. (2014) apply singular value decomposition to neural network to preserve important connections. The meProp (Sun et al. 2017) uses approximate gradients by keeping only top-k elements based on the magnitude values. Torfi and Shirvani (2018) propose group sparsity to constrain the effective parameters and set an extra loss term to force some parameters not to be sparse to keep the network performance. The above methods prune the “least important” neurons layer-by-layer either independently or greedily, without considering weights in different layers jointly and the influence of error propagation in the deep network. In fact, one problem with such methods is that neurons deemed unimportant in an early layer can contribute significantly to responses of important neurons in later layers. Therefore, the neurons in the network must be pruned as a whole according to a unified goal.

Huang and Wang (2018) add an item to structural blocks or neural nodes to regularize the network structure and control the output. It adjusts the unit while processing the whole structure, making the final neural network model look cleaner. Yu et al. (2018) determine the importance of each channel in the back-propagation process, and then selects the important parts based on Inf-FS method. Sau and Balasubramanian (2016) use a noisy network to guide the training of another small network. Liu et al. (2017) add a scale quantization factor to each channel of the convolution kernel, which is regularized during training. Controlling each channel is equivalent to controlling the convolution kernels of the corresponding cube. Sun et al. (2016) propose an iterative learning sparse ConvNets, and retrains the entire model with initial weights learned in previous iterations. Targeted Dropout (Gomez et al. 2018) combines the post hoc pruning strategy into the training process and has no significant influence on the potential performance of the particular architecture. Since the estimation of the importance of neurons (i.e., weights) is based on the situation before Dropout, such estimation errors may accumulate during the optimization, and eventually leading to divergence results. In addition, there are special designs that enable small-scale networks to directly train for outstanding performance, such as ShuffleNet (Zhang et al. 2018), Xception (Chollet 2017), Squeezenet (Iandola et al. 2017), and MobileNet (Howard et al. 2017). The introduction of additional prior knowledge (or manual removal of redundancy) makes these frameworks not applicable to a variety of visual tasks, so it is difficult to be integrated into the mobile-end visual processing system.

3 Drop-path method based on PAC-Bayesian framework

In this section, we present the Drop-path method under PAC-Bayesian framework for network pruning and demonstrate the advantages of this approach in achieving more efficient computation and storage capacity savings. It consists of the following steps:

1.
Pretrain the baseline deep CNN until convergence on the image classification task;
2.
Prune and fine-tune the network alternately;
3.
Output the sparse deep CNN model after achieving the target trade-off between classification accuracy and pruning objective, e.g., FLOPs or memory footprint.

The overall pruning and fine-tuning procedure using Drop-path is shown in Fig. 2.

3.1 Network pruning and fine-tuning

The goal of network pruning is to remove redundant parameters or connections while minimizing accuracy loss. Our starting point is a well-trained high-performance baseline deep CNN model $ {\mathbb{N}}_{0} :f({\mathbf{x}};\varvec{\theta}_{0} ), $ in which x and θ₀ represent the inputs (i.e., sample images) and initialized network weights, respectively. Then the paths in deep CNN are defined as connections formed by parameters in different layers, which do not necessarily span the entire model and therefore have varying lengths. Then we remove redundant paths from path set $ P_{t} = \left\{ {p_{t}^{1} ,\;p_{t}^{2} ,\; \ldots ,\;p_{t}^{\tau } } \right\} $ in the deep CNN model in a global manner, including all convolutional and fully or locally connected layers. In the path set, $ p_{t}^{i} = \left\{ {\theta_{j} ,\;\theta_{j + 1} ,\; \ldots ,\;\theta_{k} } \right\}_{1 \le j < k \le L} $ represents the i-th path from j-th layer to k-th layer and L is a hyper-parameter representing the maximum length of the path, that is, a path contains at most L (≥2) parameters, and each of which comes from a different layer. Suppose a deep CNN model consists of s layers with r (>1) parameters per layer, then the total number of paths τ can be formulated by

$$ \begin{aligned} \tau & = (S - 1)r^{2} + (S - 2)r^{3} + \cdots + (S - L + 1)r^{L} \\ & = \sum\limits_{i = 1}^{L - 1} {(S - i)r^{i + 1} } = \sum\limits_{i = 1}^{L - 1} {Sr^{i + 1} } - \sum\limits_{i = 1}^{L - 1} {ir^{i + 1} } \\ & = Sr^{2} \frac{{1 - r^{L - 1} }}{1 - r} - r^{2} \frac{{1 - r^{L - 1} }}{{(1 - r)^{2} }} + r\frac{{(L - 1)r^{L} }}{1 - r} \\ & = \frac{{(S - 1)^{2} - Sr^{3} + (L - S)r^{L + 1} + (S - L + 1)r^{L + 2} }}{{(1 - r)^{2} }} \\ & = \;o(r^{L - 1} ),\;r \ne 1 \\ \end{aligned} $$

(1)

The above equation indicates that the total number of paths is the L-order infinitesimal of r. The number of parameters per layer is determined by the structure of the baseline model, so we can constrain the maximum length of paths to limit their number. Generally, the maximum length of paths in deep CNN is set to less than 4 to prevent operation overflow. Given the initial pruning rate ξ₀ (0 ≤ ξ₀ ≤ 1), a total of ξ₀ τ paths which includes ξ₀ τ L parameters are pruned from the total number of parameters |θ| after Drop-path pruning. In the pruning process, the number of removed parameters at each iteration is controlled by step ζ, i.e., the pruning rate at t-th pruning is updated according to

$$ \xi_{t} = \xi_{t - 1} - \zeta $$

(2)

This means that a total number of (1 − ξ₀ + ζ)τ paths are removed at each iteration.

When the connections are sparsified, a new network $ {\mathbb{N}}_{t} {:}\;f({\mathbf{x}};\varvec{\theta}_{t} ) $ initialized by the previous model is trained by using mini-batch stochastic gradient descent (SGD) method, as given by

$$ \varvec{\theta}_{t}^{i} =\varvec{\theta}_{t}^{i - 1} - \alpha \cdot \frac{1}{M}\sum\limits_{i = 1}^{M} {\nabla_{{\varvec{\theta}_{t}^{i - 1} }} \left[ {{\mathcal{L}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right)} \right]} $$

(3)

where $ \varvec{\theta}_{t}^{\;i} $ represents the updated weights at the i-th iteration of mini-batch SGD in t-th fine-tuning process. $ \varvec{\theta}_{t}^{\;0} $ is the weights after t-th pruning. α and M denotes the learning rate and batch size, respectively. $ {\mathcal{L}} $ is a loss function to measure the gap between model output f (x) and its corresponding ground-truth label y, such as the negative log-likelihood function:

$$ {\mathcal{L}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right) = - \frac{1}{C}\sum\limits_{j = 1}^{C} {\left[ {y_{i}^{j} \log f({\mathbf{x}}_{i} ;\varvec{\theta})^{j} + (1 - y_{i}^{j} )\log (1 - f({\mathbf{x}}_{i} ;\varvec{\theta})^{j} )} \right]} + \lambda ||\varvec{\theta}||_{2} $$

(4)

where C denotes the total number of dataset categories, that is, the dimension of the output vector. $ y_{i}^{j} $ and $ f({\mathbf{x}}_{i} )^{j} $ represent the j-th component of ground-truth label and model prediction, respectively. λ is weight decay parameter that controls the L2-regularization intensity. The choice of loss function is independent of pruning, but only depends on the task to be solved by the original baseline model. The parameters transferred from the denser network $ {\mathbb{N}}_{t - 1} $ are good initialization of the sparser network $ {\mathbb{N}}_{t} $ to be further fine-tuned. Finally, a sequence of network models $ \left\{ {{\mathbb{N}}_{0} ,\,{\mathbb{N}}_{1} ,\; \ldots ,\;{\mathbb{N}}_{T} } \right\} $ with fewer and fewer parameters are fine-tuned and $ {\mathbb{N}}_{T} $ is the final sparse deep CNN model obtained. During the whole pruning and fine-tuning process, the previous well-trained network is used to calculate the importance of paths in current version of deep CNN model and guide the next parameter removing procedure.

During alternate pruning and fine-tuning process, we use a binary matrix (referred to as inhibition mask I) with the same size as the weight matrix of the whole CNN model to specify the reserved or removed parameters in each path. Before the next fine-tuning of the pruned network, the weight matrix is first updated by dot-multiplying with the inhibition mask:

$$ \hat{\varvec{\theta }}_{t} =\varvec{\theta}_{t} \circ \varvec{I}_{t} $$

(5)

where $ \circ $ represents the element-wise operation for computing the Hadamard product. By doing this, the redundant parameters are removed by multiplying with 0, while the remaining parameters can be preserved by multiplying with 1. Then the following training procedure can be performed in the same way as the baseline network, and the model gradually becomes sparse. All the removed parameters being updated through fine-tuning would be truncated to zero again before next pruning. When the pruned model is deployed for testing on mobile devices or embedded systems, the storage space and computation requirements have been greatly reduced, since the memory footprint of inhibition mask (~10% at 30% pruning rate) is much smaller than that of removed real-valued parameters. Based on this, the sparse convolutional kernels of the pruned network can be transformed into a structured matrix for storage and computation. In other words, a matrix of m × n size only needs less than m × n parameters to describe.

The network pruning and fin-tuning process with Drop-path is summarized in Algorithm 1, in which the parameter ranking criterion will be introduced in Sect. 3.2.

3.2 Path ranking criterion

In this part, we introduce the path ranking criteria used to determine the importance of parameters in each path based on the PAC-Bayesian framework. Our intuition is that the invariance of generalization error boundary of deep CNN model should play a key role in the model pruning since the remaining parameters ensure that the potential of the pruned model can be stimulated by fine-tuning. In other words, parameters that have the least impact on the generalization error boundary are the least important. For simpler deep CNN models such as AlexNet and VGG-16, we can easily prune any of the parameters in any trainable layers. However, for more complex network model like ResNet, pruning filters directly is usually difficult. The structure of residual block imposes restrictions, which leads to the interaction of pruning between layers. Therefore, we do not remove the entire convolutional kernels, but prune the parameters on each path to form sparse filters.

Suppose the input space is represented by $ {\mathcal{X}} $ and the classifier f (·) satisfying the distribution $ {\mathcal{F}} $ is used to predict the labels of samples. In order to introduce the criterion for judging the influence of parameters on generalization error boundary, we first give the following relevant definitions (Langford and Schapire 2015) of PAC-Bayesian boundary:

Definition 1

The expected error of a classifier f (·) is defined as the probability Pr of misclassifying the randomly sampled datum (x, y), as given by

$$ C_{{\mathcal{X}}} \equiv { \Pr }_{{({\mathbf{x}},{\mathbf{y}}) \in {\mathcal{X}}}} \left( {f({\mathbf{x}}) \ne {\mathbf{y}}} \right) $$

(6)

Definition 2

The empirical error of a classifier f (·) is defined as the average probability of misclassifying dataset D with a total number N of samples, as given by

$$ \hat{C}_{D} \equiv \text{Pr}_{{({\mathbf{x}},{\mathbf{y}}) \in D}} \left( {f({\mathbf{x}}) \ne {\mathbf{y}}} \right) = \frac{1}{N}\sum\limits_{i = 1}^{N} {P\left( {f({\mathbf{x}}_{i} ) \ne {\mathbf{y}}_{i} } \right)} $$

(7)

Definition 3

The expected distribution error of the classifier $ f \in {\mathcal{F}} $ is defined as the probability of misclassifying the sample $ \left( {{\mathbf{x}},{\mathbf{y}}} \right) \in {\mathcal{X}}, $ as given by

$$ {\mathcal{F}}_{{\mathcal{X}}} \equiv {\mathbb{E}}_{{f \in {\mathcal{F}}}} C_{{\mathcal{X}}} $$

(8)

Definition 4

The empirical distribution error of classifier $ f \in {\mathcal{F}} $ is defined as the probability of misclassifying the sample $ \left( {{\mathbf{x}},{\mathbf{y}}} \right) \in D $, as given by

$$ {\hat{\mathcal{F}}}_{D} \equiv {\mathbb{E}}_{{f \in {\mathcal{F}}}} \hat{C}_{D} $$

(9)

Based on the above two error metrics, the property of the PAC-Bayes boundary can be summarized as follows:

Theorem 1

(PAC-Bayesian Boundary) For the entire input space$ {\mathcal{X}}, $all the prior distributions P(f) of classifier f (·) satisfy the following inequality for any$ \varepsilon \in (0,\;1] {:} $

$$ \text{Pr}_{{D \sim {\mathcal{X}}^{N} }} \left[ {\forall {\mathcal{F}}(f):KL({\hat{\mathcal{F}}}_{D} | |{\mathcal{F}}_{{\mathcal{X}}} )} \right] \le \frac{{KL\left[ {{\mathcal{F}}(f)\; | |\;P(f)} \right] + \ln (\frac{N + 1}{\varepsilon })}}{N} \ge 1 - \varepsilon $$

(10)

where

$$ KL\left( {{\hat{\mathcal{F}}}_{D} | |{\mathcal{F}}_{{\mathcal{X}}} } \right) = {\mathcal{F}}_{{\mathcal{X}}} \ln \frac{{{\mathcal{F}}_{{\mathcal{X}}} }}{{{\hat{\mathcal{F}}}_{D} }} + (1 - {\mathcal{F}}_{{\mathcal{X}}} )\ln \frac{{1 - {\mathcal{F}}_{{\mathcal{X}}} }}{{1 - {\hat{\mathcal{F}}}_{D} }} $$

(11)

$$ KL\left[ {{\mathcal{F}}(f)\; | |\;P(f)} \right] = {\mathbb{E}}_{{f \in {\mathcal{F}}}} \ln (\frac{{{\mathcal{F}}(f)}}{P(f)}) $$

(12)

Theorem 1 illustrates that for any classifier f satisfying the distribution $ {\mathcal{F}}, $ their expected error and empirical error can be defined by its prior distribution and empirical distribution. Furthermore, if the prior distribution of a classifier is known and the prior distribution and empirical distribution are assumed to be of the same type, the inequality (8) can be further simplified to make the classification boundary more compact, i.e., the loss is smaller. It gives the upper bound of the average empirical error, which can be used as the absolute measure to evaluate the generalization performance of the model. For a given learning algorithm and training set, the empirical error is fixed. And in this case, the empirical error and the relative entropy [i.e., KL($ {\mathcal{F}}||P $)] are increased. In fact, PAC-Bayesian boundary (Herbrich and Graepel 2002) provides the most compact generalization boundary for the classifier and can therefore be used as a favorable way to evaluate the generalization ability of the learning algorithm.

In practice, the generalized error boundary of deep CNN model is usually difficult to calculate due to large-scale parameters. We propose to use the cross-layer parameters in the path to locally observe their influence on the generalization error boundary, and estimate the change of the generalization error boundary of the whole model. In the end, it is used as ranking criterion for judging the importance of the path. Motivated by this, we tend to remove paths (and the corresponding parameters) where the connected neurons have negligible effects on the generalization error boundary of the model.

The influence of paths on generalization error boundary of deep CNN can be characterized by KL-divergence $ \psi_{i} $ between model losses before and after pruning i-th path, which can be formulated by

$$ \psi_{i} = KL\left( {{\hat{\mathcal{L}}}_{D} | |{\mathcal{L}}_{D} } \right) = {\mathcal{L}}_{D} \ln \frac{{{\mathcal{L}}_{D} }}{{{\hat{\mathcal{L}}}_{D} }} + (1 - {\mathcal{L}}_{D} )\ln \frac{{1 - {\mathcal{L}}_{D} }}{{1 - {\hat{\mathcal{L}}}_{D} }} $$

(13)

where $ {\mathcal{L}}_{D} $ and $ {\hat{\mathcal{L}}}_{D} $ represent the model loss on training set D before and after pruning, respectively. And $ {\hat{\mathcal{L}}}_{D} $ can be calculated by

$$ {\hat{\mathcal{L}}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right) = - \frac{1}{C}\sum\limits_{j = 1}^{C} {\left[ {y_{i}^{j} \log f({\mathbf{x}}_{i} ;\hat{\varvec{\theta }})^{j} + (1 - y_{i}^{j} )\log (1 - f({\mathbf{x}}_{i} ;\hat{\varvec{\theta }})^{j} )} \right]} + \lambda ||\hat{\varvec{\theta }}||_{2} $$

(14)

Then $ \psi_{i} $ is min–max normalized as the criterion to determine the importance of each path (i.e., the corresponding parameters), as calculated by

$$ \psi_{i} \leftarrow \frac{{\;\psi_{i} - \psi_{{{\rm{min}} }} }}{{\psi_{{{\rm{max}} }} - \psi_{{{\rm{min}} }} }} $$

(15)

Finally, we can get a series of importance sequences sorted from small to large as shown below to remove redundant paths:

$$ \psi_{{{\rm{min}} }} \le \psi_{i} \le \psi_{j} \le \cdots \le \psi_{{{\rm{max}} }} $$

(16)

In other words, the paths are removed in terms of normalized KL-divergence $ \psi_{i} $ from small to large. In the end, there are $ \tau \xi_{t} $ removed paths $ P_{t} = \{ p_{t}^{1} ,p_{t}^{2} , \ldots ,p_{t}^{{\tau \xi_{t} }} \} $ at the t-th pruning iteration, and the removed parameters include {p¹ ∩ p² ∩ $ \cdots $ ∩ $ p_{t}^{{\tau \xi_{t} }} $}. Finally, the inhibition mask I_t is created for network fine-tuning, in which 1 and 0 denotes reserved and removed parameters, respectively. The path ranking procedure based on PAC-Bayesian framework is summarized in Algorithm 2.

3.3 Reasons for pruning paths rather than individual parameters

Path is a collection of parameters spanning multiple layers in a model, which plays a role of “receptive field” of convolutional kernels but on model structures, and estimates the importance of local structure of deep CNN in pruning process. Compared with the statistics based on single parameter, the variation of generalized error boundary based on the calculation of multiple parameters in the path is more accurate, which can capture the change of deep CNN in the structure space more comprehensively. On the other hand, the compulsory cross-layer parameter removal mechanism can improve the multi-scale feature extraction ability of the model. In the fine-tuning process after Drop-path, the potential of the model can be fully exploited and the representational ability can be exerted through more sparse convolutional kernels, playing the role of structure distillation like “data distillation” proposed in Radosavovic et al. (2018). Perhaps the redundancy of deep CNN is centralized, which means that the model only needs a small number of parameters to mine the implicit knowledge hidden in the sample set.

We observe the loss gains of two deep CNN models (AlexNet and VGG-16) by suppressing the parameters one by one. Four cases, including the original networks and pruned networks with three different pruning rates, are counted. The results are min-max normalized to [0, 1] and shown in Fig. 3. We observe the effect of the parameters on the total loss of test samples by zeroing the weights in the network one by one. Although the impact of parameters on model classification results cannot directly reflect the importance of the weight, the loss gain can characterize their role in the model. It can be clearly seen that there are sharp rises in the pruned networks, which indicates that the parameters at the bottom play a more important role. On the other hand, the effects of single parameter on the classification results gradually becomes consistent with the pruning process, while at the beginning a large number of parameters hardly affect the results. In other words, in the model structure space, the structure which contains a lot of redundant information is full of the whole space, and a small number of effective sparse structures are clustered together. This is inspiring for model pruning or structure exploration: we can remove model parameters centrally in a small range, i.e., find the appropriate structure along several directions in the structure space, which is exactly what Drop-path implements.

Compared with traditional sorting and pruning algorithms based on single parameter, Drop-path can improve the computational efficiency. The computational complexity of single parameter based ranking method is L/(L − 1) times that of path based ranking method. Furthermore, the variation of generalized error boundary caused by a single parameter is similar and indistinguishable. The ranking criterion based on single parameter is unstable, which usually leads to the sharp collapse of deep CNN in the alternative pruning and fine-tuning process. The classification accuracy of the model would be drastically reduced as the pruning proceeds. In Drop-path method, the parameters in one path span at most the L layers at the same time, avoiding excessive removal of a layer of parameters. In fact, the parameters in bottom layers are usually more important, and Drop-path can remove more parameters in top layers as much as possible while retaining more underlying parameters.

4 Experiments

4.1 Experimental setup

4.1.1 Baseline models and benchmark datasets

We conduct experiments on two image classification benchmark datasets, that is, ImageNet (Russakovsky et al. 2015) and CIFAR-10 (Krizhevsky and Hinton 2009). The ImageNet is a large scale dataset, which contains 1000 classes of 1.2 million natural images for training and 50,000 images for testing. CIFAR-10 is a popular benchmark for small-scale natural image classification, which contains 10 classes of 50,000 images for training and 10,000 images for testing.

Then we evaluate the Drop-path method by testing eight popular deep CNN models: AlexNet (Krizhevsky et al. 2012), VGG-16 (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015), ResNet-34/50 (He et al. 2016) for ImageNet, and a modified VGG-16, ResNet-56/110 for CIFAR-10. The architectures for ImageNet, including the category of layers, the size and number of convolutional kernels, are shown in Table 1. In the modified VGG-16 for CIFAR-10, the first two fully connected layers fc1 and fc2 are removed, only the last fully connected layer fc3 with 10 neurons is used for classification. ResNet-56/110 have three stages of residual blocks for outputting feature maps with sizes of 32 × 32, 16 × 16, and 8 × 8. For the subsequent comparison, the number of parameters, FLOPs, and classification accuracies of above baseline models are summarized in Table 2.

Table 1 Network architectures of baseline models for ImageNet

Full size table

Table 2 Parameter size, FLOPs, and classification accuracy of baseline network models

Full size table

4.1.2 Pruning and fine-tuning details

In the pruning process, pruning rate is set to set to 1 to ensure that each network can be pruned as much as possible. Step size is set to 0.01 and the intermediate state of deep CNN is preserved to observe the effectiveness of pruning method. In the fine-tuning process, hyper-parameters including batch size, learning rate, weight decay, momentum, and dropout rate for two datasets are set by default, as shown in Table 3. A Large batch size can help optimization, while large momentum can help model avoid local minimum. The learning rates of models trained on ImageNet and CIFAR-10 are reduced by a factor of 10 after 30 and 15 epochs, respectively. The maximum length of paths in AlexNet, VGG-16, GoogLeNet, and ResNet is set to 2, 3, 3, and 3, respectively. All the training and testing process of various deep CNNs are carried out under the Caffe deep learning framework (Jia et al. 2014), based on a workstation consisting of an Intel Core i9-9900k CPU, two NVIDIA GeForce GTX Titan XP GPUs, and 4 × 16 gigabytes of memory. In order to make a reasonable comparison with baseline and state-of-the-art methods, no data augmentations are used in training process.

Table 3 Hyper-parameters setting on two benchmark datasets

Full size table

4.2 Model fine-tuning and baseline performance comparison

We first plot the learning curves of final fine-tuning process of eight deep CNNs on ImageNet and CIFAR-10 under different pruning rates, including 20%, 40%, 60%, and 80% of parameters being removed, as shown in Fig. 4. For ImageNet, we fine-tune the networks for 150 epochs and report the log-likelihood loss. For CIFAR-10, 70 epochs are sufficient to optimize the network because too much training leads to over-fitting problem. With the increase of pruning rate, i.e., the decrease of model parameters, the final loss of the model increases slightly: from 20 to 80% of pruning rate, the final loss increases by about 0.07 and 0.015 on ImageNet and CIFAR-10, respectively. Even if most of the parameters of deep CNNs are removed, the optimization of the model is not significantly affected, thanks to the good initialization provided by the previous model.

We now evaluate the full iterative pruning procedure on two image classification tasks. Figure 5 shows the classification results of AlexNet, VGG-16, GoogLeNet, and ResNet-34/56/110 after pruning and fine-tuning on ImageNet and CIFAR-10. Results of ResNet-50 and modified VGG-16 for CIFAR-10 are only used to demonstrate comparisons with some state-of-the-art algorithms, as shown in Sect. 4.3. The figure depicts classification accuracy relative to the pruning rate of model parameters. Baseline methods, including random and minimum weight pruning, minimum activation criterion, L0-regularization and targeted Dropout (Gomez et al. 2018), are compared to illustrate the effectiveness of Drop-path method. In random and minimum weight pruning, the order of parameters to be pruned is randomly permuted and arranged from small to large, respectively. Minimum activation criterion based pruning sorts the activation values of neurons and removes parameters from small to large. The performance of Drop-path in both six models is much higher than that of random and minimum weight pruning, and minimum activation criterion. Even if the parameters are removed more than 60%, the classification accuracy of six models in ImageNet and CIFAR-10 remains above 70% and 80%, respectively. Furthermore, Drop-path method maintains almost the highest classification accuracy on AlexNet, VGG-16, and ResNet-56 under all parameter sizes. L0-regularization based criterion shows the second-best performance which is closely followed by targeted Dropout. The classification accuracy of removing parameters with larger L0-norms decreases quickly as the pruning rate increases, which indicates the importance of parameters with larger L0-norms. Actually, our pruning method can be combined with these baseline techniques. For example, a deep CNN may be first learned with L0-regularization. Then parameters in the small network can be further pruned by using Drop-path.

4.3 Comparison with state-of-the-art methods

In this section, we compare Drop-path with existing state-of-the-art pruning methods on AlexNet, VGG-16, GoogLeNet and ResNet-34/50/56/110, and show results in Table 4 for ImageNet and Table 5 for CIFAR-10. To compare model inference speedup, we report the accuracy loss and the reduction in the number of multiplication and the number of parameters, and denote them as [Accuracy (↓%)], [FLOPs (↓%)], and [Parameters (↓%)] in the Table. FLOPs is a commonly used metric for comparing the computation complexities of various deep CNN models. It is easy to calculate and can be accomplished statically, which is independent of the underlying hardware and software implementations. The trade-off between accuracy loss and the reduction of FLOPs and parameters can be used to illustrate the effectiveness of proposed Drop-path method.

Table 4 ImageNet classification results based on various pruning method

Full size table

Table 5 CIFAR-10 classification results based on various pruning method

Full size table

4.3.1 ImageNet classification

In Table 4, on AlexNet, Drop-path-A1 and Drop-path-A2 represent models with 62% and 24% pruning rates, respectively. By achieving smaller accuracy loss (1.04%), Drop-path-A1 reduces significantly more FLOPs (69.17%) than NISP-A (Yu et al. 2018). Drop-path-A2 can achieve lowest accuracy loss (0.01%) with 33.80% and 24% reduction in FLOPs and parameters, respectively. On VGG-16, Drop-path-V1 achieves similar accuracy loss (0.98% vs 1.00%) with larger FLOPs reduction (90.32%), and Drop-path-V2 achieves lowest accuracy loss (0.01%) with 35.57% and 28% reduction in FLOPs and parameters. Comparing with Tucker (Kim et al. 2016) and NISP (Yu et al. 2018), GoogLeNet with Drop-path-G improves accuracy loss by 0.04% and 0.07%, respectively. Using ResNet, our methods (Drop-path-34-R1 and Drop-path-50-R1) reduce more FLOPs and parameters with smaller accuracy loss, which demonstrate the superior performance when compared with the state-of-the-art methods (Frankle and Carbin 2019; Yu et al. 2018; Li et al. 2017). On the other hand, the test time for a single image of pruned AlexNet, VGG-16, GoogLeNet, ResNet-34 and ResNet-50 are reduced from 24, 38, 32, 29 and 36 to 11, 14, 11, 13 and 16 ms. It can be seen that the test time of the pruned model is lower than that of the original model, which shows the effectiveness of the pruning method.

4.3.2 CIFAR-10 classification

On VGG-16 for CIFAR-10 classification, Drop-path achieves 0.05% accuracy loss, while the parameter saving can be up to 2 × and the FLOPs reduction is typically around 50%. It achieves the lowest accuracy loss while reducing the maximum number of FLOPs. On ResNet-56 and 110, classification accuracy under Drop-path-56-R2 and Drop-path-110-R2 even rises slightly (0.03%↑ and 0.04%↑) with a small number of parameters (6% and 4%) removed. We conjecture this is due to the removal of some parameters can help improve the generalization of the “bottleneck” structure. For compression and acceleration, Drop-path-56-R1 and Drop-path-110-R1 removes approximately half of the parameters and achieves almost the same accuracy loss as One-shot pruning (0.1) (Frankle and Carbin 2019) and NISP-110 (Yu et al. 2018). We also observe that with the same accuracy loss, the parameter reduction rate on CIFAR-10 is typically slightly lower than ImageNet, which is possibly due to the fact that ImageNet contains more samples and categories, and thus the deep CNN has more redundancy. The test time for a single image of pruned VGG-16, ResNet-56 and ResNet-110 are reduced from 11, 22 and 24 to 7, 11 and 15 ms.

In theory, both the energy consumption and the inference delay of the network will decrease when the parameters and the amount of computation are reduced. But in practice, they are not simply proportional to these complexity measures. Due to the complexity of physical hardware, the inference time of the large network may be shorter than that of the small network in some cases.

4.4 Layer robustness to drop-path pruning

The more parameters removed, the less the classification accuracy, which is consistent with our understanding. Both layers in the model can be pruned but with different sensitivity. To observe the pruning sensitivity for convolutional layers and fully connected layers of Drop-path, we plot the trade-off curves between classification accuracy and the number of parameters in each layer, as shown in Fig. 6. Preliminary results show that the classification accuracy of deep CNN is related to the total number of parameters in each layer. Moreover, the length of a path is positively correlated with the possibility of removal. The longer the path is, the greater the probability of removal will be, which means that the bottom connections are more important. The majority of parameters of baseline networks reside in the fully or locally connected layers. We remove as many parameters as possible at these higher layers, while the underlying connections are kept untouched.

On AlexNet, even the parameters in last three convolutional layers (i.e., conv3 to conv5) and fully connected layers (fc1 to fc3) are reduced to less than 20%, the accuracy loss is no more than 20%. The convolutional layers are more sensitive to pruning than the fully connected layers, especially the first two convolutional layers conv1 and conv2. The first convolutional layer conv1, which interacts with the input image directly, is undoubtedly the most sensitive to pruning. This sensitivity is due to the fact that the input has only 3 channels and therefore it has less redundancy than the other convolutional layers.

On VGG-16 for ImageNet, the underlying convolutional layers become more important and sensitive. The accuracy loss has exceeded 20%, even if the parameters of layer conv1 are still more than 60%. We believe that the deeper the network structure, the greater the importance of underlying building. To prove this empirically, we plot the distribution histogram of removing parameters under 50% pruning rate in Fig. 7. It can be observed that fully connected layers in AlexNet and VGG-16 have a large amount of redundant parameters that can be pruned. Moreover, with the layers going deeper, more and more parameters are removed.

On VGG-16 for CIFAR-10, the pruning sensitivity of convolutional layers is generally higher. There are four convolutional layers (conv1-conv4) with an accuracy loss of more than 20%, even though their parameters are preserved by more than 40%. It can be seen that the sensitivity of fully connected layers and most convolutional layers to Drop-path pruning is acceptable and usually does not produce devastating results.

5 Discussions

5.1 Difficulty in directly using small-sized networks

The classification performance of pruned deep CNNs presented in Sect. 4 are surprising and also raise questions. Since small-sized models have the potential to show good performance, can we directly optimize sparse models without relying on dense models? To answer this question, we report the classification performance of eight pruned sparse networks with random initializations being trained from scratch, as shown in Fig. 8. Networks pruned by Drop-path with a 20% pruning rate are randomly initialized to MSRA Gaussian distribution (i.e., $ \varvec{\theta}\,\sim\,G\left[ {0,\;\sqrt {2/n_{input} } } \right] $) (He et al. 2015) and Xavier Uniform distribution (i.e., $ \varvec{\theta}\,\sim\,U\left[ {0,\;2/(n_{input} + n_{output} )} \right] $) (Glorot and Bengio 2010), where n_input and n_output represent the number of inputs and outputs in the current layer, respectively. Taking the same pruning rates, networks learned from MSRA or Xavier re-initializations perform significantly worse than the fine-tuned deep CNNs: the accuracy loss on ImageNet and CIFAR-10 has risen by 10% and 5%, respectively. The results demonstrate that the weight initialization learned from the previous network is crucial for continuing to learn sparse networks.

This result is interesting and has enlightened our inference on the optimization and generalization properties of deep CNNs. Although the learning ability of sparse networks is sufficient enough to fit the training images, it is usually easier to fall into local minimum, while the denser network with more parameters can help to explore better initial positions. Once a good initialization position is found by the denser baseline model, fine-tuning the network can improve its classification performance. To see how initialization affects the performance of the network, we plot the loss landscapes of deep CNNs trained on ImageNet in Fig. 9. Noises are added to the parameters along the first two principal component analysis (PCA) directions μ and υ of the weight matrix to observe the loss of the model, which can reflect the local optimization process of deep CNN. If the iterates of SGD are projected onto the plane defined by two random directions, almost none of the motion can be captured. PCA based visualization technique (Li et al. 2018) provides insights into the consequences of a variety of choices facing the neural network practitioner, including network architecture, optimizer selection, and batch size. Then the parameters near the optimal value are calculated as

$$ \tilde{\varvec{\theta }}_{{(d_{1} ,\;d_{2} )}} =\varvec{\theta}_{ * } + d_{1} \mu \varvec{E} + d_{2} \upsilon \varvec{E} $$

(17)

where d₁ ∈ (− 1, 1) and d₂ ∈ (− 1, 1) are forward steps along the two directions μ and υ in principal component analysis. E is an all-one matrix which has the same dimensions as weight. Finally, the loss of deep CNN can be updated according to

$$ {\tilde{\mathcal{L}}}\left( {{\mathbf{x}}_{i} ,{\mathbf{y}}_{i} } \right)_{{(d_{1} ,\;d_{2} )}} = - \frac{1}{C}\sum\limits_{j = 1}^{C} {\left[ {y_{i}^{j} \log f({\mathbf{x}}_{i} ;\tilde{\varvec{\theta }}_{{(d_{1} ,\;d_{2} )}} )^{j} + (1 - y_{i}^{j} )\log (1 - f({\mathbf{x}}_{i} ;\tilde{\varvec{\theta }}_{{(d_{1} ,\;d_{2} )}} )^{j} )} \right]} $$

(18)

It can be seen that model after pruning and fine-tuning has a similar curvature and minimum region as the baseline model, while the re-initialized network is sharper and accompanied by more local minimums, which means that it is difficult to find a good solution for sparse networks without any prior knowledge. Furthermore, the performance of pruned deep CNNs can even be comparable to that of specially designed sparse networks, as shown in Table 6. By fine-tuning, pruned deep CNNs with much fewer parameters outperform the sparse networks, such as SqueezeNet (Iandola et al. 2017) and ShuffleNet (Zhang et al. 2018), which illustrates that deep CNNs with gradually pruning and fine-tuning are more suitable than directly training from scratch for mobile systems or embedded devices.

Table 6 Comparison of lightweight CNNs and pruned deep CNN models on two benchmark datasets

Full size table

5.2 The choice of maximum length of paths in the pruning of deep CNNs

In fact, the choice of maximum length of paths during the pruning process of deep CNNs is critical. Intuitively, the long path containing a set of neurons spanning multiple layers can comprehensively describe the network coding and decoding process and assess the importance of these neurons as a whole. However, this does not mean that each of these neurons has the same level of importance. On the other hand, short paths can locally assess the importance of neurons at other scales and are therefore still necessary. Larger L_max can lead to more paths with different sizes, which are beneficial for assessing the importance of network neurons or connections. Researchers believe that neurons have the ability to encode light, sound, taste, and other information that the body perceives (Carsen et al. 2016). In other words, the neural coding process attempts to establish a mapping from stimulus to response, focusing on understanding how neurons respond to different stimuli and building models to predict the response of neurons to specific stimuli. The corresponding neural decoding process studies the mapping in the opposite direction, that is, from the known responses to reconstruct features and estimate the external stimulus. Context relations play an important role in neural networks (Chen and Jahanshahi 2017; Zheng et al. 2018, 2019), and it is difficult for a single neuron to correctly reflect its importance in the inference process of neural networks.

Then we present the classification performance of the eight baseline CNN models with different maximum lengths of path to illustrate the validity of L_max = 4, as shown in Fig. 10. The classification results of the eight networks showed consistency: all the models with L_max = 4 achieved the highest classification accuracy under the same pruning rate (~50%). When L_max equals 1, pruning is based on a single neuron and results in a cliff-like decline in the accuracy of fine-tuning networks. During alternate pruning and fine-tuning steps, networks with larger L_max required more local preparation time, which means that more information reflecting the importance of neurons was observed and utilized.

5.3 Sparse convolutional kernels are efficient

Intuitively, directly removing the entire convolutional kernels should be more appropriate than removing the parameters, which guarantees the integrity of the convolution kernel. However, the redundancy of deep CNN model is distributed in each filter, rather than some specific filters are redundant. We present 64 kernels in the first convolutional layer of pruned AlexNet for ImageNet and pruned VGG-16 for CIFAR-10 in Fig. 11. Even when 50% of the parameters are pruned, the number of remaining convolutional kernels is still larger than the number of raw input channels, and only the kernels are split into multiple small sawtooth-like blocks and become sparse. We believe that sparse convolutional kernels play a similar role to Atrous convolution operation (Yu and Koltun 2015), removing redundancy and guaranteeing the mining of potential features. At the same time, this enlarges the field of perception, so that the output of each convolution layer contains a larger range of information. On the other hand, VGG-16 has more inhibited filters than AlexNet as shown in the last two rows in the figure. This is possible for a small-sized dataset such as CIFAR-10, on which the deep CNN does not need to learn as much significative convolutional kernels as on ImageNet.

6 Conclusion

In this paper, we propose a generic approach named Drop-path for network compression and acceleration based on identifying the importance levels of connections in different paths under PAC-Bayesian framework. To the best of our knowledge, this is the first time to reduce model size based on the generalization error boundary. Drop-path is generic and straightforward, which can be easily and suitably applied to any multi-layer and multi-branch models. Extensive experiments have demonstrated that Drop-path can effectively reduce the redundancy of deep CNN and achieve network compression and acceleration with negligible accuracy loss. Eight popular deep CNNs including AlexNet, VGG-16, GoogLeNet, and ResNet-34/50/56/110 trained on ImageNet and CIFAR-10 achieve the state-of-the-art performance by about 2 × speed up along with no more than 1% increase of error. This results in smaller memory footprint and computational requirements for real-time image processing, making the deep CNN easier to be deployed on mobile systems or embedded devices. Moreover, the proposed Drop-path method can be viewed as a tool to further explore the dependence of model structure on optimization and generalization of neural networks. In the process of theoretical analysis and experimental verification, we conclude:

1.
By iteratively removing the least important parameters in different paths, deep CNNs can be successfully pruned by ensuring the invariance of network generalization error boundary as much as possible.
2.
Automatic structured pruning can usually find effective network architecture, which performs better than directly training from scratch of sparse models.
3.
Sparse convolutional kernels are more efficient than being removed directly as a whole.

As for future work, we plan to apply our approach to more deep learning based applications, such as real-time object detection and instance segmentation. It is also interesting to use Drop-path method to accelerate more advanced neural networks, such as recurrent neural network (RNN) and 3D CNN. In these cases, the approximation method used to estimate the generalization error boundary of neural networks needs to be redesigned.

References

Bolukbasi, T., Wang, J., Dekei, O., & Saligrama, V. (2017). Adaptive neural networks for fast test-time prediction. In Proceedings of 34th international conference on machine learning (ICML) (pp. 527–536), Sydney.
Carsen, S., Marius, P., Nicholas, A. S., Michael, O., Peter, B., et al. (2016). Inhibitory control of correlated intrinsic variability in cortical networks. Elife. https://doi.org/10.7554/elife.19695.
Article Google Scholar
Chen, F. C., & Jahanshahi, R. J. (2017). NB-CNN: Deep learning-based crack detection using convolutional neural network and naive Bayes data fusion. IEEE Transactions on Industrial Electronics,65(5), 4392–4400.
Article Google Scholar
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1800–1807), Honolulu.
Denton, E., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of conference and workshop on neural information processing systems (NIPS), Montreal, http://papers.nips.cc/paper/5544-exploiting-linear-structure-within-convolutional-networks-for-efficient-evaluation.pdf.
Figurnov, M., Ibraimova, A., & Dmitry, P. V. (2016). PerforatedCNNs: Acceleration through elimination of redundant convolutions. In Proceedings of conference and workshop on neural information processing systems (NIPS) (pp. 1–9), Barcelona. https://arxiv.org/abs/1504.08362.
Frankle, J., & Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of international conference on learning representations (ICLR). https://openreview.net/forum?id=rJl-b3RcF7.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of 13th international conference on artificial intelligence and statistics (AISTATS) (pp. 249–256), Sardinia.
Goh, H., Thome, N., Cord, M., & Lim, J. (2014). Learning deep hierarchical visual feature coding. IEEE Transactions on Neural Networks and Learning Systems,25(12), 2212–2225.
Article Google Scholar
Gomez, N. A., Zhang, I., Swersky, K., Gal, Y., & Hinton, G. E. (2018). Targeted dropout. In Proceedings of conference and workshop on neural information processing systems (NIPS). https://nips.cc/Conferences/2018/Schedule?showEvent=10941.
Gutierrez-Galan, D., Dominguez-Morales, J. P., Cerezuela-Escudero, E., Rios-Navarro, A., Tapiador-Morales, R., Rivas-Perez, M., et al. (2018). Embedded neural network for real-time animal behavior classification. Neurocomputing,272, 17–26.
Article Google Scholar
Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing DNNs with pruning, trained quantization and huffman coding. In Proceedings of international conference on learning representations (ICLR), San Juan. https://arxiv.org/abs/1510.00149.
Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for efficient neural networks. In Proceedings of conference and workshop on neural information processing systems (NIPS) (pp. 1–9), Montreal, Canada. http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 1026–1034), Santiago.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition ResNet. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778), Las Vegas.
He, Y., Zhang, X., & Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 1398–1406), Venice.
Herbrich, R., & Graepel, T. (2002). A PAC-Bayesian margin bound for linear classifiers”. IEEE Transactions on Information Theory,48(12), 3140–3150.
Article MathSciNet Google Scholar
Howard, A. G., Zhu, M., Chen, B., & Kalenichenko, D. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint, https://arxiv.org/abs/1704.04861.
Hu, Y., Li, C., Meng, K., Qin, J., & Yang, X. (2017). Group sparse optimization via l p, q, regularization. Journal of Machine Learning Research,8(30), 960–1011.
Google Scholar
Huang, Z., & Wang, N. (2018). Data-driven sparse structure selection for deep neural networks. In Proceedings of European conference on computer vision (ECCV) (pp. 317–334), Munich.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2017). SqueezeNet: AlexNet-Level accuracy with 50X fewer parameters and < 0.5 MB model size. In Proceedings of international conference on learning representations (ICLR), Toulon. https://openreview.net/pdf?id=S1xh5sYgx.
Jang, H., & Lee, J. (2018). An empirical study on modeling and prediction of bitcoin prices with Bayesian neural networks based on blockchain information. IEEE Access,6, 5427–5437.
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of 22nd ACM international conference on multimedia (pp. 675–678), Florida.
Jie, W., & Wang, J. (2017). Forecasting stochastic neural network based on financial empirical mode decomposition. Neural Networks,90, 8–20.
Article Google Scholar
Kim, Y., Park, E., Yoo, S., Choi, T., Yang, L., & Shi, D. (2016). Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of international conference on learning representations (ICLR), Caribe Hilton. https://arxiv.org/abs/1511.06530.
Krizhevsky, A., & Hinton, G. E. (2009). Learning multiple layers of features from tiny images. Technical Report, 1(4), p. 7, University of Toronto, Toronto, Canada.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Proceedings of conference and workshop on neural information processing systems (NIPS). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Langford, J., & Schapire, R. (2015). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research,6(3), 273–306.
MathSciNet MATH Google Scholar
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,521(7553), 436–444.
Article Google Scholar
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. In Proceedings of international conference on learning representations (ICLR), Toulon. https://openreview.net/pdf?id=rJqFGTslg.
Li, H., Xu, Z., Taylor, G., & Goldstein, T. (2018). Visualizing the loss landscape of neural nets. In International conference on learning representations workshop (ICLRW), Vancouver, BC, Canada (pp. 1–17).
Li, Y., Yin, G., Zhuang, W., Zhang, N., Wang, J., & Geng, K. (2018b). Compensating delays and noises in motion control of autonomous electric vehicles by using deep learning and unscented Kalman predictor. IEEE Transactions on Systems, Man, and Cybernetics: Systems,. https://doi.org/10.1109/TSMC.2018.2850367.
Article Google Scholar
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., & Zhang, C. (2017). Learning efficient convolutional networks through network slimming. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 2755–2763), Venice.
Luo, J., Wu, J., & Lin, W. (2017). Thinet: A filter level pruning method for deep neural network compression. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 5068–5076), Venice.
Miao, H., & He, D. (2017). Deep learning based approach for bearing fault diagnosis. IEEE Transactions on Industry Applications,53(3), 3057–3065.
Article Google Scholar
Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2017). Pruning convolutional neural networks resource efficient inference. In Proceedings of international conference on learning representations (ICLR), Toulon. https://openreview.net/forum?id=SJGCiw5gl.
Painsky, A., & Rosset, S. (2016). Isotonic modeling with non-differentiable loss functions with application to Lasso regularization. IEEE Transactions on Pattern Analysis and Machine Intelligence,38(2), 308–321.
Article Google Scholar
Radosavovic, L., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data distillation: Towards omni-supervised learning. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4119–4128), Salt Lake City.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision,115(3), 211–252.
Article MathSciNet Google Scholar
Samala, R. K., Chan, H., Hadjiiski, L., Helvie, M. A., Richter, C. D., & Cha, K. H. (2019). Breast cancer diagnosis in digital breast Tomosynthesis: effects of training sample size on multi-stage transfer learning using deep neural nets. IEEE Transactions on Medical Imaging,38(3), 686–696.
Article Google Scholar
Sau, B. B., & Balasubramanian, V. N. (2016). Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint, https://arxiv.org/abs/1610.09650.
Shin, H. C., Roth, H. R., Gao, M., Lu, L., Xu, Z., Nogues, I., et al. (2016). Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging,35(5), 1285–1298.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint, https://arxiv.org/abs/1409.1556.
Srinivas, S., & Babu, R. V. (2016). Learning the architecture of deep neural networks. In Proceedings of international conference on learning representations (ILCR), Caribe Hilton. https://arxiv.org/abs/1511.05497v1.
Sun, X., Ren, X., Ma, S., & Wang, H. (2017). meProp sparsified backpropagation for accelerated deep learning with reduced overfitting. In Proceedings of 34th international conference on machine learning (ICML) (pp. 3299–3308), Sydney.
Sun, Y., Wang, X., & Tang, X. (2016). Sparsifying neural network connections for face recognition. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4856–4864), Las Vegas.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–9), Boston.
Theis, L., Korshunova, I., Tejani, A., & Huszár, F. (2018). Faster gaze prediction with dense networks and Fisher pruning. arXiv preprint, https://arxiv.org/abs/1801.05787.
Tian, Q., Arbel, T., & Clark, J. J. (2017). Deep LDA-pruned nets for efficient facial gender classification. In Proceedings of IEEE conference on computer vision and pattern recognition workshops (CVPRW) (pp. 512–521), Honolulu.
Torfi, A., & Shirvani, R. A. (2018). Attention-based guided structured sparsity of deep neural networks, In Proceedings of international conference on learning representations workshops (ICLRW), Canada. https://openreview.net/pdf?id=S1dGIXVUz.
Wang, J., Xu, C., Yang, X., & Zurada, J. M. (2018). A novel pruning algorithm for smoothing feedforward neural networks based on group Lasso method. IEEE Transactions on Neural Networks and Learning Systems,29(5), 2012–2024.
Article MathSciNet Google Scholar
Xu, T., Yang, P., Zhang, X., & Liu, C. (2019). LightweightNet: toward fast and lightweight convolutional neural networks via architecture distillation. Pattern Recognition,88, 272–284.
Article Google Scholar
Yang, T. J., Chen, Y. H., & Sze, V. (2017). Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6071–6079), Honolulu.
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation with dilated convolutions. arXiv preprint, https://arxiv.org/abs/1511.07122v2.
Yu, R., Li, A., Chen, C. F., Lai, J., Morariu, V. I., Han, X., et al. (2018). NISP: Pruning networks using neuron importance score propagation. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 9194–9203), Salt Lake City.
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6848–6856), Salt Lake City.
Zheng, Q., Tian, X., Yang, M., & Wang, H. (2019). Differential learning: a powerful tool for interactive content-based Image Retrieval. Engineering Letters,27(1), 202–215.
Google Scholar
Zheng, Q., Yang, M., Zhang, Q., & Yang, J. (2018a). A bilinear multi-scale convolutional neural network for fine-grained object classification. IAENG International Journal of Computer Science,45(2), 340–352.
Google Scholar
Zheng, Q., Yang, M., Zhang, Q., & Zhang, X. (2018b). Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process. IEEE Access,6, 15844–15869.
Article Google Scholar
Zheng, Q., Yang, M., Zhang, Q., Zhang, X., & Yang, J. (2017). Understanding and boosting of deep convolutional neural network based on sample distribution, In IEEE Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). Chengdu, China,2017, 823–827.
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., & V. Le, Q. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8697–8710), Salt Lake City.

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (Grant No. 2018YFC0831503), National Natural Science Foundation of China (Grant No. 61571275), China Computer Program for Education and Scientific Research (Grant No. NGII20161001), and Fundamental Research Funds of Shandong University (Grant No. 2018JC040).

Author information

Authors and Affiliations

School of Information Science and Engineering, Shandong University, Qingdao, 266237, China
Qinghe Zheng, Mingqiang Yang & Yulin Wu
College of Mechanical and Electrical Engineering, Shandong Management University, Jinan, 250357, China
Xinyu Tian
School of Microelectronics, Xidian University, Xi’an, 710126, China
Huake Su

Authors

Qinghe Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Tian
View author publications
You can also search for this author in PubMed Google Scholar
Mingqiang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yulin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Huake Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingqiang Yang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, Q., Tian, X., Yang, M. et al. PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning. Multidim Syst Sign Process 31, 793–827 (2020). https://doi.org/10.1007/s11045-019-00686-z

Download citation

Received: 09 May 2019
Revised: 25 July 2019
Accepted: 04 October 2019
Published: 19 October 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11045-019-00686-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning

Abstract

Similar content being viewed by others

Blending Pruning Criteria for Convolutional Neural Networks

Global balanced iterative pruning for efficient convolutional neural networks

CLMIP: cross-layer manifold invariance based pruning method of deep convolutional neural network for real-time road type recognition

Explore related subjects

1 Introduction

2 Related work

3 Drop-path method based on PAC-Bayesian framework

3.1 Network pruning and fine-tuning

3.2 Path ranking criterion

Definition 1

Definition 2

Definition 3

Definition 4

Theorem 1

3.3 Reasons for pruning paths rather than individual parameters

4 Experiments

4.1 Experimental setup

4.1.1 Baseline models and benchmark datasets

4.1.2 Pruning and fine-tuning details

4.2 Model fine-tuning and baseline performance comparison

4.3 Comparison with state-of-the-art methods

4.3.1 ImageNet classification

4.3.2 CIFAR-10 classification

4.4 Layer robustness to drop-path pruning

5 Discussions

5.1 Difficulty in directly using small-sized networks

5.2 The choice of maximum length of paths in the pruning of deep CNNs

5.3 Sparse convolutional kernels are efficient

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation