Network Pruning via Explicit Information Migration

Wu, Jincheng; Hu, Dongfang; Zheng, Zhitong

doi:10.1007/978-981-99-9119-8_35

Jincheng Wu^11,12,
Dongfang Hu¹¹ &
Zhitong Zheng¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14474))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

599 Accesses

Abstract

Neural network pruning is a widely used approach for reducing the inference cost of deep models in order to deploy on resource-limited settings. However, current pruning works lack attention to information migration from pruned to the remaining part of the Deep Neural Network, and on balancing model performance and compression rate. On these two issues, in this paper, we propose a novel $\textbf{E}$xplicit $\textbf{I}$nformation $\textbf{M}$igration network $\textbf{P}$runing (EIMP) algorithm. Specifically (1) the constrained gradient update method transfers valid information from redundant networks to the preserved, and (2) the newly designed $\lambda $-decay regularization method learns the trade-off between the performance and penalty item. Experiments show that our EIMP algorithm achieves state-of-the-art performance on several datasets with various benchmark network architectures. Notably, EIMP achieves $+1.54\%$ better than SOTA on ImageNet.

Work done during the internship at OPPO.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Tailored Pruning via Rollback Learning

Pruning Convolutional Neural Networks via Stochastic Gradient Hard Thresholding

Improved Network Pruning via Similarity-Based Regularization

Keywords

1 Introduction

With the advancement of deep neural networks (DNN) in recent years [20], accompanied by the effectiveness, there has been a rising amount of storage, memory, computation resources, and energy cost [25]. Therefore, neural network pruning has attracted lots of attention for its capability of reducing model complexity and speeding up inference which is achieved by discarding some redundant structures, such as channels, filters, layers, $ etc $ [20, 30]. Huge efforts have been devoted by researchers over years [3, 27], which can be roughly divided into pruning based on criteria [1, 2, 6, 16, 21] and learn-able pruning methods [3, 18].

The former focuses on directly proposing some theoretically feasible redundancy criteria, like magnitude [20], Hessian [8] and Group Fisher [21], to prune unimportant weights once and for all, followed by a fine-tuning of the pruned network. The latter learnable pruning usually automatically selects unimportant weights during training. In this line, a variety of approaches add a regularization term to the origin loss [20], and we also name them regularization-based methods. Regularization-based pruning is dedicated to impelling unimportant weights close to zero so that subsequent shearing operation will affect as little as possible performance even followed by a simple weight magnitude-based criteria [29]. And we conduct research on the latter path.

In pruning, [29] has pointed out that, high-redundancy parts in a network can have helpful information. Directly trimming is detrimental to the model. From this perspective, learnable regularization-based pruning attempts to automatically transfer useful information from the redundant part to that to be preserved during training. But there are still problematic hurdles: (1) the process of impelling unimportant weights close to zero is too subtle to be perceptible which deserves further research [3]. (2) previous works, such as $L_1$, $L_2$, $L_{2,1}$ regularization, $ etc $., ignore the balance between model convergence and pruning [18]. For the first concern, we present a novel constrained gradient update method to better transfer the model capability to the remaining part. In pruning applications, we find a practical situation: when the magnitude of the model parameter is less than a certain threshold $\tau $, the impact of pruning on the performance can be minimized. We utilize this phenomenon and repel the information of networks to be pruned moves towards the threshold $\tau $ during training. Different from the commonly used Stochastic Gradient Descent (SGD), the constrained gradient update method can proactively prompt redundant parameters to move below the threshold. The process undergoes information migration so that when pruning occurs, the model effectiveness suffers as little as possible.

Considering the second issue, we propose the adapted $\lambda $-decay regularization. Pruning in the training procedure causes the model to go from initial model overfitting to subnet underfitting and then to pruned net fitting again [24]. During the underfitting phase, the high-frequency information tends to be lost, so it is necessary to migrate it from the pruned part to the remaining. Our adapted $\lambda $-decay method regards regularization as a function. When the model has not yet been converged, the optimization is more performance-oriented. And when the model is close to the global optimum, the regularization can increase to speed up the convergence.

The commonality between constrained gradient update and adapted $\lambda $-decay methods is characteristic of adaptive learning. In constrained gradient update, both $S^{0}$ and $S^{1}$ are learning concurrently, but different gradient update methods allow the pruned conv kernels to gradually decay and knowledge progressively flows to the retained parts $S^{1}$. And meanwhile, the structures of the $S^{0}$ and $S^{1}$ sets are dynamically adjusting, so the pruned but potential structures have chances to be resurrected. Furthermore, the adapted $\lambda $-decay method allows the model to learn performance when it should improve it, and learn convergence when it should improve it, dynamically balancing this process.

To summarize, our main contributions are as follows:

We have further explored the explicit information migration for pruning and propose Explicit Information Migration network Pruning (EIMP) consists of a constrained gradient update and an adapted $\lambda $-decay method.
The proposed constrained gradient update method is designed to transfer the model expressivity to the remaining part during pruning. It ensures the model compression rate while enhancing its ability as much as possible.
The adapted $\lambda $-decay regularization scheme is simple yet effective, which can help improve model performance during pruning training.
Extensive experiments on various networks ($ i.e. $ ResNet-32, ResNet-50, ResNet-110, MobileNet-V1, $ etc $.) and datasets ($ i.e. $ CIFAR-10, ImageNet) verify the effectiveness of the two methods. Combining them has delivered the best performance against the state-of-the-art. Besides, we find some interesting phenomena and provide some experience for pruning.

2 Related Work

The pruning algorithm is a commonly used model compression algorithm that can effectively improve the efficiency of model operations [20]. It can be divided into three paradigms: pruning before training, pruning after training, and pruning during training. Pruning before training Some work [4, 22] direct randomization network, followed by randomization of the structure of the network and finally training of the model parameters. [22] think that fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. Pruning after training The classic training paradigm is pre-trained a well-trained model, then prune the unimportant parts by human heuristic algorithm, and finally fine-tuning, it can be performed once or multiple times. The heuristic algorithms include based on magnitude [7], Hessian information [21], $ etc $. Some work introduces trainable structures that adaptively learn the importance of each computing unit. Pruning during training Pruning during training is another pruning paradigm that does not require fine-tuning, as it allows for pruning to be performed concurrently with training. It includes soft pruning strategy [3, 10, 25] and hard pruning [20, 32]. Hard pruning is performed by ADMM [36] algorithm and the projection operator [32] to directly prune unimportant conv kernels. And the Soft pruning strategy [10] inducing structure sparsity is most relevant to the algorithm proposed in this paper, pruned conv kernels continue to participate in training iterations, which means that they are not directly discarded.

3 Method

The overall of our proposed EIMP algorithm is shown in Fig. 1. In this section, we first introduce the widely-used formulations of channel pruning for deep neural networks (DNNs) and floating-point operations (FLOPs), which are used in our paper and presented in Sect. 3.1. Next, we describe the proposed constrained gradient update method in Sect. 3.2, and finally present our method of adapted $\lambda $-decay in Sect. 3.3.

3.1 Preliminaries

Formulation Denote the dataset with N samples as $\mathbb {X} =\{x_i\}_{i=1}^N$, and $\mathbb {Y}=\{y_i\}_{i=1}^N$ are the corresponding labels, a well-trained model with total L layers weights as $\mathbb {W}=\left\{ \textbf{W}_l \right\} _{l=1}^L$ where $\textbf{W}_{l} \in \mathbb {R}^{c_l\times c_{l-1}\times k_l \times k_l}$ represents the weight parameter of the l-th layer, $l\in \left\{ 1,2,...,L \right\} $. $c_l$ is the channel number of l-th layer and $k_l$ is the kernel size in l-th layer.

Some researchers often insert new layers with parameters $\mathbb {W}^{'}=\{ \textbf{W}^{'}_l\}_{l=1}^L$ after each convolution kernel in the original model [18, 20]. We denote this layer as the auxiliary pruning layer, which can be vectors obeying some distribution, crafted matrices, or even conv layers that are introduced to facilitate the pruning of the network architecture. That is, if the $L_2$ norm of the auxiliary pruning layer is approximately or equal to 0, the convolution kernel of the previous layer to which it is attached is equivalent to having no effect on the model.

$$\begin{aligned} \min \limits _{\mathbb {W},\mathbb {W}^{'}} \sum _{i=1}^N \mathcal {L}(x_{i},\mathbb {W},\mathbb {W}^{'};y_{i})&+\lambda \cdot \sum _{l=1}^L||\textbf{W}^{'}_l||_{2,1}\end{aligned}$$

(1)

$$\begin{aligned} \quad \text {s. t.} \quad R_{FLOPs}&\le R_{budget} \end{aligned}$$

(2)

The optimization of previous learn-able regularization-based pruning can be formulated as Eq. 1−2, where $\mathcal {L}$ is the original loss of model, for instance, Cross Entropy Loss. $\left\| \cdot \right\| _{2,1}$ represents $l_{2,1}$-norm and $\Vert \textbf{W}_{l}^{'}\Vert _{2,1}=\sum _{j=1}^{c_l}||\textbf{W}_{l,j,:,:,:}^{'}||_2$. $\lambda $ is a scalar coefficient, which balances the origin model loss and sparsification penalty term, and a larger $\lambda $ urges a sparser convolution kernel and thus obtains a more compact network, whereas a smaller $\lambda $ tends to improve model effectiveness. $R_{FLOPs}$ means FLOPs of sheared model during pruning training, and $R_{budget}$ is the upper limit of resource constraint.

We consider the stochastic gradient update algorithm, the one-step update formula for $\textbf{W}^{'l}$ as shown in Eq. 3, where $\eta ^{'}$ is the learning rate for auxiliary pruning layers training.

$$\begin{aligned} \textbf{W}^{'(t+1)}_{l}&=\textbf{W}^{'(t)}_{l}-\eta ^{'}\cdot \nabla _{\textbf{W}^{'}_{l}} \mathcal {L}-\eta ^{'}\cdot \lambda \cdot \frac{\textbf{W}^{'}_{l}}{||\textbf{W}^{'}_{l}||_{2,1}} \end{aligned}$$

(3)

FLOPs During Pruning. We find when the conv kernel weight is less than the threshold $\tau _1$, pruning it will affect little. Therefore, we only calculate the networks satisfying this condition.

3.2 Constrained Gradient Update

We define ${S}^{0}$ as the set of structures to be pruned of $\mathbb {W^{'}}$ and ${S}^{1}$ as the set of remaining structures. The selection process is not done all at once. Initially, all parameters of $\mathbb {W^{'}}$ belong to ${S}^{1}$ and ${S}^{0}$ is an empty set. As pruning progresses, every few epochs, sort $\mathbb {W^{'}}$ according to the $L_2$ norm, and when $Epoch/u ==0$, calculate $Q = m*Epoch/u$, where u represents the number of times selection is executed and m indicates how many convolutional kernels are selected in each selection. It should be noticed that the Q value will not increase until the FLOPs value of the ${S}^{1}$ set meets the predefined value. After that, put the top Q minimum structures from ${S}^{1}$ into ${S}^{0}$. Before each selection, the ${S}^{0}$ set is reset to an empty set, and all parameters of $\mathbb {W^{'}}$ are again placed in ${S}^{1}$. Then, the selected elements are removed from ${S}^{1}$ and added to ${S}^{0}$.

To update the parameter weights in the ${S}^1$ set, we use the SGD method, which is the same as Eq. 3. For the parameters of network structures to be pruned in ${S}^0$, we define the information learning intensity function $f(\textbf{W}^{'}_{l,j})$ as Eq. 4, which indicates how much the gradient weight for the loss function contributes to the update of the convolution kernel weight parameters in the ${S}^0$ set. Thus, the parameter weights in the ${S}^0$ set are updated according to Eq. 5.

$$\begin{aligned} f(\textbf{W}^{'l}_{j})&=\lambda _2 \cdot max(0, \tau _2-||\textbf{W}^{'l}_{j}||_2)\end{aligned}$$

(4)

$$\begin{aligned} \textbf{W}^{'l}_{j,t+1}&=\textbf{W}^{'}_{j,t}{^{l}}-\eta ^{'}\cdot \nabla _{\textbf{W}^{'l}_{j}} \mathcal {L}\cdot f(\textbf{W}^{'l}_{j})-\eta ^{'}\cdot \lambda \cdot \frac{\textbf{W}^{'l}_{j}}{||\textbf{W}^{'l}||_{2,1}} \end{aligned}$$

(5)

where $\lambda _2$ is the coefficient and $\tau _2$ is the threshold. By employing the proposed CGU algorithm, the weights of the ${S}^0$ set can continuously learn and transfer information until it reaches a stage where the information is either fully depleted or repartitioned into the ${S}^1$ space. The updated weight of a conv kernel in the ${S}^0$ set is monitored by Eq. 5. If the updated weight surpasses the decay rate of the gradient of the loss function, it implies that the conv kernel is gradually increasing its significance and will therefore be reclassified into the ${S}^1$ set.

3.3 Adapted $\lambda $-Decay Regularization

The convergence of the model is represented via gradients $\varphi $ of the last layer in the model. Set the number of each batch size is B, thus the average convergence of every batch is depicted as:

$$\begin{aligned} \varphi &= \nabla _{} \mathcal {L} = \frac{1}{B\times K} \sum _{j=1}^B\sum _{i=1}^K |\frac{\partial \mathcal {L}}{\partial z_{i}^j}| \end{aligned}$$

(6)

$$\begin{aligned} \lambda ^{'} &=\lambda \cdot e^{-\beta *\varphi } \end{aligned}$$

(7)

where $z_{i}^j$ represents the logit for the i-th class and j-th input data among K total classes. In our paper, the regularization coefficients $\lambda $ are re-parameterized as a function $\lambda ^{'}$ in Eq. 7, which can automatically balance the effect and convergence during different pruning training phases. Figure 2 shows that the parameter $\beta $ controls the elasticity of the regularization term.

$$\begin{aligned} \min \limits _{\textbf{W},\mathbf {W^{'}}}& \sum _{i=1}^N \mathcal {L}(x_i,\textbf{W},\mathbf {W^{'}};y_i)+\lambda ^{'}\cdot \sum _{l=1}^L||\textbf{W}_l^{'}||_{2,1} \end{aligned}$$

(8)

Therefore, with the constraint Eq. 2, the model optimization Eq. 1 can be transformed into Eq. 8. It can be seen a rapid decrease in the regularization term enhances the effectiveness of the model, whereas an increase in the norm term accelerates the model’s convergence.

Pseudo-code in Algorithm 1 shows the implementation procedure of our proposed pruning methods. In initialization, all structure parameters are regarded as belonging to ${S}^1$ set. And during the training process, in every u iterations, a few structure parameters with the least $L_2$ norm are chosen and added to the ${S}^0$ set, where they undergo constrained gradient update rule.

4 Experiments and Results

4.1 Datasets, Networks and Pruning Traning Settings

Datasets and Networks We evaluate the performance of the proposed algorithm on the CIFAR-10 [15] and ImageNet [26] datasets. (1) On the CIFAR-10 dataset, we have investigated three models ResNet-32, ResNet-56 and ResNet-110. All the models are trained from scratch with initialization, using Cosine Annealing LR with initial learning rate $lr=0.1$, and decayed between 120 epochs and 180 epochs, for a total of 240 epochs. (2) The ImageNet dataset is a widely used large dataset in the classification task. We have conducted experiments on the ImageNet dataset using MobileNet-V1 [14] and ResNet-50 models [9] with a total of 180 epochs, respectively. MobileNet-V1 i+s trained from scratch with a learning rate of 0.1, batch size of 256, and a decay scheme: cosine learning rate with initialization $lr=0.01$, $cosine minimum=0$, and before pruning, its Top-1 accuracy is 69.26$\%$. In addition, ResNet-50 is the official website implementation integrated into torchvision with Top-1 accuracy 76.15$\%$.

Pruning Traning Settings. As [3] has done, we insert architecture parameters in the basic block first (3 $\times $ 3) conv layers. For ResNet-50 and MobileNet-V1 pruning, we use architecture parameters in the first and second layers in the bottleneck layer. The pruning process uses consine learning rate with an initial value of 0.01 and cosine minimum set to 0. The optimizer selects the SGD algorithm and momentum is set to 0.99. $\eta ^{'}$ is optimized by Cosine learning rate and initialized with 0.01. Threshold $\tau _1=10^{-5}$, $\tau _2=10^{-2}$, $\lambda _1=10^{-4}$, $\lambda _2=1$, $m=4$ and set select interval $u=5$. So we obtain the ${S}^{0}$ set by selecting the Top-4 smallest parameters of every 5 epochs.

4.2 Pruning Results on CIFAR-10 and ImageNet

CIFAR-10. Experiments on benchmark models of ResNet-32, ResNet-56 and ResNet-110 and conducted on CIFAR-10 data and as the Table 1 shows our pruning obtains the best performance overall comparative experiments. Specifically speaking, for the ResNet-56 model, we compare our approach with several recent methods. As the pruning ratio descends to $53.87\%$, EIMP performs best, $+0.55\%$ Top-1 accuracy. In addition, for ResNet-110, our method gets $+0.24\%$ increase than baseline while the second best only obtains $+0.06\%$. In a word, all these results validate the effectiveness of EIMP on CIFAR-10 dataset.

ImageNet. On ImageNet data, benchmark networks ResNet-50 and MobileNet-V1 are tried on, the results of which are shown in Table 2 and verify the superiority of our pruning. As far as we know, with $54.5\%$ compression rate, we achieve an accuracy improvement of $0.07\%$, which surpasses the previous SOTA. Meanwhile, in MobileNet-V1 experiments, at the same compression rate, our EIMP surpasses previous by $1.54\%$.

Table 1. CIFAR-10 dataset experiments compared with state-of-the-art. We conduct experiments on three models: ResNet-32, ResNet-56 and ResNet-110 with 50–65$\%$ FLOPs drops. Our proposed pruning method achieves obvious gains compared with others. $\varDelta =(Pruned-Baseline)/Baseline$.

Full size table

Table 2. ImageNet dataset experiments compared with state-of-the-art. Results of our method on two models with 54$\%$ and 74$\%$ FLOPs drops respectively. Our proposed pruning method achieves obvious gains compared with others.

Full size table

4.3 Information Migration Between ${S}^0$ and ${S}^1$

As shown in Fig. 3, when the iteration increases, more and more channels are cut. It can be noticed that between $epochs=32-36$, some pruned channels are revived. That is to say, in our algorithm, information flows bidirectionally between the pruned and unpruned networks, which enables the proposed algorithm more flexible. And the results highlight the effectiveness of the CGU algorithm in preserving important convolution kernels while pruning.

Table 3. Effectiveness of EIMP. We train on CIFAR-10 with ResNet-56. $\lambda $-decay is short for adapted $\lambda $-decay regularization. The Top-1 accuracy of the pruned networks and the gaps from the base networks are reported.

Full size table

4.4 Ablation Studies

Overall of EIMP. As a baseline, we use the method in [18], and select Top-4 smallest $L_2$ norm elements of the auxiliary pruning layer from $S^{1}$ to $S^{0}$ set every 5 epochs and then prune them. For a fair comparison, we use the same base pruning framework and under the situation of pruning 60% of Flops, compare the pruned model accuracies of ablation experiments. The impacts of whether using CGU or adapted $\lambda $-decay is investigated in Table 3.

Effectiveness of CGU. To inspect the information migration effectiveness of CGU, we compare it with other learning-based pruning methods. For example, in [3], after its division of the conv kernel into $S^{0}$ and $S^{1}$ sets, only attenuation is performed on the $S^{0}$ set without information migration. Results in Fig. 4 left and middle demonstrate the superiority of our CGU method over Res in ResRep. We conduct the experiments on ResNet-56 on the CIFAR-10 and the CGU method consistently outperforms the Res method in terms of accuracy, because when the $L_2$ norm of the conv kernels in the $S^{0}$ set is less than the threshold $\tau _2$, they migrate information to the $S^{1}$ space to allow the model achieves better performance.

Robustness of Adapted $\lambda $-decay Method. To demonstrate the generalizability of the adapted $\lambda $-decay regularization method, we have transferred the method to the ResRep framework and verified its performance. Specifically, we trained the ResNet-32 benchmark network on the CIFAR-10 dataset for a total of 400 epochs, using the same optimizer settings and learning rate as in the Sect. 4.2. Figure 4 (right) shows the accuracy variation from the 160th epoch to the 480th epoch during the pruning training process, with FLOPs reduced by 52.94%. We can observe that after implementing the $\lambda $-decay method, the accuracy can be improved while reducing the same FLOPs value. This implies that $\lambda $-decay can help the model converge closer to the global optimum.

5 Conclusions

In this paper, we examine the problem of model pruning from the perspective of information migration and propose an EIMP schema. Our key contribution is the introduction of a new gradient update rule, referred to as CGU, and an adapted $\lambda $-decay regularization method. The CGU lets the pruned network transfer its valid information to the retained portion of the model, potentially offering a new approach for future pruning research. And the adapted $\lambda $-decay regularization enables the automatic trade-off between performance and regularization item, via assigning varying regularization intensities. Two methods of EIMP can be easily applied to other DNN frameworks.

References

Chen, T.Y., et al.: Only train once: a one-shot neural network training and pruning framework. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Chin, T.W., Ding, R.Z., Zhang, C., Marculescu, D.: Towards efficient model compression via learned global ranking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1518–1528 (2020)
Google Scholar
Ding, X.H., Hao, T.X., et al.: Resrep: Lossless CNN pruning via decoupling remembering and forgetting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4510–4520 (2021)
Google Scholar
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018)
Gao, S.Q., Huang, F.H., Cai, W.D., Huang, H.: Network pruning via performance maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9270–9280 (2021)
Google Scholar
Guo, Y., Yuan, H., Tan, J.C., et al.: GDP: stabilized neural network pruning via gates with differentiable polarization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5239–5250 (2021)
Google Scholar
Han, S., Pool, J., et al.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Hassibi, B., et al.: Optimal brain surgeon and general network pruning. In: IEEE International Conference on Neural Networks, pp. 293–299. IEEE (1993)
Google Scholar
He, K.M., Zhang, X.Y., Ren, S.Q., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, Y., Kang, G.L., et al.: Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866 (2018)
He, Y., Liu, P., Wang, Z.W., Hu, Z.L., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019)
Google Scholar
He, Y., et al.: Learning filter pruning criteria for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2009–2018 (2020)
Google Scholar
Hou, Z.J., Qin, M.H., et al.: Chex: channel exploration for CNN model compression. arXiv preprint arXiv:2203.15794 (2022)
Howard, A.G., Zhu, M.L., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report 0, University of Toronto, Toronto, Ontario (2009)
Google Scholar
Li, B., Wu, B., Su, J., Wang, G.: EagleEye: fast sub-net evaluation for efficient neural network pruning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 639–654. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_38
Chapter Google Scholar
Li, Y.C., Lin, S.H., et al.: Towards compact CNNs via collaborative compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6438–6447 (2021)
Google Scholar
Li, Y.W., Gu, S.H., et al.: Group sparsity: the hinge between filter pruning and decomposition for network compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8018–8027 (2020)
Google Scholar
Lin, M.B., Ji, R.R., et al.: Hrank: filter pruning using high-rank feature map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538 (2020)
Google Scholar
Liu, Z., Li, J.G., et al.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017)
Google Scholar
Liu, L.Y., et al.: Group fisher pruning for practical network compression. In: International Conference on Machine Learning, pp. 7021–7032. PMLR (2021)
Google Scholar
Liu, Z., et al.: Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018)
Liu, Z.C., et al.: MetaPruning: meta learning for automatic neural network channel pruning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3296–3305 (2019)
Google Scholar
Miao, L., et al.: Learning pruning-friendly networks via frank-Wolfe: one-shot, any-sparsity, and no retraining. In: International Conference on Learning Representations (2021)
Google Scholar
Park, J.H., et al.: Dynamic structure pruning for compressing CNNs. arXiv:2303.09736 (2023)
Russakovsky, O., Deng, J., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Shen, M.Y., Yin, H.X., et al.: Structural pruning via latency-saliency knapsack. arXiv preprint arXiv:2210.06659 (2022)
Tang, Y.H., et al.: Scientific control for reliable neural network pruning. Scop. Adv. Neural. Inf. Process. Syst. 33, 10936–10947 (2020)
Google Scholar
Wang, H., Qin, C., et al.: Neural pruning via growing regularization. arXiv preprint arXiv:2012.09243 (2020)
Wang, N.G., Liu, C.C., et al.: Deep compression of pre-trained transformer models. Adv. Neural. Inf. Process. Syst. 35, 14140–14154 (2022)
Google Scholar
Wang, Z., Li, C.C., Wang, X.Y.: Convolutional neural network pruning with structural redundancy reduction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14913–14922 (2021)
Google Scholar
Yang, H.C., Gui, S.P., Zhu, Y.H., Liu, J.: Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2178–2188 (2020)
Google Scholar
Yu, S.X., Mazaheri, A., Jannesari, A.: Auto graph encoder-decoder for neural network pruning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6362–6372 (2021)
Google Scholar
Zhang, M., et al.: Weighted mutual learning with diversity-driven model compression. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Zhang, Y.F., et al.: Exploration and estimation for model compression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 487–496 (2021)
Google Scholar
Zheng, S., et al.: Fast-and-light stochastic ADMM. In: IJCAI, pp. 2407–2613 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

OPPO, Dongguan, China
Jincheng Wu, Dongfang Hu & Zhitong Zheng
University of Science and Technology of China, Hefei, China
Jincheng Wu

Authors

Jincheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Dongfang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhitong Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongfang Hu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, J., Hu, D., Zheng, Z. (2024). Network Pruning via Explicit Information Migration. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_35

Download citation

DOI: https://doi.org/10.1007/978-981-99-9119-8_35
Published: 03 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9118-1
Online ISBN: 978-981-99-9119-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Network Pruning via Explicit Information Migration

Abstract