Keywords

1 Introduction

With the advancement of deep neural networks (DNN) in recent years [20], accompanied by the effectiveness, there has been a rising amount of storage, memory, computation resources, and energy cost [25]. Therefore, neural network pruning has attracted lots of attention for its capability of reducing model complexity and speeding up inference which is achieved by discarding some redundant structures, such as channels, filters, layers, \( etc \) [20, 30]. Huge efforts have been devoted by researchers over years [3, 27], which can be roughly divided into pruning based on criteria [1, 2, 6, 16, 21] and learn-able pruning methods [3, 18].

The former focuses on directly proposing some theoretically feasible redundancy criteria, like magnitude [20], Hessian [8] and Group Fisher [21], to prune unimportant weights once and for all, followed by a fine-tuning of the pruned network. The latter learnable pruning usually automatically selects unimportant weights during training. In this line, a variety of approaches add a regularization term to the origin loss [20], and we also name them regularization-based methods. Regularization-based pruning is dedicated to impelling unimportant weights close to zero so that subsequent shearing operation will affect as little as possible performance even followed by a simple weight magnitude-based criteria [29]. And we conduct research on the latter path.

In pruning, [29] has pointed out that, high-redundancy parts in a network can have helpful information. Directly trimming is detrimental to the model. From this perspective, learnable regularization-based pruning attempts to automatically transfer useful information from the redundant part to that to be preserved during training. But there are still problematic hurdles: (1) the process of impelling unimportant weights close to zero is too subtle to be perceptible which deserves further research [3]. (2) previous works, such as \(L_1\), \(L_2\), \(L_{2,1}\) regularization, \( etc \)., ignore the balance between model convergence and pruning [18]. For the first concern, we present a novel constrained gradient update method to better transfer the model capability to the remaining part. In pruning applications, we find a practical situation: when the magnitude of the model parameter is less than a certain threshold \(\tau \), the impact of pruning on the performance can be minimized. We utilize this phenomenon and repel the information of networks to be pruned moves towards the threshold \(\tau \) during training. Different from the commonly used Stochastic Gradient Descent (SGD), the constrained gradient update method can proactively prompt redundant parameters to move below the threshold. The process undergoes information migration so that when pruning occurs, the model effectiveness suffers as little as possible.

Considering the second issue, we propose the adapted \(\lambda \)-decay regularization. Pruning in the training procedure causes the model to go from initial model overfitting to subnet underfitting and then to pruned net fitting again [24]. During the underfitting phase, the high-frequency information tends to be lost, so it is necessary to migrate it from the pruned part to the remaining. Our adapted \(\lambda \)-decay method regards regularization as a function. When the model has not yet been converged, the optimization is more performance-oriented. And when the model is close to the global optimum, the regularization can increase to speed up the convergence.

The commonality between constrained gradient update and adapted \(\lambda \)-decay methods is characteristic of adaptive learning. In constrained gradient update, both \(S^{0}\) and \(S^{1}\) are learning concurrently, but different gradient update methods allow the pruned conv kernels to gradually decay and knowledge progressively flows to the retained parts \(S^{1}\). And meanwhile, the structures of the \(S^{0}\) and \(S^{1}\) sets are dynamically adjusting, so the pruned but potential structures have chances to be resurrected. Furthermore, the adapted \(\lambda \)-decay method allows the model to learn performance when it should improve it, and learn convergence when it should improve it, dynamically balancing this process.

To summarize, our main contributions are as follows:

  • We have further explored the explicit information migration for pruning and propose Explicit Information Migration network Pruning (EIMP) consists of a constrained gradient update and an adapted \(\lambda \)-decay method.

  • The proposed constrained gradient update method is designed to transfer the model expressivity to the remaining part during pruning. It ensures the model compression rate while enhancing its ability as much as possible.

  • The adapted \(\lambda \)-decay regularization scheme is simple yet effective, which can help improve model performance during pruning training.

  • Extensive experiments on various networks (\( i.e. \) ResNet-32, ResNet-50, ResNet-110, MobileNet-V1, \( etc \).) and datasets (\( i.e. \) CIFAR-10, ImageNet) verify the effectiveness of the two methods. Combining them has delivered the best performance against the state-of-the-art. Besides, we find some interesting phenomena and provide some experience for pruning.

2 Related Work

The pruning algorithm is a commonly used model compression algorithm that can effectively improve the efficiency of model operations [20]. It can be divided into three paradigms: pruning before training, pruning after training, and pruning during training. Pruning before training Some work [4, 22] direct randomization network, followed by randomization of the structure of the network and finally training of the model parameters. [22] think that fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. Pruning after training The classic training paradigm is pre-trained a well-trained model, then prune the unimportant parts by human heuristic algorithm, and finally fine-tuning, it can be performed once or multiple times. The heuristic algorithms include based on magnitude [7], Hessian information [21], \( etc \). Some work introduces trainable structures that adaptively learn the importance of each computing unit. Pruning during training Pruning during training is another pruning paradigm that does not require fine-tuning, as it allows for pruning to be performed concurrently with training. It includes soft pruning strategy [3, 10, 25] and hard pruning [20, 32]. Hard pruning is performed by ADMM [36] algorithm and the projection operator [32] to directly prune unimportant conv kernels. And the Soft pruning strategy [10] inducing structure sparsity is most relevant to the algorithm proposed in this paper, pruned conv kernels continue to participate in training iterations, which means that they are not directly discarded.

3 Method

The overall of our proposed EIMP algorithm is shown in Fig. 1. In this section, we first introduce the widely-used formulations of channel pruning for deep neural networks (DNNs) and floating-point operations (FLOPs), which are used in our paper and presented in Sect. 3.1. Next, we describe the proposed constrained gradient update method in Sect. 3.2, and finally present our method of adapted \(\lambda \)-decay in Sect. 3.3.

Fig. 1.
figure 1

Overall workflow of the proposed approach EIMP. The entire algorithm EIMP is a pruning-in-training algorithm and it automatically learns how to prune better by the two designed methods. (a). The constrained gradient update method helps information flow between networks \({S}^{0}\) and \({S}^{1}\) and repels valid information to \({S}^{1}\). (b) The adapted \(\lambda \)-decay method automatically balances convergence and performance.

3.1 Preliminaries

Formulation Denote the dataset with N samples as \(\mathbb {X} =\{x_i\}_{i=1}^N\), and \(\mathbb {Y}=\{y_i\}_{i=1}^N\) are the corresponding labels, a well-trained model with total L layers weights as \(\mathbb {W}=\left\{ \textbf{W}_l \right\} _{l=1}^L\) where \(\textbf{W}_{l} \in \mathbb {R}^{c_l\times c_{l-1}\times k_l \times k_l}\) represents the weight parameter of the l-th layer, \(l\in \left\{ 1,2,...,L \right\} \). \(c_l\) is the channel number of l-th layer and \(k_l\) is the kernel size in l-th layer.

Some researchers often insert new layers with parameters \(\mathbb {W}^{'}=\{ \textbf{W}^{'}_l\}_{l=1}^L\) after each convolution kernel in the original model [18, 20]. We denote this layer as the auxiliary pruning layer, which can be vectors obeying some distribution, crafted matrices, or even conv layers that are introduced to facilitate the pruning of the network architecture. That is, if the \(L_2\) norm of the auxiliary pruning layer is approximately or equal to 0, the convolution kernel of the previous layer to which it is attached is equivalent to having no effect on the model.

$$\begin{aligned} \min \limits _{\mathbb {W},\mathbb {W}^{'}} \sum _{i=1}^N \mathcal {L}(x_{i},\mathbb {W},\mathbb {W}^{'};y_{i})&+\lambda \cdot \sum _{l=1}^L||\textbf{W}^{'}_l||_{2,1}\end{aligned}$$
(1)
$$\begin{aligned} \quad \text {s. t.} \quad R_{FLOPs}&\le R_{budget} \end{aligned}$$
(2)

The optimization of previous learn-able regularization-based pruning can be formulated as Eq. 12, where \(\mathcal {L}\) is the original loss of model, for instance, Cross Entropy Loss. \(\left\| \cdot \right\| _{2,1}\) represents \(l_{2,1}\)-norm and \(\Vert \textbf{W}_{l}^{'}\Vert _{2,1}=\sum _{j=1}^{c_l}||\textbf{W}_{l,j,:,:,:}^{'}||_2\). \(\lambda \) is a scalar coefficient, which balances the origin model loss and sparsification penalty term, and a larger \(\lambda \) urges a sparser convolution kernel and thus obtains a more compact network, whereas a smaller \(\lambda \) tends to improve model effectiveness. \(R_{FLOPs}\) means FLOPs of sheared model during pruning training, and \(R_{budget}\) is the upper limit of resource constraint.

We consider the stochastic gradient update algorithm, the one-step update formula for \(\textbf{W}^{'l}\) as shown in Eq. 3, where \(\eta ^{'}\) is the learning rate for auxiliary pruning layers training.

$$\begin{aligned} \textbf{W}^{'(t+1)}_{l}&=\textbf{W}^{'(t)}_{l}-\eta ^{'}\cdot \nabla _{\textbf{W}^{'}_{l}} \mathcal {L}-\eta ^{'}\cdot \lambda \cdot \frac{\textbf{W}^{'}_{l}}{||\textbf{W}^{'}_{l}||_{2,1}} \end{aligned}$$
(3)

FLOPs During Pruning. We find when the conv kernel weight is less than the threshold \(\tau _1\), pruning it will affect little. Therefore, we only calculate the networks satisfying this condition.

3.2 Constrained Gradient Update

We define \({S}^{0}\) as the set of structures to be pruned of \(\mathbb {W^{'}}\) and \({S}^{1}\) as the set of remaining structures. The selection process is not done all at once. Initially, all parameters of \(\mathbb {W^{'}}\) belong to \({S}^{1}\) and \({S}^{0}\) is an empty set. As pruning progresses, every few epochs, sort \(\mathbb {W^{'}}\) according to the \(L_2\) norm, and when \(Epoch/u ==0\), calculate \(Q = m*Epoch/u\), where u represents the number of times selection is executed and m indicates how many convolutional kernels are selected in each selection. It should be noticed that the Q value will not increase until the FLOPs value of the \({S}^{1}\) set meets the predefined value. After that, put the top Q minimum structures from \({S}^{1}\) into \({S}^{0}\). Before each selection, the \({S}^{0}\) set is reset to an empty set, and all parameters of \(\mathbb {W^{'}}\) are again placed in \({S}^{1}\). Then, the selected elements are removed from \({S}^{1}\) and added to \({S}^{0}\).

To update the parameter weights in the \({S}^1\) set, we use the SGD method, which is the same as Eq. 3. For the parameters of network structures to be pruned in \({S}^0\), we define the information learning intensity function \(f(\textbf{W}^{'}_{l,j})\) as Eq. 4, which indicates how much the gradient weight for the loss function contributes to the update of the convolution kernel weight parameters in the \({S}^0\) set. Thus, the parameter weights in the \({S}^0\) set are updated according to Eq. 5.

$$\begin{aligned} f(\textbf{W}^{'l}_{j})&=\lambda _2 \cdot max(0, \tau _2-||\textbf{W}^{'l}_{j}||_2)\end{aligned}$$
(4)
$$\begin{aligned} \textbf{W}^{'l}_{j,t+1}&=\textbf{W}^{'}_{j,t}{^{l}}-\eta ^{'}\cdot \nabla _{\textbf{W}^{'l}_{j}} \mathcal {L}\cdot f(\textbf{W}^{'l}_{j})-\eta ^{'}\cdot \lambda \cdot \frac{\textbf{W}^{'l}_{j}}{||\textbf{W}^{'l}||_{2,1}} \end{aligned}$$
(5)

where \(\lambda _2\) is the coefficient and \(\tau _2\) is the threshold. By employing the proposed CGU algorithm, the weights of the \({S}^0\) set can continuously learn and transfer information until it reaches a stage where the information is either fully depleted or repartitioned into the \({S}^1\) space. The updated weight of a conv kernel in the \({S}^0\) set is monitored by Eq. 5. If the updated weight surpasses the decay rate of the gradient of the loss function, it implies that the conv kernel is gradually increasing its significance and will therefore be reclassified into the \({S}^1\) set.

Fig. 2.
figure 2

\(\lambda ^{'}\) function. Different lines in the graph represent the function under various values of \(\beta \).

3.3 Adapted \(\lambda \)-Decay Regularization

The convergence of the model is represented via gradients \(\varphi \) of the last layer in the model. Set the number of each batch size is B, thus the average convergence of every batch is depicted as:

$$\begin{aligned} \varphi &= \nabla _{} \mathcal {L} = \frac{1}{B\times K} \sum _{j=1}^B\sum _{i=1}^K |\frac{\partial \mathcal {L}}{\partial z_{i}^j}| \end{aligned}$$
(6)
$$\begin{aligned} \lambda ^{'} &=\lambda \cdot e^{-\beta *\varphi } \end{aligned}$$
(7)

where \(z_{i}^j\) represents the logit for the i-th class and j-th input data among K total classes. In our paper, the regularization coefficients \(\lambda \) are re-parameterized as a function \(\lambda ^{'}\) in Eq. 7, which can automatically balance the effect and convergence during different pruning training phases. Figure 2 shows that the parameter \(\beta \) controls the elasticity of the regularization term.

$$\begin{aligned} \min \limits _{\textbf{W},\mathbf {W^{'}}}& \sum _{i=1}^N \mathcal {L}(x_i,\textbf{W},\mathbf {W^{'}};y_i)+\lambda ^{'}\cdot \sum _{l=1}^L||\textbf{W}_l^{'}||_{2,1} \end{aligned}$$
(8)

Therefore, with the constraint Eq. 2, the model optimization Eq. 1 can be transformed into Eq. 8. It can be seen a rapid decrease in the regularization term enhances the effectiveness of the model, whereas an increase in the norm term accelerates the model’s convergence.

Pseudo-code in Algorithm 1 shows the implementation procedure of our proposed pruning methods. In initialization, all structure parameters are regarded as belonging to \({S}^1\) set. And during the training process, in every u iterations, a few structure parameters with the least \(L_2\) norm are chosen and added to the \({S}^0\) set, where they undergo constrained gradient update rule.

4 Experiments and Results

4.1 Datasets, Networks and Pruning Traning Settings

Datasets and Networks We evaluate the performance of the proposed algorithm on the CIFAR-10 [15] and ImageNet [26] datasets. (1) On the CIFAR-10 dataset, we have investigated three models ResNet-32, ResNet-56 and ResNet-110. All the models are trained from scratch with initialization, using Cosine Annealing LR with initial learning rate \(lr=0.1\), and decayed between 120 epochs and 180 epochs, for a total of 240 epochs. (2) The ImageNet dataset is a widely used large dataset in the classification task. We have conducted experiments on the ImageNet dataset using MobileNet-V1 [14] and ResNet-50 models [9] with a total of 180 epochs, respectively. MobileNet-V1 i+s trained from scratch with a learning rate of 0.1, batch size of 256, and a decay scheme: cosine learning rate with initialization \(lr=0.01\), \(cosine minimum=0\), and before pruning, its Top-1 accuracy is 69.26\(\%\). In addition, ResNet-50 is the official website implementation integrated into torchvision with Top-1 accuracy 76.15\(\%\).

figure a

Pruning Traning Settings. As [3] has done, we insert architecture parameters in the basic block first (3 \(\times \) 3) conv layers. For ResNet-50 and MobileNet-V1 pruning, we use architecture parameters in the first and second layers in the bottleneck layer. The pruning process uses consine learning rate with an initial value of 0.01 and cosine minimum set to 0. The optimizer selects the SGD algorithm and momentum is set to 0.99. \(\eta ^{'}\) is optimized by Cosine learning rate and initialized with 0.01. Threshold \(\tau _1=10^{-5}\), \(\tau _2=10^{-2}\), \(\lambda _1=10^{-4}\), \(\lambda _2=1\), \(m=4\) and set select interval \(u=5\). So we obtain the \({S}^{0}\) set by selecting the Top-4 smallest parameters of every 5 epochs.

4.2 Pruning Results on CIFAR-10 and ImageNet

CIFAR-10. Experiments on benchmark models of ResNet-32, ResNet-56 and ResNet-110 and conducted on CIFAR-10 data and as the Table 1 shows our pruning obtains the best performance overall comparative experiments. Specifically speaking, for the ResNet-56 model, we compare our approach with several recent methods. As the pruning ratio descends to \(53.87\%\), EIMP performs best, \(+0.55\%\) Top-1 accuracy. In addition, for ResNet-110, our method gets \(+0.24\%\) increase than baseline while the second best only obtains \(+0.06\%\). In a word, all these results validate the effectiveness of EIMP on CIFAR-10 dataset.

ImageNet. On ImageNet data, benchmark networks ResNet-50 and MobileNet-V1 are tried on, the results of which are shown in Table 2 and verify the superiority of our pruning. As far as we know, with \(54.5\%\) compression rate, we achieve an accuracy improvement of \(0.07\%\), which surpasses the previous SOTA. Meanwhile, in MobileNet-V1 experiments, at the same compression rate, our EIMP surpasses previous by \(1.54\%\).

Fig. 3.
figure 3

Number of Pruned channels per layer.

Table 1. CIFAR-10 dataset experiments compared with state-of-the-art. We conduct experiments on three models: ResNet-32, ResNet-56 and ResNet-110 with 50–65\(\%\) FLOPs drops. Our proposed pruning method achieves obvious gains compared with others. \(\varDelta =(Pruned-Baseline)/Baseline\).
Table 2. ImageNet dataset experiments compared with state-of-the-art. Results of our method on two models with 54\(\%\) and 74\(\%\) FLOPs drops respectively. Our proposed pruning method achieves obvious gains compared with others.

4.3 Information Migration Between \({S}^0\) and \({S}^1\)

As shown in Fig. 3, when the iteration increases, more and more channels are cut. It can be noticed that between \(epochs=32-36\), some pruned channels are revived. That is to say, in our algorithm, information flows bidirectionally between the pruned and unpruned networks, which enables the proposed algorithm more flexible. And the results highlight the effectiveness of the CGU algorithm in preserving important convolution kernels while pruning.

Table 3. Effectiveness of EIMP. We train on CIFAR-10 with ResNet-56. \(\lambda \)-decay is short for adapted \(\lambda \)-decay regularization. The Top-1 accuracy of the pruned networks and the gaps from the base networks are reported.

4.4 Ablation Studies

Overall of EIMP. As a baseline, we use the method in [18], and select Top-4 smallest \(L_2\) norm elements of the auxiliary pruning layer from \(S^{1}\) to \(S^{0}\) set every 5 epochs and then prune them. For a fair comparison, we use the same base pruning framework and under the situation of pruning 60% of Flops, compare the pruned model accuracies of ablation experiments. The impacts of whether using CGU or adapted \(\lambda \)-decay is investigated in Table 3.

Effectiveness of CGU. To inspect the information migration effectiveness of CGU, we compare it with other learning-based pruning methods. For example, in [3], after its division of the conv kernel into \(S^{0}\) and \(S^{1}\) sets, only attenuation is performed on the \(S^{0}\) set without information migration. Results in Fig. 4 left and middle demonstrate the superiority of our CGU method over Res in ResRep. We conduct the experiments on ResNet-56 on the CIFAR-10 and the CGU method consistently outperforms the Res method in terms of accuracy, because when the \(L_2\) norm of the conv kernels in the \(S^{0}\) set is less than the threshold \(\tau _2\), they migrate information to the \(S^{1}\) space to allow the model achieves better performance.

Fig. 4.
figure 4

Ablation Study. Left and Middle: the change in accuracy of the ResNet-56 model on the CIFAR-10 using the Res or CGU algorithms, respectively, along with the number of channels retained in each layer after pruning. The FLOPs reduction in this case is 52.9%. Right: the change in accuracy with the use or non-use of the adapt \(\lambda \)-decay in ResNet-32 model on the CIFAR-10 with FLOPs reduced by 52.94%.

Robustness of Adapted \(\lambda \)-decay Method. To demonstrate the generalizability of the adapted \(\lambda \)-decay regularization method, we have transferred the method to the ResRep framework and verified its performance. Specifically, we trained the ResNet-32 benchmark network on the CIFAR-10 dataset for a total of 400 epochs, using the same optimizer settings and learning rate as in the Sect. 4.2. Figure 4 (right) shows the accuracy variation from the 160th epoch to the 480th epoch during the pruning training process, with FLOPs reduced by 52.94%. We can observe that after implementing the \(\lambda \)-decay method, the accuracy can be improved while reducing the same FLOPs value. This implies that \(\lambda \)-decay can help the model converge closer to the global optimum.

5 Conclusions

In this paper, we examine the problem of model pruning from the perspective of information migration and propose an EIMP schema. Our key contribution is the introduction of a new gradient update rule, referred to as CGU, and an adapted \(\lambda \)-decay regularization method. The CGU lets the pruned network transfer its valid information to the retained portion of the model, potentially offering a new approach for future pruning research. And the adapted \(\lambda \)-decay regularization enables the automatic trade-off between performance and regularization item, via assigning varying regularization intensities. Two methods of EIMP can be easily applied to other DNN frameworks.