1 Introduction

Since the emergence of deep neural networks (DNNs) [1], due to the less labeled data, poor hardware storage and computing power, it has not been able to completely release the performance. As the number of labeled datasets keeps springing up, as well as the development of high-performance hardware such as GPU and TPU, DNNs have achieved great success in the fields of scientific research and engineering. As the main component, the CNNs achieve excellent performance in extracting image features combined by virtue of the parameter sharing and translation invariance characteristics. At present, CNNs has received extensive attention in computer vision tasks such as image classification, object detection, semantic segmentation, style transfer, and super-resolution images, moreover its performance is significantly better than the traditional methods. However, as image and video tasks become more and more complex, the scale and classes of CNNs are gradually increasing. Although this can achieve better accuracy, it also extends the cost of hardware and computing power for network deployment, which limits the application of high-performance CNNs on resource-constrained devices. On the other hand, some works [2,3,4] have shown that existing CNNs have a certain degree of parameter redundancy, which provides background support and a theoretical basis for network compression.

The existing CNNs compressing methods mainly consist of optimizing the calculation methods of convolution and designing network compression algorithms. In a nutshell, the network compressing is committed to reducing the number of parameters and FLOPs as more as possible in the case of guaranteeing network performance. Mainstream algorithms include network pruning, quantification, low-rank decomposition, and knowledge distillation. Among them, the network pruning method based on parameter importance is more convenient and effective. However, existing pruning methods of this type tend to vary widely in the compressing rates for parameters and FLOPs. Moreover, during iterative pruning, the performance loss after each pruning leads to a gradual decrease in the accuracy of later compressing operations. In response to the above issues, we consider designing a strategy in the phase of balanced pruning to efficiently optimize the continuously compressed network. In this way, it can quickly restore the accuracy to perform the iterative pruning in a few training epochs, and it can ensure that the network after pruning has almost no performance loss. The motivation of our method is twofold. Firstly, Komodakis and Zagoruyko [5] demonstrates that feature maps of the large network can pay more accurate attention to the object than the small network through extensive experiments. Secondly, Lin et al. [6] introduces GANs to optimize the network compressing, but it adds a mask for pruning, which increases the cost of the network pruning, and a separate optimization for this parameter is required.

In this paper, we propose a global balanced iterative pruning method. Firstly, we design a global balanced pruning scheme which eliminates the unnecessary parameters via analyzing the magnitude distribution of channels. Considering that simple magnitude pruning across different layers or within the same layer may lead to unbalanced pruning rates of parameters and FLOPs, we do not perform magnitude analysis across different layers. Then, we introduce an efficient performance recovery policy, which define the original network as the teacher and the pruned network as the student. And using the intermediate feature maps and the output features of the teacher to transfer the information learned by the original network to the student during fine-tuning. Moreover, we construct a shallow neural network as a platform, making the output features of the two networks conduct an adversarial game. The above strategies act on the compact network after every pruning step. And iterative pruning is carried out during the training phase. To do so, the pruned network can recover accuracy by training a few epochs after each pruning operation, which can provide more accurate guidance for the judgment of the importance of parameters in the next pruning, and shorten the whole pruning phase. The final compact network is retrained to restore the experimental accuracy.

To demonstrate the effectiveness of our GBIP, we prune VGGNet [7], ResNet [8] and GoogLeNet [9] on the image classification datasets CIFAR-10, CIFAR-100 [10] and ILSVRC-2012 [11]. Moreover, we further perform experiments on the SSD [12] on the object detection dataset PASCAL VOC [13]. The results manifest that without harming overall performance it is possible to compress and accelerate the CNNs using the proposed pruning method in this paper. On the CIFAR-10, when removing 97.22% of the parameters and 96.57% of the FLOPs of the VGG-16, the classification accuracy can still reach 90.29%. In addition, when the compression rate of SSD exceeds 50%, the performance loss is still less than 1.00%.

The proposed GBIP can be applied to many convolutional networks in image classification tasks, and it also shows good generalization in object detection. The existing network compressing method can combine with our efficient performance recovery strategy to increase the accuracy of the compressed network. What’s more, because no sparseness was introduced, GBIP does not require the assistance of additional sparse matrix operations and acceleration libraries. And the entire pruning process can be achieved only by controlling one parameter, which notably reduces labor intervention and can perform automatic compression and acceleration. If adopting the larger CNN as the teacher network, the efficiency of pruning and the performance of the compressed network can be further improved.

The main contributions of our work are as follows:

  • This paper proposes a global balanced pruning scheme for convolutional channels. We analyze the magnitude distribution of intermediate feature maps to eliminate the unimportant parameters and connections.

  • We design an efficient performance recovery method. The abundant knowledge learned in the training process of the original network is applied to guide the compact network to quickly recover the accuracy in the iterative pruning interval.

  • We demonstrate the effectiveness of the method on CIFAR-10, CIFAR-100, ILSVRC-2012 with extensive experiments. Moreover, the results on the object detection dataset PASCAL VOC further verifies our GBIP has superior generalization in CNNs compressing and accelerating. The ablation analysis manifests that the adjustment of hyperparameter can stably control the pruning rate.

2 Related works

At present, convolutional network compression has received widespread attention from both academia and industry. Many methods with significant effects such as pruning, quantification, low-rank decomposition, and knowledge distillation have emerged. The related works of our method are presented as follows:

2.1 Network pruning

Network pruning is to remove the relatively redundant weights or filters according to the importance of parameters in the CNN to compress and accelerate the network based on ensuring the accuracy of the task. The key of pruning is to determine the evaluation criteria of the importance of parameters and then design an effective pruning strategy. Some existing methods are based on the magnitude of parameters. From 2016, pruning compression for deep neural networks begins to receive wide attention from both academia and industry. In recent years, magnitude pruning, as one of the efficient methods, still exists in a large amount of works. Han et al. [2] is typical of unstructured pruning method which using the value of weight to measure the redundancy of connections and setting neurons smaller than a threshold directly to zero. Li et al. [14] deletes the filters with the smaller L1 norm. The impact of pruning different layers on the accuracy drop needs to be analyzed before each pruning. The sensitive layers will be pruned with a smaller pruning rate or directly skipped without pruning. Moreover, the specific pruning rate is manually set for all pruned layers. Polyak and Wolf [15] applies the variance of channel activation to measure their contribution. The pruning of the entire network is done sequentially on the network layers: from lower layers to the top ones, pruning is followed by fine-tuning, which is followed by pruning of the next layer. The pruning in this paper is carried out independently at each layer simultaneously. He et al. [16] prunes networks according to the L2 norm of filters. Molchanov et al. [17] uses the change of accuracy loss after deleting a parameter as the importance evaluation criterion for the parameter. The pruning is constructed as an optimization problem which is approximated using a first-order Taylor expansion. Ultimately, the parameter importance is transformed into the product of the activation and its gradient. In contrast, our method uses the maximum regularization of the L1 norm of the activation as the evaluation criterion of the parameter importance. Moreover, it uses the traditional fine-tuning method to recover the accuracy in iterative pruning intervals. Liu et al. [18] uses the scaling factor of the batch normalization layers as the evaluation standard of parameter importance. Lin et al. [19] utilizes the rank of the feature map matrix to judge how much information it contains. Li et al. [20] slims CNN through the diversity and similarity of feature maps. Tang et al. [21] fits the input complexity and feature similarity to the pruned network space to dynamically discard redundant filters. Wu et al. [22] uses the product of filter sparsity and feature dispersion to measure their importance. Chin et al. [23] learns different parameter pairs for all layers and performs affine transformation on L2 form of filters in the layer to get their importance ranking. Finally, the less important filters are removed according to the preset amount of floating point operations. Some pruning strategies are based on the impact of deleted parameters on performance drop. For example, Yu et al. [24] measures the importance of pruning neurons by minimizing the reconstruction error of the second-to-last layer in front of the final classification layer. Lee et al. [25] introduces connection sensitivity to evaluate the importance of structure, and the pruning is implemented in the parameter initialization stage before training. Guo et al. [26] samples the channel pruning as a Markov process that is optimized using standard regularization loss and model parameters or FLOPs budget regularization. You et al. [27] multiplies the intermediate feature map with a scale factor, and then estimates the accuracy loss caused by the scale factor set to zero to determine the importance of the relevant filters. Guo et al. [28] reconstructs the cropped feature and observe its impact on the classification loss to carry out layer-by-layer channel pruning. Other pruning approaches combine existing advanced algorithms to compress the network, such as [29] using reinforcement learning to search for better pruning strategies. Liu et al. [30] combines meta-learning to find compact networks with better performance. Lin et al. [31] formulates the search of optimal pruned structure as an optimization problem and integrate the ABC algorithm to solve it in an automatic manner. Ding et al. [32] generates a global network pruning strategy using long short-term memory. Compared to aforementioned works, the proposed method iteratively prunes the unneccessary channels and connections based on the importance magnitude distribution of feature maps with almost no decrease in accuracy. Moreover, our approach is able to compress the number of parameters and FLOPs in a balanced way.

2.2 Knowledge transfer

Knowledge transfer utilizes a pre-trained high-performance teacher network to guide a smaller student network, thereby improving the experimental accuracy of the small network. Ba and Caruana [33] uses the input of the final softmax layer to represent the knowledge learned by the teacher network to supervise the training of the student network. Hinton et al. [34] introduces temperature T to the output of the softmax layer and then trains together as a soft label with the real target. The two methods above only consider the information contained in the output, which is relatively limited. The FitNet proposed in [35] applies not only the output of the teacher network but also its intermediate feature to jointly optimize the training process of the student network. This can train a deep and narrow student network while enhancing its generalization ability. Komodakis and Zagoruyko [5] proposes attention transfer that using the attention maps in the teacher network to deliver the information of the teacher network’s attention to the student network and improve the performance. Different from the aforementioned methods, we mainly introduce the information representation of feature maps and output features from the original network to guide the pruned network to quickly eliminate the accuracy loss, thus making each pruning operation of iterative compressing more accurate.

3 Proposed method

3.1 The global balanced iterative pruning framework

Figure 1 shows the overall framework of our pruning method. Firstly, the labeled training images are input into the original network. We analyze the magnitude distribution of feature maps in each pruning layer and remove the unnecessary channels. Then we select three-pair feature maps and the output from original network and pruned one to transfer attention and knowledge. It can be seen from the figure that for the same input sample, the attention maps generated by the pruned student network have obviously weaker interest in the classification object than the original teacher network. Afterwards, a shallow neural network is conduct to make the two output features play a game, which further improves the accuracy of the pruned network. Finally, after few epochs for accuracy recovery the next pruning phase is performed. In this scenario, we achieve iterative pruning to compress and accelerate the original convolutional network.

Fig. 1
figure 1

The global balanced iterative pruning (GBIP) framework. (This figure is best viewed in color and zoomed in) (Color figure online)

3.2 Global balanced pruning strategy

Given a convolutional neural network with L layers, we refer to \(C=\left( {{C}_{1}},{{C}_{2}},\ldots ,{{C}_{L}} \right) \) as the original network structure, where \({C}_{l}\) is the number of channels in the lth layer. \(W\in {{{\mathbb {R}}}^{{{C}_\mathrm{out}}\times {{C}_\mathrm{in}}\times K\times K}}\) is the weight of the filter, where \({C}_\mathrm{out}\) is the number of output channels, \({C}_\mathrm{in}\) is the number of input channels, and \(K\times K\) is the size of the filter. The feature map generated in the lth layer is \(M\in {{\mathbb {R}}^{{{C}_{l}}\times w\times h}}\), where w and h are width and height of the feature map. With the input sample x, the output produced by the original network is defined as \({{f}_\mathrm{T}}\left( x,{{W}_\mathrm{T}} \right) \), and the output of the pruned network is defined as \({{f}_\mathrm{S}}\left( x,{{W}_\mathrm{S}} \right) \). \(G\left( f\left( x,W \right) ,{{W}_\mathrm{G}} \right) \) is the adversarial platform, where \(f\left( x,W \right) \) is the output of the teacher or the student network.

Here, we propose a global balanced pruning strategy of the convolutional neural network. To reduce the complexity of the network pruning, an effective pruning method based on the magnitude of parameters should be designed. Filters with smaller weights tends to produce relatively weak activation feature maps compared to other filters at the same layer. Therefore, we can preferentially remove such filters to reduce network redundancy during network pruning. Most of the existing methods delete unimportant parameters directly based on the L1, L2 norm, or other magnitude of the filters and feature maps. However, this depends on a relatively uniform distribution of feature map’s magnitude. Otherwise, when the pruning threshold is unreasonable, it will cause enormous differences in the pruning rate of layers, which will seriously affect the final performance of the network. Here, we analyze the L1 norm for the feature maps of VGG-16 and Resnet-56 on the CIFAR10 and Resnet-18 on the ILSVRC-2012. Specifically, we first calculate all the L1 norm of the feature maps in the layers to be pruned and then perform maximum regularization on the feature maps in each layer using Eq. 1 to obtain the importance score \(m_{l}^{c}\) of each feature map.

$$\begin{aligned} m_l^c = {{{{\left\| {M_l^c} \right\| }_1}} / {\hbox {max}\left\{ {{\left\| {M_l^1} \right\| }_1},{{\left\| {M_l^2} \right\| }_1}, \ldots ,{{\left\| {M_l^{{C_l}}} \right\| }_1}\right\} }} \end{aligned}$$
(1)

where \(c\in \left\{ \left. 1,2,\ldots ,{{C}_{l}} \right\} \right. \) is the index of the every feature map in the lth layer. \({\left\| \centerdot \right\| _1}\) refers to the L1 norm. We visualize the results obtained as Fig. 2.

Fig. 2
figure 2

The importance score distribution density of the feature map in the pruned layers

It can be seen from the figure that for CIFAR-10, the importance scores of VGG-16 are generally concentrated between 0 and 0.5. While the importance distribution of Resnet-56 is relatively uniform, but the importance scores in the first few layers are almost between 0 and 0.5. On ILSVRC-2012, the importance of the features for Resnet-18 at each layer is significantly different. The importance scores of the sixth layer are almost between 0 and 0.4, while those of the eighth layer mainly vary from 0.4 to 1. Therefore, directly setting the threshold based on the L1 norm cannot achieve ideal compression for all layers in the network. If the layer-wise pruning rate is preset, when the compressing rate of all layers is kept the same, the final pruning rates of parameters and FLOPs can indeed be exactly equal. However, when the compressing rate of each layer is immense, the performance of the pruned network is also poor. When different pruning rates are preset for different layers, firstly, a balanced compression rate is not always obtained, i.e., the difference between the pruning rates of parameters and FLOPs is large. The two pruning rates of some methods can even differ by nearly 40%. Therefore, we set pruning factor k to perform on the mean value of the importance scores of the feature maps to determine the final pruning threshold \(m_{l}^{p}\) of the lth layer.

$$\begin{aligned} m_{l}^{p}=k\cdot \frac{1}{{{C}_{l}}}\sum \limits _{c=1}^{{{C}_{l}}}{m_{l}^{c}} \end{aligned}$$
(2)

where \(k\in \left( 0,1 \right) \) is the pruning threshold factor used to control the network pruning rate, and it is also the only variable parameter in our proposed method. The number of parameters and FLOPs of CNNs are calculated as follows:

$$\begin{aligned} & \hbox {Params} = {C_\mathrm{out}} \times \left( {{C_\mathrm{in}} \times K \times K + 1} \right) \end{aligned}$$
(3)
$$\begin{aligned}&\hbox {FLOPs} = 2 \times w \times h \times {C_\mathrm{out}} \times \left( {{C_\mathrm{in}} \times K \times K + 1} \right) \end{aligned}$$
(4)

Pruning channels is equivalent to reducing the number of \( C_\mathrm{out} \). The dimension \(w \times h\) of the feature maps in front layers is larger than that in last layers. If the channels pruning rate of the front layers is considerably larger, the compression ratio of the FLOPs can be much higher than that of the parameters. Using the method in this paper, channels in each layer can achieve relatively balanced compressing. In this case, the clipping rate of parameters and FLOPs will be closer, which is more conducive to the balanced compression and acceleration of the convolutional network. Moreover, the compressing amplitude of each layer can be adaptively adjusted via k to achieve global equilibrium pruning under different degrees. Redundant parameters in the CNNs can be deleted at each pruning step via the above pruning strategy, and the problem of unbalanced compressing among layers will not occur. After pruning, the performance of the network will not suffer a great loss.

3.3 Performance recovery scheme

3.3.1 Knowledge transfer

The intermediate feature map of CNNs is the concrete or abstract representation extracted by the filters from the input images, which shows the objects that the network pays attention to when treating specific tasks. For image classification, the feature maps will highlight the target to be classified and weaken the background and irrelevant objects to obtain a more reliable classification result. Therefore, whether the feature map can precisely pay attention to the goal and how strong the attention is are especially important to the performance of the network. It can be seen from Fig. 1 that the feature maps of the pruned network pay less attention to the target, which will seriously affect the correctness of pruning and the accuracy of the compressed network. Because of this, we introduce knowledge transfer in the pruning process.

We select three layers with the different dimensions of feature maps and integrate the feature maps in the same layer to form an attention map to guide pruned networks to focus on classification objectives. Specifically, for the \({{C}_{l}}\) intermediate feature maps of the lth layer, the attention map is constructed using the Eq. 5:

$$\begin{aligned} {A^l}\left( {{M^{ab}}} \right) = \frac{1}{{{C_l}}}\sum \limits _{i = 1}^{{C_l}} {{{\left( {M_i^{ab}} \right) }^2}} \end{aligned}$$
(5)

where \( M^{ab} \) is the pixel of the attention map in the lth layer with \(a \in \left\{ {\left. {0,1, \ldots ,w - 1} \right\} } \right. \) and \(b \in \left\{ {\left. {0,1, \ldots ,h - 1} \right\} } \right. \). The attention map produced in this way can get the attention area of the input sample, and on the other hand, it can also represent the amount of information learned in the layer. Then as shown in Eq. 6, after regularizing the three pairs of attention maps of two networks, we use the L2 norm of their difference to construct the attention transfer loss \({{\mathcal {L}}_\mathrm{AT}}\) of the student network.

$$\begin{aligned} {{{\mathcal {L}}}_\mathrm{AT}} = \sum \limits _{l = 1}^3 \left\| {{\frac{{A_S^l}}{\left\| {A_S^l}\right\| _2}} - \frac{{A_T^l}}{\left\| {A_T^l}\right\| _2}} \right\| _2 \end{aligned}$$
(6)

where \(A_{S}^{l}\) is the attention map in the lth layer of the student network, and \(A_{T}^{l}\) is the corresponding attention map of the teacher network. The interest of the student can be made as close as possible to that of the teacher in the inference process through the \({{\mathcal {L}}}_\mathrm{AT}\) loss. In the experiments, we prune VGGNet, ResNet, and GoogLeNet. The specific implementation positions for extracting attention maps in these three networks are plotted in Fig. 3.

Fig. 3
figure 3

The positions of VGGNet, ResNet, and GoogLeNet for extracting attention maps

When dealing with image classification tasks, the output of the convolutional network is the probabilities of each category, so we apply the output features of the original network to guide the compressed network. In this way, we can perform more accurate pruning and improve the performance of the pruned network at the same time. Regarding the outputs of the teacher and the student network \({{f}_\mathrm{T}}\left( x \right) \) and \({{f}_\mathrm{S}}\left( x \right) \), this paper introduces temperature \({{T}_\mathrm{emp}}\) which draws on the idea of [34] to smooth the two outputs as shown in Eqs. 7 and 8. Hence the classification probability of the student for each category can be as similar to the teacher network as possible to avoid the probability of all incorrect classification tends to zero, and result in a smoother category probability distribution.

$$\begin{aligned}&p\left( x \right) ={{F}_\mathrm{soft\max }}\left( {{{f}_\mathrm{S}}\left( x \right) }/{{{T}_\mathrm{emp}}}\; \right) \end{aligned}$$
(7)
$$\begin{aligned}&q\left( x \right) ={{F}_\mathrm{soft\max }}\left( {{{f}_\mathrm{T}}\left( x \right) }/{{{T}_\mathrm{emp}}}\; \right) \end{aligned}$$
(8)

where \({{F}_\mathrm{soft\max }}\left( \centerdot \right) \) refers to softmax function. Then the KL divergence of \(p\left( x \right) \) and \(q\left( x \right) \) is calculated according to Eq. 9, and the accuracy of the student can be improved by reducing the divergence during training.

$$\begin{aligned} {{D}_{KL}}\left( p\parallel q \right) =\sum \limits _{i=1}^{n}{\left[ p\left( x \right) \hbox {log}\left( p\left( x \right) \right) -p\left( x\right) \hbox {log}\left( q\left( x \right) \right) \right] } \end{aligned}$$
(9)

At the same time, to better correct the output of the student network, the cross-entropy loss \({{L}_\mathrm{CE}}\left( {{W}_\mathrm{S}} \right) \) between the output features of the student and the real labels is added to the above divergence, which is regarded as the output features transfer loss \({{L}_\mathrm{OT}}\left( {{W}_\mathrm{S}} \right) \).

$$\begin{aligned} {{\mathcal {L}}_\mathrm{OT}}\left( {{W}_\mathrm{S}} \right) =\alpha \cdot {{\mathcal {L}}_\mathrm{KL}}\left( {{W}_\mathrm{S}} \right) +(1-\alpha ){{\mathcal {L}}_\mathrm{CE}}\left( {{W}_\mathrm{S}} \right) \end{aligned}$$
(10)

where \(\alpha \) is the weight between the two losses of KL divergence and cross entropy. And \({{L}_\mathrm{KL}}\left( {{W}_\mathrm{S}} \right) \) is formulated in Eq. 11. In order to make the effect of these two losses roughly under the same magnitude, we multiply \({{D}_\mathrm{KL}}\) by \(T_\mathrm{emp}^{2}\).

$$\begin{aligned} {{\mathcal {L}}_\mathrm{KL}}\left( {{W}_\mathrm{S}} \right) \text {=}T_\mathrm{emp}^{2}\cdot {{D}_\mathrm{KL}}\left( p\parallel q \right) \end{aligned}$$
(11)

Different from the existing knowledge transfer method with fixed structure of student network, we continue to compress the network. In fact, knowledge transfer is only an auxiliary strategy of iterative pruning in our method, which aims to quickly mitigate the accuracy loss from each pruning step. To fully accelerate the optimizing for pruned network from training process and results, we apply the intermediate feature maps and the output simultaneously to guide the fine-tuning of the compact network. Through the above methods, the intermediate and the final classification information learned by the original network can be thoroughly transmitted to the pruned student network. In this scenario, we can not only ensure that the student accurately finds the unimportant parameters for the corresponding task but also restore the network’s performance through a few training epochs after each pruning step to achieve iterative pruning during the training process.

3.3.2 Adversarial game

The above-mentioned knowledge transfer has achieved the effective delivery of semantic information from the unpruned network to the pruned network. On this basis, we found that introducing an adversarial game strategy can further improve the final output performance of the student and the recovery speed of accuracy in the iterative pruning. Hence, this paper constructs a shallow neural network as the adversarial platform and makes the outputs of the student and the teacher network play an adversarial game on it. In this way, the output of the student network may closer approach that of the teacher network. Therefore, the adversarial game loss of the student network is defined as follows:

$$\begin{aligned} {{L}_\mathrm{AG}}({{W}_\mathrm{S}})={{E}_{{{f}_\mathrm{S}}(x)\sim {{p}_\mathrm{S}}(x)}}\left[ \hbox {log}\left( 1-G\left( {{f}_\mathrm{S}}\left( x,{{W}_\mathrm{S}} \right) ,{{W}_\mathrm{G}} \right) \right) \right] , \end{aligned}$$
(12)

where \({{p}_\mathrm{S}}(x)\) represents the feature distribution of the student network. Combined with the knowledge transfer loss in the previous section, the training loss of the student network in the proposed pruning method consists of the following three parts:

$$\begin{aligned} {{\mathcal {L}}_\mathrm{S}}\left( {{W}_\mathrm{S}} \right) ={{\mathcal {L}}_\mathrm{AG}}\left( {{W}_\mathrm{S}} \right) +{{\mathcal {L}}_\mathrm{AT}}\left( {{W}_\mathrm{S}} \right) +{{\mathcal {L}}_\mathrm{OT}}\left( {{W}_\mathrm{S}} \right) . \end{aligned}$$
(13)

The network for adversarial game needs to be continuously trained to distinguish whether the input is from the teacher or the pruned network. For the output features from the teacher network, it should produce a positive response, while for the output features generated by the student network, the network should treat it as the pseudo sample. To be specified, the loss of the adversarial platform during training is defined as follows:

$$\begin{aligned} \begin{aligned} {{\mathcal {L}}_\mathrm{G}}({{W}_\mathrm{G}})&={{E}_{{{f}_\mathrm{T}}(x)\sim {{p}_\mathrm{T}}(x)}}\left[ \hbox {log}\left( 1-G\left( {{f}_\mathrm{T}}\left( x,{{W}_\mathrm{T}} \right) ,{{W}_\mathrm{G}} \right) \right) \right] \\&\quad +{{E}_{{{f}_\mathrm{S}}(x)\sim {{p}_\mathrm{S}}(x)}}\left[ \hbox {log}\left( G\left( {{f}_\mathrm{S}}\left( x,{{W}_\mathrm{S}} \right) ,{{W}_\mathrm{G}} \right) \right) \right] , \end{aligned} \end{aligned}$$
(14)

where \({{p}_\mathrm{T}}(x)\) represents the feature distribution of the teacher network. The adversarial platform and the pruned network are alternately optimized in each training epoch to accelerate the performance improvement of the student network. In addition, we integrate the knowledge transfer, so that the accuracy of the compact network can be regained only after a few training epochs and then the iterative pruning will be conducted. Accordingly, the entire network pruning process becomes more compact and accurate. Moreover, our method can significantly improve the accuracy of the network after pruning. Extensive experiments have shown that even in the case of a considerable compressing rate, the performance of the pruned network after retraining can still reach or exceed that of the original network. The subsequent experiments in this paper also entirely demonstrate the effectiveness and accuracy of our channel pruning method in network compressing and accelerating. Algorithm 1 shows the pseudocode of the global balanced iterative pruning method. Given a pre-trained original convolutional network, a compact model \(S_\mathrm{pruned}^{*}\) can be obtained after pruning with the GBIP scheme. Finally, we retrain the pruned model from scratch to restore the accuracy of the experiment.

figure a

For most of the existing network pruning methods, the compressed network inherits the weights and bias from the original network to restore the performance as much as possible through fine-tuning. However, when the network pruning rate is remarkable, the accuracy recovery after fine-tuning is not obvious, and the actual performance of the compact network cannot be greatly manifested. Liu et al. [36] makes a surprising observation in structured network pruning that fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. And the experiment results reveal that the pruned architecture itself, rather than a set of inherited important weights, is more crucial to the efficiency in the final model. Our results of pruning VGG-16 on the CIFAR-10 further verify the observation. In order to fully demonstrate the performance of the compact network, we retrain the pruned network from scratch via the performance recovery method in our experiments. Specifically, we keep the number of FLOPs consistent before and after pruning. The number of training epochs of the original network is multiplied by the accelerating rate of FLOPs as the retraining epochs of the compressed network. Finally, we compare the accuracy of the pruned network with the original network to draw a conclusion.

3.4 Pruning strategy for different CNNs

Since different CNNs have different network structures, the specific pruning implementation details should also change accordingly. We perform experiments on VGGNet, ResNet, and GoogLeNet. Among them, VGGNet is a common layer-by-layer convolutional network and does not include unusual architecture. Therefore, all layers can be directly pruned without affecting the integrity of the final network structure. ResNet contains customized residual modules, so arbitrarily compressing each layer will destroy the dimension matching of the channels. The basic residual block is composed of two convolutional layers. We only discard the output channels in the first layer, and the input channels in the second layer will also change accordingly. By doing so, the overall dimension of the ResNet is still matched and can be trained correctly after pruning. GoogLeNet is a more complex convolutional network with multiple Inception V3 modules, each of which contains four branches. We cut the branches containing two and three convolutional layers to conduct the compressing and accelerating. The specific structure and pruning scheme of the Inception V3 module is plotted in Fig. 4.

Fig. 4
figure 4

Illustration of pruning VGGNet, ResNet and GoogLeNet. The black font indicates the number of original channels, and the red font indicates that after pruning (Color figure online)

4 Experiments

We demonstrate the effectiveness of the proposed method by pruning VGGNet, ResNet, and GoogLeNet on the CIFAR-10, CIFAR-100, and ILSVRC-2012. Moreover, we compress SSD via GBIP on the PASCAL VOC to analyze its generalization on object detection. All experiments are implemented with Pytorch on NVIDIA TITAN X GPUs. For fairly comparing with the existing pruning methods, the network pre-training and parameter settings use the method presented in [8]. Specifically, the pre-training epochs of CNNs on the CIFAR are 160, while on the ILSVRC-2012 are 90. The learning rate is initially set to 0.1 and then decreased by a factor of 10 on half and three-quarter epochs. Stochastic gradient descent (SGD) with momentum is used for backpropagation, and the momentum is 0.9 with a weight decay of 1e−4. In the retraining stage, we adjust the learning rate with the cosine annealing adjustment strategy. The parameter settings during the iterative pruning are as follows. The weight in output transfer is \(\alpha =0.3\). On the CIFAR, the total epochs of training are \(N=30\), and the pruning interval period is \({{s}_\mathrm{p}}=10\). On the ILSVRC-2012, the training epochs for pruning are \(N=20\), and the pruning interval is \({{s}_\mathrm{p}}=10\). The pruning threshold factor k is the only parameter that is changed for pruning. In addition, we draw on the neural network composed of three fully-connected layers with the neurons of 128-256-128 in [6] as the adversarial platform in the adversarial game.

In this section, we compare the proposed method with the existing pruning schemes, among which Li et al. [14], SFP [16], DCP [37], FPGM [38], EDP [39], CNN-FCF [40], CCP [41], Taylor-FO-BN [42], HRank [19], ManiDP [21], NPPM [43] are the state-of-the-art methods. Due to the difference in experimental equipment and environment, the results obtained by different papers also have several differences. In order to make a fair comparison as much as possible, we also mainly compare the decrease of accuracy after pruning according to current methods. The results of these competing methods are reported according to the original article.

4.1 Results comparison on CIFAR-10

We first prune VGG-16 on the CIFAR-10, and the results are shown in Table 1. It can be seen from the table that when \(k=0.3\), our method reduces up to 47.86% of the parameters and 44.39% of the FLOPs for VGG-16, however, the accuracy of the network is even improved by 0.54% compared with the baseline. When the network compression ratio exceeds 80%, the compact network still has a performance improvement of 0.17%. Although the parameter compression ratio of ABCPruner [31] is 5.65% higher than that of our method, the pruning rate of FLOPs is lower than that of this paper and the final accuracy after pruning is also smaller (− 0.06% vs. − 0.17%). As the VGG-16 continues to be compressed, the accuracy of the network is gradually declining. When discarding 97.22% of the parameters and 96.57% of the FLOPs with \(k=0.7\), the accuracy of the final network still reaches 90.29%.

Table 1 Performance comparison of VGG-16 on CIFAR-10
Table 2 Performance comparison of ResNet-56 on CIFAR-10

The experimental results show that for the CIFAR-10, the VGG-16 does have a certain degree of parameter redundancy. Compressing the network can reduce the impact of overfitting and improve the accuracy of the network. At the same time, the effectiveness of the pruning method proposed in this paper is preliminarily verified. Then, we continue to cut ResNet-56, and the experimental results are tabulated in Table 2. When \(k=0.4\), the parameters and FLOPs of ResNet-56 are reduced by 41.18% and 47.81%, respectively. At this time, the accuracy of the network after pruning is increased by 0.67%. And when \(k=0.5\), the pruning rate has exceeded 50.00%, but the network still has a performance improvement of 0.36%, which is significantly better than the compared algorithms. Although the final accuracy improvement of SRR-GR [49] is 0.01% higher than that of ours, the compression rate of its FLOPs is relatively low by 9.55% (53.80% vs. 63.35%). The accuracy of ResNet-56 only drops by 0.38% when deleting 70.37% of the parameters and 73.41% of the FLOPs. In this case, the network parameters are only 0.21M. In addition, it can be found that when \(k=0.4\), the classification accuracy of ResNet-56 is 94.09%, which is 0.32% higher than that of VGG-16 when \(k=0.5\), however, the parameters are only about 1/5 of VGG-16. It also confirms from the side that the residual module can effectively improve the performance of CNNs in image classification tasks.

Then, we compress ResNet-110. From Table 3, it can be concluded that the baseline accuracy of ResNet-110 on the CIFAR-10 is 93.53%. When 20.81% of the parameters and 22.95% of FLOPs are discarded, the accuracy increased by 0.95%. HRank [19] compresses parameters and FLOPs by 41.20% and 39.40%, respectively, which is about 20% lower than our method when \(k=0.5\), and the performance drop is also 0.02% worse. The performance of the compressed network is still improved by 0.52% compared to the original network even when 71.10% of the parameters and 72.82% of the FLOPs are eliminated. At this time, the network scale is similar to that of CNN-FCF [40], but the final accuracy loss is 1.14% lower (− 0.52 vs. 0.62). This experiment shows that ResNet-110 has obvious parameter redundancy on the CIFAR-10, which leads to overfitting during the training process, resulting in lower accuracy of the original network. And our global balanced iterative pruning method can achieve better accuracy recovery in the case of accurately compressing the ResNet-110. It also manifests that the compressing rate of parameters and FLOPs and the accuracy drop using our pruning method are significantly better than all comparative methods.

Table 3 Performance comparison of ResNet-110 on CIFAR-10

In order to further demonstrate the applicability of GBIP to various convolutional networks, we continue to prune GoogLeNet. The experimental results are depicted in Table 4. Due to the Inception module, GoogLeNet increases the width, therefore the baseline accuracy on the CIFAR-10 reaches 94.72%, which is ahead of VGGNet and ResNet. When \(k=0.4\), after deleting 33.87% of the parameters and 37.95% of the FLOPs, the accuracy of the network increased by 0.52%. When the parameters and FLOPs are compressed to about 50%, the performance of the compact network is increased by 0.41%. Even if the parameters and the FLOPs are removed by 65.64% and 69.34% respectively, the classification accuracy of the network is still improved by 0.34%. It attests that GoogLeNet is also redundant on the CIFAR-10. Using the iterative pruning method in this paper can effectively eliminate unimportant parameters and improve the experimental performance of GoogLeNet.

Table 4 Performance comparison of GoogLeNet on CIFAR-10

4.2 Results comparison on CIFAR-100

We continue to prune VGG-19 and ResNet-56 on the CIFAR-100, and the experimental results are reported in Tables 5 and 6, respectively. CIFAR-100 has the same total number of training and test images as CIFAR-10, but the category has increased from 10 to 100. As the training data for each class of images decreases, the performance of the convolutional neural network also drops significantly. It can be seen from Table 5 that the baseline accuracy of VGG-19 on CIFAR-100 is 73.58%. When pruning 75.66% of the parameters and 68.54% of the FLOPs, the accuracy of the retrained compact network is increased by 0.48%. Even when \(k=0.5\), the performance is only declined by 1.76% when parameters and FLOPs are compressed by 85.17% and 89.82%, respectively. And it is significantly better than Slimming [18], Liu et al. [36] and GReg-2 [51] in terms of network compression ratio and performance recovery. This manifests that our GBIP is also applicable to datasets with relatively few training samples. Table 6 shows that the baseline accuracy of ResNet-56 is 71.36%. Because parameters and FLOPs of ResNet-56 are significantly less than VGG-19, the redundancy of ResNet-56 is also smaller. However, when the number of FLOPs is discarded by 48.71%, there is still a 0.52% improvement in performance. When \(k=0.5\), we remove 68.27% of the FLOPs with 0.18% accuracy drop that is still significantly superior to the comparison method.

Table 5 Performance comparison of VGG-19 on CIFAR-100
Table 6 Performance comparison of ResNet-56 on CIFAR-100

Experiments on the CIFAR datasets preliminarily verify the effectiveness and superior performance of the proposed method in image classification tasks. Our GBIP can achieve a certain degree of compression for the parameters and FLOPs of VGGNet, ResNet, and GoogLeNet almost without accuracy drop. It also fully indicates that in different tasks, the existing CNNs have certain parameter redundancy, and removing these unimportant parameters can achieve network compression and acceleration without affecting the performance of networks. In this way, the computational cost of the neural network will reduce remarkably.

4.3 Results comparison on ILSVRC-2012

Table 7 Performance comparison of ResNet-18 on ILSVRC-2012

To further assess the effectiveness of the proposed pruning method, we experiment on the large image classification dataset ILSVRC-2012 with 1000 categories which are difficult to precisely classify, and the parameters of the CNNs are less redundant, so pruning is more challenging. In this subsection, we select ResNet-18 and ResNet-50 for pruning, which can highlight the power of our method. The original Top-1 and Top-5 accuracy of ResNet-18 on ILSVRC-2012 are 70.02% and 89.23%. As we can see from Table 7 that when pruning less than 46.00% of the FLOPs via GBIP, the Top-1 and Top-5 accuracy loss is smaller than that of other methods. Although the parameter compressing rate of ABCPruner [31] is 6.77% higher than that of ours, and the FLOPs pruning rate is 0.67% lower, the performance drop after pruning is significantly greater. The Top-1 accuracy in [31] loses 2.38%, while the performance only drops 0.66% via our method, and its Top-5 accuracy also decreases 0.89% higher than GBIP. FBS [57] pruning 3.95% FLOPs higher than that of GBIP, and its Top-1 and Top-5 accuracy loss is also higher than ours by 0.93% (1.59% vs. 0.66%) and 0.34% (0.86% and 0.52%) respectively. When \(k=0.5\), the cutting rate of FLOPs using GBIP is 5.45% lower than that of ManiDP [21], and the Top-5 accuracy drop is 0.20% higher (0.52% vs. 0.32%), but the Top-1 accuracy loss is 0.22% lower (0.66% vs. 0.88%). When \(k=0.6\), GBIP deletes 46.32% of the parameters and 51.65% of the FLOPs. In this scenario, the compression degree is significantly higher than the comparative pruning algorithms, and the accuracy loss is also higher. To the best of our knowledge, this is because the number of remaining parameters is too little to adequately extract the target information in the images during the learning process with the continuous compression of the network, and it results in the decrease of the final classification performance.

After that, we continue to conduct the pruning experiments on the ResNet-50. Table 8 depicts the performance comparison of pruning ResNet-50. The original Top-1 and Top-5 accuracy of ResNet-50 are 75.94% and 92.93% respectively. When pruning 55.36% parameters and 63.34% FLOPs, the Top-1 accuracy only decreases by 0.47% and the Top-5 accuracy drop is even less. Although the Top-1 accuracy loss of CNN-FCF [40] is the same as ours and their Top-5 accuracy loss is 0.04% less, the compressing rates of parameters and FLOPs of our GBIP are respectively 12.95% (55.36% vs. 42.41%) and 17.29% (63.34% vs. 46.05%) higher than CNN-FCF.

Table 8 Performance comparison of ResNet-50 on ILSVRC-2012

All the experiments for image classification reveal that our global balanced iterative pruning method can achieve a similar degree of compression rate on the parameters and FLOPs of convolutional networks. For simple tasks, after using GBIP for network pruning, overfitting is eliminated, and the performance of the compact network can maintain or even exceed the accuracy of the original network after retraining. The complex classification task requires more parameters to extract the semantic information in the image. There are almost no redundant parameters in tiny convolutional networks, therefore pruning will be accompanied by a decrease in accuracy. However, our GBIP can still control the performance loss in a smaller range. It indicates that the iterative channel pruning method proposed in this paper can effectively remove unimportant parameters in the CNNs and reduce network redundancy in the sense that it also has regularization on the network training.

4.4 Pruning SSD on PASCAL VOC

The existing network pruning algorithms are almost totally for single-target image classification tasks with obvious targets and rarely involve other more complex tasks. In the real world, the scenarios of object detection are more extensive and the requirements for low storage and real-time are higher. However, in the context of uncertain conditions such as occlusion, size, and light changes, these tasks often need more complicated models. Therefore, compressing the model for object detection while maintaining accuracy faces salient challenges. To show off the generalization of the proposed method, we prune the SSD on the PASCAL VOC object detection dataset. The backbone of the SSD adopts the VGG-16 trained on the CIFAR-100. Here, we compare the parameters and FLOPs compressing rate and Mean Average Precision (mAP) loss. The results are depicted in Table 8. When \(k=0.3\), the pruning rates of the parameters and FLOPs are 47.66% and 30.25%, respectively. Compared with the mAP of the baseline of 76.10%, the detection accuracy of the compact SSD only decreases by 0.50%. While pruning 57.66% of the parameters and 54.06% of the FLOPs in the SSD with \(k=0.4\), the mAP drops by 0.90%.

Table 9 The results of pruning SSD on PASCAL VOC

To visually display the results of the pruned SSD in the object detection, we select five pictures in PASCAL VOC to visualize the experiments in Table 9. And the results are depicted in Fig. 5. The first line is the original images, and the second line is the detection result obtained using the baseline SSD, while the last two lines are the pruned results via GBIP with \(k=0.3\) and \(k=0.4\). It can be found from the figure that the compressed SSD can still correctly detect the object in the images, despite the position and size of the detection frame may alter slightly within an acceptable range. Moreover, the confidence of some targets will also fluctuate to a certain extent. For example, the baseline confidence of the tvmonitor in figure (d) is 0.97, but when \(k=0.3\) and \(k=0.4\), they are 0.92 and 0.95, respectively. The confidence of some targets in the other pictures is also different. We conjecture this is due to the detection accuracy of some categories has been improved after pruning, although the overall mAP is slightly lower. To be more specific, the detection precision of some targets will even improve when compressing the network. For instance, the baseline confidence of the bottle in figure (b) is 0.75, however, when \(k=0.3\) it reaches 0.85, and when \(k=0.4\) it even increases to 0.99. It also reveals that pruning can improve the capability of the network recognition for some target classes by reducing model redundancy. Moreover, when \(k=0.4\), the SSD is compressed by more than 50%. At this time, even more persons are accurately found than the baseline in figure(b). It manifests that the ability to distinguish people has been developed. The above experiments further verify that the pruning method in this paper also has good generalization in the field of object detection.

Fig. 5
figure 5

Visualization of pruning SSD on the PASCAL VOC

4.5 Ablation analysis

Then, we conduct the ablation analysis on the proposed GBIP method. This section is composed of the following three parts: the influence of k on the pruning rate, the influence of k on the compressing magnitude in different layers, and the influence of attention transfer, output transfer, and adversarial game on the network performance recovery.

4.5.1 The influence of k on the pruning rate

The pruning threshold factor k is the parameter used to adjust the compression ratio in our proposed pruning algorithm. The larger the k, the greater the pruning threshold, so that the higher the degree of network compression. To reveal the influence of the k, we perform six groups of pruning experiments on VGG-16 by setting different k on the CIFAR-10, and the results are shown in Fig. 6. It can be seen from the figure that as the k increases, the pruning rate of parameters, FLOPs, and channels constantly exceeds. The most important is that the parameters and FLOPs compressing rate are always balanced. When the value of k is small, the cropping ratio rises faster meanwhile the curve is relatively steep. But with the continuous growth of k, the curve of the parameters and the FLOPs gradually tends to be smooth, while the compressing rate of channels almost still linearly rises. From the figure, it is clear that when the compression ratio of parameters and FLOPs are less than 80%, and that of the number of channels is less than 60%, the accuracy of the pruned network remains unchanged or even slightly improved compared to the baseline. Nevertheless, the performance of the compact network begins to decline if continues to compress. This is because when pruning fewer parameters, the redundancy and the impact of overfitting are reduced so that the performance will be improved. But when removing too many parameters, the network is difficult to cope with the classification tasks which causes performance degradation.

Fig. 6
figure 6

The influence of k on the accuracy and pruning rate of CNNs

4.5.2 The influence of k on the compression magnitude in different layers

To better show the compression amplitude of each layer of the network under different pruning ratios, this section visualizes the number of channels of VGG-16, ResNet-56, and ResNet-110 in CIFAR-10 as Fig. 7. The three rows from top to bottom are VGG-16, ResNet-56, and ResNet-110. It can be found from the first row that the last layer of VGG-16 has the most remarkable redundancy. The last layer has eliminated more than 50.00% of the channels with \(k=0.3\), while the discarding ratio of the network is small. As the pruning ratio increases, the number of retained channels in the 9th and 12th layers is more than that in the other layers. It implies that the impact of the two layers on extracting target information is more pivotal than that of other layers. For ResNet-56, the number of channels reserved in the 23th layer is more than that of the 22th layer with little compression. But when \(k=0.6\), the cropping ratio raises, the number of channels saved in the 22th layer is significantly more than the other layers. It indicates that as the pruning rate changes, the importance of different layers also varies to improve the performance as much as possible. At the same time, it reiterates that the pruning strategy proposed in this paper can adaptively adjust the pruning range of each layer according to different compression rates to obtain a compact network that meets the performance requirements.

Fig. 7
figure 7

The influence of k for channels of VGG-16 (top), ResNet-56 (second row), and ResNet-110 (bottom) on CIFAR-10

4.5.3 The influence of three modules on the performance recovery

To analyze the influence of attention transfer, output transfer, and adversarial game on the performance recovery of the compact network, we retrain the pruned VGG-16 and ResNet-18 via different strategies on the CIFAR-10/100 and ImageNet respectively. The number of training epochs and other hyperparameter settings remain the same. The results are tabulated in Table 10. As we can see from the table that the VGG-16 network trained by all the three strategies has the highest accuracy, which can reach 94.14%. The accuracy obtained using the output transfer is 94.03%, which is 0.03% higher than applying the attention transfer. When utilizing adversarial game, the accuracy is 94.02% which is 0.08% higher than that of retraining without the three strategies. Therefore, output transfer plays the most considerable role in the three modules. The results of ResNet-56 on CIFAR-100 and ResNet-18 on ImageNet also demonstrate the conclusion above. For large scale task ImageNet, the accuracy obtained via all three strategies is 69.36%, which is the best and 1.24% higher than training without any strategies. It can also be seen from the table that the performance via the adversarial game alone is better than only using the attention transfer. It is because the attention map only works on the intermediate output feature maps and does not restrict the final output, while the adversarial game can directly optimize the output features, so it can better improve the performance of the pruned network. In addition, we also compare our pruning strategy without using knowledge migration and adversarial games with some existing pruning methods that do not use optimization algorithms. The results are shown in Table 11. As we can see from the table that the pruning method in this paper still has leading performance in terms of network compression rate and accuracy without optimization.

Table 10 Performance comparison of retraining pruned networks
Table 11 Performance analysis of our pruning method without using knowledge transfer and adversarial game strategy

5 Conclusion and future work

In this paper, we propose a global balanced iterative pruning method. The unimportant parameters and FLOPs can be eliminated in similar amplitude based on the magnitude distribution of the intermediate features. And then, we design a performance recovery scheme, so that the performance of the compressed network can be recovered as soon as possible after each pruning step. In this way, we can complete continuous iterative pruning in the training process of the network. The final compact network obtained will restore the accuracy through retraining from scratch. We conduct extensive experiments for pruning VGGNet, ResNet, and GoogLeNet on image classification datasets of CIFAR-10, CIFAR-100, and ILSVRC-2012. The results have manifested that GBIP is comparable with state-of-the-art network pruning methods in performance and pruning rate of parameters and FLOPs. On CIFAR-10, after compressing 75.29% of the parameters and 78.60% of the FLOPs of ResNet-56, the accuracy only drops by 0.39%. In the object detection task PASCAL VOC, when removing more than 50% of the parameters and FLOPs of the SSD, the mAP is only reduced by 0.9%. The final ablation analysis reveals that the pruning factor can achieve flexible control of the compression rate. The experiments fully show that the proposed method can be widely applied in different CNNs, image datasets, and various computer vision tasks. In the future, we will integrate the channel pruning method with other compression schemes such as quantization. Furthermore, we will consider applying existing approaches to accelerate other real-world vision tasks and even natural language processing.