1 Introduction

In this paper, we describe the method of filter pruning for the efficient CNNs, can be achieved based upon correlation amongst the feature maps which are generated from corresponding filter. This sort of pruning benefits the model in many ways, it removes the redundant feature maps to reduce the model size, lower computational cost along with save number of FLOPs. Alternatively, in past few years, researcher have achieved sophisticated performance utilizing CNNs in several demanding tasks in different areas namely image recognition [1, 2], web search [3], speech recognition [4], NLP. Nonetheless, their nature of being computationally intensive, as CNN networks have gotten deeper, the memory footprint, power hungry, and necessity of floating-point operations (FLOPs) have also dramatically increased. Thus, reason behind this growth is the increment in parameter count and also convolution operations. The networks with higher capacity tend to posses substantial inference costs specifically for the resource constrained devices namely mobile devices or the embedded sensors because these sorts of devices have limited computational and also power resources.

A substantial amount of research has been carried out to compress huge CNN networks or learn further efficient CNNs models directly. To mitigate the conflict of higher resource requirement of CNNs, researchers have presented numerous approaches regarding to compressing and accelerating CNNs in different models along with not having noticeable loss in accuracy. Among researches extensively utilized technique is pruning for network compression. This technique further classified into two sub categories namely weight pruning [5,6,7] other one is filter pruning [8,9,10]. However, weight pruning creates unstructured scarcities because it eliminates parameters directly in the filter [40]. This behavior results in unstructured memory access and that effects the overall efficiency. While filter pruning eliminating filters as a whole and leaves a structured model. Being a simple technique yet it is quite demanding because eliminating parameters from one specific layer can results in considerable changes in input of the subsequent layer. Thus, filter pruning efficient in contrast with weight pruning [25], which helps decrease size of model, minimizes computational cost and saves the number of FLOPs, which is the key focus of our proposed work.

Additionally, filter pruning methods involve some measure to compute filter importance in general. The number of past researches [2, 13] calculate the valuation of the lters by scale factors or L1-norm along with neglecting the volume of information contained by the feature maps. Moreover, in [7, 10] the authors showed that given that the output of a large number of feature maps by the middle layers of CNNs are most of the zeros or zero matrices, which exposes that not all the lters in the architecture are valuable.

Furthermore, we created a correlation-based feature selector (CSF), which will boost the efficiency of pruning through negate the setting of pruning rate layer to layer. The approach of pruning filters between multiple layers provides a general scenario of the effectiveness of the model resultant in a slimmer architecture. Further, we examine the behavior of CFS under several conditions and present that CFS can recognize efficient feature maps and eliminate less important feature maps. However, the data with the superior correlation amongst two variables replicating information which is similar to a loss in information.

Fig. 1
figure 1

Robustly correlated feature maps are chosen, and one of duplicate feature map is pruned

Subsequently, numerous metrics can be employed to analyze the correlation amongst two random variables, for instance, correlation metric, range, standard deviation, and also mean. However, the requirement of the feature map selection algorithm based on statistical importance test is to consider that the features are mutually independent, but actually, there is always a definite correlation between the features [45]. Therefore, in this paper we utilized the Pearson correlation-based [11] technique to calculate correlation amongst two feature maps, where higher the correlation amongst two feature maps means greater the replication of information amongst two feature maps ought to be. Further, one feature map will be eliminated from the specific filter as described in the Fig. 1. Correlation performs a major part in data science as the selection of features and helps to nd eective features according to the correlation.

Fig. 2
figure 2

This shows our introduced technique which prunes filters according to correlation score between two feature maps. It computes the correlation between each two adjacent feature maps of the pre-trained model (left side). For the i-th layer, the two output feature maps of convolutional filters are excerpted and input to the correlation module to measure their correlation. After those feature maps with higher correlation scores show that they are duplicating the information and one of the feature maps and filter will be eliminated (right side). In the meantime, the related channels of each filter in (i+1)-th layer will also be deleted to be reliable with input. Finally, all the filters of the convolutional layer are pruned layer by layer

2 Related work

In general, challenge of compressing CNNs has been extensively considered in literature, although interpretability is not an impelling aspect in large number of theses studies. In the conventional way to compress CNNs, weights, but not filters are actually pruned and also quantized which commonly termed as “weight compression”. Few of these approaches are optimal brain surgeon [14], Deep compression [15], optimal brain damage [16], and the recent one is SqueezeNet [17].

The primary goal of filter pruning is basically to acquire an estimation regarding to importance of filters and then eliminate unimportant filters [2, 18, 19]. Besides, thereafter at every pruning step, it essentially required re-training to cover the decline in accuracy. The authors [20] estimated the importance of the filter over a subset of training data which is based upon output feature map. In [14] they performed pruning based upon a greedy technique which assessed filter importance through verifying accuracy of the model right after pruning filter. However, feature maps or in other words pruning activation is utilized in [21] with the aim of acquiring more faster CNNs. This technique can also be percept as eliminating filters in input on a particular location, but those filters most of the time be left in other locations which hardly ever results in any sort of filter compression. Thus, in our work, CFS selects less important lters based on the correlation value between successive corresponding feature maps. We apply Pearson correlation to calculate the correlation value between two feature maps. The CFS approach is dierent from other approaches and can tackle their weakness such as ignorance of spatial information in feature maps since we try to ip the feature maps by row and measure their spatial information with correlation to evaluate the usefulness of the corresponding lters. After we create a feature selection module to extract the output of each filter and calculate their correlation weights. These modules are put between each two successive feature maps of the filter of the pre-trained model, as illustrated in the Fig. 2. Those filters whose corresponding two feature maps correlation are the higher value will be pruned one of them. After the pruning, we fine-tune the pruned model to recover performance and can even obtain greater accuracy in several scenarios. Lastly, the pruning and fine-tuning procedure are iterative for few times to achieve a slimmer model. Moreover, we also study the relationship between the pruning ratio and the number of filters to highlight the sharing of information in every convolutional layer of a CNN model.

To prune a fully trained network is fruitful in several ways in contrast to training a network from scratch containing lesser filters. Because there are many uncertainties such as architecture selection and how many filters are required to start with. Even though to tackle that issue many hyper-parameters optimization approaches presented [22,23,24], large number of feasible architecture along with filters that results in lofty computational cost into a conjugational way as in more model choice issues [25]. Additionally, contemporary outcomes recommend that since the CNNs with large-scale, the pruned network accuracy corresponds to faintly higher in contradiction to a certain network trained from scratch [10], for ResNet and also VGG. In contrary to larger-scale, small-scale CNNs, to train a particular network from scratch is most likely to accomplish similar accuracy considering pruned one. In ample amount of application regarding to transfer learning which based upon well trained networks, the algorithms can accomplish much higher accuracies in contrast to training from scratch provided same number of filters and architecture [21].

Many of researches focused on reducing complexity of multiplication as well as parameters. In [41] authors used Strassen Algorithm with the help of Pan's modification to reduce number of multiplications in CNNs. On the other hand, Li [42] presented Structured channel weight sharing to compress (SCWC) which uses distributive property to reduce number of multiplications in CNNs. The authors in [43] proposed two densely linked CNNs and named them DenseDsc and Dense2Net. In DenseDsc, the reliable depth-wise separable convolution is applied to enhance the performance. For the Dense2Net, author applied group convolution to enhance the parameters performance.

Besides, a model presented in [26] familiarized sparsity to the model parameters and with that they also required support of sparse libraries to achieve anticipated compression results. Similarly, mentioned technique provides an inadequate compression rate on Total run time memory (TRM) and FLOPs. But these particular techniques deliver a better compression rate regarding to weights storage, with limited FLOPs. whereas pruning approach in [27] presented for filter importance possesses specific constraints as necessities that are not generally met. So, models sustain the redundancy which are compressed by these approaches since these approaches do not contemplate filter redundancy during pruning.

As another option, approaches based upon weight quantization [28, 29] have been utilized in previous researches for compression of models. In [30] performed compression relied upon float value quantization for the purpose of model storage. But in [7] conducted compressions though Binarization in which every float value is quantized up to binary value. However, researchers have also utilized Bayesian approaches considering network quantization. On contrary our proposed technique removes unimportant feature maps to help decrease the size of the model along with save the number of FLOPs and lower the computational cost.

We perform our approach on various famous datasets and different CNN models. For VGG16 on CIFAR100, we obtain approx. 59.6% of parameter pruning and 46.4% reduction of FLOPs with 0.17% accuracy loss. The model with less redundancy such as ResNet-50, we also approx. 44.6% parameter elimination, without loss of accuracy. In the coming sections, we will provide brief details of our correlation-based filter pruning approach.

3 Methodology

Initially, we present how to determine the correlation scores of two feature maps. After, our feature maps selection and filter pruning schemes are given. Lastly, the discussion of computational cost compression.

3.1 Calculate correlation scores of the two successive feature maps

The number of past researches [2, 10, 13] calculate the valuation of the filters through scale factors or L1-norm and neglect the volume of information contained by the feature maps. Some past studies [7, 10] have given that the output of a large number of feature maps by the middle layers of CNNs are most of the zeros or zero matrices, which exposes that not all the filters in the architecture are valuable.

To find the usefulness of filters, we apply correlation to calculate the information in feature maps. Correlation performs a major part in data science as the selection of features and helps to find effective features according to the correlation score. Considering that the output of dissimilar convolutional layers has a substantial difference in the amount of information, therefore, the feature maps weights are calculated in every layer individually. Particularly, to overcome the conditional outcome of a single image, we arbitrarily choose a large number of images from the training set to compute the average correlation weights of filters.

Suppose \(H_i\) is the height of the output feature maps and \(W_i\) is the width, \(m_i\) denoted as the number of filters in the \(i-th\) convolutional layer, where one filter produces one feature map. Further, N shows the randomly selected images input into the model, value of N directly impact on the memory of the system, greater the value of N, more it consumes memory of the system. Moreover, It is worth noted that the value of N is same for all considered datasets. The lowest scores of final accuracy are obtained from N of 16, 32 , 50, and 64. The highest results of final accuracy are demonstrated by the N of 256 and 512 instances. The N of 100, 128, 150, and 200 instances show the average outcomes of final accuracy. Therefore, the higher the value of N, greater the final accuracy we will get, additionally, much larger value can also effect on the final accuracy as shown in Fig. 3 (The figure only highlights ImageNet dataset for VGG16 model). Let \(X_{i, j}^{(n)}\) set as \(j-t h\) matrix of output feature map of the layer i for the \(n-t h\) image, and \(Y_{i, k}^{(n)}\) is \(k-t h\) matrix of output feature map of the same layer i for \(n-t h\) image, and both the matrix converted into a vector of feature maps given below:

Fig. 3
figure 3

Impact of N on VGG16 accuracy using ImageNet dataset

$$\begin{aligned} \hat{X}_{i, j}^{(n)}= & {} \left( x_{i, j, 1}^{(n)}, x_{i, j, 2}^{(n)}, \ldots , x_{i, j, L_{i}}^{(n)}\right) \end{aligned}$$
(1)
$$\begin{aligned} \hat{Y}_{i, k}^{(n)}= & {} \left( y_{i, k, 1}^{(n)}, y_{i, k, 2}^{(n)}, \ldots , y_{i, k, L_{i}}^{(n)}\right) \end{aligned}$$
(2)

Where \(L_{i}=H_{i} \times W_{i}\). Finally, we will normalize both \(\widehat{X}_{i, j}^{(n)}, \hat{Y}_{i, k}^{(n)}\) using the following equations.

$$\begin{aligned} p_{i, j, l}^{(n)}= & {} \frac{x_{i, j, l}^{(n)}-\min _{l}\left\{ x_{i, j, l}^{(n)}\right\} }{\max _{l}\left\{ x_{i, j, l}^{(n)}\right\} -\min _{l}\left\{ x_{i, j, l}^{(n)}\right\} } \end{aligned}$$
(3)
$$\begin{aligned} P_{i, j}^{(n)}= & {} \left( p_{i, j, 1}^{(n)}, p_{i, j, 2}^{(n)}, \ldots , p_{i, j, L_{i}}^{(n)}\right) \end{aligned}$$
(4)

For \(\hat{Y}_{i, k}^{(n)}\)

$$\begin{aligned} p_{i, k, l}^{(n)}= & {} \frac{y_{i, k, l}^{(n)}-\min _{l}\left\{ y_{i, k, l}^{(n)}\right\} }{\max _{l}\left\{ y_{i, k, l}^{(n)}\right\} -\min _{l}\left\{ y_{i, k, l}^{(n)}\right\} } \end{aligned}$$
(5)
$$\begin{aligned} P_{i, k}^{(n)}= & {} \left( p_{i, k, 1}^{(n)}, p_{i, k, 2}^{(n)}, \ldots , p_{i, k, L_{i}}^{(n)}\right) \end{aligned}$$
(6)

After, for the \(n-t h\) image of \(i-t h\) convolutional layer, the Pearson correlation of both \(j-t h\) and \(k-t h\) vector of feature maps are given as:

$$\begin{aligned} \rho \left( (P_{i, j}^{(n)}), (P_{i, k}^{(n)})\right) =\frac{\sum _{l=1}^{L_{i}}\left( p_{i, j}^{(n)}-\overline{p_{t, j}^{(n)}}\right) \left( p_{i, k}^{(n)}-\overline{p_{l, k}^{(n)}}\right) }{\sqrt{\sum _{l=1}^{L_{i}}\left( p_{i, j}^{(n)}-\overline{p_{t, j}^{(n)}}\right) ^{2}\left( p_{i, k}^{(n)}-\overline{p_{l, k}^{(n)}}\right) ^{2}}} \end{aligned}$$
(7)

Whereas \(p_{i, j}^{(n)}\) is the mean of \(P_{i, j}^{(n)}\) and \(p_{i, k}^{(n)}\) is the mean of \(P_{i, k}^{(n)} .\) The value of \(\rho \left( (P_{i, j}^{(n)}), (P_{i, k}^{(n)})\right)\) range from \(-1\) to 1 . If the value of \(\rho \left( (P_{i, j}^{(n)}), (P_{i, k}^{(n)})\right)\) is 0 then the \(P_{i, j}^{(n)}, P_{i, k}^{(n)}\) both are independent, 1 otherwise. Feature maps that contain a greater correlation value are considered redundant feature maps, therefore, only those feature maps are selected, which contains the low redundancy between successive feature maps. The lowest Pearson’s correlation scores considering neighboring feature maps are forwarded to the selected feature map set. After will get the best feature maps and discarding the redundant feature maps. This process will continue layer to layer until will get all the effective feature maps from all layers. The algorithm 1 shows the steps based on our introduced method.

figure a

3.2 Strategies for filter pruning

In the process to recognize the unimportant filters from a pre-trained network, the CFS is created and placed between every two successive convolutional layers of the network. As illustrated in the Fig. 2, the output of the \(i-th\) convolutional layer is put into the correlation weights module to calculate the correlation score of each two consecutive feature maps through the algorithm explained in the above section. The higher correlation score shows there are duplications of information in these feature maps and one of them should be removed, further, the corresponding filter in the \(i-th\) convolutional layer is less valuable. After we can eliminate the feature maps with duplicate information by pruning all their incoming and outgoing connections. With that, all the unimportant filters of the \(i-th\) layer and feature maps for input to the next layer are eliminated, along with the corresponding channels of every filter in the next layer.

Fig. 4
figure 4

Image of the method to prune cross layer architectures. a Original ResNet bottleneck block, b CFS module with bottleneck block. Before every covolutional layer, The Batch Norm layers is inserted, the ReLu is placed before CSF module

3.2.1 Setting up pruning threshold

It is important to set the pruning threshold according to the Symmetrical Uncertainty(SU) coefficient in each convolutional layer. Initially, Symmetrical Uncertainty score of each feature map in each layer are placed in ascending order. After they are gathered from the largest values until the set pruning ratio is surpassed. The last superficial value is applied as the threshold of the corresponding layer, and all the feature maps and concerned filters whose Symmetrical Uncertainty score are higher than the threshold will be eliminated. Finally, we achieved a slimmer architecture with fewer parameters, less run time memory, and storage.

3.2.2 Pruning by iteration and fine-tune process

After the pruning, we might experience some loss of accuracy temporarily, however, it can be greatly overcome through the following process of fine-tuning. In some cases, we can even obtain a greater accuracy than the baseline model. For the entire architecture pruning, past researches generally prune and fine-tune the filter layer to layer, or retrain the model after every pruning and fine-tuning process. Further, we reduce and fine-tune the model iteratively and there is no requirement to retrain the model from scratch again. This algorithm used as a prototype on LeNet shows that this is efficient. After few epochs, we can obtain a greater compression and even achieve a better classification accuracy.

Table 1 VGG16 Pruning for CIFAR10 dataset

3.2.3 Strategy to adjust residual models (ResNet)

The presented filter pruning approach can be effortlessly used on normal CNN models e.g. AlexNet and VGGNet. But, some modification schemes are needed when it is applied to reduce complex models as they have cross-layer connections for example ResNet. In ResNet, the output of the last convolutional layer and the identity mapping have a similar number of feature maps and similar in size, which creates hurdles to reduce them. As illustrated in the Fig. 4, our CFS are inserted between the first and second convolutional layers. In the bottleneck block of the third convolutional layer, we only eliminate the channels of every filter to make them compatible with the input feature maps and do not delete the number of filters, because the output of that have to align the identity maps and as we know that 1 x 1 filters have fewer parameters.

Table 2 Pruning results of the VGG-16 model performed on CIFAR-10/100

4 Computational complexity analysis and compression

As highlighted in Sect. 3.1, the input of the \(i-th\) convolutional layer is \(H_{i-1} \times W_{i-1} \times m_{i-1}\) matrix of feature maps and generates a \(H_{i} \times W_{i} \times m_{i}\) matrix, in which \(m_{i}\) and \(m_{i-1}\) the total number of feature maps. Let’s suppose, parameterization of the \(i-th\) convolutional layer is \(T_{i} \times T_{i} \times m_{i} \times m_{i-1}\), whereas \(T_i\) denote the spatial dimension of each filter. Generally, convolutions possess the computational complexity of \(T_{i} \times T_{i} \times m_{i} \times m_{i-1} \times H_{i} \times W_{i}\). Now we set the pruning ratio of the \(i-th\) layer as \(p_i\) and the corresponding filter pruning rate is denoted as \(\widehat{p}_{t}\). So that the \(i-th\) layer total filters will be decreased from \(m_{i}\) to \(m_{i}\left( 1-p_{i}\right)\) and filters channels in the same layer are decreased from \(m_{i-1}\) to \(m_{i-1}\left( 1-p_{i-1}\right)\). Finally, we can obtain compression rate in computational complexity for this reduced layer as follows:

$$\begin{aligned} \begin{aligned}&1-\frac{T_{i} \times T_{i} \times m_{i-1}\left( 1-\hat{p}_{i-1}\right) \times m_{i}\left( 1-\hat{p}_{i}\right) \times H_{i} \times W_{i}}{T_{i} \times T_{i} \times m_{i-1} \times m_{i} \times H_{i} \times W_{i}} \\&=1-\left( 1-\hat{p}_{i}\right) \left( 1-\hat{p}_{i-1}\right) \end{aligned} \end{aligned}$$
(8)

5 Experiments

We check our introduced CFS on various state-of-art models and datasets, such as CIFAR10 and CIFAR100 datasets used on the VGG16 model, on the other hand, we used the CIFAR10 dataset on ResNet50 and ResNet56 models. All the experiments are done with Tensorflow and Keras framework on NVIDIA Tesla P100 GPU using Google Colab Pro version.

5.1 Implementation settings

Initially, the architectures are trained from scratch to compute the classification accuracies and considered them as baselines. Before training, all the images are pre-processed before being fed into the model: all the images are randomly cropped into 32 x 32 size, and the horizontal flip is used with padding set to four. The mini-batch size is set to 100 for training and 1000 for test in the VGGNet model, on the other hand, the mini-batch size is set to 64 for train and 256 for test in the ResNet models. All the networks are trained and fine-tuned applying Stochastics Gradient Descent (SGD) for 160 iterations on both datasets. Further, we set the initial learning rate to 0.1, and after 50% and 75% iterations, we set the learning rate to 0.01 and 0.001 respectively. To avoid overfitting, we also applied weight initialization.

5.2 VGGNet pruning

According to [1], the convolutional layers of VGGNet are having different information attentiveness and robustness, therefore, we set separate pruning rates for every convolutional layer. Based on the architecture of VGGNet, the layers in the network are divided into three groups, (1) layers contain 512 filters, (2) layers contain 256 filters, and (3) layers consist of less than 128 filters. Further, we set different pruning rates p by each group, such as 1.5p, p, and 0.5p respectively. Finally, we prune the VGG16 network iteratively with r = 10%. Our introduced CFS obtains better results than the approaches for filter pruning. After only a second epoch, the CFS approach can prune approx. 50% parameter and even obtain accuracy gain of 0.30%. After two more epochs, the saving of parameters can be approx. 90% and elimination of FLOPs can be 52.4% with a gain in accuracy of 0.05%. The Table 1 shows layer-wise pruning of parameters, FLOP saving and elimination of activation maps. Further, detailed illustration of results is given in the Table 2a.

We continue our training for VGG16 with the CIFAR100 dataset. The setting has been similar during the experiment process. This time our model can obtain a reduction of FLOPs approx. 47% in just three epochs as shown in the Table 2b. As compare to the CIFAR10, the pruning ratio is not greater, this is because the CIFAR100 has 10 times more classes and it requires more features to perform the classification task. Further, we trained VGG16 model with the ImageNet dataset, our method achieves 35.67% parameter pruning with 34.2% of FLOPs saved only on third epoch.

5.3 ResNet pruning

We have used two architectures of the ResNet family: ResNet50 and ResNet56 with the structure of bottleneck are used to check the performance of the presented algorithm. As mentioned in the summary of ResNet architectures, the ReLU and Batch-Norm layers are inserted before every convolutional layer in bottleneck blocks. Since there are skip connections available in ResNet blocks and the useful information can be divided with the entire architecture, thus, we apply a similar pruning ratio for all layers for pruning. Furthermore, there are fewer redundant parameters in ResNet because the information is shared across the model via skip connections, therefore, we have to prune all ResNet models in single-shot pruning.

Firstly, we prune ResNet50 model using the ImageNet dataset. We can achieve nearly equal accuracy comparing to baseline architecture with approx. 40.5% parameters elimination and reduction of FLOPS approx. 48.2% with the pruning ratio is set to 20%. Further, for setting up pruning ratio to 30%, our CFS can also eliminate approx. 44.6% parameters with 51.6% FLOPs reduction, and 0.56% loss of accuracy as shown in Table 3a. It can be seen that the ResNet with bottleneck design contains less redundancy in calculations and parameters, unlike VGGNet.

Table 3 Pruning results in ResNet models performed on CIFAR-10 and ImageNet
Fig. 5
figure 5

Pruning results of ResNet50 on ImageNet regarding different correlation pruning ratios

Fig. 6
figure 6

Pruning results of ResNet56 on CIFAR-10 regarding different correlation pruning ratios

We continue to apply our CFS approach to a deeper ResNet56 network for the CIFAR10 dataset, we enable the same settings as with ResNet50. Our algorithm can outclass the baseline model with a reduction of approx. 25.9% parameters and approx. 30.2% FLOPs with a gain in accuracy of 0.20%. when we try to increase the pruning ratio to 30%, our approach obtains about 52.2% of parameter elimination and saving about 55.6% of FLOPs with the only degradation of 0.31% accuracy as illustrated in Table 3b. Furthermore, we try our CFS algorithm on ResNet18 on ImageNet datasets, and our CFS approach prunes approx. 42.31% of pruning with 30.66% FLOPs saved as given in Table 3c.

Fig. 7
figure 7

Pruning comparison of FLOPs, parameters, and filters

6 Discussion

To widely analyze the effectiveness of our presented algorithm on the architecture, we evaluate the architecture accuracy of the different pruning rates and compression rates. As illustrated in the Figs. 5, 6, and 7 at the start the classification accuracy of pruned architecture increase over the baseline network and decreases as we increase the pruning rate. While the pruning rate is less than 25%, nearly 50% of the parameters are eliminated, and our approach obtains no loss of accuracy and sometimes it achieves marginally accuracy enhancement. This shows that our algorithm CFS can eliminate unimportant information and enhance the efficient set of features. This discussion only highlights the ResNet model.

7 Conclusion

In this article, we present an easy and efficient approach, which calculates the effectiveness of convolutional filters on the bassis of duplication possess in the two successive feature maps generated through these filters. Our approach proposes correlation to compute the redundancy brought by the feature maps and measure the usefulness of related convolutional filters. A correlation feature selector is created to designed pruning schemes. To overcome the dimensionolity mismatch problem in the ResNet model in course of pruning, new pruning schemes are presented. Furthermore, this article also highlights the sharing of information in each convolutional layer with results reflects that filters in many layers contribute limited to the final accuracy of the model. Advanced experiments reflect the advantage of our method compared to the presently available approaches. Finally, for deeper and complex model ResNet56 on the CIFAR10 dataset, our presented approach can eliminate 52.2% of parameters along with 55.6% FLOPs reduced with 0.09% of accuracy gain, and this shows the greatness of the proposed algorithm.