1 Introduction

In the past decade, convolutional neural networks(CNNs) have exhibited remarkable success in various computer vision tasks, such as image classification [1], object detection [2], and semantic segmentation [3], owing to their exceptional ability for representation learning. However, these high-performance CNN models, such as VGG [4], ResNet [5], and DenseNet [6], require a large amount of memory and computational resources, which limits their deployment on edge computing devices, e.g., on mobile devices.

In recent years, the compression and acceleration techniques for deep models have become a hot research topic. Commonly used techniques include knowledge distillation, network pruning, and binary quantization. Network pruning refers to reducing the storage and computational cost of deep convolutional neural networks by removing redundant components, thus achieving model compression and acceleration. Current research focuses mainly on structured pruning, which involves removing unimportant filters from the original network based on appropriate pruning strategies. Li et al. [7] evaluates convolution kernels according to the \(l_1\)-norm and removes unimportant kernels. Liu et al. [8] uses \(\textrm{L}_1\) regularization of the channel scaling factors of the batch normalization(BN) layer to measure the importance of the channel, and then prunes the channels with smaller scaling factor values. HRank [9] first introduces the rank of the feature map as the evaluation criterion, which makes the whole pruning process more efficient.

In pruning process, the essential work is selecting appropriate pruning strategies to judge redundant components and determining the optimal pruning ratio. However, existing pruning strategies are oversimple and neglect the structural integrity and global correlation between layers of the CNN model. These strategies prune layer by layer using a uniform pruning ratio, which may result in some layers being overly pruned while others still exhibit structural redundancy, thus fail to obtain the optimal network model structure. In addition, the setting of the pruning ratio relies on manual experience and requires repeated test to find the optimal overall pruning ratio.

To address the above-mentioned problems, we propose an automatic filter pruning algorithm AFPruner based on feature map average similarity and reverse search genetic algorithm(RSGA), which automatically searches for the optimal combination of pruning ratio of each convolutional layer of CNN. AFPruner uses feature map average similarity to measure the similarity of filters, and then prunes similar filters at each layer according to the pruning ratio. In this way, we can obtain the optimal CNN model structure.

Our contributions are summarized as follows:

  1. (1)

    We propose a reverse search genetic algorithm(RSGA), which enhances the diversity of the population using a reverse search strategy. At the end of each iteration, the algorithm replaces the worst two individuals with their reverse individuals. Additionally, we use clustering methods to group individuals and select individuals from different clusters for evolutionary operations, which further avoids the “prematurity” convergence.

  2. (2)

    We use the reverse search genetic algorithm to search for the optimal combination of pruning ratio of each convolutional layer of CNN. Then, the similarity of the filter is described using the feature map average similarity, where the feature map average similarity is calculated by using the average Euclidean distance between the feature maps of multiple samples, and the filter is pruned according to the filter similarity. This avoids the effect of sample bias on filter similarity measurements. According to the similarity of the filters, we retain one similar filter per group at each layer of the network and prune the rest of the filters, achieving the purpose of model compression. This approach reduces manual intervention and repeated testing, and realizes automatic pruning.

  3. (3)

    In this paper, experiments are designed on CIFAR-10, CIFAR-100, and ILSVRC-2012 datasets to verify the performance of the algorithm and compared with other methods. The results show that AFPruner achieves a good balance between performance and compression ratio, and is superior to most of the current pruning algorithms for CNN.

2 Related works

2.1 Pruning strategy

Due to weight pruning leading to non-structured sparse filters, it is hard to accelerate the pruned model with general hardware. Consequently, existing pruning research has mainly focused on structured filter pruning. The current filter pruning strategies can be roughly divided into the following two types:

Importance metric Common pruning methods usually use weight norm [7, 10, 11] to measure the importance of filters, and filters with smaller norm values contain less feature information and should be pruned first. Azadeh et al. [12] propose a genetic-based joint dynamic pruning and learning method and prune redundant filters adaptively during training, which not only considers the \(l_1\)-norm of the filter, but also introduces the \(l_1\)-norm of gradient to jointly measure the importance of filters. NISP [13] obtains the importance score of each filter by optimizing the reconstruction error of the final response layer. Attention mechanism [14] is also introduced in model pruning. Cheng et al. [15] apply Squeeze-and-Excitation(SE) blocks for each convolutional layer to predict the importance factor of each filter.

Similarity metric FPGM [16] proves that filters are not “smaller-norm-less-important”. Furthermore, it calculates the geometric median of all filters in the layer to measure the similarity between filters, and then prunes the filters with the smallest distance sum to the geometric median. OSFP [17] also applies the idea of filter similarity, unlike FPGM [16], which explores the distance relationship between all filters of each convolutional layer. Similarly, Li et al. [18] judge whether a filter is replaceable by exploring the similarity relationship between feature maps. CLR-RNF [19] uses the idea of k-reciprocal nearest to measure similar filters. Some recent works [20,21,22] all introduce clustering to prune similar filters.

2.2 Pruning ratio

The current methods for setting pruning ratio can be broadly divided into two types: predefined and automatic. The predefined method is to manually set the same pruning ratio for all convolutional layers, and the pruning strategies mentioned above belong to this type. Previous studies [7, 23] have revealed that the different convolutional layers have different sensitivities to pruning. As a result, recent researches have focused on automatically obtaining the pruning ratio, i.e., setting different pruning threshold for all convolutional layers.

Yang et al. [24] leverages second-order information to prune filters with low sensitivity. AMC [25] combines reinforcement learning to determine the pruning ratio for each layer automatically. DSA [26] proposes a new differentiable pruning process, which optimizes the sparsity distribution of continuous space to find the pruning ratio of each layer. DMC [27] assigns a pruning ratio for each layer by applying a discrete gate to each channel.

Liu et al. [28] has shown that the essence of network pruning is finding the optimal structure. Based on this, recent works treat the pruning problem as an optimization problem. ABCPrunner [29] first shrinks the combinations of channels to a specific space, and then automatically obtains the pruning ratio of each layer through the artificial bee colony algorithm(ABC). MetaPruning [30] proposes a two-stage pruning algorithm, which pretrains a PruningNet using meta-learning to predict the weight parameters of the pruned model, and then utilizes genetic algorithm to search for the optimal combination of pruning ratio. ACP [31] first completes the preliminary pruning of channels through hierarchical clustering of channels, and then uses particle swarm algorithm to search for the optimal pruned model. SST [32] limits the channel pruning ratio of each layer to four thresholds using two magnitude parameters and uses particle swarm optimization to search the combination of pruning ratio. IPSOPruner [33] removes the more sensitive layers during the pruning process and then uses an improved particle swarm optimization algorithm to search for the pruning ratio. AACP [34] uses hyperparameters to compress the pruning ratio combination space, and improves the traditional differential evolution algorithm to obtain the optimal pruning ratio for each layer. To solve the intractably huge combinations of pruned structure for convolutional networks, the above methods artificially constrain the search space, which may result in missing the optimal pruning ratio combination.

In addition to direct search pruning ratio, there are also works to sparse different layers of the network from the perspective of search weight parameters and pruning strategy. Wang et al. [35] guides channel pruning by sparing the distribution of BN layers, dynamically adjusting the importance of channels using a scale factor and bias factor (\(\gamma \) and \(\beta \)), and combines genetic algorithm to search for the optimal factor combination. LFPC [36] defines the search space as different pruning strategies to learn filter pruning criterion. According to the filter distribution of each layer, it adaptively selects a suitable pruning strategy for each layer to prune independently.

3 Methodology

This section will introduce the proposed automatic filter pruning framework based on reverse search genetic algorithm. In the convolutional layer of CNN, the important filters are able to extract discriminative local information [37]. Without destroying the network structure, pruning unimportant filters can effectively decrease the amount of parameters and computations of the model to accelerate the inference speed. For a pre-trained model of CNN with \(\textrm{L}\) convolutional layers, the model weights are \(\left\{ \textrm{W}_{i} \in \mathbb {R}^{C_{i+1} \times C_{i} \times k_{i} \times k_{i}}, 1 \le i \le \textrm{L}\right\} \), where \(\textrm{C}_{i}\), \(\textrm{C}_{i+1}\), and \(k_{i}\) are the number of input channels, the number of output channels, and the size of the convolution kernel of the i-th convolutional layer, respectively. The j-th filter weights of the i-th convolutional layer is denoted as \(\textrm{W}_{i}^{j} \in \mathbb {R}^{C_{i} \times k_{i} \times k_{i}}\). \(\textrm{F}_{i}^{j} \in \mathbb {R}^{h_{i} \times w_{i}}\) is a feature map generated by the filter \(\textrm{W}_{i}^{j}\) after matrix operations. \(h_{i}\), \(w_{i}\) are the height and width corresponding to the feature map of the i-th layer, respectively.

3.1 Definition of filter pruning

Most existing pruning methods prune filters(channels) of all layers with a predefined global pruning ratio. However, the pruned model may not be optimal due to the different pruning sensitivity of filters(channels) in different layers. In this paper, determining the optimal pruning ratio combination of each layer filter is regarded as a combination optimization problem. Specifically, we define the pruning problem as follows: given an unpruned model \(\textrm{M}\) with \(\textrm{L}\) convolutional layers, assuming that each layer of the network has c filters, the filter pruning problem can be formulated as:

$$\begin{aligned} \begin{array}{l} \min \mathcal {L}_{\text{ test } }\left( \textrm{M}^{*} , \textrm{W}\right) \\ \text{ s.t. } \textrm{M}^{*}=\arg \min \mathcal {L}_{\text{ train } }(\textrm{R} ^*, \mathrm {~W}) \end{array} \end{aligned}$$
(1)

where \(\textrm{M}^*\) is the pruned model, \(\textrm{R}^*=[r_1,r_2,\dots ,r_\textrm{L}]\) is the filter pruning ratio vector, with \(r_i\) denoting the filter pruning ratio of layer i-th, \(i=1,\dots ,\textrm{L}\). \(\textrm{W}\) represents the weight parameter of the network, and \(\mathcal {L}\) represents the cross-entropy classification loss of the network. This paper aims to obtain the pruned model \(\textrm{M}^*\) with the lowest training loss through searching the original network model. Furthermore, after determining the pruning ratio of each layer filters, fine-tuning is performed to minimize the testing error of the pruned sub-network.

3.2 Filter similarity evaluation

Fig. 1
figure 1

For the evaluation of the pruned models, filters are selected by calculating the feature map average similarity of s sample. For the m-th layer of the network, we calculate the similarity matrix for all output feature maps and remove the filter corresponding to the feature map with the highest similarity based on the filter pruning ratio in the current layer structure vector

During the search phase, we propose a method for evaluating filter similarity based on feature map average similarity. According to the discovered pruning ratio, AFPruner prunes the similarity filter at each layer.

The pruned model inherits the weight parameters of the original network, to evaluate its performance. Inspired by the previous work [38], we use the average euclidean distance(Other distances can also be used here, such as the cosine distance, which will be described detailedly in Section 4.5.) of s sample feature maps to measure the filter similarity in each convolutional layer, and the similarity calculation formula is as follows:

$$\begin{aligned} \begin{array}{l} {dist}\left( \textrm{M}_{i}^{l}, \textrm{M}_{j}^{l}\right) =\frac{1}{{s}} \sum _{{t}=1}^{{s}} \sqrt{\sum _{{n}=1}^{\textrm{N}}\left( {m}_{{t}, {i}}^{{n}}-{m}_{{t}, {j}}^{{n}}\right) ^{2}} \\ \text{ s.t. } m_{{t}, i}^{n}= \text{ flatten } \left( \textrm{M}_{{t}, {i}}^{{l}}\right) , m_{{t}, {j}}^{n}= \text{ flatten } \left( \textrm{M}_{{t}, {j}}^{{l}}\right) \end{array} \end{aligned}$$
(2)

where \(\textrm{M}^{l}_{{t},{i}}\) represents the i-th feature map generated by the t-th image in the l-th layer, \(\textrm{flatten}(\cdot )\) converts the feature map matrix into a vector. In order to obtain more accurate similarity, we uses the average Euclidean distance between the feature maps of s samples, which reduces the impact of sample bias. The smaller value of \(dist(\textrm{M}^l_i,\textrm{M}^l_j)\) means that the two feature maps are more similar, and the specific process is shown in Fig. 1.

3.3 Framework of automatic filter pruning algorithm(AFPruner)

Fig. 2
figure 2

AFPruner algorithm framework. The optimal pruned model is obtained through genetic algorithm, and then fine-tune to recover the model accuracy

The proposed algorithm consists of four steps, as illustrated in Fig. 2.The entire algorithm flow is shown in Algorithm 1.

  • Population initialization: Initialize the pruning ratio vector set by reverse search, and each vector represents an individual in the population (i.e., a pruned model \(\textrm{M}^*\)), and the elements of each vector represent the filter pruning ratio of one layer.

  • Population evolution: Selection, crossover, mutation, and evaluation of individuals to obtain next generation populations.

  • Reverse operation: At the end of each iteration, the two individuals with the lowest fitness value are selected to generate two reverse individuals, and these two individuals are replaced with the generated reverse individuals. This process increases group diversity and avoids the “prematurity” convergence.

  • Model fine-tuning: After the iteration of the reverse search genetic algorithm is completed, the individuals with the highest fitness in the population are selected as the optimal pruning ratio combination \(\textrm{R}^*\), and then the pruning method described in Section 3.2 is used to obtain the pruned model \(\textrm{M}^*\). Finally, the pruned model is fine-tuned to restore accuracy.

Algorithm 1
figure e

Automatic Filter Pruning(AFPruner)

3.4 The optimal combination search of prune ratio

To address the optimization problem described in (1), we propose the reverse search genetic algorithm (RSGA) for searching the optimal combination of pruning ratio at each convolutional layer. In this section, we provide a detailed description of our algorithm.

Fitness Previous works[29, 31,32,33,34] use the classification accuracy of the network model to evaluate the fitness of individuals, without considering the relationship between model compression ratio and classification accuracy. Our algorithm uses a combination of model classification accuracy and model compression ratio as an individual fitness function to achieve the balance between classification accuracy and compression ratio. Specifically, the fitness calculation is defined as follows:

$$\begin{aligned} \begin{array}{l} \text{ Fitness } ={w}_{1} {\text {Acc}}+{w}_{2} \frac{\textrm{F}_{{o}}-\textrm{F}_{{k}}}{\textrm{F}_{{o}}}+{w}_{3} \frac{\textrm{P}_{{o}}-\textrm{P}_{{k}}}{\textrm{P}_{{o}}} \\ \text{ s.t. } \sum _{i=1}^{3} w_{i}=1 \end{array} \end{aligned}$$
(3)

where \({\text {Acc}}\) denotes the classification accuracy of the individual, \(\textrm{F}_{{o}}\) and \(\textrm{P}_{{o}}\) are the Flops and Params of the original network, respectively, \(\textrm{F}_{{k}}\) and \(\textrm{P}_{{k}}\) are the Flops and Params retained by the individual, respectively. \(w_1, w_2, w_3\) are dynamically adjustable parameters.

Initialization Assuming the population size is \(\textrm{N}\), we first randomly generate \(\textrm{N}/2\) individuals to form a subpopulation \(\textrm{P}_1\), and then perform the reverse operation to the subpopulation \(\textrm{P}_1\) for generating a subpopulation \(\textrm{P}_2\). Subpopulations \(\textrm{P}_1\) and \(\textrm{P}_2\) are combined to form the population \(\textrm{P}\), which is mathematically expressed as formula 4.

$$\begin{aligned} \left. \begin{array}{l} \textrm{R}_{i}=\left[ r_{1}, r_{2}, r_{3}, \ldots , r_{\textrm{L}}\right] \\ \textrm{R}_{i}^{\prime }=\left[ r_{1}^{\prime }, r_{2}^{\prime }, r_{3}^{\prime }, \ldots , r_{\textrm{L}}^{\prime }\right] \\ \text{ s.t } r_{i}^{\prime }=a+b-r_{i} \end{array}\right\} \rightarrow \begin{array}{l} \textrm{P}_{1}=\left[ \textrm{R}_{1}, \textrm{R}_{2}, \ldots \textrm{R}_{\textrm{N} / 2}\right] \\ \textrm{P}_{2}=\left[ \textrm{R}_{1}^{\prime }, \textrm{R}_{2}^{\prime }, \cdots \textrm{R}_{\textrm{N} / 2}^{\prime }\right] \end{array} \end{aligned}$$
(4)

where L is the number of convolutional layers in the network model. The pruning ratio of the i-th layer of the network, denoted by \(r_i\), is uniformly distributed between (ab), where \(r_i \in \left[ a,b\right) \). In this paper, the value range of \(r_i \) is [0, 1). \(R_i\) represents an individual, i.e. a filter pruning ratio vector. Using the aforementioned reverse initialization, the initial population can be better distributed in the search space, and N structure vectors as shown in Fig. 2(a) can be obtained. We use t to denote the generation of population evolution so that the initial population \(\textrm{P}\) can be expressed as:

$$\begin{aligned} \textrm{P}=\left\{ \textrm{P}_{1}^{0}, \textrm{P}_{2}^{0}\right\} = \left[ \textrm{R}_{1}, \textrm{R}_{2}, \ldots , \textrm{R}_{\textrm{N}/2},\textrm{R}_{1}^{\prime }, \textrm{R}_{2}^{\prime }, \cdots \textrm{R}_{\textrm{N} / 2}^{\prime }\right] \end{aligned}$$
(5)

Selection To maintain the diversity of individuals in the population, we first cluster the individuals into three subgroups using Euclidean distance by the K-means clustering algorithm. Then, individuals are selected for crossover and mutation operations from different subgroups in each generation, and the elitist preservation strategy is adopted to directly transfer the top two individuals with the highest fitness to the next generation. The specific process consists of three steps.

Fig. 3
figure 3

Crossover process

Fig. 4
figure 4

Mutation process

  1. (1)

    The population \(\textrm{P}^t\) is clustered into three subgroups \(\textrm{P}^t_1, \textrm{P}^t_2\), and \(\textrm{P}_3^t\).

  2. (2)

    Two different subgroups are randomly selected.

  3. (3)

    The roulette wheel selection algorithm is used to select an individual from each of the two subpopulations, which constitutes a pair of individuals that are used for subsequent evolution operations, where the probability of individuals being selected is proportional to their fitness value. Assuming that the number of individuals in the subgroup is m, for each individual \(\textrm{R}_i\) in the subgroup \(\textrm{P}_j^t,j=1,2,3\), the selection probability \(s_i\) is calculated as follows:

$$\begin{aligned} s_i=\frac{{f}_{{i}}}{\sum _{{k}=1}^{{m}} {f}_{{k}}}, {i}=1,2, \ldots , {m} \end{aligned}$$
(6)

Crossover We utilize the two-point crossover operation. Firstly, two crossover points are randomly selected based on the crossover probability. Then, the genes between the selected points of the two individuals are exchanged, generating two new individuals. The operation is illustrated in Fig. 3.

Mutation We employ the two-point mutation operation. After crossover, two mutated genes are randomly selected for each individual according to the mutation probability, and two mutation genes are chosen to be reinitialized. The mutation process is shown in Fig. 4.

Reverse operation To maintain the population diversity and avoid the “prematurity” convergence, we introduce the reverse operation in the genetic algorithm. At the end of each iteration, the two individuals with the lowest fitness are selected for the reverse operation, which generates two new individuals to replace the original two individuals in the population. By incorporating this reverse operation, the genetic algorithm can maintain population diversity to a certain extent, which is beneficial for algorithm convergence.

4 Experiments

To validate the effectiveness of our proposed method, we conduct experiments using state-of-the-art network models(VGGNet, ResNet) and compare with other algorithms on multiple datasets.

4.1 Datasets and experimental setting

We conduct experiments on publicly available datasets, including CIFAR-10, CIFAR-100, and ILSVRC-2012. The CIFAR-10 contains 60,000 (50,000 training images and 10,000 testing images) 32\(\times \)32 color images in 10 different classes, with 6,000 images in each category. The CIFAR-100 is similar to CIFAR-10, except it has 100 classes, each containing 600 images. ILSVRC-2012 is one of the most commonly used subsets in the ImageNet, which contains 1,000 classes with 1.28 million training images and 50k validation images.

In experiments, we use Stochastic Gradient Descent(SGD) for fine-tuning with a momentum of 0.9. On CIFAR-10 and CIFAR-100, the initial learning rate is 0.01, the batch size is 128, and the weight decay is set to 5e-3. The final model is fine-tuned for 160 epochs and divided by 10 every 50 epochs. On ILSVRC-2012, batch size is set to 256, the weight decay is set to le-4 and 100 epochs are given for fine-tuning. The initial learning rate is set to 0.1 and divided by 10 every 30 epochs. In the process of evolutionary search, the pruned models are fine-tuned for 2 epochs to get more accurate classification accuracy. In the Algorithm 1, \(\textrm{N} = 10, \textrm{T} = 10, \textrm{P}_c= 0.7\), and \(\textrm{P}_m = 0.1\).

4.2 w parameter analysis

In our method, \(w_1\), \(w_2\) and \(w_3\) are dynamically adjustable hyperparameters. \(w_1\), \(w_2\) and \(w_3\) denote the percentage of accuracy, Flops compression ratio and parameter compression ratio of the pruned model, respectively. On CIFAR-10, we analyze the effect of these hyperparameters for the pruning results based on the pruning process of the ResNet56 model.

Table 1 The influence of different w values for the compression ratio and accuracy of ResNet56 on CIFAR-10
Fig. 5
figure 5

(a) shows the average fitness over all individuals with respect to the generation number. The bars indicate the highest and lowest fitness in the corresponding generation. (b) shows the euclidean distances between individuals before and after the reversal operation in each generation, including the average, maximum and minimum distances, with the solid line represents before the reversal operation and the dashed line represents after the reversal operation

The results are shown in Table 1.

  1. (1)

    From the first 5 rows of the table, it is clear that when the value of \(w_1\) is tiny, the accuracy of the pruned model is low, but the compression ratio is high. As the value of \(w_1\) increases, the accuracy of the pruned model becomes higher and higher, but the compression ratio becomes lower and lower. It can be seen that when \(w_1\)=0.7, the accuracy and compression ratio have achieved better results, so we initially choose the value of \(w_1\) as 0.7.

  2. (2)

    In order to seek for more optimal results, we make a more detailed analysis of \(w_1\) within the range of [0.7-0.8]. The results are shown in the middle 4 rows of the table. It can be seen that as the value of \(w_1\) increases, the accuracy of the pruned model is slightly improved, but the Flops and Parameters compression ratio decreases a lot. The reason is that the model accuracy is given more attention, and the compression ratio are ignored. Therefore, we finally choose \(w_1\) = 0.7 for the subsequent analysis on \(w_2\) and \(w_3\).

  3. (3)

    More attention is paid to the FLOPs compression ratio for network pruning, this is because it directly affects the computational speed of the model. From the last 5 rows of the table, it can be seen that \(w_1=0.7\), \(w_2=0.2\) and \(w_3=0.1\) obtain the best model accuracy and compression ratio.

Table 2 Accuracy and pruning ratio on CIFAR-10
Table 3 Accuracy and pruning ratio on CIFAR-100

4.3 Algorithm analysis

4.3.1 Performance analysis

To validate the performance of the algorithm, we statistically analyze the individuals produced by different models on CIFAR-10. The reverse search genetic algorithm tends to retain good genes from parents, where crossover and mutation are central in the evolution process, i.e., a better individual is more likely to produce a better individual through mutation or crossover. Figure 5(a) supports our view that the average fitness of all the individuals is generally higher than that of the previous generation, which suggests that the overall quality of the individual has been improved through crossover and mutation. In addition, it is clear that the distance between individuals is larger after the reverse operation as seen in Fig. 5(b). This further shows that the reverse operation increases the diversity of individuals, which is beneficial to discover better individuals.

4.3.2 Robustness analysis

In order to rule out randomness in the experimental results, we use statistical tests to record the results. Specifically, five experiments are conducted on different datasets for different models, and then average the results of the five experiments. The experimental results are shown in Table 2, 3 and 4, where “Model” is the model used in the experiment, “-” is the result with the smallest pruning ratio , “+” is the result with the largest pruning ratio, and “\(*\)” is the result averaged over 5 trials. Four widely-used metrics are reported here, including accuracy, Flops, Params, and pruning ratio, where "Pruned" is the pruning ratio. For CIFAR-10 and CIFAR-100, Top-1 accuracy of pruned models are provided. For ILSVRC-2012, both Top-1 and Top-5 accuracies are reported.

From the table, it can be seen that after compressing the original model with an appropriate pruning ratio, the accuracy of the pruned models are all slightly better than the baseline. This is because common convolutional neural networks exhibit different degrees of redundancy, and removing the redundant parameters not only does not negatively affect the original network, but also improves the generalization performance of the model. AFPruner achieves nearly half of the compression ratio on both FLOPs and Params for different models, where VGG16 maximally compresses 67.75% FLOPs and 82.56% Params. This indicates that the VGG16 has greater parameter redundancy. In addition, from the results of multiple experiments, the accuracy of these pruning models are approximately the same, and the average results reflect the stability of our algorithm. Based on the above analysis, our pruning algorithm can effectively achieve the compression and acceleration of CNNs.

Table 4 Accuracy and pruning ratio on ILSVRC-2012
Table 5 Performance comparison of ResNet56/110 on CIFAR-10

4.4 Compared with other methods

Since there are few pruning methods report results on CIFAR-100, we mainly compare our pruning results with other filter pruning methods on CIFAR-10 and ILSVRC-2012. In Section 4.4.1, the comparison of pruning ratio and accuracy are reported in Tables 5 and 6. Among the compared methods, ABCPruner [29], AACP [34] and SST [32] are all population intelligence optimization algorithms, AMC [25], HRank [9], FPGM [16], CSHE [21], Li et al.[18], FPSC [22] and CLR-RNF [19] are the state-of-the-art filter pruning methods. The results of these comparing methods are reported according to the original article. It is worth noting that the average values of the statistical tests are used for comparison here, not the optimal values. In Section 4.4.2, we further compare the time complexity of the algorithms in Table 7.

Table 6 Performance comparison of ResNet18/34/50 on ILSVRC-2012
Table 7 Time complexity on CIFAR-10
Table 8 The influence of reverse operation

4.4.1 Accuracy and pruning ratio

Comparison on CIFAR-10 For ResNet56, it can be seen from Table 5 that most of the methods decrease the classification accuracy after pruning. In contrast, our method exceeds the baseline. Compared with ABCPruner [29], AMC [25], SST [32] and AACP [34], which also search for pruning ratio, our method outperforms these algorithms with similar pruning ratio. Compared with the newly released algorithms FPSC [22] and CSHE [21], our method improves accuracy by 0.19% when considering the pruning ratio of Flops (52.35% vs 52.60% vs 50.00%) and Params (54.42% vs 54.20% vs 42.40%), while the accuracy of FPSC [22] and CSHE [21] decreases by 0.45% and 0.15%, respectively.

Table 9 Cluster analysis of ResNet56 on CIFAR-10

For ResNet110, even removed 54.27% Params and 54.65% Flops, the accuracy of the pruned model is 0.58% higher than the baseline, which is slightly better than other methods. Among the compared methods, AFPruner outperforms CHSE [21] in all respects, CSHE [21] has the lowest pruning ratio and the classification accuracy of the pruned model decreased by 0.06%. Although the accuracy of ABCPruner [29] and CLR-RNF [19] is still improved compared to the baseline at high compression ratio, the improvement is not significant, which further illustrates that AFPruner can obtain the optimal pruning ratio.

Comparison on ILSVRC-2012 Table 6 shows the comparison of ResNet18/34/50 results on ILSVRC-2012. For ResNet18, AFPruner achieves the best results in terms of Flops, Params and Top-1 accuracy. In particular, compared with Azadeh [12], which also adopts genetic algorithm, our method compresses 48.22% of Flops and achieves 68.01% of Top-1 accuracy, while Azadeh [12] compresses 20.30% of Flops and achieves only 67.82% of Top-1 accuracy. For both ResNet34 and ResNet50, AFPruner still slightly outperforms existing methods with similar pruning ratio. Although Azadeh [12] slightly outperforms our method in terms of accuracy, its Flops compression ratio is only half of ours.

Table 10 Comparison of different similarity calculations on CIFAR-10

The comparisons of the experimental results on different datasets show that AFPruner achieves better performance in both the compression ratio of Flops and Params after pruning. Even though the accuracy metric is slightly lower than individual pruning algorithms on some models, its pruning ratio is higher than these algorithms, which indicates that AFPruner better considers the structural integrity and global correlation of CNN layers in the filter pruning process, and thus can better prune the redundant structure of the model. The experimental results show the effectiveness and superiority of the AFPruner.

4.4.2 Time complexity

We further compare the effective running time, including search time and fine-tuning time of the algorithms. Specifically, we focus on comparing the algorithms ABCPruner [29] and ACP [31], which all also use a population intelligence optimization approach to search pruning ratio. The results are reported in Table 7, where ABCPruner [29] employs an artificial bee colony(ABC) algorithm and ACP [31] employs particle swarm optimization(PSO) algorithm. To ensure the fairness of the comparison results, we use the same baseline for different algorithms on CIFAR-10. During the experiments, the same parameter configurations were set up for a total of 10 rounds of searching, with a final fine-tuning of 160 epochs, and all the experiments are performed on Tesla V100-SXM2-32GB. (Note: Because the original paper did not have running time, the data in Table 7 are not from the original paper, and are reproduced by ourselves using the same parameters).

Fig. 6
figure 6

Some individuals are in the iterative pruning process of ResNet56 on CIFAR-10. Red represents the number of filters in each layer of the original network, and purple represents the number of filters in each layer of the sub-network after pruning. \(\textrm{A}\) represents the test classification accuracy after two rounds of fine-tuning, \(\textrm{P}\) represents the param compression ratio, and \(\textrm{F}\) represents the Flops compression ratio. (a) is the worst individual in \(P_0\); (b) is the optimal individual in \(P_0\); (c) and (d) are the intermediate individuals in \(P_3\) and \(P_6\), respectively; (e) is the worst individual in \(P_9\) ; (f) is the optimal individual in \(P_9\) (that is, the optimal individual finally obtained)

As can be seen in Table 7, for the same model, our algorithm takes less time to run, and the final fine-tuned accuracy is higher.

4.5 Ablation study

Reverse operation To demonstrate the effectiveness of the reverse operation during population iteration, we compare the improved genetic algorithm with the traditional genetic algorithm in Table 8. The result of the improved genetic algorithm is better because the reverse operation can increase the diversity of the population, which is more conducive to convergence.

Cluster analysis We take ResNet56 on CIFAR-10 as an example and analyze the influence of introducing clusters in Table 9. Intuitively, the results of adding clusters are better, indicating that the clustering operation can further improve the population diversity of the genetic algorithm and alleviate the “prematurity” convergence, thus effectively improving the convergence.

Similarity calculation In order to verify the effect of different similarity methods on the results, we use cosine distance instead of euclidean distance to calculate the similarity between feature maps. On CIFAR-10, the same individual is trained in different distances, and other parameters are consistent during the training process. As can be seen from Table 10, the results of euclidean distances outperform cosine distances in all three models. In addition, in the previous work FPSC [22], it is also proved that the euclidean distance is better than the cosine distance.

Table 11 Comparison of some individuals generated during the search on CIFAR-10 for ResNet56

4.6 Visualization and analysis

We visualize some individuals generated during the search process of ResNet56 on CIFAR-10. Figure 6 shows the effect of hierarchical pruning on these individuals, and Table 11 shows the differences between them.

From Fig. 6, it can be found that the filter pruning ratio varies for different individuals in different layers. It also is indicated that our pruning algorithm adjusts the appropriate pruning ratio based on the pruning sensitivity of each layer during the search process. Additionally, for residual networks such as ResNet, excessive pruning of filters in the early stages leads to a decrease in accuracy, and the accuracy of pruned models is usually low when the residual blocks at the beginning or end of each stage (as shown in the figure 9/10/18/19 layers) are too pruned. Specifically:

  1. (1)

    As shown in Table 11, the classification accuracy of individuals (a), (c), and (d) after fine-tuning is relatively low. Comparing with Fig. 6, it can be seen that individual (a) prunes too many filters in the early stages and the 9-th and 10-th layers; the 10-th and 19-th layers of individual (c) are over-pruned; the 19-th layer of individual (d) is over-pruned;

  2. (2)

    For the two contemporary optimal individuals (b) and (f), it can be seen that the 9/10/18/19-th layers are not over-pruned, which results in higher testing accuracy at the appropriate compression ratio.

  3. (3)

    Furthermore, compared with the worst individual (a) in the initial population, the worst individual (e) in the final generation has a larger Flops and Param compression ratio, and its testing accuracy after fine-tuning is even better. This is because individual (e) preserves more filters in 10-th and 19-th layers, and evolves towards a better direction in seeking a balance between compression ratio and accuracy.

The reason for the above phenomenon may be that the first residual block in each stage downsamples the features and requires more filters for feature extraction to avoid information loss. In the subsequent pruning process, specific processing can be performed for these layers to guide network pruning more effectively.

5 Conclusion

This paper introduces a novel algorithm named automatic filter pruning algorithm via feature map average similarity and reverse search genetic algorithm(AFPruner) for automatically searching the optimal combination of pruning ratio in convolutional neural networks. We use reverse search and clustering to enhance population diversity and accelerate convergence, and utilize the average Euclidean distance between feature maps to calculate their similarity during the pruning process, thereby selecting filters. The experiment results show that AFPruner has excellent performance, achieves a high pruning ratio and model accuracy, and outperforms most current pruning algorithms. In addition, we also found that over-pruning has a significant impact on accuracy during feature downsampling, which suggests that specific treatment of these layers can be applied in subsequent pruning processes. In future research, we will consider the following methods: optimizing the search space for network pruning, leveraging faster optimization methods, and extending model pruning to additional tasks, such as object detection.