Automatic filter pruning algorithm for image classification

Xue, Yifan; Yao, Wangshu; Peng, Siyuan; Yao, Shiyou

doi:10.1007/s10489-023-05207-x

Automatic filter pruning algorithm for image classification

Published: 07 December 2023

Volume 54, pages 216–230, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Automatic filter pruning algorithm for image classification

Download PDF

Yifan Xue¹,
Wangshu Yao ORCID: orcid.org/0009-0009-0930-3060^1,2,3,
Siyuan Peng¹ &
…
Shiyou Yao¹

467 Accesses
4 Citations
Explore all metrics

Abstract

Network pruning is an essential technique for compressing and accelerating convolutional neural networks (CNNs). Existing pruning algorithms primarily evaluate filter importance or similarity, and then remove unimportant filters or keep only one similar filter at each convolutional layer based on a global pruning ratio. These methods, ignoring the sensitivity of pruning among different convolutional layers, rely on a lot of manual experience and multiple experiments to obtain the optimal convolutional neural network structure. To this end, we propose an automatic filter pruning algorithm via feature map average similarity and reverse search genetic algorithm(RSGA), dubbed as AFPruner, which automatically searches for the optimal combination of pruning ratio for all convolutional layers, evaluates filter similarity by feature map average similarity and then prunes similarity filter. Our method is evaluated against several state-of-the-art CNNs on three different classification datasets, and the experimental results show that our algorithm outperforms most current network pruning algorithms.

Filter Pruning for Efficient Transfer Learning in Deep Convolutional Neural Networks

Complex hybrid weighted pruning method for accelerating convolutional neural networks

Article Open access 06 March 2024

Using Feature Entropy to Guide Filter Pruning for Efficient Convolutional Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the past decade, convolutional neural networks(CNNs) have exhibited remarkable success in various computer vision tasks, such as image classification [1], object detection [2], and semantic segmentation [3], owing to their exceptional ability for representation learning. However, these high-performance CNN models, such as VGG [4], ResNet [5], and DenseNet [6], require a large amount of memory and computational resources, which limits their deployment on edge computing devices, e.g., on mobile devices.

In recent years, the compression and acceleration techniques for deep models have become a hot research topic. Commonly used techniques include knowledge distillation, network pruning, and binary quantization. Network pruning refers to reducing the storage and computational cost of deep convolutional neural networks by removing redundant components, thus achieving model compression and acceleration. Current research focuses mainly on structured pruning, which involves removing unimportant filters from the original network based on appropriate pruning strategies. Li et al. [7] evaluates convolution kernels according to the $l_1$-norm and removes unimportant kernels. Liu et al. [8] uses $\textrm{L}_1$ regularization of the channel scaling factors of the batch normalization(BN) layer to measure the importance of the channel, and then prunes the channels with smaller scaling factor values. HRank [9] first introduces the rank of the feature map as the evaluation criterion, which makes the whole pruning process more efficient.

In pruning process, the essential work is selecting appropriate pruning strategies to judge redundant components and determining the optimal pruning ratio. However, existing pruning strategies are oversimple and neglect the structural integrity and global correlation between layers of the CNN model. These strategies prune layer by layer using a uniform pruning ratio, which may result in some layers being overly pruned while others still exhibit structural redundancy, thus fail to obtain the optimal network model structure. In addition, the setting of the pruning ratio relies on manual experience and requires repeated test to find the optimal overall pruning ratio.

To address the above-mentioned problems, we propose an automatic filter pruning algorithm AFPruner based on feature map average similarity and reverse search genetic algorithm(RSGA), which automatically searches for the optimal combination of pruning ratio of each convolutional layer of CNN. AFPruner uses feature map average similarity to measure the similarity of filters, and then prunes similar filters at each layer according to the pruning ratio. In this way, we can obtain the optimal CNN model structure.

Our contributions are summarized as follows:

(1)
We propose a reverse search genetic algorithm(RSGA), which enhances the diversity of the population using a reverse search strategy. At the end of each iteration, the algorithm replaces the worst two individuals with their reverse individuals. Additionally, we use clustering methods to group individuals and select individuals from different clusters for evolutionary operations, which further avoids the “prematurity” convergence.
(2)
We use the reverse search genetic algorithm to search for the optimal combination of pruning ratio of each convolutional layer of CNN. Then, the similarity of the filter is described using the feature map average similarity, where the feature map average similarity is calculated by using the average Euclidean distance between the feature maps of multiple samples, and the filter is pruned according to the filter similarity. This avoids the effect of sample bias on filter similarity measurements. According to the similarity of the filters, we retain one similar filter per group at each layer of the network and prune the rest of the filters, achieving the purpose of model compression. This approach reduces manual intervention and repeated testing, and realizes automatic pruning.
(3)
In this paper, experiments are designed on CIFAR-10, CIFAR-100, and ILSVRC-2012 datasets to verify the performance of the algorithm and compared with other methods. The results show that AFPruner achieves a good balance between performance and compression ratio, and is superior to most of the current pruning algorithms for CNN.

2 Related works

2.1 Pruning strategy

Due to weight pruning leading to non-structured sparse filters, it is hard to accelerate the pruned model with general hardware. Consequently, existing pruning research has mainly focused on structured filter pruning. The current filter pruning strategies can be roughly divided into the following two types:

Importance metric Common pruning methods usually use weight norm [7, 10, 11] to measure the importance of filters, and filters with smaller norm values contain less feature information and should be pruned first. Azadeh et al. [12] propose a genetic-based joint dynamic pruning and learning method and prune redundant filters adaptively during training, which not only considers the $l_1$-norm of the filter, but also introduces the $l_1$-norm of gradient to jointly measure the importance of filters. NISP [13] obtains the importance score of each filter by optimizing the reconstruction error of the final response layer. Attention mechanism [14] is also introduced in model pruning. Cheng et al. [15] apply Squeeze-and-Excitation(SE) blocks for each convolutional layer to predict the importance factor of each filter.

Similarity metric FPGM [16] proves that filters are not “smaller-norm-less-important”. Furthermore, it calculates the geometric median of all filters in the layer to measure the similarity between filters, and then prunes the filters with the smallest distance sum to the geometric median. OSFP [17] also applies the idea of filter similarity, unlike FPGM [16], which explores the distance relationship between all filters of each convolutional layer. Similarly, Li et al. [18] judge whether a filter is replaceable by exploring the similarity relationship between feature maps. CLR-RNF [19] uses the idea of k-reciprocal nearest to measure similar filters. Some recent works [20,21,22] all introduce clustering to prune similar filters.

2.2 Pruning ratio

The current methods for setting pruning ratio can be broadly divided into two types: predefined and automatic. The predefined method is to manually set the same pruning ratio for all convolutional layers, and the pruning strategies mentioned above belong to this type. Previous studies [7, 23] have revealed that the different convolutional layers have different sensitivities to pruning. As a result, recent researches have focused on automatically obtaining the pruning ratio, i.e., setting different pruning threshold for all convolutional layers.

Yang et al. [24] leverages second-order information to prune filters with low sensitivity. AMC [25] combines reinforcement learning to determine the pruning ratio for each layer automatically. DSA [26] proposes a new differentiable pruning process, which optimizes the sparsity distribution of continuous space to find the pruning ratio of each layer. DMC [27] assigns a pruning ratio for each layer by applying a discrete gate to each channel.

Liu et al. [28] has shown that the essence of network pruning is finding the optimal structure. Based on this, recent works treat the pruning problem as an optimization problem. ABCPrunner [29] first shrinks the combinations of channels to a specific space, and then automatically obtains the pruning ratio of each layer through the artificial bee colony algorithm(ABC). MetaPruning [30] proposes a two-stage pruning algorithm, which pretrains a PruningNet using meta-learning to predict the weight parameters of the pruned model, and then utilizes genetic algorithm to search for the optimal combination of pruning ratio. ACP [31] first completes the preliminary pruning of channels through hierarchical clustering of channels, and then uses particle swarm algorithm to search for the optimal pruned model. SST [32] limits the channel pruning ratio of each layer to four thresholds using two magnitude parameters and uses particle swarm optimization to search the combination of pruning ratio. IPSOPruner [33] removes the more sensitive layers during the pruning process and then uses an improved particle swarm optimization algorithm to search for the pruning ratio. AACP [34] uses hyperparameters to compress the pruning ratio combination space, and improves the traditional differential evolution algorithm to obtain the optimal pruning ratio for each layer. To solve the intractably huge combinations of pruned structure for convolutional networks, the above methods artificially constrain the search space, which may result in missing the optimal pruning ratio combination.

In addition to direct search pruning ratio, there are also works to sparse different layers of the network from the perspective of search weight parameters and pruning strategy. Wang et al. [35] guides channel pruning by sparing the distribution of BN layers, dynamically adjusting the importance of channels using a scale factor and bias factor ($\gamma $ and $\beta $), and combines genetic algorithm to search for the optimal factor combination. LFPC [36] defines the search space as different pruning strategies to learn filter pruning criterion. According to the filter distribution of each layer, it adaptively selects a suitable pruning strategy for each layer to prune independently.

3 Methodology

This section will introduce the proposed automatic filter pruning framework based on reverse search genetic algorithm. In the convolutional layer of CNN, the important filters are able to extract discriminative local information [37]. Without destroying the network structure, pruning unimportant filters can effectively decrease the amount of parameters and computations of the model to accelerate the inference speed. For a pre-trained model of CNN with $\textrm{L}$ convolutional layers, the model weights are $\left\{ \textrm{W}_{i} \in \mathbb {R}^{C_{i+1} \times C_{i} \times k_{i} \times k_{i}}, 1 \le i \le \textrm{L}\right\} $, where $\textrm{C}_{i}$, $\textrm{C}_{i+1}$, and $k_{i}$ are the number of input channels, the number of output channels, and the size of the convolution kernel of the i-th convolutional layer, respectively. The j-th filter weights of the i-th convolutional layer is denoted as $\textrm{W}_{i}^{j} \in \mathbb {R}^{C_{i} \times k_{i} \times k_{i}}$. $\textrm{F}_{i}^{j} \in \mathbb {R}^{h_{i} \times w_{i}}$ is a feature map generated by the filter $\textrm{W}_{i}^{j}$ after matrix operations. $h_{i}$, $w_{i}$ are the height and width corresponding to the feature map of the i-th layer, respectively.

3.1 Definition of filter pruning

Most existing pruning methods prune filters(channels) of all layers with a predefined global pruning ratio. However, the pruned model may not be optimal due to the different pruning sensitivity of filters(channels) in different layers. In this paper, determining the optimal pruning ratio combination of each layer filter is regarded as a combination optimization problem. Specifically, we define the pruning problem as follows: given an unpruned model $\textrm{M}$ with $\textrm{L}$ convolutional layers, assuming that each layer of the network has c filters, the filter pruning problem can be formulated as:

$$\begin{aligned} \begin{array}{l} \min \mathcal {L}_{\text{ test } }\left( \textrm{M}^{*} , \textrm{W}\right) \\ \text{ s.t. } \textrm{M}^{*}=\arg \min \mathcal {L}_{\text{ train } }(\textrm{R} ^*, \mathrm {~W}) \end{array} \end{aligned}$$

(1)

where $\textrm{M}^*$ is the pruned model, $\textrm{R}^*=[r_1,r_2,\dots ,r_\textrm{L}]$ is the filter pruning ratio vector, with $r_i$ denoting the filter pruning ratio of layer i-th, $i=1,\dots ,\textrm{L}$. $\textrm{W}$ represents the weight parameter of the network, and $\mathcal {L}$ represents the cross-entropy classification loss of the network. This paper aims to obtain the pruned model $\textrm{M}^*$ with the lowest training loss through searching the original network model. Furthermore, after determining the pruning ratio of each layer filters, fine-tuning is performed to minimize the testing error of the pruned sub-network.

3.2 Filter similarity evaluation

During the search phase, we propose a method for evaluating filter similarity based on feature map average similarity. According to the discovered pruning ratio, AFPruner prunes the similarity filter at each layer.

The pruned model inherits the weight parameters of the original network, to evaluate its performance. Inspired by the previous work [38], we use the average euclidean distance(Other distances can also be used here, such as the cosine distance, which will be described detailedly in Section 4.5.) of s sample feature maps to measure the filter similarity in each convolutional layer, and the similarity calculation formula is as follows:

$$\begin{aligned} \begin{array}{l} {dist}\left( \textrm{M}_{i}^{l}, \textrm{M}_{j}^{l}\right) =\frac{1}{{s}} \sum _{{t}=1}^{{s}} \sqrt{\sum _{{n}=1}^{\textrm{N}}\left( {m}_{{t}, {i}}^{{n}}-{m}_{{t}, {j}}^{{n}}\right) ^{2}} \\ \text{ s.t. } m_{{t}, i}^{n}= \text{ flatten } \left( \textrm{M}_{{t}, {i}}^{{l}}\right) , m_{{t}, {j}}^{n}= \text{ flatten } \left( \textrm{M}_{{t}, {j}}^{{l}}\right) \end{array} \end{aligned}$$

(2)

where $\textrm{M}^{l}_{{t},{i}}$ represents the i-th feature map generated by the t-th image in the l-th layer, $\textrm{flatten}(\cdot )$ converts the feature map matrix into a vector. In order to obtain more accurate similarity, we uses the average Euclidean distance between the feature maps of s samples, which reduces the impact of sample bias. The smaller value of $dist(\textrm{M}^l_i,\textrm{M}^l_j)$ means that the two feature maps are more similar, and the specific process is shown in Fig. 1.

3.3 Framework of automatic filter pruning algorithm(AFPruner)

The proposed algorithm consists of four steps, as illustrated in Fig. 2.The entire algorithm flow is shown in Algorithm 1.

Population initialization: Initialize the pruning ratio vector set by reverse search, and each vector represents an individual in the population (i.e., a pruned model $\textrm{M}^*$), and the elements of each vector represent the filter pruning ratio of one layer.
Population evolution: Selection, crossover, mutation, and evaluation of individuals to obtain next generation populations.
Reverse operation: At the end of each iteration, the two individuals with the lowest fitness value are selected to generate two reverse individuals, and these two individuals are replaced with the generated reverse individuals. This process increases group diversity and avoids the “prematurity” convergence.
Model fine-tuning: After the iteration of the reverse search genetic algorithm is completed, the individuals with the highest fitness in the population are selected as the optimal pruning ratio combination $\textrm{R}^*$, and then the pruning method described in Section 3.2 is used to obtain the pruned model $\textrm{M}^*$. Finally, the pruned model is fine-tuned to restore accuracy.

Algorithm 1

Automatic Filter Pruning(AFPruner)

Full size image

3.4 The optimal combination search of prune ratio

To address the optimization problem described in (1), we propose the reverse search genetic algorithm (RSGA) for searching the optimal combination of pruning ratio at each convolutional layer. In this section, we provide a detailed description of our algorithm.

Fitness Previous works[29, 31,32,33,34] use the classification accuracy of the network model to evaluate the fitness of individuals, without considering the relationship between model compression ratio and classification accuracy. Our algorithm uses a combination of model classification accuracy and model compression ratio as an individual fitness function to achieve the balance between classification accuracy and compression ratio. Specifically, the fitness calculation is defined as follows:

$$\begin{aligned} \begin{array}{l} \text{ Fitness } ={w}_{1} {\text {Acc}}+{w}_{2} \frac{\textrm{F}_{{o}}-\textrm{F}_{{k}}}{\textrm{F}_{{o}}}+{w}_{3} \frac{\textrm{P}_{{o}}-\textrm{P}_{{k}}}{\textrm{P}_{{o}}} \\ \text{ s.t. } \sum _{i=1}^{3} w_{i}=1 \end{array} \end{aligned}$$

(3)

where ${\text {Acc}}$ denotes the classification accuracy of the individual, $\textrm{F}_{{o}}$ and $\textrm{P}_{{o}}$ are the Flops and Params of the original network, respectively, $\textrm{F}_{{k}}$ and $\textrm{P}_{{k}}$ are the Flops and Params retained by the individual, respectively. $w_1, w_2, w_3$ are dynamically adjustable parameters.

Initialization Assuming the population size is $\textrm{N}$, we first randomly generate $\textrm{N}/2$ individuals to form a subpopulation $\textrm{P}_1$, and then perform the reverse operation to the subpopulation $\textrm{P}_1$ for generating a subpopulation $\textrm{P}_2$. Subpopulations $\textrm{P}_1$ and $\textrm{P}_2$ are combined to form the population $\textrm{P}$, which is mathematically expressed as formula 4.

$$\begin{aligned} \left. \begin{array}{l} \textrm{R}_{i}=\left[ r_{1}, r_{2}, r_{3}, \ldots , r_{\textrm{L}}\right] \\ \textrm{R}_{i}^{\prime }=\left[ r_{1}^{\prime }, r_{2}^{\prime }, r_{3}^{\prime }, \ldots , r_{\textrm{L}}^{\prime }\right] \\ \text{ s.t } r_{i}^{\prime }=a+b-r_{i} \end{array}\right\} \rightarrow \begin{array}{l} \textrm{P}_{1}=\left[ \textrm{R}_{1}, \textrm{R}_{2}, \ldots \textrm{R}_{\textrm{N} / 2}\right] \\ \textrm{P}_{2}=\left[ \textrm{R}_{1}^{\prime }, \textrm{R}_{2}^{\prime }, \cdots \textrm{R}_{\textrm{N} / 2}^{\prime }\right] \end{array} \end{aligned}$$

(4)

where L is the number of convolutional layers in the network model. The pruning ratio of the i-th layer of the network, denoted by $r_i$, is uniformly distributed between (a, b), where $r_i \in \left[ a,b\right) $. In this paper, the value range of $r_i $ is [0, 1). $R_i$ represents an individual, i.e. a filter pruning ratio vector. Using the aforementioned reverse initialization, the initial population can be better distributed in the search space, and N structure vectors as shown in Fig. 2(a) can be obtained. We use t to denote the generation of population evolution so that the initial population $\textrm{P}$ can be expressed as:

$$\begin{aligned} \textrm{P}=\left\{ \textrm{P}_{1}^{0}, \textrm{P}_{2}^{0}\right\} = \left[ \textrm{R}_{1}, \textrm{R}_{2}, \ldots , \textrm{R}_{\textrm{N}/2},\textrm{R}_{1}^{\prime }, \textrm{R}_{2}^{\prime }, \cdots \textrm{R}_{\textrm{N} / 2}^{\prime }\right] \end{aligned}$$

(5)

Selection To maintain the diversity of individuals in the population, we first cluster the individuals into three subgroups using Euclidean distance by the K-means clustering algorithm. Then, individuals are selected for crossover and mutation operations from different subgroups in each generation, and the elitist preservation strategy is adopted to directly transfer the top two individuals with the highest fitness to the next generation. The specific process consists of three steps.

(1)
The population $\textrm{P}^t$ is clustered into three subgroups $\textrm{P}^t_1, \textrm{P}^t_2$, and $\textrm{P}_3^t$.
(2)
Two different subgroups are randomly selected.
(3)
The roulette wheel selection algorithm is used to select an individual from each of the two subpopulations, which constitutes a pair of individuals that are used for subsequent evolution operations, where the probability of individuals being selected is proportional to their fitness value. Assuming that the number of individuals in the subgroup is m, for each individual $\textrm{R}_i$ in the subgroup $\textrm{P}_j^t,j=1,2,3$, the selection probability $s_i$ is calculated as follows:

$$\begin{aligned} s_i=\frac{{f}_{{i}}}{\sum _{{k}=1}^{{m}} {f}_{{k}}}, {i}=1,2, \ldots , {m} \end{aligned}$$

(6)

Crossover We utilize the two-point crossover operation. Firstly, two crossover points are randomly selected based on the crossover probability. Then, the genes between the selected points of the two individuals are exchanged, generating two new individuals. The operation is illustrated in Fig. 3.

Mutation We employ the two-point mutation operation. After crossover, two mutated genes are randomly selected for each individual according to the mutation probability, and two mutation genes are chosen to be reinitialized. The mutation process is shown in Fig. 4.

Reverse operation To maintain the population diversity and avoid the “prematurity” convergence, we introduce the reverse operation in the genetic algorithm. At the end of each iteration, the two individuals with the lowest fitness are selected for the reverse operation, which generates two new individuals to replace the original two individuals in the population. By incorporating this reverse operation, the genetic algorithm can maintain population diversity to a certain extent, which is beneficial for algorithm convergence.

4 Experiments

To validate the effectiveness of our proposed method, we conduct experiments using state-of-the-art network models(VGGNet, ResNet) and compare with other algorithms on multiple datasets.

4.1 Datasets and experimental setting

We conduct experiments on publicly available datasets, including CIFAR-10, CIFAR-100, and ILSVRC-2012. The CIFAR-10 contains 60,000 (50,000 training images and 10,000 testing images) 32$\times $32 color images in 10 different classes, with 6,000 images in each category. The CIFAR-100 is similar to CIFAR-10, except it has 100 classes, each containing 600 images. ILSVRC-2012 is one of the most commonly used subsets in the ImageNet, which contains 1,000 classes with 1.28 million training images and 50k validation images.

In experiments, we use Stochastic Gradient Descent(SGD) for fine-tuning with a momentum of 0.9. On CIFAR-10 and CIFAR-100, the initial learning rate is 0.01, the batch size is 128, and the weight decay is set to 5e-3. The final model is fine-tuned for 160 epochs and divided by 10 every 50 epochs. On ILSVRC-2012, batch size is set to 256, the weight decay is set to le-4 and 100 epochs are given for fine-tuning. The initial learning rate is set to 0.1 and divided by 10 every 30 epochs. In the process of evolutionary search, the pruned models are fine-tuned for 2 epochs to get more accurate classification accuracy. In the Algorithm 1, $\textrm{N} = 10, \textrm{T} = 10, \textrm{P}_c= 0.7$, and $\textrm{P}_m = 0.1$.

4.2 w parameter analysis

In our method, $w_1$, $w_2$ and $w_3$ are dynamically adjustable hyperparameters. $w_1$, $w_2$ and $w_3$ denote the percentage of accuracy, Flops compression ratio and parameter compression ratio of the pruned model, respectively. On CIFAR-10, we analyze the effect of these hyperparameters for the pruning results based on the pruning process of the ResNet56 model.

Table 1 The influence of different w values for the compression ratio and accuracy of ResNet56 on CIFAR-10

Full size table

The results are shown in Table 1.

(1)
From the first 5 rows of the table, it is clear that when the value of $w_1$ is tiny, the accuracy of the pruned model is low, but the compression ratio is high. As the value of $w_1$ increases, the accuracy of the pruned model becomes higher and higher, but the compression ratio becomes lower and lower. It can be seen that when $w_1$=0.7, the accuracy and compression ratio have achieved better results, so we initially choose the value of $w_1$ as 0.7.
(2)
In order to seek for more optimal results, we make a more detailed analysis of $w_1$ within the range of [0.7-0.8]. The results are shown in the middle 4 rows of the table. It can be seen that as the value of $w_1$ increases, the accuracy of the pruned model is slightly improved, but the Flops and Parameters compression ratio decreases a lot. The reason is that the model accuracy is given more attention, and the compression ratio are ignored. Therefore, we finally choose $w_1$ = 0.7 for the subsequent analysis on $w_2$ and $w_3$.
(3)
More attention is paid to the FLOPs compression ratio for network pruning, this is because it directly affects the computational speed of the model. From the last 5 rows of the table, it can be seen that $w_1=0.7$, $w_2=0.2$ and $w_3=0.1$ obtain the best model accuracy and compression ratio.

Table 2 Accuracy and pruning ratio on CIFAR-10

Full size table

Table 3 Accuracy and pruning ratio on CIFAR-100

Full size table

4.3 Algorithm analysis

4.3.1 Performance analysis

To validate the performance of the algorithm, we statistically analyze the individuals produced by different models on CIFAR-10. The reverse search genetic algorithm tends to retain good genes from parents, where crossover and mutation are central in the evolution process, i.e., a better individual is more likely to produce a better individual through mutation or crossover. Figure 5(a) supports our view that the average fitness of all the individuals is generally higher than that of the previous generation, which suggests that the overall quality of the individual has been improved through crossover and mutation. In addition, it is clear that the distance between individuals is larger after the reverse operation as seen in Fig. 5(b). This further shows that the reverse operation increases the diversity of individuals, which is beneficial to discover better individuals.

4.3.2 Robustness analysis

In order to rule out randomness in the experimental results, we use statistical tests to record the results. Specifically, five experiments are conducted on different datasets for different models, and then average the results of the five experiments. The experimental results are shown in Table 2, 3 and 4, where “Model” is the model used in the experiment, “-” is the result with the smallest pruning ratio , “+” is the result with the largest pruning ratio, and “$*$” is the result averaged over 5 trials. Four widely-used metrics are reported here, including accuracy, Flops, Params, and pruning ratio, where "Pruned" is the pruning ratio. For CIFAR-10 and CIFAR-100, Top-1 accuracy of pruned models are provided. For ILSVRC-2012, both Top-1 and Top-5 accuracies are reported.

From the table, it can be seen that after compressing the original model with an appropriate pruning ratio, the accuracy of the pruned models are all slightly better than the baseline. This is because common convolutional neural networks exhibit different degrees of redundancy, and removing the redundant parameters not only does not negatively affect the original network, but also improves the generalization performance of the model. AFPruner achieves nearly half of the compression ratio on both FLOPs and Params for different models, where VGG16 maximally compresses 67.75% FLOPs and 82.56% Params. This indicates that the VGG16 has greater parameter redundancy. In addition, from the results of multiple experiments, the accuracy of these pruning models are approximately the same, and the average results reflect the stability of our algorithm. Based on the above analysis, our pruning algorithm can effectively achieve the compression and acceleration of CNNs.

Table 4 Accuracy and pruning ratio on ILSVRC-2012

Full size table

Table 5 Performance comparison of ResNet56/110 on CIFAR-10

Full size table

4.4 Compared with other methods

Since there are few pruning methods report results on CIFAR-100, we mainly compare our pruning results with other filter pruning methods on CIFAR-10 and ILSVRC-2012. In Section 4.4.1, the comparison of pruning ratio and accuracy are reported in Tables 5 and 6. Among the compared methods, ABCPruner [29], AACP [34] and SST [32] are all population intelligence optimization algorithms, AMC [25], HRank [9], FPGM [16], CSHE [21], Li et al.[18], FPSC [22] and CLR-RNF [19] are the state-of-the-art filter pruning methods. The results of these comparing methods are reported according to the original article. It is worth noting that the average values of the statistical tests are used for comparison here, not the optimal values. In Section 4.4.2, we further compare the time complexity of the algorithms in Table 7.

Table 6 Performance comparison of ResNet18/34/50 on ILSVRC-2012

Full size table

Table 7 Time complexity on CIFAR-10

Full size table

Table 8 The influence of reverse operation

Full size table

4.4.1 Accuracy and pruning ratio

Comparison on CIFAR-10 For ResNet56, it can be seen from Table 5 that most of the methods decrease the classification accuracy after pruning. In contrast, our method exceeds the baseline. Compared with ABCPruner [29], AMC [25], SST [32] and AACP [34], which also search for pruning ratio, our method outperforms these algorithms with similar pruning ratio. Compared with the newly released algorithms FPSC [22] and CSHE [21], our method improves accuracy by 0.19% when considering the pruning ratio of Flops (52.35% vs 52.60% vs 50.00%) and Params (54.42% vs 54.20% vs 42.40%), while the accuracy of FPSC [22] and CSHE [21] decreases by 0.45% and 0.15%, respectively.

Table 9 Cluster analysis of ResNet56 on CIFAR-10

Full size table

For ResNet110, even removed 54.27% Params and 54.65% Flops, the accuracy of the pruned model is 0.58% higher than the baseline, which is slightly better than other methods. Among the compared methods, AFPruner outperforms CHSE [21] in all respects, CSHE [21] has the lowest pruning ratio and the classification accuracy of the pruned model decreased by 0.06%. Although the accuracy of ABCPruner [29] and CLR-RNF [19] is still improved compared to the baseline at high compression ratio, the improvement is not significant, which further illustrates that AFPruner can obtain the optimal pruning ratio.

Comparison on ILSVRC-2012 Table 6 shows the comparison of ResNet18/34/50 results on ILSVRC-2012. For ResNet18, AFPruner achieves the best results in terms of Flops, Params and Top-1 accuracy. In particular, compared with Azadeh [12], which also adopts genetic algorithm, our method compresses 48.22% of Flops and achieves 68.01% of Top-1 accuracy, while Azadeh [12] compresses 20.30% of Flops and achieves only 67.82% of Top-1 accuracy. For both ResNet34 and ResNet50, AFPruner still slightly outperforms existing methods with similar pruning ratio. Although Azadeh [12] slightly outperforms our method in terms of accuracy, its Flops compression ratio is only half of ours.

Table 10 Comparison of different similarity calculations on CIFAR-10

Full size table

The comparisons of the experimental results on different datasets show that AFPruner achieves better performance in both the compression ratio of Flops and Params after pruning. Even though the accuracy metric is slightly lower than individual pruning algorithms on some models, its pruning ratio is higher than these algorithms, which indicates that AFPruner better considers the structural integrity and global correlation of CNN layers in the filter pruning process, and thus can better prune the redundant structure of the model. The experimental results show the effectiveness and superiority of the AFPruner.

4.4.2 Time complexity

We further compare the effective running time, including search time and fine-tuning time of the algorithms. Specifically, we focus on comparing the algorithms ABCPruner [29] and ACP [31], which all also use a population intelligence optimization approach to search pruning ratio. The results are reported in Table 7, where ABCPruner [29] employs an artificial bee colony(ABC) algorithm and ACP [31] employs particle swarm optimization(PSO) algorithm. To ensure the fairness of the comparison results, we use the same baseline for different algorithms on CIFAR-10. During the experiments, the same parameter configurations were set up for a total of 10 rounds of searching, with a final fine-tuning of 160 epochs, and all the experiments are performed on Tesla V100-SXM2-32GB. (Note: Because the original paper did not have running time, the data in Table 7 are not from the original paper, and are reproduced by ourselves using the same parameters).

As can be seen in Table 7, for the same model, our algorithm takes less time to run, and the final fine-tuned accuracy is higher.

4.5 Ablation study

Reverse operation To demonstrate the effectiveness of the reverse operation during population iteration, we compare the improved genetic algorithm with the traditional genetic algorithm in Table 8. The result of the improved genetic algorithm is better because the reverse operation can increase the diversity of the population, which is more conducive to convergence.

Cluster analysis We take ResNet56 on CIFAR-10 as an example and analyze the influence of introducing clusters in Table 9. Intuitively, the results of adding clusters are better, indicating that the clustering operation can further improve the population diversity of the genetic algorithm and alleviate the “prematurity” convergence, thus effectively improving the convergence.

Similarity calculation In order to verify the effect of different similarity methods on the results, we use cosine distance instead of euclidean distance to calculate the similarity between feature maps. On CIFAR-10, the same individual is trained in different distances, and other parameters are consistent during the training process. As can be seen from Table 10, the results of euclidean distances outperform cosine distances in all three models. In addition, in the previous work FPSC [22], it is also proved that the euclidean distance is better than the cosine distance.

Table 11 Comparison of some individuals generated during the search on CIFAR-10 for ResNet56

Full size table

4.6 Visualization and analysis

We visualize some individuals generated during the search process of ResNet56 on CIFAR-10. Figure 6 shows the effect of hierarchical pruning on these individuals, and Table 11 shows the differences between them.

From Fig. 6, it can be found that the filter pruning ratio varies for different individuals in different layers. It also is indicated that our pruning algorithm adjusts the appropriate pruning ratio based on the pruning sensitivity of each layer during the search process. Additionally, for residual networks such as ResNet, excessive pruning of filters in the early stages leads to a decrease in accuracy, and the accuracy of pruned models is usually low when the residual blocks at the beginning or end of each stage (as shown in the figure 9/10/18/19 layers) are too pruned. Specifically:

(1)
As shown in Table 11, the classification accuracy of individuals (a), (c), and (d) after fine-tuning is relatively low. Comparing with Fig. 6, it can be seen that individual (a) prunes too many filters in the early stages and the 9-th and 10-th layers; the 10-th and 19-th layers of individual (c) are over-pruned; the 19-th layer of individual (d) is over-pruned;
(2)
For the two contemporary optimal individuals (b) and (f), it can be seen that the 9/10/18/19-th layers are not over-pruned, which results in higher testing accuracy at the appropriate compression ratio.
(3)
Furthermore, compared with the worst individual (a) in the initial population, the worst individual (e) in the final generation has a larger Flops and Param compression ratio, and its testing accuracy after fine-tuning is even better. This is because individual (e) preserves more filters in 10-th and 19-th layers, and evolves towards a better direction in seeking a balance between compression ratio and accuracy.

The reason for the above phenomenon may be that the first residual block in each stage downsamples the features and requires more filters for feature extraction to avoid information loss. In the subsequent pruning process, specific processing can be performed for these layers to guide network pruning more effectively.

5 Conclusion

This paper introduces a novel algorithm named automatic filter pruning algorithm via feature map average similarity and reverse search genetic algorithm(AFPruner) for automatically searching the optimal combination of pruning ratio in convolutional neural networks. We use reverse search and clustering to enhance population diversity and accelerate convergence, and utilize the average Euclidean distance between feature maps to calculate their similarity during the pruning process, thereby selecting filters. The experiment results show that AFPruner has excellent performance, achieves a high pruning ratio and model accuracy, and outperforms most current pruning algorithms. In addition, we also found that over-pruning has a significant impact on accuracy during feature downsampling, which suggests that specific treatment of these layers can be applied in subsequent pruning processes. In future research, we will consider the following methods: optimizing the search space for network pruning, leveraging faster optimization methods, and extending model pruning to additional tasks, such as object detection.

Data Availability Statements

The datasets generated during and/or analysed during the current study are available in http://www.cs.toronto.edu/~kriz/cifar.html and https://image-net.org/.

References

Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International conference on computer vision, pp 1520–1528
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR), pp 1–14
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Li H, Kadav A, Durdanovic I, Samet H, Graf HP (2016) Pruning filters for efficient convnets. In: International conference on learning representations
Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017) Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp 2736–2744
Lin M, Ji R, Wang Y, Zhang Y, Zhang B, Tian Y, Shao L (2020) Hrank: Filter pruning using high-rank feature map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1529–1538
He Y, Kang G, Dong X, Fu Y, Yang Y (2018) Soft filter pruning for accelerating deep convolutional neural networks. In: International joint conference on artificial intelligence
He Y, Dong X, Kang G, Fu Y, Yan C, Yang Y (2020) Asymptotic soft filter pruning for deep convolutional neural networks. IEEE Trans Cybern 50(8):3594–3604
Article Google Scholar
Famili A, Lao Y (2022) Genetic-based joint dynamic pruning and learning algorithm to boost dnn performance. In: 2022 26th International conference on pattern recognition (ICPR), pp 2100–2106
Yu R, Li A, Chen C-F, Lai J-H, Morariu VI, Han X, Gao M, Lin C-Y, Davis LS (2018) Nisp: Pruning networks using neuron importance score propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9194–9203
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023
Article Google Scholar
Cheng Y, Wang X, Xie X, Li W, Peng S (2022) Channel pruning guided by global channel relation. Applied Intelligence, pp 1–12
He Y, Liu P, Wang Z, Hu Z, Yang Y (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4340–4349
Sawant SS, Bauer J, Erick FX, Ingaleshwar S, Holzer N, Ramming A, Lang E, Götz T (2022) An optimal-score-based filter pruning for deep convolutional neural networks. Appl Intell pp 1–23
Li J, Shao H, Zhai S, Jiang Y, Deng X (2023) A graphical approach for filter pruning by exploring the similarity relation between feature maps. Pattern Recogn Lett 166:69–75
Article Google Scholar
Lin M, Cao L, Zhang Y, Shao L, Lin C-W, Ji R (2022) Pruning networks with cross-layer ranking & k-reciprocal nearest filters. IEEE Transactions on neural networks and learning systems, pp 1–10
Ayinde BO, Inanc T, Zurada JM (2019) Redundant feature pruning for accelerated inference in deep neural networks. Neural Netw 118:148–158
Article Google Scholar
Shao M, Dai J, Wang R, Kuang J, Zuo W (2021) Cshe: network pruning by using cluster similarity and matrix eigenvalues. Int J Mach Learn Cybern 13:371–382
Article Google Scholar
Song K, Yao W, Zhu X (2022) Filter pruning via similarity clustering for deep convolutional neural networks. In: International conference on neural information processing. Springer, pp 88–99
Orseau L, Hutter M, Rivasplata O (2020) Logarithmic pruning is all you need. Adv Neural Inf Process Syst 33:2925–2934
Google Scholar
Yang C, Liu H (2022) Channel pruning based on convolutional neural network sensitivity. Neurocomputing 507:97–106
Article Google Scholar
He Y, Lin J, Liu Z, Wang H, Li L-J, Han S (2018) Amc: Automl for model compression and acceleration on mobile devices. In: Proceedings of the European conference on computer vision (ECCV), pp 784–800
Ning X, Zhao T, Li W, Lei P, Wang Y, Yang H (2020) Dsa: More efficient budgeted pruning via differentiable sparsity allocation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, Springer, pp 592–607
Gao S, Huang F, Pei J, Huang H (2020) Discrete model compression with resource constraint for deep neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1899–1908
Liu Z, Sun M, Zhou T, Huang G, Darrell T (2018) Rethinking the value of network pruning. In: International conference on learning representations
Lin M, Ji R, Zhang Y, Zhang B, Wu Y, Tian Y (2020) Channel pruning via automatic structure search. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 673–679
Liu Z, Mu H, Zhang X, Guo Z, Yang X, Cheng K-T, Sun J (2019) Metapruning: Meta learning for automatic neural network channel pruning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3296–3305
Chang J, Lu Y, Xue P, Xu Y, Wei Z (2022) Automatic channel pruning via clustering and swarm intelligence optimization for cnn. Appl Intell pp 1–21
Lian Y, Peng P, Xu W (2021) Filter pruning via separation of sparsity search and model training. Neurocomputing 462:185–194
Article Google Scholar
Tmamna J, Ayed EB, Ayed MB (2021) Neural network pruning based on improved constrained particle swarm optimization. In: Neural information processing: 28th international conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part VI 28, Springer, pp 315–322
Lin L, Chen S, Yang Y, Guo Z (2022) Aacp: Model compression by accurate and automatic channel pruning. In: 2022 26th International conference on pattern recognition (ICPR), IEEE, pp 2049–2055
Wang Z, Li F, Shi G, Xie X, Wang F (2020) Network pruning using sparse learning and genetic algorithm. Neurocomputing 404:247–256
Article Google Scholar
He Y, Ding Y, Liu P, Zhu L, Zhang H, Yang Y (2020) Learning filter pruning criteria for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2009–2018
Huang K, Wu S, Li F, Yang C, Gui W (2022) Fault diagnosis of hydraulic systems based on deep learning model with multirate data samples. IEEE Transactions on Neural Networks and Learning Systems 33(11):6789–6801
Article Google Scholar
Li H, Ma C, Xu W, Liu X (2020) Feature statistics guided efficient filter pruning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI)

Download references

Acknowledgements

This work was partially supported by Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD), Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

School of computer science and technology, Soochow University, Suzhou, 21500, China
Yifan Xue, Wangshu Yao, Siyuan Peng & Shiyou Yao
School of software, Soochow University, Suzhou, 21500, China
Wangshu Yao
Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, 21500, China
Wangshu Yao

Authors

Yifan Xue
View author publications
You can also search for this author in PubMed Google Scholar
Wangshu Yao
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Shiyou Yao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yifan Xue: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Wangshu Yao: Conceptualization, Writing - review & editing, Project administration, Funding acquisition. Siyuan Peng: Validation, Resources, Investigation. Shiyou Yao: Visualization.

Corresponding author

Correspondence to Wangshu Yao.

Ethics declarations

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xue, Y., Yao, W., Peng, S. et al. Automatic filter pruning algorithm for image classification. Appl Intell 54, 216–230 (2024). https://doi.org/10.1007/s10489-023-05207-x

Download citation

Accepted: 29 November 2023
Published: 07 December 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10489-023-05207-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Automatic filter pruning algorithm for image classification

Abstract

Similar content being viewed by others

Filter Pruning for Efficient Transfer Learning in Deep Convolutional Neural Networks

Complex hybrid weighted pruning method for accelerating convolutional neural networks

Using Feature Entropy to Guide Filter Pruning for Efficient Convolutional Networks

Explore related subjects

1 Introduction

2 Related works

2.1 Pruning strategy

2.2 Pruning ratio

3 Methodology

3.1 Definition of filter pruning

3.2 Filter similarity evaluation

3.3 Framework of automatic filter pruning algorithm(AFPruner)

3.4 The optimal combination search of prune ratio

4 Experiments

4.1 Datasets and experimental setting

4.2 w parameter analysis

4.3 Algorithm analysis

4.3.1 Performance analysis

4.3.2 Robustness analysis

4.4 Compared with other methods

4.4.1 Accuracy and pruning ratio

4.4.2 Time complexity

4.5 Ablation study

4.6 Visualization and analysis

5 Conclusion

Data Availability Statements

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation