1 Introduction

Convolutional Neural Networks (CNNs) have increasingly gained tremendous success in various artificial intelligence applications, such as image classification, voice synthesis, and semantic segmentation (LeCun et al 2015; Krizhevsky et al 2017; Redmon et al 2016; Ronneberger et al 2015). The size of current CNN models becomes massive to achieve better accuracy, which introduces excess computation and memory access, making it challenging to deploy in low-power and real-time required scenarios (Yang et al 2017; Li and Louri 2022; Yu et al 2017). Consequently, many customized accelerators are designed to accelerate the General Matrix Multiplications (GEMM) of CNN, which accounts for more than 95% of the operations during the inference stage (Lym and Erez 2020). Exploiting high data reuse and computational parallelism in GEMM, systolic array adopts a pipelined computation mode to improve performance/energy (Jouppi et al 2017) and become the most popular CNN accelerator architecture. Both academia and industry propose systolic-array-based CNN accelerators, such as Google’s Tensor Processing Unit (TPU) (Jouppi et al 2017) and MIT’s Eyeriss (Chen et al 2016). However, given the trend of CNNs growing incredibly deeper for better accuracy, current custom CNN accelerators are still laborious for the enormous computational and memory requirements of modern state-of-the-art CNNs (He et al 2016; Huang et al 2016; Rhu et al 2016).

Fig. 1
figure 1

Mapping of pruned CNN in systolic array and performance optimize approach

Therefore, recent studies attempt to apply various software-level optimized technologies to make the inference of CNN more efficient on accelerators, such as data quantization, compact model design, and weight pruning (Sandler et al 2018; Yayla and Chen 2022; Han et al 2015b). Among them, weight pruning removes a number of unimportant weights and reduces the corresponding parameter storage and computation, bringing significant energy efficiency and performance improvement (Liang et al 2021). Depending on the pruning granularity, the pruning methods can be classified into two categories: unstructured pruning and structured pruning. As shown in Fig. 1b, unstructured pruning removes the individual weights in a fine granularity and can bring significant parameter reduction (Han et al 2015a; Carreira-Perpinan and Idelbayev 2018). However, the CNNs after pruning exhibit irregular sparse patterns, which are difficult to be exploited by accelerators for performance improvement. By contrast, structured pruning usually prunes a set of weights at filter or channel granularity, which makes the pruned CNNs still well organized, and the parameter reduction can be conveniently converted to performance improvement by accelerators (Lin et al 2020; Li et al 2020, 2022). However, as shown in Fig. 1c, suffering from the low pruning rate, structured pruning can only provide a limited performance improvement.

To improve the execution efficiency of pruning networks on CNN accelerators, recent work collaboratively designed the accelerator architecture and pruning methods. Some studies focus on designing sparse accelerators, which can exploit the irregular sparse patterns introduced by unstructured pruning (Zhang et al 2021; Chen et al 2020; Kung et al 2019). Aiming to improve the performance and avoid extra energy consumption, these studies combine sparse columns/rows of weight to skip unnecessary storage and computation of zero-valued weights. However, to maintain the input reuse and partial-sum accumulation of systolic array, sparse CNN accelerators either introduce extra control units and data paths or extend the clock cycle. In contrast, some other studies advocated customizing dedicated pruning algorithms to generate systolic array-specific sparse patterns (Chitty-Venkata and Somani 2020; Asgari et al 2019; teja Vooturi et al 2019). However, due to strict sparsity pattern constrain, these approaches only bring limited performance/energy improvement.

To address the above challenges, in this work, we propose FASS-Pruner, a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling, which consist of FASplit-pruner and FAShuffle-pruner: (1) Considering the round-by-round execution characteristics of CNN accelerator, FASplit-pruner splits filters into column-weight vectors and perform a fine-grained pruning for higher pruning rate; (2) Leveraging the filter-level calculation independence characteristics of accelerator, FAShuffle-pruner performs inter-filter shuffling to group filters sharing same sparse patterns, and prunes CNN at the granularity of row-weight vector. In pursuit of higher pruning rates, we further explore the potential for combining FASplit-pruner and FAShuffle-pruner. By executing Column-wise FASplit-pruner in priority over Row-wise FAShuffle-pruner, we propose CR-wise pruner; swapping the execution order, we further propose RC-wise pruner. The experiment results show that, on average, FASplit-pruner and FAShuffle-pruner achieve 52.3% and 71.47% pruning rates, respectively. While preserving the original dataflow of CNN accelerators, on average, FASplit-pruner and FAShuffle-pruner improve 46.5% and 127.2% performance with 21.1% and 56.3% more energy saving compared with the baseline pruning approaches, respectively. Compared with SOTA accelerator-customized pruning method, on average, RC-wise pruner and CR-wise pruner achieve 21.1% and 34.7% higher pruning rates and improve the inference of pruned models 40.9% and 69.4% performance with 49.2% and 75.2% energy saving, respectively.

The main contributions of this paper are summarized as follows:

  1. (1)

    Proposing FASplit-pruner, splits filters with the scale of systolic array and pruning at the granularity of column-weight vector.

  2. (2)

    Proposing FAShuffle-pruner, shuffles the filters to perform pruning at the granularity of row-weight vector with better accuracy-performance trade-off.

  3. (3)

    Combining FASplit-pruner and FAShuffle-pruner proposes CR-wise pruner and RC-wise pruner to explore pruning opportunities at a finer granularity.

  4. (4)

    Combining the sparse pattern of pruned CNN and the dataflow of systolic array, we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption.

2 Background and motivation

Fig. 2
figure 2

Instruction of weight stationary systolic array

2.1 CNN and systolic array

CNNs usually consist of multiple layers to extract and process the features hierarchically, including convolutional (CONV) layers, pooling layers, and fully connected (FC) layers. Occupying the majority of computation of CNN, CONV layers perform feature extraction from the input feature map (ifmap) with 3D filters. The pooling layers compress the output feature map (ofmap) by selecting representative values within specific regions, which FC layers process the extracted features and generate the final result. Since CONV operation between filters and ifamp is generally converted into matrix-matrix multiplications by im2col, the systolic array, with simple operation logic and high parallelism provided by data reuse in two directions, is naturally advantageous for CNN acceleration.

Figure 2 shows the general architecture of systolic arrays, which comprise a set of connected processing elements (PEs) performing the same Multiply-And-Accumulate (MAC) operation. Taking the most widely applied weight stationary (WS) dataflow as an example,Footnote 1 all weights for a specific filter will be loaded on a column of PEs (Jouppi et al 2017). During the execution phase, the organized input will stream into the systolic array from the left side and be reused by all filters. In contrast, the partial sum (Psum) streams and accumulates along the column of PEs to generate the final output. Since the size of the systolic array is fixed, the calculation of large-scale layers with a vast number of filters needs to be divided into several filter-group (i.e., Inter-filter) execution rounds according to the array size (Kung et al 2019). Likewise, the filters containing massive weights also need to be executed in multiple intra-filter rounds.

2.2 Weight pruning for systolic array

Weight Pruning seeks to reduce CNN size and computation by setting unimportant weights to zero. However, without considering the interplay between sparse weight patterns and systolic array architecture, it is hard for CNN accelerators to directly benefit from the sparse-pruned weight matrix. In attempts to accelerate pruned CNN on the systolic array, the related studies mainly span the fields of sparse CNN accelerator design and customized accelerator-aware pruning:

(1) The first category aims to dense the unstructured sparse weights of pruned CNN and changes the original accelerators dataflow to skip the useless computation cycles of zero-valued weights for performance and energy saving. For instance, Kung et al (2019) proposed column-weights combining in the systolic array to pack multiple filter-wise sparse weights and designed the conflict elimination mechanism to handle the situation where some nonzeros from multiple columns occupy the same row-positions. Similarly, Chen et al (2020) also proposed an efficient column-weights combining framework named Tight Compression. By performing the simulated annealing (SA) algorithm during the weight permutation process, Tight Compression obtains an optimal arrangement to avoid conflicts of column packing. All the above designs combine multiple filters into one and map it to a column of PEs; consequently, the extra control logic is introduced to accelerators for correct partial sum accumulation. On the contrary, some studies aim to dense row-wise sparse weights (i.e., within a filter). For example, Zhang et al (2021) proposed row swapping to compact sparse weights within a filter and inserted multiplexers into the systolic array to map the corresponding inputs to the weights, thereby ensuring the execution correctness of CNN.

(2) The other category is advocated in customizing weight sparse patterns (e.g., filter sparsity (Lin et al 2020), channel sparsity (Liu et al 2017) and block sparsity (teja Vooturi et al 2019)) to match the dataflow of the systolic array, ensuring the systolic array fully benefits from the discarded unimportant weights of the pruned network. For example, Chitty-Venkata and Somani (2020) proposed AAP (array-aware pruning) to compress the weight matrix of CNN by filter-wise pruning to decrease the execution rounds on the systolic array for efficient model inference. Furthermore, recent researches focus on block sparsity as the systolic array process matrices in blocks. Asgari et al (2019) compress weights of the CNN model in multiple weight blocks by sliding a fixed-size window on the weight matrix to select locally dense blocks. The pruned CNNs are organized as weight blocks, matched with the systolic array’s size, reducing the computation amount while avoiding the additional control overhead of organizing data flow. teja Vooturi et al (2019) further developed a dynamic block sparsity reparameterization (DBSR) for CNNs, generating block sparsity by using a set of trainable scaling parameters for the blocks and pushing them to zero during training. Compared with the art structured sparsity approaches, DBSR approach tightly integrates structured sparsity generation with the training process and thus produces more efficient models on standard vision tasks like image classification and semantic segmentation.

2.3 Motivation

Both types of research have been demonstrated helpful in improving energy and performance efficiency of pruned networks on CNN accelerator. However, some limitations still prevent them from being the perfect solution for co-acceleration between weight pruning and CNN accelerator. The sparse CNN accelerator design typically requires extra control units and data paths to encode/decode the irregular sparse weight pattern to ensure CNN’s execution correctness, which may destroy the original fine-grained parallelism design philosophy of CNN accelerators (Ma et al 2022). While the software-level customized accelerator-aware pruning is challenging to bring satisfactory pruning rate and performance/energy gains due to strict sparsity pattern constrain (Zhang et al 2021).

In customizing an efficient pruning framework on CNN accelerators, the critical factor is to design a flexible and fine-grained weight sparsity pattern while perfectly matching the original high data reuse pattern of systolic array. This is also the focus of our study.

3 Software-hardware co-design framework

We propose FASS-Pruner in this section, which contains two pruning methods: (1) FASplit-pruner split filters with the scale of the systolic array to perform pruning at the granularity of weights in a column on the systolic array (column-weight vector). (2) FAShuffle-pruner shuffles the filters to perform pruning at the granularity of weights in a row on the systolic array (row-weight vector). Besides, we further explored the opportunities of combining these two approaches to customize systolic array-specific sparsity patterns of CNN flexibly. Moreover, to maintain the original workflow of systolic array and maximize the execution efficiency of the pruned CNN, we modify the systolic array-based accelerator to enable it to support pruned sparse CNN.

3.1 FASplit-pruner

Fig. 3
figure 3

Implementation of FASplit

As illustrated in Fig. 3a, assuming a convolution layer with nine filters and each of which contains a \(3\times 3\) kernel, the calculation of this layer will be divided into nine rounds by the \(3\times 3\) systolic array. Intuitively, filter-wise pruning will generate the structured sparsity pattern, thus enabling the condensed weights can be efficiently mapped on the systolic array. However, the coarse-granularity filter-wise pruning may suffer from a lower pruning rate, thereby bringing significant performance/energy gains. As mentioned in Sect. 2.1, the calculation of a filter will be divided into several intra-filter rounds due to the systolic array’s fixed design. Consequently, instead of coarse-grained filter-wise pruning, we attempt to refine CNN’s pruning granularity to array-column-wise by splitting filters into multiple column-weights vectors.

Fig. 4
figure 4

Normalized execution time under different pruning rate

However, the sparsity generated by random column-wise pruning may not always bring apparent performance gain in the WS systolic array architecture. As illustrated, taking the unpruned 256 columns of weights as the baseline, Fig. 4 depicts the normalized execution time of pruned vectors on the \(64\times 64 \) systolic array under different pruning rates. As can be seen, the stepwise execution time decreases only occur when pruning an entire execution round. Besides, the performance improvement brought by the increasing pruning rate is feeble. Due to the high parallelism and data reuse characteristics of systolic array, once an extra execution round is remaining, the number of column-weight vectors has little effect on the execution cycles/performance of the CNN accelerator. More parameters in CNN mean better accuracy and robustness. Consequently, in order to avoid the whole columns of PEs resources being idle after random column-wise pruning, the optimal accuracy-performance tradeoff can be achieved by constraining the pruned dense vectors can be exactly filled in the columns of PEs (except for the original scale of column-weight vectors smaller than the array-width).

Inspired by the above observation, we proposed a Fine-grained Accelerator-aware pruning framework named FASplit-pruner, which Splits filters into multiple column-weight vectors to perform flexible pruning. As shown in Fig. 3, the implementation of FASplit-pruner mainly consists of the flowing steps:

  1. (A)

    To ensure the fine-grained weight sparsity pattern on the systolic array, FASplit-pruner first split each filter into column-weight vectors according to the array height;

  2. (B)

    During the column-weight vector pruning stage, FASplit-pruner leverages the sum of the weight absolute values as metric to evaluate the importance of the corresponding weight vectors;

  3. (C)

    Then, a bunch of the least unimportant vectors will be pruned together to guarantee that the column-weight vectors after pruning can be shrunk to exactly fill the systolic array (i.e., without any idle column of PEs);

  4. (D)

    Finally, FASplit-pruner retrains the remaining weights to recover model accuracy;

  5. (E)

    Repeating steps (B)–(D) until the retrained CNN cannot keep the original accuracy.

As shown in Fig. 3, considering the design size and intra-filter execution mode of CNN accelerator, FASplit-pruner achieves fine-grained column-wise pruning while without any PEs resource underutilization.

Fig. 5
figure 5

Implementation of proposed accelerator architecture

Compared to filter-wise pruning, FASplit-pruner creates fine-grained column-weight sparsity patterns for efficient inference on systolic arrays. In the original CNN accelerator execution model, each filter is strictly tied to the specific column of systolic arrays until all execution rounds are finished. As shown in Fig. 3, column-weight vector compacting will disturb the original Psum accumulation among intra-filter rounds. As shown in Fig. 5, to ensure the correctness of inter-filter psum accumulation, FASplit-pruner introduces a selector to send each intra-filter round’s calculation results (i.e., column-wise psum) to the corresponding accumulator. To implement this logic, FASplit-pruner leverages limited extra bits to encode the filter order of each column-weight vector during the weights loading stage. Moreover, FASplit-pruner slightly extends the accumulator size to satisfy higher parallelism Psum accumulation requirements of column-dense CNN. Compared to the high-density systolic array, the selector and extra accumulator introduced by FASplit-pruner are negligible, thereby introducing insignificant area and energy overhead.

3.2 FAShuffle-pruner

Fig. 6
figure 6

Implementation of FAShuffle-pruner

As mentioned in Sect. 2.1, due to the fixed systolic array width, the calculation of large-scale layers with a massive number of filters will be divided into multiple inter-filter execution rounds. Similar to FASplit-pruner, in this subsection, we further explore the possibility of pruning in another dimension (i.e., removing the corresponding weights of each filter in the execution round). As illustrated in Fig. 6, the compact row-weight-vector after retraining not only obviates the loading of unimportant weights but also saves the expensive memory access of the corresponding inputs, thereby bringing significant energy and performance gain for systolic arrays. However, to preserve original ifmap reuse characteristics among columns of PEs, some weights with high magnitude within the row vectors may also be removed, restricting the final pruning rate and even incurring a severe accuracy drop of CNN. Fortunately, since the independent characteristics of each filter, the over-pruning problem can be alleviated by shuffling filters into several groups to concentrate the unimportant weights in the rows of PEs.

Motivated by the above observations, we proposed a Fine-grained Accelerator-aware pruning framework named FAShuffle-pruner, which leverages inter-filter shuffling to prune the unimportant row-weight vector on the systolic array. FAShuffle-pruner consists of two stages: (1) Filter shuffling and (2) Row-weight vector pruning:

3.2.1 Filter shuffling

figure a

The pruning rate of the FAShuffle-pruner mainly depends on the filter-shuffling algorithm’s effectiveness, which divides filters in a specific layer into several groups to ensure the unimportant row-weights vectors are more likely to be concentrated in the rows of PE. The shuffling algorithm ensures the filters after pruning in a group share the same sparse patterns, and FAShuffle-pruner prunes CNN at the granularity of row-weight vector. The row-weight vector compacting process will reduce the intra-filter rounds to boost the performance of the CNN accelerator. Besides, to maintain the input reuse of systolic, the filter shuffling algorithm needs to make the number of filters in a group match the array size. Moreover, considering weight matrix will be tuned during the iterations of model pruning and retraining, it is impossible to verify the validity of shuffling solutions one by one based on the weight importance score, so we design a heuristic algorithm that leverages the irregular sparse pattern generated by unstructured pruning (considered as the ideal sparse pattern) as guidance of the shuffling process.

We summarize the heuristic filter shuffling approach in Algorithm 1. Firstly, an iterative unstructured pruning is performed on the target model to get the irregular sparse pattern. And the ideal sparse weight structure will be taken as the guidance of filter shuffling. Since the filters in a group share the same sparse pattern, the ideal sparse pattern can be clustered to produce grouping information, making the final result close to the ideal state. Considering the number of filters in each group needs to be the same as the array size, balanced k-Means is selected (Malinen and Fränti 2014) for filter grouping. Suppose the i-th convolutional layer consists of \(K^i\) filters and each filter \(F_j^i\in \mathbf {R^{C\times K\times K}} \) after unstructured pruning generates a binary mask \(M_j^i\in \mathbf {B^{C\times K\times K}}(\textbf{B}=\{0,1\}) \). During clustering (iteration round t), each filter mask is a sample and each group \(C_x^t\) has a centroid location \(c_x^t\). The centroid locations are initially chosen randomly, and in subsequent iterations, they are recalculated as following:

$$\begin{aligned} c_x^t = \frac{1}{|C_x^t|} \sum _{M\in C_x^t} M \end{aligned}$$
(1)

The distance between a sample and a centroid location is:

$$\begin{aligned} d_{p,q} = \sum |M_p^i - c_x^t|\end{aligned}$$
(2)

The regular k-Means method assigns samples to the nearest centroid location before recalculating the new centroid locations. In order to maintain the same number in each group while minimizing clustering error, balanced k-Means perform the Hungarian algorithm to get the best result of grouping (Burkard et al 2012). The termination condition of the iteration is that the grouping no longer changes, and the final clustering result is taken as the grouping of filters. The shuffling process will be performed layer by layer to make the target CNN ready for subsequent row-weight vector pruning.

3.2.2 Row-weight vector pruning

Based on the filter groups generated by the shuffling algorithm, the weight matrix is divided into row-weight vectors. Then FAShuffle-pruner will prune the model at the granularity of row-weight vectors in the following stages:

  1. (A)

    Select a certain percentage of unimportant row-weight vectors. Similar to FASplit-pruner, FAShuffle-pruner evaluates the importance of the row-weight vectors by the sum of absolute values of the weights in the row-weight vectors.

  2. (B)

    Prune the selected row-weight vectors;

  3. (C)

    Retrain the model to recover accuracy;

  4. (D)

    Repeat steps (A)–(C) until the model accuracy cannot be maintained;

  5. (E)

    Compress the pruned filters group by group and encode sparse pattern.

Since the FAshuffle-pruner removes several row-weight vectors and compresses the pruned filter, the corresponding inputs need to be skipped, and a finer-grained input skip/selection is required. As shown in Fig. 5, we design a data setup unit to match the inputs with the counterpart weights. Besides, shuffled filters will generate shuffled outputs, which are mismatched to the computation of the next layer. The selector will send the output directly to the corresponding accumulator to reset the output to the original order.

3.3 Combination and choice of pruning methods

Fig. 7
figure 7

Implementation of RC-wise pruner and CR-wise pruner

Since FASplit-pruner and FAShuffle-pruner focus on compressing models in different dimensions (column-weight and row-weight), executing Column-wise FASplit-pruner in priority over Row-wise FAShuffle-pruner, we propose CR-wise pruner; swapping the execution order, we further propose RC-wise pruner.

Figure 7a illustrates the implementation of RC-wise pruner and CR-wise pruner. For a specific layer, RC-wise pruner first prunes the origin model at the granularity of weights sharing the same position in all filters (row-weights) as coarse-grained preprocessing. Then, taking the average absolute value of weights as the importance metric, FASplit-pruner is applied to compress the CNN model further.

As shown in Fig. 7b, RC-wise pruner first applies regular filter-wise pruning to the original CNN to explore redundancy in the column-weight dimension. For better performance, the number of remaining filters in each layer should be the multiple of the systolic array size. In other words, we need to prune a group of filters once during pruning and retraining. Then we apply FAShuffle-pruner to the pruned model for fine-grained pruning.

Furthermore, the above pruning methods are also available for the FC layers whose weights naturally form a 2D matrix. Considering the input of the FC layers is a 1-D vector, each neuron can only calculate one result. The difference between FASplit-pruner and FAShuffle-pruner in pruning dimension determines their different application scenes. For a CONV Layer or a FC layer, the number of filters/neurons and the number of weights in a filter/neuron define size in different dimensions. Simply, we can select a pruner by comparing the number of filters/neurons with the number of weights in a filter/neuron. When the number of filters/neurons is greater than the number of weights in a filter/neuron, FASplit-pruner is suggested. If not so, FAShuffle-pruner is suggested.

4 Experimental methodology

Datasets and models To demonstrate the effectiveness of FASS-Pruner, we conduct experiments on both small and large datasets (i.e., CIFAR-10 and ImageNet). We select two mainstream CNN backbones, including VGG and ResNet. On CIFAR-10, we evaluated our method on VGG16 and ResNet56. On ImageNet dataset, we adopt VGG16 and ResNet50. To acclerate training, simple data augmentation (random crop and random horizontal flip) and normalization are used in training sets of both dataset. For ImageNet, images are resized to \(256\times 256\) to regularize the dataset.

Configurations The pruning on the above models is implemented using PyTorch on an Intel CPU with 2.4 GHz and an Nvidia Geforce RTX 3090 graphics card. For models on CIFAR-10, the learning rate for retraining is set to 0.001. For models on ImageNet, the learning rate for retraining is set to 0.01. The default size of the systolic array is set to \(64\times 64\) We adapt Scale-sim, a cycle-accurate architecture simulator for CNN accelerators, to estimate the relative execution time and memory access. Besides, the energy consumption is calculated with the metric in Eyeriss.

Evaluation protocols We adopt the number of parameters (Params) and required Float Points Operations (FLOPs) to evaluate pruned model size and computational requirements. The relative execution time (ReT) is used to evaluate the speedup of pruned models on CNN Accelerators. The normalized energy consumption (Nec) represents the energy consumption. The four protocols will be uniformly presented as the relative change rate in the subsequent results.

5 Evaluation results and analysis

Table 1 Pruning results on CIFAR-10
Table 2 Pruning results on ImageNet

This section arranges as follows: In Sect. 5.1, we compare FASplit-pruner and FAShuffle-pruner with state-of-the-art structured pruning methods, which contains two filter pruning methods (L1 and Hrank) and two channel pruning methods (CP and Sliming) (Han et al 2015b; Lin et al 2020; He et al 2017; Liu et al 2017). In Sect. 5.2, we compare RC-wise pruner and CR-wise pruner with three customizing pruning methods, which contains AAP, BP and DBSR (Chitty-Venkata and Somani 2020; Asgari et al 2019; teja Vooturi et al 2019).

5.1 FASS versus structured pruning

Table 3 Pruning result of VGG16 on CIFAR-10

To verify the effectiveness of FASplit-pruner and FAShuffle-pruner, we select several SOTA structured pruning methods for comparison. According to the similar pruning dimension, filter pruning methods (L1 and Hrank) and channel pruning methods (CP and Sliming) compare with FASplit-pruner and FAShuffle-pruner, respectively.

Tables 1 and 2 list the evaluation results on CIFAR-10 and ImageNet, respectively. Compared with L1 and Hrank, FASplit-pruner achieves significantly better parameters and FLOPs reductions (e.g., 66.0% vs. 27.1%, 36.9% and 63.1% vs.31.0%, 33.3% in ResNet50), which demonstrates the superiority of fine-grained pruning. Moreover, FASplit-pruner usually obtains more performance improvement (70.3% vs. 31.6%), even under similar parameter reduction (Hrank on VGG16(CIFAR-10))(88.6% vs. 82.9%), indicating that FASplit-pruner is advocated in performance improvement and is more practical in CNN accelerators. As shown in Table 1, both filter pruning methods and FASplit-pruner hardly compress ResNet56, and even FASplit-pruner is worse than L1 and Hrank. Observing the structure of ResNet56, there are less than 64 filters in most CONV layers, which are compact and hardly compressed by pruning filters.

Compared with CP and Sliming, FAShuffle-pruner yields better compression and acceleration on all test models, verifying the pruning flexibility increased by fine pruning granularity. Also, FAShuffle-pruner obtains more compact parameters (50.8% vs. 46.3%) and faster acceleration (50.0% vs. 38.5%) than FASplit-pruner. Through analysis, we believe that FAShuffle-pruner does better than FASplit-pruner for two reasons. (1) FAShuffle-pruner are more flexible due to filter shuffling. (2) The mainstream CNN models increase the number of filters with depth to fully extract features, making the weight structure slender and increasing the redundancy in the dimension of row-weight, which suits with FAShuffle-pruner. In addition, compared to baseline approaches, the energy efficiency of FASplit-pruner and FAShuffle-pruner are improved by 1.27 and 2.29 times on average, respectively, showing the effectiveness of FASS on energy saving.

5.2 FASS versus customizing pruning

We further validate the effectiveness of RC-wise pruner and CR-wise pruner on improving performance and energy efficiency by comparing with three SOTA accelerator-customized pruning methods (AAP, BP, and DBSR). We conduct experiments for VGG16 on CIFAR-10 dataset, aiming to simulate the common but critical scenario of large models for small datasets. The results are shown in Table 3. As we can see, the model pruned by RC-wise pruner and CR-wise pruner is more efficient than that of other accelerator-customized pruning methods. Compared with SOTA accelerator-customized pruning method, on average, RC-wise pruner and CR-wise pruner achieve 21.1% and 34.7% higher pruning rates and improve the inference of pruned models 40.9% and 69.4% performance with 49.2% and 75.2% energy saving, respectively. All the accelerator-customized pruning methods have a larger effect on the accuracy. Even a small amount of pruning by BP (64.9%) causes an unacceptable accuracy decrease (3.79%). As discussed above, this is due to the fact that these accelerator-customized pruning methods strictly define sparse patterns in models, resulting in limited representation capability. By contrast, whatever the mode of operation, RC-wise pruner and CR-wise pruner obtain significant compression and acceleration (89.3% and 90.5% in Params; 72.7% and 88.6% in ReT) with only a slight loss in accuracy (0.35% and 0.44%). This suggests that RC-wise pruner and CR-wise pruner take into account the flexibility of sparse patterns when compressing the model. From the result, it is clear that CR-wise pruner are better than RC-wise pruner, which demonstrates that the RC-wise pruner is more suitable for the mainstream slender weight structure. Moreover, compared with a single FASplit-pruner or FAShuffle-pruner, the two combinations lead to good results, even if the improvement is negligible (e.g., 90.5% by CR-wise pruner vs. 89.1% by FAShuffle-pruner in Params of VGG16).

Overall, these experimental results have confirmed that FASS can effectively compress and accelerate model inference on CNN accelerators.

6 Conclusion

In this paper, we proposed a fine-grained accelerator-aware pruning framework, FASS-Pruner, which can effectively compress CNNs while preserving the original dataflow of CNN accelerators by filter splitting and filter shuffling. We evaluated FASS-Pruner on mainstream CNN models by comparing them with state-of-the-art structured pruning methods and accelerator-customizing pruning methods. The experimental results validate the effectiveness of FASS-Pruner, which can achieve significant performance improvement and energy saving.