FASS-pruner: customizing a fine-grained CNN accelerator-aware pruning framework via intra-filter splitting and inter-filter shuffling

Wei, Xiaohui; Zheng, Xinyang; Wang, Chenyang; Li, Guangli; Yue, Hengshan

doi:10.1007/s42514-023-00156-w

FASS-pruner: customizing a fine-grained CNN accelerator-aware pruning framework via intra-filter splitting and inter-filter shuffling

Regular Paper
Published: 26 May 2023

Volume 5, pages 292–303, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

FASS-pruner: customizing a fine-grained CNN accelerator-aware pruning framework via intra-filter splitting and inter-filter shuffling

Download PDF

Xiaohui Wei¹,
Xinyang Zheng¹,
Chenyang Wang¹,
Guangli Li² &
…
Hengshan Yue ORCID: orcid.org/0000-0003-2189-8385¹

318 Accesses
1 Citation
Explore all metrics

Abstract

Nowadays, with the increasing depth of CNNs, the number of computation and storage requirements with weights expands significantly, preventing their wide deployment on resource-constrained application scenarios such as embedded systems. To improve the efficiency of the current deep CNN inference stage, researchers have attempted to explore weight pruning techniques on CNN accelerators (e.g., systolic arrays) to avoid the number of unimportant weights storage and computation. However, these attempts either suffer expensive extra hardware costs to encode/decode the irregular sparse weight pattern on accelerators or bring finite performance improvement due to structured pruning’s modest compression ratio. In order to address the above challenge, this paper proposes FASS-Pruner, a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling: (1) Considering the round-by-round execution behavior of CNN accelerator, FASS-Pruner split filters into multiple rounds to perform column-wise-weight pruning; (2) Leveraging the calculation independence characteristics across filters on CNN accelerators, FASS-Pruner shuffles the filters to prune the unimportant row-wise weights at CNN accelerator. Combining the sparse pattern of pruned CNN and the dataflow of systolic array, we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption. By condensing the pruned sparse weights in systolic arrays, FASS-Pruner achieves a comparable pruning ratio while preserving the original data flow of CNN accelerators, thereby achieving significant performance and energy saving.

FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling

Article Open access 13 February 2021

SMOF: Squeezing More Out of Filters Yields Hardware-Friendly CNN Pruning

Adaptive FSP: Adaptive Architecture Search with Filter Shape Pruning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Convolutional Neural Networks (CNNs) have increasingly gained tremendous success in various artificial intelligence applications, such as image classification, voice synthesis, and semantic segmentation (LeCun et al 2015; Krizhevsky et al 2017; Redmon et al 2016; Ronneberger et al 2015). The size of current CNN models becomes massive to achieve better accuracy, which introduces excess computation and memory access, making it challenging to deploy in low-power and real-time required scenarios (Yang et al 2017; Li and Louri 2022; Yu et al 2017). Consequently, many customized accelerators are designed to accelerate the General Matrix Multiplications (GEMM) of CNN, which accounts for more than 95% of the operations during the inference stage (Lym and Erez 2020). Exploiting high data reuse and computational parallelism in GEMM, systolic array adopts a pipelined computation mode to improve performance/energy (Jouppi et al 2017) and become the most popular CNN accelerator architecture. Both academia and industry propose systolic-array-based CNN accelerators, such as Google’s Tensor Processing Unit (TPU) (Jouppi et al 2017) and MIT’s Eyeriss (Chen et al 2016). However, given the trend of CNNs growing incredibly deeper for better accuracy, current custom CNN accelerators are still laborious for the enormous computational and memory requirements of modern state-of-the-art CNNs (He et al 2016; Huang et al 2016; Rhu et al 2016).

Therefore, recent studies attempt to apply various software-level optimized technologies to make the inference of CNN more efficient on accelerators, such as data quantization, compact model design, and weight pruning (Sandler et al 2018; Yayla and Chen 2022; Han et al 2015b). Among them, weight pruning removes a number of unimportant weights and reduces the corresponding parameter storage and computation, bringing significant energy efficiency and performance improvement (Liang et al 2021). Depending on the pruning granularity, the pruning methods can be classified into two categories: unstructured pruning and structured pruning. As shown in Fig. 1b, unstructured pruning removes the individual weights in a fine granularity and can bring significant parameter reduction (Han et al 2015a; Carreira-Perpinan and Idelbayev 2018). However, the CNNs after pruning exhibit irregular sparse patterns, which are difficult to be exploited by accelerators for performance improvement. By contrast, structured pruning usually prunes a set of weights at filter or channel granularity, which makes the pruned CNNs still well organized, and the parameter reduction can be conveniently converted to performance improvement by accelerators (Lin et al 2020; Li et al 2020, 2022). However, as shown in Fig. 1c, suffering from the low pruning rate, structured pruning can only provide a limited performance improvement.

To improve the execution efficiency of pruning networks on CNN accelerators, recent work collaboratively designed the accelerator architecture and pruning methods. Some studies focus on designing sparse accelerators, which can exploit the irregular sparse patterns introduced by unstructured pruning (Zhang et al 2021; Chen et al 2020; Kung et al 2019). Aiming to improve the performance and avoid extra energy consumption, these studies combine sparse columns/rows of weight to skip unnecessary storage and computation of zero-valued weights. However, to maintain the input reuse and partial-sum accumulation of systolic array, sparse CNN accelerators either introduce extra control units and data paths or extend the clock cycle. In contrast, some other studies advocated customizing dedicated pruning algorithms to generate systolic array-specific sparse patterns (Chitty-Venkata and Somani 2020; Asgari et al 2019; teja Vooturi et al 2019). However, due to strict sparsity pattern constrain, these approaches only bring limited performance/energy improvement.

To address the above challenges, in this work, we propose FASS-Pruner, a Fine-grained Accelerator-aware pruning framework via intra-filter Splitting and inter-filter Shuffling, which consist of FASplit-pruner and FAShuffle-pruner: (1) Considering the round-by-round execution characteristics of CNN accelerator, FASplit-pruner splits filters into column-weight vectors and perform a fine-grained pruning for higher pruning rate; (2) Leveraging the filter-level calculation independence characteristics of accelerator, FAShuffle-pruner performs inter-filter shuffling to group filters sharing same sparse patterns, and prunes CNN at the granularity of row-weight vector. In pursuit of higher pruning rates, we further explore the potential for combining FASplit-pruner and FAShuffle-pruner. By executing Column-wise FASplit-pruner in priority over Row-wise FAShuffle-pruner, we propose CR-wise pruner; swapping the execution order, we further propose RC-wise pruner. The experiment results show that, on average, FASplit-pruner and FAShuffle-pruner achieve 52.3% and 71.47% pruning rates, respectively. While preserving the original dataflow of CNN accelerators, on average, FASplit-pruner and FAShuffle-pruner improve 46.5% and 127.2% performance with 21.1% and 56.3% more energy saving compared with the baseline pruning approaches, respectively. Compared with SOTA accelerator-customized pruning method, on average, RC-wise pruner and CR-wise pruner achieve 21.1% and 34.7% higher pruning rates and improve the inference of pruned models 40.9% and 69.4% performance with 49.2% and 75.2% energy saving, respectively.

The main contributions of this paper are summarized as follows:

(1)
Proposing FASplit-pruner, splits filters with the scale of systolic array and pruning at the granularity of column-weight vector.
(2)
Proposing FAShuffle-pruner, shuffles the filters to perform pruning at the granularity of row-weight vector with better accuracy-performance trade-off.
(3)
Combining FASplit-pruner and FAShuffle-pruner proposes CR-wise pruner and RC-wise pruner to explore pruning opportunities at a finer granularity.
(4)
Combining the sparse pattern of pruned CNN and the dataflow of systolic array, we modify the systolic array-based accelerator to enable it to execute pruned sparse CNN with better performance and lower energy consumption.

2 Background and motivation

2.1 CNN and systolic array

CNNs usually consist of multiple layers to extract and process the features hierarchically, including convolutional (CONV) layers, pooling layers, and fully connected (FC) layers. Occupying the majority of computation of CNN, CONV layers perform feature extraction from the input feature map (ifmap) with 3D filters. The pooling layers compress the output feature map (ofmap) by selecting representative values within specific regions, which FC layers process the extracted features and generate the final result. Since CONV operation between filters and ifamp is generally converted into matrix-matrix multiplications by im2col, the systolic array, with simple operation logic and high parallelism provided by data reuse in two directions, is naturally advantageous for CNN acceleration.

Figure 2 shows the general architecture of systolic arrays, which comprise a set of connected processing elements (PEs) performing the same Multiply-And-Accumulate (MAC) operation. Taking the most widely applied weight stationary (WS) dataflow as an example,^{Footnote 1} all weights for a specific filter will be loaded on a column of PEs (Jouppi et al 2017). During the execution phase, the organized input will stream into the systolic array from the left side and be reused by all filters. In contrast, the partial sum (Psum) streams and accumulates along the column of PEs to generate the final output. Since the size of the systolic array is fixed, the calculation of large-scale layers with a vast number of filters needs to be divided into several filter-group (i.e., Inter-filter) execution rounds according to the array size (Kung et al 2019). Likewise, the filters containing massive weights also need to be executed in multiple intra-filter rounds.

2.2 Weight pruning for systolic array

Weight Pruning seeks to reduce CNN size and computation by setting unimportant weights to zero. However, without considering the interplay between sparse weight patterns and systolic array architecture, it is hard for CNN accelerators to directly benefit from the sparse-pruned weight matrix. In attempts to accelerate pruned CNN on the systolic array, the related studies mainly span the fields of sparse CNN accelerator design and customized accelerator-aware pruning:

(1) The first category aims to dense the unstructured sparse weights of pruned CNN and changes the original accelerators dataflow to skip the useless computation cycles of zero-valued weights for performance and energy saving. For instance, Kung et al (2019) proposed column-weights combining in the systolic array to pack multiple filter-wise sparse weights and designed the conflict elimination mechanism to handle the situation where some nonzeros from multiple columns occupy the same row-positions. Similarly, Chen et al (2020) also proposed an efficient column-weights combining framework named Tight Compression. By performing the simulated annealing (SA) algorithm during the weight permutation process, Tight Compression obtains an optimal arrangement to avoid conflicts of column packing. All the above designs combine multiple filters into one and map it to a column of PEs; consequently, the extra control logic is introduced to accelerators for correct partial sum accumulation. On the contrary, some studies aim to dense row-wise sparse weights (i.e., within a filter). For example, Zhang et al (2021) proposed row swapping to compact sparse weights within a filter and inserted multiplexers into the systolic array to map the corresponding inputs to the weights, thereby ensuring the execution correctness of CNN.

(2) The other category is advocated in customizing weight sparse patterns (e.g., filter sparsity (Lin et al 2020), channel sparsity (Liu et al 2017) and block sparsity (teja Vooturi et al 2019)) to match the dataflow of the systolic array, ensuring the systolic array fully benefits from the discarded unimportant weights of the pruned network. For example, Chitty-Venkata and Somani (2020) proposed AAP (array-aware pruning) to compress the weight matrix of CNN by filter-wise pruning to decrease the execution rounds on the systolic array for efficient model inference. Furthermore, recent researches focus on block sparsity as the systolic array process matrices in blocks. Asgari et al (2019) compress weights of the CNN model in multiple weight blocks by sliding a fixed-size window on the weight matrix to select locally dense blocks. The pruned CNNs are organized as weight blocks, matched with the systolic array’s size, reducing the computation amount while avoiding the additional control overhead of organizing data flow. teja Vooturi et al (2019) further developed a dynamic block sparsity reparameterization (DBSR) for CNNs, generating block sparsity by using a set of trainable scaling parameters for the blocks and pushing them to zero during training. Compared with the art structured sparsity approaches, DBSR approach tightly integrates structured sparsity generation with the training process and thus produces more efficient models on standard vision tasks like image classification and semantic segmentation.

2.3 Motivation

Both types of research have been demonstrated helpful in improving energy and performance efficiency of pruned networks on CNN accelerator. However, some limitations still prevent them from being the perfect solution for co-acceleration between weight pruning and CNN accelerator. The sparse CNN accelerator design typically requires extra control units and data paths to encode/decode the irregular sparse weight pattern to ensure CNN’s execution correctness, which may destroy the original fine-grained parallelism design philosophy of CNN accelerators (Ma et al 2022). While the software-level customized accelerator-aware pruning is challenging to bring satisfactory pruning rate and performance/energy gains due to strict sparsity pattern constrain (Zhang et al 2021).

In customizing an efficient pruning framework on CNN accelerators, the critical factor is to design a flexible and fine-grained weight sparsity pattern while perfectly matching the original high data reuse pattern of systolic array. This is also the focus of our study.

3 Software-hardware co-design framework

We propose FASS-Pruner in this section, which contains two pruning methods: (1) FASplit-pruner split filters with the scale of the systolic array to perform pruning at the granularity of weights in a column on the systolic array (column-weight vector). (2) FAShuffle-pruner shuffles the filters to perform pruning at the granularity of weights in a row on the systolic array (row-weight vector). Besides, we further explored the opportunities of combining these two approaches to customize systolic array-specific sparsity patterns of CNN flexibly. Moreover, to maintain the original workflow of systolic array and maximize the execution efficiency of the pruned CNN, we modify the systolic array-based accelerator to enable it to support pruned sparse CNN.

3.1 FASplit-pruner

As illustrated in Fig. 3a, assuming a convolution layer with nine filters and each of which contains a $3\times 3$ kernel, the calculation of this layer will be divided into nine rounds by the $3\times 3$ systolic array. Intuitively, filter-wise pruning will generate the structured sparsity pattern, thus enabling the condensed weights can be efficiently mapped on the systolic array. However, the coarse-granularity filter-wise pruning may suffer from a lower pruning rate, thereby bringing significant performance/energy gains. As mentioned in Sect. 2.1, the calculation of a filter will be divided into several intra-filter rounds due to the systolic array’s fixed design. Consequently, instead of coarse-grained filter-wise pruning, we attempt to refine CNN’s pruning granularity to array-column-wise by splitting filters into multiple column-weights vectors.

However, the sparsity generated by random column-wise pruning may not always bring apparent performance gain in the WS systolic array architecture. As illustrated, taking the unpruned 256 columns of weights as the baseline, Fig. 4 depicts the normalized execution time of pruned vectors on the $64\times 64 $ systolic array under different pruning rates. As can be seen, the stepwise execution time decreases only occur when pruning an entire execution round. Besides, the performance improvement brought by the increasing pruning rate is feeble. Due to the high parallelism and data reuse characteristics of systolic array, once an extra execution round is remaining, the number of column-weight vectors has little effect on the execution cycles/performance of the CNN accelerator. More parameters in CNN mean better accuracy and robustness. Consequently, in order to avoid the whole columns of PEs resources being idle after random column-wise pruning, the optimal accuracy-performance tradeoff can be achieved by constraining the pruned dense vectors can be exactly filled in the columns of PEs (except for the original scale of column-weight vectors smaller than the array-width).

Inspired by the above observation, we proposed a Fine-grained Accelerator-aware pruning framework named FASplit-pruner, which Splits filters into multiple column-weight vectors to perform flexible pruning. As shown in Fig. 3, the implementation of FASplit-pruner mainly consists of the flowing steps:

(A)
To ensure the fine-grained weight sparsity pattern on the systolic array, FASplit-pruner first split each filter into column-weight vectors according to the array height;
(B)
During the column-weight vector pruning stage, FASplit-pruner leverages the sum of the weight absolute values as metric to evaluate the importance of the corresponding weight vectors;
(C)
Then, a bunch of the least unimportant vectors will be pruned together to guarantee that the column-weight vectors after pruning can be shrunk to exactly fill the systolic array (i.e., without any idle column of PEs);
(D)
Finally, FASplit-pruner retrains the remaining weights to recover model accuracy;
(E)
Repeating steps (B)–(D) until the retrained CNN cannot keep the original accuracy.

As shown in Fig. 3, considering the design size and intra-filter execution mode of CNN accelerator, FASplit-pruner achieves fine-grained column-wise pruning while without any PEs resource underutilization.

Compared to filter-wise pruning, FASplit-pruner creates fine-grained column-weight sparsity patterns for efficient inference on systolic arrays. In the original CNN accelerator execution model, each filter is strictly tied to the specific column of systolic arrays until all execution rounds are finished. As shown in Fig. 3, column-weight vector compacting will disturb the original Psum accumulation among intra-filter rounds. As shown in Fig. 5, to ensure the correctness of inter-filter psum accumulation, FASplit-pruner introduces a selector to send each intra-filter round’s calculation results (i.e., column-wise psum) to the corresponding accumulator. To implement this logic, FASplit-pruner leverages limited extra bits to encode the filter order of each column-weight vector during the weights loading stage. Moreover, FASplit-pruner slightly extends the accumulator size to satisfy higher parallelism Psum accumulation requirements of column-dense CNN. Compared to the high-density systolic array, the selector and extra accumulator introduced by FASplit-pruner are negligible, thereby introducing insignificant area and energy overhead.

3.2 FAShuffle-pruner

As mentioned in Sect. 2.1, due to the fixed systolic array width, the calculation of large-scale layers with a massive number of filters will be divided into multiple inter-filter execution rounds. Similar to FASplit-pruner, in this subsection, we further explore the possibility of pruning in another dimension (i.e., removing the corresponding weights of each filter in the execution round). As illustrated in Fig. 6, the compact row-weight-vector after retraining not only obviates the loading of unimportant weights but also saves the expensive memory access of the corresponding inputs, thereby bringing significant energy and performance gain for systolic arrays. However, to preserve original ifmap reuse characteristics among columns of PEs, some weights with high magnitude within the row vectors may also be removed, restricting the final pruning rate and even incurring a severe accuracy drop of CNN. Fortunately, since the independent characteristics of each filter, the over-pruning problem can be alleviated by shuffling filters into several groups to concentrate the unimportant weights in the rows of PEs.

Motivated by the above observations, we proposed a Fine-grained Accelerator-aware pruning framework named FAShuffle-pruner, which leverages inter-filter shuffling to prune the unimportant row-weight vector on the systolic array. FAShuffle-pruner consists of two stages: (1) Filter shuffling and (2) Row-weight vector pruning:

3.2.1 Filter shuffling

The pruning rate of the FAShuffle-pruner mainly depends on the filter-shuffling algorithm’s effectiveness, which divides filters in a specific layer into several groups to ensure the unimportant row-weights vectors are more likely to be concentrated in the rows of PE. The shuffling algorithm ensures the filters after pruning in a group share the same sparse patterns, and FAShuffle-pruner prunes CNN at the granularity of row-weight vector. The row-weight vector compacting process will reduce the intra-filter rounds to boost the performance of the CNN accelerator. Besides, to maintain the input reuse of systolic, the filter shuffling algorithm needs to make the number of filters in a group match the array size. Moreover, considering weight matrix will be tuned during the iterations of model pruning and retraining, it is impossible to verify the validity of shuffling solutions one by one based on the weight importance score, so we design a heuristic algorithm that leverages the irregular sparse pattern generated by unstructured pruning (considered as the ideal sparse pattern) as guidance of the shuffling process.

We summarize the heuristic filter shuffling approach in Algorithm 1. Firstly, an iterative unstructured pruning is performed on the target model to get the irregular sparse pattern. And the ideal sparse weight structure will be taken as the guidance of filter shuffling. Since the filters in a group share the same sparse pattern, the ideal sparse pattern can be clustered to produce grouping information, making the final result close to the ideal state. Considering the number of filters in each group needs to be the same as the array size, balanced k-Means is selected (Malinen and Fränti 2014) for filter grouping. Suppose the i-th convolutional layer consists of $K^i$ filters and each filter $F_j^i\in \mathbf {R^{C\times K\times K}} $ after unstructured pruning generates a binary mask $M_j^i\in \mathbf {B^{C\times K\times K}}(\textbf{B}=\{0,1\}) $. During clustering (iteration round t), each filter mask is a sample and each group $C_x^t$ has a centroid location $c_x^t$. The centroid locations are initially chosen randomly, and in subsequent iterations, they are recalculated as following:

$$\begin{aligned} c_x^t = \frac{1}{|C_x^t|} \sum _{M\in C_x^t} M \end{aligned}$$

(1)

The distance between a sample and a centroid location is:

$$\begin{aligned} d_{p,q} = \sum |M_p^i - c_x^t|\end{aligned}$$

(2)

The regular k-Means method assigns samples to the nearest centroid location before recalculating the new centroid locations. In order to maintain the same number in each group while minimizing clustering error, balanced k-Means perform the Hungarian algorithm to get the best result of grouping (Burkard et al 2012). The termination condition of the iteration is that the grouping no longer changes, and the final clustering result is taken as the grouping of filters. The shuffling process will be performed layer by layer to make the target CNN ready for subsequent row-weight vector pruning.

3.2.2 Row-weight vector pruning

Based on the filter groups generated by the shuffling algorithm, the weight matrix is divided into row-weight vectors. Then FAShuffle-pruner will prune the model at the granularity of row-weight vectors in the following stages:

(A)
Select a certain percentage of unimportant row-weight vectors. Similar to FASplit-pruner, FAShuffle-pruner evaluates the importance of the row-weight vectors by the sum of absolute values of the weights in the row-weight vectors.
(B)
Prune the selected row-weight vectors;
(C)
Retrain the model to recover accuracy;
(D)
Repeat steps (A)–(C) until the model accuracy cannot be maintained;
(E)
Compress the pruned filters group by group and encode sparse pattern.

Since the FAshuffle-pruner removes several row-weight vectors and compresses the pruned filter, the corresponding inputs need to be skipped, and a finer-grained input skip/selection is required. As shown in Fig. 5, we design a data setup unit to match the inputs with the counterpart weights. Besides, shuffled filters will generate shuffled outputs, which are mismatched to the computation of the next layer. The selector will send the output directly to the corresponding accumulator to reset the output to the original order.

3.3 Combination and choice of pruning methods

Since FASplit-pruner and FAShuffle-pruner focus on compressing models in different dimensions (column-weight and row-weight), executing Column-wise FASplit-pruner in priority over Row-wise FAShuffle-pruner, we propose CR-wise pruner; swapping the execution order, we further propose RC-wise pruner.

Figure 7a illustrates the implementation of RC-wise pruner and CR-wise pruner. For a specific layer, RC-wise pruner first prunes the origin model at the granularity of weights sharing the same position in all filters (row-weights) as coarse-grained preprocessing. Then, taking the average absolute value of weights as the importance metric, FASplit-pruner is applied to compress the CNN model further.

As shown in Fig. 7b, RC-wise pruner first applies regular filter-wise pruning to the original CNN to explore redundancy in the column-weight dimension. For better performance, the number of remaining filters in each layer should be the multiple of the systolic array size. In other words, we need to prune a group of filters once during pruning and retraining. Then we apply FAShuffle-pruner to the pruned model for fine-grained pruning.

Furthermore, the above pruning methods are also available for the FC layers whose weights naturally form a 2D matrix. Considering the input of the FC layers is a 1-D vector, each neuron can only calculate one result. The difference between FASplit-pruner and FAShuffle-pruner in pruning dimension determines their different application scenes. For a CONV Layer or a FC layer, the number of filters/neurons and the number of weights in a filter/neuron define size in different dimensions. Simply, we can select a pruner by comparing the number of filters/neurons with the number of weights in a filter/neuron. When the number of filters/neurons is greater than the number of weights in a filter/neuron, FASplit-pruner is suggested. If not so, FAShuffle-pruner is suggested.

4 Experimental methodology

Datasets and models To demonstrate the effectiveness of FASS-Pruner, we conduct experiments on both small and large datasets (i.e., CIFAR-10 and ImageNet). We select two mainstream CNN backbones, including VGG and ResNet. On CIFAR-10, we evaluated our method on VGG16 and ResNet56. On ImageNet dataset, we adopt VGG16 and ResNet50. To acclerate training, simple data augmentation (random crop and random horizontal flip) and normalization are used in training sets of both dataset. For ImageNet, images are resized to $256\times 256$ to regularize the dataset.

Configurations The pruning on the above models is implemented using PyTorch on an Intel CPU with 2.4 GHz and an Nvidia Geforce RTX 3090 graphics card. For models on CIFAR-10, the learning rate for retraining is set to 0.001. For models on ImageNet, the learning rate for retraining is set to 0.01. The default size of the systolic array is set to $64\times 64$ We adapt Scale-sim, a cycle-accurate architecture simulator for CNN accelerators, to estimate the relative execution time and memory access. Besides, the energy consumption is calculated with the metric in Eyeriss.

Evaluation protocols We adopt the number of parameters (Params) and required Float Points Operations (FLOPs) to evaluate pruned model size and computational requirements. The relative execution time (ReT) is used to evaluate the speedup of pruned models on CNN Accelerators. The normalized energy consumption (Nec) represents the energy consumption. The four protocols will be uniformly presented as the relative change rate in the subsequent results.

5 Evaluation results and analysis

Table 1 Pruning results on CIFAR-10

Full size table

Table 2 Pruning results on ImageNet

Full size table

This section arranges as follows: In Sect. 5.1, we compare FASplit-pruner and FAShuffle-pruner with state-of-the-art structured pruning methods, which contains two filter pruning methods (L1 and Hrank) and two channel pruning methods (CP and Sliming) (Han et al 2015b; Lin et al 2020; He et al 2017; Liu et al 2017). In Sect. 5.2, we compare RC-wise pruner and CR-wise pruner with three customizing pruning methods, which contains AAP, BP and DBSR (Chitty-Venkata and Somani 2020; Asgari et al 2019; teja Vooturi et al 2019).

5.1 FASS versus structured pruning

Table 3 Pruning result of VGG16 on CIFAR-10

Full size table

To verify the effectiveness of FASplit-pruner and FAShuffle-pruner, we select several SOTA structured pruning methods for comparison. According to the similar pruning dimension, filter pruning methods (L1 and Hrank) and channel pruning methods (CP and Sliming) compare with FASplit-pruner and FAShuffle-pruner, respectively.

Tables 1 and 2 list the evaluation results on CIFAR-10 and ImageNet, respectively. Compared with L1 and Hrank, FASplit-pruner achieves significantly better parameters and FLOPs reductions (e.g., 66.0% vs. 27.1%, 36.9% and 63.1% vs.31.0%, 33.3% in ResNet50), which demonstrates the superiority of fine-grained pruning. Moreover, FASplit-pruner usually obtains more performance improvement (70.3% vs. 31.6%), even under similar parameter reduction (Hrank on VGG16(CIFAR-10))(88.6% vs. 82.9%), indicating that FASplit-pruner is advocated in performance improvement and is more practical in CNN accelerators. As shown in Table 1, both filter pruning methods and FASplit-pruner hardly compress ResNet56, and even FASplit-pruner is worse than L1 and Hrank. Observing the structure of ResNet56, there are less than 64 filters in most CONV layers, which are compact and hardly compressed by pruning filters.

Compared with CP and Sliming, FAShuffle-pruner yields better compression and acceleration on all test models, verifying the pruning flexibility increased by fine pruning granularity. Also, FAShuffle-pruner obtains more compact parameters (50.8% vs. 46.3%) and faster acceleration (50.0% vs. 38.5%) than FASplit-pruner. Through analysis, we believe that FAShuffle-pruner does better than FASplit-pruner for two reasons. (1) FAShuffle-pruner are more flexible due to filter shuffling. (2) The mainstream CNN models increase the number of filters with depth to fully extract features, making the weight structure slender and increasing the redundancy in the dimension of row-weight, which suits with FAShuffle-pruner. In addition, compared to baseline approaches, the energy efficiency of FASplit-pruner and FAShuffle-pruner are improved by 1.27 and 2.29 times on average, respectively, showing the effectiveness of FASS on energy saving.

5.2 FASS versus customizing pruning

We further validate the effectiveness of RC-wise pruner and CR-wise pruner on improving performance and energy efficiency by comparing with three SOTA accelerator-customized pruning methods (AAP, BP, and DBSR). We conduct experiments for VGG16 on CIFAR-10 dataset, aiming to simulate the common but critical scenario of large models for small datasets. The results are shown in Table 3. As we can see, the model pruned by RC-wise pruner and CR-wise pruner is more efficient than that of other accelerator-customized pruning methods. Compared with SOTA accelerator-customized pruning method, on average, RC-wise pruner and CR-wise pruner achieve 21.1% and 34.7% higher pruning rates and improve the inference of pruned models 40.9% and 69.4% performance with 49.2% and 75.2% energy saving, respectively. All the accelerator-customized pruning methods have a larger effect on the accuracy. Even a small amount of pruning by BP (64.9%) causes an unacceptable accuracy decrease (3.79%). As discussed above, this is due to the fact that these accelerator-customized pruning methods strictly define sparse patterns in models, resulting in limited representation capability. By contrast, whatever the mode of operation, RC-wise pruner and CR-wise pruner obtain significant compression and acceleration (89.3% and 90.5% in Params; 72.7% and 88.6% in ReT) with only a slight loss in accuracy (0.35% and 0.44%). This suggests that RC-wise pruner and CR-wise pruner take into account the flexibility of sparse patterns when compressing the model. From the result, it is clear that CR-wise pruner are better than RC-wise pruner, which demonstrates that the RC-wise pruner is more suitable for the mainstream slender weight structure. Moreover, compared with a single FASplit-pruner or FAShuffle-pruner, the two combinations lead to good results, even if the improvement is negligible (e.g., 90.5% by CR-wise pruner vs. 89.1% by FAShuffle-pruner in Params of VGG16).

Overall, these experimental results have confirmed that FASS can effectively compress and accelerate model inference on CNN accelerators.

6 Conclusion

In this paper, we proposed a fine-grained accelerator-aware pruning framework, FASS-Pruner, which can effectively compress CNNs while preserving the original dataflow of CNN accelerators by filter splitting and filter shuffling. We evaluated FASS-Pruner on mainstream CNN models by comparing them with state-of-the-art structured pruning methods and accelerator-customizing pruning methods. The experimental results validate the effectiveness of FASS-Pruner, which can achieve significant performance improvement and energy saving.

Notes

In this study, we adopt the Weight Stationary (WS) dataflow mapping strategy, which is generally considered the most common dataflow in the state-of-the-art CNN architecture (Ma et al 2019).

References

Asgari, B., Hadidi, R., Kim, H., et al.: Lodestar: creating locally-dense CNNs for efficient inference on systolic arrays. In: Proceedings of the 56th Annual Design Automation Conference 2019. Association for Computing Machinery, New York, NY, USA, DAC ’19 (2019). https://doi.org/10.1145/3316781.3322472
Burkard, R., Dell’Amico, M., Martello, S.: Assignment problems. Soc. Ind. Appl. Math. (2012). https://doi.org/10.1137/1.9781611972238
Article MATH Google Scholar
Carreira-Perpinan, M.A., Idelbayev, Y,: Learning-compression algorithms for neural net pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput. Arch. News 44(3), 367–379 (2016)
Article Google Scholar
Chen, X., Zhu, J., Jiang, J., et al.: Tight compression: compressing CNN model tightly through unstructured pruning and simulated annealing based permutation. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2020)
Chitty-Venkata, K.T., Somani, A.K.: Array aware training/pruning: methods for efficient forward propagation on array-based neural network accelerators. In: 2020 IEEE 31st International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 37–44 (2020). https://doi.org/10.1109/ASAP49362.2020.00016
Han, S., Mao, H., Dally, W.J.: Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149 (2015a)
Han, S., Pool, J., Tran, J., et al.: Learning both weights and connections for efficient neural network. In: Cortes, C., Lawrence, N., Lee, D., et al. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc., Red Hook (2015)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406 (2017). https://doi.org/10.1109/ICCV.2017.155
Huang, G., Sun, Y., Liu, Z., et al.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., et al. (eds.) Computer Vision - ECCV 2016, pp. 646–661. Springer International Publishing, Cham (2016)
Chapter Google Scholar
Jouppi, N.P., Young, C., Patil, N., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
Article Google Scholar
Kung, H., McDanel, B., Zhang, S.Q.: Packing sparse convolutional neural networks for efficient systolic array implementations: column combining under joint optimization. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 821–834 (2019)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, J., Louri, A.: Adaprune: An accelerator-aware pruning technique for sustainable CNN accelerators. IEEE Trans. Sustain. Comput. 7(1), 47–60 (2022). https://doi.org/10.1109/TSUSC.2021.3060690
Article Google Scholar
Li, G., Ma, X., Wang, X., et al.: Fusion-catalyzed pruning for optimizing deep learning on intelligent edge devices. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 39(11), 3614–3626 (2020). https://doi.org/10.1109/TCAD.2020.3013050
Article Google Scholar
Li, G., Ma, X., Wang, X., et al.: Optimizing deep neural networks on intelligent edge accelerators via flexible-rate filter pruning. J. Syst. Archit. 124(102), 431 (2022)
Google Scholar
Liang, T., Glossner, J., Wang, L., et al.: Pruning and quantization for deep neural network acceleration: a survey. Neurocomputing 461, 370–403 (2021)
Article Google Scholar
Lin, M., Ji, R., Wang, Y., et al.: Hrank: Filter pruning using high-rank feature map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Liu, Z., Li, J., Shen, Z., et al.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763 (2017). https://doi.org/10.1109/ICCV.2017.298
Lym, S., Erez, M.: Flexsa: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training. arXiv preprint arXiv:2004.13027 (2020)
Ma, M., Tan, J., Wei, X., et al.: Process variation mitigation on convolutional neural network accelerator architecture. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp. 47–55. IEEE (2019)
Ma, X., Lin, S., Ye, S., et al.: Non-structured dnn weight pruning-is it beneficial in any platform? IEEE Trans. Neural Netw. Learn. Syst. 33(9), 4930–4944 (2022). https://doi.org/10.1109/TNNLS.2021.3063265
Article Google Scholar
Malinen, M.I., Fränti, P.: Balanced k-means for clustering. In: Fränti, P., Brown, G., Loog, M., et al. (eds.) Structural, Syntactic, and Statistical Pattern Recognition, pp. 32–41. Springer, Berlin (2014)
Chapter Google Scholar
Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91
Rhu, M., Gimelshein, N., Clemons, J., et al.: vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13 (2016). https://doi.org/10.1109/MICRO.2016.7783721
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., et al. (eds.) Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, pp. 234–241. Springer International Publishing, Cham (2015)
Chapter Google Scholar
Sandler, M., Howard, A., Zhu, M,, et al.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
teja Vooturi, D., Varma, G., Kothapalli, K.: Dynamic block sparse reparameterization of convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2019)
Yang, T.J., Chen, Y.H., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5687–5695 (2017)
Yayla, M., Chen, J.J.: Memory-efficient training of binarized neural networks on the edge. In: Proceedings of the 59th ACM/IEEE Design Automation Conference. Association for Computing Machinery, New York, NY, USA, DAC ’22, pp. 661–666 (2022). https://doi.org/10.1145/3489517.3530496
Yu, J., Lukefahr, A., Palframan, D., et al.: Scalpel: Customizing DNN pruning to the underlying hardware parallelism. SIGARCH Comput. Archit. News 45(2), 548–560 (2017). https://doi.org/10.1145/3140659.3080215
Article Google Scholar
Zhang, J., Gu, H., Zhang, G.L., et al.: Hardware-software codesign of weight reshaping and systolic array multiplexing for efficient CNNs. In: 2021 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 667–672 (2021). https://doi.org/10.23919/DATE51398.2021.9474215

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) (Grants No.U19A2061, No.62272190) and Sichuan Major R &D Project (Grant No.22QYCX0168).

Author information

Authors and Affiliations

The College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Xiaohui Wei, Xinyang Zheng, Chenyang Wang & Hengshan Yue
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Guangli Li

Authors

Xiaohui Wei
View author publications
You can also search for this author in PubMed Google Scholar
Xinyang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Chenyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guangli Li
View author publications
You can also search for this author in PubMed Google Scholar
Hengshan Yue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hengshan Yue.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wei, X., Zheng, X., Wang, C. et al. FASS-pruner: customizing a fine-grained CNN accelerator-aware pruning framework via intra-filter splitting and inter-filter shuffling. CCF Trans. HPC 5, 292–303 (2023). https://doi.org/10.1007/s42514-023-00156-w

Download citation

Received: 13 March 2023
Accepted: 15 May 2023
Published: 26 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s42514-023-00156-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

FASS-pruner: customizing a fine-grained CNN accelerator-aware pruning framework via intra-filter splitting and inter-filter shuffling

Abstract

Similar content being viewed by others

FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling

SMOF: Squeezing More Out of Filters Yields Hardware-Friendly CNN Pruning

Adaptive FSP: Adaptive Architecture Search with Filter Shape Pruning

1 Introduction