Keywords

1 Introduction

In the internet of Things (IoT) realm, sensors and actuators seamlessly integrate with the environment [1], enabling cross-platform information flow for environmental metrics, while numerous connected devices generate massive data, offering convenience but also high latency. However, applications such as vehicle-to-vehicle communication which enhances the traffic safety by automobile collaboration, require low latency and high security. Edge computing is a promising technology that has the potential to improve the performance and security of IoT applications [2].

Even though chip giants are integrating more and more AI accelerators into their design for the IoT devices, the massive number of parameters and the huge amount of computation would bring horrible experience to the consumers when Deep Neural Networks (DNNs) are employed in their devices [3]. To alleviate such kind of problems, researchers have made efforts in many directions, which could be mainly categorized into two types: unstructured ones and structured ones.

Unstructured pruning methods prune individual weights based on the importance of themselves. For example, by using the second-order derivatives of the error function, Optimal Brain Damage and Optimal Brain Surgery proposed to remove unimportance weights from a trained network [4]. Deep Compression compressed neural networks by pruning the unimportant connections, quantizing the network, and applying Huffman coding [5]. With Taylor expansion that approximates the change in the cost function [6], pruned convolutional kernels to enable efficient inference and could handle the transfer learning tasks effectively. A major downside of the unstructured methods is the sparse matrix and the relative indices after pruning, which leads to the complexity and inefficiency on hardware [7].

Structured methods prune weights in a predictable way. Li et al. [8] pruned unimportant filters with \(L_1\) norm. Luo et al. [9] pruned filters based on statistics information computed from its next layer, not the current layer. He et al. [10] pruned channels by LASSO regression. By using scaling factors from batch normalization layers, [11] removed unimportant channels. Lebedev and Lempitsky [12] revisited the idea of brain damage and extended it to group wise, obtaining the sparsities in new neural network. To the best of our knowledge, one recent study [13] proposed a stripe-wise pruning based methods by introducing filter skeleton to learn the shape of filters and then performed pruning on the stripes according to the corresponding values of the filter skeleton.

However setting an absolute threshold sometimes could not express the relative importance of each stripe in a filter. To resolve this problem, in this work, we put forward a new method, using the statistical properties of the weights located on each stripe, to learn the importance between those stripes in a filter. The intuition of this method is triggered by the process during stripe wise convolution and the properties of normal distributions.

Our principal contributions in this paper could be summarized as follows: (1) The research proposes a new method for determining which weights in a neural network can be pruned without sacrificing accuracy. Our pruned VGG16 achieves results comparable to the existing model, with a fourfold reduction in parameters and only a 0.4% decrease in accuracy. (2) The proposed method is based on sound theoretical principles, making it more trustworthy and easier to understand and apply. (3) The effectiveness of the proposed approach is tested on different neural network architectures (VGG16 and ResNet56) and evaluated on edge devices with limited computational resources.

The paper is arranged as follows: In Sect. 2, we present our method as well as the theoretical framework behind it. In Sect. 3, we explain the experimental details and demonstrate comparisons between our method and the original method. Additionally, we showcase the performance of our method deployed on edge devices. Finally, concluding remarks are provided in Sect. 4.

2 The Proposed Method

2.1 Stripe Wise Convolution

In l-th convolution layer, suppose the weight 4-D matrix W is of size \(\mathbb {R}^{N\times C\times K\times K}\), where N, C and K are the numbers of filters, the channel dimension and the kernel size, respectively.

Let \(x_{c,h,w}^{l}\) be one point of feature map in the l-th layer and \(x_{n,h,w}^{l+1}\) be the convolution result in the \(l+1\)-th layer. We modify the calculation order of standard convolution in the stripe-wise way (1) as illustrating in Fig. 1a.

$$\begin{aligned} x_{n,h,w}^{l+1}= \sum _i^K\sum _j^K\left( \sum _c^Cw_{n,c,i,j}^{l}\times x_{c,h+i-\frac{K+1}{2},w+j-\frac{K+1}{2}}^{l}\right) \end{aligned}$$
(1)

\(x_{c,p,q}^l = 0\), when \(p<1\) or \(p>M_H\) or \(q<1\) or \(q>M_W\). \(M_H\) is the height of the feature map, while \(M_W\) represents the width.

From Fig. 1a, we could find that in stripe wise convolution, the convolution result of individual filter is the summation of the convolution result of the stripes which belongs to this filter. One intuition is that if the convolution result of the stripe 1 is much smaller than the convolution result of the stripe 2, Stripe 1 could be pruned and Stripe 2 could be kept as shown in Fig. 1b. The following part will prove it in a theoretical manner.

Fig. 1
A 2-part illustration represents the result of stripe-wise convolution. a. The filter's kernel size is 3. The convolution result of the individual filter is the summation of the convolution of the stripes of that filter. b. It denotes the single filter case.

Stripe wise convolution

2.2 Theoretical Analysis

Batch normalization (BN) is widely used in a neural network. This method could make DNN faster and more stable [14]. In one filter, suppose B is a mini-batch of size m, i.e., \(B=\{a_1,\ldots a_m\}\). BN layer processes these following transformation steps: \(\mu _B = \frac{1}{m} \sum _{i=1}^m a_i\), \(\sigma _B^2 = \frac{1}{m} \sum _{i=1}^m (a_i-\mu _B)^2\), \(\hat{a}_{i} = \frac{a_i-\mu _B}{\sqrt{\sigma _B^{2}+\epsilon }}\), \(x_i = \gamma \hat{a}_{i} +\beta \equiv \text {BN}_{\gamma ,\beta }(a_i)\), where \(\mu _B\) and \(\sigma _B\) are the empirical mean and standard deviation of B. To resume the representation ability of the network, scale \(\gamma \) and shift \(\beta \) are learned during the whole process.

After transformation in the BN layer, in c-th channel of l-th layer, the input feature map could be \(X^{l}_c\sim \mathcal {N}(\beta _c^{l}, (\gamma ^{l}_c)^2)\). When \(M_H\) is large, \((X^{l}_c)_{i,j}\sim \mathcal {N}(\beta _c^{l}, (\gamma ^{l}_c)^2)\). From (1), we could get \(X_{n}^{l+1}=\sum _i^K\sum _j^K(\sum _c^Cw_{n,c,i,j}^{l}\times (X^{l}_c)_{i,j})\).

Assuming all data is independently identically distribution, with the properties of normal distribution [15], we have \(X_{n}^{l+1}\sim \mathcal {N}(\mu _{n}^{l+1}, (\sigma _{n}^{l+1})^2)\), where \(\mu _{n}^{l+1}=\sum _i^K\sum _j^K(\sum _c^Cw_{n,c,i,j}^{l}\beta _c^{l})\) and \((\sigma _{n}^{l+1})^2=\sum _i^K\sum _j^K(\sum _c^C(w_{n,c,i,j}^{l})^2(\gamma ^{l}_c)^2)\)

To reduce the number of parameters \(w_{n,c,i,j}^{l}\) and avoid the value of \(\mu _{n}^{l+1}\) change, we introducing an importance indicator \(Q_{n,i,j}^l\) to the output of convolution of each stripe and have the following loss function.

$$\begin{aligned} L_n=\textrm{loss}\left( \mu _{n}^{l+1},\sum _i^K\sum _j^KQ_{n,i,j}^l\left( \sum _c^Cw_{n,c,i,j}^{l}\beta _c^{l}\right) \right) +\alpha g_n(Q) \end{aligned}$$
(2)

where \(g_n(Q)=\sum _i^K\sum _j^K\left| Q_{n,i,j}^l\right| , Q_{n,i,j}^l=1 \; \text {or} \; 0\).

Let \(s_{n,i,j}^{l}\triangleq \sum _c^Cw_{n,c,i,j}^{l}\). If we assume \(\beta _{1}^{l}=\beta _{2}^{l}\cdots =\beta _{c}^{l}=\beta ^{l}\), (2) could be written as \( L_n=\textrm{loss}(\beta ^{l}\sum _a^K\sum _b^Ks_{n,a,b}^{l},\beta ^{l}\sum _i^K\sum _j^KQ_{n,i,j}^ls_{n,i,j}^{l})+\alpha g_n(Q)\) and it can be further written as

$$\begin{aligned} L_n=\textrm{loss}\left( 1,\sum _i^K\sum _j^KQ_{n,i,j}^lT_{n,i,j}^l\right) +\alpha ^{\prime } g_n(Q) \end{aligned}$$
(3)

where

$$\begin{aligned} T_{n,i,j}^l=\dfrac{s_{n,i,j}^{l}}{\sum _a^K\sum _b^Ks_{n,a,b}^{l}} \end{aligned}$$
(4)

Obviously, \(\sum _i^K\sum _j^KT_{n,i,j}^l=1, 0\le T_{n,i,j}^l<1\) To minimize (3), we could set \(Q_{n,i,j}^l=0\) to those \(T_{n,i,j}^l\) close to 0, which means the corresponding stripes will be pruned. \(T_{n,i,j}^l\) could be used to describe the relative importance of \(\text {stripe}_{i,j}\) in \(\text {filter}_n\). When \(T_{n,i,j}^l\rightarrow 1\), \(\text {stripe}_{i,j}\) contributes more than other stripes. When \(T_{n,i,j}^l\rightarrow 0\), \(\text {stripe}_{i,j}\) contributes less than other stripes and could be pruned.

Before setting a threshold for \(T_{n,i,j}^l\) to prune stripes, we need to impose regularization on the whole neural network to achieve sparsity. This method could avoid so-called “Train, Prune, Fine-tune” pipeline. The regularization on the FS will be

$$\begin{aligned} L= \sum _{(x,y)}\textrm{loss}(f(x,W),y)+\alpha g(W) \end{aligned}$$
(5)

where \(\alpha \) adjusts the degree of regularization. g(W) is \(L_1\) norm penalty on W and could be written as \(g(W)=\sum _{l=1}^L(\sum _{n=1}^N\sum _{c=1}^C\sum _{i=1}^K\sum _{j=1}^K\left| W_{n,c,i,j}^l\right| )\).

To avoid using sub-gradient at non-smooth point, instead of the \(L_1\) penalty, we deploy the smooth-\(L_1\) penalty [16].

3 Experiments

In order to assess the performance of the proposed model and confirm its effectiveness, we carry out experiments using the CIFAR-10 dataset. Our method is implemented using the publicly available Torch. Dataset and Model: CIFAR-10 [17] is one of the most popular image collection data sets. This dataset contains 60K color images from 10 different classes. 50K and 10K images are included in the training and testing sets respectively. By adopting CIFAR-10, we evaluated the proposed method mainly on VGG [18] and ResNet56 [19]. The inference time refers to the total amount of time needed to classify 3270 image patches with a size of \(224\times 224\). Baseline Setting: We train the model using mini-batch size of 64 for 100 epochs. The initial learning rate is set to 0.1, and is divided by 10 at the epoch 50. Random crop and random horizontal flip are used as data augmentation for training images. Image is scaled to \(256 \times 256\). Then a \(224 \times 224\) part is randomly cropped from the scaled image for training. The testing is the center crop with \(224 \times 224\). Experiment environment: NVidia 1080-TI and Intel Core i5-8500B are selected as two different computing platforms representatives of the server and the edge device, respectively. The first is a GPU which has high computation ability, however needs communication with sensors and actuators. The second is a CPU to represent the restricted computer power of an edge device.

3.1 Comparing with the Original SWP

To compare our method with the original SWP, we revisit the concept of filter skeleton (FS) from [13]. Each value in FS corresponds to a stripe in the filter. During training, the filters’ weights are multiplied with FS. With I representing the FS, the stripe wise convolution could be written as

$$\begin{aligned} x_{n,h,w}^{l+1}= \sum _i^K\sum _j^KI_{n,i,j}^l\left( \sum _c^Cw_{n,c,i,j}^{l}\times x_{c,h+i-\frac{K+1}{2},w+j-\frac{K+1}{2}}^{l}\right) \end{aligned}$$
(6)

where \(I_{n,i,j}^l\) is initialized with 1.

The regularization on the FS will be

$$\begin{aligned} L= \sum _{(x,y)}\textrm{loss}(f(x,W\odot I),y)+\alpha g(I) \end{aligned}$$
(7)

where \(\odot \) denotes dot product and \(\alpha \) adjusts the degree of regularization. g(I) is written as: \(g(I)=\sum _{l=1}^L(\sum _{n=1}^N\sum _{i=1}^K\sum _{j=1}^K\left| I_{n,i,j}^l\right| )\).

For convenience, in Table 1 for the comparison on CIFAR-10, both the original method and our method use FS to train and prune the whole neural network. Both of them use the coefficient \(\alpha \) of regularization, which is set to \(1e-5\) and \(5e-5\). The difference is that for the original method, pruning is based on the value in FS which corresponds to a stripe and for our method, pruning is based on \(T_{n,i,j}^l\) which combines the weights located in a stripe. Regarding the choice of T, we used the value corresponding to the highest accuracy.

Table 1 Comparison with the original SWP on CIFAR-10

From the table, we could find both methods could reduce the number of parameters and the amount of computation (FLOPs) in a considerable volume without losing network performance. For the backbone is VGG16 situation, when \(\alpha =1e-5\), the number of parameters and the amount of computation of our method are larger than the original approach. This is because our method will keep at least one stripe in a filter, while the original approach might prune a whole filter. However, when \(\alpha =5e-5\), the original approach could not converge and our method could reach a high compression rate both in the number of parameters and the amount of computation. Our pruned VGG16 could achieve 95% reduction in memory demands.

For the backbone is Resnet56 situation, we present our result of \(\alpha =5e-5\). To compare with the original approach’s result of \(\alpha =1e-5\), our method could see a large reduction in the number of parameters and the amount of computation while sacrificing a bit of accuracy. Our pruned Resnet56 could achieve 75% reduction in memory demands.

In our method, there are two decisive hyper-parameters in the neural network, the coefficient \(\alpha \) of regularization in (7) and the weight combination threshold T in (4). As the outcomes of the experiment demonstrated in Table 2, we display the effects of the hyper-parameters in pruning consequences. It could be noticed that \(\alpha =5e-5\) and \(T = 0.005\) holds an acceptable pruning ratio as well as test accuracy.

Table 2 Different coefficient \(\alpha \) and weight combination threshold

3.2 Edge Device Performance

We further verify our approach in an edge device. Pruning is executed on the server as training consumes computing resources on learning the importance between the stripes and serval complete passes of the training dataset through the whole neural network. The pruned networks are then deployed on these two computing platforms to test results and get the inference time. The comparison is shown in Fig. 2a, b. It should be noted that stripe wise convolution is not yet optimized in CUDA. Along with the increase in percentage of parameters pruned, the decline in inference time in servers is not quite clear. However, the inference time in edge device drops by half when 75–95% of parameters are pruned.

Fig. 2
Two grouped bar graphs of inference time versus the percentage of parameters pruned denote the trends for server setup and edge device. A represents Pruned V G G 16, and B represents pruned Res net 56.

Required inference time for pruned models

4 Conclusion

In this work, we avoid using an absolute threshold in existing stripe-wise pruning by combining the weights located on each stripe. This allows us to learn the importance between stripes in a filter and remove those with low importance. Our pruned method effectively reduces the parameters and inference time of our VGG16 model without significantly impacting accuracy. In future work, we will explore the introduction of regularizers to prune filters with single stripes, which may further compress deep neural networks and improve performance.