1 Introduction

Now we have entered the big data era. Big data provides great opportunities for academia and industry. It is a reality that no one can ignore and our environment whenever we need to make decision [1]. The meaning of big data contains both data science and applications such as business data mining or artificial intelligence [2, 3], where big data analysis, across data science and applications, is also a subset of big data [4]. Facing the challenges of big data, the core of big data is intelligence still where deep learning is a focus [5].

Deep learning is a branch of machine learning. It automatically extracts data features layer by layer through multi-layer network structure, and uses parameters and network structure design algorithms to make the network have the ability to fit any complex nonlinear mapping. So far, deep learning has made an important progress in pattern recognition [6], computer vision [7], speech recognition and natural language processing [8], etc. Nowadays, the research on deep learning is in the ascendant. The major technology companies in the industry are competing to occupy the commanding height of deep learning technology [9]. The excellent performance of deep learning is inseparable from its powerful expression ability. In order to achieve a strong representation ability, deep learning needs to rely on the multi-layer structure and a large number of parameters. With the increase of data, the improvement of learning accuracy and the complexity of learning tasks, the number of network layers and the amount of parameters are increasing. Taking the famous Imagenet image recognition contest as an example, the champion algorithm AlexNet [7] in 2012 only constructed five convolution layers and three fully connected layers, which required more than 60 million parameters; while the champion algorithm ResNet [10] in 2015 achieved 96.43% accuracy through 152 layers of deep neural network, which successfully exceeded the human visual ability (94.9%) in the field of visual recognition. In the language model, the Bert [11] proposed by Google in 2019 has more than 300 million parameters, and the gpt-3 proposed by openai in 2020 has 175 billion parameters. Despite the rapid development of deep learning, the state-of-the-art algorithm in a series of fields has made great progress, but it is still restricted by time and space in practical application. Deep networks have a large amount of computation, even with the help of graphics processing unit (GPU) acceleration, it still can not meet the needs of many application scenarios in time. In addition, large-scale model also take up a lot of memory space, which is not suitable for mobile devices such as mobile phones. Therefore, how to compress the deep neural network and improve the efficiency of deep learning is the key problem that must be solved in the process of deep learning.

There are many kinds of methods to compress the network, among which sparsity is the main method of compression. Iterative pruning [12] is proposed to achieve sparsity by iteratively remove connections which are uncritical. The work [13] makes neurons sparse by employing sparsity-inducing priors for hidden units. Structured Sparsity Learning (SSL) method [14] is proposed to directly learn a compressed structure of DNNs by group Lasso regularization during the training. The work [15] studies Variational Dropout [16] in the case when each weight of a model has its individual dropout rate and proposes Sparse Variational Dropout that extends Variational Dropout to all possible values of dropout rates and leads to a sparse solution. The work [17] proposes a solution through the inclusion of a collection of non-negative stochastic gates for \(\ell _0\) norm regularization for neural networks. A lot of sparse methods to compress the network have been proposed. We classify these methods into three categories, including pruning, dropout and sparse regularization based compression. Among these methods, this article focuses on sparse regularization based compression which is with both highly optimized implementations and approximate methods investigated [18]. The sparse regularization based compression has many advantages. On the one hand, this methodology does not need too many assumptions and is easy to apply. On the other hand, the sparse regularization has a good theoretical basis of statistical learning, which transforms the maximum likelihood into a posteriori probability estimation, and introduces a reasonable prior in the case of insufficient data, which can improve the effect of the model. We represent the details of sparse regularization based compression, including their general model, sparse-inducing regularizer and optimization methods in the following sections.

The rest of our paper is organized as follows. In next section, we review the literature of network compression from a global approach Sect. 2. In Sect. 3, we will represent the details of sparse regularization based compression, especially sparse-inducing regularizers. In Sect. 4, commonly-used optimization methods for corresponding sparse regularization based compression are introduced.

2 Compression for DNNs

Recently, deep neural networks have become the dominant models for heterogeneous computer tasks including but not limited to pattern recognition [6], computer vision [7], speech recognition and natural language processing [8]. To deal with various difficult tasks, one habitual method is to deepen the networks to confront more complex data or to improve performance, resulting in a large number of parameters in the deep network model. The network needs a lot of storage space and computing resources, so the model can not be applied to devices with limited resources or real-time requirements. Therefore, the compression and acceleration of neural network is a very important research direction. Compression requires joint knowledge from multiple disciplines, such as machine learning, optimization, computer architecture, signal processing and hardware design, etc. The existing neural network model compression methods can be divided into four categories: low-rank decomposition, compact convolution filter design, knowledge distillation and compression from a parametric perspective. Then, we further review the literature of these four categories of compression in the following content (Fig. 1).

Fig. 1
figure 1

Network-compression

2.1 Low-Rank Decomposition

The low rank decomposition method mainly uses the decomposition of matrix or tensor to estimate the meaningful parameters in neural network. In deep convolution neural networks, most of the computation comes from convolution operation. For example, if a convolution network has l convolution layers, the number of filters in layer l is \(n_l \), the size of the filter is \(s_l \), the size of the active graph is \(m_l \), the sum of the computational complexity of all the convolution layers can be denoted as \(O(\sum _{l=1}^L n_{l-1}\cdot s_l^2\cdot n_l\cdot m_l^2)\) [19]. Therefore, reducing the amount of computation of the convolution layer can speed up the overall computing speed of the network. Convolution kernel can be regarded as a 4-dimensional tensor which usually has a great redundancy. Tensor decomposition is an effective method to remove this redundancy. The fully connected layer can be regarded as a two-dimensional matrix, and matrix decomposition is also effective for it.

Based on this methodology, many decomposition techniques for matrices and tensors are applied to compress and accelerate neural networks, including SVD [20,21,22], Canonical Polyadic Decomposition [23], Tucker Decomposition [24], Block Component Decomposition [25, 26] and many other tenser decomposition [27, 28]. Sainath et al. [29] decompose the last weight layer into low rank matrices. Denton et al. [20] approximate tensors in convolutional layers by SVD(singular value decomposition). Jaderberg et al. [30] exploit cross-channel or filter redundancy and compress filters by constructing rank-1 filters. Ioannou et al. [31] train base filters and combine these base filters into complex ones. Wen et al. [32] compress CNNs using a low-rank regularization. Yu et al. [33] approximates the weight matrix as the sum of a low rank matrix and a sparse matrix.

The low rank decomposition based method is a very direct way for model compression and acceleration. However, decomposition consumes a lot of calculation and is not simple to implement. The model needs fine-tuning and retraining to achieve convergence. In addition, it can not perform global parameter compression since different layers have different information. At last, it does not work for some networks with small convolution kernel.

2.2 Compact Filter Design

The compact filter design is to save parameters by designing structural convolution filter, so as to economize storage. Taco [34] points out that the convolution layer can be effectively used in deep networks because these layers are translation invariant. According to the translation invariance, some algorithms design specific transformations and apply them to layers or filters to compress the network. Zhai et al. [35] define the transformation as a set of transform functions which are applied to 2-dimensional filters. Li et al. [36] introduce a multi-bias non-linear activation to generate more modes without adding too much calculation through the transformation. Shang et al. [37] also use a new activation called CReLU. CReLU function saves the positive and negative linear returns after convolution, so that each filter can effectively represent their direction. Sander et al. [38] propose four operations to establish rotation invariant neural networks, including slicing, pooling, rolling and stacking. However, the application of this kind of method is limited in some wide network structures such as VGGNet, but can not achieve comparable performance on narrow or some special networks such as GoogLeNet and ResNet. At the same time, the assumptions of this method make the results obtained on some data sets unstable [39]. Therefore, some studies [40, 41] consider designing compact convolution filter to replace the loose filter with redundant parameters, which can directly reduce the computational consumption and improve the computational speed. Christian et al. [40] replace per \(3 \times 3 \)-size filter with two \(1\times 1\) filters. Wu et al. [41] propose SqueezeDet with two \(1\times 1\) filters.

2.3 Knowledge Distillation

The basic idea of knowledge distillation is to extract knowledge from a large network which is called teacher network and learn a small network which is called student network, and makes the student networks have the knowledge extracted from the teacher networks. The seminal work [42] compresses the networks exploiting knowledge transfer, but this method can only be applied to shallow networks. Jimmy et al. [43] prove that the shallow network can also learn the complex functions learned by the deep network, and obtains the accuracy that only the deep network can achieve before. It adopts the idea of knowledge transfer to mimick the deep networks by shallower ones, which is referred to as knowledge distillation. Hinton et al. [44] introduce a student-teacher paradigm as knowledge distillation compression framework. The soft labels corresponding to the training samples are obtained by training the teacher networks. Combined with the real label, the soft labels are used to calculate the loss function of the student networks. Romero et al. [45] divide the training process into two stages. In the first stage, the middle layer information of teacher networks are used to train the parameters in the student networks. In the second stage, the final output of the teacher networks to train all the parameters of the student networks to get a deeper but narrower network. Luo et al. [46] compress the model in the face recognition tasks. It uses the characteristics of the data sets, and exploit the network output of the previous layer of softmax as information to select neurons. Chen et al. [47] first trains a small network, and then transfer the knowledge to a larger network. It uses the trained weight of the small network to accelerate the speed of training a large network. Zagoruyko et al. [48] improve FitNets and merge the attention mechanism to transfer the attention map as knowledge from the teacher model to the student model. Yim et al. [49] represent the data flow relationship between the two network layers as a flow of solution procedure (FSP) matrix and transfer it as knowledge. And the knowledge is added to the optimization process to reduce the FSP difference between the corresponding layers between teacher networks and student networks. In the object detection problem, Chen et al. [50] uses fitnets combined with a new loss function to extract knowledge dealing with sample imbalance and other problems. Huang et al. [51] utilize the consistency of activation distribution of student network and teacher network to conduct the training of student network.

2.4 Quantization

The model quantization method compresses the original network by reducing the number of bits required to represent each weight. The parameters of the network are generally represented by 32-bit floating-point numbers, but in fact, it is not necessary to retain such high bit-number. The bit-number required for stochastic gradient descent method is only 6 to 8 bits. The quantization neural network compresses the scale of the network and speeds up the training and inference process by using a small number of bits, such as 8 bits or even 1 bit. The calculation operation in quantization neural network is usually bit-by-bit operation, which is performed by arithmetic logic unit (ALU), which consumes much less energy than floating-point unit (FPU). Therefore, for mobile devices whose power consumption is limited, such as mobile phones, quantitative neural network is better than full precision network.

In neural networks, there are three components that can be quantified: weight, activation and gradient. By quantifying the weight and activation, a low-bit network model can be obtained. In the distributed training environment, the communication cost can be saved by quantifying the gradient. Generally speaking, quantifying the gradient is more difficult than quantifying the weight and activating the output, because a high-precision gradient is required to converge the optimization algorithm.

Gong et al. [52] propose vector quantization to quantify the neural networks. The basic idea of vector quantization is to cluster the weights into several groups, and then use the cluster center to replace other weights in each group. In this way, only indexes and cluster centers need to be stored, which reduces the memory requirement. However, the accuracy loss caused by k-means cluster can not be controlled, and compression ratio constraints can not be imposed either. In order to solve these problems, Choi et al. [53] propose a k-means cluster method weighted by Hessian matrix. The basic idea of this method is to use Hessian to measure the performance degradation caused by weight quantization. In this way, the weights that have a great importance on the network performance can be prevented from excessively deviating from their full-precision values. Wu et al. [54] also adopt k-means cluster method to quantify the filters and the weights in fully connected layers. The responses of the convolution layer and the fully connected layer are estimated by inner product approximation. The performance is improved by reducing the estimation error of the responses of each layer during quantization process. Park et al. [55] propose a method of weights cluster based on weight entropy. In this method, there are more clusters around the important weights so that a more automatic and flexible multi-bit quantization can be realized. These methods represent parameters in the cluster through the cluster center, so as to reduce the required storage. However, one disadvantage of quantization using k-means cluster is that the calculation is very expensive due to the large number of network weights in deep neural networks.

Some methods focus on reducing the number of bits used to represent per parameter, so as to reduce the memory requirements and computation of the model. Vanhoucke et al. [56] employ 8-bit quantization for the activation, and limit the activation to the range of 0 to 1 by a sigmoid function. Gupta et al. [57] take 16-bit representation in the convolution neural network training based on stochastic rounding, which significantly reduces the memory requirements and floating-point operations. Micikevicius et al. [58] still save the weights as 32 bits, but truncates them to 16 bits for calculation during training. Wu et al. [59] propose a heuristic rounding function to quantify the full-precision floating-point weight into a k bits integer. Seide et al. [60] propose a subgradient representation with 1 bit quantization, so as to speed up the calculation. The extreme case of 1-bit quantization of weights is weight binarization. Courbariaux et al. [61] propose BinaryConnect which trains the network with binary weights during forward and back propagation. In the forward propagation stage, the algorithm quantifies the real weight through the symbolic function, and uses the quantized weight to generate the output. In the back propagation stage, the symbolic function can no longer be used to back propagate the loss, because the gradient is almost zero everywhere. The usual approach is to use a heuristic method to estimate the gradient. Courbariaux et al. [62] further propose BinaryNet based on BinaryConnect which quantifies the network weights and limits the weights to \(-1 \) or \(+1 \). Storm et al. [63] propose a threshold quantization method to quantify the gradient, that is, with a fixed threshold selected in advance, the gradient above the threshold is quantified as \(+ 1 \), the gradient below the threshold is quantified as 0. Rastegari et al. [64] propose the Binary-Weight-Network(BWN) and the XNOR-Network. BWN only binarizes the network weights, while XNOR network binarizes the network weights and network inputs. And XNOR has significantly more acceleration effect than BWN. Hou et al. [65] consider the influence of binarization on the loss function, and proposes a loss-aware binarization. Kim et al. [66] propose a two-step binarization. In the first step, the weight is compressed to the range of \(-1\) to 1. In the second step, the compressed weight is used to initialize the parameters of a binary network. Hu et al. [67] train binary weight networks via hashing. Lin et al. [68] believe that the binary representation is not enough to comprehensively represent the weights of neural networks, and propose ternary weight networks added by the value of 0. Li et al. [69] propose ternary weight networks with \(-1\), 0 and 1. Zhu et al. [70] introduce two full-precision scaling coefficients for each layer, and then quantify the weights to one of the two coefficients and 0, rather than between \(+1 \), \(-1\) and 0. Mellempudi et al. [71] utilize multiple scaling factors to consider the asymmetry between positive and negative weights in the ternary networks. Deng et al. [72] propose gated XNOR networks which train deep neural networks with ternary weights and activations. The calculation trigger determined by weight and activation turns on the calculation as a control signal or control gate in the network. Wang et al. [73] propose a semi discrete decomposition of weight matrix. It decomposes \(W\in {\mathbb {R}}\) into the multiplication of \(X\in \{-1,0,+1\}^{m\times k}\), \(Y\in \{-1,0,+1\}^{n\times k}\) and non-negative diagonal matrix\(D\in {\mathbb {R}}^{k\times k_+}\). By choosing a different k, a trade-off can be achieved between compression ratio and performance loss. Wen et al. [74] introduce the TernGrad method, which quantifies the gradient into \(\{-1,0,1 \} \)

In addition, some other quantitative methods have been proposed. Muller et al. [75] draw lessons from the idea of integer programming and propose a stochastic rounding function to map the real value to the nearest discrete point or the second nearest discrete point based on the distance from the corresponding point. Polino et al. [76] propose a uniform quantization. Given a parameter s in advance. Uniform quantization generates \(s+1\) equally spaced points between 0 and 1, and quantifies each real value as the nearest quantization point. Koster et al. [77] quantify networks in the training process, share the exponential bits of the stored data, and transforms the floating-point operation in the training process into the integer operation of the mantissa, so as to accelerate the network training. Cai et al. [78] propose low precision Relu by half-wave gaussian quantization. Mishra et al. [79] propose wide reduced precision networks (WRPN) to quantify the activations and weights. This paper points out that the activations actually occupies more memory than the weight, so it adopts a strategy, that is, increasing the number of filters in each layer to compensate for the decline in accuracy caused by quantization operation.

2.5 Sparse-Inducing Methods

The sparse-inducing methods seek the sparsity in dense networks. It trains a sparse network with smaller number of weights, connections, filters or neurons compared with the original network to reduce the memory demand and computation. Denil et al. [80] points out that most parameters in neural networks are redundant. Therefore, it is feasible to train sparse networks with comparable performace. The neural networks with a sparse structure and fewer connections need less memory space to store these weight data, and the compressed networks require less computation when applied to new data. In this subsection, the existing network sparsity methods are mainly divided into three categories: pruning, dropout and sparse optimization methods.

2.5.1 Pruning

The main idea of pruning method is to remove those unimportant connections that have little impact on the final output of the network in the process of network training. Assuming the neural network architecture is \(f(x,\cdot )\), where x is the input sample. Such an architecture includes the parameters of the network and the operators it uses, such as convolution, pooling, batch normalization, etc. For the specific parameter W, the neural network is denoted as f(xW). The pruning of neural networks actually takes f(xW) as input and generates a new model \(f(x,{\hat{W}})\) as output. \({\hat{W}}\) is the parameter set after pruning. In a pruning operation, it can be denoted as \(M\odot W\), where \(M \in \{0,1\}^{|W|}\) is a binary mask to change some specific parameters to 0, \(\odot \) is an element-wise multiplication operation. In practice, instead of using a mask, the parameter is directly set to 0 or removed. Generally speaking, pruning methods follow such a framework,

  1. 1.

    Initializing parameters W, \(M \in \{1\}^{|W|}\) and training f(xW) to convergence.

  2. 2.

    According to an evaluation criterion \(S(\cdot )\), scoring the existing parameter set, and update the next pruning rule according to the current score and pruning rule, that is \(M\leftarrow P(M,S(W)) \).

  3. 3.

    Fine-tuning the network \(f(x,M\odot W)\).

In the last century, several early pruning methods were proposed. Biased weight decay [81] is an early work for pruning. A magnitude based pruning method is proposed to apply weight decay related to its absolute value to each hidden neuron in the network to reduce the number of hidden units. Subsequently, optimal brain damage(OBD) [82] and optimal brain surgery(OBS) [83, 84] take the Hessian matrix of the loss function into consideration. These two methods get higher accuracy than biased weight decay. However, the complexity of calculating the Hessian matrix is \(O(|W|^2)\) so that it is not feasible for current networks which are deeper and with more parameters and neurons.

In recent years, pruning methods have regained attention. Many works have been proposed to prune redundant network connections which do not provide useful information for the network output. Srinivas et al. [85] observe that similar neurons are redundant. It calculates a saliency value between two neurons to represent similarity. And then it removes one of the neurons with the smallest saliency. Han et al. [86] consider that the weights whose value are less than a certain threshold are redundant. It removes these weights from the network to obtain a sparse network. Finally, the network obtained is retrained and the weights are fine-tuned to ensure that the model performance will not be affected by the removal of some connections. They also combine this pruning method with quantization method and Huffman coding to compress the networks [87]. Molchanov et al. [88] propose a pruning standard based on Taylor expansion, which approximates the change of loss caused by pruning network through first-order Taylor expansion. Anwar et al. [89] introduce the particle filter method into structural pruning. The importance of each particle is determined by calculating the error classification rate. And it removes the particle with lower importance to achieve structurally pruned networks. And It is becoming increasingly popular to focus on setting different pruning standards on neuron-level. Narang et al. [90] propose to prune with a monotonically increasing threshold and the sparsity rate can be controlled by setting the shape of the threshold function. Lin et al. [91] propose a global and dynamic pruning strategy. Specifically, the model uses a global discriminant function established based on a priori knowledge to globally prune the filter that is not significant, and then dynamically update the significance value of the filter based on the pruned sparse network, recover the proper filter, and then retrain the model to improve the accuracy of the model. Molchanov et al. [92] propose an adaptable structural pruning without requiring per-layer sensitivity analysis.

2.5.2 Dropout Methods

Dropout methods randomly discard some neurons and connections of networks in the training process. It is worth noting that dropout is proposed not only as a sparse technique of neural network, but also as a regularization to reduce the overfitting problem in the process of network training.

Dropout was first proposed by Hinton et al. [93]. In each training process, each hidden neuron is randomly removed from the network with a probability of 0.5, so that the existing neurons cannot rely on removed hidden neurons. Dropout can also be regarded as training a large number of different networks in a reasonable time. There are different networks for each representation of each training case. Extensive stochastic methods are inspired by this original dropout and we refer to them by dropout method in general. DropConnect [94] extends Dropout [93] to the connections of the fully connected layers. Adaptive dropout [95] substitute the constant dropout rate by the adaptive rate in Dropout [93]. Srivastava et al. [96] comprehensively analyse dropout. Dropout is found to promote a sparse distribution of weights. Kingma et al. [16] propose variational dropout which sets an independent inactivation rate for each layer, each neuron and even each weight. Molchanov et al. [15] extend variational dropout [16] and build a bridge between pruning and dropout via sparse distribution. Poernomo et al. [97] propose Biased Dropout and Crossmap Dropout on hidden layers and convolutional layers.

In general, dropout methods are originally an algorithmic mechanism used to avoid overfitting. On one hand, its algorithm framework background is conducive to being applied to a wide range of neural network topologies. On the other hand, since it has good mathematical and statistical properties, it can be expanded to a compression applications. It can be seen as a sparse distribution of weights, which narrows the gap between pruning and sparse regularizaiton.

2.5.3 Sparse Optimization

The method based on sparse optimization is to transform the sparse learning process of neural network into an optimization problem by introducing sparse regularization terms. By solving an unconstrained sparse optimization problem, the weight of some network connections will be set to zero. And the training objective function can be denoted as,

$$\begin{aligned} \min \limits _{W} \ {\mathcal {L}} (f(\{W^{(l)}\}_{l=1}^{L}),D)+\lambda \sum \limits _{l=1}^{L}\varOmega (W^{(l)}) \end{aligned}$$
(1)

where \(W^{(l)}\) is the weights matrix of l-th layer, \(\{W^{(l)}\}\) is the set of all weights matrices. \(D=\{x_i,y_i\}_{i=1}^N\) is the dataset containing N samples, \(x_i\) is the i-th input sample \(y_i\in \{1,\dots ,K\}\) ia the corresponding label. \(\lambda ^{(l)}\) is the hyper-parameter to balance the effect between \({\mathcal {L}}\left( \{W^{(l)}\},\cdot \right) \) and regularization term \(\varOmega (W^{(l)})\).

The method based on sparse optimization mainly designs different sparse regularizaiton terms to maintain the prediction accuracy and sparse the network at the same time. [98] studies the use of three sparse regularization terms as sparse constraints to sparse neural networks, including \(\ell _1\), shrinkage operator and projection to \(\ell _0\) ball. The model training process using these regular functions can still be performed by the stochastic gradient descent method, which makes it easy for the algorithm to call the existing code. Zhou et al. [99] propose a method which tensor low rank constraint and group sparsity are applied to the objective function of deep neural network to remove redundant neurons. The tensor low rank constraint is realized by the trace norm of the tensor. The trace norm of tensor is proposed by Liu et al. [100]. The average rank of tensor is approximated by calculating the average value of trace norm of different expansion modes of tensor. For the tensor connected to a neuron \(\widetilde{{\mathbf {w}}}_{Neu}\), whose trace norm can be denoted as follows,

$$\begin{aligned} \varOmega (W^{(l)})=\left( (1-\alpha )\sqrt{P^{(l)}}\sum _{n=1}^{N^{(l)}}\Vert \omega _n^{(l)}\Vert _2+\alpha \Vert W^{(l)}\Vert _1\right) , \end{aligned}$$
(2)

where \(P^{(l)}\) is the neuron size in l-th layer, \(N^{(l)}\) is the neuron number in l-th layer, \(\omega _n^{(l)}\) is the corresponding weight matrix of n-th neuron in l-th layer, \(\alpha \) is a balance parameter. And then a majority of regularization based methods focus on utilizing various regularizer \(\varOmega (\cdot )\) [99, 101,102,103,104,105], which we will further introduce regularizer in the next section.

Sparse regularization based compression has a strong mathematical and optimization background. It can be related with sparse priori distributions. Therefore, it has a certain interpretability in theory. Moreover, the capability to prevent overfitting make regularization based compression can achieve comparable accuracy for model. In addition, it is easy to practical application.

3 Sparse Regularizers

Compression with sparse regularizer based approaches in DNNs obtain sparsity through turning the training process into an optimization problem. The compression is obtained by introducing sparse regularization to the objective function [99, 106,107,108,109]. In general, optimization problem for network compression can be formulated as follows,

$$\begin{aligned} \min \limits _{W} \ {\mathcal {L}} (f(W),D)+\lambda \varOmega (W), \end{aligned}$$
(3)

where \(f(\cdot )\) is the neural network, D is the dataset, W is the weights, \({\mathcal {L}} (\cdot ,\cdot )\) is the data fidelity term, \(\varOmega (\cdot )\) is the regularization term and \(\lambda >0\) is the hyper-parameter.

Many studies focus on applying sparse-regularization based compression by designing the regularization term \(\varOmega (\cdot )\) [99, 101,102,103,104,105]. \(\ell _0 \) norm is the most intuitive sparse constraint. When it acts on a vector, its output is the number of non-zero elements. Intuitivly, \(\ell _0 \) norm is the strictest sparse-inducing constraint, and the most sparse solution can be obtained. However, minimize a \(\ell _0 \) norm constraint problem is usually NP hard. \(\ell _1\) norm is a convex relaxation of \(\ell _0 \) norm. It is commonly used as it is convex and easy to be operated. The \(\ell _1\) norm [108, 110, 111] of a vector \(x\in {\mathbb {R}}^N\) is defined by

$$\begin{aligned} \left\| x\right\| _1= \sum \limits _{i=1}^N |x_i|. \end{aligned}$$
(4)

In [98], sparse regularizers including the \(\ell _1\) regularizer are applied to both convolutional and fully-connected structures. Some methods [108, 111] use \(\ell _1\) to promoted weight sparsity, which is to remove connections of a well-trained network.

Though the \(\ell _1\) norm has many advantages, it is sensitive to outliers and may cause serious bias in estimation [112]. That means these methods may need to sacrifice accuracy to achieve a comparable compression rate. After that, some works focus on improving the performance of regularization, which can be categorized as two types. The first kind of methods pay attention to the properties of regularization terms, so as to select better regularization terms. It is pointed out in [112] that a good regularization term should obtain an estimation with the following three characteristics: unbiasedness, sparsity and continuity. The so-called unbiasedness means that for variables with a large proportion (such as the previous parameters), the estimated value should be asymptotically unbiased, so as to avoid excessive model deviation. Sparsity means that small variables can be automatically estimated to zero to obtain sparse solutions, which can reduce the complexity of the model. Continuity means that the obtained estimation should maintain continuity, so as to maintain the stability of the model. These characteristics provide a reference for the selection of regularization functions. Obviously, regularization functions with such characteristics should be nonconvex.

Smooth clipped absolute deviation (SCAD) is the first regularization function [112] proved to meet these characteristics in order to improve \(\ell _1 \) and hard threshold functions are proposed. It effectively combines the soft threshold function with the hard threshold function. Its definition on the vector \({\mathbf {x}}=\{x_1,x_2,\dots ,x_N\}\in {\mathbb {R}}^N\) is as follows,

$$\begin{aligned} {\mathbf {P}}({\mathbf {x}};\lambda ,\gamma )=\sum _{i=1}^{n}P(x_i;\lambda ,\gamma ), P(x_i;\lambda ,\gamma )= \left\{ \begin{array}{llll} \lambda \vert x_i\vert , &{}if \ \vert x_i\vert \le \lambda \\ \frac{2\gamma \lambda \vert x_i\vert -x_i^2-\lambda ^2}{2(\gamma -1)}, &{}if \ \lambda< \vert x_i\vert <\gamma \lambda \\ {\lambda ^2(\gamma +1)}/2, &{}if\ \vert x_i\vert \ge \gamma \lambda , \end{array} \right. \end{aligned}$$
(5)

where \(\lambda > 0 \)and \(\gamma > 2 \) are two adjustment parameters, usually \(\gamma = 3.7 \). As can be seen from the above formula, for scalar variable \(\vert x\vert \le \lambda \), SCAD is equivalent to \(\ell _1\) , and then it smoothly transforms into a quadratic function until \(\vert x\vert =\gamma \lambda \). Then, for all \(\vert x\vert >\gamma \lambda \), it is equal to a constant, so as to meet the approximate unbiased estimation of variables.

SCAD has been proved the following two properties:

  1. (a)

    it can select the correct subset of variables.

  2. (b)

    parameter estimation is asymptotically normal, and the variable estimation can be unbiased by controlling the parameters.

And then the minimum maximum concave penalty (MCP) is proposed in [113] and is defined as follows,

$$\begin{aligned} {\mathbf {P}}({\mathbf {x}};\lambda ,\gamma )=\sum _{i=1}^{n}P(x_i;\lambda ,\gamma ), P(x_i;\lambda ,\gamma )= \left\{ \begin{array}{lllll} \lambda \vert x_i\vert -{x_i^2}/(2\gamma ), &{}if \ \vert x_i\vert \le \gamma \lambda \\ \gamma \lambda ^2/2, &{}if \ \vert x_i\vert >\gamma \lambda \end{array} \right. \end{aligned}$$
(6)

where \(\lambda > 0 \)and \(\gamma > 1 \). It can be seen from (6) that when \(\gamma \rightarrow \infty \), MCP function tends to \(\ell _1 \) regularization, the sparsity-inducing ability becomes weaker; When \(\gamma \rightarrow 1 \), MCP function approaches \(\ell _0\) regularization, and the sparsity-inducing ability becomes stronger. Similar to SCAD, when the value of the variable is greater than a certain value, the value of MCP will become a constant. And MCP can also obtain unbiasedness, sparsity and continuous variable estimates. Different from SCAD, MCP directly relaxes the penalty rate to zero in the later stage, while the penalty rate of SCAD remains unchanged for a period before it decreases.

Although SCAD and MCP can obtain unbiased estimation, they are in the segmented form and the model is relatively complex. As a result, they undoubtedly increases the amount of calculation in practice. Especially when the applied model itself is complex and the amount of calculation is large, the processing of these two parameters will increase the amount of calculation to the model.

Capped \(\ell _1\) function is another approximate form of \(\ell _0\) to better reduce the influence of noise and outliers [114],

$$\begin{aligned} {\mathbf {P}}({\mathbf {x}};a)=\sum _{i=1}^{n}\min (\vert x_i\vert ,a), a>0. \end{aligned}$$
(7)

It consists of two segments. As seen in Eq. (7), when \(a\rightarrow 0 \sum _i\min (\vert x_i\vert ,a)/a\rightarrow \Vert {\mathbf {x}}\Vert _0\). It can better approximates \(\ell _0\) than \(\ell _1\). When the absolute value of the variable is less than the parameter a, Capped \(\ell _1\) function corresponding to \(\ell _1\). And when the absolute value of the variable is greater than the parameter a, Capped \(\ell _1\) function corresponding to a constant, which means that the noise term with large error will be truncated by Capped \(\ell _1\). Thus Capped \(\ell _1\) is able to avoid the bias of variable estimation, which is more robust to noise.

The elastic net is a combination of \(\ell _1\) regularization function and \(\ell _2\) regularization function [115],

$$\begin{aligned} {\mathbf {P}}({\mathbf {x}};\alpha )=\sum _{i=1}^{n}\left( (\alpha -1)x_i^2/2+(2-\alpha )\vert x_i\vert \right) , \ 1\le \alpha \le 2, \end{aligned}$$
(8)

where \(\alpha \) is a non negative parameter. When \(\alpha = 1 \), the elastic net function becomes \(\ell _1 \) function, and when \(\alpha =2 \), the elastic net function becomes the \(\ell _2 \) function. Logarithmic penalty function is a generalized form of elastic nets [116],

$$\begin{aligned} {\mathbf {P}}({\mathbf {x}};\gamma )=\sum _{i=1}^{n}P(x_i,\gamma ), P(x_i;\gamma )=\frac{\log (\gamma \vert x_i\vert +1)}{\log (\gamma +1)}, \end{aligned}$$
(9)

where the parameter \(\gamma > 0 \). And by changing the value of \(\gamma \), the continuum from \(\ell _1 \ (\gamma \rightarrow 0_+)\) to \(\ell _0 \ (\gamma \rightarrow \infty ) \) can be obtained.

The \(\ell _1/\ell _2 \) regularization function has been applied as a sparsity-inducing regularizer in many fields, such as non-negative matrix factorization [117], blind deconvolution [118], image deblurring [119]. Without the range constraint, this regularization in these works adopts the form as follows,

$$\begin{aligned} {\mathbf {P}}({\mathbf {x}})=\frac{\sum _{i=1}^{n} |x_i|}{\sqrt{\sum _{i=1}^{n} x_i^2}} \end{aligned}$$
(10)

It satisfies five desired heuristic criteria of sparsity measures [120]. And It is differentiable almost everywhere and scale-invariant so that it is utilized in learning sparser neural networks [121].

The \(\ell _{1-2}\) [122] regularization is the difference of \(\ell _1\) and \(\ell _2\) norm and widely used in many problems. Its relationship with \(\ell _1/\ell _2 \) regularization function can be denoted as \(\Vert x\Vert _1-\Vert x\Vert _2=\Vert x\Vert _2(\frac{\Vert x\Vert _1}{\Vert x\Vert _2}-1)\). The \(\ell _{1-2}\) is nonconvex and Lipschitz continuous, and the corresponding optimization algorithms are effective. It performs very well in spectral imaging [122], compressed sensing [123], sparse signal recovery [124]. This regularization function has no hyper-parameters and is simple in form, but it is also lack of adaptability to different tasks due to the absence of hyper-parameters.

Transformed \(\ell _1\)(T\(\ell _1 \)) regularization function [125, 126] is a smooth form of Capped \(\ell _1\), which has many good properties, such as unbiasedness, sparsity, Lipschitz continuity and so on. For a single variable x, its definition formula as follows,

$$\begin{aligned} \rho _a(x)=\frac{(a+1)\vert x\vert }{a+\vert x\vert } \end{aligned}$$
(11)

where \(a>0 \) is a parameter which controls the shape of the function. When \(a \rightarrow 0\), \(\rho _a(x)\) approaches the indicator function as follows,

$$\begin{aligned} I(x)= \left\{ \begin{array}{lllll} 1, &{}\quad if \ x \ne 0\\ 0, &{}\quad if \ otherwise. \end{array} \right. \end{aligned}$$
(12)

and when \(a \rightarrow \infty \), \(\rho _a(x)\) approaches the \(\ell _{1}\) function. And Zhang extends T\(\ell _1 \) to vector space, for a vector \({\mathbf {x}}=\{x_1,x_2,\dots ,x_n\}\in {\mathbb {R}}^n\), T\(\ell _1 \) is defined as follows,

$$\begin{aligned} T\ell _1({\mathbf {x}})=\sum _{i=1}^{n}\rho _a(x_i). \end{aligned}$$
(13)

and noticing that

$$\begin{aligned} \begin{aligned} \lim _{a \rightarrow 0^+}T\ell _1({\mathbf {x}})=\sum _{i=1}^{N}I_{\left\{ x_i\ne 0\right\} }=\Vert {\mathbf {x}}\Vert _0, \ \lim _{a \rightarrow +\infty } T\ell _1({\mathbf {x}}) = \sum _{i=1}^N \left| x_i\right| = \Vert {\mathbf {x}}\Vert _1. \end{aligned} \end{aligned}$$
(14)

The non-convexity and the hyper-parameter property of it make it better sparsify networks [127, 128].

\(\ell _p\) \((0< p< 1)\) has attracted a great deal of attention in recent years [129,130,131, 131,132,133,134,135], which is defined by

$$\begin{aligned} \left\| x\right\| _p= \left( \sum \limits _{i=1}^N |x_i|^p\right) ^{\frac{1}{p}}. \end{aligned}$$
(15)

In theory, It fulfills unbiasedness, sparsity and oracle properties [129]. And when the value of p approaching to 0, \(\ell _p\) can interpolate \(\ell _0\).

$$\begin{aligned} \mathop {lim} \limits _{p\rightarrow 0^+} \left\| x\right\| _p^p=\left\| x\right\| _0. \end{aligned}$$
(16)

The changing rate of the \(\ell _p\) function value is unbounded at zero because it is non-Lipschitz continuous at \(x=0\). From these two perspectives, the mutation between zero and non-zero is a imitation of the discreteness of \(\ell _0\).

Some commonly used sparse-inducing regularizer are shown in Table 1.

Table 1 Commonly used sparse-inducing regularizer
Table 2 Commonly used sparse-inducing regularizer

The second type of regularizer-based approach is to use the composite regularizers [99, 101,102,103,104,105], as the following form,

$$\begin{aligned} \varOmega (W)=\mu \varOmega _1(W)+(1-\mu ) \varOmega _2(W), \end{aligned}$$
(17)

where \(\varOmega _1(\cdot )\) aims to compress networks on connection-level, \(\varOmega _2(\cdot )\) is used to enhance the compression capability on neuron-level, and \(\mu \in [0,1]\) is the parameter to keep balance. For example, Zhou et al. [99] utilized two sparse constraints, group sparsity for inducing sparsity on neuron-level and low rank constraint for tensor to promote competition between weights at each layer. The work [102] use the \(\ell _{2,1}\) norm, which was used to obtain a sparse network at neuron-level, together with the \(\ell _1\) norm, which tends to remove connections. Yoon et al. [104] combined group sparsity and exclusive sparsity to promote sharing and competition for different features. In our previous work [101], to explore the strength of these two methodologies, we proposed integrating the non-convex transformed \(\ell _1\) with group sparsity. Commonly used composite sparse-inducing regularizers for DNNs are shown in Table 2.

Although the combination of the two regularizers can achieve good results in terms of accuracy and compression ratio, this structure makes the optimization process alternate. This alternative optimization method may lead to sawtooth phenomenon and reduce the training speed. In order to overcome the numerical difficulties caused by the composite regularizer, some work design a single regularizer to compress the network without losing accuracy. In order to ensure the compression effect, the regularizer should be able to compress the connection and neuron at the same time.

4 Optimization

$$\begin{aligned} \min \limits _{W} \ {\mathcal {L}} (f(\{W^{(l)}\}_{l=1}^{L}),D)+\lambda \sum \limits _{l=1}^{L} \mu _l\varOmega (W^{(l)}), \end{aligned}$$
(18)

where \(\varOmega (W^{(l)})\) is a regularization term, \(\lambda > 0\) is the regularization parameter, \(\mu _l > 0\) is the balancing parameter used to control the sparse level of each layer l, D is a training set. By coordinating the value of \(\lambda \), \(W^{(l)}\in {\mathbb {R}}^{m_l \times m_{l-1}}\) represents the weights at the l-th layer, \(l\in \{1,2,\ldots ,L\}\), L is the number of layers, \(m_l\) denotes the neuron number of layer l.

we can compress the connections of the network during the process of solving problem (18). After obtain the solution to (18), the neurons, whose weights of connections are all zero, would be removed from the network. Then the sparse network is constructed.

4.1 Proximal Algorithm

In this subsection, we introduce the proximal algorithm. It is a very universal method under many circumstances, such as nonsmooth objective function and so on. At the same time, the speed of this algorithm can be very satisfactory in practice. The proximal method is also widely used in many fields, such as kernel norm problem [138], sparse problem [139], maximum a posteriori probability estimation in graph model [140], empirical or structural risk minimization [141,142,143] and signal processing [144]. It is worth noting that the proximal algorithm is often very suitable for solving sparse optimization problems. In addition, it is simple in both mathematical form and operation, which is very easy to understand and operate. Such an algorithm depends on the use of proximal operator.

4.1.1 Proximal Operator

Firstly, we introduce the concept of proximal operator. A proximal operator \(prox_f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n\) on f is defined as follows,

$$\begin{aligned} prox_f(y)=\arg \min _x\left( f(x)+\frac{1}{2}\Vert x-y\Vert _2^2\right) , \end{aligned}$$
(19)

where \(\Vert \cdot \Vert _2\) is \(\ell _{2}\) norm. The proximal operator of a function is actually solving the optimization problem. Generally, f is accompanied by a scaling coefficient \(\lambda \), and its proximal operator becomes as follows,

$$\begin{aligned} prox_{\lambda f}(y)=\arg \min _x\left( f(x)+\frac{1}{2\lambda }\Vert x-y\Vert _2^2\right) . \end{aligned}$$
(20)

From the perspective of gradient, when f is differentiable and \(f(x)+\frac{1}{2\lambda }\Vert x-y\Vert _2^2\) has an minimum,

$$\begin{aligned} \nabla f(x) +\frac{x-y}{\lambda }=0 \Rightarrow x=y-\lambda \nabla f(x) \approx y-\lambda \nabla f(y) \end{aligned}$$
(21)

that means, \(prox_{\lambda f}(y)\) is approximately the gradient descent at the point of y.

There is also an illustrative example for the proximal method when f is \(I_{{\mathcal {C}}}(x)\).

$$\begin{aligned} I_{{\mathcal {C}}}(x)= \left\{ \begin{array}{llll} 0,&{}x\in {\mathcal {C}}\\ +\infty ,&{}x\notin {\mathcal {C}} \end{array} \right. \end{aligned}$$
(22)

where \({\mathcal {C}} \) is a closed nonempty convex set, and its proximal operator is

$$\begin{aligned} \arg \min _x\left( I_{{\mathcal {C}}}(x)+\frac{1}{2}\Vert x-y\Vert _2^2\right) . \end{aligned}$$
(23)

In fact, it is an optimization problem of projection under Euclidean norm,

$$\begin{aligned} \arg \min _{x\in {\mathcal {C}}}\Vert x-y\Vert _2^2. \end{aligned}$$
(24)

The solution of Eq. (24) can be seen as a projection of y onto set \({\mathcal {C}}\). Therefore, the proximal operator can be regarded as a projection. And if f can be divided into \(f(x) = \sum _{i=1}^n f_i (x_i)\), then the proximal operator has \((prox_f(v))_i = prox_{f_i}(v_i)\). Such property can be applied in parallel computing design.

4.1.2 Proximal Gradient Algorithm

Given a optimization problem as follows,

$$\begin{aligned} \min _x f(x)+g(x). \end{aligned}$$
(25)

where \(f:{\mathbb {R}}^n\rightarrow R\) and \(g:{\mathbb {R}}^n\rightarrow R\cup \{+\infty \}\) are two closed proper functions, f is differentiable. The proximal gradient method uses the following iteration to solve the problem,

$$\begin{aligned} x^{k+1}:=prox_{\lambda ^kg}(x^k-\lambda \nabla f(x^k)), \end{aligned}$$
(26)

4.1.3 Applied in Compression for DNNs

Proximal gradient algorithm is the commom and useful method to train compression model (18). During the process of training DNNs, a stochastic framework is adopted to deal with the computational cost brought by the extremely large training dataset. And stochastic gradient method cannot be applied directly for a majority of regularization. Thus, a stochastic proximal gradient algorithm is uitilized in our previous work. In detail, the proximal operator is presented layer by layer as follow,

$$\begin{aligned} W_{t+1}^{(l)}=prox_{\lambda \gamma \varOmega }\left( W_t^{(l)}-\nabla {\mathcal {L}}\big (f(W_t^{l}),D\big )\right) , \end{aligned}$$
(27)

where \(W_{t}^{(l)}\) is the weight matrix of the l-th layer. Furthermore, (27) can be rewritten as the following optimization problem,

$$\begin{aligned} W_{t+1}^{(l)}= \arg \min \limits _W \frac{1}{2\lambda }\left\| W-(W_t^{(l)}-\gamma \nabla {\mathcal {L}}(f(W_t^{l}),D)) \right\| _F^2+\varOmega (W). \end{aligned}$$
(28)

Using the stochastic gradient to replace the gradient in above Eq. (28), we obtain,

$$\begin{aligned} W_{t+1}^{(l)}=\arg \min \limits _W \bigg \{\frac{1}{2\lambda }\Big \Vert W-\big (W_t^{(l)}-\frac{\gamma }{m_0}\sum \limits _{i=1}^{m_0} \nabla {\mathcal {L}} (f(W_t^{l}),\{x_i,y_i\})\big ) \Big \Vert _F^2+\varOmega (W)\bigg \}, \nonumber \\ \end{aligned}$$
(29)

where \(m_0\) is the mini-batch size in SGD.

It is a general framework to solve regularized compression model. However, ther is a few of differences for the pipeline of solving a combined regularized objective function. As we known, the success of the application of proximal methods depends on whether there is a solution of the proximal operator for the regularization term. It is difficult to directly calculate the proximal operator of such a combined regularization term. Under this circumstace, calculating the proximal operators of two regular functions separately is a pragmatic method [101, 104, 136]. For a regularized compression model as follows,

$$\begin{aligned} \min \limits _{W} \ {\mathcal {L}} (f(\{W^{(l)}\}_{l=1}^{L}),D)+\lambda \sum \limits _{l=1}^{L} (\mu _l\varOmega _{1}(W^{(l)})+(1-\mu _l)\varOmega _{2}(W^{(l)})), \end{aligned}$$
(30)

calculation results for per regularization term are utilized ralatively to update the gradient steps only on the loss function iteratively.

$$\begin{aligned} W^{(l)}_{t+1}=prox_{\lambda ^{(l)}\gamma (1-\mu _l)\varOmega _{2}}\left( prox_{\lambda ^{(l)}\gamma \mu _l \varOmega _{1}}(W^{(l)}_{t}-\gamma \sum _{i=1}^n\nabla {\mathcal {L}}(W^{(l)}_{t},\{x_i,y_i\})/n)\right) .\nonumber \\ \end{aligned}$$
(31)

As seen from the Eq. (27) and the Eq. (31), we can find some differences in single and combined regularization from the perspective of optimization process. Noticing that the optimization pipeline of the these combined regularizers, we optimized alternatively for the two-term structure. It in practice results in slows down the convergence. Meanwhile, each part of these two-term regularizers can be seen a single regularization. Thus, the alternative optimization of per term regularization may cause the computational difficulties.

4.2 Subgradient Based Method

4.2.1 Subgradient

\(\partial f(x)\) is called the subgradient of f at the point x, if it satisfies the condition,

$$\begin{aligned} f(y)\ge f(x)+ \partial f(x) (y-x) \end{aligned}$$
(32)

4.2.2 Subgradient Descent

Due to the non-smoothness of the commonly used sparse-inducing regularization, gradient descent cannot be applied for our model (18) directly. To minimize (18) in general, subgradient descent is also utilized [145], where stochastic gradient descent is applied to the first term while subgradient descent is applied to the regularization term. The mathematical form is as follows

$$\begin{aligned} w_{t+1}^{(l)}= & {} w_t^{(l)}-\nabla {\mathcal {L}}(f(w_t^{l}),D) \end{aligned}$$
(33)
$$\begin{aligned} w_{t+2}^{(l)}= & {} w_{t+1}^{(l)}-\alpha _{t+1} g_{t+1}^{(l)} \end{aligned}$$
(34)

where \( w_{t}^{(l)}\) is the weight of l-th layer in t-th iteration, \(g^{t}\) is any subgradient of \(\varOmega \) at \(w^{t}\), and \(\alpha _t>0\) is the step size in t-th iteration.

4.3 Discussion

We compare subgradients and proximal operators of several sparse regularizers in Table 3. From these two optimization algorithms and Table 3, we find that the compression by proximal gradient algorithm usually obtain a threshold-form proximal operator from which the unsignificant weights are exactly zero through the optimization process. The sparsity in the network results from this. In the network trained by subgradient descent method, some weights may become very small in the optimization process, but it is difficult to accurately become zero. The compression of the network needs post-processing, and the sparsity is obtained by pruning. It may need retraining to keep the accuracy.

Table 3 Subgradients and proximal operator of Several sparse regularizers

We used to focus on the sparsity-iducing capability of various sparse regularizer, which is supposed to make weights in the network zero directly. We often focus on setting the parameter to zero and the threshold value setting parameters to zero. However, pruning a network without a threshold produced by regularizer is actually possible, and it can also control the compression ratio. Compression with subgradient algorithm inspires us. In regularization based compression using subgradient descent method, post-operation, pruning, is needed. They don’t spend too much effort on the threshold, but they still get a compressed network with good accuracy. The sparsity brought by the regularizer, which can be seen as the anti-disturbance ability brought by it to the model. In other words, the ability of regularizers is the ability to select the optimal subnetwork for the original network. It will pick some of the more important positions of the weight and keep them down.

5 Avenues for Future Research

Considering the enormous interest in neural network compression, we briefly review and analyze the sparse regularization based compression methods for DNNs in this work. After carefully studying the literature, we overview sparse regularizations and optimization methods for DNNs compression. We discussed both the different advantages and disadvantages, and provided some insights and discussions on how to make sparse regularization fit within the compression framework to help future research endeavors to produce the kinds of results. In this section, we indicate some avenues for future research so that regularization will harmonize the networks compression and develop better in this area.

First, regularizer designing has attracted great attention of researchers. They design regularizers with different desired properties, such as unbiasedness, sparsity, scale-invariant, continuity and Lipschitz-continuity. A good regularizer with desired properties should result in an estimator with preference. Moreover, to obtain structural compression effect, state-of-the-art works focus on removing neurons utilizing \(\ell _{2,1}\) group sparse regularizer. Although \(\ell _{2,1}\) performs well, there is only \(\ell _{2,1}\) widely used. Therefore, it is significant to design or utilize a novel group sparse regularizer which can better capture intra-group information or better promote group sparsity, so that the regularized networks can adapt to different tasks.

In addition, to obtain good compression effect, composite regularizers are utilized to sparsify connections and neurons at the same time. One part of a composite regularizer removes spare connections and the other part removes unnecessary neurons. Although the composite regularizers can obtain accurate compressed networks, they are usually difficult to solve. Because the algorithms need alternately optimize the two regularized objective. Such an alternative optimization usually leads to the zigzagging phenomenon and slow down the training. Therefore, it is necessary to design a simpler regularizer or algorithm in order to overcome the computational difficulty.

Furthermore, it will be very promising to study from the perspective of Bayesian. Noticing that explicit regularization can be considered as introducing priors into posteriors. For example, the commonly used prior distribution is Laplacian prior distribution, which corresponds to \(\ell _{1}\) regularization. In the future, it is promising to extend the commonly used priors to more general priors. From a compression point of view, sparse priors can be placed over connection-wise or neuron-wise structures. some sparser prior distribution can be utilized on weights and some priors such as Bernoulli priors can be introduced over structures such as convolutional filters or ResNet units to promote sturctural sparsity.

Second, it is meaningful to introduce regularization based compression methods on other neural network architectures. With the increasing scale of tasks to be solved, the network used is becoming wider and deeper. Although there are various network architectures, a majority of methods verify the effect on convolutional neural networks. In the future, it is meaningful to test the performance on other neural network architectures. Furthermore, it could be interesting to study the pipeline of compression, such as training models from scratch or in a student-teacher setting.

Last but not least, it is promising to explore the effect of explicit and implicit regularizations. Generally speaking, existing works on DNN model compression include pruning, dropout, quantization and optimization with explicit regularization. In addition to the explicit regularization based method, other methods may impose a regularization effect by the training algorithm rather than introducing explicit regularizer to the objective function. These algorithmic methods can be regarded as a implicit regularization. For instance, pruning is to remove non-informative weights which are insensitive to the performance in a pretrained network. It may remove bad noise to improve the generalization or robustness of the networks. Dropout is proposed to reduce over-fitting and improve the performance by randomly removing neurons, which is definitely a regularization. It could be interesting to explore the effect of explicit and implicit regularizations.