Keywords

1 Introduction

Deep neural networks (DNN) have achieved unprecedented performance in a number of fields such as speech recognition [1], computer vision [2], and natural language processing [22]. However, these works heavily rely on DNN with a huge number of parameters, and high computation capability [5]. For instance, the work by Krizhevsky et al. [2] achieved dramatic results in the 2012 ImageNet Large Scale Visual Recognition challenge (ILSVRC) using a network containing 60 million parameters. A convolutional neural network, VGG [27], which wins the ILSVRC 2014 consists of 15M neurons and 144M parameters. This challenge makes the deployment of DNNs impractical on devices with limited memory storage and computing power. Moreover, a large number of parameters tend to decrease the generalization of the model [4, 5]. There is thus a growing interest in reducing the complexity of DNNs.

Existing work on model compression and acceleration in DNN can be categorized into four types: parameter pruning and sparsity regularizers, low-rank factorization, transferred/compact convolutional filter and knowledge distillation. Among these techniques, one class focuses on promoting sparsity in DNNs. DNNs contain lots of redundant weights, occupying unnecessary computational resources while potentially causing overfitting and poor generalization. The network sparsity has been shown effective in network complexity reduction and addressing the overfitting problem [24, 25].

Sparsity for DNNs can be further classified into pruning and sharing, matrix designing and factorization, randomly reducing the complexity and sparse optimization. The pruning and sharing method is to remove redundant, non-information weights with a negligible drop of accuracy. However, pruning standards require manual setup for layers, which demands fine-tuning of the parameters and could be cumbersome for some applications.

The second methods reduce memory costs by structural matrix. However, structural constraint might bring bias to the model. On the other hand, how to find a proper structural matrix is difficult. Matrix factorization uses low-rank filters to accelerate convolution. The low-rank approximation was done layer by layer. However, the implementation is computationally expensive and cannot perform global parameter compression.

The third methods randomly reduce the size of network during training. A typical method is dropout which randomly removes the hidden neurons in the DNNs. These methods can reduce overfitting efficiently but take more time for training.

Recently, training compact CNNs with sparsity constraints have achieved more attention. Those sparsity constraints are typically introduced in the optimization problem as structured and sparse regularizers for network weights. In the work [26], sparse updates such as the \(\ell _1\) regularizer, the shrinkage operator and the projection to \(\ell _0\) balls applied to each layer during training. Nevertheless, these methods often results in heavy accuracy loss. Group sparsity and the \(\ell _1\) norm are integrated in the work. [3, 4] to obtain a sparse network with less parameters. Group sparsity and exclusive sparsity are combined as a regularization term in a recent work [5]. Experiments show that these method can achieve better performance than original network.

The key challenge of sparse optimization is the design of regularization terms. \(\ell _0\) regularizer is the most intuitive form of sparse regularizers. However, minimizing \(\ell _0\) problem is NP-hard [15]. The \(\ell _1\) regularizer is a convex relaxation of \(\ell _0\), which is popular and easy for solving. Although \(\ell _1\) enjoys several good properties, it may cause bias in estimation [8]. [8] proposes a smoothly clipped absolute (SCAD) penalty function to ameliorate \(\ell _1\), which has been proven to be unbiased. Later, many other nonconvex regularizers are proposed, including the minimax concave penalty (MCP) [16], \(\ell _p\) penalty with \(p\in (0,1)\) [9,10,11,12,13], \(\ell _{1-2}\) [17, 18] and transformed \(\ell _1\)(TL1) [19,20,21].

The optimization methods play a central role in DNNs. Training such networks with sparse regularizers is a problem of minimizing a high-dimensional non-convex and non-smooth objective function, and is often solved by simple first-order methods such as stochastic gradient descent. Consider that the proximal gradient method is a efficient method for non-smooth programming and suitable for our model, we choose this algorithm and borrow the strengths of stochastic methods, such as their fast convergence rates and ability to avoid overfitting and appropriate for high-dimensional models.

In this paper, we consider non-convex regularizers to sparsify the network weights so that the non-essential ones are zeroed out with minimal loss of performance. We choose a simple regularization term instead of multiple terms regularizer to sparse weights and combine dropout to remove neurons.

2 Related Work

2.1 Sparsity for DNNs

There are two methodologies to make networks sparse. A class focuses on inducing sparsity among connections [3, 4] to reinforce competitiveness of features. The \(\ell _1\) regularizer is applied as a part of regularization term to remove redundant connections. An extension of \(\ell _{1,2}\) norm adopted in [5] not only achieve the same effect but also balance the sparsity of per groups.

Another class focuses on the sparsity at the neuron level. Group sparsity is a typical one [3, 4, 6, 7], which is designed to promote all the variables in a group to be zero. In DNNs, when each group is set to denote all weights from one neuron, all outgoing weights from a neuron are either simultaneously zero. Group sparsity can automatically decide how many neurons to use at each layer, force the network to have a redundant representation and prevent the co-adaptation of features.

2.2 Non-convex Sparse Regularizers

Fan and Li [8] have discussed about a good penalty function that it should result in an estimator with three properties: sparsity, unbiasedness and continuity. And regularization terms with these properties should be nonconvex. The smoothly clipped absolute deviation (SCAD) [8] and minimax concave penalty (MCP) [16] are the regularizers that fulfil these properties. Recent years, nonconvex metrics in concise forms are taken into consideration, such as \(\ell _{1-2}\) [17, 18, 31] and transformed \(\ell _1\) (TL1) [19,20,21] and \(\ell _p \ \ p\in (0,1)\) [9,10,11,12,13, 32].

3 The Proposed Approach

We aim to obtain a sparse network, while the test accuracy has comparable or even better result than the original model. The objective function can be defined by

$$\begin{aligned} \mathop {min}\limits _{W} \ \mathcal {L} (f(W),D)+\lambda \varOmega (f) \end{aligned}$$
(1)

where f is the prediction function which is parameterized by W and \(D=\{ x_i,y_i\}_{i=1}^N\) is a training set which has N instances, and \(x_i\in \mathbb {R}^p\) is a p-dimensional input sample and \(y_i\in \{1,...,K\}\) is its corresponding class label. \(\mathcal {L}\) is the loss function and \(\varOmega \) is the regularizer. \(\lambda \) is the parameter which balances the loss and the regularization term. In DNNs, W represents the set of weight matrices. As for the regularization term, it can be written as the sum of regularization on weight matrix for each layer.

\(\ell _p\) regularization \((0< p< 1)\) is studied in the work [9,10,11,12,13]. The \(\ell _p\) quasi norm of \(\mathbb {R}^N\) for a variable x is defined by

$$\begin{aligned} \left\| x\right\| _p= \sum \limits _{i=1}^N (|x_i|^p)^{\frac{1}{p}} \end{aligned}$$
(2)

which is nonconvex, nonsoomth and non-Lipschitz.

And we use an extension of \(\ell _{1,2}\) called exclusive sparse regularization to promote competition for features between different weights, making them suitable for disjoint feature sets. The exclusive regularization of \(\mathbb {R}^{n \times m}\) is defined by

$$\begin{aligned} EL(X)= \sum \limits _{i=1}^n (\sum \limits _{j=1}^m (|x_{ij}|))^{2} \end{aligned}$$
(3)

The \(\ell _1\) norm reaches the sparsity within the group, and the \(\ell _2\) norm reaches the balance weight between the groups. The sparsity of each group is relatively average, and the number of non-zero weights of each group is similar.

In this work, we define the regularizer as follows,

$$\begin{aligned} \varOmega (W)=(1-\mu )\sum \limits _g (\sum \limits _i |w_{g,i}|)^2+ \mu \left\| Vec(W) \right\| _{\frac{1}{2}}^{\frac{1}{2}} \end{aligned}$$
(4)

where Vec(W) denotes vectorizing the weight matrix.

4 The Optimization Algorithm

4.1 Combined Exclusive and Half Thresholding Method

In our work, we use proximal gradient method and combine the stochastic method to solve the regularized loss function, a solution in closed form can be obtained for each iteration in our model. Considering that a minimization problem

$$\begin{aligned} \mathop {min}\limits _{W} \ \mathcal {L} (f(W),D)+(1-\mu )\sum \limits _{l=1}^4 (\sum \limits _i |w_{l,i}|)^2+ \mu \lambda \sum \limits _{l=1}^4 \left\| Vec(W^l) \right\| _{\frac{1}{2}}^{\frac{1}{2}} \end{aligned}$$
(5)

When updating \(W_t\), as the regularizer consists of two terms, we first compute an intermediate solution by taking a gradient step using the gradient computed on the loss only, and then optimize for the regularization term while performing Euclidean projection of it to the solution space. Select a batch of samples \(D_i, d_i\in D_i\)}

$$\begin{aligned} W_{t}=W_t-\frac{\eta _t}{size(D_i)}\sum \limits _{i=1}^{size(D_i)} \nabla \mathcal {L} (f(W_t),d_i) \end{aligned}$$
(6)

then do proximal mapping of weights after current iteration, compute the optimization problem as follows,

$$\begin{aligned} \begin{aligned} W_{t+\frac{1}{2}}&=prox_{(1-\mu )EL}(W_{t})\\&=\mathop {argmin}\limits _W \frac{1}{2\lambda \eta }\left\| W-W_{t} \right\| _2^2+\varOmega (W) \end{aligned} \end{aligned}$$
(7)

One of the attractive points of the proximal methods for our problem is that the subproblem can often be computed in closed form and the solution is usually shrinkage operator which can bring sparsity to models. The proximal operator for the exclusive sparsity regularizer, \(prox_EL(W)\), is obtained as follows:

$$\begin{aligned} \begin{aligned} prox_{(1-\mu )EL}(W)&= (1-\frac{\lambda (1-\mu ) \left| |W_l \right| |_1}{\left| w_{l,i}\right| })_{+} \ w_{l,i}\\&=\mathop {sign}(w_{l,i})(\left| w_{g,i} \right| -\lambda (1-\mu )\left| |W_l \right| |_1 )_{+} \end{aligned} \end{aligned}$$
(8)

Now we consider how to compute the proximal operator of half regularization

$$\begin{aligned} \begin{aligned} W_{t+1}&=prox_{\mu \ell _{1/2}}(W_{t+\frac{1}{2}})\\&=\mathop {argmin}\limits _W \frac{1}{2\lambda \eta }\left\| W-W_{t+\frac{1}{2}} \right\| _2^2+\varOmega (W) \end{aligned} \end{aligned}$$
(9)

The combined regularizer can be optimized by using the two proximal operators at each gradient step, after updating the variable with gradient. The process is described in Algorithm 1. When training process terminates, some weights turn to zero. Then, these connections will be removed. Ultimately, a sparse architecture is yielded.

figure a

5 Experiments

5.1 Basilines

We compare our proposed method with several relevant baselines:

  • \(\ell _1\)

  • Sparse Group Lasso (SGL) [4]

  • Combined Group and Exclusive Sparsity (CGES) [5]

  • Combined Group and TL1 Sparsity (IGTL) [28]

5.2 Network Setup

We use Tensorflow framework to implement and evaluate our models. In all cases, we choose the ReLU activation function for the network,

$$\begin{aligned} \sigma (x)=max(0,x) \end{aligned}$$
(10)

One-hot encoding is used to encode different classes. We apply softmax function as activation for the output layer defined by

$$\begin{aligned} \rho (x_i)=\frac{e^{x_i}}{\sum _{j=1}^{n} e^{x_j}} \end{aligned}$$
(11)

where x denotes the vector that is input to softmax, the \(i_th\) denotes the index. We initialize the weights of the network by random initialization according to a normal distribution. The size of the batch is depending on the dimensionality of the problem. We optimize the loss function by standar cross-entropy loss, which is defined as

$$\begin{aligned} \mathcal {L}=-\sum \limits _{i=1}^{n} y_i log(f(x_i)) \end{aligned}$$
(12)

5.3 Measurement

We use accuracy to measure the performance of the model, floating-point operations per second (FLOPs) to represent the computational complexity reduction of the model and parameter used to represent the percentage of parameters in the network compare to the fully connected network. The results for our experiments are reported in Table 1.

Table 1. Performance of each model on mnist

6 Conclusion

We combine exclusive sparsity regularization term and half quasi-norm and use dropout to remove neurons. We apply \(\ell _{1/2}\) regularization to the neural network framework. At the same time, the sparseness brought by the regularization term and the characteristics suitable for large-scale problems are fully utilized; We also combine the stochastic method with the optimization algorithm, and transform the problem into a half thresholding problem by proximal method, so that the corresponding sparse problem can be easily solved and the complexity of the solution is reduced.