Sparse Optimization Based on Non-convex $$\ell _{1/2}$$ Regularization for Deep Neural Networks

Tang, Anda; Ma, Rongrong; Miao, Jianyu; Niu, Lingfeng

doi:10.1007/978-981-15-2810-1_16

Anda Tang¹⁵,
Rongrong Ma¹⁵,
Jianyu Miao¹⁶ &
…
Lingfeng Niu¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1179))

Included in the following conference series:

International Conference on Data Service

1228 Accesses

Abstract

With the arrival of big data and the improvement of computer hardware performance, deep neural networks (DNNs) have achieved unprecedented success in many fields. Though deep neural network has good expressive ability, its large model parameters which bring a great burden on storage and calculation is still a problem remain to be solved. This problem hinders the development and application of DNNs, so it is worthy of compressing the model to reduce the complexity of the deep neural network. Sparsing neural networks is one of the methods to effectively reduce complexity which can improve efficiency and generalizability. To compress model, we use regularization method to sparse the weights of deep neural network. Considering that non-convex penalty terms often perform well in regularization, we choose non-convex regularizer to remove redundant weights, while avoiding weakening the expressive ability by not removing neurons. We borrow the strength of stochastic methods to solve the structural risk minimization problem. Experiments show that the regularization term features prominently in sparsity and the stochastic algorithm performs well.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Survey for Sparse Regularization Based Compression Methods

Article 16 April 2022

SRS-DNN: a deep neural network with strengthening response sparsity

Article 26 June 2019

Learning Bilevel Sparse Regularized Neural Network

Keywords

1 Introduction

Deep neural networks (DNN) have achieved unprecedented performance in a number of fields such as speech recognition [1], computer vision [2], and natural language processing [22]. However, these works heavily rely on DNN with a huge number of parameters, and high computation capability [5]. For instance, the work by Krizhevsky et al. [2] achieved dramatic results in the 2012 ImageNet Large Scale Visual Recognition challenge (ILSVRC) using a network containing 60 million parameters. A convolutional neural network, VGG [27], which wins the ILSVRC 2014 consists of 15M neurons and 144M parameters. This challenge makes the deployment of DNNs impractical on devices with limited memory storage and computing power. Moreover, a large number of parameters tend to decrease the generalization of the model [4, 5]. There is thus a growing interest in reducing the complexity of DNNs.

Existing work on model compression and acceleration in DNN can be categorized into four types: parameter pruning and sparsity regularizers, low-rank factorization, transferred/compact convolutional filter and knowledge distillation. Among these techniques, one class focuses on promoting sparsity in DNNs. DNNs contain lots of redundant weights, occupying unnecessary computational resources while potentially causing overfitting and poor generalization. The network sparsity has been shown effective in network complexity reduction and addressing the overfitting problem [24, 25].

Sparsity for DNNs can be further classified into pruning and sharing, matrix designing and factorization, randomly reducing the complexity and sparse optimization. The pruning and sharing method is to remove redundant, non-information weights with a negligible drop of accuracy. However, pruning standards require manual setup for layers, which demands fine-tuning of the parameters and could be cumbersome for some applications.

The second methods reduce memory costs by structural matrix. However, structural constraint might bring bias to the model. On the other hand, how to find a proper structural matrix is difficult. Matrix factorization uses low-rank filters to accelerate convolution. The low-rank approximation was done layer by layer. However, the implementation is computationally expensive and cannot perform global parameter compression.

The third methods randomly reduce the size of network during training. A typical method is dropout which randomly removes the hidden neurons in the DNNs. These methods can reduce overfitting efficiently but take more time for training.

Recently, training compact CNNs with sparsity constraints have achieved more attention. Those sparsity constraints are typically introduced in the optimization problem as structured and sparse regularizers for network weights. In the work [26], sparse updates such as the $\ell _1$ regularizer, the shrinkage operator and the projection to $\ell _0$ balls applied to each layer during training. Nevertheless, these methods often results in heavy accuracy loss. Group sparsity and the $\ell _1$ norm are integrated in the work. [3, 4] to obtain a sparse network with less parameters. Group sparsity and exclusive sparsity are combined as a regularization term in a recent work [5]. Experiments show that these method can achieve better performance than original network.

The key challenge of sparse optimization is the design of regularization terms. $\ell _0$ regularizer is the most intuitive form of sparse regularizers. However, minimizing $\ell _0$ problem is NP-hard [15]. The $\ell _1$ regularizer is a convex relaxation of $\ell _0$, which is popular and easy for solving. Although $\ell _1$ enjoys several good properties, it may cause bias in estimation [8]. [8] proposes a smoothly clipped absolute (SCAD) penalty function to ameliorate $\ell _1$, which has been proven to be unbiased. Later, many other nonconvex regularizers are proposed, including the minimax concave penalty (MCP) [16], $\ell _p$ penalty with $p\in (0,1)$ [9,10,11,12,13], $\ell _{1-2}$ [17, 18] and transformed $\ell _1$(TL1) [19,20,21].

The optimization methods play a central role in DNNs. Training such networks with sparse regularizers is a problem of minimizing a high-dimensional non-convex and non-smooth objective function, and is often solved by simple first-order methods such as stochastic gradient descent. Consider that the proximal gradient method is a efficient method for non-smooth programming and suitable for our model, we choose this algorithm and borrow the strengths of stochastic methods, such as their fast convergence rates and ability to avoid overfitting and appropriate for high-dimensional models.

In this paper, we consider non-convex regularizers to sparsify the network weights so that the non-essential ones are zeroed out with minimal loss of performance. We choose a simple regularization term instead of multiple terms regularizer to sparse weights and combine dropout to remove neurons.

2 Related Work

2.1 Sparsity for DNNs

There are two methodologies to make networks sparse. A class focuses on inducing sparsity among connections [3, 4] to reinforce competitiveness of features. The $\ell _1$ regularizer is applied as a part of regularization term to remove redundant connections. An extension of $\ell _{1,2}$ norm adopted in [5] not only achieve the same effect but also balance the sparsity of per groups.

Another class focuses on the sparsity at the neuron level. Group sparsity is a typical one [3, 4, 6, 7], which is designed to promote all the variables in a group to be zero. In DNNs, when each group is set to denote all weights from one neuron, all outgoing weights from a neuron are either simultaneously zero. Group sparsity can automatically decide how many neurons to use at each layer, force the network to have a redundant representation and prevent the co-adaptation of features.

2.2 Non-convex Sparse Regularizers

Fan and Li [8] have discussed about a good penalty function that it should result in an estimator with three properties: sparsity, unbiasedness and continuity. And regularization terms with these properties should be nonconvex. The smoothly clipped absolute deviation (SCAD) [8] and minimax concave penalty (MCP) [16] are the regularizers that fulfil these properties. Recent years, nonconvex metrics in concise forms are taken into consideration, such as $\ell _{1-2}$ [17, 18, 31] and transformed $\ell _1$ (TL1) [19,20,21] and $\ell _p \ \ p\in (0,1)$ [9,10,11,12,13, 32].

3 The Proposed Approach

We aim to obtain a sparse network, while the test accuracy has comparable or even better result than the original model. The objective function can be defined by

$$\begin{aligned} \mathop {min}\limits _{W} \ \mathcal {L} (f(W),D)+\lambda \varOmega (f) \end{aligned}$$

(1)

where f is the prediction function which is parameterized by W and $D=\{ x_i,y_i\}_{i=1}^N$ is a training set which has N instances, and $x_i\in \mathbb {R}^p$ is a p-dimensional input sample and $y_i\in \{1,...,K\}$ is its corresponding class label. $\mathcal {L}$ is the loss function and $\varOmega $ is the regularizer. $\lambda $ is the parameter which balances the loss and the regularization term. In DNNs, W represents the set of weight matrices. As for the regularization term, it can be written as the sum of regularization on weight matrix for each layer.

$\ell _p$ regularization $(0< p< 1)$ is studied in the work [9,10,11,12,13]. The $\ell _p$ quasi norm of $\mathbb {R}^N$ for a variable x is defined by

$$\begin{aligned} \left\| x\right\| _p= \sum \limits _{i=1}^N (|x_i|^p)^{\frac{1}{p}} \end{aligned}$$

(2)

which is nonconvex, nonsoomth and non-Lipschitz.

And we use an extension of $\ell _{1,2}$ called exclusive sparse regularization to promote competition for features between different weights, making them suitable for disjoint feature sets. The exclusive regularization of $\mathbb {R}^{n \times m}$ is defined by

$$\begin{aligned} EL(X)= \sum \limits _{i=1}^n (\sum \limits _{j=1}^m (|x_{ij}|))^{2} \end{aligned}$$

(3)

The $\ell _1$ norm reaches the sparsity within the group, and the $\ell _2$ norm reaches the balance weight between the groups. The sparsity of each group is relatively average, and the number of non-zero weights of each group is similar.

In this work, we define the regularizer as follows,

$$\begin{aligned} \varOmega (W)=(1-\mu )\sum \limits _g (\sum \limits _i |w_{g,i}|)^2+ \mu \left\| Vec(W) \right\| _{\frac{1}{2}}^{\frac{1}{2}} \end{aligned}$$

(4)

where Vec(W) denotes vectorizing the weight matrix.

4 The Optimization Algorithm

4.1 Combined Exclusive and Half Thresholding Method

In our work, we use proximal gradient method and combine the stochastic method to solve the regularized loss function, a solution in closed form can be obtained for each iteration in our model. Considering that a minimization problem

$$\begin{aligned} \mathop {min}\limits _{W} \ \mathcal {L} (f(W),D)+(1-\mu )\sum \limits _{l=1}^4 (\sum \limits _i |w_{l,i}|)^2+ \mu \lambda \sum \limits _{l=1}^4 \left\| Vec(W^l) \right\| _{\frac{1}{2}}^{\frac{1}{2}} \end{aligned}$$

(5)

When updating $W_t$, as the regularizer consists of two terms, we first compute an intermediate solution by taking a gradient step using the gradient computed on the loss only, and then optimize for the regularization term while performing Euclidean projection of it to the solution space. Select a batch of samples $D_i, d_i\in D_i$}

$$\begin{aligned} W_{t}=W_t-\frac{\eta _t}{size(D_i)}\sum \limits _{i=1}^{size(D_i)} \nabla \mathcal {L} (f(W_t),d_i) \end{aligned}$$

(6)

then do proximal mapping of weights after current iteration, compute the optimization problem as follows,

$$\begin{aligned} \begin{aligned} W_{t+\frac{1}{2}}&=prox_{(1-\mu )EL}(W_{t})\\&=\mathop {argmin}\limits _W \frac{1}{2\lambda \eta }\left\| W-W_{t} \right\| _2^2+\varOmega (W) \end{aligned} \end{aligned}$$

(7)

One of the attractive points of the proximal methods for our problem is that the subproblem can often be computed in closed form and the solution is usually shrinkage operator which can bring sparsity to models. The proximal operator for the exclusive sparsity regularizer, $prox_EL(W)$, is obtained as follows:

$$\begin{aligned} \begin{aligned} prox_{(1-\mu )EL}(W)&= (1-\frac{\lambda (1-\mu ) \left| |W_l \right| |_1}{\left| w_{l,i}\right| })_{+} \ w_{l,i}\\&=\mathop {sign}(w_{l,i})(\left| w_{g,i} \right| -\lambda (1-\mu )\left| |W_l \right| |_1 )_{+} \end{aligned} \end{aligned}$$

(8)

Now we consider how to compute the proximal operator of half regularization

$$\begin{aligned} \begin{aligned} W_{t+1}&=prox_{\mu \ell _{1/2}}(W_{t+\frac{1}{2}})\\&=\mathop {argmin}\limits _W \frac{1}{2\lambda \eta }\left\| W-W_{t+\frac{1}{2}} \right\| _2^2+\varOmega (W) \end{aligned} \end{aligned}$$

(9)

The combined regularizer can be optimized by using the two proximal operators at each gradient step, after updating the variable with gradient. The process is described in Algorithm 1. When training process terminates, some weights turn to zero. Then, these connections will be removed. Ultimately, a sparse architecture is yielded.

5 Experiments

5.1 Basilines

We compare our proposed method with several relevant baselines:

$\ell _1$
Sparse Group Lasso (SGL) [4]
Combined Group and Exclusive Sparsity (CGES) [5]
Combined Group and TL1 Sparsity (IGTL) [28]

5.2 Network Setup

We use Tensorflow framework to implement and evaluate our models. In all cases, we choose the ReLU activation function for the network,

$$\begin{aligned} \sigma (x)=max(0,x) \end{aligned}$$

(10)

One-hot encoding is used to encode different classes. We apply softmax function as activation for the output layer defined by

$$\begin{aligned} \rho (x_i)=\frac{e^{x_i}}{\sum _{j=1}^{n} e^{x_j}} \end{aligned}$$

(11)

where x denotes the vector that is input to softmax, the $i_th$ denotes the index. We initialize the weights of the network by random initialization according to a normal distribution. The size of the batch is depending on the dimensionality of the problem. We optimize the loss function by standar cross-entropy loss, which is defined as

$$\begin{aligned} \mathcal {L}=-\sum \limits _{i=1}^{n} y_i log(f(x_i)) \end{aligned}$$

(12)

5.3 Measurement

We use accuracy to measure the performance of the model, floating-point operations per second (FLOPs) to represent the computational complexity reduction of the model and parameter used to represent the percentage of parameters in the network compare to the fully connected network. The results for our experiments are reported in Table 1.

Table 1. Performance of each model on mnist

Full size table

6 Conclusion

We combine exclusive sparsity regularization term and half quasi-norm and use dropout to remove neurons. We apply $\ell _{1/2}$ regularization to the neural network framework. At the same time, the sparseness brought by the regularization term and the characteristics suitable for large-scale problems are fully utilized; We also combine the stochastic method with the optimization algorithm, and transform the problem into a half thresholding problem by proximal method, so that the corresponding sparse problem can be easily solved and the complexity of the solution is reduced.

References

Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: Advances in Neural Information Processing Systems, pp. 2270–2278 (2016)
Google Scholar
Scardapane, S., Comminiello, D., Hussian, A.: Group sparse regularization for deep neural networks. Neurocomputing 241, 81–89 (2017)
Article Google Scholar
Yoon, J., Hwang, S.J.: Combined group and exclusive sparsity for deep neural networks. In: International Conference on Machine Learning, pp. 3958–3966 (2017)
Google Scholar
Zhou, H., Alvarez, J.M., Porikli, F.: Less is more: towards compact CNNs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 662–677. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_40
Chapter Google Scholar
Lebedev, L., Lempitsky, V.: Fast convnets using group-wise brain damage. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564 (2016)
Google Scholar
Fan, J., Li, R.: Variable selectionvia nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article Google Scholar
Xu, Z.: Data modeling: Visual psychology approach and $L_{1/2}$ regularization theory (2010). https://doi.org/10.1142/9789814324359_0184
Xu, Z., Chang, X., Xu, F., Zhang, H.: $\ell _{1/2}$ regularization: a thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 23(7), 1013 (2012)
Article Google Scholar
Krishnan, D., Fergus, R.: Fast image deconvolution using hyper-laplacian priors. In: International Conference on Neural Information Processing Systems, pp. 1033–1041 (2009)
Google Scholar
Xu, Z., Guo, H., Wang, Y., Zhang, H.: Representative of $L_{1/2}$ regularization among $L_q (0 < q \le 1) $ Regularizations: an experimental study based on phase diagram. Acta Autom. Sinica 38(7), 1225–1228 (2012)
Article MathSciNet Google Scholar
Chartrand, R., Yin, W.: Iterative reweighted algorithms for compressive sensing. Technical report (2008)
Google Scholar
Lv, J., Fan, Y.: A unified approach to model selection and sparse recovery using regularized least squares. Ann. Stat. 37(6A), 3498–3528 (2009)
Article MathSciNet Google Scholar
Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)
Article MathSciNet Google Scholar
Zhang, C.H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 849–942 (2010)
Google Scholar
Esser, E., Lou, Y., Xin, J.: A method for finding structure sparse solutions to nonnegative least squares problem with applications. SIAM J. Imaging Sci. 6(4), 2010–2046 (2013)
Article MathSciNet Google Scholar
Yin, P., Esser, E., Xin, J.: Ratio ad difference of $ell_1$ and $\ell _2$ norms and sparse representation with coherent dictionaries. Commun. Inform. Syst. 14(2), 87–109 (2014)
Article Google Scholar
Nikolova, M.: Local strong homogeneity of a regularized estimator. SIAM J. Appl. Math. 61(2), 633–659 (2000)
Article MathSciNet Google Scholar
Zhang, S., Xin, J.: Minimization of transformed $\ell _1$ penalty: Closed form representation and iterative thresholding algorithms. arXiv preprint arXiv:1412.5240 (2014)
Zhang, S., Xin, J.: Minimization of transformed $\ell _1$ penalty: theory, differnece of convex function algorithm, and robust application in compressed sensing. Math. Program. 169(1), 307–336 (2018)
Article MathSciNet Google Scholar
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. arXiv preprint (2016). arXiv:1612.08083
Gong, Y., Liu, L., Yang, M., Bourdev, L.D.: Compressing deep convolutional networks using vector quantization, CoRR, vol. abs/1412.6115 (2014)
Google Scholar
Gong, Y., Liu, L., Yang, M., et al.: Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115 (2014)
Dinh, T., Xin, J.: Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via $\ell _0,\ell _1$ and transformed $\ell _1$ Penalties arXiv preprint arXiv:1812.05719 (2018)
Collins, M.D., Kohli, P.: Memory bounded deep convolutional networks, arXiv preprint arXiv:1412.1442
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Ma, R., Miao, J., Niu, L., et al.: Transformed $\ell _1$ regularization for learning sparse deep neural networks. Neural Netw. 119, 286–298 (2019)
Article Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet Google Scholar
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016)
Google Scholar
Shi, Y., Miao, J., Wang, Z., et al.: Feature selection with $\ell _2,\ell _{1-2}$ regularization. IEEE Trans. Neural Netw. Learn. Syst. 29(10), 4967–4982 (2018)
Article MathSciNet Google Scholar
Niu, L., Zhou, R., Tian, Y., et al.: Nonsmooth penalized clustering via $\ell _{p}$ regularized sparse regression. IEEE Trans. Cybern. 47(6), 1423 (2017)
Article Google Scholar
Shi, Y., Lei, M., Yang, H., et al.: Diffusion network embedding. Pattern Recognit. 88, 518–531 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100190, China
Anda Tang & Rongrong Ma
Henan University of Technology, Zhengzhou, 450001, China
Jianyu Miao
School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, China
Lingfeng Niu

Authors

Anda Tang
View author publications
You can also search for this author in PubMed Google Scholar
Rongrong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jianyu Miao
View author publications
You can also search for this author in PubMed Google Scholar
Lingfeng Niu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingfeng Niu .

Editor information

Editors and Affiliations

Swinburne University of Technology, Melbourne, VIC, Australia
Jing He
University of Illinois at Chicago, Chicago, USA
Philip S. Yu
College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE, USA
Yong Shi
Research Institute of Extenics and Innovation Methods, Guangdong University of Technology, Guangzhou, China
Xingsen Li
Ningbo University, Ningbo, China
Zhijun Xie
Deakin University, Burwood, VIC, Australia
Guangyan Huang
Department of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China
Jie Cao
Nanjing University of Posts and Telecommunications, Nanjing, China
Fu Xiao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, A., Ma, R., Miao, J., Niu, L. (2020). Sparse Optimization Based on Non-convex $\ell _{1/2}$ Regularization for Deep Neural Networks. In: He, J., et al. Data Science. ICDS 2019. Communications in Computer and Information Science, vol 1179. Springer, Singapore. https://doi.org/10.1007/978-981-15-2810-1_16

Download citation

DOI: https://doi.org/10.1007/978-981-15-2810-1_16
Published: 02 February 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2809-5
Online ISBN: 978-981-15-2810-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sparse Optimization Based on Non-convex \(\ell _{1/2}\) Regularization for Deep Neural Networks

Abstract

Similar content being viewed by others

A Survey for Sparse Regularization Based Compression Methods

SRS-DNN: a deep neural network with strengthening response sparsity

Learning Bilevel Sparse Regularized Neural Network

Keywords

1 Introduction

2 Related Work

2.1 Sparsity for DNNs

2.2 Non-convex Sparse Regularizers

3 The Proposed Approach

4 The Optimization Algorithm

4.1 Combined Exclusive and Half Thresholding Method

5 Experiments

5.1 Basilines

5.2 Network Setup

5.3 Measurement

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Sparse Optimization Based on Non-convex \(\ell _{1/2}\) Regularization for Deep Neural Networks

Abstract

Similar content being viewed by others

A Survey for Sparse Regularization Based Compression Methods

SRS-DNN: a deep neural network with strengthening response sparsity

Learning Bilevel Sparse Regularized Neural Network

Keywords

1 Introduction

2 Related Work

2.1 Sparsity for DNNs

2.2 Non-convex Sparse Regularizers

3 The Proposed Approach

4 The Optimization Algorithm

4.1 Combined Exclusive and Half Thresholding Method

5 Experiments

5.1 Basilines

5.2 Network Setup

5.3 Measurement

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation