1 Introduction

The focus of this paper is on the following structured minimization:

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} \ F(x):=f(x)+\mu \Vert x\Vert _1, \end{aligned}$$
(1.1)

where \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is a continuously differentiable (may be nonconvex) function that is bounded below, \(\Vert \cdot \Vert _1\) is the \(\ell _{1}\)-norm of a vector, and parameter \(\mu >0\) is used to trade off both terms for minimization. Given its structure, problem (1.1) covers a wide range of apparently related formulations in different scientific fields, including linear and logistic regression [45], compressive sensing [18], sparse inverse covariance estimation [46], and sparse principal component analysis [27, 42].

1.1 Problem Formulations

A special case of model (1.1) is the \(\ell _{1}\)-norm regularized least squares problem shown below

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} \frac{1}{2}\Vert Ax - b\Vert _2^2 + \mu \Vert x\Vert _1, \end{aligned}$$
(1.2)

where \(A\in {\mathbb {R}}^{m\times n}\,(m\ll n)\) is a linear operator, and \(b\in {\mathbb {R}}^m\) is an observation. Model (1.2) mainly appears in compressive sensing, which is an emerging methodology in digital signal processing, and has attracted intensive research activities over the past years [811, 18]. Compressive sensing is based on the fact that if the original signal is sparse or approximately sparse in some orthogonal basis, an exact restoration can be produced by solving problem (1.2).

Another prevalent case of (1.1) that has attracted much interest in machine learning is the linear and logistic regression. Assume the training date \(A=[a_1,\ldots ,a_m]^\top \in {\mathbb {R}}^{m\times n}\) and class labels \(y\in \{-1,+1\}^m\). A linear classifier is a hyperplane \(\{w_i:x^\top a_i+b=0\}\), where \(x\in {\mathbb {R}}^n\) is a set of weights, and \(b\in {\mathbb {R}}\) is the intercept. A frequently used model is the \(\ell _{2}\)-loss support vector machine, which is given by

$$\begin{aligned} \min _{x\in {\mathbb {R}}^m,b\in {\mathbb {R}}} \sum _{i=1}^m\max \left\{ 0,1-y_i(x^\top a_i+b)\right\} ^2+\mu \Vert x\Vert _1. \end{aligned}$$
(1.3)

The \(\ell _{2}\)-lose function is continuous, but not differentiable because of the “max” operation. In the logistic model, the probability distribution of the classifier label \(y\), given \(a_i\), has the form

$$\begin{aligned} p(y_i|a_i)=\frac{1}{1-e^{-(x^\top a_i+b)y_i}}. \end{aligned}$$

The weights \(x\) and the intercept \(b\) can be found by minimizing the average loss as

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n,b\in {\mathbb {R}}} \frac{1}{m}\sum _{i=1}^m\phi \left( \left( x^\top a_i+b\right) y_i\right) , \end{aligned}$$

where \(\phi (z)=\log (1+e^{-z})\) is the logistic loss function. Finding the maximum likelihood estimate of \(x\) and \(b\) is called logistic regression. To derive a sparse vector \(x\), the \(\ell _{1}\)-regularized logistic regression can be formulated as

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n,b\in {\mathbb {R}}} \frac{1}{m}\sum _{i=1}^m\text {log}\Big (1+e^{-(x^\top a_i+b)y_i}\Big )+\mu \Vert x\Vert _1. \end{aligned}$$
(1.4)

Obviously, the logistic loss function is twice differentiable.

Although the models of these problems have similar structures, they may be very different from a real-data point of view. For example, in compressive sensing, the length of measurement \(m\) is much smaller than that of the original signal \((m\ll n)\), and the encoding matrix \(A\) is dense. However, in machine learning, the numbers of instance \(m\) and features \(n\) are both large, and the data \(A\) is very sparse.

1.2 Existing Algorithms

As the \(\ell _{1}\)-regularized term is non-differentiable when \(x\) contains zero values, the use of the standard unconstrained smooth optimization tools are generally precluded. In the past decades, various approaches have been proposed, analyzed, and implemented in compressive sensing and machine learning literature. One approach involves a variety of algorithms for special cases where in \(f(x)\) has a specific functional form, such as the least squares (1.2), the square loss (1.3), and the logistic loss (1.4). In the following, we briefly review some of these approaches in the literature.

The first popular approach falls into the coordinate descent method. At the current iterate \(x_k\), the simple coordinate descent method updates one component at a time to generate \(x_k^j,\,j=1,\ldots ,n+1\) such that \(x_k^1=x_k,\,x_k^{n+1}=x_{k+1}\) and solves a one-dimensional subproblem

$$\begin{aligned} \min _z\quad F\left( x_k^j+ze^j\right) -F\left( x_k^j\right) , \end{aligned}$$
(1.5)

where \(e^j\) is defined as the \(j\)th column of an identity matrix. Clearly, the objective function has one variable and one non-differentiable point at \(z=-e^j\). To solve the logistic regression model (1.4), Bayesian binary regression (BBR) [21] solves the sub-problem approximately using the trust region method with Newton step; coordinate descent with Newton step (CDN)[12] improves the performance of BBR by applying a one-dimensional Newton method and a line search technique. Instead of cyclically updating one component at a time, the stochastic coordinate descent method [35] randomly selects the working components to achieve enhanced performance; the block Coordinate Gradient Descent (CGD) algorithm [37, 47] is based on the approximated convex quadratic model for \(f\) and selects the working variables according to some rules.

The second type of approach is to transform model (1.1) into an equivalent box-constrained optimization problem by variable splitting. Let \(x=u-v\) with \(u_i=\max \{0,x_i\}\) and \(v_i=\max \{0,-x_i\}\). Then, model (1.1) can be reformulated equivalently as

$$\begin{aligned} \min _{u,v} f(u-v)+\mu \sum _{i=1}^n(u_i+v_i), \quad \text {s.t.} \ u\ge 0,\quad v\ge 0. \end{aligned}$$
(1.6)

The objective function and constraints are smooth. Therefore, the model can be solved by any standard box-constrained optimization technique. However, an obvious drawback of this approach is that it doubles the number of variables. Gradient projection for sparse reconstruction (GPSR) [20] solves (1.6) and subsequently solves (1.2) using the Barzilai–Borwein gradient method [2] with an efficient nonmonotone line search [22]. The method is actually an application of the well-known spectral projection gradient [5] in compressive sensing. The trust region Newton algorithm [26, 45] minimizes (1.6) then solves the logistic regression model (1.4) and exhibits powerful vitality through a series of comparisons. To solve (1.3) and (1.4), the interior-point algorithm [24, 25] forms a sequence of unconstrained approximations by appending a “barrier” function to the objective function (1.6), thereby ensuring that \(u\) and \(v\) remain sufficiently positive. Moreover, truncated Newton steps and preconditioned conjugate gradient iterations are used to produce the search direction.

The third type of method approximates the \(\ell _{1}\)-regularized term with a differentiable function. The simple approach replaces the \(\ell _{1}\)-norm with a sum of multi-quadric functions

$$\begin{aligned} l(x) \triangleq \sum _i^n \sqrt{x_i^2+\epsilon }, \end{aligned}$$

where \(\epsilon \) is a small positive scalar. This function is twice-differentiable, and \(\lim _{\epsilon \rightarrow 0^+}l(x)=\Vert x\Vert _1\). Subsequently, several smooth unconstrained optimization approaches can be applied based on this approximation. However, the performance of these algorithms is highly influenced by the parameter values, and the condition number of the corresponding Hessian matrix becomes large as \(\epsilon \) decreases. Nesterov’s smoothing technique [28] is used to construct smooth functions to approximate some specific structured convex nonsmooth function. Based on this technique, NESTA (Nesterov’s Algorithm) [4] solves problem (1.2) using first-order gradient information.

The fourth type of approach falls into the subgradient-based Newton-type algorithm. An important attempt in this class is that of Andrew and Gao [1], who extend the well-known limited memory BFGS method [30] to solve \(\ell _{1}\)-regularized logistic regression model (1.4) and propose an orthant-wise limited memory quasi-Newton method. At each iteration, this method computes a search direction over an orthant containing the previous point. The subspace BFGS method [44] involves an inner iteration approach to find the descent quasi-Newton direction and subgradient Wolfe conditions to determine the stepsize that ensures decreasing the objective functions. This method is globally convergent and capable of solving general nonsmooth convex minimization problems.

Aside from GPSR and NESTA, other specially designed solvers are available. Using an operator splitting technique, Hale, Yin, and Zhang derive the iterative shrinkage/thresholding (IST) fixed-point continuation algorithm (FPC) [23]. By combining the interior-point algorithm [24], FPC is also extended to solve large-scale \(\ell _{1}\)-regularized logistic regression [36]. A closely related algorithm to FPC is the fixed-point continuation and active set FPC_AS [39, 40], which solves a smooth subproblem to determine the magnitudes of the nonzero components of \(x\) based on an active set. Two-step IST algorithm (TwIST) [6] and fast IST algorithm (FISTA) [3] speed up the performance of IST algorithm and have virtually the same complexity but with better convergence properties. Another closely related method is the sparse reconstruction algorithm (SpaRSA) [41], which involves minimizing a non-smooth convex problem with separable structures. The spectral projected gradient algorithm named SPGL1 [38] solves the lasso model (1.2) by the spectral gradient projection method with an efficient Euclidean projection on \(\ell _{1}\)-norm ball. The alternating directions method called YALL1 [43] investigates \(\ell _{1}\)-norm problems from either the primal or the dual forms and solves \(\ell _{1}\)-regularized problems of different types.

All the reviewed algorithms differ in various aspects such as convergence speed, ease of implementation, and practical applicability. No evidence can verify that which algorithm outperforms the others under all scenarios.

1.3 Contributions and Organization

Although much progress has been achieved in solving problem (1.1), existing algorithms mainly deal with cases where \(f\) is a convex function and even least squares. In this study we propose a Barzilai–Borwein gradient algorithm to solve \(\ell _{1}\)-regularized nonsmooth minimization problems. At each iteration, we approximate \(f\) locally by a convex quadratic model, where the Hessian is replaced by the multiples of a spectral coefficient with an identity matrix. The search direction is determined by minimizing the quadratic model and maximizing the use of the \(\ell _{1}\)-norm structure. We show that the generated direction contains the one in FPC_AS [40] as a special case and is descent, which guarantees the existence of a positive stepsize along the direction. In our algorithm, we adopt the nonmonotone line search of Grippo, Lampariello, and Lucidi [22], which allows the function values to increase occasionally in some iteration but decrease in the whole iterative process. The nonmonotone line search is attractive because it saves a considerable number of function evaluations, which are the computational burden in large datasets. The method is easily performed, as only the value of objective function and the gradient of the smooth term are needed at each iteration. We show that each cluster of the iterates generated by this algorithm is a stationary point of \(F\). Although we mainly consider the \(\ell _{1}\)-regularizer in this study, the \(\ell _{2}\)-norm regularization problem and the matrix trace norm problems can also be readily included in our framework, thereby broadening the capability of the algorithm. We implement the algorithm to solve problem (1.1), where \(f\) is a nonconvex smooth function from the CUTEr library to show the efficiency of the algorithm. We also run the algorithm to solve \(\ell _{1}\)-regularized least squares and do performance comparisons with state-of-the-art algorithms namely NESTA, CGD, TwIST, FPC_BB, FPC_AS, and GPSR. The results of the comparisons show that the proposed algorithm is effective, comparable, and promising.

We organize the rest of this paper as follows. In Sect. 2, we briefly recall some preliminary results in optimization literature to describe our motivation for our work, construct the search direction, and present the steps of our algorithm along with some remarks. In Sect. 3, we establish the global convergence theorem under some mild conditions. In Sect. 4, we show how to extend the algorithm to solve \(\ell _{2}\)-norm and matrix trace norm minimization problems. In Sect. 5, we present experiments to show the efficiency of the algorithm in solving the \(\ell _{1}\)-regularized nonconvex problem and least squares problem. In Sect. 6, we conclude our paper.

2 Algorithm

2.1 Preliminary Results

First, consider the minimization of the smooth function without the \(\ell _{1}\)-norm regularization

$$\begin{aligned} \min _{x\in {\mathbb {R}}} f(x). \end{aligned}$$
(2.1)

The basic idea of Newton’s method for this problem is to iteratively use the quadratic approximation \(q_k\) to the objective function \(f(x)\) at the current iterate \(x_k\) and to minimize the approximation \(q_k\). Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be twice continuously differentiable, and let its Hessian \(G_k=\nabla ^2 f(x_k)\) be positive definite. Function \(f\) at the current \(x_k\) is modeled by the quadratic approximation \(q_k\) as

$$\begin{aligned} f(x_k+s)\approx q_k(s)=f(x_k)+\nabla f(x_k)^\top s+\frac{1}{2} s^\top G_k s, \end{aligned}$$

where \(s=x-x_k\). Minimizing \(q_k(s)\) yields

$$\begin{aligned} x_{k+1}=x_k-G_k^{-1} \nabla f(x_k), \end{aligned}$$

which is Newton’s formula, and \(s_k=x_{k+1}-x_k=-G_k^{-1}\nabla f(x_k)\) is the so-called Newton’s direction.

For the positive definite quadratic function, Newton’s method can reach the minimizer with one iteration. However, when the starting point is far away from the solution, the method cannot guarantee that \(G_k\) is positive definite and that Newton’s direction \(d_k\) is a descent direction. Let the quadratic model of \(f\) at \(x_{k+1}\) be

$$\begin{aligned} f(x)\approx f(x_{k+1})+\nabla f(x_{k+1})^\top (x-x_{k+1})+\frac{1}{2} (x-x_{k+1})^\top G_{k+1}(x-x_{k+1}). \end{aligned}$$

Finding the derivative yields

$$\begin{aligned} \nabla f(x)\approx \nabla f(x_{k+1})+G_{k+1}(x-x_{k+1}). \end{aligned}$$

Setting \(x=x_k,\,s_k=x_{k+1}-x_k\), and \(y_k=\nabla f(x_{k+1})-\nabla f(x_k)\), we obtain

$$\begin{aligned} G_{k+1}s_k\approx y_k. \end{aligned}$$
(2.2)

For various practical problems, either the computing efforts of the Hessian matrices are very expensive or the evaluation of the Hessian is difficult; the Hessian is not even available analytically. These challenges lead to the quasi-Newton method, which generates a series of Hessian approximations through the use of the gradient while maintaining a fast rate of convergence. Instead of computing the Hessian \(G_{k}\), the quasi-Newton method constructs the Hessian approximation \(B_{k}\), where the sequence \(\{B_k\}\) possesses positive definiteness and satisfies

$$\begin{aligned} B_{k+1}s_k=y_k. \end{aligned}$$
(2.3)

In general, \(B_{k+1}\) is obtained by updating \(B_k\) with some typical and popular formulae, such as BFGS, DFP, and SR1.

Unfortunately, the standard quasi-Newton algorithm, or even its limited memory versions, does not scale well enough to train very large-scale models involving millions of variables and training instances, which are commonly encountered, for example, in natural language processing. The main computational burden of the Newton-type algorithm is the storage of a large matrix at per-iteration, which may exceed the memory capability of a personal computer (PC). Hence, a matrix-free algorithm that particularly deals with large-scale problems is urgently needed. For this purpose, the approximation Hessian \(B_k\) should be furthermore simplified as a diagonal matrix with positive components, i.e., \(B_k=\lambda _k I\) with an identity matrix \(I\) and \(\lambda _k>0\). Then, the quasi-Newton condition changes to the form

$$\begin{aligned} \lambda _{k+1}Is_k=y_k. \end{aligned}$$

Multiplying both sides by \(s_k^\top \) yields

$$\begin{aligned} \lambda _{k+1}^{(1)}=\frac{s_k^\top y_k}{\Vert s_k\Vert _2^2}. \end{aligned}$$
(2.4)

Similarly, multiplying both sides by \(y_k^\top \), yields

$$\begin{aligned} \lambda _{k+1}^{(2)}=\frac{\Vert y_k\Vert _2^2}{s_k^\top y_k}. \end{aligned}$$
(2.5)

As indicated by both formulae, if \(s_k^\top y_k>0\), then the matrix \(\lambda _{k+1}I\) is positive definite, which ensures that the search direction \(-\lambda _{k}^{-1}\nabla f(x_k)\) is descent at current point.

The formulae (2.4) and (2.5) were first developed by Barzilai and Borwein [2] for the quadratic case of \(f\). This method essentially comprises the steepest descent method and adopts either (2.4) or (2.5) as the stepsize along a negative gradient direction. Barzilai and Borwein [2] showed that the corresponding iterative algorithm is R-superlinearly convergent for the quadratic case. Raydan [32] presented a globalization strategy based on nonmonotone line search [22] for the general non-quadratic case. Other developments of Barzilai–Borwein gradient algorithm can be found in [5, 13, 1517, 31, 49].

2.2 Algorithm

Given its simplicity and numerical efficiency, the Barzilai–Borwein gradient method is very effective in dealing with large-scale smooth unconstrained minimization problems. However, the application of the Barilai-Borwein gradient algorithm in \(\ell _{1}\)-regularized nonsmooth optimization is problematic because the regularization is non-differentiable. In this subsection, we construct an iterative algorithm to solve the \(\ell _{1}\)-regularized structured nonconvex optimization problem. The algorithm can be described as the iterative form

$$\begin{aligned} x_{k+1}=x_k+\alpha _k d_k, \end{aligned}$$

where \(\alpha _k\) is a stepsize, and \(d_k\) is a search direction defined by minimizing a quadratic-approximated model of \(F\).

Now, we turn our attention to consider the original problem with \(\ell _{1}\)-regularizer. As \(\ell _{1}\)-term is non-differentiable, at \(x_k+d_k\), we use the approximated form

$$\begin{aligned} \Vert x_k+d_k\Vert _1\approx \Vert x_k\Vert +\frac{\Vert x_k+hd_k\Vert _1-\Vert x_k\Vert }{h}, \end{aligned}$$

where \(h\) is a small positive number, and the case \(h=1\) is reduced to the equivalent form. Subsequently, the objective function \(F\) is approximated by the quadratic approximation \(Q_k\),

$$\begin{aligned} F(x_k+d)&=f(x_k+d)+\mu \Vert x_k+d\Vert _1\nonumber \\&\approx f(x_k)+\nabla f(x_k)^\top d+\frac{\lambda _k}{2} \Vert d\Vert _2^2 \nonumber \\&\quad +\mu \Big [\Vert x_k\Vert _1 + \frac{\Vert x_k+hd\Vert _1-\Vert x_k\Vert _1}{h}\Big ]=: Q_k(d). \end{aligned}$$
(2.6)

Minimizing (2.6) yields

$$\begin{aligned}&\min _{d\in {\mathbb {R}}^n}Q_k(d)\nonumber \\&\quad \Leftrightarrow \min _{d\in {\mathbb {R}}^n}\nabla f(x_k)^\top d+\frac{\lambda _k}{2} \Vert d\Vert _2^2+\frac{\mu }{h}\Vert x_k+hd\Vert _1\nonumber \\&\quad \Leftrightarrow \min _{d\in {\mathbb {R}}^n}\frac{h^2}{\lambda _k}\Big (\nabla f(x_k)^\top d+\frac{\lambda _k}{2} \Vert d\Vert _2^2+\frac{\mu }{h}\Vert x_k+hd\Vert _1\Big )\nonumber \\&\quad \Leftrightarrow \min _{d\in {\mathbb {R}}^n} \frac{1}{2}\Big \Vert x_k+hd-\big (x_k-\frac{h}{\lambda _k}\nabla f(x_k)\big )\Big \Vert _2^2+\frac{\mu h}{\lambda _k}\Vert x_k+hd\Vert _1\nonumber \\&\quad \Leftrightarrow \min _{d\in {\mathbb {R}}^n}\sum _{i=1}^{n}\Big \{ \frac{1}{2}\Big (x_k^i+hd^i-\big (x_k^i-\frac{h}{\lambda _k}\nabla f^i(x_k)\big )\Big )^2+\frac{\mu h}{\lambda _k}|x_k^i+hd^i|\Big \}, \end{aligned}$$
(2.7)

where \(x_k^i,\,d^i\), and \(\nabla f^i(x_k)\) denote the \(i\)th component of \(x_k,\,d\), and \(\nabla f(x_k)\), respectively. The favorable structure of (2.7) adopts the explicit solution

$$\begin{aligned} x_k^i+hd_k^i =\max \Big \{\Big |x_k^i-\frac{h}{\lambda _k}\nabla f^i(x_k^i)\Big |-\frac{\mu h}{\lambda _k},0\Big \}\frac{x_k^i-\frac{h}{\lambda _k}\nabla f^i(x_k)}{|x_k^i-\frac{h}{\lambda _k}\nabla f^i(x_k)|}. \end{aligned}$$

Hence, the search direction at current point is

$$\begin{aligned} d_k= -\frac{1}{h}\Big [x_k-\max \Big \{\Big |x_k-\frac{h}{\lambda _k}\nabla f(x_k)\Big |-\frac{\mu h}{\lambda _k},0\Big \} \frac{x_k-\frac{h}{\lambda _k}\nabla f(x_k)}{|x_k-\frac{h}{\lambda _k}\nabla f(x_k)|}\Big ], \end{aligned}$$
(2.8)

where \(|\cdot |\) and “\(\max \)” are interpreted as componentwise, and the convention \(0\cdot 0/0=0\) is followed. When \(\mu =0\), (2.8) is reduced to \(d_k=-\lambda _k^{-1}\nabla f(x_k)\), i.e., the traditional Barzilai–Borwein gradient algorithm in smooth optimization. The key motivation for this formulation is that the optimization problem in Eq. (2.7) can be easily solved by exploiting the structure of the \(\ell _{1}\)-norm.

For \(y\in {\mathbb {R}}^n\) and \(\tau \in {\mathbb {R}}\), the unique minimizer of \(\tau \Vert x\Vert _1+\frac{1}{2}\Vert x-y\Vert ^2\) is given explicitly by

$$\begin{aligned} x^*={\mathcal {S}}(y,\tau )=\max \left\{ |y|-\tau ,0\right\} \frac{y}{|y|}. \end{aligned}$$

As a result, in the case of \(h=1\), (2.8) is reduced to

$$\begin{aligned} d_k={\mathcal {S}}\left( x_k-\frac{1}{\lambda _k}\nabla f(x_k),\frac{\mu }{\lambda _k}\right) -x_k, \end{aligned}$$

which is essentially the search direction of the algorithm FPC_AS [39, 40]. Just because of this, in the following, we only consider the case where \(h\in (0,1)\). The important observation illustrates that our approach generalizes the definition of the search direction of FPC_AS using a small scalar \(h\). For the reason, we believe that the search direction should be generated by theoretically minimizing the quadratic approximation of both terms in \(F\), not just in the smooth function \(f\). Another advantage of our approach is that an appropriate value of \(h\) may result in an improved numerical performance, and that the approach is not restricted to the special case \(h=1\).

Lemma 2.1

For any real vectors \(a\in {\mathbb {R}}^n\) and \(b\in {\mathbb {R}}^n\), the following function \(L(x)\) is non-decreasing

$$\begin{aligned} L(x)=\frac{\Vert a+bx\Vert _1-\Vert a\Vert _1}{x},\quad x\in (0,\infty ). \end{aligned}$$
(2.9)

Proof

Note that

$$\begin{aligned} L(x)=\frac{\Vert a+bx\Vert _1-\Vert a\Vert _1}{x}=\sum _i^n \frac{|a^i+b^ix|-|a^i|}{x} \triangleq \sum _i^n l^i(x); \end{aligned}$$

hence, it is reduced to prove that \(l^i(x)\) is non-decreasing for each \(i\).

  1. (a)

    When \(a^i\ge 0\) and \(a^ix+b^i\ge 0,\,l^i(x)=b^i\).

  2. (b)

    When \(a^i\ge 0\) and \(a^ix+b^i\le 0\), we obtain

    $$\begin{aligned} l^i(x)=\frac{-2a^i-b^ix}{x}=\frac{-2a^i}{x}-b^i. \end{aligned}$$
  3. (c)

    When \(a^i\le 0\) and \(a^ix+b^i\ge 0\), we obtain

    $$\begin{aligned} l^i(x)=\frac{2a^i+b^ix}{x}=\frac{2a^i}{x}+b^i. \end{aligned}$$
  4. (d)

    When \(a^i\le 0\) and \(a^ix+b^i\ge 0\), we have \(l^i(x)=-b^i\).

Clearly \(l^i(x)\) is non-decreasing for each case. Hence, \(L(x)\) is non-decreasing. \(\square \)

The following lemma shows that the direction defined by (2.8) is descent if \(d_k\ne 0\).

Lemma 2.2

Suppose that \(\lambda _k>0\) and \(d_k\) is determined by (2.8). Then,

$$\begin{aligned} F(x_k+\theta d_k)\le F(x_k)+\theta \Big [\nabla f(x_k)^\top d_k+\frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h}\Big ]+o(\theta ) \quad \theta \in (0,h], \nonumber \\ \end{aligned}$$
(2.10)

and

$$\begin{aligned} \nabla f(x_k)^\top d_k+\frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h}\le -\frac{\lambda _k}{2}\Vert d_k\Vert _2^2. \end{aligned}$$
(2.11)

Proof

By the differentiability of \(f\) and the convexity of \(\Vert x\Vert _1\), we show that for any \(\theta \in (0,h]\,(\theta /h\in (0,1])\),

$$\begin{aligned} F(x_k+\theta d_k)-F(x_k)&=f(x_k+\theta d_k)-f(x_k)+\mu \Vert x_k+\theta d_k\Vert _1-\mu \Vert x_k\Vert _1\\&=f(x_k\!+\!\theta d_k)\!-\!f(x_k)\!+\!\mu \Big \Vert \frac{\theta }{h}(x_k\!+\!hd_k)\!+\!\left( 1\!-\! \frac{\theta }{h}\right) x_k\Big \Vert _1-\mu \Vert x_k\Vert _1\\&\le f(x_k\!+\!\theta d_k)\!-\!f(x_k)\!+\!\frac{\theta \mu }{h}\Vert x_k\!+\!hd_k\Vert _1\!+\!\left( \!1\!-\! \frac{\theta }{h}\right) \mu \Vert x_k\Vert _1\!-\!\mu \Vert x_k\Vert _1\\&=\theta \nabla f(x_k)^\top d_k+o(\theta )+\theta \Big [\frac{\mu }{h}\Vert x_k+hd_k\Vert _1-\frac{\mu }{h} \Vert x_k\Vert _1\Big ], \end{aligned}$$

which is exactly (2.10).

Noting \(d_k\) is the minimizer of (2.6) and \(\theta \in (0,h]\), from (2.6) as well as the convexity of \(\Vert x\Vert _1\), we have

$$\begin{aligned}&\nabla f(x_k)^\top d_k+\frac{\lambda _k}{2} \Vert d_k\Vert _2^2+\frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h} \\&\quad \le \theta \nabla f(x_k)^\top d_k+\frac{\lambda _k}{2} \Vert \theta d_k\Vert _2^2+\frac{\mu }{h}\Vert x_k+\theta hd_k\Vert _1-\frac{\mu }{h} \Vert x_k\Vert _1\\&\quad \le \theta \nabla f(x_k)^\top d_k+\frac{\lambda _k\theta ^2}{2} \Vert d_k\Vert _2^2+\frac{\theta \mu }{h^2}\Vert x_k+h^2d_k\Vert _1+ \frac{\mu }{h}\left( 1-\frac{\theta }{h} \right) \Vert x_k\Vert _1-\frac{\mu }{h} \Vert x_k\Vert _1. \end{aligned}$$

Hence,

$$\begin{aligned}&(1-\theta )\nabla f(x_k)^\top d_k+\frac{\mu }{h}\Vert x_k+hd_k\Vert _1-\frac{\theta \mu }{h^2}\Vert x_k+h^2d_k\Vert _1 \nonumber \\&\quad -\frac{\mu }{h}\left( 1-\frac{\theta }{h}\right) \Vert x_k\Vert _1\le -\frac{\lambda _k}{2}\left( 1-\theta ^2\right) \Vert d_k\Vert _2^2. \end{aligned}$$
(2.12)

The last three terms of the left hand side in (2.12) can be re-organized as

$$\begin{aligned}&\frac{\mu }{h}\Big \{\Vert x_k+hd_k\Vert _1-\frac{\theta }{h}\Vert x_k+h^2d_k\Vert _1 -\left( 1-\frac{\theta }{h}\right) \Vert x_k\Vert _1\Big \}\nonumber \\&\quad =\frac{\mu }{h}\Big \{\Vert x_k+hd_k\Vert _1-\Vert x_k\Vert _1- \theta \Big [\frac{\Vert x_k+h^2d_k\Vert _1 -\Vert x_k\Vert _1}{h}\Big ]\Big \}\nonumber \\&\quad = \frac{\mu }{h}\Big \{\Vert x_k+hd_k\Vert _1-\Vert x_k\Vert _1-\theta \Big [h\cdot \frac{\Vert x_k+h^2d_k\Vert _1 -\Vert x_k\Vert _1}{h^2}\Big ]\Big \}\nonumber \\&\quad \ge \frac{\mu }{h}\Big \{\Vert x_k+hd_k\Vert _1-\Vert x_k\Vert _1-\theta \Big [h\cdot \frac{\Vert x_k+hd_k\Vert _1 -\Vert x_k\Vert _1}{h}\Big ]\Big \}\nonumber \\&\quad = \frac{\mu }{h}(1-\theta )\left\{ \Vert x_k+hd_k\Vert _1-\Vert x_k\Vert _1 \right\} , \end{aligned}$$
(2.13)

where the inequality is from Lemma 2.1. Combining (2.12) with (2.13), we obtain

$$\begin{aligned} (1-\theta )\nabla f(x_k)^\top d_k+(1-\theta )\frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h}\le -\frac{\lambda _k}{2}\left( 1-\theta ^2\right) \Vert d_k\Vert _2^2. \nonumber \\ \end{aligned}$$
(2.14)

Dividing both sides of (2.14) by \((1-\theta )\) and noting \(1-\theta >0\) when \(h\in (0,1)\), we arrive at the desired result (2.11). \(\square \)

When the search direction is determined, a suitable stepsize along this direction should be found to determine the next iterative point. In this study we pay particular attention to a nonmonotone line search strategy, which differs from the traditional Armijo line search or the Wolfe–Powell line search. The traditional Armijo line search requires the function value to decrease monotonically at each iteration. This requirement may cause the sequence of iterations to follow the bottom of a curved narrow valley, which commonly occurs in difficult nonlinear problems. To overcome this difficulty, a possible alternative is to allow an occasional increase in the objective function at each iteration. To clarify further the proposed algorithm, we briefly recall the earliest nonmonotone line search technique by Grippo, Lampariello, and Lucidi [22]. Let \(\delta \in (0,1),\,\rho \in (0,1)\), and \(\tilde{m}\) be a positive integer. The nonmonotone line search is used to choose the smallest nonnegative integer \(j_k\) such that the stepsize \(\alpha _k=\tilde{\alpha }\rho ^{j_k}\) satisfies

$$\begin{aligned} f(x_k+\alpha _kd_k)\le \max _{0\le j\le m(k)}f(x_{k-j})+\delta \alpha _k \nabla f(x_k)^\top d_k, \end{aligned}$$
(2.15)

where

$$\begin{aligned} m(0)=0\quad \text {and}\quad 0\le m(k)\le \min \left\{ m(k-1)+1,\tilde{m}\right\} . \end{aligned}$$

If \(m(k)=0\), the above nonmonotone line search reduces to the standard Armijo line search.

Based on Lemma 2.2, the inequality (2.15) should be modified as

$$\begin{aligned} F(x_k+\alpha _kd_k)\le \max _{0\le j\le m(k)}F(x_{k-j})+\delta \alpha _k \Delta _k, \end{aligned}$$
(2.16)

where

$$\begin{aligned} \Delta _k=\nabla f(x_k)^\top d_k+\frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h}. \end{aligned}$$
(2.17)

As shown in (2.11) \(\Delta _k\le -\frac{\lambda _k}{2}\Vert d_k\Vert ^2_2<0\) whenever \(d_k\ne 0\). Hence, \(\alpha _k\) given by (2.16) is well-defined.

Given all the derivations above, we now describe the nonmonotone Barzilai–Borwein gradient algorithm (hereafter referred to NBBL1) as follows.

figure a

Remark 1

If \(\lambda _k>0\), then the generated direction is descent. However, the condition \(\lambda _k>0\) in this case may not be fulfilled and the hereditary descent property can no longer be guaranteed. To cope with this drawback, we should keep the sequence \(\{\lambda _k\}\) uniformly bounded; that is, for sufficiently small \(\lambda _{(\min )}>0\) and sufficiently large \(\lambda _{(\max )}>0\), the \(\lambda _k\) is forced as

$$\begin{aligned} \lambda _k=\min \left\{ \lambda _{(\max )},\max \left\{ \lambda _k,\lambda _{(\min )}\right\} \right\} . \end{aligned}$$

This approach ensures that \(\lambda _k\) is bounded from zero and subsequently ensures that \(d_k\) is descent at per-iteration.

Remark 2

Lemma 2.2demonstrates the existence of a constant \(\theta \in (0,h]\) such that \(x_k+\theta d_k\) is a descent point. Hence, choosing the initial stepsize as \(\tilde{\alpha }=h\) is suggested in practical computation.

3 Convergence Analysis

Thfis section is devoted to presenting some favorable properties of the generated direction and establishing the global convergence of Algorithm 1. Our convergence result utilizes the following assumption:

Assumption 1

The level set \(\Omega =\{x:f(x)\le f(x_0)\}\) is bounded.

To easily understand the lemma given below, we present two frequently used definitions in convex analysis.

Definition 3.1

The directional derivative of a multivariate function \(f(x_1,\ldots ,x_n)\) along a given vector \(d=(d_1,\ldots ,d_n)\) at a given point \(x\) is defined by

$$\begin{aligned} f'(x;d)=\lim _{\alpha \downarrow 0}\frac{f(x+\alpha d)-f(x)}{\alpha }. \end{aligned}$$

Definition 3.2

A feasible point \(x\in {\mathbb {R}}^n\) is said to be a stationary point of \(f\) if \(f'(x;d)\ge 0\) for all \(d\in {\mathbb {R}}^n\).

Lemma 3.3

Suppose that \(\lambda _k>0\) and \(d_k\) is defined by (2.8) with \(h\in (0,1]\). Then, \(x_k\) is a stationary point of problem (1.1) if and only if \(d_k=0\).

Proof

If \(d_k\ne 0\), then Lemma 2.2 shows that \(d_k\) is descent direction at \(x_k\), which implies that \(x_k\) is not a stationary point of \(F\). If \(d_k=0\) is the solution of (2.7), we obtain the following for any \(\alpha d\in {\mathbb {R}}^n\) with \(\alpha >0\):

$$\begin{aligned} \alpha \nabla f(x_k)^\top d+\frac{\lambda _k\alpha ^2}{2} \Vert d\Vert _2^2+\frac{\mu }{h}\Vert x_k+\alpha hd\Vert _1\ge \frac{\mu }{h}\Vert x_k\Vert _1. \end{aligned}$$
(3.1)

Combining \(f(x_k+\alpha d)-f(x_k)=\alpha \nabla f(x_k)^\top d+o(\alpha )\) with (3.1) yields

$$\begin{aligned} F'(x_k;d)&=\lim _{\alpha \downarrow 0} \frac{f(x_k+\alpha d)-f(x_k)+\mu \Vert x_k+\alpha d\Vert _1-\mu \Vert x_k\Vert _1}{\alpha } \\&= \lim _{\alpha \downarrow 0} \frac{\alpha \nabla f(x_k)^\top d+o(\alpha )+\mu \Vert x_k+\alpha d\Vert _1-\mu \Vert x_k\Vert _1}{\alpha } \\&\ge \lim _{\alpha \downarrow 0}\Big ( \frac{-\frac{\lambda _k\alpha ^2}{2} \Vert d\Vert _2^2+o(\alpha )}{\alpha }\\&\quad +\frac{\big [\mu \Vert x_k+\alpha d\Vert _1-\mu \Vert x_k\Vert _1\big ]-\big [\frac{\mu }{h}\Vert x_k+\alpha hd\Vert _1-\frac{\mu }{h}\Vert x_k\Vert _1\big ]}{\alpha }\Big ).\\&\ge \lim _{\alpha \downarrow 0} \frac{-\frac{\lambda _k\alpha ^2}{2} \Vert d\Vert _2^2+o(\alpha )}{\alpha }\\&=0, \end{aligned}$$

where the second inequality is from Lemma 2.1. Hence, \(x_k\) is a stationary point of \(F\). \(\square \)

The proof of the following lemma is similar to the Theorem in [22].

Lemma 3.4

Let \(l(k)\) be an integer such that

$$\begin{aligned} k-m(k)\le l(k)\le k\quad \text {and} \quad F(x_{l(k)})= \max _{0\le j\le m(k)}F(x_{k-j}). \end{aligned}$$

Then, the sequence \(\{F(x_{l(k)})\}\) is non-increasing, and the search direction \(d_{l(k)}\) satisfies

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{l(k)}\Vert d_{l(k)}\Vert _2 =0. \end{aligned}$$
(3.2)

Proof

From the definition of \(m(k)\), we have \(m(k+1)\le m(k)+1\). Hence,

$$\begin{aligned} F(x_{l(k+1)})&= \max _{0\le j\le m(k+1)}F(x_{k+1-j})\\&\le \max _{0\le j\le m(k)+1}F(x_{k+1-j})\\&= \max \left\{ F(x_{l(k)}),F(x_{k+1})\right\} \\&= F(x_{l(k)}). \end{aligned}$$

Based on (2.16), we obtain the following for all \(k>\tilde{m}\):

$$\begin{aligned} F(x_{l(k)})&=F\left( x_{l(k)-1}+\alpha _{l(k)-1}d_{l(k)-1}\right) \\&\le \max _{0\le j\le m(l(k)-1)}F\left( x_{l(k)-1-j}\right) +\delta \alpha _{l(k)-1}\Delta _{l(k)-1}\\&= F\left( x_{l(l(k)-1)}\right) +\delta \alpha _{l(k)-1}\Delta _{l(k)-1}. \end{aligned}$$

By assumption 1, the sequence \(\{F(x_{l(k)})\}\) admits a limit for \(k\rightarrow \infty \). Hence,

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{l(k)}\Delta _{l(k)} =0. \end{aligned}$$
(3.3)

Given the definition of \(\Delta _k\) in (2.17) and the inequality (2.11), we can deduce that

$$\begin{aligned} \Delta _{l(k)}\le -\frac{\lambda _{(\min )}}{2}\Vert d_{l(k)}\Vert _2^2<0. \end{aligned}$$

Combining the previous equation with (3.3) yields

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{l(k)}\Vert d_{l(k)}\Vert _2^2 =0, \end{aligned}$$

which shows the desired result (3.2). \(\square \)

Theorem 3.5

Let the sequences \(\{x_k\}\) and \(\{d_k\}\) be generated by Algorithm 1. Then, a subsequence \({\mathcal {K}}\) exists such that

$$\begin{aligned} \lim _{k\rightarrow \infty ,k\in {\mathcal {K}}} \Vert d_k\Vert _2=0. \end{aligned}$$
(3.4)

Proof

As shown in [22], (3.2) also implies that

$$\begin{aligned} \lim _{k\rightarrow \infty } \alpha _k\Vert d_k\Vert _2=0. \end{aligned}$$
(3.5)

Now, let \(\bar{x}\) be a limit point of \(\{x_k\}\) and \(\{x_k\}_{{\mathcal {K}}_1}\) be a subsequence of \(\{x_k\}\) converging to \(\bar{x}\). Then, by (3.5), either \(\lim \limits _{k\rightarrow \infty ,k\in {\mathcal {K}}_1} \Vert d_k\Vert _2=0\), which implies \(\Vert \bar{d}\Vert _2=0\), or a subsequence \(\{x_k\}_{{\mathcal {K}}}\,({\mathcal {K}}\subset {\mathcal {K}}_1)\) exists such that

$$\begin{aligned} \lim _{k\rightarrow \infty ,k\in {\mathcal {K}}} d_k\ne 0\quad \text {and}\quad \lim _{k\rightarrow \infty ,k\in {\mathcal {K}}} \alpha _k=0. \end{aligned}$$
(3.6)

In this case, we assume that a constant \(\epsilon >0\) exists such that

$$\begin{aligned} \Vert d_k\Vert _2\ge \epsilon ,\quad \forall \ k\in {\mathcal {K}}. \end{aligned}$$
(3.7)

As \(\alpha _k\) is the first value that satisfies (2.16), we can assume based on Step 3 in Algorithm 1 that an index \(\bar{k}\) exists such that for all \(k\ge \bar{k}\) and \(k\in {\mathcal {K}}\),

$$\begin{aligned} F\left( x_k+\frac{\alpha _k}{\rho }d_k\right) > \max _{0\le j\le m(k)}F(x_{k-j})+\delta \frac{\alpha _k}{\rho } \Delta _k\ge F(x_{k})+\delta \frac{\alpha _k}{\rho } \Delta _k. \end{aligned}$$
(3.8)

As \(f\) is continuously differentiable, we can find based on the mean-value theorem on \(f\) that a constant \(\theta _k\in (0,1)\) exists such that

$$\begin{aligned} f\left( x_k+\frac{\alpha _k}{\rho }d_k\right) -f(x_{k})= \frac{\alpha _k}{\rho }\nabla f\left( x_k+\theta _k\frac{\alpha _k}{\rho }d_k\right) ^\top d_k. \end{aligned}$$

By combining the previous equation with (3.8), we obtain

$$\begin{aligned} \nabla f\left( x_k+\theta _k\frac{\alpha _k}{\rho }d_k\right) ^\top d_k+\frac{\mu \Vert x_k+\frac{\alpha _k}{\rho }d_k\Vert _1- \mu \Vert x_k\Vert _1}{\alpha _k/\rho }>\delta \Delta _k. \end{aligned}$$
(3.9)

As \(\tilde{\alpha }=h\) and \(\alpha _k\rightarrow 0\) in (3.6), we obtain \(\alpha _k<\rho h\) as \(k\rightarrow \infty \). Based on Lemma 2.1,

$$\begin{aligned} \frac{\mu \Vert x_k+\frac{\alpha _k}{\rho }d_k\Vert _1-\mu \Vert x_k\Vert _1}{\alpha _k/\rho } - \frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h}\le 0. \end{aligned}$$

Subtracting both sides of (3.9) by \(\Delta _k\) and noting the definition of \(\Delta _k\)

$$\begin{aligned}&\nabla f\left( x_k+\theta _k\frac{\alpha _k}{\rho }d_k\right) ^\top d_k-\nabla f(x_k)^\top d_k \nonumber \\&\quad \ge \nabla f\left( x_k+\theta _k\frac{\alpha _k}{\rho }d_k\right) ^\top d_k-\nabla f(x_k)^\top d_k \nonumber \\&\qquad +\Big [\frac{\mu \Vert x_k+\frac{\alpha _k}{\rho }d_k\Vert _1- \mu \Vert x_k\Vert _1}{\alpha _k/\rho } - \frac{\mu \Vert x_k+hd_k\Vert _1-\mu \Vert x_k\Vert _1}{h}\Big ]\nonumber \\&\quad >-(1-\delta )\Delta _k\nonumber \\&\quad \ge (1-\delta )\frac{\lambda _{(\min )}}{2}\Vert d_k\Vert _2^2. \end{aligned}$$
(3.10)

Taking the limit as \(k\in {\mathcal {K}},\,k\rightarrow \infty \) in both sides of (3.10) and using the smoothness of \(f\), we obtain

$$\begin{aligned} 0=\nabla f(\bar{x})^\top \bar{d}-\nabla f(\bar{x})^\top \bar{d}\ge (1-\delta )\frac{\lambda _{(\min )}}{2}\Vert \bar{d}\Vert _2^2, \end{aligned}$$

which implies that \(\Vert d_k\Vert _2\rightarrow 0\) as \(k\in {\mathcal {K}},\,k\rightarrow \infty \). This result yields a contradiction because (3.7) indicates that \(\Vert d_k\Vert _2\) is bounded away from zero. \(\square \)

4 Some Extensions

In this section, we show that our algorithm can be readily extended to solve \(\ell _{2}\)-norm and matrix trace norm minimization problems in machine learning, thereby broadening the applicable range of our approach significantly.

First, we consider the \(\ell _{2}\)-regularization problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} \ F(x)=f(x)+\mu \Vert x\Vert _2. \end{aligned}$$

The search direction \(d_k\) is clearly determined by minimizing

$$\begin{aligned} \min _{d\in {\mathbb {R}}^n} \frac{1}{2}\Big \Vert x_k+hd-\big (x_k-\frac{h}{\lambda _k}\nabla f(x_k)\big )\Big \Vert _2^2+\frac{\mu h}{\lambda _k}\Vert x_k+hd\Vert _2. \end{aligned}$$

Based on [19], the explicit solution is

$$\begin{aligned} x_k+hd_k=\max \Big \{\Big \Vert x_k-\frac{h}{\lambda _k}\nabla f(x_k)\Big \Vert _2-\frac{\mu h}{\lambda _k},0\Big \}\frac{x_k-\frac{h}{\lambda _k}\nabla f(x_k)}{\Vert x_k-\frac{h}{\lambda _k}\nabla f(x_k)\Vert _2}, \end{aligned}$$

i.e.,

$$\begin{aligned} d_k= -\frac{1}{h}\Big [x_k-\max \Big \{\Big \Vert x_k-\frac{h}{\lambda _k}\nabla f(x_k)\Big \Vert _2-\frac{\mu h}{\lambda _k},0\Big \}\frac{x_k-\frac{h}{\lambda _k}\nabla f(x_k)}{\Vert x_k-\frac{h}{\lambda _k}\nabla f(x_k)\Vert _2}\Big ]. \end{aligned}$$

Now, we consider the matrix trace norm minimization problem

$$\begin{aligned} \min _{X\in {\mathbb {R}}^{m\times n}} \ F(X)=f(X)+\mu \Vert X\Vert _*, \end{aligned}$$
(4.1)

where the functional \(\Vert X\Vert _*\) is the trace norm of matrix \(X\), which is defined as the sum of its singular values. That is, assume that \(X\) has \(r\) positive singular values of \(\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _r\ge 0\); then, \(\Vert X\Vert _*=\sum _{i=1}^r\sigma _i\). The matrix trace norm is also known as the Schatten \(\ell _{1}\)-norm, Ky Fan norm, and nuclear norm [33]. This problem has received much attention because it is closely related to the affine rank minimization problem, which has appeared in many control applications, including controller design, realization theory, and model reduction.

Similar to the steps undertaken in previous sections, we can readily reformulate (2.6) as the following quadratic model to determine the search direction:

$$\begin{aligned} \min _{D\in {\mathbb {R}}^{m\times n}} \frac{1}{2}\Big \Vert X_k+hD-\big (X_k-\frac{h}{\lambda _k}\nabla f(X_k)\big )\Big \Vert _2^2+\frac{\mu h}{\lambda _k}\Vert X_k+hD\Vert _*. \end{aligned}$$
(4.2)

To obtain the exact solution of (4.2), we now consider the singular value decomposition (SVD) of a matrix \(Y\in {\mathbb {R}}^{m\times n}\) with rank \(r\) as

$$\begin{aligned} Y=U\Sigma V^\top ,\quad \Sigma =\text {diag}(\{\sigma _i\}_{1\le i\le r}), \end{aligned}$$

where \(U\) and \(V\) are \(m\times r\) and \(r\times n\) matrices, respectively, with orthonormal columns, and the singular value \(\sigma _i\) is positive. For each \(\tau >0\), let

$$\begin{aligned} {\mathcal {D}}_{\tau }(Y)=U{\mathcal {D}}_{\tau }(\Sigma )V^\top ,\quad {\mathcal {D}}_{\tau }(\Sigma )=\text {diag}([\sigma _i-\tau ]_+), \end{aligned}$$

where \([\cdot ]_+=\max \{0,\cdot \}\). \({\mathcal {D}}_{\tau }(Y)\) obeys the following nuclear norm minimization problem [7], i.e.,

$$\begin{aligned} {\mathcal {D}}_{\tau }(Y)=\text {arg}\min _X \tau \Vert X\Vert _*+\frac{1}{2}\Vert X-Y\Vert _F^2. \end{aligned}$$
(4.3)

Comparing (4.2) with (4.3), we deduce that

$$\begin{aligned} X_k+hD_k = U{\mathcal {D}}_{\mu h/\lambda _k}(\Sigma )V^\top \quad \text {and}\quad {\mathcal {D}}_{\mu h/\lambda _k}(\Sigma )=\text {diag}\big ([\sigma _i-\frac{\mu h}{\lambda _k}]_+\big ), \end{aligned}$$

or, equivalently,

$$\begin{aligned} D_k = -\frac{1}{h}\Big [X_k-U{\mathcal {D}}_{\mu h/\lambda _k}(\Sigma )V^\top \Big ]. \end{aligned}$$

Subsequently, the NBBL1 framework to solve \(\ell _{2}\)-norm and matrix trace norm regularization problems is easily derived.

5 Numerical Experiments

In this section, we present numerical results to illustrate the feasibility and efficiency of NBBL1. We partition our experiments into three classes based on different types of \(f\). In the first class, we use our algorithm to solve \(\ell _{1}\)-regularized nonconvex problem. In the second class, we use our algorithm to solve \(\ell _{1}\)-regularized least squares, which mainly appear in compressive sensing. In the third class, we compare some state-of-the-art algorithms in compressive sensing to show the efficiency of our algorithm. All experiments are performed under Windows XP and Matlab 7.8 (2009a) running on a Lenovo laptop with an Intel Atom CPU at 1.6 GHz and 1 GB of memory.

5.1 Test on \(\ell _{1}\)-Regularized Nonconvex Problem

Our first test is performed on a set of nonconvex unconstrained problems from the CUTEr [14] library. The second-order derivatives of all the selected problems are available. Given our interest in large problems, we only consider the problems with a size of at least 100. For such problems, we use the dimensions that is admissible of the “double large” installation of CUTEr. The algorithm stops if the norm of the search direction is small enough; that is,

$$\begin{aligned} \Vert d_k\Vert _2\le tol_1. \end{aligned}$$
(5.1)

The iterative process is also stopped if the number of iterations exceeds 10,000 without achieving convergence.

In this experiment, we take \(tol_1=10^{-8},\,h=1,\,\lambda _{(\min )}=10^{-20}\), and \(\lambda _{(\max )}=10^{20}\). In the line search, we choose \(\tilde{\alpha }_0=1,\,\rho = 0.35,\,\delta =10^{-4}\), and \(\tilde{m}=5\). We test NBBL1 with different parameter values \(\mu =\{0, .25, 1, 1.5, 2\}\). The numerical results are presented in Tables 1 and 2, which contain the name of the problem (Problem), the dimensions of the problem (Dim), the number of iterations (Iter), the number of function evaluations (Nf), the CPU time required in seconds (Time), the final objective function values (Fun), the norm of the final gradient of \(f\) (Normg), or the number of nonzeros components of solutions (Nzero), and the norm of final direction (Normd).

Table 1 Test result for NBBL1 with \(\mu =0\)
Table 2 Test result for NBBL1 with \(\mu =0.25,1,2\)

As show in Tables 1 and  2, NBBL1 works successfully for all the test problems in each case. Particularly, NBBL1 consistently produces highly accurate solutions within a short amount of time. The proposed algorithm requires a large number of iterations for a number of special problems, such as problems FLETCHER, NONCVXU2, and BROYDN7D with parameter \(\mu =0\), problems FLETCHER and BROYDN7D with \(\mu =0.25\); problems FLETCHER and CHAIWOO with \(\mu =1\), problems VARDIM, FLETCHER and CHAIWOO with \(\mu =1.5\) and \(\mu =2\). It can also be observed that the number of iterations and function evaluations both decrease universally as \(\mu \) increase from 1 except for problem VARDIM. As illustrated at the third column on the right, the number of the non-zero components of final solutions decrease dramatically as \(\mu \) gets small, and reach zero when \(\mu =2\) for a coupe of problems. Moreover, if larger \(\mu \) is permitted, the number of non-zero components of final solutions are all zero for each test problem, which means that zero is the optimal solution. The phenomenon is not surprising once we note the separable structure of the original problem and the balanced parameter \(\mu \). From Table 2 we also note that for problems COSINE and GENROSE, the proposed algorithm requires only two steps to achieve convergence. From CUTEr [14], it is easy to see that the fundamental cause of the thing lies in the starting point near to zero, i.e. the solution. Meanwhile, the norm of the final direction is exactly zero for problems COSINE and GENROSE, because at the solution it has \(x_k=0\) and \(|x_k-\frac{h}{\lambda _k}\nabla f(x_k)|<\frac{\mu h}{\lambda _k}\) in (2.8) at this case. Table 1 presents the numerical results of NBBL1 after solving a smooth nonconvex minimization problem without any regularization. As shown in the second to the last column of this table, the norm of the final gradient is sufficiently small. The important observation verifies that the proposed algorithm is very efficient in solving unconstrained smooth minimization problems. This result is not surprising, because the proposed algorithm reduces to the effective nonmonotone Barzailai-Borwein gradient method of Raydan [32] in this case.

5.2 Test on \(\ell _{1}\)-Regularized Least Squares

Let \(\bar{x}\) be a sparse or a nearly sparse original signal, \(A\in {\mathbb {R}}^{m\times n}\, (m\ll n)\) be a linear operator, \(\omega \in {\mathbb {R}}^m\) be a zero-mean Gaussian white noise, and \(b\in {\mathbb {R}}^m\) be an observation that satisfies the relationship

$$\begin{aligned} b=A\bar{x}+\omega . \end{aligned}$$

Recent compressive sensing results show that under some technical conditions, the desired signal can be reconstructed almost exactly by solving the \(\ell _{1}\)-regularized least squares (1.2). In this subsection, we perform two classes of numerical experiments to solve (1.2) using the Gaussian matrices as the encoder. In the first class, we demonstrate that our algorithm effectively decodes a sparse signal. In the second class, we carry out a series of experiments with different \(h\) to choose the best one. We measure the quality of restoration \(x^*\) through the relative error to the original signal \(\bar{x}\); that is,

$$\begin{aligned} \text {RelErr} = \frac{\Vert x^* - \bar{x}\Vert _2}{\Vert \bar{x}\Vert _2}. \end{aligned}$$
(5.2)

In the first test, we use a random matrix \(A\) with independent identically distributed Gaussian entries. The \(\omega \) is the additive Gaussian noise of zero mean and standard deviation \(\sigma \). Given the storage limitations of the PC, we test a small size signal with \(n=2^{11},\,m=2^{9}\). The original signal contains randomly \(p=2^6\) nonzero elements. We also choose the noise level \(\sigma =10^{-3}\). The proposed algorithm starts at a zero point and terminates when the relative change of two successive points are sufficiently small, i.e.,

$$\begin{aligned} \frac{\Vert x_k-x_{k-1}\Vert _2}{\Vert x_{k-1}\Vert _2}<tol_2. \end{aligned}$$
(5.3)

In this experiment, we take \(tol=10^{-4},\,h=10^{-2},\,\lambda _{(\min )}=10^{-30}\), and \(\lambda _{(\max )}=10^{30}\). In the line search, we choose \(\tilde{\alpha }_0=10^{-2},\,\rho = 0.35,\,\delta =10^{-4}\), and \(\tilde{m}=5\). The original signal, the limited measurement, and the reconstructed signal are given in Fig. 1.

Fig. 1
figure 1

Left Original signal with length 4,096 and 64 positive non-zero elements; Middle the noisy measurement with length 512; Right recovered signal by NBBL1 (red circle) versus original signal (blue peaks) (Color figure online)

Comparing the left plot with the right one in Fig. 1, we clearly see that the original sparse signal is restored almost completely. All the blue peaks are encircled by the red circles, illustrating that the original signal has been found almost exactly. Overall, this simple experiment shows that the proposed algorithm performs quite well and provides an efficient approach to the recovery of large sparse non-negative signal.

The last term in the approximate quadratic model (2.6) is equivalent to \(\Vert x_k+d\Vert _1\) exactly when \(h=1\). Next, we provide evidence to show that other values can be potentially and dramatically better than \(h=1\). We conduct a series of experiments and compare the performance at each case. In our experiments, we set all the parameters values as in the previous test except for \(n=2^{10}\). We present in Fig. 2 the impact of the parameter \(h\) values on the total number of iterations, the computing time, and the quality of the recovered signal. In each plot, the level axis denotes the values of \(h\) from 0.01 to 1 in a log scale.

Fig. 2
figure 2

Performance of NBBL1: number of iterations (left), computing time (middle) and final relative error (right). In each plot, the horizontal axis represents the value of \(h\) in log scale

In Fig. 2, the number of iterations, the computing time, and the quality of restorations are greatly influenced by the \(h\) values. Generally, as \(h\) increases, NBBL1 consistently performs satisfactorily. On the other hand, the right plot clearly demonstrates that the relative error decreases dramatically at the very beginning and then decreases slightly after 0.2. However, the quality of restoration cannot be improved further after 0.7. On the other hand, the left and middle plots show that the number of iterations and the computing time slightly increase after \(h=0.8\). The these plots verify that the performance of NBBL1 is sensitive to the \(h\) values, and that the value of \(h\in [0.7,0.9]\) may be preferable.

5.3 Comparisons with NESTA_Ct, GPSR_BB, CGD, TwIST, FPC_BB, and FPC_AS

In the third class of the experiment we test the proposed algorithm against several state-of-the-art algorithms to solve \(\ell _{1}\)-regularized problems in compressive sensing or linear inverse problems. Compare each algorithm in a very fair way is difficult, because each algorithm is compiled with different parameter settings, such as the termination criterions, the starting points, or the continuation techniques. Hence, in our performance comparisons, we run each code from the same initial point, use all the default parameter values, and only observe the convergence behavior of each algorithm to attain a solution of similar accuracy. We provide below a brief review of each solver.

NESTAFootnote 1 uses Nesterov’s smoothing technique [28] and gradient method [29] to solve basis pursuit denoising problem. The current version can solve \(\ell _{1}\)-norm regularization problems with different types, including (1.2). In this experiment, we test NESTA with continuation (hereafter referred to as NESTA-Ct) for comparison. This algorithm solves a sequence of problems (1.2) using a decreasing sequence of \(\mu \) values. Additionally, NESTA-Ct initially uses the intermediated solution for the next problem. In running NESTA, all the parameters are taken as default except TolVar, which is set to \(1\hbox {e}-5\), to obtain solutions of similar quality with other solvers.

Gradient projections for sparse reconstruction (referred to as GPSR_BB Footnote 2 [20]) reformulates the original problem (1.2) as a box-constrained quadratic programming problem (1.6) by splitting \(x=u-v\). Figueiredo, Nowak, and Wright use a gradient projection method with Barziali-Borwein steplength [2] for its solution, the performance of which is improved using a nonmonotone line search [22]. For the comparison of the proposed algorithm with GPSR-BB, we use the former’s continuation variant and set all parameters as default.

The well-known CGDFootnote 3 uses a gradient algorithm to solve (1.5) and obtain the search direction \(d_k^i=ze^i\) in \(i\in {\mathcal {J}}\), where \({\mathcal {J}}\) is a nonempty subset of \(\{1,\ldots ,n\}\). Moreover, CGD chooses the index subset \({\mathcal {J}}\) following the Gauss-Southwell rule. The iterative process \(x_{k+1}=x_k+\alpha _kd_k\) continues until some termination criteria are met, (i.e., \(d_k^i=0\) with \(i\notin {\mathcal {J}}\)), and the stepsize \(\alpha _k\) using an Armijo rule. In running CGD, we use the code CGD according to its Matlab package and set all the parameters as default except for init=2 to start the iterative process at \(A^\top b\).

TwISTFootnote 4 is a two-step IST algorithm that is used to solve a class of linear inverse problems. Specifically, TwIST is designed to solve

$$\begin{aligned} \min _u {\mathcal {J}}(u)+\frac{\mu }{2}\Vert Au-f\Vert _2^2, \end{aligned}$$
(5.4)

where \(A\) is a linear operator, and \({\mathcal {J}}(\cdot )\) is a general regularizer, which can be either a \(\ell _{1}\)-norm or total variation [34]. The iteration framework of TwIST is

$$\begin{aligned} u_{k+1} = (1-\alpha )u_{k-1}+(\alpha -\delta )u_k+\delta \Phi _\mu (\xi _k), \end{aligned}$$

where \(\alpha ,\delta >0\) are parameters, \(\xi _k=u_k+A^\top (f-Au_k)\), and

$$\begin{aligned} \Phi _\mu (\xi _k)=\text {arg}\min _u {\mathcal {J}}(u)+\frac{\mu }{2}\Vert u-\xi _k\Vert _2^2. \end{aligned}$$
(5.5)

We use the default parameters in TwIST and terminate the iteration process when the relative variation of function value falls below \(10^{-4}\).

FPCFootnote 5 is used to solve the general \(\ell _{1}\)-regularized minimization problem (1.1), where \(f\) is a continuously differentiable convex function. At current \(x_k\) and any scalar \(\tau >0\) (might be \(1/\lambda _k\)), the next iteration is produced by the so-called fixed point iteration

$$\begin{aligned} x_{k+1}={\mathcal {S}}(x_k-\tau \nabla f(x_k),\mu \tau ). \end{aligned}$$

To obtain a good practical performance, FPC is augmented with a continuation approach. Moreover, FPC is further modified using Barzilai–Borwein stepsize (code FPC-BB in Matlab package FPC_v2), resulting in a significantly improved performance. In running FPC-BB, we use all the default parameter values except xtol = \(1\cdot \hbox {e}-5\) to stop the algorithm when the relative change between successive points is below xtol to derive solutions of similar quality with other solvers. The closely related algorithm FPC_AS Footnote 6 searches along

$$\begin{aligned} d_k={\mathcal {S}}(x_k-\tau \nabla f(x_k),\mu \tau )-x_k \end{aligned}$$

to estimate the support to the solution using the nonmonotone line search of Zhang and Hager [48]. Then, a smooth “subspace optimization” is solved to recover the magnitudes of the nonzero components of \(x\). In running FPC_AS, we use the latest version in its Matlab package and set all the parameters as default except for opts.init=2 to start at \(A^\top b\).

In this test, \(A\) is a partial discrete cosine transform (DCT) coefficient matrix, whose \(m\) rows are chosen randomly from the \(n\times n\) DCT matrix. This encoding matrix \(A\) does not require storage and allows fast matrix-vector multiplications involving \(A\) and \(A^\top \). Therefore, it can be used to test larger-size problems compared with those tested using Gaussian matrices. In NBBL1, we take \(tol_2=10^{-4},\,h=0.83,\,\lambda _{(\min )}=10^{-30}\), and \(\lambda _{(\max )}=10^{30}\). In the line search, we choose \(\tilde{\alpha }_0=h,\,\rho = 0.35,\,\delta =10^{-5}\), and \(\tilde{m}=5\). In this comparison, we let \(n= 2^{14},\,m = \mathtt{floor}(n/4)\), where “floor” is a Matlab command used to round off an element to the nearest integer toward minus infinity. The original signal \(\bar{x}\) contains \(p=\mathtt{floor}(m/8)\) number of nonzero components. Moreover, the observation \(b\) is contaminated by Gaussian noise with level \(\sigma = 1\hbox {e}-\)3. The goal is to use each algorithm to reconstruct \(\bar{x}\) from the observation \(b\) by solving (1.2) with \(\mu =2^{-8}\). All the tested algorithms start at \(x_0=A^\top b\) and terminate with different stopping criterions to produce resolutions of similar quality. Note that for the continuation technique used in some of the tested algorithms, the value of \(\mu \) may change occasionally. For this reason, we only observe the convergence behavior of the algorihtms with respect to relative error as the iteration numbers and computing time increase to specifically illustrate the performance of each algorithm.. The results of NBBL1, NESTA_Ct, CGD, TwIST, and GPSR_BB are presented in Fig. 3.

Fig. 3
figure 3

Comparison result of NBBL1, NESTA_Ct, CGD, TwIST, and GPSR_BB. The x-axes represent the CPU time in seconds (left) and the number of iterations (right). The y-axes represent the relative error

The left plot in Fig. 3 shows that NBBL1 usually decreases the relative errors faster than NESTA_Ct, CGD, GPSR_BB, and TwIST throughout the entire iteration process. Meanwhile, the right plot shows that NBBL1 requires lesser number of iterations than NESTA_Ct, CGD, and GPSR_BB. Notably, TwIST requires a nearly equal number of iterations as NBBL1 but with more computing time. The reason lies in solving the de-noising subproblem (5.5) per-iteration. Among the compared solvers, CGD performs the weakest. However, if CGD starts at zero, its performance should significantly improve (see [43]). In sum, the simple experiment verifies that NBBL1 is the fastest algorithm among those included inthis set of test.

Now we proceed to testing the algorithm against two other related solvers, namely, PFC_BB and FPC_AS, while keeping the experiment settings as the same as previous test. The reason behind our choice to conduct this particular experiment is that the three categories of algorithms all use the Barzilai–Borwein coefficient and nonmonotone line search strategy but in different ways. The search direction of FPC_AS is clearly included in the one defined by (2.8) as a special case. Nevertheless, we still test NBBL1 when \(h=1\) [abbreviated as NBBL1(1.0)], the purpose of which is to demonstrate that the search direction (2.8) would greatly benefit the performance of the algorithm. The results of NBBL1, FPC_BB, and FPC_AS are presented in Fig. 4.

Fig. 4
figure 4

Comparison result of NBBL1, FPC_AS and FPC_BB. The x-axes represent the CPU time in seconds (left) and the number of iterations (right). The y-axes represent the relative error

As shown in Fig. 4 all the tested algorithms eventually attain nearly equal relative errors at the end. FPC_BB is the most competitive with NBBL1(0.83), whereas NBBL1(1.0) and FPC_AS are much slower. FPC_AS is the slowest in decreasing relative errors throughout the entire iteration process, but requires the least number of iterations, because at each iteration, FPC_AS solves a subspace minimization problem. One fact that must be emphasized is that the performance of NBBL1 indeed improves dramatically using \(h=0.83\). In conclusion, the proposed algorithm is generally more efficient than FPC_AS and competes strongly with FPC_BB.

To further illustrate the benefit of NBBL1, we present the comparison results of each algorithm behind the previous two experiments in Table 3 at the end of this section. In this comparison, we only consider the iteration numbers (Iter), the computing time (Time), and the relative error (RelErr) and the function values (Fun) at the final solutions for each algorithm.

Table 3 Comparison results for each test algorithm

It can be seen from Table 3 that, each algorithm obtained solutions with comparable objective function values and comparable relative errors. From the results, once again we see that FPC_BB is most competitive with NBBL1(0.83), while others are relatively slow. Based on the experimental results, NBBL1(1.0) is appreciably slower than NBBL1(0.83). We also did a series of tests with different of experiments settings and observed consistent results. Specifically, FPC_BB is the most competitive with NBBL1(0.83). These results and observations sufficiently demonstrated the efficiency and stability of NBBL1.

6 Conclusions

In this study, we proposed, analyzed, and tested a new practical algorithm to solve the separable nonsmooth minimization problem consisting of an \(\ell _{1}\)-norm regularized term and a continuously differentiable term. This type of problem mainly appears in signal/image processing, compressive sensing, machine learning, and linear inverse problems. However, the problem is challenging because of the non-smoothness of the regularization term. Our approach minimizes an approximal local quadratic model to determine a search direction at each iteration. We showed that the search direction contains the one of FPC_AS as a special case, and is reduced to the classic Barzilai–Borwein gradient method when \(\mu =0\). We also proved that the objective function is descent along the direction provided that the initial stepsize does not exceed \(h\) in the nonmonotone line search step. We established the algorithm’s global convergence theorem by assuming that \(f\) is bounded below. Extensive experimental results illustrated that the proposed algorithm is an effective tool to solve \(\ell _{1}\)-regularized nonconvex problems from CUTEr library. Moreover, we ran our algorithm to recover a large sparse signal from its noisy measurement. The performance comparisons with several state-of-the-art solvers verified that our algorithm is competitive with FPC_BB and faster than FPC_AS, NESTA_Ct, CGD, GPSR, and TwIST.

Unlike all the existing algorithms in the literature, our approach uses a linear model to approximate \(\Vert x_k+d\Vert _1\) for computing the search direction with a small scalar \(h\); that is,

$$\begin{aligned} \Vert x_k+d\Vert _1\approx \Vert x_k\Vert _1 + \frac{\Vert x_k+hd\Vert _1-\Vert x_k\Vert _1}{h}. \end{aligned}$$

Although the equations may hold exactly when \(h=1\), a series of numerical experiments in this paper showed that an appropriate \(h\) may result in an improved performance with suitable experiment settings. This approach is distinctive and novel; therefore, it is one of the important contributions of this study. A natural question is whether the performance of FPC_AS, even its related variants, can be improved using the search direction defined in (2.8). The answer deserves a thorough investigation. The nonmonotone Barzilai–Borwein gradient algorithm of Raydan [32] is known to be very effective in smooth unconstrained minimization, and its equally remarkable effectiveness in signal reconstruction problems involving \(\ell _{1}\)-regularized problems has not been clearly explored. Hence, our approach can be considered as a modification or extension of the algorithm in [32]. Moreover, the numerical experiments illustrated that our approach is comparable with or even better than several state-of-the-art algorithms. This advantage certainly serves as the numerical contribution of this study. Our algorithm readily solves the \(\ell _{1}\)-regularized logistic regression, the \(\ell _{2}\)-norm, as well as matrix trace norm and \(\ell _{2,1}\)-norm minimization problems in machine learning. However, tests related to these problems were not conducted in the study. Further investigations are therefore necessary.