1 Introduction

This paper aims to solve a composite convex optimization problem in the form

$$\begin{aligned} \mathop {\text {min}}\limits _{x\in \mathbb {R}^n}\quad f(x)+h(x), \end{aligned}$$
(1.1)

where f and h are extended real-valued convex functions. Problem (1.1) arises from a wide variety of applications, such as signal recovery, image processing, machine learning, data analysis, and etc. For example, it becomes the sparse optimization problem when f or h equals to the \(\ell _1\)-norm, which attracts a significant interest in signal or image processing in recent years. If f is a loss function associated with linear predictors and h is a regularization function, problem (1.1) is often referred as the regularized risk minimization problem in machine learning and statistics. When f or h is an indicator function onto a convex set, problem (1.1) represents a general convex constrained optimization problem.

Recently, a series of first-order methods, including the forward–backward splitting (FBS) (also known as proximal gradient) methods, Nesterov’s accelerated methods, the alternative direction methods of multipliers (ADMM), the Douglas–Rachford splitting (DRS) and Peaceman–Rachford splitting (PRS) methods, have been extensively studied and widely used for solving problem (1.1). The readers are referred to, for example, [3, 6] and references therein, for a review on some of these first-order methods. One main feature of these methods is that they first exploit the underlying problem structures, then construct subproblems that can be solved relatively efficiently. These algorithms are rather simple yet powerful since they are easy to be implemented in many interesting applications and they often converge fast to a solution with moderate accuracy. However, a notorious drawback is that they may suffer from a slow tail convergence and hence a significantly large number of iterations is needed in order to achieve a high accuracy.

A few Newton-type methods for some special instances of problem (1.1) have been investigated to alleviate the inherent weakness of the first-order type methods. Most existing Newton-type methods for problem (1.1) with a differentiable function f and a simple function h whose proximal mapping can be cheaply evaluated are based on the FBS method to some extent. The proximal Newton method [18, 26] can be interpreted as a generalization of the proximal gradient method. It updates in each iteration by a composition of the proximal mapping with a Newton or quasi-Newton step. The semi-smooth Newton methods proposed in [4, 16, 22] solve the nonsmooth formulation of the optimality conditions corresponding to the FBS method. In [43], the augmented Lagrangian method is applied to solve the dual formulation of general linear semidefinite programming problems, where each augmented Lagrangian function is minimized by using the semi-smooth Newton-CG method. Similarly, a proximal point algorithm is developed to solve the dual problems of a class of matrix spectral norm approximation in [5], where the subproblems are again handled by the semi-smooth Newton-CG method.

In this paper, we study a few second-order type methods for problem (1.1) in a general setting even if f is nonsmooth and h is an indicator function. Our key observations are that many first-order methods, such as the FBS and DRS methods, can be written as fixed-point iterations and the optimal solutions of (1.1) are also the solutions of a system of nonlinear equations defined by the corresponding fixed-point mapping. Consequently, the concept is to develop second-order type algorithms based on solving the system of nonlinear equations. These nonlinear equations are often non-differentiable, but they are semi-smooth due to the properties of the proximal mappings. Hence, we propose a regularized semi-smooth Newton method to solve the system of nonlinear equations. Although the semi-smooth Newton method had been extensively studied in 1990s [9, 25, 27, 28, 34, 37], our usage is still different in several aspects. Firstly, our regularization parameter is updated by a self-adaptive strategy similar to the trust region algorithms. The regularization term is important since the generalized Jacobian matrix corresponding to monotone equations may only be positive semidefinite. Secondly, our approach keeps to use the semi-smooth Newton step as much as possible if the residual is reduced sufficiently comparing to the last Newton step. Otherwise, a hyperplane projection technique proposed by [32, 33] is called to ensure global convergence to an optimal solution of problem (1.1). When certain conditions are satisfied, we prove that the semi-smooth Newton steps are always performed close to the optimal solutions. Consequently, fast local convergence rate is established. Of course, the projection step may be replaced by other globalization techniques for certain problems. Thirdly, the computational cost can be further reduced if the Jacobian matrix is approximated by the limited memory BFGS (L-BFGS) method for some cases.

Our main contribution is the study of some relationships between the first-order and second-order type methods. Our semi-smooth Newton methods are able to solve the general convex composite problem (1.1) as long as a fixed-point mapping is well defined. In particular, our methods are applicable to constrained convex programs, such as constrained \(\ell _1\)-minimization problem. In contrast, the Newton-type methods in [4, 16, 18, 22, 26] are designed for unconstrained problems. Unlike the methods in [5, 43] applying the semi-Newton method to a sequence of subproblems, our target is a single system of nonlinear equations. Although solving the Newton system is a major challenge, the computational cost usually can be controlled reasonably well when certain structures can be utilized. Since the hyperplane projection technique in [32, 33] often slows down the numerical convergence of the Newton steps, we only activate this step when it is necessary and the numerical efficiency is still determined by the Newton steps. Our preliminary numerical results show that our proposed methods are able to reach superlinear or quadratic convergence rates on typical \(\ell _1\)-minimization problems.

The rest of this paper is organized as follows. In Sect. 2, we review a few popular operator splitting methods, derive their equivalent fixed-point iterations and discuss their semi-smooth and monotone properties. We propose a semi-smooth Newton method and establish its convergence results in Sect. 3. Numerical results on a number of applications are presented in Sect. 4. Finally, we conclude this paper in Sect. 5.

1.1 Notations

Let I be the identity operator or identity matrix of suitable size. Given a convex function \(f:\mathbb {R}^n\rightarrow (-\infty ,+\infty ]\) and a scalar \(t>0\), the proximal mapping of f is defined by

$$\begin{aligned} \mathbf {prox}_{tf}(x):=\mathop {{{\mathrm{arg\,min}}}}\limits _{u\in \mathbb {R}^n} f(u)+\frac{1}{2t}\Vert u-x\Vert _2^2. \end{aligned}$$
(1.2)

The Fenchel conjugate function \(f^*\) of f is

$$\begin{aligned} f^*(y):=\sup \limits _{x\in \mathbb {R}^n}\left\{ x^Ty-f(x)\right\} . \end{aligned}$$
(1.3)

A function f is said to be closed if its epigraph is closed, or equivalently f is lower semicontinuous. A mapping \(F:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is said to be monotone, if

$$\begin{aligned} \left\langle {x-y},{F(x)-F(y)}\right\rangle \ge 0,\quad \text{ for } \text{ all }\ x,y\in \mathbb {R}^n. \end{aligned}$$

The mapping F is called nonexpansive (contractive) if F is Lipschitz continuous with constant \(L\le 1\) (\(L<1\)), and firmly nonexpansive if

$$\begin{aligned} \Vert F(x)-F(y)\Vert _2^2\le \left\langle {x-y},{F(x)-F(y)}\right\rangle ,\quad \text{ for } \text{ all }\ x,y\in \mathbb {R}^n. \end{aligned}$$

The mapping F is called strongly monotone with modulus \(c>0\) if

$$\begin{aligned} \left\langle {x-y},{F(x)-F(y)}\right\rangle \ge c\Vert x-y\Vert _2^2,\quad \text{ for } \text{ all }\ x,y\in \mathbb {R}^n. \end{aligned}$$

It is said that F is cocoercive with modulus \(\beta >0\) if

$$\begin{aligned}\left\langle {x-y},{F(x)-F(y)}\right\rangle \ge \beta \Vert F(x)-F(y)\Vert _2^2,\quad \text{ for } \text{ all }\ x,y\in \mathbb {R}^n. \end{aligned}$$

The operator T is called \(\alpha -averaged\) \((\alpha \in (0,1])\) if there exists a nonexpansive operator R such that \(T = (1-\alpha )I + \alpha R\).

2 Preliminaries

In this section, we review the fixed-point characterizations of some operator splitting algorithms and discuss some important properties of the associated fixed-point mappings. This section consists of three subsections. Section 2.1 reviews some operator splitting algorithms for problem (1.1), including FBS, DRS, and ADMM. These algorithms are well studied in the literature, see [2, 6, 7, 12] for example. Most of the operator splitting algorithms can also be interpreted as fixed-point algorithms derived from certain optimality conditions. Then, we present that the fixed-point mappings derived from the operator splitting algorithms are semi-smooth and monotone in Sects. 2.2 and 2.3, respectively.

2.1 Operator Splitting and Fixed-Point Algorithms

In problem (1.1), let h be a continuously differentiable function. The FBS algorithm is the iteration

$$\begin{aligned} x^{k+1}=\mathbf {prox}_{tf}(x^k-t\nabla h(x^k)),\quad k=0,1,\ldots , \end{aligned}$$
(2.1)

where \(t>0\) is the step size. Define the following operator

$$\begin{aligned} T_{\text {FBS}}:=\mathbf {prox}_{tf}\circ (I-t\nabla h). \end{aligned}$$
(2.2)

Then FBS can be viewed as a fixed-point iteration

$$\begin{aligned} x^{k+1}=T_{\text {FBS}}(x^k). \end{aligned}$$
(2.3)

The DRS algorithm solves (1.1) by the following update:

$$\begin{aligned}&x^{k+1} = \mathbf {prox}_{th}(z^k), \end{aligned}$$
(2.4)
$$\begin{aligned}&y^{k+1} = \mathbf {prox}_{tf}(2x^{k+1}-z^k), \end{aligned}$$
(2.5)
$$\begin{aligned}&z^{k+1} = z^k+y^{k+1}-x^{k+1}. \end{aligned}$$
(2.6)

The algorithm is traced back to [10, 11, 20] to solve partial differential equations (PDEs). The fixed-point iteration characterization of DRS is in the form of

$$\begin{aligned} z^{k+1}=T_{\text {DRS}}(z^k), \end{aligned}$$
(2.7)

where

$$\begin{aligned} T_{\text {DRS}}:=I+\mathbf {prox}_{tf}\circ (2\mathbf {prox}_{th}-I)-\mathbf {prox}_{th}. \end{aligned}$$
(2.8)

Consider a linear constrained program:

$$\begin{aligned} \begin{array}{ll} \mathop {\text {min}}\limits _{x_1\in \mathbb {R}^{n_1},x_2\in \mathbb {R}^{n_2}}\quad &{}f_1(x_1)+f_2(x_2)\\ \,\text {subject to}\,\quad &{} A_1x_1+A_2x_2=b, \end{array} \end{aligned}$$
(2.9)

where \(A_1\in \mathbb {R}^{m\times n_1}\) and \(A_2\in \mathbb {R}^{m\times n_2}\). The dual problem of (2.9) is given by

$$\begin{aligned} \mathop {\text {min}}\limits _{w\in \mathbb {R}^m}\quad d_1(w)+d_2(w), \end{aligned}$$
(2.10)

where

$$\begin{aligned} d_1(w):=f_1^*(A_1^Tw),\quad d_2(w):=f_2^*(A_2^Tw)-b^Tw. \end{aligned}$$

Assume that \(f_1\) and \(f_2\) are convex. It is widely known that the DRS iteration for dual problem (2.10) is the ADMM [14, 15]. It is regarded as a variant of augmented Lagrangian method and has attracted much attention in numerous fields. A recent survey paper [3] describes the applications of the ADMM to statistics and machine learning. The ADMM is equivalent to the following fixed-point iteration

$$\begin{aligned} z^{k+1}=T_{\text {DRS}}(z^k), \end{aligned}$$

where \(T_{\text {DRS}}\) is the DRS fixed-point mapping for problem (2.10).

2.2 Semi-Smoothness of Proximal Mapping

We now discuss the semi-smoothness of proximal mappings. This property often implies that the fixed-point mappings corresponding to operator splitting algorithms are semi-smooth or strongly semi-smooth.

Let \({\mathcal {O}}\subseteq \mathbb {R}^n\) be an open set and \(F:{\mathcal {O}}\rightarrow \mathbb {R}^m\) be a locally Lipschitz continuous function. Rademacher’s theorem says that F is almost everywhere differentiable. Let \(D_F\) be the set of differentiable points of F in \({\mathcal {O}}\). We next introduce the concepts of generalized differential.

Definition 2.1

Let \(F:{\mathcal {O}}\rightarrow \mathbb {R}^m\) be locally Lipschitz continuous at \(x\in {\mathcal {O}}\). The B-subdifferential of F at x is defined by

$$\begin{aligned} \partial _BF(x):=\left\{ \lim \limits _{k\rightarrow \infty }F'(x^k)| x^k\in D_F, x^k\rightarrow x\right\} . \end{aligned}$$

The set \(\partial F(x)=co (\partial _BF(x))\) is called Clarke’s generalized Jacobian, where \(co \) denotes the convex hull.

The notion of semi-smoothness plays a key role on establishing locally superlinear convergence of the nonsmooth Newton-type method. Semi-smoothness was originally introduced by Mifflin [21] for real-valued functions and extended to vector-valued mappings by Qi and Sun [28].

Definition 2.2

Let \(F:{\mathcal {O}}\rightarrow \mathbb {R}^m\) be a locally Lipschitz continuous function. We say that F is semi-smooth at \(x\in {\mathcal {O}}\) if

  1. (a)

    F is directionally differentiable at x; and

  2. (b)

    for any \(d\in {\mathcal {O}}\) and \(J\in \partial F(x+d)\),

    $$\begin{aligned} \Vert F(x+d)-F(x)-Jd\Vert _2=o(\Vert d\Vert _2)\quad \text{ as }\ d\rightarrow 0. \end{aligned}$$

Furthermore, F is said to be strongly semi-smooth at \(x\in {\mathcal {O}}\) if F is semi-smooth and for any \(d\in {\mathcal {O}}\) and \(J\in \partial F(x+d)\),

$$\begin{aligned} \Vert F(x+d)-F(x)-Jd\Vert _2=O\left( \Vert d\Vert _2^2\right) \quad \text{ as }\ d\rightarrow 0. \end{aligned}$$

The examples of semi-smooth functions include the smooth functions, all convex functions (thus norm), and the piecewise differentiable functions. Differentiable functions with Lipschitz gradients are strongly semi-smooth. For every \(p\in [1,\infty ]\), the norm \(\Vert \cdot \Vert _{p}\) is strongly semi-smooth. Piecewise affine functions are strongly semi-smooth. Examples of semi-smooth functions are thoroughly studied in [12, 37].

The basic properties of proximal mapping is well documented in textbooks such as [2, 29]. The proximal mapping \(\mathbf {prox}_{f}\), corresponding to a proper, closed and convex function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\), is single-valued, maximal monotone and nonexpansive. Moreover, the proximal mappings of many interesting functions are (strongly) semi-smooth. It is worth mentioning that the semi-smoothness of proximal mapping does not hold in general [31].

We next review some existing results on the semi-smoothness of proximal mappings of various interesting functions. The proximal mapping of \(\ell _1\)-norm \(\Vert x\Vert _1\), which is the well-known soft-thresholding operator, is component-wise separable and piecewise affine. Hence, the operator \(\mathbf {prox}_{\Vert \cdot \Vert _1}\) is strongly semi-smooth. According to the Moreau’s decomposition, the proximal mapping of \(\ell _{\infty }\) norm (the conjugate of \(\ell _1\) norm) is also strongly semi-smooth. For \(k\in \mathbb {N}\), a function with k continuous derivatives is called a \(\mathcal {C}^k\) function. A function \(f:{\mathcal {O}}\rightarrow \mathbb {R}^m\) defined on the open set \({\mathcal {O}}\subseteq \mathbb {R}^n\) is called piecewise \(\mathcal {C}^k\) function, \(k\in [1,\infty ]\), if f is continuous and if at every point \(\bar{x}\in {\mathcal {O}}\) there exists a neighborhood \(V\subset {\mathcal {O}}\) and a finite collection of \(\mathcal {C}^k\) functions \(f_i:V\rightarrow \mathbb {R}^m, i=1,\ldots ,N\), such that

$$\begin{aligned} f(x)\in \{f_1(x),\ldots ,f_N(x)\}\quad \text {for all}\ x\in V. \end{aligned}$$

For a comprehensive study on piecewise \(\mathcal {C}^k\) functions, the readers are referred to [30]. From [37, Proposition 2.26], if f is a piecewise \(\mathcal {C}^1\) (piecewise smooth) function, then f is semi-smooth; if f is a piecewise \(\mathcal {C}^2\) function, then f is strongly semi-smooth. As described in [26, Section 5], in many applications the proximal mappings are piecewise \(\mathcal {C}^1\) and thus semi-smooth. Metric projection, which is the proximal mapping of an indicator function, plays an important role in the analysis of constrained programs. The projection over a polyhedral set is piecewise linear [29, Example 12.31] and hence strongly semi-smooth. The projections over symmetric cones are proved to be strongly semi-smooth in [35].

2.3 Monotonicity of Fixed-Point Mappings

This subsection focuses on the discussion of the monotonicity of the fixed-point mapping \(F:=I-T\), where \(T:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is a fixed-point operator. Later, we will show that the monotone and nonexpansive property of F play critical roles in our proposed method.

Proposition 2.3

  1. (i)

    Suppose that \(\nabla h\) is cocoercive with \(\beta >0\), then \(F_{\text {FBS}}=I-T_{\text {FBS}}\) is monotone and \(\frac{2\beta }{4\beta -\gamma }\)-averaged if \(0<t\le 2\beta \).

  2. (ii)

    Suppose that \(\nabla h\) is strongly monotone with \(c>0\) and Lipschitz with \(L>0\), then \(F_{\text {FBS}}\) is strongly monotone if \(0<t<2c/L^2\).

  3. (iii)

    Suppose that \(h\in C^2\), \(H(x):=\nabla ^2h(x)\) is positive semidefinite for any \(x\in \mathbb {R}^n\) and \(\bar{\lambda }=\max _x\lambda _{\text {max}}(H(x))<\infty \). Then, \(F_{\text {FBS}}\) is monotone if \(0<t\le 2/\bar{\lambda }\).

  4. (iv)

    The fixed-point mapping \(F_{\text {DRS}}:=I-T_{\text {DRS}}\) is monotone and firmly nonexpansive hence \(\frac{1}{2}\)-averaged.

  5. (v)

    If T is \(\alpha \)-averaged \((\alpha \in (0,1])\), then \(F = I - T\) is \(\frac{1}{2\alpha }\)-cocoercive.

Proof

Items (i) and (ii) are well known in the literature, see [8, 44] for example.

(iii) From the mean value theorem, for any \(x,y\in \mathbb {R}^n\), one has that

$$\begin{aligned} \nabla h(x)-\nabla h(y)=\bar{H}(x,y)(x-y), \end{aligned}$$

where \(\bar{H}(x,y)=\int _0^1H(y+s(x-y))\text{ d }s\) and thus \(\lambda _{max}(\bar{H}(x,y))\le \bar{\lambda }\). Hence, we obtain

$$\begin{aligned} \Vert \nabla h(x)-\nabla h(y)\Vert _2^2=\left\langle {\bar{H}(x,y)(x-y)},{\bar{H}(x,y)(x-y)}\right\rangle \le \bar{\lambda }\left\langle {x-y},{\nabla h(x)-\nabla h(y)}\right\rangle , \end{aligned}$$

which implies that \(\nabla h\) is cocoercive with \(1/\bar{\lambda }\). Hence, the monotonicity is obtained from item (i) .

(iv) It has been shown that the operator \(T_{\text {DRS}}\) is firmly nonexpansive, see [20]. Therefore, \(F_{\text {DRS}}\) is firmly nonexpansive and hence monotone [2, Proposition 4.2].

(v) The proof simply follows from the definition of the \(\alpha \)-average of T. \(\square \)

Items (i) and (ii) demonstrate that \(F_{\text {FBS}}\) is monotone as long as the step size t is properly selected. It is also shown in [44] that, when f is an indicator function of a convex closed set, the step size interval in items (i) and (ii) can be enlarged to \((0,4\beta ]\) and \((0,4c/L^2)\), respectively. Finally, we introduce an useful lemma on the positive semidefinite property of the subdifferential of the monotone mapping, see [17, Proposition 2.3].

Lemma 2.4

For a Lipschitz continuous mapping \(F:\mathbb {R}^n\rightarrow \mathbb {R}^n\), F is monotone if and only if each element of \(\partial _BF(x)\) is positive semidefinite for any \(x\in \mathbb {R}^n\).

3 Semi-Smooth Newton Method for Nonlinear Monotone Equations

The purpose of this section is to design a Newton-type method for solving the system of nonlinear equations

$$\begin{aligned} F(z)=0, \end{aligned}$$
(3.1)

where \(F:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is semi-smooth and monotone. In particular, we are interested in \(F(z)=z-T(z)\), where T(z) is a fixed-point mapping corresponding to certain first-order type algorithms. Let \(Z^*:=\{z\in \mathbb {R}^n:F(z)=0\}\) denote the solution set to nonlinear equations (3.1). It is known that \(Z^*\) is a closed convex set if the mapping F is maximal monotone.

3.1 A Regularized Semi-Smooth Newton Method with Projection Steps

The system of monotone equations has various applications [1, 19, 24, 32, 46]. Inspired by a pioneer work [32], a class of iterative methods for solving nonlinear (smooth) monotone equations were proposed in recent years [1, 19, 46]. In [32], the authors proposed a globally convergent Newton method by exploiting the structure of monotonicity, whose primary advantage is that the whole sequence of the distances from the iterates to the solution set is decreasing. The method is extended in [45] to solve monotone equations without nonsingularity assumption.

The main concept in [32] is introduced as follows starting from an initial point \(z^0\) . For an iterate \(z^k\), let \(d^k\) be a direction such that

$$\begin{aligned} \left\langle {F(u^k)},{-d^k}\right\rangle >0, \end{aligned}$$

where \(u^k=z^k+d^k\) is an intermediate iterate. By monotonicity of F, for any \(z^*\in Z^*\) one has

$$\begin{aligned} \left\langle {F(u^k)},{z^*-u^k}\right\rangle \le 0. \end{aligned}$$

Therefore, the hyperplane

$$\begin{aligned} H_k:=\left\{ z\in \mathbb {R}^n|\left\langle {F(u^k)},{z-u^k}\right\rangle =0\right\} \end{aligned}$$

strictly separates \(z^k\) from the solution set \(Z^*\). Based on this fact, it was developed in [32] that the next iterate is set by

$$\begin{aligned} z^{k+1}=z^k-\frac{\left\langle {F(u^k)},{z^k-u^k}\right\rangle }{\Vert F(u^k)\Vert _2^2}F\left( u^k\right) . \end{aligned}$$

It is easy to show that the point \(z^{k+1}\) is the projection of \(z^k\) onto the hyperplane \(H_k\). The hyperplane projection step is critical to construct a globally convergent method for solving the system of nonlinear monotone equations. By applying the same technique, we develop a globally convergent method for solving semi-smooth monotone equations (3.1).

It has been demonstrated in Lemma 2.4 that each element of the B-subdifferential of a monotone and semi-smooth mapping is positive semidefinite. Hence, for an iterate \(z^k\), by choosing an element \(J_k\in \partial _BF(z^k)\), it is natural to apply a regularized Newton method. It computes

$$\begin{aligned} (J_k+\mu _kI)d=-F^k, \end{aligned}$$
(3.2)

where \(F^k=F(z^k)\), \(\mu _k=\lambda _k\Vert F^k\Vert _2\) and \(\lambda _k>0\) is a regularization parameter. The regularization term \(\mu _k I\) is chosen such that \(J_k+\mu _k I\) is invertible. From a computational view, it is practical to solve the linear system (3.2) inexactly. Define

$$\begin{aligned} r^k:=(J_k+\mu _kI)d^k+F^k. \end{aligned}$$
(3.3)

At each iteration, we seek a step \(d^k\) by solving (3.2) approximately such that

$$\begin{aligned} \Vert r^k\Vert _2\le \tau \min \{1,\lambda _k\Vert F^k\Vert _2 \Vert d^k\Vert _2\}, \end{aligned}$$
(3.4)

where \(0<\tau <1\) is some positive constant. Then a trial point is obtained as

$$\begin{aligned} u^k=z^k+d^k. \end{aligned}$$

If \(\Vert F(u^k)\Vert _2\) is sufficiently decreased, we take a Newton step. Specifically, let \(\xi _0=\Vert F(z^0)\Vert _2\). When \(\Vert F(u^k)\Vert _2 \le \nu \xi _{k}\) with \(0<\nu <1\), we call the Newton step is successful and set

$$\begin{aligned} z^{k+1} = u^k, \xi _{k+1}=\Vert F(u^k)\Vert _2 \text{ and } \lambda _{k+1}= \lambda _k. \quad \text{[Newton } \text{ step] } \end{aligned}$$
(3.5)

Otherwise, we take a safeguard step as follows. Define a ratio

$$\begin{aligned} \rho _k=\frac{-\left\langle {F(u^k)},{d^k}\right\rangle }{\Vert d^k\Vert _2^2}. \end{aligned}$$
(3.6)

Select some parameters \(0<\eta _1\le \eta _2<1\) and \(1<\gamma _1 <\gamma _2\). If \(\rho _k\ge \eta _1\), the iteration is said to be successful. Otherwise, the iteration is unsuccessful. Moreover, for a successful iteration, we take a hyperplane projection step when the residual of the projection step is non-increasing and take a fixed-point iteration when it is increasing. In summary, we set

$$\begin{aligned} z^{k+1}={\left\{ \begin{array}{ll} v^k,\ &{}\text {if} \; \rho _k\ge \eta _1 \; \text{ and } \; \Vert F(v^k)\Vert _2 \le \Vert F(z^k)\Vert _2, \text{[projection } \text{ step] }\\ w^k, &{}\text {if}\; \rho _k\ge \eta _1 \; \text{ and } \; \Vert F(v^k)\Vert _2 > \Vert F(z^k)\Vert _2, \text{[fixed-point } \text{ step] }\\ z^k, &{}\text {if}\; \rho _k<\eta _1, \qquad \qquad \qquad \qquad \qquad \qquad \qquad \text{[unsuccessful } \text{ step] } \end{array}\right. } \end{aligned}$$
(3.7)

where

$$\begin{aligned} v^k=z^k-\frac{\left\langle {F(u^k)},{z^k-u^k}\right\rangle }{\Vert F(u^k)\Vert _2^2}F\left( u^k\right) , \ \ w^k = z^k - \beta F(z^k), \beta \in \left( 0,\frac{1}{\alpha }\right) . \end{aligned}$$
(3.8)

The parameters \(\xi _{k+1}\) and \(\lambda _{k+1}\) are updated as

$$\begin{aligned} \xi _{k+1}=\xi _k, \quad \lambda _{k+1} \in \left\{ \begin{array}{ll} (\underline{\lambda },\lambda _k), &{}\text {if}\ \rho _k\ge \eta _2,\\ {[}\lambda _k,\gamma _1\lambda _k], &{}\text {if}\ \eta _1\le \rho _k<\eta _2,\\ (\gamma _1\lambda _k,\gamma _2\lambda _k], &{}\text {otherwise,} \end{array}\right. \end{aligned}$$
(3.9)

where \(\underline{\lambda }>0\) is a small positive constant. These parameters determine how aggressively the regularization parameter is increased when an iteration is unsuccessful or it is decreased when \(\rho _k\ge \eta _2\). The complete approach to solve (3.1) is summarized in Algorithm 1.

figure a

3.2 Global Convergence

We make the following assumption.

Assumption 3.1

The set \(Z^*\) of optimal solutions is nonempty. The function \(F:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is monotone and the corresponding fixed point operator \(T = I -F\) is \(\alpha \)-averaged with \(\alpha \in (0,1]\).

The assumption that T is \(\alpha \)-averaged implies that F is Lipschitz continuous with modulus \(L\le 2\alpha \). Hence, we have \(\Vert J_k\Vert \le L\) for any \(k\ge 0\) and any \(J_k\in \partial _{B}F(z^k)\). Since the Lipschitz continuity of F is sufficient to obtain global convergence, the semi-smoothness of F actually is not required in this subsection.

The following lemma demonstrates that the distance from \(z^k\) to \(Z^*\) decreases in a projection step. The proof follows directly from [32, Lemma 2.1], and it is omitted.

Lemma 3.2

For any \(z^*\in Z^*\) and any projection step, indexed by say k, we have that

$$\begin{aligned} \Vert z^{k+1}-z^*\Vert _2^2\le \Vert z^k-z^*\Vert _2^2-\Vert z^{k+1}-z^k\Vert _2^2. \end{aligned}$$
(3.10)

Denote the index sets of Newton steps, projection steps, fixed-point steps and successful iterations, respectively, by

$$\begin{aligned} \mathcal {K}_{N}:= & {} \{k\ge 0\ |\ \ \Vert F(u^k)\Vert _2 \le \nu \xi _k\}, \\ \mathcal {K}_{P}:= & {} \{k\ge 0\ |\ \rho _k\ge \eta _1,\ \Vert F(u^k)\Vert _2> \nu \xi _k, \Vert F(v^k)\Vert _2 \le \Vert F(z^k)\Vert _2 \}, \\ \mathcal {K}_{F}:= & {} \{k\ge 0\ |\ \rho _k\ge \eta _1,\ \Vert F(u^k)\Vert _2> \nu \xi _k, \Vert F(v^k)\Vert _2 > \Vert F(z^k)\Vert _2\}, \end{aligned}$$

and

$$\begin{aligned} \mathcal {K}_{S}:=\{k\ge 0\ |\ \Vert F(u^k)\Vert _2 \le \nu \xi _k\ \mathrm {or}\ \rho _k\ge \eta _1\}. \end{aligned}$$

We next show that if there are only finitely many successful iterations, the later iterates are optimal solutions.

Lemma 3.3

Suppose that Assumption 3.1 holds and the index set \(\mathcal {K}_{S}\) is finite. Then \(z^k=z^*\) for all sufficiently large k and \(F(z^*)=0\).

Proof

Denote the index of the last successful iteration by \(k_0\). By the definition of “successful” and “unsuccessful”, we have that \(\rho _k<\eta _1\) for all \(k>k_0\). Therefore, from (3.7) and (3.9), one has that \(z^{k_0+i}=z^{k_0+1}:=z^*\), for all \(i\ge 1\) and additionally \(\lambda _k\rightarrow \infty \). We proceed by contradiction. Suppose that \(\phi :=\Vert F(z^*)\Vert _2>0\). For all \(k> k_0\), it follows from (3.3) that

$$\begin{aligned} d^k=(J_k+\lambda _k\Vert F^k\Vert _2I)^{-1}\left( r^k-F^k\right) , \end{aligned}$$

which, together with \(\lambda _k\rightarrow \infty \), \(\Vert r^k\Vert _2\le \tau \) and the fact that \(J_k\) is positive semidefinite, imply that \(d^k\rightarrow 0\), and hence \(u^k\rightarrow z^*\).

We now show that when \(\lambda _k\) is large enough, the ratio \(\rho _k\) is not smaller than \(\eta _2\). For this purpose, we consider an iteration with index \(k> k_0\) sufficiently large such that

$$\begin{aligned} \lambda _k\ge \frac{\eta _2+L}{(1-\tau )\phi }. \end{aligned}$$

Then, it yields that

$$\begin{aligned} -\left\langle {F(z^k)},{d^k}\right\rangle= & {} \left\langle {(J_k+\lambda _k\Vert F^k\Vert _2I)d^k)},{d^k}\right\rangle -\left\langle {r^k},{d^k}\right\rangle \nonumber \\\ge & {} \lambda _k\Vert F^k\Vert _2\Vert d^k\Vert _2^2-\tau \lambda _k\Vert F^k\Vert _2\Vert d^k\Vert _2^2\nonumber \\\ge & {} (\eta _2+L)\Vert d^k\Vert _2^2. \end{aligned}$$
(3.11)

The second line above is from that \(J_k\) is positive semidefinite and the residual condition (3.4). The last inequality is obtained by \(\Vert F^k\Vert _2=\Vert F(z^*)\Vert _2=\phi \) for \(k>k_0\). Further, we obtain

$$\begin{aligned} -\left\langle {F(u^k)},{d^k}\right\rangle= & {} -\left\langle {F(z^k)},{d^k}\right\rangle -\left\langle {F(u^k)-F(z^k)},{d^k}\right\rangle \\\ge & {} (\eta _2+L)\Vert d^k\Vert _2^2-\Vert F(u^k)-F(z^k)\Vert _2\cdot \Vert d^k\Vert _2\\\ge & {} \eta _2\Vert d^k\Vert _2^2, \end{aligned}$$

where the first inequality is derived by (3.11) and the second inequality comes from L-Lipschitz continuity of F. Hence, we have \(\rho _k\ge \eta _2\), which generates a successful iteration and yields a contradiction. This completes the proof. \(\square \)

We are now ready to prove the main global convergence result.

Theorem 3.4

If Assumption 3.1 holds, then we have \(\lim _{k\rightarrow \infty } \Vert F(z^k)\Vert _2=0\).

Proof

If the index set \(\mathcal {K}_S\) is finite, the convergence is directly from Lemma 3.3. Thus, we only need to consider the situation that the index set \(\mathcal {K}_S\) is infinite. The main task is to prove that the full sequence \(\{\Vert F(z^k)\Vert _2\}_{k \ge 0}\) converges to zero in the following two cases.

Case (i): \(K_N\) is finite.

Without loss of generality, we ignore \(\mathcal {K}_N\) and assume that \(\mathcal {K}_S=\mathcal {K}_P \cup \mathcal {K}_{F}\) in this case. Let \(z^*\) be any point in solution set \(Z^*\). For any \(k \in \mathcal {K}_{F}\), we have that \(z^{k+1}=z^k-\beta F(z^k)\) and

$$\begin{aligned} \Vert z^k-z^*\Vert _2^2 - \Vert z^{k+1}-z^*\Vert _2^2= & {} 2\beta \left\langle {F(z^k)},{z^k-z^*}\right\rangle - \beta ^2\Vert F(z^k)\Vert _2^2 \nonumber \\\ge & {} \beta \left( \frac{1}{\alpha }-\beta \right) ||F(z^k)||_2^2\ge 0, \end{aligned}$$
(3.12)

where the first inequality is due to F is \(\frac{1}{2\alpha }\)-cocoercive and the item (v) in Proposition 2.3, and the last inequality is because of the choice of \(\beta \). By Lemma 3.2, for any \(k\in \mathcal {K}_{P}\), it yields that

$$\begin{aligned} \Vert z^k-z^*\Vert _2^2 - \Vert z^{k+1}-z^*\Vert _2^2 \ge \Vert z^{k+1}-z^k\Vert _2^2. \end{aligned}$$
(3.13)

Therefore, the whole sequence \(\{\Vert z^k-z^*\Vert _2\}_{k \ge 0}\) is non-increasing and convergent, and the sequence \(\{z^k\}_{k \ge 0}\) is bounded. By (3.3) and (3.4), it follows that

$$\begin{aligned} \Vert F^k\Vert _2\ge \Vert (J_k+\lambda _k\Vert F^k\Vert _2I)d^k\Vert _2-\Vert r^k\Vert _2\ge (1-\tau ) \lambda _k\Vert F^k\Vert _2\Vert d^k\Vert _2, \end{aligned}$$

which implies that \(\Vert d^k\Vert _2\le 1/[(1-\tau )\underline{\lambda }]\). This inequality shows that \(\{d^k\}_{k \ge 0}\) is bounded, and \(\{u^k\}_{k \ge 0}\) is also bounded. By using the continuity of F, there exists a constant \(c_1>0\) such that

$$\begin{aligned} \Vert F(u^k)\Vert _2^{-1}\ge c_1,\quad \text {for any }\ k \ge 0. \end{aligned}$$

Using (3.7) and (3.8), for any \(k\in \mathcal {K}_{P}\), we obtain that

$$\begin{aligned} \Vert z^{k+1}-z^k\Vert _2=\frac{-\left\langle {F(u^k)},{d^k}\right\rangle }{\Vert F(u^k)\Vert _2}\ge c_1\rho _k\Vert d^k\Vert _2^2 \ge c_1\eta _1 \Vert d^k\Vert _2^2. \end{aligned}$$
(3.14)

We next consider either

$$\begin{aligned} \liminf \limits _{k\rightarrow \infty }\Vert F^k\Vert _2=0\quad \text {or}\quad \liminf \limits _{k\rightarrow \infty }\Vert F^k\Vert _2=c_2>0. \end{aligned}$$

If \(\liminf \nolimits _{k\rightarrow \infty }\Vert F^k\Vert _2=0\), the continuity of F and the boundedness of \(\{z^k\}_{k \ge 0}\) imply that the sequence \(\{z^k\}_{k \ge 0}\) has some accumulation point \(\hat{z}\) such that \(F(\hat{z})=0\). Since \(z^*\) is an arbitrary point in \(Z^*\), we can choose \(z^*=\hat{z}\) in (3.12) and (3.13) . Then \(\{z^k\}_{k \ge 0}\) converges to \(\hat{z}\) and \(\{\Vert F(z^k)\Vert _2\}_{k \ge 0}\) converges to zero.

If \(\liminf \nolimits _{k\rightarrow \infty }\Vert F^k\Vert _2=c_2>0\), by using the continuity of F and the boundedness of \(\{z^k\}_{k \ge 0}\) again, there exist constants \(c_3>c_4>0\) such that

$$\begin{aligned} c_4\le \Vert F^k\Vert _2\le c_3,\quad \text {for all}\ k\ge 0. \end{aligned}$$

We now show that \(\{\lambda _k\}_{k \ge 0}\) is bounded above. If \(\lambda _k\) is large enough such that \(\Vert d^k\Vert _2\le 1\) and

$$\begin{aligned} \lambda _k\ge \frac{\eta _2+L}{(1-\tau ) c_4}, \end{aligned}$$

then by a similar proof as in Lemma 3.3 we have that \(\rho _k\ge \eta _2\) and consequently \(\lambda _{k+1}<\lambda _k\). Hence, it turns out that \(\{\lambda _k\}_{k \ge 0}\) is bounded from above, by say \(\bar{\lambda }>0\). Using (3.3), (3.4), \(\Vert J_k\Vert \le L\) and the upper bound of \(\{\lambda _k\}\), we have

$$\begin{aligned} \Vert F^k\Vert _2\le \Vert (J_k+\lambda _k\Vert F^k\Vert _2I)d^k\Vert _2+\Vert r^k\Vert _2\le (L+(1+\tau ) c_3\bar{\lambda })\Vert d^k\Vert _2. \end{aligned}$$
(3.15)

Hence, it follows from (3.13), (3.14) and (3.15) that there exists a constant \(c_5>0\) such that for any \(k \in \mathcal {K}_{P}\),

$$\begin{aligned} \Vert z^k-z^*\Vert _2^2 - \Vert z^{k+1}-z^*\Vert _2^2 \ge c_5||F(z^k)||_2^4. \end{aligned}$$
(3.16)

Then, by (3.12) and (3.16), we conclude that \(\{\Vert F(z^k)\Vert _2\}_{k \ge 0}\) converges to zero, which yields a contradiction to the assumption \(c_2 > 0\). The convergence of \(\{z^k\}_{k\ge 0}\) and \(\{\Vert F(z^k)\Vert _2\}_{k \ge 0}\) is established in the case (i).

Case (ii): \(K_N\) is infinite.

Let \((k_i)_{i \ge 0}\) enumerate all elements of the set \(\{k + 1 : k \in \mathcal {K}_N \}\) in increasing order. Since \(\Vert F(z^{k_i})\Vert _2\le \nu \Vert F(z^{k_{i-1}})\Vert _2\) and \(0<\nu <1\) for any \(i\ge 1\), we have that the sequence \(\{\Vert F(z^{k_i})\Vert _2\}\) converges to zero. For any \(k\in \mathcal {K}_{P}\), we have \(\Vert F(z^{k+1})\Vert _2\le \Vert F(z^{k})\Vert _2\) from the updating rule (3.7). For any \(k \in \mathcal {K}_F\), we have that \(z^{k+1}=(I-\beta F)(z^k)=[(1-\alpha \beta )I+\alpha \beta R](z^k)\) with \(\alpha \beta <1\). Since R is nonexpansive, from [7, Theorem 1], we have that \(\Vert (R-I)(z^{k+1})\Vert _2\le \Vert (R-I)(z^k)\Vert _2\). By noticing that \(F(z^k)=\alpha (I-R)(z^k)\), it follows that \(\Vert F(z^{k+1})\Vert _2 \le \Vert F(z^k)\Vert _2\), for any \(k \in \mathcal {K}_F\). Moreover, for any \(k\notin \mathcal {K}_{N}\), there exists an index i such that \(k_i<k+1<k_{i+1}\), and hence \(\Vert F(z^{k+1})\Vert _2 \le \Vert F(z^{k_i})\Vert _2\). Therefore, the whole sequence \(\{\Vert F(z^k)\Vert \}_{k \le 0}\) converges to 0, which completes the proof. \(\square \)

Some comments to Theorem 3.4 are given as follows.

  1. (i)

    In our arguments, we do not assume that \(\{z^k\}\) is bounded. In fact, the subsequence \(\{z^k\}_{k \in \mathcal {K}_P \cup \mathcal {K}_F}\) is bounded because of (3.12) and (3.13). Hence, the existence of accumulation points is up to the boundedness of the subsequence \(\{z^k\}_{k \in \mathcal {K}_N}\). Then any accumulation point of \(\{z^k\}\) converges to some point \(\bar{z}\) such that \(F(\bar{z})=0\).

  2. (ii)

    If the global error bound condition is satisfied at \(\bar{z}\), i.e., there exists a constant \(c>0\) such that

    $$\begin{aligned} c\Vert z-\bar{z}\Vert \le \Vert F(z)\Vert _2,\quad \forall z\in \mathbb {R}^n, \end{aligned}$$
    (3.17)

    the whole sequence \(\{z^k\}\) converges to \(\bar{z}\). Consider the case that the mapping F(x) is derived from the fixed point mapping with respect to the FBS, where the function h in (2.2) is strongly convex. Then the optimal solution \(z^*\) is unique and the global error bound condition is satisfied at \(z^*\) [36, Theorem 4].

As is already shown, the global convergence of our Algorithm is essentially guaranteed by the projection step and the fixed-point step. However, by noticing that (3.8) is in the form of \(v^k=z^k-\alpha _kF(u^k)\) with \(\alpha _k=\left\langle {F(u^k)},{z^k-u^k}\right\rangle /\Vert F(u^k)\Vert _2^2>0\), the projection step is indeed an extragradient step [12]. Since the asymptotic convergence rate of the extragradient step is often not faster than that of the Newton step, a slow convergence may be observed if the projection step and fixed-point step are always performed. Hence, our modification (3.7) is practically meaningful. Moreover, we will next prove that the projection step and fixed-point step will never be performed when the iterate is close enough to a solution under some generalized nonsingular conditions.

3.3 Fast Local Convergence

By the global convergence analysis, we can assume that \(z^*\) is an accumulation point of the sequence \(\{z^k\}\) generated by Algorithm 1 such that \(F(z^*)=0\). We also make the following assumption.

Assumption 3.5

For the mapping F defined in (3.1), the mapping F is semi-smooth and BD-regular at \(z^*\), which means that all elements in \(\partial _BF(z^*)\) are nonsingular.

The semi-smoothness of the mapping F often holds since the proximal mapping is (strongly) semi-smooth in many interesting applications. The BD-regularity is a common assumption in the analysis of the local convergence of nonsmooth methods. When F is BD-regular at \(z^*\), from [27, Proposition 2.5] and [25, Proposition 3], there exists some constants \(c_6>0,\kappa >0\) and a neighborhood \(N(z^*,\varepsilon _0)\) such that for any \(y\in N(z^*,\varepsilon _0)\) and \(J\in \partial _BF(y)\), it holds that:

  1. (1)

    \(z^*\) is an isolated solution;

  2. (2)

    J is nonsingular and \(\Vert J^{-1}\Vert \le c_6\);

  3. (3)

    the local error bound condition holds for F(z) on \(N(z^*,\varepsilon _0)\), that is \(\Vert y-z^*\Vert _2\le \kappa \Vert F(y)\Vert _2\).

Note that, the conditions that \(z^*\) is isolated and \(\Vert z^k-z^*\Vert _2\le \kappa \Vert F^k\Vert _2\) (for k large enough) indicate that the whole sequence \(\{z^k\}\) converges to \(z^*\).

We next show that Algorithm 1 turns into a Newton-type method in a neighborhood of \(z^*\) and achieves a locally fast convergence.

Theorem 3.6

Suppose that Assumption 3.5 holds and the regularization parameter \(\lambda _k\) is bounded above by \(\bar{\lambda }\) for any \(z^k\in N(z^*,\varepsilon _1)\) with \(\varepsilon _1:=\min \{\varepsilon _0,1/(2Lc_6\tau \bar{\lambda })\}\). Then for sufficiently large k, we have \(\Vert F(u^{k})\Vert _2\le \nu \Vert F(z^k)\Vert _2\) and \(z^{k+1}=u^k\), and \(\{z^k\}\) converges superlinearly to \(z^*\). If F is strongly semi-smooth at \(z^*\), then \(\{z^k\}\) converges quadratically to \(z^*\).

Proof

It follows from the L-Lipschitz continuity of F that \(\Vert F^k\Vert _2\le L\Vert z^k-z^*\Vert _2\le L\varepsilon _1\) for any \(z^k\in N(z^*,\varepsilon _1)\). Hence the definition of \(\varepsilon _1\) implies that

$$\begin{aligned} c_6\tau \bar{\lambda }\Vert F^k\Vert _2\le 1/2. \end{aligned}$$
(3.18)

From (3.2), (3.4), (3.18) and \(\Vert J_k^{-1}\Vert \le c_6\), we obtain for a Newton step that

$$\begin{aligned} \Vert d^k\Vert _2\le & {} \Vert (J_k+\mu _k I)^{-1}F^k\Vert _2+\Vert (J_k+\mu _k I)^{-1}r^k\Vert _2 \nonumber \\\le & {} c_6L\Vert z^k-z^*\Vert _2+c_6\tau \bar{\lambda }\Vert F^k\Vert _2\Vert d^k\Vert _2 \nonumber \\\le & {} 2c_6L\Vert z^k-z^*\Vert _2, \end{aligned}$$
(3.19)

where the last inequality is derived from a rearrangment of the second inequality. A direct calculation gives

$$\begin{aligned}&\Vert u^{k}-z^*\Vert _2=\Vert z^k+d^k-z^*\Vert _2\\&\quad =\Vert z^k+(J_k+\mu _kI)^{-1}(F^k+(J_k+\mu _kI)d^k-F^k)-z^*\Vert _2\\&\quad \le \Vert z^k-z^*-(J_k+\mu _kI)^{-1}F^k\Vert _2+\Vert (J_k+\mu _kI)^{-1}\Vert \cdot \Vert F^k+(J_k+\mu _kI)d^k\Vert _2\\&\quad \le \Vert (J_k+\mu _k I)^{-1}\Vert \cdot [\Vert F^k-F(z^*)-J_k(z^k-z^*)\Vert _2+\mu _k\Vert z^k-z^*\Vert _2+\Vert r^k\Vert _2]. \end{aligned}$$

Using \(\Vert (J_k+\mu _k I)^{-1}\Vert \le c_6\), \(\mu _k=\lambda _k\Vert F^k\Vert _2\) and \(\Vert r^k\Vert _2\le \tau \lambda _k\Vert F^k\Vert _2\Vert d^k\Vert _2\), we have

$$\begin{aligned} \Vert u^{k}-z^*\Vert _2\le & {} c_6(\Vert F^k-F(z^*)-J_k(z^k-z^*)\Vert _2\nonumber \\&+\,\bar{\lambda }\Vert F^k\Vert _2\Vert z^k-z^*\Vert _2+\tau \bar{\lambda }\Vert F^k\Vert _2\Vert d^k\Vert _2). \end{aligned}$$
(3.20)

By the L-Lipschitz continuity of F and (3.19), we obtain

$$\begin{aligned} \bar{\lambda }\Vert F^k\Vert _2\Vert z^k-z^*\Vert _2+\tau \bar{\lambda }\Vert F^k\Vert _2\Vert d^k\Vert _2\le L\bar{\lambda }(1+2c_6L\tau )\Vert z^k-z^*\Vert _2^2. \end{aligned}$$
(3.21)

The semi-smoothness of F at \(z^*\) implies

$$\begin{aligned} \Vert F^k-F(z^*)-J_k\left( z^k-z^*\right) \Vert _2=o\left( \Vert z^k-z^*\Vert _2\right) , \quad \text{ as }\ \Vert z^k-z^*\Vert \rightarrow 0, \end{aligned}$$

which, together with (3.20) and (3.21) shows that \(\Vert u^k-z^*\Vert _2=o(\Vert z^k-z^*\Vert _2)\). Hence, we have \(L\Vert u^{k}-z^*\Vert _2\le \frac{\nu }{\kappa }\Vert z^k-z^*\Vert _2\) for sufficiently large k. The local error bound condition implies that \(\Vert z^k-z^*\Vert _2\le \kappa \Vert F(z^k)\Vert _2\) for sufficiently large k. Together with the Lipschitz continuity of F, we obtain for sufficiently large k that

$$\begin{aligned} \Vert F(u^k)\Vert _2\le L\Vert u^{k}-z^*\Vert _2\le \frac{\nu }{\kappa }\Vert z^k-z^*\Vert _2\le \nu \Vert F(z^k)\Vert _2. \end{aligned}$$

Then the updating rule (3.5) yields \(\Vert F(u^{k})\Vert _2\le \nu \Vert F(z^k)\Vert _2\) and \(z^{k+1}=u^k\).

When F is strongly semi-smooth, the quadratic convergence \(\Vert u^{k}-z^*\Vert _2=O(\Vert z^k-z^*\Vert _2^2)\) is established because of the property \(\Vert F^k-F(z^*)-J_k(z^k-z^*)\Vert _2=O(\Vert z^k-z^*\Vert ^2_2)\). The proof is completed. \(\square \)

It is clear that the BD-regular condition plays a key role in the above discussion. Although the BD-regular condition is strong and may fail in some situations, there are some possible ways to resolve this issue. As is shown in [22, Section 4.2], suppose that there exists a nonsingular element in \(\partial _BF(z^*)\) and other elements in \(\partial _BF(z^*)\) may be singular. By exploiting the structure of \(\partial _BF(z)\), one can carefully choose a nonsingular generalized Jacobian when z is close enough to \(z^*\). Hence, if \(z^*\) is isolated, one can still obtain the fast local convergence results by a similar proof as above. Another way is inspired by the literature on the Levenberg–Marquardt (LM) method. The LM method is a regularized Gauss–Newton method to deal with some possibly singular systems. It has been shown in [13] that the LM method preserves a superlinear or quadratic local convergence rate under certain local error bound condition, which is weaker than the nonsingular condition. Therefore, it remains a future research topic to investigate local convergence of our algorithm under the local error bound condition.

3.4 Regularized L-BFGS Method with Projection Steps

In this subsection, we propose a regularized L-BFGS method with projection steps by simply replace the Newton step in Algorithm 1 with a regularized L-BFGS step to avoid solving the linear system (3.2). The L-BFGS method is an adaption of the classical BFGS method, which tries to use a minimal storage. A globally convergent BFGS method with projection steps is proposed in [46] for solving smooth monotone equations. The convergence of our regularized L-BFGS method can be analyzed in a similar way as our regularized Newton method by combining the convergence analysis in [46]. We only describe the L-BFGS update in the following and omit the convergence analysis.

For an iterate \(z^k\), we compute the direction by

$$\begin{aligned} (H_k + \mu _k I)d^k = -F^k, \end{aligned}$$
(3.22)

where \(H_k\) is the L-BFGS approximation to the Jacobian matrix.

Choosing an initial matrix \(H_k^{0}\) and setting \(\delta F^k = F^{k+1} - F^{k}\), the Jacobian matrix can be approximated by the recent m pairs \(\{\delta F^i, d^i\}, i = k-1, k -2, \ldots , k-m\), i.e., using the standard formula [23] as

$$\begin{aligned} H_{k} = H_k^{0} - \left[ \begin{array}{cc} H_k^0\mathcal {D}_k&\mathcal {F}_k \end{array}\right] \left[ \begin{array}{cc} \mathcal {D}_k^TH_k^0\mathcal {D}_k &{} L_k \\ L^T_k &{} -S_k \end{array}\right] ^{-1} \left[ \begin{array}{c} \mathcal {D}_k^T (H_k^0)^T \\ \mathcal {F}_k^T \end{array}\right] , \end{aligned}$$
(3.23)

where \(\mathcal {D}_k = [d^{k-m}, \ldots , d^{k-1} ]\), \(\mathcal {F}_k = [\delta F^{k-m}, \ldots , \delta F^{k-1} ]\), \(L_k\) is a lower-triangular matrix with entries

$$\begin{aligned} (L_k)_{i,j} = {\left\{ \begin{array}{ll} (d^{k-m-1+i})^T(\delta F^{k-m-1+j})&{} \text {if} \quad i > j, \\ 0 &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$

and \(S_k\) is a diagonal matrix with entries

$$\begin{aligned} (S_k)_{ii} = (d^{k-m-1+i})^T\delta F^{k-m-1+i}. \end{aligned}$$

Then we can compute the inverse regularized Jacobian matrix

$$\begin{aligned} (H_k + \mu _k I)^{-1} = \bar{H}_k^{-1} + \bar{H}_k^{-1}C_kR_k^{-1}C_k^T(\bar{H}_k^T)^{-1}, \end{aligned}$$

where \(\bar{H}_k = H_k^0 + \mu _k I\), \(C_k = [H_k^0\mathcal {D}_k \quad \mathcal {F}_k]\), \(R_k\) is defined by \(R_k = V_k - C_k^T \bar{H}_k^{-1} C_k\) and

$$\begin{aligned} V_k = \left[ \begin{array}{cc} \mathcal {D}_{k}^TH_k^0\mathcal {D}_{k} &{} L_{k} \\ L_{k}^T &{} -S_{k} \end{array}\right] . \end{aligned}$$

Specifically, if k is smaller than m, we use the classical BFGS method to approximate inverse regularized Jacobian matrix, which just let \(d^j = \delta F^j= 0\) for \(j < 0\) in the formula (3.23).

We should point out that the usage of L-BFGS may not be suitable because the Jacobian matrix may not be symmetric. It seems that no quasi-Newton method can provide a nonsymmetric but positive semidefinite approximation. Hence, we only take care of the positive semidefiniteness rather than nonsymmetric property. Fortunately, our numerical experiments show that it works in the \(\ell _1\)-regularized optimization problem. Of course, one may also try nonsymmetric type of quasi-Newton method.

4 Numerical Results

In this section, we conduct proof-of-concept numerical experiments on our proposed schemes for the fixed-point mappings induced from the FBS and DRS methods by applying them to \(\ell _1\)-norm minimization problem. All numerical experiments are performed in Matlab on workstation with a Intel(R) Xeon(R) CPU E5-2680 v3 and 128GB memory.

4.1 Applications to the FBS Method

Consider the \(\ell _1\)-regularized optimization problem of the form

$$\begin{aligned} \min \; \mu \Vert x\Vert _1 + h(x), \end{aligned}$$
(4.1)

where h is continuously differentiable. Let \(f(x)=\mu \Vert x\Vert _1\). The system of nonlinear equations corresponding to the FBS method is \( F(x) =x- \mathbf {prox}_{tf}(x- t \nabla h(x))=0.\) The generalized Jacobian matrix of F(x) is

$$\begin{aligned} J(x) =I- M(x)\left( I- t \partial ^2 h(x)\right) , \end{aligned}$$
(4.2)

where \(M(x)\in \partial \mathbf {prox}_{tf}( x - t \nabla h(x))\) and \(\partial ^2 h(x)\) is the generalized Hessian matrix of h(x). Specifically, the proximal mapping corresponding to f(x) is the so-called shrinkage operator defined as

$$\begin{aligned} \left( \mathbf {prox}_{tf}(x) \right) _i= \mathrm {sign}(x_i) \max (|x_i|-\mu t, 0). \end{aligned}$$

Hence, one can take a Jacobian matrix M(x) which is a diagonal matrix with diagonal entries being

$$\begin{aligned} \left( M(x) \right) _{ii}= {\left\{ \begin{array}{ll} 1, &{} \text{ if } |(x-t\nabla h(x))_i| >\mu t,\\ 0, &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Similar to [22], we introduce the index sets

$$\begin{aligned} \mathcal {I}(x):= & {} \{i : |(x - t \nabla h(x))_i| >t\mu \} = \{i: \left( M(x) \right) _{ii} = 1\}, \\ \mathcal {O}(x):= & {} \{i : |(x - t \nabla h(x))_i| \le t\mu \} = \{i: \left( M(x) \right) _{ii} = 0\}. \end{aligned}$$

The Jacobian matrix can be represented by

$$\begin{aligned} J(x) = \left( \begin{matrix} t(\partial ^2h(x))_{\mathcal {I}(x)\mathcal {I}(x)} &{} t(\partial ^2h(x))_{\mathcal {I}(x)\mathcal {O}(x)} \\ 0 &{} I \end{matrix} \right) . \end{aligned}$$
(4.3)

Using the above special structure of Jacobian matrix J(x), we can reduce the complexity of the regularized Newton step (3.2). Let \(\mathcal {I} = \mathcal {I}(x^k)\) and \({\mathcal {O}}= {\mathcal {O}}(x^k)\). Then, we have

$$\begin{aligned}&(1 + \mu _k)s^k_{{\mathcal {O}}} = - F_{k, {\mathcal {O}}} ,\\&(t(\partial ^2h(x))_{\mathcal {I}\mathcal {I}} + \mu I) s^k_{\mathcal {I}} + t(\partial ^2h(x))_{\mathcal {I}{\mathcal {O}}} s^k_{{\mathcal {O}}} = - F_{k,\mathcal {I}}, \end{aligned}$$

which yields

$$\begin{aligned}&s^k_{{\mathcal {O}}} = - \frac{1}{1 + \mu _k}F_{k, {\mathcal {O}}}, \\&(t(\partial ^2h(x))_{\mathcal {I}\mathcal {I}} + \mu I) s^k_{\mathcal {I}} = - F_{k,\mathcal {I}} - t(\partial ^2h(x))_{\mathcal {I}{\mathcal {O}}} s^k_{{\mathcal {O}}} . \end{aligned}$$

The second equation in the above linear system is then solved by the conjugate gradient (CG) method.

4.1.1 Numerical Comparison

In this subsection, we compare our proposed methods with different solvers for solving problem (4.1) with \(h(x)=\frac{1}{2} \Vert Ax-b\Vert _2^2\). The solvers used for comparison include ASSN, SSNP, ALSB, FPC-AS [39], SpaRSA [40] and SNF [22]. ASSN is the proposed semi-smooth Newton method with projection steps (Algorithm 1 ) and SSNP is the method which only uses the projection steps. ASLB(i) is a variant of the line search based method by combining the L-BFGS method and hyperplane projection technique. The number in bracket is the size of memory. FPC-AS is a first-order method that uses a fixed-point iteration under Barzilai–Borwein steps and continuation strategy. SpaRSA resembles FPC-AS, which is also a first-order methods and uses Barzilai–Borwein steps and continuation strategy. SNF is a semi-smooth Newton type method which uses the filter strategy and is one of state-of-the-art second-order methods for \(\ell _1\)-regularized optimization problem (4.1) and SNF(aCG) is the SNF solver with an adaptive parameter strategy in the CG method. The parameters of FPC-AS, SpaRSA and SNF are the same as [22].

The test problems are from [22], which are constructed as follows. Firstly, we randomly generate a sparse solution \(\bar{x} \in \mathbb {R}^n\) with k nonzero entries, where \(n = 512^2 = 262144\) and \(k = [n/40] = 5553\). The k different indices are uniformly chosen from \(\{1,2,\ldots ,n\}\) and we set the magnitude of each nonzero element by \(\bar{x}_i = \eta _1(i)10^{d \eta _2(i)/20}\), where \(\eta _1(i)\) is randomly chosen from \(\{-1,1\}\) with probability 1 / 2, respectively, \(\eta _2(i)\) is uniformly distributed in [0, 1] and d is a dynamic range which can influence the efficiency of the solvers. Then we choose \(m = n/8 = 32768\) random cosine measurements, i.e., \(Ax = (\mathrm {dct}(x))_J\), where J contains m different indices randomly chosen form \(\{1,2,\ldots ,n\}\) and \(\mathrm {dct}\) is the discrete cosine transform. Finally, the input data is specified by \(b = A\bar{x} + \epsilon \), where \(\epsilon \) is a Gaussian noise with a standard deviation \(\bar{\sigma } = 0.1\).

To compare fairly, we set an uniform stopping criterion. For a certain tolerance \(\epsilon \), we obtain a solution \(x_{newt}\) using ASSN such that \(\Vert F(x_{newt})\Vert \le \epsilon \). Then we terminate all methods by the relative criterion

$$\begin{aligned} \frac{f(x^k) - f(x^{*})}{\max \{f(x^*),1\}} \le \frac{f(x_{newt}) - f(x^{*})}{\max \{f(x^*),1\}}, \end{aligned}$$

where f(x) is the objective function and \(x^{*}\) is a highly accurate solution obtained by ASSN under the criterion \(||F(x)|| \le 10^{-14}\).

Table 1 Total number of A- and \(A^T\)-calls \(N_A\) and CPU time (in seconds) averaged over 10 independent runs with dynamic range 20 dB
Table 2 Total number of A- and \(A^T\)-calls \(N_A\) and CPU time (in seconds) averaged over 10 independent runs with dynamic range 40 dB
Table 3 Total number of A- and \(A^T\)-calls \(N_A\) and CPU time (in seconds) averaged over 10 independent runs with dynamic range 60 dB
Table 4 Total number of A- and \(A^T\)-calls \(N_A\) and CPU time (in seconds) averaged over 10 independent runs with dynamic range 80 dB

We solve the test problems under different tolerances \(\epsilon \in \{10^{-0}, 10^{-1},10^{-2},10^{-4},10^{-6} \}\) and dynamic ranges \(d \in \{20,40,60,80 \}\). Since the evaluations of \(\mathrm {dct}\) dominate the overall computation, we mainly use the total numbers of A- and \(A^T\)-calls \(N_A\) to compare the efficiency of different solvers. Tables 123, and 4 show the averaged numbers of \(N_A\) and CPU time over 10 independent trials. These tables show that ASSN and ASLB are competitive to other methods. For the low accuracy, SpaRSA and FPC-AS show a fast convergence rate. ASSN and ASLB are both faster than or close to FPC-AS and SpaRSA regardless of \(N_A\) and CPU time in most cases. In the meanwhile, ASSN and ASLB are competitive to the second-order methods under moderate accuracy. The CPU time and \(N_A\) of ASSN and ASLB are less than the Newton type solver SNF in almost all cases, especially for the large dynamic range. ASLB with a memory size \(m = 1\) shows the fastest speed in low accuracy. It is necessary to emphasize that L-BFGS with \(m = 1\) is equal to the Hestenes–Stiefel and Polak–Ribière conjugate gradient method with exact line search [23]. Compared with ASSN, SSNP has a slower convergent rate, which implies that our adaptive strategy on switching Newton and projection steps is helpful.

Fig. 1
figure 1

Residual history with respect to the total numbers of A- and \(A^T\)-calls \(N_A\). a 20 dB, b 40 dB, c 60 dB, d 80 dB

Fig. 2
figure 2

Residual history with respect to the total numbers of iteration. a 20 dB, b 40 dB, c 60 dB, d 80 dB

In particular, ASSN and ASLB have a better performance for high accuracy. Figures 1 and 2 illustrate the residual history with respect to the total number of A- and \(A^T\)-calls \(N_A\) and the total number of iterations. Since two first-order methods have a close performance and ASLB(1) performs better than ASLB(2), we omit the the figure of FPC-AS and ASLB(2). These figures also show that ASSN and ASLB have a better performance than SNF and SNF(aCG) independent of dynamic ranges. In particular, quadratic convergence is observable from ASSN in these examples.

4.2 Applications to the DRS Method

Consider the Basis-Pursuit (BP) problem

$$\begin{aligned} \min \; \Vert x\Vert _1, \,\text {subject to}\,Ax=b, \end{aligned}$$
(4.4)

where \(A\in \mathbb {R}^{m\times n}\) is of full row rank and \(b\in \mathbb {R}^m\). Let \(f(x)=\mathrm {1}_\Omega (Ax-b)\) and \(h(x)=\Vert x\Vert _1\), where the set \(\Omega =\{0\}\). The system of nonlinear equations corresponding to the DRS fixed-point mapping is

$$\begin{aligned} F(z) = \mathbf {prox}_{th}(z)-\mathbf {prox}_{tf}(2\mathbf {prox}_{th}(z)-z)=0. \end{aligned}$$
(4.5)

For the simplicity of solving the subproblems in the DRS method, we make the assumption that \(AA^\top =I\). Then it can be derived that the proximal mapping with respect to f(x) is

$$\begin{aligned} \mathbf {prox}_{tf}(z)= & {} \left( I-A^\top A\right) z + A^\top \left( \mathbf {prox}_{\mathrm {1}_\Omega }(Az-b)+b\right) \\= & {} z- A^\top (A z-b). \end{aligned}$$

A generalized Jacobian matrix \(D \in \partial \mathbf {prox}_{tf}((2\mathbf {prox}_{th}(z)-z))\) is taken as follows

$$\begin{aligned} D = I- A^\top A. \end{aligned}$$
(4.6)

The proximal mapping with respect to h(x) is

$$\begin{aligned} \left( \mathbf {prox}_{th}(z) \right) _i= \mathrm {sign}(z_i) \max (|z_i|-t, 0). \end{aligned}$$

One can take a generalized Jacobian matrix \(M(z)\in \partial \mathbf {prox}_{th}(z)\) as a diagonal matrix with diagonal entries

$$\begin{aligned} M_{ii}(z)=\left\{ \begin{array}{ll} 1,\quad &{}\quad |(z)_i|>t,\\ 0,\quad &{}\quad \text{ otherwise }.\\ \end{array} \right. \end{aligned}$$

Hence, a generalized Jacobian matrix of F(z) is in the form of

$$\begin{aligned} J(z)= M(z) + {D}(I-2M(z)). \end{aligned}$$
(4.7)

Let \(W=(I-2M(z))\) and \(H=W+M(z)+\mu I\). Using the binomial inverse theorem, we obtain the inverse matrix

$$\begin{aligned} (J(z)+\mu I)^{-1}= & {} \left( H - A^\top A W\right) ^{-1} \\= & {} H^{-1} + H^{-1} A^\top \left( I- AW H^{-1}A^\top \right) ^{-1} A WH^{-1}. \end{aligned}$$

For convenience, we write the diagonal entries of matrix W and H as

$$\begin{aligned} W_{ii}(z)=\left\{ \begin{array}{ll} -1,&{} \quad |(z)_i|>t,\\ 1, &{} \quad \text{ otherwise }\\ \end{array} \right. \text {and} \quad H_{ii}(z)=\left\{ \begin{array}{ll} \mu , &{}\quad |(z)_i|>t,\\ 1+\mu , &{}\quad \text{ otherwise }.\\ \end{array} \right. \end{aligned}$$

Then \(WH^{-1} = \frac{1}{1+\mu }I - S\), where S is a diagonal matrix with diagonal entries

$$\begin{aligned} S_{ii}(z)=\left\{ \begin{array}{ll} \frac{1}{\mu } + \frac{1}{1+\mu }, &{} \quad |(z)_i|>t,\\ 0, &{}\quad \text{ otherwise }.\\ \end{array} \right. \end{aligned}$$

Hence, \(I- AW H^{-1}A^\top = (1-\frac{1}{1+\mu })I + ASA^\top \). Define the index sets

$$\begin{aligned} {\begin{matrix} &{}\mathcal {I}(x) := \{i : |(z)_i| >t \} = \{i: M_{ii}(x) = 1\}, \\ &{}\mathcal {O}(x) := \{i : |(z)_i| \le t \} = \{i: M_{ii}(x) = 0\} \end{matrix}} \end{aligned}$$

and \(A_{\mathcal {I}(x)}\) denote the matrix containing the column \(\mathcal {I}(x)\) of A, then we have

$$\begin{aligned} ASA^\top = \left( \frac{1}{\mu } + \frac{1}{1+\mu }\right) A_{\mathcal {I}(x)}A_{\mathcal {I}(x)}^\top . \end{aligned}$$
(4.8)

The above property implies the positive definiteness of \(I- AW H^{-1}A^\top \) and can be used to reduce the computational complexity if the submatrix \(A_{\mathcal {I}(x)} A_{\mathcal {I}(x)} ^\top \) is easily available.

4.2.1 Numerical Comparison

In this subsection, we compare our methods with two first-order solvers: ADMM [42] and SPGL1 [38]. The ASLB solver is not included since its performance is not comparable with other approaches. Our test problems are almost the same as the last subsection and the only difference is that we set \(b = A\bar{x}\) without adding noise. We use the residual criterion \(\Vert F(z)\Vert \le \epsilon \) as the stopping criterion for ADMM and ASSN. Because the computation of residual of SPGL1 needs extra cost, we use its original criterion and list the relative error “rerr” to compare with ADMM and ASSN. The relative error with respect to the true solution \(x^*\) is denoted by

$$\begin{aligned} \text{ rerr } = \frac{||x^k - x^*||}{\max (||x^*||,1)}. \end{aligned}$$

We revise the ADMM in yall1Footnote 1 by adjusting the rules of updating the penalty parameter and choosing the best parameters so that it can solve all examples in our numerical experiments. The parameters are set to the default values in SPGL1. Since the matrix A is only available as an operator, the property (4.8) cannot be applied in ASSN.

Table 5 Total number of A- and \(A^T\)-calls \(N_A\), CPU time (in seconds) and relative error with dynamic range 20 dB
Table 6 Total number of A- and \(A^T\)-calls \(N_A\), CPU time (in seconds) and relative error with dynamic range 40 dB
Table 7 Total number of A- and \(A^T\)-calls \(N_A\), CPU time (in seconds) and relative error with dynamic range 60 dB
Table 8 Total number of A- and \(A^T\)-calls \(N_A\), CPU time (in seconds) and relative error with dynamic range 80 dB
Fig. 3
figure 3

Residual history with respect to the total numbers of A- and \(A^T\)-calls \(N_A\). a 20 dB, b 40 dB, c 60 dB, d 80 dB

Fig. 4
figure 4

Residual history with respect to the total numbers of iteration. a 20 dB, b 40 dB, c 60 dB, d 80 dB

We solve the test problems under different tolerances \(\epsilon \in \{10^{-2},10^{-4},10^{-6} \}\) and dynamic ranges \(d \in \{20,40,60,80 \}\). Similar to the last subsection, we mainly use the total numbers of A- and \(A^T\)-calls \(N_A\) and CPU time to compare the efficiency among different solvers. We also list the relative error so that we can compare ADMM, ASSN with SPGl1. These numerical results are reported in Tables 567 and 8. The performance of ASSN is close to ADMM for tolerance \(10^{-2}\) and is much better for tolerance \(10^{-4}\) and \(10^{-6}\) independent of dynamic ranges. For all test problems, SPGL1 can only obtain a low accurate solution. It may be improved if the parameters are further tuned.

Figures 3 and 4 illustrate the residual history with respect to the total number of A- and \(A^T\)-calls \(N_A\) and the total number of iterations. SPGL1 is omitted since it cannot converge for a high accuracy. The figures show that ASSN has a similar convergent rate as ADMM in the initial stage but it achieves a faster convergent rate later, in particular, for a high accuracy.

5 Conclusion

The purpose of this paper is to study second-order type methods for solving composite convex programs based on fixed-point mappings induced from many operator splitting approaches such as the FBS and DRS methods. The semi-smooth Newton method is theoretically guaranteed to converge to a global solution from an arbitrary initial point and achieve a fast convergent rate by using an adapt strategy on switching the projection steps and Newton steps. Our proposed algorithms are suitable to constrained convex programs when a fixed-point mapping is well-defined. It may be able to bridge the gap between first-order and second-order type methods. They are indeed promising from our preliminary numerical experiments on a number of applications. In particular, quadratic or superlinear convergence is attainable in some examples of Lasso regression and basis pursuit, although the BD-regular condition required for fast local convergence has not been established rigorously. The Jacobian of the \(\ell _1\)-regularized problem (4.1) is (4.3). The term \(\partial ^2h(x))_{\mathcal {I}(x)\mathcal {I}(x)}\) behaves well in our numerical experiments. Similarly, the structure of \(ASA^\top \) in (4.8) for the Jacobian of the basis-pursuit problem (4.4) also helps. Moreover, the number of the A- and \(A^T\)-calls of ASSN when solving the Newton systems shows that the CG method works well in general. It further indicates that the BD-regular condition may hold on these examples.

One of the key issues for solving the monotone nonsmooth equations is the convergence to global optimality. Note that, if h(x) in problem (1.1) is an indicator function of a closed convex set C, the nonsmooth equations with respect to FBS are in the form of \(F(z)=z-\Pi _{C}(x-\nabla f(x))=0\), which are equivalent to a variational inequality (VI) problem. Various globally convergent projection-type or extragradient-type methods have been proposed to solve the VI, see the survey paper [41] and references therein. They may provide us plenty of alternative choices other than the hyperplane projection step in this paper.

There are a number of future directions worth pursuing from this point on, including theoretical analysis and a comprehensive implementation of these second-order algorithms. To improve the performance in practice, the second-order methods can be activated until the first-order type methods reach a good neighborhood of the global optimal solution. Since solving the corresponding system of linear equations is computationally dominant, it is important to explore the structure of the linear system and design certain suitable preconditioners.