Peaceman–Rachford splitting for a class of nonconvex optimization problems

Li, Guoyin; Liu, Tianxiang; Pong, Ting Kei

doi:10.1007/s10589-017-9915-8

Peaceman–Rachford splitting for a class of nonconvex optimization problems

Published: 13 May 2017

Volume 68, pages 407–436, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

Peaceman–Rachford splitting for a class of nonconvex optimization problems

Download PDF

Guoyin Li¹,
Tianxiang Liu² &
Ting Kei Pong²

2029 Accesses
25 Citations
Explore all metrics

Abstract

We study the applicability of the Peaceman–Rachford (PR) splitting method for solving nonconvex optimization problems. When applied to minimizing the sum of a strongly convex Lipschitz differentiable function and a proper closed function, we show that if the strongly convex function has a large enough strong convexity modulus and the step-size parameter is chosen below a threshold that is computable, then any cluster point of the sequence generated, if exists, will give a stationary point of the optimization problem. We also give sufficient conditions guaranteeing boundedness of the sequence generated. We then discuss one way to split the objective so that the proposed method can be suitably applied to solving optimization problems with a coercive objective that is the sum of a (not necessarily strongly) convex Lipschitz differentiable function and a proper closed function; this setting covers a large class of nonconvex feasibility problems and constrained least squares problems. Finally, we illustrate the proposed algorithm numerically.

Douglas–Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems

Article 24 November 2015

Convergence Analysis of the Generalized Splitting Methods for a Class of Nonconvex Optimization Problems

Article 09 July 2019

Convergence of Bregman Peaceman–Rachford Splitting Method for Nonconvex Nonseparable Optimization

Article 04 May 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider the following optimization problem with competing structure:

$$\begin{aligned} \min _u~f(u) + g(u), \end{aligned}$$

(1)

where f and g are proper closed possibly nonconvex functions. Optimization problems of this form arise in many important modern applications such as signal processing, machine learning and statistics [6, 10, 17, 32]. A typical application of (1) is to solve some ill-posed inverse problems where the function f represents the data fitting term and the function g is the regularization term. To solve problems with competing structures, an important and powerful class of algorithms is the class of splitting methods. In these methods, the objective function is decomposed into simpler individuals which are then processed separately in the subproblems. Two classical splitting methods in the literature are the Douglas–Rachford (DR) splitting method [15, 16, 26] and the Peaceman–Rachford (PR) splitting method [26, 30].

The PR splitting method was originally introduced in [30] for solving linear heat flow equations, and was later generalized to deal with nonlinear equations in [26]. In the case when f and g are both convex, the PR splitting method can be described conveniently by the following update:

$$\begin{aligned} x^{t+1} = (2\mathrm{prox}_{\gamma g} - I)\circ (2\mathrm{prox}_{\gamma f} - I)(x^t), \end{aligned}$$

(2)

where I is the identity mapping, $\gamma > 0$ and

$$\begin{aligned} \hbox {prox}_{\gamma h}(z) := \mathop {\hbox {Arg min}}_u\left\{ \gamma h(u) + \frac{1}{2} \Vert u - z\Vert ^2\right\} , \end{aligned}$$

i.e., the set of minimizers of the problem $\min \limits _u \gamma h(u) + \frac{1}{2} \Vert u - z\Vert ^2$; we note that this set is a singleton when h is convex. Although the PR splitting method can be faster than the DR splitting method (see, for example, [18] and Example 1 in Appendix), the PR splitting method was not as popular as the DR splitting method. This is also witnessed by the fact that the PR splitting method is not discussed nor mentioned in the recent monograph [5] on operator splitting methods. One of the main reasons for the unpopularity is that, even in the convex settings, the PR splitting method is not convergent in general. To guarantee convergence, typically one would require either the operator $(2\mathrm{prox}_{\gamma f} - I)$ or $(2\mathrm{prox}_{\gamma g} - I)$ to be a contraction mapping. In applications where f, g are both convex, this requirement typically needs f or g to be strongly convex, which largely limits the applicability of the PR splitting method; see, for example, [12, 26]. In contrast, under a commonly used constraint qualification which can be easily satisfied, the DR splitting method converges in the convex case [13, Theorem 20]. Moreover, recently, it has been shown in [25] that the DR splitting method can be adapted to a nonconvex setting with global convergence guaranteed under some assumptions. This broadens the applicability of the DR splitting method to cover many nonconvex feasibility problems and many important nonconvex optimization problems arising in statistical machine learning such as the $\ell _{1/2}$ regularized least squares problem.

In this paper, to broaden the applicability of the PR splitting method, we extend it to a nonconvex setting. By constructing a merit function which captures the progress of the PR splitting method, we extend the global convergence of the PR splitting method from the known convex setting to the case where the objective function can be decomposed as the sum of a strongly convex Lipschitz differentiable function and a nonconvex function, under suitable assumptions. As a by-product, this extension also allows us to establish the global convergence and iteration complexity of a new PR splitting method for convex optimization problems in the absence of strong convexity. The underlying intuitive idea is that one can decompose a non-strongly convex function $F + G$ into the sum of a strongly convex function $f=F+\gamma \Vert \cdot \Vert ^2$ and a nonconvex function $g=G-\gamma \Vert \cdot \Vert ^2$, if a $\gamma > 0$ can be chosen so that f is strongly convex.

The contributions of this paper are two-fold. First, we establish that, for the sequence generated by the PR splitting method applied to minimizing the sum of a strongly convex Lipschitz differentiable function and a proper closed function, if the strongly convex function has a sufficiently large strong convexity modulus and the step-size parameter is chosen below a threshold that is computable, then any cluster point, if exists, gives a stationary point of the optimization problem. We also provide sufficient conditions to guarantee boundedness of the sequence generated. To our knowledge, this is the first work that studies the convergence of the PR splitting method for nonconvex optimization problems. Second, we demonstrate how the method can be suitably applied to minimizing a coercive function $F+G$, where G is a proper closed function, and F is convex Lipschitz differentiable but not necessarily strongly convex. Even in the case when G is also convex, it was previously unknown in the literature how the PR splitting method can be suitably applied to solving it. Our study largely broadens the applicability of the PR splitting method. We also discuss global iteration complexity of this new PR splitting method under the additional assumption that G is convex, and establish global linear convergence of the sequence generated if $F+G$ is further assumed to be strongly convex.

The rest of the paper is organized as follows. In Sect. 1.1, we fix the notation and recall some basic definitions which will be used throughout this paper. In Sect. 2, we establish the convergence of the PR splitting method for nonconvex optimization problems where the objective function can be decomposed as the sum of a strongly convex function and a proper closed function, under suitable assumptions. In Sect. 3, we demonstrate how the PR splitting method can be applied in the absence of strong convexity. In Sect. 4, as applications, we illustrate how the PR splitting method can be applied to solving two important classes of nonconvex optimization problems that arise in the area of statistics and machine learning: constrained least squares problem and feasibility problems. We also demonstrate our approach numerically. Our concluding remarks are in Sect. 5. Finally, in the Appendix, we provide simple and concrete examples illustrating the different behaviors of the classical PR splitting method, the classical DR splitting method and our proposed PR splitting method.

1.1 Notation

In this paper, the n-dimensional Euclidean space is denoted by $\mathrm{I}\!\mathrm{R}^n$, with the associated inner product denoted by $\langle \cdot ,\cdot \rangle $ and the induced norm denoted by $\Vert \cdot \Vert $. For an extended-real-valued function $f:\mathrm{I}\!\mathrm{R}^n \rightarrow (-\infty ,\infty ]$, we say that f is proper if it is never $-\infty $ and its domain, $\mathrm{dom}f:=\{x \in \mathrm{I}\!\mathrm{R}^n: f(x)<+\infty \}$, is nonempty. Such a function is said to be closed if it is lower semicontinuous. For a proper function f, we let $z\mathop {\rightarrow }\limits ^{f}x$ denote $f(z)\rightarrow f(x)$ and $z\rightarrow x$. The limiting subdifferential of f at $x\in \mathrm {dom}~f$ is defined by [31]

$$\begin{aligned} \partial f(x)&:=\left\{ v\in \mathrm{I}\!\mathrm{R}^n :\;\exists x^t\mathop {\rightarrow }\limits ^{f}x,\;v^t\rightarrow v\; \text{ with } \right. \nonumber \\&\qquad \left. \displaystyle \liminf _{z\rightarrow x^t}\frac{f(z)-f(x^t)-\langle v^t,z-x^t\rangle }{\Vert z-x^t\Vert }\ge 0 \text{ for } \text{ each } t\right. \Bigg \}. \end{aligned}$$

(3)

From the above definition, one immediately obtains the following robustness property:

$$\begin{aligned} \left\{ v\in \mathrm{I}\!\mathrm{R}^n:\; \exists x^t \mathop {\rightarrow }\limits ^{f}x,\; v^t\rightarrow v\;, v^t\in \partial f(x^t)\right\} \subseteq \partial f(x). \end{aligned}$$

(4)

The subdifferential (3) reduces to the derivative of f (denoted by $\nabla f$) if f is continuously differentiable, and the classical subdifferential in convex analysis if f is convex (see, for example, [31, Proposition 8.12]). For a function f having more than one group of variables, we let $\partial _x f$ (resp., $\nabla _x f$) denote the subdifferential (resp., derivative) of f with respect to the variable x.

We say that a function f is a strongly convex function with modulus $\sigma >0$ if $f-\frac{\sigma }{2}\Vert \cdot \Vert ^2$ is a convex function. A function f is said to be coercive if $\liminf \limits _{\Vert x\Vert \rightarrow \infty }f(x) = \infty $. For a nonempty closed set $S \subseteq \mathrm{I}\!\mathrm{R}^n$, its indicator function $\delta _S$ is defined by

$$\begin{aligned} \delta _S(x)=\left\{ \begin{array}{ll} 0 &{} \text{ if }\,x \in S, \\ +\infty &{} \text{ if }\,x \notin S. \end{array}\right. \end{aligned}$$

We use the notation $d_S(x)$ or $\mathrm{dist}(x,S)$ to denote the distance from an $x\in \mathrm{I}\!\mathrm{R}^n$ to S, i.e., $d_S(x):= \inf _{y\in S}\Vert x - y\Vert $. Moreover, we use $P_S(x)$ to denote the points in S that are closest to x: note that $P_S(x)$ is a singleton set if S is, in addition, convex.

Finally, for an optimization problem $\displaystyle \min _{x \in \mathrm{I}\!\mathrm{R}^n} f(x)$, we use $\displaystyle \mathop {\hbox {Arg min}}_{x} f(x)$ to denote the set consisting of all its minimizers. If $\displaystyle \mathop {\hbox {Arg min}}_{x} f(x)$ turns out to be a singleton, we simply denote it as $\displaystyle \mathop {\hbox {arg min}}_{x} f(x)$.

2 Peaceman–Rachford splitting for structured nonconvex problems

Recall that the class of problems we consider is

$$\begin{aligned} \min _u \ f(u) + g(u), \end{aligned}$$

(5)

where f and g are proper closed possibly nonconvex functions. As discussed in the introduction, even in the case when both f and g are convex, typically one would need f (or g) to be strongly convex to guarantee convergence of the PR splitting method. Moreover, we recall that the Lipschitz differentiability of f played an important role in the recent convergence analysis of the closely related DR splitting method in [25] for (5) in the nonconvex settings. Motivated by these, we make the following blanket assumption on f throughout this paper.

Assumption 1

(Blanket assumption on f ) The function f is strongly convex with a strong convexity modulus at least $\sigma > 0$, and is Lipschitz differentiable so that $\nabla f$ has a Lipschitz continuity modulus at most $L > 0$.

Notice that the proximal mapping $\mathrm{prox}_{\gamma f}(z)$ of a strongly convex function f is well defined for any $\gamma > 0$ at any point z. Thus, in order for the iterates in (2) to be well defined, we only need to make additionally the following blanket assumption on g in this paper.

Assumption 2

(Blanket assumption on g ) The function g is proper closed with a nonempty proximal mapping $\mathrm{prox}_{\gamma g}(z)$ for any z and for the $\gamma > 0$ we use in the algorithm.

Under the blanket assumptions, we consider the following adaptation of the PR splitting method to solve the possibly nonconvex problem (5), which can be easily shown to be equivalent to (2) in the case when f and g are convex (so that the proximal mappings are single-valued).

Our convergence analysis follows a similar line of arguments (with some intricate modifications) for showing convergence for the Douglas–Rachford splitting method as in our recent work [25], and has to make extensive use of the following merit function:

$$\begin{aligned} {\mathfrak {P}}_\gamma (y,z,x):= & {} f(y) + g(z) - \frac{3}{2\gamma }\Vert y - z\Vert ^2 + \frac{1}{\gamma }\langle x-y,z-y\rangle \nonumber \\= & {} {\mathfrak {D}}_\gamma (y,z,x) - \frac{1}{\gamma }\Vert y - z\Vert ^2, \end{aligned}$$

(7)

where ${\mathfrak {D}}_\gamma $ is the so-called Douglas–Rachford merit function given by ${\mathfrak {D}}_\gamma (y,z,x)=f(y) + g(z) - \frac{1}{2\gamma }\Vert y - z\Vert ^2 + \frac{1}{\gamma }\langle x-y,z-y\rangle $ (see [25, Definition 2.1]), motivated by [29, Eq. 35].

Before proceeding, we make two important observations. First, it is not hard to see that the merit function ${\mathfrak {P}}_\gamma $ can alternatively be written as

$$\begin{aligned} {\mathfrak {P}}_\gamma (y,z,x)= & {} f(y) + g(z) + \frac{1}{2\gamma }\Vert 2y - z - x\Vert ^2 - \frac{1}{2\gamma }\Vert x-y\Vert ^2 - \frac{2}{\gamma }\Vert y - z\Vert ^2\nonumber \\= & {} f(y) + g(z) + \frac{1}{2\gamma }(\Vert x - y\Vert ^2 - \Vert x - z\Vert ^2-2\Vert y - z\Vert ^2), \end{aligned}$$

(8)

where the first relation follows from the elementary relation $\langle u,v\rangle = \frac{1}{2} (\Vert u + v\Vert ^2 - \Vert u\Vert ^2 - \Vert v\Vert ^2)$ applied with $u = x-y$ and $v = z -y$ in (7), while the second relation is obtained by using the elementary relation $\langle u,v\rangle = \frac{1}{2} (\Vert u\Vert ^2 + \Vert v\Vert ^2 - \Vert u - v\Vert ^2)$ in (7) with $u = x-y$ and $v = z -y$. We will make use of these equivalent formulations in the convergence analysis. Second, we note by using the optimality conditions for the y and z-updates in (6) that:

$$\begin{aligned} 0= & {} \nabla f(y^{t+1}) + \frac{1}{\gamma }(y^{t+1} - x^t),\nonumber \\ 0\in & {} \partial g(z^{t+1}) + \frac{1}{\gamma }(z^{t+1} - y^{t+1}) - \frac{1}{\gamma }(y^{t+1} - x^t), \end{aligned}$$

(9)

where we made use of the subdifferential calculus rule [31, Exercise 8.8]. Consequently, for all $t\ge 1$,

$$\begin{aligned} 0 \in \nabla f(y^t) + \partial g(z^t) + \frac{1}{\gamma }(z^t - y^t). \end{aligned}$$

(10)

To establish convergence and characterize the cluster point of the sequence generated, we will subsequently show that $\lim _{t\rightarrow \infty }\Vert z^{t} - y^{t}\Vert = 0$ and that g is “continuous” at the cluster point along the sequence generated.

We are now ready to state and prove a convergence result for the PR splitting method (6). We would like to point out that our proof is following exactly the same line of arguments as [25, Theorem 1]. However, there are two crucial differences. First, we now make use of the merit function (7) in place of the Douglas–Rachford merit function. Second, as we will see in the upper estimate in (20), the factor of $\gamma $ in the denominator is canceled, and thus the strong convexity modulus $\sigma $ comes into play in establishing the non-increasing property of the sequence $\{{\mathfrak {P}}_\gamma (y^t,z^t,x^t)\}_{t\ge 1}$.

Theorem 1

(Global subsequential convergence) Suppose that $3\sigma > 2L$ and the parameter $\gamma $ is chosen so that

$$\begin{aligned} 0< \gamma < \frac{3\sigma -2L}{L^2}. \end{aligned}$$

(11)

Then the sequence $\{{\mathfrak {P}}_\gamma (y^t,z^t,x^t)\}_{t\ge 1}$ is nonincreasing. Moreover, if a cluster point $(y^*,z^*,x^*)$ of the sequence exists, then we have

$$\begin{aligned} \lim _{t\rightarrow \infty }\Vert x^{t+1} - x^t\Vert = 2\lim _{t\rightarrow \infty }\Vert z^{t+1} - y^{t+1}\Vert = 0, \end{aligned}$$

(12)

the cluster point satisfies $z^* = y^*$, and

$$\begin{aligned} 0\in \nabla f(z^*) + \partial g(z^*). \end{aligned}$$

Remark 1

We note that the condition $3\sigma > 2L$ indicates that this convergence result can only be applied when f has a relatively large strong convexity modulus, i.e., when $\sigma > \frac{2}{3} L$. It seems restrictive at first glance, but we will demonstrate in the next section how this theorem can be applied in a wide range of problems that do not explicitly contain a strongly convex part in the objective. Specifically, we will show that the method can be suitably applied to minimizing a coercive function $F+G$, where G is a proper closed function and F is convex Lipschitz differentiable but not necessarily strongly convex; see Corollary 1.

Proof

We study the behavior of ${\mathfrak {P}}_\gamma $ along the sequence generated from the PR splitting method. First, using (7) and the definition of the x-update, we see that

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^{t+1},z^{t+1},x^{t+1}) - {\mathfrak {P}}_\gamma (y^{t+1},z^{t+1},x^t)= & {} \frac{1}{\gamma }\langle x^{t+1}-x^t,z^{t+1} - y^{t+1}\rangle \nonumber \\= & {} \frac{1}{2\gamma }\Vert x^{t+1} - x^t\Vert ^2. \end{aligned}$$

(13)

Second, making use of the first relation in (8) and the definition of $z^{t+1}$ as a minimizer, we have

$$\begin{aligned}&{\mathfrak {P}}_\gamma (y^{t+1},z^{t+1},x^t) - {\mathfrak {P}}_\gamma (y^{t+1},z^t,x^t) \nonumber \\&\quad = g(z^{t+1}) + \frac{1}{2\gamma }\Vert 2y^{t+1} - z^{t+1} - x^t\Vert ^2 - \frac{2}{\gamma }\Vert y^{t+1} - z^{t+1}\Vert ^2\nonumber \\&\qquad - g(z^t) - \frac{1}{2\gamma }\Vert 2y^{t+1} - z^t - x^t\Vert ^2 + \frac{2}{\gamma }\Vert y^{t+1} - z^t\Vert ^2\nonumber \\&\quad \le \frac{2}{\gamma }\left( \Vert y^{t+1} - z^t\Vert ^2 - \Vert y^{t+1} - z^{t+1}\Vert ^2\right) \nonumber \\&\quad = \frac{2}{\gamma }\left( \Vert y^{t+1} - z^t\Vert ^2 - \frac{1}{4}\Vert x^{t+1} - x^t\Vert ^2\right) , \end{aligned}$$

(14)

where the last relation is due to the definition of $x^{t+1}$. Consequently, summing (13) and (14), we have

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^{t+1},z^{t+1},x^{t+1}) - {\mathfrak {P}}_\gamma (y^{t+1},z^t,x^t) \le \frac{2}{\gamma }\Vert y^{t+1} - z^t\Vert ^2. \end{aligned}$$

(15)

Next, making use of the second relation in (8), we see that

$$\begin{aligned}&{\mathfrak {P}}_\gamma (y^{t+1},z^t,x^t) - {\mathfrak {P}}_\gamma (y^t,z^t,x^t)\nonumber \\&\quad = f(y^{t+1}) + \frac{1}{2\gamma } \Vert x^t - y^{t+1}\Vert ^2 - f(y^t) - \frac{1}{2\gamma } \Vert x^t - y^t\Vert ^2 - \frac{1}{\gamma }\Vert y^{t+1} - z^t\Vert ^2 \nonumber \\&\qquad + \frac{1}{\gamma }\Vert y^t - z^t\Vert ^2\nonumber \\&\quad \le -\frac{1}{2}\left( \frac{1}{\gamma }+ \sigma \right) \Vert y^{t+1} - y^t\Vert ^2 - \frac{1}{\gamma }\Vert y^{t+1} - z^t\Vert ^2 + \frac{1}{\gamma }\Vert y^t - z^t\Vert ^2, \end{aligned}$$

(16)

where, in the last inequality, we used the definition of $y^{t+1}$ as a minimizer and the strong convexity of the objective in the minimization problem that defines the y-update. Combining (16) with (15) gives further that

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^{t+1},z^{t+1},x^{t+1}) - {\mathfrak {P}}_\gamma (y^t,z^t,x^t)\le & {} -\frac{1}{2}\left( \frac{1}{\gamma }+\sigma \right) \Vert y^{t+1} - y^t\Vert ^2\nonumber \\&+\frac{1}{\gamma }\Vert y^{t+1} - z^t\Vert ^2 + \frac{1}{\gamma }\Vert y^t - z^t\Vert ^2.\nonumber \\ \end{aligned}$$

(17)

To further upper estimate (17), observe from the first relation in (9) that

$$\begin{aligned} \nabla f(y^{t+1}) = \frac{1}{\gamma }(x^t - y^{t+1}). \end{aligned}$$

Since f is strongly convex with modulus $\sigma > 0$ by assumption, we see that for all $t\ge 1$,

$$\begin{aligned}&\left\langle \frac{1}{\gamma }(x^t - y^{t+1}) - \frac{1}{\gamma }(x^{t-1} - y^t),y^{t+1} - y^t\right\rangle \ge \sigma \Vert y^{t+1} - y^t\Vert ^2\\&\quad \Longrightarrow \langle x^{t} -x^{t-1},y^{t+1} - y^t\rangle \ge (1 +\gamma \sigma )\Vert y^{t+1} - y^t\Vert ^2. \end{aligned}$$

Thus, making use of the definition of $x^t$ and the above relation, we obtain further that

$$\begin{aligned}&\Vert y^{t+1} - z^t\Vert ^2 = \Vert y^{t+1} - y^t + y^t - z^t\Vert ^2 = \left\| y^{t+1} - y^t - \frac{1}{2}(x^t - x^{t-1})\right\| ^2\nonumber \\&\quad = \Vert y^{t+1} - y^t\Vert ^2 - \langle y^{t+1} - y^t,x^t - x^{t-1}\rangle + \frac{1}{4}\Vert x^t - x^{t-1}\Vert ^2\nonumber \\&\quad \le -\gamma \sigma \Vert y^{t+1} - y^t\Vert ^2 + \frac{1}{4}\Vert x^t - x^{t-1}\Vert ^2. \end{aligned}$$

(18)

In addition, observe also from the definition of the x-update, the first relation in (9) and the Lipschitz continuity of $\nabla f$ that for $t\ge 1$

$$\begin{aligned} 2\Vert y^t - z^t\Vert = \Vert x^t - x^{t-1}\Vert \le (1 + \gamma L)\Vert y^{t+1} - y^t\Vert . \end{aligned}$$

(19)

Combining (18), (19) with (17), we conclude that for any $t\ge 1$

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^{t+1},z^{t+1},x^{t+1}) \!-\! {\mathfrak {P}}_\gamma (y^t,z^t,x^t)\le & {} \frac{1}{2\gamma }\left( (1 \!+\! \gamma L)^2 \!-\! 3\gamma \sigma \!-\! 1\right) \Vert y^{t+1} \!-\! y^t\Vert ^2\nonumber \\= & {} \frac{1}{2}\left( -3\sigma + 2L + \gamma L^2\right) \Vert y^{t+1} - y^t\Vert ^2.\nonumber \\ \end{aligned}$$

(20)

By our choice of $\gamma $, $-3\sigma + 2L + \gamma L^2 < 0$. From this we see immediately that $\{{\mathfrak {P}}_\gamma (y^t,z^t,x^t)\}$ is nonincreasing. Summing (20) from $t=1$ to $N-1\ge 1$, we obtain that

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^N,z^N,x^N) - {\mathfrak {P}}_\gamma (y^1,z^1,x^1) \le \frac{1}{2}\left( -3\sigma + 2L + \gamma L^2\right) \sum _{t=1}^{N-1}\Vert y^{t+1} - y^t\Vert ^2.\nonumber \\ \end{aligned}$$

(21)

Using this, the closedness of ${\mathfrak {P}}_\gamma $ and the existence of cluster points, we conclude immediately from (21) that $\lim \limits _{t\rightarrow \infty }\Vert y^{t+1} - y^t\Vert =0$. Combining this with (19), we conclude that (12) holds. Furthermore, combining these with the third relation in (6), we obtain further that $\lim \limits _{t\rightarrow \infty }\Vert z^{t+1} - z^t\Vert =0$.

Consequently, if $(y^*,z^*,x^*)$ is a cluster point of $\{(y^t,z^t,x^t)\}$ with a convergent subsequence $\{(y^{t_j},z^{t_j},x^{t_j})\}$ such that $\lim \limits _{j\rightarrow \infty }(y^{t_j},z^{t_j},x^{t_j}) = (y^*,z^*,x^*)$, then we must have

$$\begin{aligned} \lim _{j\rightarrow \infty }(y^{t_j},z^{t_j},x^{t_j}) = \lim _{j\rightarrow \infty }(y^{t_j-1},z^{t_j-1},x^{t_j-1}) = (y^*,z^*,x^*). \end{aligned}$$

(22)

Since $z^t$ is a minimizer of the subproblem,

$$\begin{aligned} g(z^t) + \frac{1}{2\gamma }\Vert 2y^t - z^t - x^{t-1}\Vert ^2 \le g(z^*) + \frac{1}{2\gamma }\Vert 2y^t - z^* - x^{t-1}\Vert ^2. \end{aligned}$$

Taking limit along the convergent subsequence and using (22) yields

$$\begin{aligned} \limsup _{j\rightarrow \infty }g(z^{t_j})\le g(z^*). \end{aligned}$$

Conversely, we have $\liminf \limits _{j\rightarrow \infty }g(z^{t_j})\ge g(z^*)$ by the lower semicontinuity of g. Thus,

$$\begin{aligned} \lim _{j\rightarrow \infty }g(z^{t_j})= g(z^*). \end{aligned}$$

(23)

Using (4), (12), (23) and passing to the limit in (10) along the convergent subsequence above, we conclude that the cluster point gives a stationary point of (5), i.e., $y^* = z^*$ and

$$\begin{aligned} 0\in \nabla f(z^*) + \partial g(z^*). \end{aligned}$$

This completes the proof. $\square $

In the next theorem, we study sufficient conditions to guarantee boundedness of the sequence generated from the PR splitting method. Thus, a cluster point will necessarily exist under these conditions.

Theorem 2

(Boundedness of sequence) Suppose that $3\sigma > 2L$ and the $\gamma $ is chosen to satisfy (11). Suppose in addition that $f + g$ is coercive, i.e., $\liminf _{\Vert u\Vert \rightarrow \infty } (f + g)(u) = \infty $. Then the sequence $\{(y^t,z^t,x^t)\}$ generated from (6) is bounded.

Proof

Recall from Theorem 1 that the merit function is nonincreasing along the sequence generated from (6). In particular,

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^t,z^t,x^t) \le {\mathfrak {P}}_\gamma (y^1,z^1,x^1) \end{aligned}$$

(24)

whenever $t\ge 1$, where

$$\begin{aligned} {\mathfrak {P}}_\gamma (y^t,z^t,x^t)= f(y^t) + g(z^t) - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2 + \frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2 - \frac{1}{\gamma }\Vert y^t-z^t\Vert ^2\nonumber \\ \end{aligned}$$

(25)

from the second relation in (8). Next, recall from the definition of x-update that $x^t = x^{t-1} + 2(z^t - y^t)$, which together with the first relation in (9) gives

$$\begin{aligned} \nabla f(y^t) = \frac{1}{\gamma }(x^{t-1}-y^t) = \frac{1}{\gamma }([x^t - z^t] - [z^t - y^t]). \end{aligned}$$

(26)

Moreover, for the function f whose gradient is Lipschitz continuous with modulus L, we have

$$\begin{aligned} f(z^t) \le f(y^t) + \langle \nabla f(y^t), z^t - y^t\rangle + \frac{L}{2}\Vert z^t - y^t\Vert ^2. \end{aligned}$$

(27)

Combining these with (25) and (24), we see further that

$$\begin{aligned}&{\mathfrak {P}}_\gamma (y^1,z^1,x^1) \ge f(y^t) + g(z^t) - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2 + \frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2 - \frac{1}{\gamma }\Vert y^t-z^t\Vert ^2\nonumber \\&\quad \ge f(z^t) + g(z^t) - \langle \nabla f(y^t), z^t - y^t\rangle - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2\nonumber \\&\qquad +\,\frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2 - \left( \frac{L}{2} + \frac{1}{\gamma }\right) \Vert y^t-z^t\Vert ^2\nonumber \\&\quad = f(z^t) + g(z^t) - \frac{1}{\gamma }\langle x^t - z^t, z^t - y^t\rangle - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2 \nonumber \\&\qquad +\,\frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2 - \frac{L}{2}\Vert y^t-z^t\Vert ^2\nonumber \\&\quad = f(z^t) + g(z^t) + \frac{1}{2}\left( \frac{1}{\gamma }- L\right) \Vert y^t - z^t\Vert ^2, \end{aligned}$$

(28)

where the second inequality follows from (27), the first equality follows from (26), while the last equality follows from the elementary relation $\langle u,v\rangle = \frac{1}{2}(\Vert u + v\Vert ^2 - \Vert u\Vert ^2 - \Vert v\Vert ^2)$ applied to $u = x^t - z^t$ and $v = z^t - y^t$. From (28), the coerciveness of $f+g$ and the fact that $\gamma < \frac{3\sigma - 2L}{L^2} \le \frac{1}{L}$, we conclude that $\{z^t\}$ and $\{y^t\}$ are bounded. The boundedness of $\{x^t\}$ now follows from these and the first relation in (9). This completes the proof. $\square $

Remark 2

(Comments on the proof of Theorem 2)

(i)
The technique of using (27) for establishing (28) was also used previously in [21, Lemma 3.3] for showing that the augmented Lagrangian function is bounded below along the sequence generated from the alternating direction method of multipliers for a special class of problems. Here, we applied the technique to the new merit function ${\mathfrak {P}}_{\gamma }$.
(ii)
The same technique used here can be applied to establishing the boundedness of the sequence generated by the DR splitting method studied in [25] under a condition which is slightly weaker than the one used in [25]. In fact, one can show that, the DR splitting method in [25] generates a bounded sequence under the blanket assumptions of f and g in [25, Section 3], the condition that $f + g$ is coercive and the choice of parameter specified in [25, Theorem 4].^{Footnote 1} To see this, recall that for the DR splitting method, we also have $\nabla f(y^t) = \frac{1}{\gamma }(x^{t-1} - y^t)$ but have $x^t = x^{t-1} + (z^{t} - y^t)$ instead of the third relation in (6). Thus, $\nabla f(y^t) = \frac{1}{\gamma }(x^t - z^t)$ and we have the following estimate for the DR merit function, making use of (27):
$$\begin{aligned}&{\mathfrak {D}}_\gamma (y^t,z^t,x^t) = f(y^t) + g(z^t) - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2 + \frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2\\&\quad \ge f(z^t) + g(z^t) - \langle \nabla f(y^t), z^t - y^t\rangle - \frac{L}{2}\Vert z^t - y^t\Vert ^2\nonumber \\&\qquad - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2 + \frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2\\&\quad = f(z^t) + g(z^t) - \frac{1}{\gamma }\langle x^t - z^t, z^t - y^t\rangle - \frac{L}{2}\Vert z^t - y^t\Vert ^2 \nonumber \\&\qquad - \frac{1}{2\gamma }\Vert x^t - z^t\Vert ^2 + \frac{1}{2\gamma }\Vert x^t - y^t\Vert ^2\\&\quad = f(z^t) + g(z^t) + \frac{1}{2}\left( \frac{1}{\gamma }- L\right) \Vert y^t - z^t\Vert ^2, \end{aligned}$$
where the last equality follows from the elementary relation $\langle u,v\rangle = \frac{1}{2}(\Vert u + v\Vert ^2 - \Vert u\Vert ^2 - \Vert v\Vert ^2)$ applied to $u = x^t - z^t$ and $v = z^t - y^t$. The boundedness of the sequence can then be deduced under the choice of $\gamma $ in [25, Theorem 4], which guarantees $\gamma < \frac{1}{L}$, and the assumption that $f + g$ is coercive.

As in [24, Theorem 4] and [25, Theorem 2], one can also show that the whole sequence generated is convergent under the additional assumption that ${\mathfrak {P}}_\gamma (y,z,x)$ is a KL function.^{Footnote 2} To this end, note that for any $t\ge 1$, we have from (7) and the third relation in (6) that

$$\begin{aligned} \nabla _x{\mathfrak {P}}_\gamma (y^t,z^t,x^t) = \frac{1}{\gamma }(z^t - y^t) = \frac{1}{2\gamma }(x^t - x^{t-1}). \end{aligned}$$

(29)

Moreover, using the second relation in (8), one can obtain

$$\begin{aligned} \nabla _y{\mathfrak {P}}_\gamma (y^t,z^t,x^t)= & {} \nabla f(y^t) + \frac{1}{\gamma }(y^t - x^t) - \frac{2}{\gamma }(y^t - z^t)\nonumber \\= & {} \frac{1}{\gamma }(x^{t-1} - x^t) - \frac{2}{\gamma }(y^t - z^t) = 0 \end{aligned}$$

(30)

where the second equality follows from the first relation in (9), and the last equality follows again from the third relation in (6). Finally, using the second relation in (8), one can compute that

$$\begin{aligned}&\partial _z {\mathfrak {P}}_\gamma (y^t,z^t,x^t) = \partial g(z^t) - \frac{1}{\gamma }(z^t - x^t) - \frac{2}{\gamma }(z^t - y^t)\nonumber \\&\quad = \partial g(z^t) + \frac{1}{\gamma }(z^t - y^t) - \frac{1}{\gamma }(y^t - x^{t-1}) - \frac{1}{\gamma }(z^t - y^t) + \frac{1}{\gamma }(y^t - x^{t-1})\nonumber \\&\qquad -\frac{1}{\gamma }(z^t - x^t) - \frac{2}{\gamma }(z^t - y^t)\nonumber \\&\qquad \ni -\frac{4}{\gamma }(z^t - y^t) + \frac{1}{\gamma }(x^t - x^{t-1}) = -\frac{1}{\gamma }(x^t - x^{t-1}), \end{aligned}$$

(31)

where the inclusion follows from the second relation in (9) and the last equality follows from the third relation in (6). Consequently, by combining (29), (30), (31) and (19), we see the existence of $\kappa > 0$ so that

$$\begin{aligned} \mathrm{dist}~(0,\partial {\mathfrak {P}}_\gamma (y^t,z^t,x^t))\le \kappa \Vert y^{t+1} - y^t\Vert . \end{aligned}$$

Using this, (20) and following the arguments as in the proof of [25, Theorem 2], it is not hard to prove the following result. We omit the detailed proof here.

Theorem 3

(Global convergence of the whole sequence) Suppose that $3\sigma >2L$, the parameter $\gamma > 0$ is chosen as in (11) and that the sequence $\{(y^t,z^t,x^t)\}$ generated from (6) has a cluster point $(y^*,z^*,x^*)$. Suppose also that ${\mathfrak {P}}_\gamma $ is a KL function. Then the whole sequence $\{(y^t,z^t,x^t)\}$ is convergent.

As we have seen from Theorems 1 and 2, our convergence analysis of the PR splitting method requires that the nonconvex objective function can be decomposed as $f+g$ where f is strongly convex. It should be noted that if the strong convexity assumption on f is dropped, then the sequence generated is not necessarily converging to/clustering at a stationary point even when g is also convex. On the other hand, in the next section, we will demonstrate how the method can be suitably applied to minimizing a coercive function $F+G$, where G is a proper closed function and F is convex Lipschitz differentiable but not necessarily strongly convex.

3 Peaceman–Rachford splitting methods for nonconvex problems with non-strongly convex decomposition

In many applications, the underlying optimization problem can be formulated as

$$\begin{aligned} \min \limits _u\quad F(u) + G(u) \end{aligned}$$

(32)

where $F + G$ is coercive, F is a convex smooth function with a Lipschitz continuous gradient whose modulus is at most $L_F > 0$, and G is a proper and closed function with a nonempty proximal mapping $\mathrm{prox}_{\tau G}(z)$ for any z and any $\tau > 0$. For example, when F is the least squares loss function for linear regression and G is the indicator function of the $\ell _1$ norm ball, the problem (32) reduces to the LASSO [32]. This and various related (possibly nonconvex) models have been studied extensively in the statistical literature; see, for example, [2, 6, 11, 17, 22]. We will also provide more concrete examples and simulation results later in Sect. 4.

In view of the structure of (32), a natural way of applying a splitting method would be to set $f(y) = F(y)$ and $g(z) = G(z)$. However, since this choice of f is not strongly convex, our convergence theory in Sect. 2 cannot be applied to deducing convergence of the resulting PR splitting method.

Thus, we consider an alternative way of splitting the objective in order to obtain a strongly convex f. To this end, we start by fixing any $\alpha > 0$ and defining $f(y) = F(y) + \frac{\alpha }{2}\Vert y\Vert ^2$, $g(z) = G(z) - \frac{\alpha }{2}\Vert z\Vert ^2$. Then $\nabla f$ is Lipschitz continuous with a modulus at most $L = L_F + \alpha $, and f is strongly convex with modulus at least $\sigma = \alpha $. Thus, one only needs to pick $\alpha > 2L_F$ so that $3\sigma > 2L$. Let $\alpha = \beta L_F$ for some $\beta > 2$. Then the upper bound of $\gamma $ in (11) is given by

$$\begin{aligned} \frac{\alpha -2L_F}{(L_F + \alpha )^2} = \frac{\beta -2}{(\beta + 1)^2L_F}. \end{aligned}$$

Consequently, if we set

$$\begin{aligned} f(y) = F(y) + \frac{\beta L_F}{2}\Vert y\Vert ^2 \ \mathrm{and}\ g(z) = G(z) - \frac{\beta L_F}{2}\Vert z\Vert ^2, \end{aligned}$$

then we can pick $0< \gamma < \frac{\beta -2}{(\beta + 1)^2L_F}$.^{Footnote 3} Moreover, for this choice of $\gamma $, the Assumption 2 is satisfied for the above choice of g. Hence, it follows from Theorem 2 that the sequence generated by applying the PR splitting method to this pair of f and g is bounded, and then any cluster point gives a stationary point of (32), according to Theorem 1. For concreteness and easy reference for our subsequent discussion, we present this algorithm explicitly below:

To the best of our knowledge, the global convergence of the sequence generated from (33) is new, which we summarize below for concreteness.

Corollary 1

Consider optimization problem (32) and let $\{(y^t,z^t,x^t)\}$ be the sequence generated from (33). Then the sequence is bounded, and any cluster point $(\bar{y},\bar{z},\bar{x})$ would satisfy $\bar{y} = \bar{z}$, and $\bar{z}$ is a stationary point of (32), that is,

$$\begin{aligned} 0\in \nabla F(\bar{z}) + \partial G(\bar{z}). \end{aligned}$$

Proof

We first note that since (33) is just (6) applied to $f(y) = F(y) + \frac{\beta L_F}{2}\Vert y\Vert ^2$ and $g(z) = G(z) - \frac{\beta L_F}{2}\Vert z\Vert ^2$, we obtain immediately from the above discussion and Theorem 1 that $\bar{y} = \bar{z}$ and $\bar{z}$ is a stationary point of (32) for any cluster point $(\bar{y},\bar{z},\bar{x})$. In addition, the objective function $f + g = F + G$ is coercive by assumption. The boundedness of the sequence $\{(y^t,z^t,x^t)\}$ now follows from Theorem 2. This completes the proof. $\square $

3.1 Peaceman–Rachford splitting method for convex problems

In this subsection, we suppose in addition that the G in (32) is also convex. Hence, (32) is a convex problem. We first establish the following global (ergodic) complexity result for the sequence generated from (33). Similar kinds of complexity results have also been established for other primal-dual methods for convex optimization problems; see, for example, [33, Theorem 2]. We would like to emphasize that the PR splitting method we discuss here is different from the classical PR splitting method in the literature: we split the convex objective $F+G$ into the sum of a strongly convex function f and a possibly nonconvex function g, while the classical PR splitting method only admits splitting into a sum of convex functions.

Theorem 4

(Global iteration complexity under convexity) Consider optimization problem (32) with G being convex. Let $\{(y^t,z^t,x^t)\}$ be the sequence generated from (33) and $(\bar{y},\bar{z},\bar{x})$ be any cluster point of this sequence. Then, $\bar{y} = \bar{z}$ and $\bar{z}$ is a solution of (32). Moreover, for any $N \ge 1$, we have

$$\begin{aligned} F(\bar{z}^N) + G(\bar{z}^N) - F(\bar{z}) - G(\bar{z}) \le \frac{1}{8\beta \gamma NL_F}\left( \frac{1}{\gamma } - \beta L_F\right) \Vert x^0 - \bar{x}\Vert ^2, \end{aligned}$$

(34)

where $\bar{z}^N := \frac{1}{N}\sum _{t=1}^N z^t$ and

$$\begin{aligned} \min _{0 \le t \le N}\{\Vert x^{t+1} -x^{t}\Vert \}=o\left( \frac{1}{\sqrt{N}}\right) . \end{aligned}$$

Proof

Since (32) is convex, we conclude that $\bar{z}$ is actually optimal. We now establish the inequality (34). First, from the first-order optimality conditions for the y and z-updates in (33), we have

$$\begin{aligned}&-\left( \beta L_F + \frac{1}{\gamma }\right) y^{t+1} + \frac{1}{\gamma }x^t = \nabla F(y^{t+1}),\nonumber \\&\quad \left( \beta L_F - \frac{1}{\gamma }\right) z^{t+1} - \frac{1}{\gamma }x^t + \frac{2}{\gamma }y^{t+1} \in \partial G(z^{t+1}). \end{aligned}$$

(35)

Moreover, it is not hard to see from the definition of cluster point and (12) that (35) is also satisfied with $\bar{x}$ in place of $x^t$ and $(\bar{y},\bar{z})$ in place of $(y^{t+1},z^{t+1})$. Write $w^t_e = w^t - \bar{w}$ for $w = x$, y or z for notational simplicity. We have from (35) (and its counterpart at $(\bar{y},\bar{z},\bar{x})$) and the monotonicity of convex subdifferentials that

$$\begin{aligned}&\left\langle -\left( \beta L_F + \frac{1}{\gamma }\right) y^{t+1}_e + \frac{1}{\gamma }x^t_e,y^{t+1}_e\right\rangle \ge 0,\\&\left\langle \left( \beta L_F - \frac{1}{\gamma }\right) z^{t+1}_e - \frac{1}{\gamma }x^t_e + \frac{2}{\gamma }y^{t+1}_e,z^{t+1}_e\right\rangle \ge 0. \end{aligned}$$

Summing these two relations and rearranging terms, we obtain that

$$\begin{aligned} \langle x^t_e,y^{t+1} - z^{t+1}\rangle + 2\langle y_e^{t+1},z^{t+1}_e\rangle \ge (1 + \beta \gamma L_F)\Vert y^{t+1}_e\Vert ^2 + (1 - \beta \gamma L_F)\Vert z^{t+1}_e\Vert ^2.\nonumber \\ \end{aligned}$$

(36)

Next, observe that

$$\begin{aligned} \langle x^t_e,y^{t+1} - z^{t+1}\rangle= & {} \frac{1}{2}\langle x^t_e,x^t - x^{t+1}\rangle = \frac{1}{4} \left( \Vert x^t_e\Vert ^2 + \Vert x^t - x^{t+1}\Vert ^2 - \Vert x^{t+1}_e\Vert ^2\right) \nonumber \\= & {} \frac{1}{4}\left( \Vert x^t_e\Vert ^2 - \Vert x^{t+1}_e\Vert ^2\right) + \Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\= & {} \frac{1}{4}\left( \Vert x^t_e\Vert ^2 - \Vert x^{t+1}_e\Vert ^2\right) + \Vert z_e^{t+1}\Vert ^2 + \Vert y_e^{t+1}\Vert ^2 - 2\langle y_e^{t+1},z^{t+1}_e\rangle ,\nonumber \\ \end{aligned}$$

(37)

where the first and third equalities follow from the third relation in (33), the second equality follows from the elementary relation $\langle u,v\rangle = \frac{1}{2}(\Vert u\Vert ^2 + \Vert v\Vert ^2 - \Vert u - v\Vert ^2)$ as applied to $u = x_e^t$ and $v = x^t - x^{t+1}$. Combining (37) with (36), we see further that

$$\begin{aligned} \frac{1}{4} \Vert x_e^t\Vert ^2 - \frac{1}{4} \Vert x_e^{t+1}\Vert ^2 \ge \beta \gamma L_F\left( \Vert y^{t+1}_e\Vert ^2 - \Vert z^{t+1}_e\Vert ^2\right) \end{aligned}$$

(38)

Next, using the fact that $\nabla F$ is Lipschitz continuous with modulus at most $L_F$, we have

$$\begin{aligned} F(z^{t+1}) \le F(y^{t+1}) + \langle \nabla F(y^{t+1}), z^{t+1} - y^{t+1}\rangle + \frac{L_F}{2}\Vert z^{t+1} - y^{t+1}\Vert ^2. \end{aligned}$$

(39)

From this we see further that

$$\begin{aligned}&F(z^{t+1}) + G(z^{t+1}) - F(\bar{z}) - G(\bar{z})\nonumber \\&\quad \le F(y^{t+1}) - F(\bar{y}) + G(z^{t+1}) - G(\bar{z}) + \langle \nabla F(y^{t+1}), z^{t+1} - y^{t+1}\rangle \nonumber \\&\qquad +\,\frac{L_F}{2}\Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\&\quad \le \langle \nabla F(y^{t+1}),y^{t+1}_e\rangle + \left\langle \left( \beta L_F - \frac{1}{\gamma }\right) z^{t+1} - \frac{1}{\gamma }x^t + \frac{2}{\gamma }y^{t+1},z^{t+1}_e\right\rangle \nonumber \\&\quad +\,\langle \nabla F(y^{t+1}), z^{t+1} - y^{t+1}\rangle + \frac{L_F}{2}\Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\&\quad = \left\langle \nabla F(y^{t+1})\! +\! \left( \beta L_F \!-\! \frac{1}{\gamma }\right) z^{t+1} \!-\! \frac{1}{\gamma }x^t + \frac{2}{\gamma }y^{t+1},z^{t+1}_e\right\rangle + \frac{L_F}{2}\Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\&\quad = \left\langle -\left( \beta L_F - \frac{1}{\gamma }\right) y^{t+1} + \left( \beta L_F - \frac{1}{\gamma }\right) z^{t+1},z^{t+1}_e\right\rangle + \frac{L_F}{2}\Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\&\quad = \left( \frac{1}{\gamma } - \beta L_F\right) \langle y^{t+1} - z^{t+1},z^{t+1}_e\rangle + \frac{L_F}{2}\Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\&\quad = \frac{1}{2}\left( \frac{1}{\gamma } - \beta L_F\right) (\Vert y^{t+1}_e\Vert ^2 - \Vert z^{t+1}_e\Vert ^2) + \frac{1}{2}\left( (1 + \beta )L_F - \frac{1}{\gamma }\right) \Vert z^{t+1} - y^{t+1}\Vert ^2\nonumber \\&\quad \le \frac{1}{2}\left( \frac{1}{\gamma } - \beta L_F\right) (\Vert y^{t+1}_e\Vert ^2 - \Vert z^{t+1}_e\Vert ^2)\nonumber \\&\quad \le \frac{1}{8\beta \gamma L_F}\left( \frac{1}{\gamma } - \beta L_F\right) (\Vert x_e^t\Vert ^2 - \Vert x_e^{t+1}\Vert ^2), \end{aligned}$$

(40)

where: the first inequality follows from (39) and the fact that $\bar{z} = \bar{y}$; the second inequality follows from the subdifferential inequalities applied to F and G at the points $y^{t+1}$ and $z^{t+1}$ respectively, and also the second relation in (35); the second equality follows from the first relation in (35); the fourth equality follows from the elementary relation $\langle u,v\rangle = \frac{1}{2}(\Vert u + v\Vert ^2 - \Vert u\Vert ^2 - \Vert v\Vert ^2)$ as applied to $u = z^{t+1}_e$ and $v = y^{t+1} - z^{t+1}$; the second last inequality follows from the fact that $0< \gamma < \frac{\beta -2}{(\beta + 1)^2L_F}$ so that $(1 + \beta )L_F - \frac{1}{\gamma } < 0$, while the last inequality follows from (38).

Summing both sides of (40) from $t=0$ to $N - 1 \ge 0$ and using the convexity of $F + G$, we have

$$\begin{aligned} F(\bar{z}^N) + G(\bar{z}^N) - F(\bar{z}) - G(\bar{z})\le & {} \frac{1}{N}\sum _{t=0}^{N-1}(F(z^{t+1}) + G(z^{t+1}) - F(\bar{z}) - G(\bar{z}))\\\le & {} \frac{1}{8\beta \gamma NL_F}\left( \frac{1}{\gamma } - \beta L_F\right) \Vert x^0 - \bar{x}\Vert ^2, \end{aligned}$$

where $\bar{z}^N$ is defined in the statement of the theorem. This proves (34).

Finally, observe from the last equality in (40) that for all $t \ge 1$

$$\begin{aligned} 0\le & {} F(z^{t+1}) + G(z^{t+1}) - F(\bar{z}) - G(\bar{z}) \\\le & {} \frac{1}{2}\left( \frac{1}{\gamma } - \beta L_F\right) (\Vert y^{t+1}_e\Vert ^2 - \Vert z^{t+1}_e\Vert ^2) + \frac{1}{2}\left( (1 + \beta )L_F - \frac{1}{\gamma }\right) \Vert z^{t+1} - y^{t+1}\Vert ^2, \end{aligned}$$

where the first inequality follows from the optimality of $\bar{z}$. Rearranging terms in the above relation, we see further that

$$\begin{aligned} \left( \frac{1}{\gamma } - (1 + \beta )L_F\right) \Vert z^{t+1} - y^{t+1}\Vert ^2 \le \left( \frac{1}{\gamma } - \beta L_F\right) (\Vert y^{t+1}_e\Vert ^2 - \Vert z^{t+1}_e\Vert ^2). \end{aligned}$$

Using this relation and the definition of the x-update, we obtain

$$\begin{aligned} \frac{1}{4}\sum _{t=0}^{N-1}\Vert x^{t+1}-x^t\Vert ^2= & {} \sum _{t=0}^{N-1}\Vert z^{t+1} - y^{t+1}\Vert ^2\\\le & {} \frac{\gamma }{1 - (1 + \beta )\gamma L_F}\left( \frac{1}{\gamma } - \beta L_F\right) \sum _{t=0}^{N-1}(\Vert y^{t+1}_e\Vert ^2 - \Vert z^{t+1}_e\Vert ^2)\\\le & {} \frac{1}{4\beta L_F (1-(1 + \beta ) \gamma L_F)}\left( \frac{1}{\gamma } - \beta L_F\right) \Vert x^0 - \bar{x}\Vert ^2, \end{aligned}$$

where the last inequality is due to (38). Thus, $\sum _{t=0}^{+\infty }\Vert x^{t+1}-x^t\Vert ^2<+\infty $ and so, $\sum _{t=N}^{2N-1} \Vert x^{t+1}-x^t\Vert ^2 \rightarrow 0$ as $N \rightarrow \infty $. Now consider $\alpha _N:= \min _{0 \le t \le N}\{\Vert x^{t+1} -x^t\Vert ^2\}$ for all $N \ge 0$. Then, we have $\alpha _{N+1} \le \alpha _N$ for all $N \ge 0$ and,

$$\begin{aligned} N ~ \alpha _{2N} \le \alpha _N + \cdots \alpha _{2N-1} \le \sum _{t=N}^{2N-1} \Vert x^{t+1}-x^t\Vert ^2 \rightarrow 0. \end{aligned}$$

This implies that $\alpha _N = o(1/N)$. Therefore, the conclusion follows. This completes the proof. $\square $

Next, we show that the PR splitting method exhibits linear convergence in solving (32) if G is convex and $F+G$ is strongly convex. We note that, for the classical PR splitting method, linear convergence under strongly convexity is known; see [26, Remark 10 and Proposition 4]. As explained before, here we are considering a different PR splitting method.

Proposition 1

(Linear convergence under strong convexity) Consider optimization problem (32) with G being convex. Suppose that $F+G$ is indeed strongly convex. Let $\{(y^t,z^t,x^t)\}$ be the sequence generated from (33). Then $\{(y^t,z^t,x^t)\}$ converges linearly to $(\bar{y}, \bar{z},\bar{x})$ with $\bar{y} = \bar{z}$ and $\bar{z}$ being the unique optimal solution for (32), i.e., there exist $M>0$ and $r \in (0,1)$ such that for all $t \ge 1$,

$$\begin{aligned} \max \{\Vert y^{t}-\bar{y}\Vert ^2, \Vert z^{t}-\bar{z}\Vert ^2, \Vert x^{t}-\bar{x}\Vert ^2\} \le M ~ r^t. \end{aligned}$$

Proof

Let $(\bar{y},\bar{z},\bar{x})$ be any cluster point of the sequence $\{(y^t,z^t,x^t)\}$. As before, we write $w^t_e = w^t - \bar{w}$ for $w = x$, y or z for notational simplicity. From the preceding theorem $\bar{y} = \bar{z}$ and $\bar{z}$ is optimal for (32). Note that $F+G$ is strongly convex. Hence, the optimal solution of (32) exists and is unique. Consequently, the whole sequence $\{(y^t,z^t)\}$ converges to the unique limit $(\bar{z},\bar{z})$, where $\bar{z}$ is the unique solution of (32). From this and (35) one can deduce that $\{x^t\}$ is also convergent, and hence, converges to $\bar{x}$. We next establish linear convergence.

Denote the strong convexity modulus of $F+G$ by $\sigma _1$. From (40), the strong convexity of $F+G$ and the fact that $\bar{z}$ is the solution of (32), we see that for all $t \ge 1$,

$$\begin{aligned} \frac{\sigma _1}{2}\Vert z_e^{t+1}\Vert ^2 \le F(z^{t+1}) + G(z^{t+1}) - F(\bar{z}) - G(\bar{z}) \le C(\Vert x_e^t\Vert ^2 - \Vert x_e^{t+1}\Vert ^2), \end{aligned}$$

(41)

where $C:=\frac{1}{8\beta \gamma L_F}\left( \frac{1}{\gamma } - \beta L_F\right) $. Moreover, from the last inequality in (40), we have for all $t \ge 1$,

$$\begin{aligned} C_1 (\Vert y_e^{t+1}\Vert ^2 - \Vert z_e^{t+1}\Vert ^2) \le C(\Vert x_e^t\Vert ^2 - \Vert x_e^{t+1}\Vert ^2), \end{aligned}$$

where $C_1=\frac{1}{2}\left( \frac{1}{\gamma } - \beta L_F\right) $. It then follows that

$$\begin{aligned} \Vert y_e^{t+1}\Vert ^2 -\frac{C}{C_1}(\Vert x_e^t\Vert ^2 - \Vert x_e^{t+1}\Vert ^2)\le \Vert z_e^{t+1}\Vert ^2. \end{aligned}$$

This together with (41) gives us that for all $t \ge 1$,

$$\begin{aligned} \Vert y_e^{t+1}\Vert ^2 \le \left( \frac{2C}{\sigma _1} + \frac{C}{C_1}\right) (\Vert x_e^{t}\Vert ^2 - \Vert x_e^{t+1}\Vert ^2). \end{aligned}$$

(42)

On the other hand, note from the first relation in (35) that

$$\begin{aligned} -\left( \beta L_F + \frac{1}{\gamma }\right) y_e^{t+1} + \frac{1}{\gamma }x_e^{t} = \nabla F(y^{t+1}) - \nabla F(\bar{y}). \end{aligned}$$

This together with the Lipschitz continuity of $\nabla F$ implies that

$$\begin{aligned} -\left( \beta L_F + \frac{1}{\gamma }\right) \Vert y_e^{t+1}\Vert + \frac{1}{\gamma }\Vert x_e^{t}\Vert \le \Vert \nabla F(y^{t+1}) - \nabla F(\bar{y})\Vert \le L_F \Vert y_e^{t+1}\Vert \end{aligned}$$

and consequently, $\Vert x_e^{t}\Vert \le ((1 + \beta ) \gamma L_F + 1) \Vert y_e^{t+1}\Vert $. Thus, we obtain that, for all $t \ge 1$

$$\begin{aligned} \frac{1}{((1 + \beta ) \gamma L_F + 1)^2}\Vert x_e^t\Vert ^2 \le \Vert y_e^{t+1}\Vert ^2 \le \left( \frac{2C}{\sigma _1} + \frac{C}{C_1}\right) (\Vert x_e^t\Vert ^2 - \Vert x_e^{t+1}\Vert ^2). \end{aligned}$$

This shows that there exists $r \in (0,1)$ such that

$$\begin{aligned} \Vert x_e^{t+1}\Vert ^2 \le r \Vert x_e^t\Vert ^2 \text{ for } \text{ all } t \ge 1. \end{aligned}$$

It follows that

$$\begin{aligned} \Vert x_e^{t}\Vert ^2 \le \Vert x^0-\bar{x}\Vert ^2 ~ r^{t} \text{ for } \text{ all } t \ge 1. \end{aligned}$$

Moreover, from (41) and (42), this further yields that, for all $t \ge 1$,

$$\begin{aligned} \Vert z_e^{t+1}\Vert ^2 \le \frac{2C }{\sigma _1} \Vert x_e^t\Vert ^2 \le \frac{2C \Vert x^0-\bar{x}\Vert ^2}{\sigma _1} r^{t} . \end{aligned}$$

and

$$\begin{aligned} \Vert y_e^{t+1}\Vert ^2 \le \left( \frac{2C}{\sigma _1} + \frac{C}{C_1}\right) \Vert x_e^{t}\Vert ^2 \le \left( \frac{2C}{\sigma _1} + \frac{C}{C_1}\right) \Vert x^0-\bar{x}\Vert ^2 ~ r^{t}. \end{aligned}$$

Therefore, the conclusion follows. $\square $

4 Applications

In this section, we apply the PR splitting method (33) to solving two important class of nonconvex optimization problems: constrained least squares problem and feasibility problems, based on our discussion in Sect. 3.

4.1 Constrained least squares problems

A common type of problems that arises in the area of statistics and machine learning is the following constrained least squares problem:

$$\begin{aligned} \min \limits _{u\in D}\quad \frac{1}{2} \Vert Au - b\Vert ^2, \end{aligned}$$

(43)

where A is a linear map, b is a vector of suitable dimension, and D is a nonempty compact set that is not necessarily convex. See [23, 32] for concrete examples of (43).

The classical PR splitting method applied to (43) does not have a convergence guarantee. As an alternative, as discussed in Sect. 3, we can set $f(y) = \frac{1}{2}\Vert Ay - b\Vert ^2 + \frac{\beta \lambda _{\max }(A^TA)}{2}\Vert y\Vert ^2$ and $g(z) = \delta _D(z) - \frac{\beta \lambda _{\max }(A^TA)}{2}\Vert z\Vert ^2$ and apply the PR splitting method accordingly.

We next discuss computation of the proximal mappings. We start with the proximal mapping of $\gamma g$. From the definition, for each w, the proximal mapping gives the set of minimizers of

$$\begin{aligned} \min _{z\in D}\left\{ -\frac{\beta \lambda _{\max }(A^TA)}{2}\Vert z\Vert ^2 + \frac{1}{2\gamma }\Vert z - w\Vert ^2\right\} . \end{aligned}$$

It is clear that this set is given by $P_D\left( \frac{w}{1 - \beta \lambda _{\max }(A^TA)\gamma }\right) $ since $\gamma < \frac{1}{\beta \lambda _{\max }(A^TA)}$. On the other hand, to compute the proximal mapping for $\gamma f$, we consider the following optimization problem for each w

$$\begin{aligned} \min _{y}\left\{ \frac{1}{2} \Vert Ay - b\Vert ^2 + \frac{\beta \lambda _{\max }(A^TA)}{2} \Vert y\Vert ^2 + \frac{1}{2\gamma }\Vert y - w\Vert ^2\right\} , \end{aligned}$$

whose unique minimizer is given by

$$\begin{aligned} y = [(\beta \gamma \lambda _{\max }(A^TA) + 1)I + \gamma A^TA]^{-1}(w + \gamma A^Tb). \end{aligned}$$

Thus, the PR splitting method for (43) can be stated as follows:

As a consequence of Corollary 1, we see that Algorithm (44) generates a bounded sequence such that any of its cluster point gives a stationary point of (43). We note that this global convergence result of (44) is new even when D is convex.

To illustrate our proposed approach, we now test the PR splitting method (44) on solving (43). We compare our algorithm against the DR splitting method in [25]. Our initialization and termination criteria for both algorithms are the same as in [25, Section 5]; both algorithms are initialized at the origin and terminated when

$$\begin{aligned} \frac{\max \{\Vert x^t - x^{t-1}\Vert ,\Vert y^t - y^{t-1}\Vert ,\Vert z^t - z^{t-1}\Vert \}}{\max \{\Vert x^{t-1}\Vert ,\Vert y^{t-1}\Vert ,\Vert z^{t-1}\Vert ,1\}}< tol \end{aligned}$$

(45)

for some $tol > 0$. Note that, in general, the upper bound of $\gamma $ in algorithm (44) might be too small in practical computation. Thus, following a technique used in [25, Section 5] for the DR splitting method, we adopt a heuristic for PR splitting method in our numerical simulation, which combines algorithm (44) with a specific update rule of the parameter $\gamma $. In particular, we set $\beta = 2.2$ and start with $\gamma = 0.93/(\beta \lambda _{\max }(A^TA))$. We then update $\gamma $ as $\max \{\frac{\gamma }{2},0.9999\cdot \gamma _1\}$ whenever $\gamma > \gamma _1 := \frac{\beta - 2}{(\beta + 1)^2\lambda _{\max }(A^TA)}$ and the sequence satisfies either $\Vert y^t-y^{t-1}\Vert > \frac{1000}{t}$ or $\Vert y^t\Vert > 10^{10}$. Following a similar discussion as in [25, Remark 4], one can show that this heuristic leads to a bounded sequence which clusters at a stationary point of (43). On the other hand, for the DR splitting method, we use the same heuristics described in [25, Section 5] for updating $\gamma $ but we consider three different initial $\gamma $’s: $k\cdot \gamma _0$ for $k=10$, 30 and 50, with $\gamma _0 = (\sqrt{\frac{3}{2}} - 1)/\lambda _{\max }(A^TA)$. These variants are denoted by $\mathrm{DR}_{10}$, $\mathrm{DR}_{30}$ and $\mathrm{DR}_{50}$, respectively.

In our first numerical experiment, we first randomly generate an $m\times n$ matrix A, a noise vector $\epsilon \in \mathrm{I}\!\mathrm{R}^m$, and also an $\hat{x}\in \mathrm{I}\!\mathrm{R}^r$ with $r = \lceil \frac{m}{10}\rceil $, all with i.i.d. standard Gaussian entries. We further scale each column of A to have norm 1. Next, we generate a random sparse vector $\tilde{x}\in \mathrm{I}\!\mathrm{R}^n$ by first setting $\tilde{x} = 0$ and then assigning randomly r entries in $\tilde{x}$ to be $\hat{x}$. Finally, we set $b = A\tilde{x} + 0.01\cdot \epsilon $ and $D = \{x\in \mathrm{I}\!\mathrm{R}^n:\; \Vert x\Vert _0 \le r,\ \Vert x\Vert _\infty \le 10^6\}$; here $\Vert x\Vert _0$ denotes the cardinality of x and $\Vert x\Vert _\infty $ is the $\ell _\infty $ norm of x.

We generate 50 random instances as described above for each pair of (m, n), where $m\in \{100, 200, 300, 400,500\}$ and $n\in \{4000,5000,6000\}$. Our results are reported in Table 1, where we present the number of iterations and the function value at termination^{Footnote 4} averaged over the 50 instances. One can observe that the PR splitting method is faster than the DR splitting methods for larger m. Besides, the function values obtained by the PR splitting method are usually comparable with $\mathrm{DR}_{30}$, worse than $\mathrm{DR}_{50}$ and better than $\mathrm{DR}_{10}$.

Table 1 Comparing $\mathrm{DR}_{10}$, $\mathrm{DR}_{30}$, $\mathrm{DR}_{50}$ and PR splitting for constrained least squares problem on random instances

Full size table

We also perform experiments using real data. We consider four sets of real data for the A and b used in (43): leukemia data, lymph node status data, breast cancer prognosis data and colon tumor gene expression data. We use the leukemia data pre-processed in [34], that has 3501 genes and 72 samples. The lymph node status data we use are pre-processed in [14], with 4514 genes and 148 samples. The breast cancer prognosis data we use are pre-processed in [34], containing 4919 genes and 76 samples. Finally, we use the data pre-processed in [19] with 2000 genes and 62 samples for the colon tumor gene expression data.

Similar to [27, Section 3.3], for all the data, we first standardize A and b to make each column have mean 0 and variance 1, and then scale the columns of A to have unit norm. For the A and b thus constructed, we solve (43) with $D = \{x\in \mathrm{I}\!\mathrm{R}^n:\; \Vert x\Vert _0 \le r,~\Vert x\Vert _\infty \le 10^6\}$ for $r=10$, 20, 30 by the PR splitting method (44) and compare it with $\mathrm{DR}_{10}$, $\mathrm{DR}_{30}$ and $\mathrm{DR}_{50}$. Our numerical results are presented in Table 2,^{Footnote 5} where one can see that PR is slower than $\mathrm{DR}_{50}$ and faster than $\mathrm{DR}_{10}$. Moreover, it usually outperforms $\mathrm{DR}_{30}$ in terms of function values, and its speed is comparable with $\mathrm{DR}_{30}$ for the Breast and the Colon data.

Table 2 Comparing $\mathrm{DR}_{10}$, $\mathrm{DR}_{30}$, $\mathrm{DR}_{50}$ and PR splitting on real data

Full size table

4.2 Feasibility problems

Another important problem in optimization is the feasibility problem [2,3,4, 9, 20]. We consider the following simple version: finding a point in the intersection of a nonempty closed convex set C and a nonempty compact set D. It is well known that this problem can be modeled via (32) by setting $F(u) = \frac{1}{2} d_C^2(u)$ and $G(u) = \delta _D(u)$; see, for example, [28]. For this choice of F, we have $L_F = 1$.

As before, it can be shown that the proximal mapping of $\gamma g$ is given by $P_D\left( \frac{w}{1 - \beta \gamma }\right) $ since $\gamma < \frac{1}{\beta }$. We next compute the proximal mapping for $\gamma f$ in this case. From the definition, for each w, we consider the following optimization problem

$$\begin{aligned} v:= & {} \min _{y}\left\{ \frac{1}{2} d_C^2(y) + \frac{\beta }{2} \Vert y\Vert ^2 + \frac{1}{2\gamma }\Vert y - w\Vert ^2\right\} \nonumber \\= & {} \min _{u\in C}\min _{y}\left\{ \frac{1}{2} \Vert y - u\Vert ^2 + \frac{\beta }{2} \Vert y\Vert ^2 + \frac{1}{2\gamma }\Vert y - w\Vert ^2\right\} . \end{aligned}$$

(46)

Notice that the inner minimization on the right hand side is attained at

$$\begin{aligned} y = \frac{\gamma u + w}{(1 + \beta )\gamma + 1}. \end{aligned}$$

(47)

Plugging (47) back into the (46), we see further that

$$\begin{aligned}&v = \frac{1}{((1 + \beta )\gamma + 1)^2} \nonumber \\&\qquad \times \min _{u\in C}\left\{ \frac{1}{2} \Vert (1 + \beta \gamma )u - w\Vert ^2 + \frac{\beta }{2} \Vert \gamma u + w\Vert ^2 + \frac{\gamma }{2}\Vert u -(1 + \beta ) w\Vert ^2\right\} .\nonumber \\ \end{aligned}$$

(48)

It is routine to show that the minimum in (48) is attained at

$$\begin{aligned} u = P_C\left( \frac{w}{1 + \beta \gamma }\right) . \end{aligned}$$

Combining this with (47), the proximal mapping of $\gamma f$ at w is given by

$$\begin{aligned} \frac{\gamma P_C\left( \frac{w}{1 + \beta \gamma }\right) + w}{(1 + \beta )\gamma + 1}. \end{aligned}$$

Thus, the PR splitting method for (32) with $F(u) = \frac{1}{2} d_C^2(u)$ and $G(u) = \delta _D(u)$ can be described as follows:

Similarly, as an immediate consequence of Corollary 1, we see that Algorithm (49) generates a bounded sequence such that any of its cluster point gives a stationary point of (32). We would like to point out that this global convergence result of (49) is new even when D is also convex.

As an illustration of our proposed approach, we now test the PR splitting method (49) on solving (32) with $F(u) = \frac{1}{2} d_C^2(u)$ and $G(u) = \delta _D(u)$ via MATLAB experiments. We again benchmark our algorithm against the DR splitting method in [25]. Both algorithms are initialized at the origin and terminated when (45) is satisfied with $tol = 10^{-8}$. Also, as in the previous subsection, we adopt a heuristic for updating $\gamma $ following the technique used in [25, Section 5]. Specifically, for the PR splitting method (49), we set $\beta = 2.2$ and start with $\gamma = 0.93/\beta $ and update $\gamma $ as $\max \{\frac{\gamma }{2},0.9999\cdot \gamma _1\}$ whenever $\gamma > \gamma _1 := \frac{\beta - 2}{(\beta + 1)^2}$, and the sequence satisfies either $\Vert y^t-y^{t-1}\Vert > \frac{1000}{t}$ or $\Vert y^t\Vert > 10^{10}$. Following a similar discussion as in [25, Remark 4], this heuristic can be shown to give a bounded sequence that clusters at a stationary point of (32). On the other hand, for the DR splitting method, we adopt the same heuristics described in [25, Section 5] for updating $\gamma $ but we consider three different initial $\gamma $’s: $k\cdot \gamma _0$ for $k=50$, 100 and 150, with $\gamma _0 := \sqrt{\frac{3}{2}} - 1$. These variants are denoted by $\mathrm DR_{50}$, $\mathrm DR_{100}$ and $\mathrm DR_{150}$, respectively.

Table 3 Comparing $\mathrm{DR}_{150}$ and PR splitting on random instances

Full size table

Table 4 Computational results for $\mathrm{DR}_{50}$ and $\mathrm{DR}_{100}$

Full size table

As in [25, Section 5], we consider the problem of finding an r-sparse solution of a randomly generated linear system $Ax = b$. To be concrete, we set $C = \{x\in \mathrm{I}\!\mathrm{R}^n:\; Ax = b\}$ and $D = \{x\in \mathrm{I}\!\mathrm{R}^n:\; \Vert x\Vert _0 \le r,\ \Vert x\Vert _\infty \le 10^6\}$; here $\Vert x\Vert _0$ denotes the cardinality of x and $\Vert x\Vert _\infty $ is the $\ell _\infty $ norm of x. For the set C, we first generate an $m\times n$ matrix A and an $\hat{x}\in \mathrm{I}\!\mathrm{R}^r$ with $r = \lceil \frac{m}{5}\rceil $, both with i.i.d. standard Gaussian entries. We then set $\tilde{x}$ to be the n-dimensional zero vector and randomly assign r entries in $\tilde{x}$ to be $\hat{x}$. We further project this $\tilde{x}$ onto $[-10^6,10^6]^n$ so that $\tilde{x}\in D$. Finally, we set $b = A\tilde{x}$. Consequently, the intersection $C\cap D$ is nonempty for the instance generated because it contains $\tilde{x}$. In particular, this means that the globally optimal value of $\min _u\{\frac{1}{2}d_C^2(u): u\in D\}$ is zero.

In our experiments, we generate 50 random instances as described above for each pair of (m, n), where $m\in \{100, 200, 300, 400,500\}$ and $n\in \{4000,5000,6000\}$. We report our results in Tables 3 and 4, where we present the number of iterations averaged over the 50 instances, the largest and smallest function values at termination,^{Footnote 6} and also the number of successes and failures in identifying a sparse solution of the linear system.^{Footnote 7} We also present the average number of iterations for successful instances ($\mathrm{iter_s}$) and failed instances ($\mathrm{iter_f}$).

In Table 3, we compare our PR splitting method with $\mathrm{DR}_{150}$. One can observe that this version of DR splitting method outperforms the PR splitting method in terms of the solution quality in this setting. However, the PR splitting method is consistently faster and its performance becomes comparable with the DR splitting method for easier instances (larger m and smaller n / m).

We also present in Table 4 the numerical results for $\mathrm{DR}_{50}$ and $\mathrm{DR}_{100}$. One can see that the DR splitting method becomes faster (while still slower than the PR splitting method) for these two smaller initial $\gamma $, at the price of fewer successful instances.

5 Concluding remarks

In this paper, we studied the applicability of the PR splitting method for solving nonconvex optimization problems. We established global convergence of the method when applied to minimizing the sum of a strongly convex Lipschitz differentiable function f and a proper closed function g, under suitable assumptions. Exploiting the possible nonconvexity of g, we showed how to suitably apply the PR splitting method to a large class of convex optimization problems whose objective function is not necessarily strongly convex. This significantly broadens the applicability of the PR splitting method to cover feasibility problems and many constrained least squares problems.

Notes

This slightly improves [25, Theorem 4] because [25, Theorem 4] assumed a slightly stronger condition that f and g are bounded below and one of them is coercive.
We refer the readers to, for example, [1, 2, 7, 8], for the definition and examples of KL functions. In particular, if f and g are proper closed semi-algebraic functions, then ${\mathfrak {P}}_\gamma $ is a KL function for any $\gamma > 0$.
One natural choice of $\beta $ is to set $\beta = 5$ so that $\max _{\beta > 2}\frac{\beta -2}{(\beta + 1)^2L_F} = \frac{1}{12L_F}$ is attained. However, we discover in our numerical experiments that a smaller $\beta > 2$ coupled with a suitable heuristic for updating $\gamma $ leads to faster convergence.
We choose $tol = 10^{-8}$, and we report $\frac{1}{2}\Vert Az^t - b\Vert ^2$ for both methods.
We choose $tol = 10^{-5}$, and we report $\frac{1}{2}\Vert Az^t - b\Vert ^2$ for both methods.
For both methods, we report $\frac{1}{2} d_C^2(z^t)$.
We declare a failure if the function value at termination is above $10^{-6}$, and a success if the value is below $10^{-12}$.

References

Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems. An approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
Article MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137, 91–129 (2013)
Article MathSciNet MATH Google Scholar
Bauschke, H.H., Borwein, J.M.: On the convergence of von Neumann’s alternating projection algorithm for two sets. Set-Valued Anal. 1, 185–212 (1993)
Article MathSciNet MATH Google Scholar
Bauschke, H.H., Borwein, J.M.: On projection algorithms for solving convex feasibility problems. SIAM Rev. 38, 367–426 (1996)
Article MathSciNet MATH Google Scholar
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)
Book MATH Google Scholar
Bogdan, M., van den Berg, E., Su, W., Candès, E.: Statistical estimation and testing via the sorted L1 norm. Preprint (2013). Available at arxiv:1310.1969
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)
Article MATH Google Scholar
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18, 556–572 (2007)
Article MathSciNet MATH Google Scholar
Borwein, J.M., Li, G., Yao, L.J.: Analysis of the convergence rate for the cyclic projection algorithm applied to basic semialgebraic convex sets. SIAM J. Optim. 24, 498–527 (2014)
Article MathSciNet MATH Google Scholar
Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51, 34–81 (2009)
Article MathSciNet MATH Google Scholar
Candès, E., Tao, T.: The Dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35, 2313–2351 (2007)
Article MathSciNet MATH Google Scholar
Combettes, P.L.: Iterative construction of the resolvent of a sum of maximal monotone operators. J. Convex Anal. 16, 727–748 (2009)
MathSciNet MATH Google Scholar
Combettes, P.L., Pesquet, J.-C.: A Douglas–Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE J. Sel. Top. Signal Process. 1(4), 564–574 (2007)
Article Google Scholar
Dobra, A.: Variable selection and dependency networks for genomewide data. Biostatistics 10, 621–639 (2009)
Article Google Scholar
Douglas, J., Rachford, H.H.: On the numerical solution of heat conduction problems in two or three space variables. Trans. Am. Math. Soc. 82, 421–439 (1956)
Article MathSciNet MATH Google Scholar
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Article MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Giselsson, P., Boyd, S.: Diagonal scaling in Douglas–Rachford splitting and ADMM. In: Proceedings of the 53rd IEEE Conference on Decision and Control, pp. 5033–5039 (2014)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Article Google Scholar
Hesse, R., Luke, D.R.: Nonconvex notions of regularity and convergence of fundamental algorithms for feasibility problems. SIAM J. Optim. 23, 2397–2419 (2013)
Article MathSciNet MATH Google Scholar
Hong, M., Luo, Z.-Q., Razaviyayn, M.: Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J. Optim. 26, 337–364 (2016)
Article MathSciNet MATH Google Scholar
Knight, K., Fu, W.: Asymptotics for the lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)
Article MathSciNet MATH Google Scholar
Kyrillidis, A., Becker, S., Cevher, V., Koch, C.: Sparse projections onto the simplex. JMLR W&CP 28, 235–243 (2013)
Google Scholar
Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25, 2434–2460 (2015)
Article MathSciNet MATH Google Scholar
Li, G., Pong, T.K.: Douglas–Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Program. 159, 371–401 (2016)
Article MathSciNet MATH Google Scholar
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16, 964–979 (1979)
Article MathSciNet MATH Google Scholar
Lu, Z., Pong, T.K., Zhang, Y.: An alternating direction method for finding Dantzig selectors. Comput. Stat. Data Anal. 56, 4037–4046 (2012)
Article MathSciNet MATH Google Scholar
Luke, D.R.: Finding best approximation pairs relative to a convex and a prox-regular set in a Hilbert space. SIAM J. Optim. 19, 714–739 (2008)
Article MathSciNet MATH Google Scholar
Patrinos, P., Stella, L., Bemporad, A.: Douglas–Rachford splitting: complexity estimates and accelerated variants. In: Proceedings of the 53rd IEEE Conference on Decision and Control, pp. 4234–4239 (2014)
Peaceman, D.W., Rachford, H.H.: The numerical solution of parabolic and elliptic differential equations. J. Soc. Ind. Appl. Math. 3, 28–41 (1955)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, Berlin (1998)
Book MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wang, H., Banerjee, A.: Bregman alternating direction method of multipliers. NIPS 27, 2816–2824 (2014)
Google Scholar
Yeung, K.Y., Bumgarner, R.E., Raftery, A.E.: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21, 2394–2402 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Mathematics, University of New South Wales, Sydney, 2052, Australia
Guoyin Li
Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Tianxiang Liu & Ting Kei Pong

Authors

Guoyin Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianxiang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ting Kei Pong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Kei Pong.

Additional information

G. Li was partially supported by a Research Grant from Australian Research Council. T. Liu was supported partly by the AMSS-PolyU Joint Research Institute Postdoctoral Scheme. T. K. Pong was supported partly by Hong Kong Research Grants Council PolyU253008/15p.

Appendix: concrete numerical examples

In this appendix, we provide some simple and concrete examples illustrating the different behaviors of the classical PR splitting method, the classical DR splitting method and our proposed PR splitting method (33).

The first example shows that, even in the convex setting, the classical PR splitting method can be faster than the classical DR splitting method, and our proposed PR method can outperform the classical DR method for some particular choice of the parameter $\gamma $. The second example on nonconvex feasibility problem shows that the classical PR method can diverge while our proposed PR method converges linearly to a solution for the feasibility problem.

Example 1

(Classical DR splitting method vs. classical/proposed PR method) Consider $f(x)=\Vert x\Vert ^2$ and $g(x)=0$ for all $x \in \mathrm{I}\!\mathrm{R}^n$. Then, a direct verification shows that, for any $\gamma >0$,

$$\begin{aligned} \mathrm{prox}_{\gamma f}(z) = \mathop {\hbox {arg min}}_u\left\{ \gamma \Vert u\Vert ^2 + \frac{1}{2} \Vert u - z\Vert ^2\right\} =\frac{z}{2\gamma +1} \end{aligned}$$

and

$$\begin{aligned} \mathrm{prox}_{\gamma g}(z) = \mathop {\hbox {arg min}}_u\left\{ \frac{1}{2} \Vert u - z\Vert ^2\right\} =z. \end{aligned}$$

Thus, the classical DR method reads

$$\begin{aligned} x^{t+1}&=\frac{I+(2\mathrm{prox}_{\gamma g} - I)\circ (2\mathrm{prox}_{\gamma f} - I)}{2}(x^t)\\&=\frac{1}{2\gamma +1} x^t =\cdots = \left( \frac{1}{2\gamma +1}\right) ^{t+1} x^0, \end{aligned}$$

while the classical PR method reads

$$\begin{aligned} x^{t+1}=(2\mathrm{prox}_{\gamma g} - I)\circ (2\mathrm{prox}_{\gamma f} - I)(x^t)=\frac{1-2\gamma }{2\gamma +1} x^t =\cdots = \left( \frac{1-2\gamma }{2\gamma +1}\right) ^{t+1} x^0. \end{aligned}$$

Thus, for this example, the classical PR method converges faster than the classical DR method when $\gamma \in (0,1)$.

Moreover, let $\beta =2.5$ and $\gamma <\frac{\beta -2}{(\beta +1)^2L_F}=\frac{1}{49}$. Then, the proposed PR method (33) reads

$$\begin{aligned} \left\{ \begin{array}{l} y^{t+1}=\mathop {\hbox {arg min}}_{y} \left\{ \frac{7}{2}\Vert y\Vert ^2 + \frac{1}{2\gamma }\Vert y - x^t\Vert ^2\right\} =\frac{1}{1+7\gamma } x^t, \\ z^{t+1}=\mathop {\hbox {arg min}}_{z} \left\{ - \frac{5}{2}\Vert z\Vert ^2 + \frac{1}{2\gamma }\Vert 2y^{t+1} - x^t - z\Vert ^2\right\} =\frac{1}{1-5\gamma }(2y^{t+1}-x^t),\\ x^{t+1}=x^t+2(z^{t+1}- y^{t+1})=\left( 1-\frac{4 \gamma }{(1-5 \gamma )(1+7\gamma )}\right) x^t. \end{array}\right. \nonumber \\ \end{aligned}$$

(50)

Note that, for $\gamma =0.01<\frac{\beta -2}{(\beta +1)^2L_F}=\frac{1}{49}$, we have

$$\begin{aligned} 0<1-\frac{4 \gamma }{(1-5 \gamma )(1+7\gamma )} \le 0.97 < \frac{1}{2\gamma +1}. \end{aligned}$$

Thus, for $\gamma =0.01$, our proposed PR method (33) is faster than the classical DR method for this example.

Example 2

(Classical PR method vs. the proposed PR method) Let $C=\{(0,0)\}$ and $D=\big (\{0\} \times \mathrm{I}\!\mathrm{R}\big ) \cup \big (\mathrm{I}\!\mathrm{R}\times \{0\}\big )$. We consider the feasibility problem of finding a point in the intersection of C and D. We start with the initial point $x^0=(a,0)$ with $a\ne 0$. Then, the classical PR splitting method applies (6) to $f(x)=\delta _C(x)$ and $g(x)=\delta _D(x)$ for all $x \in \mathrm{I}\!\mathrm{R}^2$, and reduces to

$$\begin{aligned} x^{t+1}=(2\mathrm{prox}_{\gamma g} - I)\circ (2\mathrm{prox}_{\gamma f} - I)(x^t)=(2P_D - I)\circ (2P_C - I)(x^t)=-x^{t}. \end{aligned}$$

Thus, the classical PR splitting method diverges and cycles between two points (a, 0) and $(-a,0)$. On the other hand, let $\beta =5$ and $\gamma \in \left( 0,\frac{1}{12}\right) $ and consider the proposed PR method (49) for feasibility problems. This algorithm reads

$$\begin{aligned} \left\{ \begin{array}{l} y^{t+1} = \frac{\gamma P_C\left( \frac{x^t}{1 + \beta \gamma }\right) + x^t}{(1 + \beta )\gamma + 1}=\frac{x^t}{6\gamma + 1}, \\ z^{t+1}\in P_D\left( \frac{2y^{t+1} - x^t}{1 - \beta \gamma }\right) =\left\{ \frac{2y^{t+1} - x^t}{1 - 5\gamma }\right\} ,\\ x^{t+1}=x^t+2(z^{t+1}- y^{t+1})=\left( 1-\frac{2\gamma }{(1-5\gamma )(6\gamma +1)}\right) x^t, \end{array}\right. \end{aligned}$$

(51)

where the formula for the z-update follows from the fact that $x^t$, $y^t \in \mathrm{I}\!\mathrm{R}\times \{0\}\subset D$, and so is $2y^{t+1}-x^t$ by the construction. Hence, the proposed PR method (51) converges to $(0,0) \in C \cap D$ linearly in this case.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, G., Liu, T. & Pong, T.K. Peaceman–Rachford splitting for a class of nonconvex optimization problems. Comput Optim Appl 68, 407–436 (2017). https://doi.org/10.1007/s10589-017-9915-8

Download citation

Received: 18 November 2015
Published: 13 May 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s10589-017-9915-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Peaceman–Rachford splitting for a class of nonconvex optimization problems

Abstract

Similar content being viewed by others

Douglas–Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems

Convergence Analysis of the Generalized Splitting Methods for a Class of Nonconvex Optimization Problems

Convergence of Bregman Peaceman–Rachford Splitting Method for Nonconvex Nonseparable Optimization

1 Introduction

1.1 Notation

2 Peaceman–Rachford splitting for structured nonconvex problems

Assumption 1

Assumption 2

Theorem 1

Remark 1

Proof

Theorem 2

Proof

Remark 2

Theorem 3

3 Peaceman–Rachford splitting methods for nonconvex problems with non-strongly convex decomposition

Corollary 1

Proof

3.1 Peaceman–Rachford splitting method for convex problems

Theorem 4

Proof

Proposition 1

Proof

4 Applications

4.1 Constrained least squares problems

4.2 Feasibility problems

5 Concluding remarks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: concrete numerical examples

Appendix: concrete numerical examples

Example 1

Example 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation