1 Introduction

In non-convex optimization, we often content ourselves with local properties of the objective function. Exploiting local information, such as smoothness or prox-regularity around the optimum, yields a local convergence theory. Local convergence rates can be obtained or iterative optimization algorithms can be designed, which depend on properties that are available only locally around a local optimum. For revealing such results, it is crucial that the generated sequence, once entered such a neighborhood of a local optimum, stays within this neighborhood and converges to a limit point in the same neighborhood.

As an illustrative example, suppose a point close to a local minimizer can be found by a global method, for example, by exhaustive search. In a neighborhood of the local minimizer, we can switch to a more efficient local algorithm. The local attraction of the local minimum assures that the generated sequence of iterates stays in this neighborhood, i.e., the sequence does not escape to a different local minimum, and there is no need to switch back to the (slow) global method, exhaustive search.

An important example of local properties, which we are going to exploit in this paper, is the fact that the Moreau envelope of a prox-regular function is locally well defined and its gradient is Lipschitz continuous and expressible using the proximal mapping—a result that is well known for convex functions. Locally, this result can be applied to gradient-based iterative methods for minimizing objective functions that involve a Moreau envelope of a function. We pursue this idea for the Heavy-ball method [1, 2] and iPiano [3, 4] (inertial version of forward–backward splitting) and obtain new algorithms for non-convex optimization such as inertial alternating/averaged proximal minimization or projection methods. The convergence result of the Heavy-ball method and iPiano translates directly to these new methods in the non-convex setting. The fact that a wide class of functions is prox-regular extends the applicability of these inertial methods significantly.

Prox-regularity was introduced in [5] and comprises primal-lower-nice (introduced by Poliquin [6]), second-order subsmooth, strongly amenable (see for instance [7]), and proper lower semi-continuous convex functions. It is known that prox-regular functions (locally) share some favorable properties of convex functions, e.g., the formula for the gradient of a Moreau envelope. Indeed, a function is prox-regular if and only if there exists a (function value attentive) localization of the subgradient mapping that is monotone up to a multiple of the identity mapping [5]. In [8], prox-regularity is key to prove local convergence of the averaged projection method using the gradient descent method, which is a result that has motivated this paper.

The convergence proof of the gradient method in [8] follows a general paradigm that is currently actively used for the convergence theory in non-convex optimization. The key is the so-called Kurdyka–Łojasiewicz (KL) property [9,10,11,12,13], which is known to be satisfied by semi-algebraic [14], globally subanalytic functions [15], or, more generally, functions that are definable in an \(\mathrm {o}\)-minimal structure [13, 16]. Global convergence of the full sequence generated by an abstract algorithm to a stationary point is proved for functions with the KL property. The algorithm is abstract in the sense that the generated sequence is assumed to satisfy a sufficient descent condition, a relative error condition, and a continuity condition; however, no generation process is specified.

The following works have also shown global convergence using the KL property or earlier versions thereof. The gradient descent method is considered in [8, 17], and the proximal algorithm is analyzed in [8, 18,19,20] and the non-smooth subgradient method in [21, 22]. Convergence of forward–backward splitting (proximal gradient algorithm) is proved in [8]. Extensions to a variable metric are studied in [23] and in [24] with line search. A block coordinate descent version is considered in [25] and a block coordinate variable metric method in [26]. A flexible relative error handling of forward–backward splitting and a non-smooth version of the Levenberg–Marquardt algorithm is explored in [27]. For proximal alternating minimization, we refer to [28] for an early convergence result of the iterates and to [29] for proximal alternating linearized minimization.

Inertial variants of these algorithms have also been examined. [3] establishes convergence of an inertial forward–backward splitting algorithm, called iPiano. [3] assumes the non-smooth part of the objective to be convex, whereas [4] and [30] prove convergence in the full non-convex setting, i.e., when the algorithm is applied to minimizing the sum of a smooth non-convex function with Lipschitz gradient and a proper lower semi-continuous function. An extension to an inertial block coordinate variable metric version was studied in [31]. Bregman proximity functions are considered in [30]. A similar method was considered in [32] by the same authors. The convergence of a generic multi-step method is proved in [33] (see also [34]). A slightly weakened formulation of the popular accelerated proximal gradient algorithm from convex optimization was analyzed in [35]. Another fruitful concept from convex optimization is that of composite objective functions involving linear operators. This problem is approached in [36, 37]. Key for the convergence results is usually a decrease condition on the objective function or an upper bound of the objective. The Lyapunov-type idea is studied in [37,38,39]. Convergence of the abstract principle of majorization minimization methods was also analyzed in a KL framework [40, 41].

The global convergence theory of an unbounded memory multi-step method was proposed in [33]. Local convergence was analyzed under the additional partial smoothness assumption. In particular, local linear convergence of the iterates is established. Although the fruitful concept of partial smoothness is very interesting, in this paper, we focus on convergence results that can be inferred directly from the KL property. In the general abstract setting, local convergence rates were analyzed in [27, 42] and for inertial methods in [34, 42]. More specific local convergence rates can be found in [18, 26, 28, 29, 43, 44].

While the abstract concept in [8] can be used to prove global convergence in the non-convex setting for the gradient descent method, forward–backward splitting, and several other algorithms, it seems to be limited to single-step methods. Therefore, [3] proved a slightly different result for abstract descent methods, which is applicable to multi-step methods, such as the Heavy-ball method and iPiano. In [31], an abstract convergence result is proved that unifies [3, 4, 8, 27].

Contribution In this paper, we develop the local convergence theory for the abstract setting in [3], in analogy to the local theory in [8]. Our local convergence result shows that, for multi-step methods such as the Heavy-ball method or iPiano, a sequence that is initialized close enough to a local minimizer

  • stays in a neighborhood of the local minimum and

  • converges to a local minimizer instead of a stationary point.

This result allows us to apply the formula for the gradient of the Moreau envelope of a prox-regular function to all iterates, which has far-reaching consequences and has not been explored algorithmically before. We obtain several new algorithms for non-convex optimization problems. Conceptionally, the algorithms are known from the convex setting or from their non-inertial versions; however, there are no guarantees for the inertial versions in the non-convex setting.

  • The Heavy-ball method applied to the sum of distance functions to prox-regular sets (resp. the sum of Moreau envelopes of prox-regular functions) coincides with the inertial averaged projection method (resp. the inertial averaged proximal minimization) for these prox-regular sets (resp. functions).

  • iPiano applied to the sum of the distance function to a prox-regular set (resp. the Moreau envelope of a prox-regular function) and a simple non-convex set (resp. function) leads to the inertial alternating projection method (resp. inertial alternating proximal minimization) for these two sets (resp. functions).

Of course, these algorithms are only efficient when the associated proximal mappings or projections are simple (efficient to evaluate). Beyond these local results, we provide global convergence guarantees (see Proposition 5.5(2)) for the following methods:

  • The (relaxed) alternating projection method for the feasibility problem of a convex set and a non-convex set.

  • An inertial version of the alternating projection method (iPiano applied to the distance function to a convex set over a non-convex constraint set).

  • An inertial version of alternating proximal minimization (iPiano applied to the sum of the Moreau envelope of a convex function and a non-convex function).

Moreover, we transfer local convergence rates depending on the KL exponent of the involved functions to the methods listed above. This result builds on a recent classification of local convergence rates depending on the KL exponent from [34, 42] (which extends results from [27]).

Outline Section 2 introduces the notation and definitions that are used in this paper. In Sect. 3.1, the conditions for global convergence of abstract descent methods [3, 4] are recapitulated. The main result for abstract descent methods, the attraction of local (or global) minima, is developed and proved in Sect. 3.2. Then, the abstract local convergence results are verified for iPiano (hence the Heavy-ball method) in Sect. 4. The equivalence to inertial averaged/alternating minimization/projection methods is analyzed in Sect. 5. Section 5.4 shows a numerical example of a feasibility problem.

1.1 Perspectives and Open Problems

Key for this paper is the formula for the gradient of the Moreau envelope of a function and the local capturing result that proves existence of a neighborhood of a local minimizer containing all iterates. Exemplarily, we used this formula for iPiano (a forward–backward splitting-like method). However, any scheme based on gradient steps, e.g., PALM [29], iPALM [45], variable metric iPiano [31], the quasi-Newton method in [36], applied to objective functions involving Moreau envelopes could make use of the Moreau envelope gradient formula, which will reveal connections to other algorithms. Conversely, gradient-free schemes that involve proximal steps, e.g., Peaceman–Rachford [38], Douglas–Rachford [39], ADMM [37], can be translated using the same formula. Since the Moreau envelope of a function is a special instance of an infimal convolution, the gradient formula may be generalized, leading again, to new algorithms and relations. Possibly, this requires the investigation of functions beyond prox-regularity. Finally, using the capturing result, other local properties may be explored, possibly allowing for an algorithmic realization of non-convex duality results.

2 Preliminaries

Throughout this paper, we will always work in a finite-dimensional Euclidean vector space \({\mathbb {R}}^N\) of dimension \(N\in {\mathbb {N}}\), where \({\mathbb {N}}:=\{1,2,\ldots \}\). The vector space is equipped with the standard Euclidean norm \(\vert \cdot \vert _{}\) that is induced by the standard Euclidean inner product \(\vert \cdot \vert _{}:= \sqrt{\left\langle \cdot ,\cdot \right\rangle }\). The ball of radius \(\varepsilon >0\) around \({\bar{x}}\in {\mathbb {R}}^N\), we denote by \(B_{\varepsilon }({\bar{x}}):=\lbrace x\in {\mathbb {R}}^N\,:\,\vert x-{\bar{x}} \vert _{}\le \varepsilon \rbrace \).

As usual, we consider extended real-valued functions \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\), where \(\overline{{\mathbb {R}}}:={\mathbb {R}}\cup \lbrace +\infty \rbrace \), which are defined on the whole space with domain given by \({\text {dom}}\,f := \lbrace x\in {\mathbb {R}}^N\,:\,f(x)< +\infty \rbrace \). A function is called proper, if \({\text {dom}}\,f \ne \emptyset \). We define the epigraph of f as \({\text {epi}}f:= \lbrace (x,\mu )\in {\mathbb {R}}^{N+1} \,:\,\mu \ge f(x) \rbrace \). The indicator function \(\delta _{C}\) of a set \(C\subset {\mathbb {R}}^N\) is defined by \(\delta _{C}(x):=0\), if \(x\in C\), and \(\delta _{C}(x):=+\infty \), otherwise. A set-valued mapping \(T:{\mathbb {R}}^N \rightrightarrows {\mathbb {R}}^M\), with \(M,N\in {\mathbb {N}}\), is defined by its graph \({\text {Graph}}T:=\lbrace (x,v)\in {\mathbb {R}}^N\times {\mathbb {R}}^M \,:\,v\in T(x) \rbrace \). The range of a set-valued mapping is defined as \(\mathrm {rge}\,T:= \bigcup _{x\in {\mathbb {R}}^N} T(x)\).

A key concept in optimization and variational analysis is that of Lipschitz continuity, which is also known as strict continuity, is defined as in [7]:

Definition 2.1

(Strict continuity [7, Def. 9.1]) A single-valued mapping \(F:D \rightarrow {\mathbb {R}}^M\) defined on \(D\subset {\mathbb {R}}^N\) is strictly continuous at \({\bar{x}}\in D\), if

$$\begin{aligned} \mathrm {lip}F ({\bar{x}}) := \limsup _{\begin{array}{c} x,x^\prime \rightarrow {\bar{x}}\\ x\ne x^\prime \end{array}} \frac{\vert F(x^\prime ) - F(x) \vert }{\vert x^\prime - x \vert } \end{aligned}$$

is finite and \(\mathrm {lip}F({\bar{x}})\) is the Lipschitz modulus of F at \({\bar{x}}\). This is the same as saying F is locally Lipschitz continuous at \({\bar{x}}\) on D.

We introduce the term f-attentive convergence: A sequence \((x^k)_{k\in {\mathbb {N}}}\) is said to f-converge to \({\bar{x}}\), if \((x^k,f(x^k))\rightarrow ({\bar{x}}, f({\bar{x}}))\) as \(k\rightarrow \infty \), and we write \(x^k\overset{f}{\rightarrow }{\bar{x}}\).

Definition 2.2

(Subdifferentials [7, Def. 8.3]) The Fréchet subdifferential of f at \({\bar{x}} \in {\text {dom}}\,f\) is the set \({\widehat{\partial }}f({\bar{x}})\) of elements \(v \in {\mathbb {R}}^N\) such that

$$\begin{aligned} \liminf _{\begin{array}{c} x\rightarrow {\bar{x}}\\ x\ne {\bar{x}} \end{array}} \frac{f(x) - f({\bar{x}}) - \left\langle v,x-{\bar{x}} \right\rangle }{\vert x-{\bar{x}} \vert _{}} \ge 0 \,. \end{aligned}$$

For \({\bar{x}}\not \in {\text {dom}}\,f\), we set \({\widehat{\partial }}f({\bar{x}}) = \emptyset \). The so-called (limiting) subdifferential of f at \({\bar{x}}\in {\text {dom}}\,f\) is defined by

$$\begin{aligned} \partial f({\bar{x}}) := \lbrace v\in {\mathbb {R}}^N\,:\,\exists \, x^n \overset{f}{\rightarrow }{\bar{x}},\;v^n\in {\widehat{\partial }}f(x^n),\;v^n \rightarrow v \rbrace \,, \end{aligned}$$

and \(\partial f({\bar{x}}) := \emptyset \) for \({\bar{x}} \not \in {\text {dom}}\,f\).

A point \({\bar{x}}\in {\text {dom}}\,f\), for which \(0\in \partial f({\bar{x}})\) holds, is a called a critical point. As a direct consequence of the definition of the limiting subdifferential, we have the following closedness property at any \({\bar{x}}\in {\text {dom}}\,f\):

$$\begin{aligned} x^k\overset{f}{\rightarrow }{\bar{x}},\ v^k\rightarrow {\bar{v}},\ \text {and for all } k\in {\mathbb {N}}:v^k\in \partial f(x^k)\quad \Longrightarrow \quad {\bar{v}}\in \partial f({\bar{x}}) \,. \end{aligned}$$

Definition 2.3

(Moreau envelope and proximal mapping [7, Def. 1.22]) For \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) and \(\lambda >0\), we define the Moreau envelope

$$\begin{aligned} e_{\lambda }{f} (x) := \inf _{w\in {\mathbb {R}}^N}\, f(w) + \frac{1}{2\lambda }\vert w-x \vert _{}^2 \,, \end{aligned}$$

and the proximal mapping

$$\begin{aligned} P_{\lambda }f (x) := \arg \min _{w\in {\mathbb {R}}^N}\, f(w) + \frac{1}{2\lambda }\vert w-x \vert _{}^2 \,. \end{aligned}$$

For a general function f, it might happen that \(e_{\lambda }{f}(x)\) takes the values \(- \infty \) and the proximal mapping is empty, i.e., \(P_{\lambda }f (x) = \emptyset \). Therefore, the analysis of the Moreau envelope is usually coupled with the following property.

Definition 2.4

(Prox-boundedness [7, Def. 1.23]) We call \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) prox-bounded, if there exists \(\lambda >0\) such that \(e_{\lambda }{f} (x)> -\infty \) for some \(x\in {\mathbb {R}}^N\). The supremum of the set of all such \(\lambda \) is the threshold \(\lambda _f\) of prox-boundedness.

In this paper, we focus on so-called prox-regular functions. These functions have many favorable properties locally, which otherwise only convex functions exhibit.

Definition 2.5

(Prox-regularity of functions, [7, Def. 13.27]) A function \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) is prox-regular at \({\bar{x}}\) for \({\bar{v}}\), if f is finite and locally lsc at \({\bar{x}}\) with \({\bar{v}} \in \partial f({\bar{x}})\), and there exists \(\varepsilon >0\) and \(\lambda >0\) such that

$$\begin{aligned} f(x^\prime ) \ge f(x) + \left\langle v,x^\prime - x \right\rangle - \frac{1}{2\lambda } \vert x^\prime - x \vert _{}^2 \quad \forall x^\prime \in B_{\varepsilon }({\bar{x}}) \\ \text {when}\ v\in \partial f(x),\ \vert v-{\bar{v}} \vert _{}< \varepsilon ,\ \vert x-{\bar{x}} \vert _{}<\varepsilon ,\ f(x) < f({\bar{x}}) + \varepsilon \,. \end{aligned}$$

When this holds for all \({\bar{v}}\in \partial f({\bar{x}})\), f is said to be prox-regular at \({\bar{x}}\).

The largest value \(\lambda >0\) for which this property holds is called the modulus of prox-regularity at \({\bar{x}}\).

Definition 2.6

(Prox-regularity of sets, [7, Ex. 13.31]) A set C is prox-regular at \({\bar{x}}\) for \({\bar{v}}\) when the indicator function \(\delta _{C}\) of the set C is prox-regular at \({\bar{x}}\) for \({\bar{v}}\). It is called prox-regular at \({\bar{x}}\), when this is true for all \({\bar{v}}\in \partial \delta _{C}({\bar{x}})\).

We provide several examples to show that most functions in practice are prox-regular.

Example 2.1

A function \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) is prox-regular if, for example,Footnote 1 f is

  • proper lower semi-continuous (lsc) convex [7, Ex. 13.30],

  • locally representable in the form \(f=g-\rho \vert \cdot \vert _{}^2\) with g being finite (\(g<+\infty \)) convex and \(\rho >0\) [7, Thm. 10.33],

  • strongly amenable [7, Def. 10.23, Prop. 13.32] (e.g., \({\mathcal {C}}^2\)-functions, functions of the form \(g\circ F\) with F being \({\mathcal {C}}^2\) and g being proper lsc convex, the maximum of \({\mathcal {C}}^2\)-function),

  • lower-\({\mathcal {C}}^2\) [7, Def. 10.29, Prop. 13.33] (functions of the form \(\max _{t\in T} f(x,t)\), where the zeroth, first, and second derivatives of \(f:{\mathbb {R}}^N\times T \rightarrow {\mathbb {R}}\) w.r.t. to the first block of coordinates are continuous and T is a compact space),

  • a \({\mathcal {C}}^2\)-perturbation of a prox-regular function [7, Ex. 13.35],

  • an indicator function of a closed convex set or of a strongly amenable set [7, Def. 10.23],

  • the Moreau envelope of a prox-regular prox-bounded function [7, Prop. 13.37 and 13.34] (e.g., the distance function of a prox-regular set), or

  • the indicator function of a closed set C and the distance function w.r.t. C is continuously differentiable on \(C\smallsetminus U\) for some open neighborhood U [46, Thm. 1.3].

  • For examples of prox-regular spectral functions, we refer to [47].

Example 2.2

(Imaging problems) Several problems (image denoising, deblurring/deconvolution, zooming, depth map fusion, etc.) may be modeled as an optimization problem of the following form

$$\begin{aligned} \min _{x\in {\mathbb {R}}^N}\, \frac{1}{2}\vert Ax - b \vert _{}^2 + \sum _{i=1}^{N} \varphi \left( \sqrt{(Dx)_i^2 + (Dx)_{i+N}^2}\right) \,, \end{aligned}$$

where \(A\in {\mathbb {R}}^{M\times N}\) (e.g., blurr operator), \(b\in {\mathbb {R}}^M\) (e.g., blurry input image), \(D\in {\mathbb {R}}^{2N\times N}\) (e.g., finite differences) with a continuous non-decreasing function \(\varphi :{\mathbb {R}}_+ \rightarrow {\mathbb {R}}_+\). The objective is prox-regular, for example, under the following conditions: \(\varphi \) is convex and non-decreasing; \(\varphi (t)=t\) (TV regularization); \(\varphi (t) = \log (1+t^2)\) (student t regularization); \(\varphi (t) = \log (1+\vert t \vert )\) (at 0 where it is not \({\mathcal {C}}^2\), the power series of \(\log \) shows that \(\log (1+\vert t \vert )-\vert t \vert \in {\mathcal {C}}^2\); hence, \(\varphi \) is a \({\mathcal {C}}^2\)-perturbation of a convex function).

Example 2.3

(Support vector machine) The goal to find a linear decision function may be formulated as the following optimization problem

$$\begin{aligned} \min _{w\in {\mathbb {R}}^N,\, b\in {\mathbb {R}}}\, \sum _{i=1}^M {\mathscr {L}}(\left\langle w,z_i \right\rangle +b, y_i) + \varphi (w) \,, \end{aligned}$$

where, for \(i=1,\ldots ,M\), \((z_i,y_i)\in {\mathbb {R}}^N\times \lbrace \pm 1 \rbrace \) is the training set, \({\mathscr {L}}\) is a loss, and \(\varphi \) is a regularizer. Examples are the hinge loss \({\mathscr {L}}({\bar{y}}_i,y_i) = \max (0, 1- {\bar{y}}_iy_i)\) (which is a maximum of \({\mathcal {C}}^2\)-functions), the squared hinge loss, the logistic loss \({\mathscr {L}}({\bar{y}}_i,y_i) = \log (1+e^{-{\bar{y}}_i y_i})\) (which are \({\mathcal {C}}^2\) function), etc. Prox-regular regularization functions \(\varphi \) are, for example, the squared \(\ell _2\)-norm \(\vert x \vert _{}^2\), or more in general p-norms \(\Vert x \Vert _{p}^p=\sum _{i=1}^N \vert x_i \vert ^p\) with \(p>0\) (\(x_i\mapsto \vert x_i \vert ^p\) is \({\mathcal {C}}^2\) on \({\mathbb {R}}\smallsetminus \lbrace 0 \rbrace \) and obviously prox-regular at \({\bar{x}}=0\)).

For the proof of the Lipschitz property of the Moreau envelope, it will be helpful to consider a so-called localization. A localization of \(\partial f\) around \(({\bar{x}},{\bar{v}})\) is a mapping \(T:{\mathbb {R}}^N \rightrightarrows {\mathbb {R}}^N\) whose graph is obtained by intersecting \({\text {Graph}}\partial f\) with some neighborhood of \(({\bar{x}}, {\bar{v}})\), i.e., \({\text {Graph}}T = {\text {Graph}}\partial f \cap U\) for a neighborhood U of \(({\bar{x}}, {\bar{v}})\). We talk about an f-attentive localization when \({\text {Graph}}T = \lbrace (x,v)\in {\text {Graph}}\partial f\,:\,(x,v) \in U\ \text {and}\ f(x) \in V \rbrace \) for a neighborhood U of \(({\bar{x}}, {\bar{v}})\) and a neighborhood V of \(f({\bar{x}})\).

Finally, the convergence result we build on is only valid for functions that have the KL property at a certain point. This property is shared, for example, by semi-algebraic functions, globally subanalytic functions, or, more generally, functions definable in an \(\mathrm {o}\)-minimal structure. For details, we refer to [12, 13].

Definition 2.7

(Kurdyka–Łojasiewicz property/KL property [8]) Let \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) be an extended real-valued function and let \({\bar{x}}\in {\text {dom}}\,\partial f\). If there exists \(\eta \in [0,\infty ]\), a neighborhood U of \({\bar{x}}\) and a continuous concave function \(\phi :[0,\eta [ \rightarrow {\mathbb {R}}_+\) such that

$$\begin{aligned} \phi (0)=0,\quad \phi \in C^{1}( 0,\eta ),\quad \text {and}\quad \phi ^\prime (s)>0\text { for all }s\in ]0,\eta [\,, \end{aligned}$$

and for all \(x\in U\cap [f({\bar{x}})< f(x) < f({\bar{x}}) + \eta ]\) the Kurdyka–Łojasiewicz inequality

$$\begin{aligned} \phi ^\prime (f(x)-f({\bar{x}})) \Vert \partial f(x) \Vert _{-} \ge 1 \end{aligned}$$
(1)

holds; then, the function has the Kurdyka–Łojasiewicz property at \({\bar{x}}\), where \(\Vert \partial f(x) \Vert _{-}:= \inf _{v\in \partial f(x)}\vert v \vert _{}\) is the non-smooth slope (note: \(\inf \emptyset := +\infty \)).

If, additionally, the function is lsc and the property holds for each point in \({\text {dom}}\,\partial f\), then f is called Kurdyka–Łojasiewicz function.

If f is closed and semi-algebraic, it is well known [9, 13] that f has the KL property at any point in \({\text {dom}}\,\partial f\), and the desingularization function \(\phi \) in Definition 2.7 has the form \(\phi (s) = \frac{c}{\theta }s^\theta \) for \(\theta \in ]0,1]\) and some constant \(c>0\). The parameter \(\theta \) is known as the KL exponent.

3 Abstract Convergence Result for KL Functions

In this section, we establish a local convergence result for abstract descent methods, i.e., the method is characterized by properties (H1), (H2), (H3) (see below) instead of a specific update rule. The local convergence result is inspired by a global convergence result proved in [3] for KL functions (see Theorem 3.1), which itself is motivated by a slightly different result in [8]. The abstract setting in [8] can be used to prove global and local convergence of gradient descent, proximal gradient descent, and other (single-step) methods. However, it does not apply directly to inertial variants of these methods. Therefore, in this section, we prove the required adaptation of the framework in [8] to the one in [3]. We obtain a local convergence theory that also applies to the Heavy-ball method and iPiano (see Sect. 4).

3.1 Global Convergence Results

The convergence result in [3] is based on the following three abstract conditions for a sequence \((z^k)_{k\in {\mathbb {N}}}:=(x^k,x^{{k-1}})_{k\in {\mathbb {N}}}\) in \({\mathbb {R}}^{2N}\), \(x^k\in {\mathbb {R}}^N\), \(x^{-1}\in {\mathbb {R}}^N\). Fix two positive constants \(a>0\) and \(b>0\) and consider a proper lower semi-continuous (lsc) function \({\mathcal {F}}:{\mathbb {R}}^{2N} \rightarrow \overline{{\mathbb {R}}}\). Then, the conditions for \((z^k)_{k\in {\mathbb {N}}}\) are as follows:

  1. (H1)

    For each \(k\in {\mathbb {N}}\), it holds that

    $$\begin{aligned} {\mathcal {F}}(z^{{k+1}}) + a \vert x^k- x^{k-1} \vert _{}^2 \le {\mathcal {F}}(z^k)\,. \end{aligned}$$
  2. (H2)

    For each \(k\in {\mathbb {N}}\), there exists \(w^{{k+1}}\in \partial {\mathcal {F}}(z^{{k+1}})\) such that

    $$\begin{aligned} \vert w^{{k+1}} \vert _{} \le \frac{b}{2} (\vert x^k- x^{k-1} \vert _{} + \vert x^{k+1}- x^k \vert _{})\,. \end{aligned}$$
  3. (H3)

    There exists a subsequence \((z^{k_j})_{j\in {\mathbb {N}}}\) such that

    $$\begin{aligned} z^{k_j}\rightarrow {\tilde{z}} \quad \text {and}\quad {\mathcal {F}}(z^{k_j}) \rightarrow {\mathcal {F}}({\tilde{z}})\,,\qquad \text {as } j\rightarrow \infty \,. \end{aligned}$$

Theorem 3.1

(Abstract global convergence, [3, Thm. 3.7]) Let the sequence \((z^k)_{k\in {\mathbb {N}}} = (x^k,x^{{k-1}})_{k\in {\mathbb {N}}}\) satisfy (H1), (H2), and (H3) for a proper lsc function \({\mathcal {F}}:{\mathbb {R}}^{2N} \rightarrow \overline{{\mathbb {R}}}\) which has the KL property at the cluster point \({\tilde{z}}\) specified in (H3).

Then, the sequence \((x^k)_{k\in {\mathbb {N}}}\) has finite length, i.e.,

$$\begin{aligned} \sum _{k=1}^{\infty } \vert x^k-x^{k-1} \vert _{} < + \infty \,, \end{aligned}$$
(2)

and converges to \({\bar{z}} = {\tilde{z}}\) where \({\bar{z}}=({\bar{x}},{\bar{x}})\) is a critical point of \({\mathcal {F}}\).

Remark 3.1

In view of the proof of this statement, obviously, the same result can be established when (H1) is replaced by \( {\mathcal {F}}(z^{{k+1}}) + a \vert x^{k+1}- x^k \vert _{}^2 \le {\mathcal {F}}(z^k) \,. \)

3.2 Local Convergence Results

The upcoming local convergence result shows that, once entered a region of attraction (around a local minimizer), all iterates of a sequence \((z^k)_{k\in {\mathbb {N}}}\) satisfying (H1), (H2), and the following growth condition (H4) stay in a neighborhood of this minimum and converge to a minimizer in the same neighborhood (not just a stationary point). For the convergence to a global minimizer, the growth condition (H4) is not required.

In the following, for \(z\in {\mathbb {R}}^{2N}\) we denote by \(z_1,z_2 \in {\mathbb {R}}^N\) the first and second block of coordinates, i.e., \(z=(z_1,z_2)\). The same holds for other vectors in \({\mathbb {R}}^{2N}\).

  1. (H4)

    Fix \(z^*\in {\mathbb {R}}^N\). For any \(\delta >0\), there exist \(0<\rho <\delta \) and \(\nu >0\) such that

    $$\begin{aligned} z\in B_{\rho }(z^*)\,,\ {\mathcal {F}}(z)< {\mathcal {F}}(z^*) + \nu \,,\ y_2\not \in B_{\delta }(z_2^*) \ \Rightarrow \ {\mathcal {F}}(z) < {\mathcal {F}}(y) + \frac{a}{4} \vert z_2 - y_2 \vert _{}^2 \end{aligned}$$

where a is the same as in (H1)–(H3).

A simple condition that implies (H4) is provided by the following lemma:

Lemma 3.1

Let \({\mathcal {F}}:{\mathbb {R}}^{2N} \rightarrow \overline{{\mathbb {R}}}\) be proper lsc and \(z^*=(x^*,x^*)\in {\text {dom}}\,{\mathcal {F}}\) a local minimizer of \({\mathcal {F}}\). Suppose, for any \(\delta >0\), \({\mathcal {F}}\) satisfies the growth condition

$$\begin{aligned} {\mathcal {F}}(y) \ge {\mathcal {F}}(z^*) - \frac{a}{16} \vert y_2 - z^*_2 \vert ^2 \quad \forall y\in {\mathbb {R}}^{2N}, y_2\not \in B_{\delta }(z_2^*) \,. \end{aligned}$$

Then, \({\mathcal {F}}\) satisfies (H4).

Proof

Let \(\delta > \rho \) and \(\nu \) be positive numbers. For \(y=(y_1,y_2)\in {\mathbb {R}}^{2N}\) with \(y_2\not \in B_{\delta }(z_2^*)\) and \(z=(z_1,z_2)\in B_{\rho }(z^*)\) such that \({\mathcal {F}}(z) < {\mathcal {F}}(z^*) + \nu \), we make the following estimation:

$$\begin{aligned} \begin{aligned} {\mathcal {F}}(y) \ge&\ {\mathcal {F}}(z^*) - \frac{a}{16} \vert y_2 - z_2^* \vert _{}^2 \\ >&\ {\mathcal {F}}(z) - \nu - \frac{a}{8} \vert y_2 - z_2^* \vert _{}^2 + \frac{a}{16} \vert y_2 - z_2^* \vert _{}^2 \\ \ge&\ {\mathcal {F}}(z) - \nu - \frac{a}{4} \vert y_2 - z_2 \vert _{}^2 - \frac{a}{4} \vert z_2 - z_2^* \vert _{}^2 + \frac{a}{16} \vert y_2 - z_2^* \vert _{}^2 \\ \ge&\ {\mathcal {F}}(z) - \frac{a}{4} \vert y_2 - z_2 \vert _{}^2 + \left( -\nu - \frac{a}{4} \rho ^2 + \frac{a}{16} \delta ^2\right) \,. \end{aligned} \end{aligned}$$

For sufficiently small \(\nu \) and \(\rho \) the term in the parenthesis becomes positive, which implies (H4).\(\square \)

We need another preparatory lemma, which is proved in [3]

Lemma 3.2

( [3, Lem. 3.5]) Let \({\mathcal {F}}:{\mathbb {R}}^{2N} \rightarrow \overline{{\mathbb {R}}}\) be a proper lsc function which satisfies the Kurdyka–Łojasiewicz property at some point \(z^*=(z_1^*,z_2^*)\in {\mathbb {R}}^{2N}\). Denote by U, \(\eta \) and \(\phi :[0,\eta [ \rightarrow {\mathbb {R}}_+\) the objects appearing in Definition 2.7 of the KL property at \(z^*\). Let \(\sigma , \rho >0\) be such that \(B_{\sigma }(z^*)\subset U\) with \(\rho \in ]0,\sigma [\).

Furthermore, let \((z^k)_{k\in {\mathbb {N}}}=(x^k,x^{{k-1}})_{k\in {\mathbb {N}}}\) satisfy (H1), (H2), and

$$\begin{aligned} \forall k\in {\mathbb {N}}:\quad z^k\in B_{\rho }(z^*) \Rightarrow z^{{k+1}}\in B_{\sigma }(z^*)\text { with } {\mathcal {F}}(z^{{k+1}}),{\mathcal {F}}(z^{k+2}) \ge {\mathcal {F}}(z^*) \,, \end{aligned}$$
(3)

moreover, \(z^0=(x^0,x^{-1})\) be such that \({\mathcal {F}}(z^*) \le {\mathcal {F}}(z^0) < {\mathcal {F}}(z^*) + \eta \) and

$$\begin{aligned} \vert x^*-x^0 \vert _{} + \sqrt{\frac{{\mathcal {F}}(z^0) - {\mathcal {F}}(z^*)}{a}} + \frac{b}{a} \phi ({\mathcal {F}}(z^0) - {\mathcal {F}}(z^*)) < \frac{\rho }{2} \,. \end{aligned}$$
(4)

Then, the sequence \((z^k)_{k\in {\mathbb {N}}}\) satisfies

$$\begin{aligned} \forall k\in {\mathbb {N}}: z^k\in B_{\rho }(z^*),\quad \sum _{k=0}^\infty \vert x^k- x^{k-1} \vert _{} < \infty ,\quad {\mathcal {F}}(z^k) \rightarrow {\mathcal {F}}(z^*), \text { as } k\rightarrow \infty \,, \end{aligned}$$
(5)

\((z^k)_{k\in {\mathbb {N}}}\) converges to a point \({\bar{z}}=({\bar{x}},{\bar{x}})\in B_{\sigma }(z^*)\) such that \({\mathcal {F}}({\bar{z}}) \le {\mathcal {F}}(z^*)\). If, additionally, (H3) is satisfied, then \(0\in \partial {\mathcal {F}}({\bar{z}})\) and \({\mathcal {F}}({\bar{z}}) = {\mathcal {F}}(z^*)\).

Under Assumption (H4), the following theorem establishes the local convergence result. Note that, thanks to Lemma 3.1, a global minimizer automatically satisfies (H4).

Theorem 3.2

(Abstract local convergence) Let \({\mathcal {F}}:{\mathbb {R}}^{2N} \rightarrow \overline{{\mathbb {R}}}\) be a proper lsc function which has the KL property at some local (or global) minimizer \(z^*= (x^*,x^*)\) of \({\mathcal {F}}\). Assume (H4) holds at \(z^*\).

Then, for any \(r>0\), there exist \(u\in ]0,r[\) and \(\mu >0\) such that the conditions

$$\begin{aligned} z^0 \in B_{u} (z^*)\,,\qquad {\mathcal {F}}(z^*)< {\mathcal {F}}(z^0) < {\mathcal {F}}(z^*) + \mu \,, \end{aligned}$$
(6)

imply that any sequence \((z^k)_{k\in {\mathbb {N}}}\) that starts at \(z^0\) and satisfies (H1) and (H2) has the finite length property (2) and remains in \(B_{r}(z^*)\) and converges to some \({\bar{z}} \in B_{r}(z^*)\), a critical point of \({\mathcal {F}}\) with \({\mathcal {F}}({\bar{z}}) = {\mathcal {F}}(z^*)\). For r sufficiently small, \({\bar{z}}\) is a local minimizer of \({\mathcal {F}}\).

Proof

Let \(r>0\). Since \({\mathcal {F}}\) satisfied the KL property at \(z^*\), there exist \(\eta _0 \in ]0,+\infty ]\), \(\delta \in ]0,r/\sqrt{2}[\) and a continuous concave function \(\phi :[0,\eta _0[ \rightarrow {\mathbb {R}}\) such that \(\phi (0)=0\), \(\phi \) is continuously differentiable and strictly increasing on \(]0,\eta _0[\), and for all

$$\begin{aligned} z\in B_{\sqrt{2}\delta } (z^*) \cap [{\mathcal {F}}(z^*)< {\mathcal {F}}(z) < {\mathcal {F}}(z^*) + \eta _0] \end{aligned}$$

the KL inequality holds. As \(z^*\) is a local minimizer, by choosing a smaller \(\delta \) if necessary, one can assume that

$$\begin{aligned} {\mathcal {F}}(z) \ge {\mathcal {F}}(z^*)\quad \text { for all }\quad z\in B_{\sqrt{2}\delta }(z^*) \,. \end{aligned}$$
(7)

Let \(0<\rho < \delta \) and \(\nu >0\) be the parameters appearing in (H4) with \(\delta \) as in (7). We want to verify the implication in (3) with \(\sigma =\sqrt{2}\delta \). Let \(\eta := \min (\eta _0,\nu )\) and \(k\in {\mathbb {N}}\). Assume \(z^{0}, \ldots , z^{k} \in B_{\rho }(z^*)\), with \(z^k=: (z_1^k,z_2^k) = (x^k,x^{k-1})\in {\mathbb {R}}^{N\times 2}\) and w.l.o.g. \({\mathcal {F}}(z^*)< {\mathcal {F}}(z^0), \ldots , {\mathcal {F}}(z^k) < {\mathcal {F}}(z^*) + \eta \) (note that if \({\mathcal {F}}(z^k) = {\mathcal {F}}(z^*)\), the sequence is stationary (\(x^k=x^{k+1}=x^{k+2}=\ldots \)) by (H1) and the result follows directly).

See Fig. 1 for the idea of the following steps. First, note that \(x^k\in B_{\delta }(z_2^*)\) as \(z^k\in B_{\delta }(z^*)\). Suppose \(z_2^{k+2}=x^{k+1}\not \in B_{\delta }(z_2^*)\). Then by (H4) and (H1), we observe (use \((u+v)^2 \le 2 (u^2 + v^2)\))

$$\begin{aligned} {\mathcal {F}}(z^k)< & {} {\mathcal {F}}(z^{k+2}) + \frac{a}{4} \vert x^{k-1}- x^{k+1} \vert _{}^2 \\\le & {} {\mathcal {F}}(z^k) - a\left( \vert x^{k+1}- x^k \vert _{}^2 + \vert x^k- x^{k-1} \vert _{}^2\right) + \frac{a}{4} \vert x^{k-1}- x^{k+1} \vert _{}^2 \le {\mathcal {F}}(z^k) \,, \end{aligned}$$

which is a contradiction and therefore \(z_2^{k+2}\in B_{\delta }(z_2^*)\).

Fig. 1
figure 1

An essential step of the proof of Theorem 3.2 is to show: \(z^k\in B_{\rho }(z^*)=B_{\rho }(x^*,x^*)\) implies \(x^{k+2}, x^{k+1}\in B_{\delta }(z^*_2)=B_{\delta }(x^*)\) which restricts \(z^{k+1}\) and \(z^{k+2}\) to the rectangle in the plot and thus to \(B_{\sqrt{2}\delta }(z^*)\)

Hence, we have \(z^{k+1}= (x^{k+1},x^k) \in B_{\sqrt{2} \delta }(z^*)\), due to the equivalence of norms in finite dimensions. Thanks to (7), we have \({\mathcal {F}}(z^{k+1})\ge {\mathcal {F}}(z^*)\). In order to verify (3), we also need \({\mathcal {F}}(z^{k+2}) \ge {\mathcal {F}}(z^*)\), which can be shown analogously; however, we need to consider three iteration steps (that is the reason for the factor \(\frac{a}{4}\) instead of \(\frac{a}{2}\) on the right-hand side of (H4)). The assumption that \(z^{k+3}_2 = x^{k+2}\not \in B_{\delta }(z_2^*)\) yields the following contradiction:

$$\begin{aligned} {\mathcal {F}}(z^k)< & {} {\mathcal {F}}(z^{k+3}) + \frac{a}{4} \vert x^{k-1}- x^{k+2} \vert _{}^2 \\\le & {} {\mathcal {F}}(z^k) - a\left( \vert x^{k+2}- x^{k+1} \vert _{}^2+\vert x^{k+1}- x^k \vert _{}^2 + \vert x^k- x^{k-1} \vert _{}^2\right) \\&+ \frac{a}{4} \vert x^{k-1}- x^{k+2} \vert _{}^2 \\\le & {} {\mathcal {F}}(z^k) - a\left( \vert x^{k+2}- x^{k+1} \vert _{}^2+\vert x^{k+1}- x^k \vert _{}^2 + \vert x^k- x^{k-1} \vert _{}^2\right) \\&+ \frac{a}{4} \left( 2\vert x^{k+2}- x^{k+1} \vert _{}^2+4\vert x^{k+1}- x^k \vert _{}^2 + 4\vert x^k- x^{k-1} \vert _{}^2\right) \le {\mathcal {F}}(z^k) \,. \end{aligned}$$

Thus, \({\mathcal {F}}(z^{k+1}),{\mathcal {F}}(z^{k+2}) \ge {\mathcal {F}}(z^*)\) holds, which is property (3) with \(\sigma = \sqrt{2} \delta \).

Now, choose \(u,\mu >0\) in (6) such that \( \mu< \eta \,,\ u< \frac{\rho }{6}\,,\ \sqrt{\frac{\mu }{a}} + \frac{b}{a} \phi (\mu ) < \frac{\rho }{3} \). If \(z^0\) satisfies (6), we have

$$\begin{aligned} \vert x^*-x^0 \vert _{} + \sqrt{\frac{{\mathcal {F}}(z^0) - {\mathcal {F}}(z^*)}{a}} + \frac{b}{a} \phi ({\mathcal {F}}(z^0) - {\mathcal {F}}(z^*)) < \frac{\rho }{2} \,, \end{aligned}$$

which is (4) with \(\mu \) in place of \(\eta \). Using Lemma 3.2, we conclude that the sequence has finite length, remains in \(B_{\rho }(z^*)\), converges to \({\bar{z}} \in B_{\sigma }(z^*)\), and \({\mathcal {F}}(z^k) \rightarrow {\mathcal {F}}(z^*)\) and \({\mathcal {F}}({\bar{z}}) \le {\mathcal {F}}(z^*)\), which is only allowed for \({\mathcal {F}}({\bar{z}}) = {\mathcal {F}}(z^*)\). Therefore, the sequence also has property (H3), and thus, \({\bar{z}}\) is a critical point of \({\mathcal {F}}\). Property (7) shows that \({\bar{z}}\) is a local minimizer for sufficiently small r.\(\square \)

Remark 3.2

The assumption in (H4) and Lemma 3.1 only restrict the behavior of the function along the second block of coordinates of \(z=(z_1,z_2)\in {\mathbb {R}}^{2N}\). This makes sense, because, for sequences that we consider, the first and second block depend on each other.

Remark 3.3

Unlike Theorem 3.1, the local convergence theorem (Theorem 3.2) does not require (H3) explicitly. If Theorem 3.1 assumes the KL property at some \(z^*\) (not the cluster point \({\tilde{z}}\) of (H3)), convergence to a point \({\bar{z}}\) in a neighborhood of \(z^*\) with \({\mathcal {F}}({\bar{z}}) \le {\mathcal {F}}(z^*)\) can be shown. However, \({\mathcal {F}}({\bar{z}}) < {\mathcal {F}}(z^*)\) might happen, which disproves \({\mathcal {F}}\)-attentive convergence of \(z^k\rightarrow {\bar{z}}\); thus, \({\bar{z}}\) would not be a critical point. Assuming \({\tilde{z}}= z^*\) by (H3) assures the \({\mathcal {F}}\)-attentive convergence, and thus, \({\bar{z}}\) is a critical point. Because of the local minimality of \(z^*\) in Theorem 3.2, \({\mathcal {F}}({\bar{z}}) < {\mathcal {F}}(z^*)\) cannot occur, and therefore, (H3) is implied.

Before deriving the convergence rates, we apply Theorem 3.2 and Lemma 3.1 to show a useful example of a feasibility problem.

Example 3.1

(Semi-algebraic feasibility problem) Let \(S_1,\ldots ,S_M\subset {\mathbb {R}}^N\) be semi-algebraic sets such that \(\bigcap _{i=1}^M S_i \ne \emptyset \) and let \(F:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) be given by \(F(x) = \frac{1}{2} \sum _{i=1}^M {\text {dist}}(x,S_i)^2\). For a constant \(c\ge 0\), we consider the function \({\mathcal {F}}(z) = {\mathcal {F}}(z_1,z_2) = F(z_1) + c \vert z_1-z_2 \vert _{}^2\). Suppose \(z^*= (x^*,x^*)\) is a global minimizer of \({\mathcal {F}}\), i.e., \(x^* \in \bigcap _{i=1}^M S_i\). Then, for \(z^0=(x^0,x^{-1})\) sufficiently close to \(z^*\), any algorithm that satisfies (H1) and (H2) and starts at \(z^0\) generates a sequence that remains in a neighborhood of \(z^*\) has the finite length property and converges to a point \({\bar{z}} =({\bar{x}}, {\bar{x}})\) with \({\bar{x}} \in \bigcap _{i=1}^M S_i\).

Finally, we complement our local convergence result by the convergence rate estimates from [34, 42]. Assuming the objective function is semi-algebraic, in [34, Thm.s 2 and 4] which build on [27, Thm. 3.4], a list of qualitative convergence rate estimates in terms of the KL exponent is proved. For estimations on the KL exponent, the interested reader is referred to [42, 48,49,50], which include estimations of the KL exponent for convex polynomials, functions that can be expressed as the maximum of finitely many polynomials, functions that can be expressed as supremum of a collection of polynomials over a semi-algebraic compact set under suitable regularity assumptions, and relations to the Luo–Tseng error bound.

Theorem 3.3

(Convergence rates) Let \((z^k)_{k\in {\mathbb {N}}} = (x^k,x^{{k-1}})_{k\in {\mathbb {N}}}\) satisfy (H1), (H2), and (H3) for a proper lsc function \({\mathcal {F}}:{\mathbb {R}}^{2N} \rightarrow \overline{{\mathbb {R}}}\) with KL exponent \(\theta \), which has the KL property at the critical point \({\tilde{z}}=z^*\) specified in (H3).

  1. 1.

    If \(\theta =1\), then \(z^k\) converges to \(z^*\) in a finite number of iterations.

  2. 2.

    If \(\frac{1}{2} \le \theta <1\), then \({\mathcal {F}}(z^k) \rightarrow {\mathcal {F}}(z^*)\) and \(x^k\rightarrow x^*\) linearly.

  3. 3.

    If \(0<\theta < \frac{1}{2}\), then \({\mathcal {F}}(z^k)-{\mathcal {F}}(z^*) \in O(k^{\frac{1}{2\theta -1}})\) and \(\vert x^k-x^* \vert _{} \in O(k^{\frac{\theta }{2\theta -1}})\).

Proof

Using Theorem 3.1, \((z^k)_{k\in {\mathbb {N}}}\) converges to \(z^*\) and \({\mathcal {F}}(z^k) \rightarrow {\mathcal {F}}(z^*)\) as \(k\rightarrow \infty \). W.l.o.g. we can assume that \({\mathcal {F}}(z^k) > {\mathcal {F}}(z^*)\) for all \(k\in {\mathbb {N}}\). By convergence of \((z^k)_{k\in {\mathbb {N}}}\) and (H1), there exists \(k_0\) such that the KL inequality (1) with \(f={\mathcal {F}}\) holds for all \(k\ge k_0\). Let U, \(\phi \), \(\eta \) be the objects appearing in Definition 2.7. Now, using \((u+v)^2 \le 2(u^2+v^2)\) for \(u,v\in {\mathbb {R}}\) to bound the terms on the right-hand side of (H2) and substituting (H1) into the resulting terms, the squared KL inequality (1) at index \(k\) yields

$$\begin{aligned} \frac{b^2}{2a} \big ( \phi ^\prime ({\mathcal {F}}(z^k) - {\mathcal {F}}(z^*)) \big )^2 \big ({\mathcal {F}}(z^{k-1}) - {\mathcal {F}}(z^{k+1})\big ) \ge 1 \,. \end{aligned}$$

As \(\phi ^\prime (s)=cs^{\theta -1}\) is non-increasing for \(\theta \in [0,1]\), we have \(\phi ^\prime ({\mathcal {F}}(z^k) - {\mathcal {F}}(z^*))\le \phi ^\prime ({\mathcal {F}}(z^{k+1}) - {\mathcal {F}}(z^*))\). The remainder of the proof is identical to [34] starting from [34, Inequality (7)], which yields the rates for \(({\mathcal {F}}(z^k))_{k\in {\mathbb {N}}}.\)

In the following, we prove the rates for \((x^k)_{k\in {\mathbb {N}}}\). We make use of an intermediate result from the proof of [3, Lem. 3.5] (cf. Lemma 3.2). The starting point is [3, Inequality (6)], restricted to terms with index \(k\ge K\), \(K\in {\mathbb {N}}\):

$$\begin{aligned} \sum _{k\ge K} \vert x^k- x^{k-1} \vert _{} \le \frac{1}{2} \vert x^K-x^{K-1} \vert _{} + \frac{b}{a} \phi ({\mathcal {F}}(z^K)-{\mathcal {F}}(z^*)) \,. \end{aligned}$$

The triangle inequality shows that the left-hand side is an upper bound for \(\vert x^K-x^* \vert _{}\). Using (H1) to bound the right-hand side yields:

$$\begin{aligned} \vert x^K-x^* \vert _{} \le \sum _{k\ge K} \vert x^k- x^{k-1} \vert _{} \le c^{\prime \prime } \left( \phi ({\mathcal {F}}(z^K)-{\mathcal {F}}(z^*)) + \sqrt{{\mathcal {F}}(z^K)-{\mathcal {F}}(z^*)} \right) \end{aligned}$$

for some constant \(c^{\prime \prime }>0\). If \(\theta \in [\frac{1}{2},1[\), for \({\mathcal {F}}(z^K)-{\mathcal {F}}(z^*)<1\), the second term upper-bounds the first one, and \({\mathcal {F}}(z^K)\rightarrow {\mathcal {F}}(z^*)\) is linear. For \(\theta \in ]0,\frac{1}{2}[\) the first term dominates, hence \(\vert x^K-x^* \vert _{} \in O(\phi ({\mathcal {F}}(z^K)-{\mathcal {F}}(z^*)))\).\(\square \)

4 Local and Global Convergence of iPiano

In this section, we briefly review the method iPiano and verify that the abstract convergence results from Sect. 3 hold for this algorithm.

iPiano applies to structured non-smooth and non-convex optimization problems with a proper lower semi-continuous (lsc) function \(h:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\), \(N\ge 1\):

$$\begin{aligned} \min _{x \in {\mathbb {R}}^N} \; h(x)\,,\qquad h(x)= f(x) + g(x) \end{aligned}$$
(8)

that satisfies the following assumption.

Assumption 1

For \(U\subset {\mathbb {R}}^N\), the following properties hold:

  • The function \(f:U \rightarrow {\mathbb {R}}\) is assumed to be \(C^1\)-smooth (possibly non-convex) with L-Lipschitz continuous gradient on \({\text {dom}}\,g\cap U\), \(L>0\).

  • The function \(g:U \rightarrow \overline{{\mathbb {R}}}\) is proper, lsc, possibly non-smooth and non-convex, simple and prox-bounded.

  • The function h restricted to U is bounded from below by \({\underline{h}} > -\infty \) and coercive, i.e., \(\vert x \vert _{}\rightarrow \infty \) with \(x\in U\) implies that \(h(x)\rightarrow \infty \).

Remark 4.1

As we will use Assumption 1, either with \(U={\mathbb {R}}^N\) or with \(U=B_{r^\prime }(x^*)\) for some \(r^\prime >0\), the coercivity assumption reduces either to the usual definition (\(U={\mathbb {R}}^N\)) or is empty (since \(B_{r^\prime }(x^*)\) is bounded). The coercivity property could be replaced by the assumption that the sequence that is generated by the algorithm is bounded.

Remark 4.2

Simple refers to the fact that the associated proximal map can be solved efficiently for the global optimum.

iPiano is outlined in Algorithm 1. For \(g=0\), iPiano coincides with the Heavy-ball method (inertial gradient descent or gradient descent with momentum).

In [4], functions g that are semi-convex received special attention. The resulting step size restrictions for semi-convex functions g are similar to those of convex functions. A function is said to be semi-convex with modulus \(m\in {\mathbb {R}}\), if m is the largest value such that \(g(x) - \frac{m}{2} \vert x \vert _{}^2\) is convex. For convex functions, \(m=0\) holds, and for strongly convex functions \(m>0\). We assume \(m< L\). According to [7, Thm. 10.33], saying a function g is (locally) semi-convex on an open set \(V\subset {\text {dom}}\,g\) is the same as saying g is lower-\({\mathcal {C}}^2\) on V. Nevertheless, the function g does not need to be semi-convex. This property is just used to improve the bounds on the step size parameters.

Remark 4.3

For simplicity, we describe the constant step size version of iPiano. However, all results in this paper are also valid for the backtracking line-search version of iPiano.

figure a
Table 1 Convergence of iPiano as stated in Corollaries 4.14.2, and 4.3 is guaranteed for the parameter settings listed in this table (for g convex, see [3, Algorithm 2], otherwise see [4, Algorithm 3])

The following convergence results hold for iPiano.

Corollary 4.1

(Global convergence of iPiano [4, Thm. 6.6]) Let \((x^k)_{k\in {\mathbb {N}}}\) be generated by Algorithm 1 with \(U={\mathbb {R}}^N\). Then, the sequence \((z^k)_{k\in {\mathbb {N}}}\) with \(z^k=(x^k, x^{k-1})\) satisfies (H1), (H2), (H3) for the function (for some \(\kappa >0\))

$$\begin{aligned} H_\kappa :{\mathbb {R}}^{2N} \rightarrow {\mathbb {R}}\cup \{\infty \}\,,\quad (x,y)\mapsto h(x) + \kappa \vert x-y \vert _{}^2 \,. \end{aligned}$$
(10)

Moreover, if \(H_\kappa (x,y)\) has the Kurdyka–Łojasiewicz property at a cluster point \(z^*=(x^*,x^*)\), then the sequence \((x^k)_{k\in {\mathbb {N}}}\) has the finite length property, \(x^k\rightarrow x^*\) as \(k\rightarrow \infty \), and \(z^*\) is a critical point of \(H_\kappa \); hence, \(x^*\) is a critical point of h.

Corollary 4.2

(Local convergence of iPiano) Let \((x^k)_{k\in {\mathbb {N}}}\) be generated by Algorithm 1 with \(U=B_{r^\prime }(x^*)\) for some \(r^\prime >0\), where \(x^*\) is a local (or global) minimizer of h. Then, \(z^*=(x^*,x^*)\) is a local (or global) minimizer of \(H_\kappa \) (defined in (10)). Suppose (H4) holds at \(z^*\) and \(H_\kappa \) has the KL property at \(z^*\).

Then, for any \(r>0\) (in particular for \(r=r^\prime \)), there exist \(u\in ]0,r[\) and \(\mu >0\) such that the conditions

$$\begin{aligned} x^0 \in B_{u} (x^*)\,,\qquad h(x^*)< h(x^0) < h(x^*) + \mu \,, \end{aligned}$$

imply that the sequence \((x^k)_{k\in {\mathbb {N}}}\) has the finite length property and remains in \(B_{r}(x^*)\) and converges to some \({\bar{x}} \in B_{r}(x^*)\), a critical point of h that satisfies \(h({\bar{x}}) = h(x^*)\). For r sufficiently small, \({\bar{z}}\) is a local minimizer of h.

Proof

Corollary 4.1 shows that Algorithm 1 generates a sequence that satisfies (H1), (H2), (H3) with \(H_\kappa \). Therefore, obviously, Theorem 3.2 can be applied. \(\square \)

Corollary 4.3

(Convergence rates for iPiano) Let \((x^k)_{k\in {\mathbb {N}}}\) be generated by Algorithm 1 and set \(z^k:=(x^k,x^{k-1})\). If \(H_\kappa \), defined in (10), has the KL property at \(z^*=(x^*,x^*)\) specified in (H3) with KL exponent \(\theta \), then the following rates of convergence hold for some \(C>0\):

  1. 1.

    If \(\theta =1\), then \(x^k\) converges to \(x^*\) in a finite number of iterations.

  2. 2.

    If \(\frac{1}{2} \le \theta <1\), then \(h(x^k) \rightarrow h(x^*)\) and \(x^k\rightarrow x^*\) linearly.

  3. 3.

    If \(0<\theta < \frac{1}{2}\), then \(h(x^k)-h(x^*) \in C(k^{\frac{1}{2\theta -1}})\) and \(\vert x^k-x^* \vert _{} \in O(k^{\frac{\theta }{2\theta -1}})\).

Proof

Corollary 4.1 shows that Algorithm 1 generates a sequence that satisfies (H1), (H2), (H3) for \(H_\kappa \). Therefore, the statement follows from Theorem 3.3 and the facts that \(H_\kappa (x^*,x^*)=h(x^*)\) and \(h(x^k) \le H_\kappa (x^k,x^{k-1})\).\(\square \)

Remark 4.4

In [42, Thm. 3.6], Li and Pong show that if h has the KL exponent \(\theta \in ]0,\frac{1}{2}]\) at \(x^*\), then \(H_\kappa \) has the same KL exponent at \(z^*=(x^*,x^*)\).

5 Inertial Averaged/Alternating Minimization

In this section, we transfer the convergence result developed for iPiano in Sect. 4 to various non-convex settings (Sects. 5.15.2, 5.3 ). This yields inertial algorithms for non-convex problems that are known from the convex setting as averaged or alternating proximal minimization (or projection) methods. Key for the generalization to the non-convex and inertial setting is an explicit formula for the gradient of the Moreau envelope of a prox-regular function (Proposition 5.2), which is well known for convex functions (Proposition 5.1), and the local convergence results in Theorem 3.2. For completeness, we state the formula in the convex setting, before we devote ourselves to the prox-regular setting.

Proposition 5.1

([51, Prop. 12.29]) Let \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) be a proper lower semi-continuous (lsc) convex function and \(\lambda >0\). Then, \(e_{\lambda }{f}\) is continuously differentiable and has the \(\lambda ^{-1}\)-Lipschitz continuous gradient

$$\begin{aligned} \nabla e_{\lambda }{f}(x) =\frac{1}{\lambda } (x - P_{\lambda }f(x)) \,. \end{aligned}$$
(11)

Proposition 5.2

Suppose that \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) is prox-regular at \({\bar{x}}\) for \({\bar{v}} =0\) and that f is prox-bounded. Then for all \(\lambda >0\) sufficiently small, there is a neighborhood of \({\bar{x}}\) on which

  1. 1.

    \(P_{\lambda }f\) is monotone, single-valued and Lipschitz continuous and \(P_{\lambda }f ({\bar{x}}) = {\bar{x}}\).

  2. 2.

    \(e_{\lambda }{f}\) is differentiable with \(\nabla (e_{\lambda }{f} )({\bar{x}}) = 0\), in fact \(\nabla (e_{\lambda }{f} )\) is strictly continuous with

    $$\begin{aligned} \nabla e_{\lambda }{f} = \lambda ^{-1} (I - P_{\lambda }f) = (\lambda I + T^{-1})^{-1} \end{aligned}$$
    (12)

    for an f-attentive localization \(T\) of \(\partial f\) at \(({\bar{x}},0)\), where I denotes the identity mapping. This localization can be chosen so that the set \(U_\lambda := \mathrm {rge}\,(I+\lambda T)\) serves for all \(\lambda >0\) sufficiently small as a neighborhood of \({\bar{x}}\) on which these properties hold.

  3. 3.

    There is a neighborhood of \({\bar{x}}\) on which for small enough \(\lambda \) the local Lipschitz constant of \(\nabla e_{\lambda }{f}\) is \(\lambda ^{-1}\). If \(\lambda _0\) is the modulus of prox-regularity at \({\bar{x}}\), then \(\lambda \in ]0, \lambda _0/2[\) is a sufficient condition.

  4. 4.

    Any point \({\tilde{x}}\in U_\lambda \) with \(\nabla e_{\lambda }{f} ({\tilde{x}}) = 0\) is a fixed point of \(P_{\lambda }f\) and a critical point of f.

Proof

While Items 1 and 2 are proved in [7, Prop. 13.37], Items 3 (estimation of the local Lipschitz constant) and 4 are not explicitly verified. To prove Items 3 and 4, we develop the basic objects that are required in the same way as [7, Prop. 13.37]. Thus, the first part of the proof coincides with [7, Prop. 13.37].

Without loss of generality, we can take \({\bar{x}} =0\). As f is prox-bounded, the condition for prox-regularity may be taken to be global, cf. [7, Prop. 8.46(f)], i.e., there exists \(\varepsilon >0\) and \(\lambda _0>0\) such that

$$\begin{aligned}&f(x^\prime ) > f(x) + \left\langle v,x^\prime - x \right\rangle - \frac{1}{2\lambda _0} \vert x^\prime - x \vert _{}^2 \quad \forall x^\prime \ne x \end{aligned}$$
(13)
$$\begin{aligned}&\text {when}\ v\in \partial f(x),\ \vert v \vert _{}< \varepsilon ,\ \vert x \vert _{}<\varepsilon ,\ f(x) < f(0) + \varepsilon \,. \end{aligned}$$
(14)

Let \(T:{\mathbb {R}}^N \rightrightarrows {\mathbb {R}}^N\) be the f-attentive localization of \(\partial f\) specified in (14), i.e., defined by \({\text {Graph}}T= \lbrace (x,v)\,:\,v\in \partial f(x),\,\vert v \vert _{}< \varepsilon ,\,\vert x \vert _{}<\varepsilon ,\, f(x) < f(0) + \varepsilon \rbrace \). Inequality (13) is valid for any \(\lambda \in ]0,\lambda _0[\). Setting \(u=x+\lambda v\), inequality (13) (with \(\lambda \) instead of \(\lambda _0\)) implies \( f(x^\prime ) + \frac{1}{2\lambda } \vert x^\prime - u \vert _{}^2 > f(x) + \frac{1}{2\lambda }\vert x-u \vert _{}^2 \). Therefore, \(P_{\lambda }f(x+\lambda v)=\lbrace x \rbrace \) when \(v\in T(x)\). In general, for any u sufficiently close to 0, thanks to Fermat’s rule on the minimization problem of \(P_{\lambda }f(u)\), we have for any \(x\in P_{\lambda }f(u)\) that \(v=(u-x)/\lambda \in T(x)\) holds. Thus, \(U_\lambda =\mathrm {rge}\,(I+\lambda T)\) is a neighborhood of 0 on which \(P_{\lambda }f\) is single-valued and coincides with \((I+\lambda T)^{-1}\).

3. Let \(u=x+\lambda v\) and \(u^\prime =x^\prime + \lambda v^\prime \) be any two elements in \(U_\lambda \) such that \(x=P_{\lambda }f(u)\) and \(x^\prime = P_{\lambda }f(u^\prime )\). Then, (xv) and \((x^\prime ,v^\prime )\) belong to \({\text {Graph}}T\). Thus, we can add two copies of (13) where in the second copy the roles of x and \(x^\prime \) are swapped. This sum yields for any \(\lambda _1\in ]0,\lambda _0[\) instead of \(\lambda _0\) in (13):

$$\begin{aligned} 0 \ge \left\langle v-v^\prime ,x^\prime - x \right\rangle - \frac{1}{\lambda _1} \vert x^\prime - x \vert _{}^2 \,. \end{aligned}$$
(15)

We substitute v with \((u-x)/\lambda \) and \(v^\prime \) with \((u^\prime -x^\prime )/\lambda \) which yields

$$\begin{aligned} 0\le & {} \frac{1}{\lambda _1} \vert x^\prime - x \vert _{}^2 + \frac{1}{\lambda }\left\langle (u^\prime -x^\prime )-(u-x),x^\prime - x \right\rangle \\= & {} \frac{1}{\lambda }\left\langle u^\prime -u,x^\prime - x \right\rangle +\left( \frac{1}{\lambda _1} - \frac{1}{\lambda }\right) \vert x^\prime - x \vert _{}^2 \end{aligned}$$

or, equivalent to that \(\left\langle u^\prime -u,x^\prime - x \right\rangle \ge (1-\tfrac{\lambda }{\lambda _1}) \vert x^\prime - x \vert _{}^2\).

This expression helps to estimate the local Lipschitz constant of the gradient of the Moreau envelope. Using the closed-form description of \(\nabla e_{\lambda }{f}\) on \(U_\lambda \), we verify the \(\lambda ^{-1}\)-Lipschitz continuity of \(\nabla e_{\lambda }{f}\) as follows, when \(\lambda \le \frac{1}{2} \lambda _1\):

$$\begin{aligned} \begin{aligned} \lambda ^2\vert \nabla e_{\lambda }{f}(u) - \nabla e_{\lambda }{f}(u^\prime ) \vert _{}^2 =&\ \vert (u- u^\prime ) - (P_{\lambda }f (u) - P_{\lambda }f(u^\prime )) \vert _{}^2 \\ =&\ \vert x - x^\prime \vert _{}^2 - 2 \left\langle u- u^\prime ,x-x^\prime \right\rangle +\vert u-u^\prime \vert _{}^2 \\ \le&\ (2\tfrac{\lambda }{\lambda _1}-1) \vert x-x^\prime \vert _{}^2 + \vert u-u^\prime \vert _{}^2 \le \vert u-u^\prime \vert _{}^2 \end{aligned} \end{aligned}$$

4. Now, let \({\tilde{x}}\in U_\lambda \) be a point for which \(\nabla e_{\lambda }{f} ({\tilde{x}}) = 0\) holds. Then, according to (12), we have \({\tilde{x}} = P_{\lambda }f({\tilde{x}})\) or \({\tilde{x}} = (I + \lambda T)^{-1}({\tilde{x}})\) for the localization selected above. Inverting the mapping shows that \({\tilde{x}} \in {\tilde{x}} + \lambda T({\tilde{x}})\), which implies that \(0\in T({\tilde{x}})\), thus \(0\in \partial f({\tilde{x}})\).\(\square \)

Remark 5.1

The proof of Item (iii) of Proposition 5.2 is motivated by a similar derivation for distance functions and projection operators in [52]. See [53], for a recent analysis of the differential properties of the Moreau envelope in the infinite- dimensional setting.

5.1 Heavy-Ball Method on the Moreau Envelope

Proposition 5.3

(Inertial proximal minimization method) Let the function \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) be prox-regular at \(x^*\) for \(v^*=0\) with modulus \(\lambda _0>0\) and prox-bounded with threshold \(\lambda _f>0\). Let \(0<\lambda < \min (\lambda _f, \lambda _0/2)\), \(\beta \in [0,1[\), and \(\alpha \in ]0,2(1-\beta )\lambda [\). Suppose that \(h = e_{\lambda }{f}\) has a local minimizer \(x^*\) and \(H_\kappa \), defined in (10), satisfies (H4) and the KL property at \((x^*,x^*)\). Let \(x^0=x^{-1}\) with \(x^0\in {\mathbb {R}}^N\) and \((x^k)_{k\in {\mathbb {N}}}\) be generated by

$$\begin{aligned} x^{k+1}\in (1-\alpha \lambda ^{-1}) x^k+ \alpha \lambda ^{-1} P_{\lambda }f (x^k) + \beta (x^k- x^{k-1}) \,. \end{aligned}$$

If \(x_0\) is sufficiently close to \(x^*\), then sequence \((x^k)_{k\in {\mathbb {N}}}\)

  • is uniquely determined,

  • has the finite length property,

  • remains in a neighborhood of \(x^*\),

  • and converges to a critical point \({\tilde{x}}\) of f with \(f({\tilde{x}}) = f(x^*)\).

If f is convex, and \(\lambda >0\), \(\beta \in [0,1[\), and \(\alpha \in ]0,2(1-\beta )\lambda [\), then \((x^k)_{k\in {\mathbb {N}}}\) has finite length and converges to a global minimizer \({\tilde{x}}\) of f for any \(x^0\in {\mathbb {R}}^N\).

Proof

The statement is an application of the results for the Heavy-ball method (i.e., (8) with \(g\equiv 0\)) to the Moreau envelope \(e_{\lambda }{f}\) of the function f. Note that \(H_\kappa \) inherits the KL property from h (see Remark 4.4).

Since f is prox-bounded with threshold \(\lambda _f\), the function is bounded from below and coercive for \(\lambda < \lambda _f\). As \(\lambda < \lambda _0/2\), Proposition 5.2 can be used to conclude that there exists a neighborhood \(U_\lambda \) of \(x^*\) such that \(e_{\lambda }{f}\) is differentiable on \(U_\lambda \) and \(\nabla e_{\lambda }{f}\) is \(\lambda ^{-1}\)-Lipschitz continuous.

There exists a neighborhood \(U\subset U_\lambda \) of \(x^*\) which contains \(x_0\) and Corollary 4.2 can be applied. Therefore, the Heavy-ball method (Algorithm 1 with \(g\equiv 0\)) with \(0<\alpha < 2(1-\beta )\lambda \) and \(\beta \in [0,1[\) generates a sequence \((x^k)_{k\in {\mathbb {N}}}\) that lies in U. Using the formula in (12), the update step of the Heavy-ball method applied to \(e_{\lambda }{f}\) reads as follows:

$$\begin{aligned} \begin{aligned} x^{k+1}=&\ x^k- \alpha \nabla e_{\lambda }{f} (x^k) + \beta (x^k- x^{k-1}) \\ =&\ x^k- \alpha \lambda ^{-1}(x^k- P_{\lambda }f (x^k)) + \beta (x^k- x^{k-1}) \\ =&\ (1- \alpha \lambda ^{-1}) x^k+ \alpha \lambda ^{-1} P_{\lambda }f (x^k) + \beta (x^k- x^{k-1})\,. \end{aligned} \end{aligned}$$

By Proposition 5.21, \(P_{\lambda }f\) is single-valued, and by Proposition 5.24, \(0\in \partial f({\tilde{x}})\). The remaining statements follow from Corollary 4.2.

The statement about convex functions follows analogously by using Proposition 5.1 instead of Proposition 5.2 and Corollary 4.1 instead of Corollary 4.2. \(\square \)

Remark 5.2

Corollary 4.3 provides convergence rates for Proposition 5.3.

Remark 5.3

The question, whether \(h = e_{\lambda }{f}\) has the KL property, if f has the KL property, has been analyzed for convex functions in [42]. For non-convex functions, this is a non-trivial open problem.

5.2 Heavy-Ball Method on the Sum of Moreau Envelopes

Proposition 5.4

(Inertial averaged proximal minimization method) Suppose \(f_i:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\), \(i=1,\ldots ,M\) are prox-regular functions at \(x^*\) for \(v^*=0\) with modulus \(\lambda _0>0\) and prox-bounded with threshold \(\lambda _f>0\). Let \(0<\lambda < \min (\lambda _f, \lambda _0/2)\), \(\beta \in [0,1[\), and \(\alpha \in ]0,2(1-\beta )\lambda [\). Suppose that \(h = \sum _{i=1}^M e_{\lambda }{f}_i\) has a local minimizer \(x^*\) and \(H_\kappa \), defined in (10), satisfies (H4) and the KL property at \((x^*,x^*)\). Let \(x^0=x^{-1}\) with \(x^0\in {\mathbb {R}}^N\) and \((x^k)_{k\in {\mathbb {N}}}\) be generated by

$$\begin{aligned} x^{k+1}\in (1-\alpha \lambda ^{-1}) x^k+ \frac{\alpha }{M} \lambda ^{-1} \sum _{i=1}^M P_{\lambda }f_i (x^k) + \beta (x^k- x^{k-1}) \,. \end{aligned}$$

If \(x_0\) is sufficiently close to \(x^*\), then sequence \((x^k)_{k\in {\mathbb {N}}}\)

  • is uniquely determined,

  • has the finite length property,

  • remains in a neighborhood of \(x^*\),

  • and converges to a critical point \({\tilde{x}}\) of h with \(h({\tilde{x}}) = h(x^*)\).

If all \(f_i\) are convex, and \(\lambda >0\), \(\beta \in [0,1[\), and \(\alpha \in ]0,2(1-\beta )\lambda [\), then \((x^k)_{k\in {\mathbb {N}}}\) has finite length and converges to a global minimizer \({\tilde{x}}\) of h for any \(x^0\in {\mathbb {R}}^N\).

Proof

The proof is analogously to that of Proposition 5.3 except for the fact that the Heavy-ball method is applied to \(\sum _{i=1}^M e_{\lambda }{f}_i\):

$$\begin{aligned} \begin{aligned} x^{k+1}=&\ x^k- \frac{\alpha }{M} \sum _{i=1}^M \nabla e_{\lambda }{f_i} (x^k) + \beta (x^k- x^{k-1}) \\ =&\ x^k- \frac{\alpha }{M} \lambda ^{-1}\sum _{i=1}^M(x^k- P_{\lambda }f_i (x^k)) + \beta (x^k- x^{k-1}) \\ =&\ (1- \alpha \lambda ^{-1}) x^k+ \frac{\alpha }{M} \lambda ^{-1} \sum _{i=1}^M P_{\lambda }f_i (x^k) + \beta (x^k- x^{k-1})\,. \end{aligned} \end{aligned}$$

Instead of scaling the feasible range of step sizes for \(\alpha \), the scaling \(\frac{1}{M}\) is included in the update formula. \(\square \)

Remark 5.4

Corollary 4.3 provides convergence rates for Proposition 5.4.

Remark 5.5

In contrast to Proposition 5.3, the sequence of iterates converges to a point \({\tilde{x}}\) for which \(\sum _{i=1}^M \nabla e_{\lambda }{f_i}({\tilde{x}}) =0\) holds. We cannot directly conclude that \(0\in \partial (\sum _i f_i)({\tilde{x}})\). However, if \(\nabla e_{\lambda }{f_i}({\tilde{x}})=0\) for all \(i=1,\ldots ,M\), then under suitable qualification and regularity conditions (see [7, Cor. 10.9]), we can conclude that \({\tilde{x}}\) is a critical point of \(\sum _{i=1}^M f_i\).

Example 5.1

(Inertial averaged projection method for the semi-algebraic feasibility problem) The algorithm described in Proposition 5.4 can be used to solve the semi-algebraic feasibility problem of Example 3.1. The conditions in Example 3.1 are satisfied.

5.3 iPiano on an Objective Involving a Moreau Envelope

Proposition 5.5

(Inertial alternating proximal minimization method) Suppose \(f:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) is prox-regular at \(x^*\) for \(v^*=0\) with modulus \(\lambda _0>0\) and prox-bounded with threshold \(\lambda _f>0\). Let \(0<\lambda < \min (\lambda _f, \lambda _0/2)\). Moreover, suppose that \(g:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\) is proper, lsc, and simple. Let \(x^0=x^{-1}\) with \(x^0\in {\mathbb {R}}^N\) and let the sequence \((x^k)_{k\in {\mathbb {N}}}\) be generated by the following update rule

$$\begin{aligned} x^{k+1}\in P_{\alpha }g\left( (1-\alpha \lambda ^{-1}) x^k+ \alpha \lambda ^{-1} P_{\lambda }f (x^k) + \beta (x^k- x^{k-1})\right) \,. \end{aligned}$$

We obtain the following cases of convergence results:

  1. 1.

    Assume that \(h = g+ e_{\lambda }{f}\) has a local minimizer \(x^*\) and \(H_\kappa \), defined in (10), satisfies (H4) and the KL property at \((x^*,x^*)\). If \(x_0\) is sufficiently close to \(x^*\), and \(\alpha \), \(\beta \) are selected according the property of g in one of the last three rows of Table 1 with \(L=\lambda ^{-1}\), then the sequence \((x^k)_{k\in {\mathbb {N}}}\)

    • has the finite length property,

    • remains in a neighborhood of \(x^*\),

    • and converges to a critical point \({\tilde{x}}\) of h with \(h({\tilde{x}}) = h(x^*)\).

  2. 2.

    Assume that f is convex, \(h = g+e_{\lambda }{f}\) and \(x^*\) is a cluster point of \((x^k)_{k\in {\mathbb {N}}}\). Suppose \(H_\kappa \), defined in (10), has the KL property at \((x^*,x^*)\). Then, for any \(x_0\in {\mathbb {R}}^N\), and \(\alpha \), \(\beta \) selected according the property of g in one of the last three rows of Table 1 with \(L=\lambda ^{-1}\), the sequence \((x^k)_{k\in {\mathbb {N}}}\)

    • has the finite length property,

    • and converges to a critical point \({\tilde{x}}\) of h with \(h({\tilde{x}}) = h(x^*)\).

If g is convex, the sequence \((x^k)_{k\in {\mathbb {N}}}\) is uniquely determined.

Proof

The proof follows analogously to that of Proposition 5.3 by, either invoking Proposition 5.2 and Corollary 4.2 or Proposition 5.1 and Corollary 4.1.\(\square \)

Remark 5.6

Corollary 4.3 provides convergence rates for Proposition 5.5.

Example 5.2

(Inertial alternating projection for the semi-algebraic feasibility problem)

  • The algorithm described in Proposition 5.5 can be used to solve the semi-algebraic feasibility problem of Example 3.1 with \(M=2\). The conditions in Example 3.1 are satisfied.

  • If \(S_1\) is non-convex and \(S_2\) convex, the second case of Proposition 5.5 yields a globally convergent relaxed alternating projection method with \(g=\delta _{S_1}\) and \(f=\delta _{S_2}\). Table 1 requires the step size conditions \(\beta \in [0,\frac{1}{2}[\) and \(\alpha \in ]0,1-2\beta [\) (note that \(\lambda = 1\)), which for \(\beta =0\) yields \(\alpha \in ]0,1[\), which leads to the following update step:

    $$\begin{aligned} x^{k+1}\in \mathrm {proj}_{S_1}((1-\alpha ) x^k+ \alpha \,\mathrm {proj}_{S_2}(x^k) ) \end{aligned}$$

Example 5.3

The algorithm described in Proposition 5.5 can be used to solve a relaxed version of the following problem:

$$\begin{aligned} \min _{x_1,\ldots ,x_M\in {\mathbb {R}}^N}\, \sum _{i=1}^M g_i(x_i)\,, \quad s.t.\ x_1=\ldots = x_M \,, \end{aligned}$$

where the convex constraint is replaced by the associated distance function. The functions \(g_i:{\mathbb {R}}^N \rightarrow \overline{{\mathbb {R}}}\), \(i=1,\ldots ,M\), \(M\in {\mathbb {N}}\), are assumed to be proper, lsc, simple, and \(x=(x_1,\ldots ,x_M)\in {\mathbb {R}}^{N\times M}\) is the optimization variable. This problem belongs to the second case of Proposition 5.5, i.e., the sequence generated by the inertial alternating proximal minimization method converges globally to a critical point \(x^*\) of the function \(\sum _{i=1}^M g_i(x_i) + \frac{1}{2} ({\text {dist}}(x,C))^2\) where \(C:=\lbrace (x_1,\ldots ,x_M)\in {\mathbb {R}}^{N\times M}\,:\,x_1=\ldots = x_M \rbrace \). The proximal mapping of \(\frac{1}{2} ({\text {dist}}(x,C))^2\) is the projection onto C, which simply averages \(x_1,\ldots ,x_M\).

5.4 Application: A Feasibility Problem

We consider the example from [54] that demonstrates (local) linear convergence of the alternating projection method. The goal is to find an \(N\times M\) matrix X of rank \(R\) that satisfies a linear system of equations \({\mathcal {A}}(X) = B\), i.e.,

$$\begin{aligned} \text {find}\quad X\quad \text {in}\quad \underbrace{\lbrace X\in {\mathbb {R}}^{N\times M}\,:\,{\mathcal {A}}(X) = B \rbrace }_{=:{\mathscr {A}}} \cap \underbrace{\lbrace X\in {\mathbb {R}}^{N\times M}\,:\,{\text {rank}}(X) = R \rbrace }_{=: {\mathscr {R}}} \,, \end{aligned}$$

where \({\mathcal {A}}:{\mathbb {R}}^{N\times M} \rightarrow {\mathbb {R}}^D\) is a linear mapping and \(B\in {\mathbb {R}}^D\). Such feasibility problems are well suited for split projection methods, as the projection onto each set might be easy to conduct. The projections are given by

$$\begin{aligned} \mathrm {proj}_{{\mathscr {A}}}(X) = X - {\mathcal {A}}^*({\mathcal {A}}{\mathcal {A}}^*)^{-1} ({\mathcal {A}}(X) - B) \quad \text {and}\quad \mathrm {proj}_{{\mathscr {R}}}(X) = \sum _{i=1}^R\sigma _i u_i v_i^\top \,, \end{aligned}$$

where \(USV^\top \) is the singular value decomposition of X with \(U=(u_1,u_2,\ldots ,u_N)\), \(V=(v_1,v_2,\ldots , v_M)\) and singular values \(\sigma _1\ge \sigma _2\ge \ldots \ge \sigma _N\) sorted in decreasing order along the diagonal of S. Note that the set of rank-R matrices is a \(C^2\)-smooth manifold [55, Ex. 8.14], hence prox-regular [7, Prop. 13.33].

We perform the same experiment as in [54], i.e., we randomly generate an operator \({\mathcal {A}}\) by constructing random matrices \(A_1\), ..., \(A_D\) and setting \({\mathcal {A}}(X) = (\left\langle A_1,X \right\rangle ,\ldots ,\left\langle A_D,X \right\rangle )\), \(\left\langle A_i,X \right\rangle :=\mathrm {trace}(A^\top X)\), selecting B such that \({\mathcal {A}}(X)=B\) has a rank \(R\) solution, and the dimensions are chosen as \(M=110\), \(N=100\), \(R=4\), \(D=450\). The performance is measured w.r.t. \(\vert {\mathcal {A}}(X)-B \vert _{}\) where X is the result of the projection onto \({\mathscr {R}}\) in the current iteration.

Table 2 Convergence results for 200 randomly generated feasibility problem as described in Sect. 5.4
Fig. 2
figure 2

Convergence plots for the feasibility problem in Sect. 5.4. The inertial methods developed in this paper significantly outperform all other methods with respect to the number of iterations (left plot) and the actual computation time (right plot)

We consider the alternating projection method \(X^{k+1}= \mathrm {proj}_{{\mathscr {R}}}(\mathrm {proj}_{{\mathscr {A}}}(X^k))\), the averaged projection method \(X^{k+1}= \frac{1}{2}\left( \mathrm {proj}_{{\mathscr {A}}}(X^k) + \mathrm {proj}_{{\mathscr {R}}}(X^k)\right) \), the globally convergent relaxed alternating projection method from Example 5.2 (glob-altP, \(\alpha =0.99\)), and their inertial variants proposed in Sects. 5.2 and 5.3. For Heavy-ball method/inertial averaged projection (loc-hb-avgP-bt, \(\beta =0.75\)) in Sect. 5.2 applied to the objective \({\text {dist}}(X,{\mathscr {A}})^2 + {\text {dist}}(X,{\mathscr {R}})^2\), we use the backtracking line-search version of iPiano [3, Algorithm 4] to estimate the Lipschitz constant automatically. For iPiano/inertial alternating projection (glob-ipiano-altP) in Sect. 5.3 applied to \(\min _{X\in {\mathscr {R}}}\,\frac{1}{2} ({\text {dist}}(X,{\mathscr {A}}))^2\) (i.e., g non-convex, f smooth convex), we use \(\beta =0.45\in [0,\frac{1}{2}[\) and \(\alpha =0.99(1-2\beta )/L\) with \(L=1\), which guarantees global convergence to a stationary point, and a backtracking version (glob-ipiano-altP-bt) [4, Algorithm 5]. Moreover, for the same setting, we use a heuristic version (heur-ipiano-altP, \(\beta =0.75\), theoretically infeasible) with \(\alpha \) such that \(\alpha \lambda ^{-1}=1\) in Proposition 5.5. Finally, we also consider the locally convergent version of iPiano in Proposition 5.5 (loc-ipiano-altP-bt, \(\beta =0.75\)) applied to the objectiveFootnote 2 \(\min _{X\in {\mathscr {A}}}\,\frac{1}{2} ({\text {dist}}(X,{\mathscr {R}}))^2\) (i.e., g convex, f prox-regular, non-convex) with backtracking. For the local convergence results, we assume that we start close enough to a feasible point. Experimentally, all algorithms converge to a feasible point. In theory, backtracking is not required; however as the radius of the neighborhood of attraction is hard to quantify, the algorithm is more stable with backtracking.

We also compare our method against the recently proposed globally convergent Douglas–Rachford splitting for non-convex feasibility problems [39]. The algorithm depends on a parameter \(\gamma \), which in theory is required to be rather small: \(\gamma _0:=\sqrt{3/2}-1\). The basic model Douglas–Rachford uses the maximal feasible value for this \(\gamma \)-parameter. Douglas–Rachford 75 is a heuristic versionFootnote 3 proposed in [39].

Table 2 compares the methods on a set of 200 randomly generated problems with a maximum of 1000 iterations for each method. Also local methods seem to reliably find a feasible point. This seems to be true also for the heuristic methods Douglas–Rachford 75 and heur-ipiano-altP, which shows that there is still a gap between theory and practice. The inertial algorithms that use backtracking significantly outperform methods without backtracking or inertia. Considering the actual computation time makes this observation even more significant, since backtracking algorithms require to compute the objective value several times per iteration. Interestingly, the globally convergent version of iPiano converged the fastest to a feasible point. The convergence behavior of the methods is visualized in Fig. 2 for a representative example.

6 Conclusions

In this paper, we proved a local convergence result for abstract descent methods, which is similar to that of Attouch et al. [8]. This local convergence result is applicable to an inertial forward–backward splitting method, called iPiano [3]. For functions that satisfy the Kurdyka–Łojasiewicz inequality at a local optimum, under a certain growth condition, we verified that the sequence of iterates stays in a neighborhood of a local (or global) minimum and converges to the minimum. As a consequence, the properties that imply convergence of iPiano are required to hold locally only. Combined with a well-known expression for the gradient of Moreau envelopes in terms of the proximal mapping, relations of iPiano to an inertial averaged proximal minimization method and an inertial alternating proximal minimization method are uncovered. These considerations are conducted for functions that are prox-regular instead of the stronger assumption of convexity. For a non-convex feasibility problem, experimentally, iPiano significantly outperforms the alternating projection method and a recently proposed non-convex variant of Douglas–Rachford splitting.