A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem

Alecsa, Cristian Daniel; László, Szilárd Csaba; Viorel, Adrian

doi:10.1007/s11075-019-00765-z

A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem

Original Paper
Published: 13 July 2019

Volume 84, pages 485–512, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Numerical Algorithms Aims and scope Submit manuscript

A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem

Download PDF

Cristian Daniel Alecsa^1,2,
Szilárd Csaba László ORCID: orcid.org/0000-0001-5140-1144³ &
Adrian Viorel³

453 Accesses
7 Citations
Explore all metrics

Abstract

We investigate an algorithm of gradient type with a backward inertial step in connection with the minimization of a nonconvex differentiable function. We show that the generated sequences converge to a critical point of the objective function, if a regularization of the objective function satisfies the Kurdyka-Łojasiewicz property. Further, we provide convergence rates for the generated sequences and the objective function values formulated in terms of the Łojasiewicz exponent. Finally, some numerical experiments are presented in order to compare our numerical scheme with some algorithms well known in the literature.

Convergence rates for an inertial algorithm of gradient type associated to a smooth non-convex minimization

Article 01 July 2020

A Telescopic Bregmanian Proximal Gradient Method Without the Global Lipschitz Continuity Assumption

Article 25 March 2019

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction and Preliminaries

Gradient-type algorithms have a long history, going back at least to Cauchy (1847), and also a wealth of applications. Solving linear systems, Cauchy’s original motivation, is maybe the most obvious application, but many of today’s hot topics in machine learning or image processing also deal with optimization problems from an algorithmic perspective and rely on gradient-type algorithms.

The original gradient descent algorithm

$$ x_{n+1}=x_{n} - s {\nabla} g(x_{n}), $$

which is precisely an explicit Euler method applied to the gradient flow

$$ \dot{x}(t)=-{\nabla} g(x(t)), $$

does not achieve very good convergence rates and much research has been dedicated to accelerating convergence.

Based on the analogy to mechanical systems, e.g., to the movement, with friction, of a heavy ball in a potential well defined by the smooth convex objective function g, with L_g −Lipschitz continuous gradient, Polyak [26] was able to provide the seminal idea for achieving acceleration namely the addition of inertial (momentum) terms to the gradient algorithm. His two-step iterative method, the so-called heavy ball method, takes the following form: For the initial values $x_{0}=x_{-1}\in \mathbb {R}^{m}$ and $n\in \mathbb {N}$ let:

$$ \left\{\begin{array}{lll} y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1}),\\ \\ x_{n+1}=y_{n}-\beta_{n}{\nabla} g(x_{n}), \end{array}\right. $$

(1)

where α_n ∈ [0,1) and β_n > 0 is a step-size parameter. Recently, in [29], the convergence rate of order $\mathcal {O}\left (\frac {1}{n}\right )$ has been obtained for the heavy ball algorithm, provided g is coercive, the inertial parameter $(\alpha _{n})_{n\in \mathbb {N}}$ is a nonincreasing sequence, and the stepsize parameter satisfies $\beta _{n}=\frac {2(1-\alpha _{n})c}{L_{g}}$, for some fixed c ∈ (0,1) (see also [18] for an ergodic rate). More precisely, in [29], it is shown that under the previous assumption one has:

$$ g(x_{n})-\min g=\mathcal{O}\left( \frac{1}{n}\right). $$

It is worthwhile mentioning that the forward-backward algorithm studied in [12] in a full nonconvex setting reduces to Polyak’s heavy ball method if the nonsmooth term vanishes; hence, it can be viewed as an extension of the heavy ball method to the case when the objective function g is possible nonconvex, but still has Lipschitz continuous gradient with Lipschitz contant L_g. Indeed, Algorithm 1 from [12], in case f ≡ 0 and $F=\frac 12\|\cdot \|^{2}$ has the form: For the initial values $x_{0}=x_{-1}\in \mathbb {R}^{m}$ and $n\in \mathbb {N}$, let:

$$ x_{n+1}=x_{n}+\alpha_{n}(x_{n}-x_{n-1})-\beta_{n}{\nabla} g(x_{n}), $$

(2)

where $0<\underline {\beta }\le \beta _{n}\le \overline {\beta }<+\infty $ and α_n ∈ [0,α], α > 0 for all n ≥ 1. In this particular case, convergence of the generated sequence $(x_{n})_{n\in \mathbb {N}}$ to a critical point of the objective function g can be shown under the assumption that a regularization of the objective function satisfies the Kurdyka-Łojasiewicz property, further $\underline {\beta }, \overline {\beta }$ and α > 0 satisfy:

$$ 1 > \overline{\beta} L_{g} + 2\alpha\frac{\overline{\beta}}{\underline{\beta}}. $$

(3)

Note that (3) implies $\alpha <\frac 12$; hence, $\alpha _{n}\in \left [0,\frac 12\right )$ for all n ≥ 1. If $\underline {\beta }$ and α are positive numbers such that $1 > \underline {\beta } L_{g} + 2\alpha $, then by choosing $ \overline {\beta }\in \left [\underline {\beta },\frac {\underline {\beta }}{\underline {\beta } L_{g} + 2\alpha }\right )$, relation (3) is satisfied.

Probably the most acclaimed inertial algorithm is Nesterov’s accelerated gradient method, which in its particular form: For the initial values $x_{0}=x_{-1}\in \mathbb {R}^{m}$ and $n\in \mathbb {N}$ let:

$$ \left\{\begin{array}{lll} y_{n}=x_{n}+\frac{n}{n+3}(x_{n}-x_{n-1}),\\ \\ x_{n+1}=y_{n}-s{\nabla} g(y_{n}), \end{array}\right. $$

(4)

for a convex g with Lipschitz continuous gradient L_g and step size $s\le \frac {1}{L_{g}}$, exhibits an improved convergence rate of $\mathcal {O}(1/n^{2})$ (see [15, 23]), and which, as highlighted by Su, Boyd, and Candès [28], (see also [5]), can be seen as the discrete counterpart of the second-order differential equation:

$$ \ddot{x}(t)+\frac{3}{t}\dot{x}(t)-{\nabla} g(x(t))=0. $$

In this respect, it may be useful to recall that until recently little was known about the efficiency of Nesterov’s accelerated gradient method outside a convex setting. However, in [20], a Nesterov-like method, differing from the original only by a multiplicative coefficient, has been studied and convergence rates have been provided for the very general case when a regularization of the objective function g has the KL property. More precisely, in [20], the following algorithm was considered.

For the initial values $x_{0}=x_{-1}\in \mathbb {R}^{m}$ and $n\in \mathbb {N}$ let:

$$ \left\{\begin{array}{lll} y_{n}=x_{n}+\frac{\beta n}{n+\alpha}(x_{n}-x_{n-1}),\\ \\ x_{n+1}=y_{n}-s{\nabla} g(y_{n}), \end{array}\right. $$

(5)

where α > 0, β ∈ (0,1) and $0<s<\frac {2(1-\beta )}{L_{g}}$. Unfortunately, for technical reasons, one can not allow β = 1 therefore one does not have full equivalence between Algorithm (5) and Algorithm (4). However, what is lost at the inertial parameter is gained at the step size, since for $0<\beta \le \frac 12$ one may have $s\in \left [\frac {1}{L_{g}},\frac {2}{L_{g}}\right )$. Convergence of the sequences generated by Algorithm (5) was obtained under the assumption that the regularization of the objective function g, namely $H(x,y)=g(x)+\frac 12\|y-x\|^{2}$, is a KL function.

In this paper, we deal with the optimization problem:

$$ \inf_{x\in\mathbb{R}^{m}}g(x), $$

(P)

where $g:\mathbb {R}^{m}\longrightarrow \mathbb {R}$ is a Fréchet differentiable function with L_g-Lipschitz continuous gradient, i.e., there exists L_g ≥ 0 such that ∥∇g(x) −∇g(y)∥≤ L_g∥x − y∥ for all $x,y\in \mathbb {R}^{m}$, and we associate to (P) the following inertial algorithm of gradient type.

Consider the starting points $x_{0}=x_{-1}\in \mathbb {R}^{m}$, and for every $n\in \mathbb {N}$ let:

$$ \left\{\begin{array}{lll} y_{n}=x_{n}+\alpha_{n}(x_{n}-x_{n-1}),\\ \\ x_{n+1}=y_{n}-\beta_{n}{\nabla} g(y_{n}), \end{array}\right. $$

(6)

where we assume that

$$ \lim_{n\longrightarrow+\infty}\alpha_{n}=\alpha\in\left( \frac{-10+\sqrt{68}}{8},0\right) , \ \lim_{n\longrightarrow+\infty}\beta_{n}=\beta \text{ and } 0<\beta<\frac{4\alpha^{2}+10\alpha+2}{L_{g}(2\alpha+1)^{2}}. $$

Remark 1

Observe that the inertial parameter α_n becomes negative after a number of iterations and this can be viewed as taking a backward inertial step in our algorithm. Of course, this also shows that after a number of iteration y_n is a convex combination of x_n− 1 and x_n, (see [16] for similar constructions), that is:

$$ y_{n}=(1-(-\alpha_{n}))x_{n}+(-\alpha_{n})x_{n-1},{\kern1.7pt}-\alpha_{n}\in(0,1). $$

Another novelty of Algorithm (6) is that it allows variable step size. Moreover, it can easily be verified that whenever $\alpha >-\frac 16$ one may take $\beta >\frac {1}{L_{g}}$.

We emphasize that the analysis of the proposed algorithm (6) is intimately related to the properties of the following regularization of the objective function g (see also [11,12,13, 20, 21]), that is:

$$ H : \mathbb{R}^{m} \times \mathbb{R}^{m}\longrightarrow \mathbb{R},{\kern1.7pt}{\kern1.7pt} H(x,y) = g(x) + \frac12\| y-x \|^{2}. $$

(7)

In the remainder of this section, we introduce the necessary apparatus of notions and results that we will need in our forthcoming analysis.

For a differentiable function $f:\mathbb {R}^{m}\longrightarrow \mathbb {R}$, we denote by $\text {crit}(f)=\{x\in \mathbb {R}^{m}: \nabla f(x)=0\}$ the set of critical points of f. The following so-called descent lemma (see [24]) will play an essential role in our forthcoming analysis.

Lemma 2

Let$f:\mathbb {R}^{m}\longrightarrow \mathbb {R}$beFrèchet differentiable with L Lipschitz continuous gradient.Then:

$$ f(y)\le f(x)+\langle{\nabla} f(x),y-x\rangle+\frac{L}{2}\|y-x\|^{2},{\kern1.7pt}\forall x,y\in\mathbb{R}^{m}. $$

Furthermore, the set of cluster points of a given sequence $(x_{n})_{n\in \mathbb {N}}$ will be denoted by $\omega ((x_{n})_{n\in \mathbb {N}})$. At the same time, the distance function to a set is defined for $A\subseteq \mathbb {R}^{m}$ as

$$ \text{dist}(x,A)=\inf_{y\in A}\|x-y\|\text{ for all } x\in\mathbb{R}^{m}. $$

Our convergence result relies on the concept of a KL function. For η ∈ (0,+∞], we denote by Θ_η the class of concave and continuous functions φ : [0,η) → [0,+∞) such that φ(0) = 0, φ is continuously differentiable on (0,η), continuous at 0 and φ^′(s) > 0 for all s ∈ (0,η).

Definition 1

(Kurdyka-Łojasiewicz property) Let $f:\mathbb {R}^{m}\rightarrow \mathbb {R}$ be a differentiable function. We say that f satisfies the Kurdyka-Łojasiewicz (KL) property at $\overline x\in \mathbb {R}^{m}$ if there exists η ∈ (0,+∞], a neighborhood U of $\overline x$ and a function φ ∈Θ_η such that for all x in the intersection:

$$ U\cap \{x\in\mathbb{R}^{m}: f(\overline x)<f(x)<f(\overline x)+\eta\} $$

the following inequality holds:

$$ \varphi^{\prime}(f(x)-f(\overline x))\|{\nabla} f(x))\|\geq 1. $$

If f satisfies the KL property at each point in $\mathbb {R}^{m}$, then f is called a KL function.

The function φ is called a desingularizing function (see for instance [6]). The origins of this notion go back to the pioneering work of Łojasiewicz [22], where it is proved that for a real-analytic function $f:\mathbb {R}^{m}\rightarrow \mathbb {R}$ and a critical point $\overline x\in \mathbb {R}^{m}$ (that is $\nabla f(\overline x)=0$), there exists 𝜃 ∈ [1/2,1) such that the function $|f-f(\overline x)|^{\theta }\|\nabla f\|^{-1}$ is bounded around $\overline x$. This corresponds to the situation when φ(s) = C(1 − 𝜃)^− 1s^1−𝜃, where C > 0 is a given constant, and leads to the following definition.

Definition 2 (for which we refer to 2, 8, 22)

A differentiable function $f:\mathbb {R}^{m} \longrightarrow \mathbb {R}$ has the Łojasiewicz property with exponent 𝜃 ∈ [0,1) at $\overline {x}\in \text {crit}(f)$ if there exists K,ε > 0 such that:

$$ |f(x)-f(\overline{x})|^{\theta} \leq K \| {\nabla} f(x) \|, $$

(8)

for every $x\in \mathbb {R}^{m}$, with $\| x-\overline {x} \|<\varepsilon $.

In the above definition, for 𝜃 = 0, we adopt the convention 0⁰ = 0, such that if $|f(x)-f(\overline {x})|^{0}=0$, then $f(x)=f(\overline {x})$ (see [2]).

The result of Łojasiewicz allows the interpretation of the KL property as a re-parametrization of the function values in order to avoid flatness around the critical points. Kurdyka [19] extended this property to differentiable functions definable in an o-minimal structure. Further extensions to the nonsmooth setting can be found in [3, 8,9,10].

To the class of KL functions belong semi-algebraic, real sub-analytic, semiconvex, uniformly convex, and convex functions satisfying a growth condition. We refer the reader to [2,3,4, 7,8,9,10] and the references therein for more details regarding all the classes mentioned above and illustrating examples.

Finally, an important role in our convergence analysis will be played by the following uniformized KL property given in [7, Lemma 6].

Lemma 3

Let${\Omega }\subseteq \mathbb {R}^{m}$bea compact set and let$f:\mathbb {R}^{m}\rightarrow \mathbb {R}$bea differentiable function. Assume that f is constant onΩ and f satisfies the KL property at each pointofΩ.Then, there existε,η > 0 andφ ∈Θ_ηsuch that for all$\overline x\in {\Omega }$andfor all x in the intersection:

$$ \{x\in\mathbb{R}^{m}: \text{dist}(x,{\Omega})<\varepsilon\}\cap \{x\in\mathbb{R}^{m}: f(\overline x)<f(x)<f(\overline x)+\eta\} $$

(9)

the following inequality holds:

$$ \varphi^{\prime}(f(x)-f(\overline x))\|{\nabla} f(x)\|\geq 1. $$

(10)

The outline of the paper is the following. In Section 2, we give a sufficient condition that ensures the decrease property of the regularization H in the iterates, and which also ensures that the iterates gap belongs to l². Further, using these results, we show that the set of cluster points of the iterates is included in the set of critical points of the objective function. Finally, by means of the the KL property of H, we obtain that the iterates gap belongs to l¹. This implies the convergence of the iterates (see also [4, 7, 12, 20]). In Section 3, we obtain several convergence rates both for the sequences $(x_{n})_{n\in \mathbb {N}},{\kern 1.7pt}(y_{n})_{n\in \mathbb {N}}$ generated by the numerical scheme (6), as well as for the function values g(x_n), g(y_n) in the terms of the Łojasiewicz exponent of g and H, respectively (see [14, 17, 20] and also [1] for convergence rates under geometrical conditions). Finally, in Section 4, we present some numerical experiments that show that our algorithm, in many cases, has better properties than the algorithms used in the literature.

2 Convergence results

We start to investigate the convergence of the proposed algorithm by showing that H is decreasing along certain sequences built upon the iterates generated by (6).

Theorem 4

Let$(x_{n})_{n\in \mathbb {N}}$and$(y_{n})_{n\in \mathbb {N}}$bethe sequences generated by the numerical scheme (6) and forall$n\in \mathbb {N},{\kern 1.7pt}n\ge 1$consider thesequences:

$$ A_{n} =\frac{(2-\beta_{n} L_{g})(1+\alpha_{n+1})^{2}-2\alpha_{n+1}(1+\alpha_{n+1})}{2\beta_{n}}, $$

$$ C_{n} = \frac{(2-\beta_{n} L_{g})\alpha_{n}(1+\alpha_{n+1})-\alpha_{n}\alpha_{n+1}}{2\beta_{n}} $$

and

$$ \delta_{n}=A_{n-1}+C_{n-1}. $$

Then, there exists $N\in \mathbb {N}$ such that:

(i)
The sequence (g(y_n) + δ_n∥x_n − x_n− 1∥²)_n≥N is decreasing and δ_n > 0 for all n ≥ N.

Assume further that g is bounded from below. Then,

(ii)
The sequence (g(y_n) + δ_n∥x_n − x_n− 1∥²)_n≥N is convergent;
(iii)
${\sum }_{n\ge 1}\|x_{n}-x_{n-1}\|^{2}<+\infty $.

Proof

By applying the descent Lemma 2 to g, we have:

$$ g(y_{n+1})\le g(y_{n})+\langle{\nabla} g(y_{n}),y_{n+1}-y_{n}\rangle+\frac{L_{g}}{2}\|y_{n+1}-y_{n}\|^{2}. $$

However, after rewriting the first equation in (6) as ${\nabla } g(y_{n})=\frac {1}{\beta _{n}}(y_{n}-x_{n+1})$ and taking the inner product with y_n+ 1 − y_n to obtain:

$$ \langle{\nabla g(y_{n}),y_{n+1}-y_{n}}\rangle=\frac{1}{\beta_{n}}\langle{y_{n}-x_{n+1},y_{n+1}-y_{n}}\rangle, $$

the descent inequality becomes:

$$ g(y_{n+1})-\frac{L_{g}}{2}\|y_{n+1}-y_{n}\|^{2}\le g(y_{n})+ \frac{1}{\beta_{n}}\langle{y_{n}-x_{n+1},y_{n+1}-y_{n}}\rangle $$

(11)

$$=g(y_{n})+ \frac{1}{\beta_{n}}\left( -\|y_{n+1}-y_{n}\|^{2}+\langle{y_{n+1}-x_{n+1},y_{n+1}-y_{n}}\rangle\right).$$

Further,

$$y_{n+1}-y_{n}=(1+\alpha_{n+1})(x_{n+1}-x_{n})-\alpha_{n}(x_{n}-x_{n-1})$$

and

$$y_{n+1}-x_{n+1}=\alpha_{n+1}(x_{n+1}-x_{n}),$$

hence:

$$ g(y_{n+1})+\left( \frac{1}{\beta_{n}}-\frac{L_{g}}{2}\right)\|y_{n+1}-y_{n}\|^{2}\le g(y_{n})+ \frac{\alpha_{n+1}}{\beta_{n}}\langle{x_{n+1}-x_{n},y_{n+1}-y_{n}}\rangle. $$

(12)

Thus, we have:

$$\|y_{n+1}-y_{n}\|^{2}=\|(1+\alpha_{n+1})(x_{n+1}-x_{n})-\alpha_{n}(x_{n}-x_{n-1})\|^{2}$$

$$=(1+\alpha_{n+1})^{2}\|x_{n+1}-x_{n}\|^{2}+{\alpha_{n}^{2}}\|x_{n}-x_{n-1}\|^{2}-2\alpha_{n}(1+\alpha_{n+1})\langle{x_{n+1}-x_{n},x_{n}-x_{n-1}}\rangle,$$

and

$$\langle{x_{n+1}-x_{n},y_{n+1}-y_{n}}\rangle=\langle{x_{n+1}-x_{n},(1+\alpha_{n+1})(x_{n+1}-x_{n})-\alpha_{n}(x_{n}-x_{n-1})}\rangle$$

$$=(1+\alpha_{n+1})\|x_{n+1}-x_{n}\|^{2}-\alpha_{n}\langle{x_{n+1}-x_{n},x_{n}-x_{n-1}}\rangle.$$

Replacing the above equalities in (12) gives:

$$g(y_{n+1})+\frac{(2-\beta_{n} L_{g})(1+\alpha_{n+1})^{2}-2\alpha_{n+1}(1+\alpha_{n+1})}{2\beta_{n}}\|x_{n+1}-x_{n}\|^{2}\le$$

$$g(y_{n})-\frac{(2-\beta_{n} L_{g}){\alpha_{n}^{2}}}{2\beta_{n}}\|x_{n}-x_{n-1}\|^{2}+$$

$$\frac{(2-\beta_{n} L_{g})\alpha_{n}(1+\alpha_{n+1})-\alpha_{n}\alpha_{n+1}}{\beta_{n}}\langle{x_{n+1}-x_{n},x_{n}-x_{n-1}}\rangle.$$

The above inequality motivates the introduction of the following notations:

$$ \begin{array}{@{}rcl@{}} B_{n} &=& \frac{(2-\beta_{n} L_{g}){\alpha_{n}^{2}}}{2\beta_{n}} \end{array} $$

and

$$ {\Delta}_{n}=A_{n-1}+B_{n}+C_{n-1}+C_{n} $$

(13)

for all $n\in \mathbb {N},{\kern 1.7pt}n\ge 1$.

In terms of these notations, after using the equality:

$$2\langle{x_{n+1}-x_{n},x_{n}-x_{n-1}}\rangle=\|x_{n+1}-x_{n-1}\|^{2}-\|x_{n+1}-x_{n}\|^{2}-\|x_{n}-x_{n-1}\|^{2},$$

we can write:

$$ -C_{n}\|x_{n+1}-x_{n-1}\|^{2}+g(y_{n+1})+(A_{n}+C_{n})\|x_{n+1}-x_{n}\|^{2}\le g(y_{n})+(-C_{n}-B_{n})\|x_{n}-x_{n-1}\|^{2}. $$

(14)

Consequently, we have:

$$ \begin{array}{@{}rcl@{}} -C_{n}\|x_{n+1}-x_{n-1}\|^{2}+{\Delta}_{n}\|x_{n}-x_{n-1}\|^{2}\le (g(y_{n})+\delta_{n}\|x_{n}-x_{n-1}\|^{2})\\-(g(y_{n+1})+\delta_{n+1}\|x_{n+1}-x_{n}\|^{2}). \end{array} $$

(15)

Now, since α_n→α, β_n→β as n→ + ∞ and $\alpha \in \left (\frac {-10+\sqrt {68}}{8},0\right )$, $0<\beta <\frac {4\alpha ^{2}+10\alpha +2}{L_{g}(2\alpha +1)^{2}}$ we have:

$$ \begin{array}{@{}rcl@{}} \lim\limits_{n\longrightarrow+\infty}A_{n} &=& \frac{(2-\beta L_{g})(\alpha+1)^{2}+2\alpha-2\alpha^{2}}{2\beta},\\ \lim\limits_{n\longrightarrow+\infty}B_{n} &=& \frac{(2-\beta L_{g})\alpha^{2}}{2\beta},\\ \lim\limits_{n\longrightarrow+\infty}C_{n} &=& \frac{(2-\beta L_{g})\alpha(1+\alpha)-\alpha^{2}}{2\beta}<0,\\ \lim\limits_{n\longrightarrow+\infty}{\Delta}_{n} &=& \frac{(2-\beta L_{g})(2\alpha+1)^{2}+2\alpha-4\alpha^{2}}{2\beta}>0,\\ \lim\limits_{n\longrightarrow+\infty}\delta_{n} &=& \frac{(2-\beta L_{g})(2\alpha^{2}+3\alpha+1)+2\alpha-3\alpha^{2}}{2\beta}>0. \end{array} $$

Hence, there exists $N\in \mathbb {N}$ and C > 0, D > 0 such that for all n ≥ N one has:

$$C_{n}\le -C,{\kern1.7pt} {\Delta}_{n}\ge D\text{ and } \delta_{n}>0$$

which, in the view of (15), shows (i); that is, the sequence g(y_n) + δ_n∥x_n − x_n− 1∥² is decreasing for n ≥ N.

By using (15) again, we obtain:

$$ \begin{array}{@{}rcl@{}}0&<&C\|x_{n+1}-x_{n-1}\|^{2}+D\|x_{n}-x_{n-1}\|^{2}\le (g(y_{n})+\delta_{n}\|x_{n}-x_{n-1}\|^{2})\\&&- (g(y_{n+1})+\delta_{n+1}\|x_{n+1}-x_{n}\|^{2}), \end{array} $$

for all n ≥ N, or, more convenient, that:

$$ 0<D\|x_{n}-x_{n-1}\|^{2}\le \left( g(y_{n})+\delta_{n}\|x_{n}-x_{n-1}\|^{2}\right)- \left( g(y_{n+1})+\delta_{n+1}\|x_{n+1}-x_{n}\|^{2}\right), $$

(16)

for all n ≥ N. Let r > N. Summing up the latter relations gives:

$$ D\sum\limits_{n=N}^{r}\|x_{n}-x_{n-1}\|^{2}\le \left( g(y_{N})+\delta_{N}\|x_{N}-x_{N-1}\|^{2}\right)-\left( g(y_{r+1})+\delta_{r+1}\|x_{r+1}-x_{r}\|^{2}\right) $$

which leads to:

$$ g(y_{r+1})+D\sum\limits_{n=N}^{r}\|x_{n}-x_{n-1}\|^{2}\le g(y_{N})+\delta_{N}\|x_{N}-x_{N-1}\|^{2}. $$

(17)

Now, taking into account that g is bounded from below, after letting r→ + ∞ we obtain:

$$\sum\limits_{n=N}^{\infty}\|x_{n}-x_{n-1}\|^{2}<+\infty$$

which proves (iii).

This also shows that:

$$\lim_{n\longrightarrow+\infty}\|x_{n}-x_{n-1}\|^{2}=0,$$

hence

$$\lim_{n\longrightarrow+\infty}\delta_{n}\|x_{n}-x_{n-1}\|^{2}=0.$$

But then, using again the fact that g is bounded from below, we have that the sequence g(y_n) + δ_n∥x_n − x_n− 1∥² is bounded from below and also decreasing (see (i)) for n ≥ N; hence, there exists:

$$\lim_{n\longrightarrow+\infty} g(y_{n})+\delta_{n}\|x_{n}-x_{n-1}\|^{2}\in\mathbb{R}.$$

□

Remark 5

By introducing the sequence:

$$ u_{n} = \sqrt{2\delta_{n}} \cdot (x_{n} - x_{n-1}) + y_{n}, {\kern1.7pt}{\kern1.7pt}n\ge 1, $$

(18)

one can easily observe that the statements of Theorem 4 can be expressed in terms of the regularization of the objective function since H(y_n,u_n) = g(y_n) + δ_n∥x_n − x_n− 1∥² for all $n\in \mathbb {N},{\kern 1.7pt}n\ge 1$.

An interesting fact is that for the sequence (H(y_n,u_n))_n≥N to be decreasing one does not need the boundedness of the objective function g, but only its regularity, as can be seen in the proof of Theorem 4. The energy decay is thus a structural property of the algorithm (6) and only the existence of the limit requires the boundedness of the objective function.

Remark 6

Observe that conclusion (iii) in the hypotheses of Theorem 4 assures that the sequence $(x_{n}-x_{n-1})_{n\in \mathbb {N}}\in l^{2}$, in particular that:

$$ \lim_{n\longrightarrow+\infty}(x_{n}-x_{n-1})=0. $$

(19)

Lemma 7

In the framework of the optimization problem (P), assume thatthe objective function g is bounded from below and consider thesequences$(x_{n})_{n \in \mathbb {N}}$and$(y_{n})_{n \in \mathbb {N}}$generatedby the numerical algorithm (6) andlet$(u_{n})_{n \in \mathbb {N}}$bedefined by (18). Then, the following statements are valid:

(i)
The sets of cluster pointsof$(x_{n})_{n \in \mathbb {N}}$,$(y_{n})_{n \in \mathbb {N}}$and$(u_{n})_{n \in \mathbb {N}}$coincideand are contained in the set o critical points of g,i.e.:
$$\omega((x_{n})_{n \in \mathbb{N}}) = \omega((y_{n})_{n \in \mathbb{N}}) = \omega((u_{n})_{n \in \mathbb{N}}) \subseteq crit(g);$$
(ii)
$\quad \omega ((y_{n},u_{n})_{n \in \mathbb {N}}) \subseteq crit(H) = \lbrace (x,x) \in \mathbb {R}^{m} \times \mathbb {R}^{m} \ : \ x \in crit(g) \rbrace \ $.

Proof

(i) We start by proving $\omega ((x_{n})_{n \in \mathbb {N}}) \subseteq \omega ((u_{n})_{n \in \mathbb {N}})$ and $\omega ((x_{n})_{n \in \mathbb {N}}) \subseteq \omega ((y_{n})_{n \in \mathbb {N}})$. Bearing in mind that $\lim \limits _{n \longrightarrow \infty } (x_{n} - x_{n-1}) = 0$ and that the sequences $(\delta _{n})_{n \in \mathbb {N}}$, $(\alpha _{n})_{n \in \mathbb {N}}$, and $(\beta _{n})_{n \in \mathbb {N}}$ are convergent, the conclusion is quite straightforward. Indeed, if $\overline {x} \in \omega ((x_{n})_{n \in \mathbb {N}})$ and $(x_{n_{k}})_{k \in \mathbb {N}}$ is a subsequence such that $\lim \limits _{k \longrightarrow \infty } x_{n_{k}} = \overline {x}$, then:

$$ \lim\limits_{k \longrightarrow \infty} y_{n_{k}} = \lim\limits_{k \longrightarrow \infty} x_{n_{k}} + \lim\limits_{k \longrightarrow \infty} \alpha_{n_{k}} \cdot \lim\limits_{k \longrightarrow \infty} (x_{n_{k}} - x_{n_{k}-1})$$

and

$$ \lim\limits_{k \longrightarrow \infty} u_{n_{k}} = \lim\limits_{k \longrightarrow \infty} \delta_{n_{k}} \cdot \lim\limits_{k \longrightarrow \infty} (x_{n_{k}} - x_{n_{k}-1}) + \lim\limits_{k \longrightarrow \infty} y_{n_{k}}$$

imply that the sequences $(x_{n_{k}})_{k \in \mathbb {N}}$, $(y_{n_{k}})_{k \in \mathbb {N}}$ and $(u_{n_{k}})_{k \in \mathbb {N}}$ all converge to the same element $\overline {x}\in \mathbb {R}^{m}$. The reverse inclusions follow in a very similar manner from the definitions of u_n and y_n. In order to prove that $\omega ((x_{n})_{n \in \mathbb {N}}) \subseteq \text {crit}(g)$, we use the fact that ∇g is a continuous operator. So, passing to the limit in $\nabla g(y_{n_{k}}) = \frac {1}{\beta _{n_{k}}} \cdot (y_{n_{k}} - x_{n_{k}+1})$ and taking into account that $\lim _{k\longrightarrow +\infty }\beta _{n_{k}}=\beta >0$, we have:

$$ \begin{array}{@{}rcl@{}} \nabla g(\overline{x}) &=& \lim\limits_{k \longrightarrow \infty} \nabla g(y_{n_{k}})\\ &=& \frac{1}{\lim\limits_{k \longrightarrow \infty} \beta_{n_{k}}} \cdot \lim\limits_{k \longrightarrow \infty} (y_{n_{k}} - x_{n_{k}+1}) \end{array} $$

and finally, as $y_{n_{k}} - x_{n_{k}+1} = (x_{n_{k}} - x_{n_{k}+1}) + \alpha _{n_{k}} \cdot (x_{n_{k}} - x_{n_{k}-1})$, we obtain:

$$\nabla g(\overline{x}) = 0.$$

For proving the statement (ii), we rely on a direct computation yielding:

$$ \nabla H(x,y) = \left( \nabla g \left( x \right) + \left( x-y \right) \ , \ \left( y-x \right) \right), $$

(20)

which, in turn, gives

$$\text{crit}(H) = \left\lbrace (\overline{x},\overline{x}) \in \mathbb{R}^{m} \times \mathbb{R}^{m} \ : \ \overline{x} \in \text{crit}(g) \right\rbrace $$

and allows us to apply (i) to obtain the desired conclusion. □

Some direct consequences of Theorem 4 (ii) and Lemma 7 are the following.

Fact 8

In the setting of Lemma 7,let$(\overline {x},\overline {x})\in \omega ((y_{n},u_{n})_{n \in \mathbb {N}})$. It followsthat$\overline {x} \in crit(g)$and:

$$\lim\limits_{n \longrightarrow \infty} H(y_{n},u_{n}) = H(\overline{x},\overline{x}).$$

Consequently,

$$ \text{H is finite and constant on the set } \omega((y_{n},u_{n})_{n \in \mathbb{N}}). $$

The arguments behind the proofs of the following two facts are the same as those in Lemma 13 from [20].

Fact 9

If the assumptions from Lemma 7 hold true and if the sequence $(x_{n})_{n \in \mathbb {N}}$ is bounded, then the following conclusions hold up :

$$ \begin{array}{@{}rcl@{}} && (i) \ \omega((y_{n},u_{n})_{n \in \mathbb{N}}) \text{ is nonempty and compact} \ , \\ && (ii) \ \lim\limits_{n \longrightarrow +\infty} \text{dist}((y_{n},u_{n}),\omega((y_{n},u_{n})_{n \in \mathbb{N}})) = 0 \ . \end{array} $$

Remark 10

We emphasize that if g is coercive, that is $\lim _{\|x\|\rightarrow +\infty } g(x)=+\infty $, then g is bounded from below and $(x_{n})_{n\in \mathbb {N}},{\kern 1.7pt}(y_{n})_{n\in \mathbb {N}}$, the sequences generated by (6), are bounded.

Indeed, notice that g is bounded from below, being a continuous and coercive function (see [27]). Note that according to Theorem 4 the sequence $D{\sum }_{n=N}^{r}\|x_{n}-x_{n-1}\|^{2}$ is convergent hence is bounded. Consequently, from (17), it follows that y_r is contained for every r > N, (N is defined in the hypothesis of Theorem 4), in a lower level set of g, which is bounded. Since $(y_{n})_{n\in \mathbb {N}}$ is bounded, taking into account (19), it follows that $(x_{n})_{n\in \mathbb {N}}$ is also bounded.

Now, based on the conclusions of Lemma 7, we present a result which will be crucial later on. For our next result, ∥⋅∥₁ will denote the 1-norm and ∥⋅∥₂ will represent the 2-norm on the linear space $\mathbb {R}^{m} \times \mathbb {R}^{m}$.

Lemma 11

Let H,$(x_{n})_{n \in \mathbb {N}}$,$(y_{n})_{n \in \mathbb {N}}$,and$(u_{n})_{n \in \mathbb {N}}$beas in all the previous results, with the mapping g bounded from below. Then,the following gradient inequalities hold true:

$$ \| \nabla H(y_{n},u_{n}) \|_{2}\le \| \nabla H(y_{n},u_{n}) \|_{1} \leq \frac{1}{\beta_{n}} \cdot \| x_{n+1} - x_{n} \| + \left[ \left( \frac{\alpha_{n}}{\beta_{n}} \right) + 2 \sqrt{2\delta_{n}} \right] \cdot \| x_{n} - x_{n-1} \| \\ $$

(21)

and

$$ \| \nabla H(y_{n},u_{n}) \|_{2}^{2} \leq \frac{2}{{\beta_{n}^{2}}} \cdot \| x_{n+1} - x_{n} \|^{2} + 2 \left[ \left( \frac{\alpha_{n}}{\beta_{n}} - \sqrt{2\delta_{n}} \right)^{2} + \delta_{n} \right] \cdot \| x_{n} - x_{n-1} \|^{2} . $$

(22)

Proof

First of all note that from our numerical scheme (6) we have ${\nabla } g(y_{n})=\frac {1}{\beta _{n}}((x_{n}-x_{n+1})+\alpha _{n}(x_{n}-x_{n-1}))$. In terms of the ∥⋅∥₁ on $\mathbb {R}^{m} \times \mathbb {R}^{m}$, we have:

$$ \begin{array}{@{}rcl@{}} \| \nabla H(y_{n},u_{n}) \|_{1} &=& \| \left( \nabla g(y_{n}) + \left( y_{n} - u_{n} \right) \ , \ \left( u_{n} - y_{n} \right) \right) \|_{1} \\ &=& \| \nabla g(y_{n}) + \left( y_{n} - u_{n} \right) \| + \| u_{n} - y_{n} \| \\ & \leq& \frac{1}{\beta_{n}} \| x_{n+1} - x_{n} \| + \frac{\alpha_{n}}{\beta_{n}} \| x_{n} - x_{n-1} \| + 2 \sqrt{2\delta_{n}} \| x_{n} - x_{n-1} \|, \end{array} $$

which proves the desired inequality.

Now, with respect to the Euclidean norm, similar arguments yield:

$$ \begin{array}{@{}rcl@{}} \| \nabla H(y_{n},u_{n}) \|_{2}^{2} &=& \| \nabla g(y_{n}) + (y_{n} - u_{n}) \|^{2} + \| u_{n} - y_{n} \|^{2} \\ &=& \left\| \nabla g(y_{n}) - \sqrt{2\delta_{n}} (x_{n}-x_{n-1}) \right\|^{2} + \left( \sqrt{2\delta_{n}} \right)^{2} \cdot \| x_{n}-x_{n-1} \|^{2} \\ & \leq& \frac{2}{{\beta_{n}^{2}}} \| x_{n+1} - x_{n} \|^{2} + 2 \left\| \left( \frac{\alpha_{n}}{\beta_{n}} - \sqrt{2\delta_{n}} \right) \cdot (x_{n} - x_{n-1}) \right\|^{2} + 2 \delta_{n} \cdot \| x_{n} - x_{n-1} \|^{2}, \end{array} $$

completing the proof. □

Our main result concerning the convergence of the sequence $(x_{n})_{n \in \mathbb {N}}$ generated by the algorithm (6) to a critical point of the objective function g is the following.

Theorem 12

Consider the sequences$(x)_{n \in \mathbb {N}}$,$(y_{n})_{n \in \mathbb {N}}$generatedby the algorithm (6) and let the objectivefunction g be bounded from below. If thesequence$(x_{n})_{n \in \mathbb {N}}$isbounded and H is a KL function, then:

$$ \sum\limits_{n = 1}^{\infty} \| x_{n} - x_{n-1} \| < + \infty $$

(23)

and there exists an element $\overline {x}\in \text {crit}(g)$ such that $\lim \limits _{n \longrightarrow +\infty } x_{n} = \overline {x}$.

Proof

Consider (x̄,x̄) from the set $\omega ((y_{n},u_{n})_{n \in \mathbb {N}})$ under the assumptions of Lemma 7. It follows that x̄ ∈ crit(g). Also, using Fact 8, we get that $\lim \limits _{n \longrightarrow \infty } H(y_{n},u_{n}) = H(\Bar {x},\Bar {x})$. Furthermore, we consider two cases:

I. By using N from Theorem 4, assume that there exists n̄ ≥ N, with $\Bar {n} \in \mathbb {N}$, such that H(y_n̄,u_n̄) = H(x̄,x̄). Then, since (H(y_n,u_n))_n≥N is a decreasing sequence, it follows that:
$$ H(y_{n},u_{n}) = H(\overline{x},\overline{x}) , \text{ for every } n \geq \overline{n},$$
Now, using (16), we get that for every $n \geq \overline {n}$ we have the following inequality:
$$ 0 \leq D \| x_{n} - x_{n-1} \|^{2} \leq H(y_{n},u_{n}) - H(y_{n+1},u_{n+1}) = H(\Bar{x},\Bar{x}) - H(\Bar{x},\Bar{x}) = 0. $$
So, the sequence $(x_{n})_{n \geq \overline {n}}$ is constant and the conclusion holds true.
II. Now, we deal with the case when $H(y_{n},u_{n}) > H(\overline {x},\overline {x})$, for every n ≥ N.

So, consider the set ${\Omega } := \omega ((y_{n},u_{n})_{n \in \mathbb {N}})$. From Fact 9, we have that the set Ω is nonempty and compact. Also, Fact 8 assures that mapping H is constant on Ω. From the hypotheses of the theorem, we have that H is a KL function. So, according to Lemma 3, there exists ε > 0, η > 0 and a function φ ∈Θ_η, such that for all the points (z,w) from the set:

$$ \lbrace (z,w) \in \mathbb{R}^{m} \times \mathbb{R}^{m} \ : \ dist((z,w),{\Omega}) < \varepsilon \rbrace \cap \lbrace (z,w) \in \mathbb{R}^{m} \times \mathbb{R}^{m} \ : \ H(\Bar{x},\Bar{x}) < H(z,w) < \eta + H(\Bar{x},\Bar{x}) \rbrace \ $$

one has that:

$$ \varphi^{\prime} (H(z,w)-H(\Bar{x},\Bar{x})) \cdot \| \nabla H(z,w) \| \geq 1. $$

On the other hand, using Fact 9, we obtain that $\lim \limits _{n \longrightarrow +\infty } dist((y_{n},u_{n}),{\Omega }) = 0$. This means that there exists an index $n_{1} \in \mathbb {N}$, for which it is valid:

$$ \text{dist}((y_{n},u_{n}),{\Omega}) < \varepsilon \ , \ \text{ for all } n \geq n_{1}. $$

Let us introduce the notation:

$$r_{n}:=H(y_{n},u_{n})-H(\overline{x},\overline{x}).$$

Because

$$ \lim\limits_{n \longrightarrow +\infty} r_{n}=0 $$

and since

$$ r_{n}>0 \ , \ \text{ for all } n \geq N $$

then there exists another index n₂ ≥ N, such that:

$$0 < r_{n} < \eta \ , \ \text{ for every } n \geq n_{2}. $$

Taking n̄ := max(n₁,n₂) we get that for each n ≥n̄ it follows:

$$ \varphi^{\prime}(r_{n}) \cdot \| \nabla H(y_{n},u_{n}) \| \geq 1. $$

Since the function φ is concave, we have:

$$ \varphi(r_{n}) - \varphi(r_{n+1})\geq\varphi^{\prime}(r_{n}) \cdot (r_{n}-r_{n+1}). $$

Thus, the following relation takes place for each n ≥n̄:

$$ \varphi(r_{n}) - \varphi(r_{n+1}) \geq \frac{r_{n}-r_{n+1}}{\| \nabla H(y_{n},u_{n}) \|}.$$

On one hand, combining the inequality (16) and (21), it follows that for every n ≥n̄

$$ \varphi(r_{n}) - \varphi(r_{n+1}) \geq\frac{D \| x_{n} - x_{n-1} \|^{2}}{\frac{1}{\beta_{n}} \| x_{n} - x_{n+1} \| + \left[ \frac{\alpha_{n}}{\beta_{n}} + 2 \sqrt{2\delta_{n}} \right] \cdot \| x_{n} - x_{n-1} \|}. $$

(24)

On the other hand, we know that the sequences $(\alpha _{n})_{n \in \mathbb {N}}$, $(\beta _{n})_{n \in \mathbb {N}}$ and $(\delta _{n})_{n \in \mathbb {N}}$ are convergent, and $\lim _{n\longrightarrow +\infty }\beta _{n}=\beta >0$, hence $\left (\frac {1}{\beta _{n}} \right )_{n \in \mathbb {N}}$ and $\left (\frac {\alpha _{n}}{\beta _{n}} + 2 \sqrt {2\delta _{n}} \right )_{n \in \mathbb {N}}$ are bounded. This shows that there exists $\Bar {N} \in \mathbb {N},{\kern 1.7pt}\Bar {N} \geq \Bar {n}$ and there exists M > 0, such that:

$$ \sup_{n\ge\overline N} \left\lbrace \frac{1}{\beta_{n}} \ , \ \frac{\alpha_{n}}{\beta_{n}} + 2 \sqrt{2\delta_{n}} \right\rbrace \leq M. $$

Thus, the inequality (24) becomes:

$$ \varphi(r_{n}) - \varphi(r_{n+1}) \geq\frac{D \| x_{n} - x_{n-1} \|^{2}}{ M \left( \| x_{n} - x_{n+1} \| + \| x_{n} - x_{n-1} \| \right)} \ , $$

(25)

for every n ≥N̄. This implies that for each n ≥N̄, the following inequality holds:

$$ \| x_{n} - x_{n-1} \| \leq\sqrt{\frac{M}{D} \cdot \left( \varphi(r_{n}) - \varphi(r_{n+1}) \right) \cdot \left( \| x_{n} - x_{n+1} \| + \| x_{n} - x_{n-1} \| \right)}. $$

From the well-known arithmetical-geometrical inequality, it follows that:

$$ \begin{array}{@{}rcl@{}} &\sqrt{\frac{M}{D} \cdot \left( \varphi(r_{n}) - \varphi(r_{n+1}) \right) \cdot \left( \| x_{n} - x_{n+1} \| + \| x_{n} - x_{n-1} \| \right)} \\ & \leq\frac{\| x_{n+1}-x_{n} \| + \| x_{n} - x_{n-1} \|}{3} + \frac{3M}{4D} \cdot \left( \varphi(r_{n}) - \varphi(r_{n+1}) \right). \end{array} $$

Therefore, we obtain:

$$ \| x_{n} - x_{n-1} \| \leq\frac{\| x_{n+1}-x_{n} \| + \| x_{n} - x_{n-1} \|}{3} + \frac{3M}{4D} \cdot \left( \varphi(r_{n}) - \varphi(r_{n+1}) \right). $$

Consequently, we have:

$$ 2 \| x_{n} - x_{n-1} \| - \| x_{n} - x_{n+1} \| \leq\frac{9M}{4D} \cdot \left( \varphi(r_{n}) - \varphi(r_{n+1}) \right), $$

(26)

for every $n \in \mathbb {N}$, with n ≥N̄. Now, by summing up the latter inequality from $\overline N$ to $P\ge \overline N$, we get that:

$$ \sum\limits_{n=\Bar{N}}^{P} \| x_{n} - x_{n-1} \| \leq \| x_{P+1} - x_{P} \| - \| x_{\Bar{N}} - x_{\Bar{N}-1} \| + \frac{9M}{4D} \cdot \left( \varphi(r_{\Bar{N}}) - \varphi(r_{P+1}) \right). $$

Now, it is time to use the fact that φ(0) = 0. In this setting, by letting P→ + ∞ and by using (19) we obtain:

$$ \sum\limits_{n=\Bar{N}}^{\infty} \| x_{n} - x_{n-1} \| \leq - \| x_{\Bar{N}} - x_{\Bar{N}-1} \| + \frac{9M}{4D} \varphi(r_{\Bar{N}}) <+ \infty. $$

It implies that:

$$ \sum\limits_{n = 1}^{\infty} \| x_{n} - x_{n-1} \| < +\infty, $$

so the first part of the proof is done. At the same time, the sequence $(S_{n})_{n \in \mathbb {N}}$, defined by:

$$ S_{n} = \sum\limits_{i=1}^{n} \| x_{i} - x_{i-1} \| $$

is Cauchy. Thus, for every ε > 0, there exists a positive integer number N_ε, such that for each n ≥ N_ε and for all $p \in \mathbb {N}$, one has:

$$ S_{n+p} - S_{n} \leq \varepsilon.$$

Furthermore,

$$ S_{n+p} - S_{n} = \sum\limits_{i=n+1}^{n+p} \| x_{i} - x_{i-1} \| \geq \left\| \sum\limits_{i=n+1}^{n+p} \left( x_{i} - x_{i-1} \right) \right\| = \| x_{n+p} - x_{n} \|. $$

So, the sequence $(x_{n})_{n \in \mathbb {N}}$ is Cauchy hence is convergent, i.e., there exists $x \in \mathbb {R}^{m}$, such that:

$$ \lim\limits_{n \longrightarrow +\infty} x_{n} = x. $$

Thus, by using (i) of Lemma 7, it follows that:

$$ \lbrace x \rbrace = \omega((x_{n})_{n \in \mathbb{N}}) \subseteq \text{crit}(g) \ , $$

which leads to the conclusion of the second part of the present theorem. □

Remark 13

Since the class of semi-algebraic functions is closed under addition (see for example [7]), and $(x,y) \mapsto \frac 12\|x-y\|^{2}$ is semi-algebraic, the conclusion of the previous theorem holds if the condition that H is a KL function is replaced by the assumption that g is semi-algebraic.

Note that, according to Remark 10, the conclusion of Theorem 12 remains valid if we replace in its hypotheses that the conditions that g is bounded from below and $(x_{n})_{n\in \mathbb {N}}$ is bounded by the condition that g is coercive.

Finally, observe that under the assumptions of Theorem 12, we have $\lim _{n\longrightarrow +\infty }$y_n = x and

$$\lim_{n\longrightarrow+\infty}g(x_{n})=\lim_{n\longrightarrow+\infty}g(y_{n})=g(x).$$

3 Convergence rates

In the following theorem, we provide convergence rates for the sequence generated by (6), but also for the function values, in terms of the Łojasiewicz exponent of H (see also, [2, 8, 14, 20]). More precisely we obtain finite, linear and sublinear convergence rates, depending the Łojasiewicz exponent of H, 𝜃 is 0, or 𝜃 belongs to $\left (0,\frac 12\right ]$, or $\theta \in \left (\frac 12,1\right )$, respectively. Note that the forthcoming results remain valid if one replace in their hypotheses the conditions that g is bounded from below and $(x_{n})_{n\in \mathbb {N}}$ is bounded by the condition that g is coercive.

The following lemma was established in [14] and will be crucial in obtaining our convergence rates.

Lemma 14 (14 Lemma 15)

Let$(e_{n})_{n\ge \overline n},{\kern 1.7pt}\overline n\in \mathbb {N}$bea monotonically decreasing positive sequence converging to0. Assume further that there exist the natural numbersl₀ ≥ 1 and$n_{0}\ge \overline {n}+l_{0}$suchthat for everyn ≥ n₀one has:

$$ e_{n-l_{0}}-e_{n}\ge C_{0} e_{n}^{2\theta} $$

(27)

where C₀ > 0 is some constant and 𝜃 ∈ [0,1). Then, following statements are true:

(i)
If 𝜃 = 0, then $(e_{n})_{n\ge \overline n}$ converges in finite time;
(ii)
If $\theta \in \left (0,\frac 12\right ]$, then there exists C₁ > 0 and Q ∈ [0,1), such that for every n ≥ n₀
$$e_{n}\le C_{1} Q^{n};$$
(iii)
If $\theta \in \left [\frac 12,1\right )$, then there exists C₂ > 0, such that for every n ≥ n₀ + l₀
$$e_{n}\le C_{2}(n-l_{0}+1)^{-\frac{1}{2\theta-1}}.$$

In the proof of the following theorem, we use Lemma 14 (see also [2]).

Theorem 15

In the settings of problem (P) consider thesequences$(x_{n})_{n\in \mathbb {N}},{\kern 1.7pt}(y_{n})_{n\in \mathbb {N}}$generatedby Algorithm (6). Assume that g is bounded from below andthat$(x_{n})_{n\in \mathbb {N}}$is bounded. Supposethat

$$H:\mathbb{R}^{m} \times \mathbb{R}^{m} \longrightarrow \mathbb{R},{\kern1.7pt}H(x,y)=g(x)+\frac12\|x-y\|^{2}$$

fulfills the Łojasiewicz property with Łojasiewicz constant K and Łojasiewicz exponent 𝜃 ∈ [0,1) and let $\lim _{n\longrightarrow +\infty }x_{n}=\overline {x}$. Then, the following statements hold true:

If 𝜃 = 0, then the sequences

(a₀):: $(g(y_{n}))_{n\in \mathbb {N}},(g(x_{n}))_{n\in \mathbb {N}},(y_{n})_{n\in \mathbb {N}}$ and $(x_{n})_{n\in \mathbb {N}}$ converge in a finite number of steps;

If $\theta \in \left (0,\frac 12\right ]$, then there exist Q ∈ (0,1), a₁,a₂,a₃,a₄ > 0 and $\overline k\in \mathbb {N}$ such that:

(a₁):: $g(y_{n})-g(\overline x)\le a_{1} Q^{n}$ for every $n\ge \overline k$,
(a₂):: $g(x_{n})-g(\overline x)\le a_{2} Q^{n}$ for every $n\ge \overline k$,
(a₃):: $\|x_{n}-\overline x\|\le a_{3} Q^{\frac {n}{2}}$ for every $n\ge \overline k$,
(a₄):: $\|y_{n}-\overline x\|\le a_{4} Q^{\frac {n}{2}}$ for all $n\ge \overline k$;

If $\theta \in \left (\frac 12,1\right )$ then there exist b₁,b₂,b₃,b₄ > 0 and $\overline k\in \mathbb {N}$ such that:

(b₁):: $g(y_{n})-g(\overline x)\le b_{1} {n^{-\frac {1}{2\theta -1}}},\text { for all } n\ge \overline k$,
(b₂):: $g(x_{n})-g(\overline x)\le b_{2} {n^{-\frac {1}{2\theta -1}}},\text { for all } n\ge \overline k$,
(b₃):: $\|x_{n}-\overline x\|\le b_{3} {n^{\frac {\theta -1}{2\theta -1}}},\text { for all } n\ge \overline k$,
(b₄):: $\|y_{n}-\overline x\|\le b_{4} {n^{\frac {\theta -1}{2\theta -1}}},\text { for all } n\ge \overline k$.

Proof

We start by employing the ideas from the proof of Theorem 12, namely if there exists $\overline {n} \in \mathbb {N}$, with $\overline {n} \geq N$, for which one has that:

$$ H(y_{\overline{n}},u_{\overline{n}}) = H(\overline{x},\overline{x}), $$

then it follows that the sequence $(x_{n})_{n \ge \overline n}$ is constant. This leads to the fact that the sequence $(y_{n})_{n \ge \overline n}$ is also constant. Furthermore,

$$ H(y_{n},u_{n}) = H(\overline{x},\overline{x})\quad \text{for all}\quad n \geq \overline{n} \ . $$

That is, if the regularized energy is constant after a certain number of iterations, one can see that the conclusion follow in a straightforward way. Now, we can easily assume that:

$$H(y_{n},u_{n}) > H(\overline{x},\overline{x})\quad \text{for all}\quad n \geq N.$$

In order to simplify notations, we will use

$$H_{n}:=H(y_{n},u_{n}),\quad\overline{H}:=H(\overline{x},\overline{x})\quad\text{and}\quad {\nabla} H_{n}:={\nabla} H(y_{n},u_{n}).$$

Our analysis aims to apply Lemma 14 for

$$r_{n} :=H_{n} - \overline{H} = H(y_{n},u_{n})-H(\overline{x},\overline{x})>0$$

based on three previously established fundamental relations:

(a)
The energy decay relation (16), for every n ≥ N
$$H_{n}-H_{n+1} \ge D \| x_{n} - x_{n-1} \|^{2},$$
(b)
The energy-gradient estimate (22), for every $n\in \mathbb {N}$
$$\| {\nabla} H_{n} \|^{2} \leq \frac{2}{{\beta_{n}^{2}}}\| x_{n+1} - x_{n} \|^{2} + S_{n} \| x_{n} - x_{n-1} \|^{2}$$
where
$$S_{n}= 2 \cdot \left[ \left( \frac{\alpha_{n}}{\beta_{n}} - \sqrt{2\delta_{n}} \right)^{2} + \delta_{n} \right] ,$$
(c)
The Łojasiewicz inequality (8), for every $n\ge \overline N_{1}$
$$(H_{n} - \overline{H})^{2\theta} \leq K^{2}\| {\nabla} H_{n} \|^{2},$$
where $\overline N_{1}\ge N$ is defined by the Łojasiewicz property of H at $(\overline x,\overline x)$ such that for ε > 0 one has
$$\|(y_{n},u_{n})-(\overline x,\overline x)\|<\varepsilon$$
for all $n\ge \overline N_{1}$.

By combining these three inequalities, one reaches

$$(H_{n} - \overline{H})^{2\theta} \leq \frac{2 K^{2}}{{\beta^{2}_{n}} D}(H_{n+1}-H_{n+2}) + \frac{K^{2} S_{n}}{D}(H_{n}-H_{n+1})$$

and we are led to a nonlinear second-order difference inequality

$$ r_{n}^{2\theta} \leq \frac{2 K^{2}}{{\beta^{2}_{n}} D}(r_{n+1}-r_{n+2}) + \frac{K^{2} S_{n}}{D}(r_{n}-r_{n+1}), $$

(28)

for every $n\ge \overline N_{1}$.

Using the fact that the positive sequence (r_n)_n≥N is decreasing we have that $r_{n+2}^{2\theta } \leq r_{n}^{2\theta } $ for all n ≥ N. Further, since the sequences $\left (\frac {2 K^{2}}{{\beta ^{2}_{n}} D}\right )_{n\in \mathbb {N}}$ and $\left (\frac {K^{2} S_{n}}{D}\right )_{n\in \mathbb {N}}$ converge and have positive limit, there exists C > 0 such that for all $n\ge \overline N_{1}$ one has

$$ \frac{2 K^{2}}{{\beta^{2}_{n}} D}(r_{n+1}-r_{n+2}) + \frac{K^{2} S_{n}}{D}(r_{n}-r_{n+1})\le C(r_{n}-r_{n+2}).$$

In the view of these observations, (28) becomes

$$ C_{0} r_{n+2}^{2\theta} \leq r_{n}-r_{n+2}, $$

(29)

for every $n\ge \overline N_{1}$, where $C_{0}=\frac 1C$.

Now we can apply Lemma 14 by observing that (29) is nothing else that (27) in Lemma 14, with $e_{n}=r_{n+2},{\kern 1.7pt}\overline {n}=N-2,{\kern 1.7pt}l_{0}=2$ and $n_{0}=\overline {N}_{1}-2$. Hence, by taking into account that r_n > 0 for all n ≥ N, that is, in the conclusion of Lemma 14 (ii) one has Q≠ 0, we have:

(p0)
If 𝜃 = 0, then (r_n)_n≥N converges in finite time;
(p1)
If $\theta \in \left (0,\frac 12\right ]$, then there exists C₁ > 0 and Q ∈ (0,1), such that for every $n\ge \overline N_{1}$
$$r_{n}\le C_{1} Q^{n};$$
(p2)
If $\theta \in \left [\frac 12,1\right )$, then there exists C₂ > 0, such that for every $n\ge \overline N_{1}+2$
$$r_{n}\le C_{2}(n-3)^{-\frac{1}{2\theta-1}}.$$

(a ₀ ). :

We treat first the case 𝜃 = 0. Then, according to (p0), $r_{n} = g(y_{n}) - g(\bar {x}) + \delta _{n} \| x_{n} - x_{n-1} \|^{2}$ converges in finite time. But then r_n − r_n+ 1 = 0 for n big enough, hence (16) implies that x_n = x_n− 1 for n big enough, consequently y_n = x_n for n big enough, thus $(x_{n})_{n\in \mathbb {N}},{\kern 1.7pt} (y_{n})_{n\in \mathbb {N}}$ converge in finite time. The above results show immediately that $\left (g(x_{n})\right )_{n\in \mathbb {N}},{\kern 1.7pt}\left (g(y_{n})\right )_{n\in \mathbb {N}}$ converge in finite time.

Assume now that $\theta \in \left (0,\frac 12\right ]$.

(a ₁ ). :

According to (p1) there exists C₁ > 0 and Q ∈ (0,1), such that for every $n\ge \overline N_{1}$ one has:

$$ r_{n}=g(y_{n}) - g(\bar{x}) + \delta_{n} \| x_{n} - x_{n-1} \|^{2}\le C_{1} Q^{n}. $$

(30)

Hence,

$$ g(y_{n}) - g(\bar{x})\le a_{1} Q^{n}, $$

(31)

for all $n\ge \overline N_{1}$, where a₁ = C₁.

(a ₂ ). :

In order to give an upper bound for the difference $g(x_{n}) - g(\bar {x})$, we consider the following chain of inequalities based upon Lemma 2:

$$ \begin{array}{@{}rcl@{}} g(x_{n}) - g(y_{n}) & \leq& \langle{ \nabla g(y_{n}) , x_{n} - y_{n} }\rangle + \frac{L_{g}}{2} \| x_{n} - y_{n} \|^{2} \\ & =& \left\langle\frac{1}{\beta_{n}} (y_{n} - x_{n+1}) , -\alpha_{n} (x_{n} - x_{n-1}) \right\rangle + \frac{L_{g}}{2} \| x_{n} - y_{n} \|^{2} \\ &=& \frac{1}{\beta_{n}} \langle{ x_{n+1} - x_{n} , \alpha_{n} (x_{n}-x_{n-1}) }\rangle - {\alpha_{n}^{2}} \frac{2-\beta_{n} L_{g}}{2\beta_{n}} \| x_{n} - x_{n-1} \|^{2}. \end{array} $$

Here, using the inequality:

$$ \langle{ x_{n+1} - x_{n} , \alpha_{n} (x_{n} - x_{n-1}) }\rangle \leq \frac{1}{2} \left[ \frac{1}{2-\beta_{n} L_{g}} \| x_{n+1} - x_{n} \|^{2} + (2-\beta_{n} L_{g}) {\alpha_{n}^{2}} \| x_{n} - x_{n-1} \|^{2} \right], $$

leads, after some simplifications, to:

$$ g(x_{n}) - g(y_{n}) \leq \frac{1}{2 \beta_{n} (2-\beta_{n} L_{g})} \| x_{n+1} - x_{n} \|^{2},\text{ for all } n\in\mathbb{N} . $$

By combining the inequality (16) with the fact that the sequence (g(y_n) + δ_n∥x_n − x_n− 1∥²)_n≥N is decreasing and converges to $g(\overline {x})$, one obtains:

$$ g(x_{n}) - g(y_{n}) \leq \frac{1}{2D \beta_{n} (2-\beta_{n} L_{g})} r_{n+1} \ , \ \text{ for all } n \geq N. $$

(32)

From (30), we have r_n+ 1 ≤ C₁Q^n+ 1 ≤ C₁Qⁿ for all $n\ge \overline N_{1}$, hence:

$$ g(x_{n}) - g(y_{n}) \leq \frac{1}{2D \beta_{n} (2-\beta_{n} L_{g})} C_{1} Q^{n}, \text{ for all } n \geq \overline N_{1}.$$

This means that for every $n \ge \overline N_{1}$ one has:

$$ g(x_{n}) - g(\bar{x})=(g(x_{n})-g(y_{n}))+(g(y_{n})-g(\overline x)) \leq C_{1} \left[\frac{1}{2D \beta_{n} (2-\beta_{n} L_{g})} +1\right]Q^{n}. $$

Since the sequence $(\beta _{n})_{n \in \mathbb {N}}$ is convergent to β > 0, we can choose:

$$a_{2}=C_{1}\sup_{n\ge\overline N_{1}} \frac{1}{2D \beta_{n} (2-\beta_{n} L_{g})}+C_{1}$$

and we have

$$ g(x_{n}) - g(\bar{x}) \leq a_{2} Q^{n} \ , \text{ for every } n \geq \overline N_{1}. $$

(33)

(a ₃ ). :

We continue the proof by establishing an estimate for $\| x_{n} - \overline {x} \|$. By the triangle inequality and by summing up (26) from $n\ge \overline N\ge \overline N_{1}$ to P > n, (where $\overline N$ was defined in the proof of Theorem 12), one has:

$$ \begin{array}{@{}rcl@{}} \| x_{P} - x_{n-1} \| &\leq& \sum\limits_{k=n}^{P} \| x_{k}-x_{k-1} \| \\ &\leq& -\| x_{n} - x_{n-1} \| + \| x_{P+1} - x_{P} \| + \frac{9M}{4D} \left[\varphi (H_{n} - \overline{H}) - \varphi (H_{P+1} - \overline{H})\right], \end{array} $$

so, letting P→∞ gives:

$$ \| x_{n-1} - \bar{x} \| \leq \frac{9M}{4D} \varphi(H_{n} - \overline{H}). $$

Recall, however, that the desingularizing function is $\varphi (t) = \frac {K}{1-\theta } t^{1-\theta }$ hence,

$$ \| x_{n-1} - \bar{x} \| \leq M_{1} r_{n}^{1-\theta}, $$

(34)

for every $n\ge \overline N$, where $M_{1}=\frac {9M K}{4D(1-\theta )}$.

Further, since r_n converges to 0 one has 0 < r_n < 1 for n big enough, hence $r_{n}^{1-\theta }\leq \sqrt {r_{n}}$ holds for 𝜃 ∈ (0,1/2], if n is big enough. By using (30), we conclude that there exists $\overline N_{2}\ge \overline N$ such that:

$$ \| x_{n} - \bar{x} \| \leq M_{1} \sqrt{r_{n+1}}\le M_{1} \sqrt{r_{n}} \leq a_{3} Q^{\frac{n}{2}}, \text{ for every } n\ge \overline N_{2}, $$

(35)

where $a_{3} := \sqrt {C_{1}} M_{1}$.

(a ₄ ). :

Finally, we conclude this part of the proof by deducing an upper bound for $\| y_{n} - \overline {x} \|$. The following inequalities hold true:

$$ \begin{array}{@{}rcl@{}} \| y_{n} - \overline{x} \| &=& \left\| x_{n} + \alpha_{n}(x_{n} - x_{n-1}) - \bar{x} \right\| \leq |1+\alpha_{n}| \cdot \| x_{n} - \overline{x} \| + |\alpha_{n}| \cdot \| x_{n-1} - \overline{x} \| \\ &\leq& (1+ | \alpha_{n} |) a_{3} Q^{\frac{n}{2}} + | \alpha_{n} | a_{3} Q^{-\frac12}Q^{\frac{n}{2}} \leq (1+ | \alpha_{n} |+Q^{-\frac12}| \alpha_{n} |)a_{3}Q^{\frac{n}{2}}, \end{array} $$

for all $n\ge \overline N_{2}+1$. Let $a_{4} =\sup _{n\ge \overline N_{2}+1}(1+ | \alpha _{n} |+Q^{-\frac 12}| \alpha _{n} |)a_{3}> 0$. Then,

$$ \| y_{n} - \bar{x} \| \leq a_{4} Q^{\frac{n}{2}}, \text{ for all } n\ge\overline N_{2}+1. $$

(36)

Now, if we take $\overline k=\max \{\overline N_{1}, \overline N_{2}+1\}$ then (31), (33), (35), and (36) lead to the conclusions (a₁)-(a₄).

Finally, assume that $\theta \in \left (\frac 12,1\right )$.

(b ₁ ). :

According to (p2) there exists C₂ > 0, such that for every $n\ge \overline N_{1}+2$ one has

$$ r_{n}=g(y_{n}) - g(\bar{x}) + \delta_{n} \| x_{n} - x_{n-1} \|^{2}\le C_{2}(n-3)^{-\frac{1}{2\theta-1}}. $$

(37)

Consequently,

$$g(y_{n}) - g(\bar{x})\le C_{2}(n-3)^{-\frac{1}{2\theta-1}}=C_{2}\left( \frac{n}{n-3}\right)^{\frac{1}{2\theta-1}}n^{-\frac{1}{2\theta-1}}$$

for every $n\ge \overline N_{1}+2$. Hence, we have

$$ g(y_{n}) - g(\bar{x})\le b_{1} n^{-\frac{1}{2\theta-1}}, $$

(38)

for every $n\ge \overline N_{1}+2$, where $b_{1}=C_{2}\sup _{n\ge \overline N_{1}+2}\left (\frac {n}{n-3}\right )^{\frac {1}{2\theta -1}}$.

The other claims now follow quite easily.

(b ₂ ). :

Indeed, note that (32) holds for every $n\ge \overline N_{1}$, hence:

$$ g(x_{n}) - g(y_{n}) \leq \frac{1}{2 D \beta_{n} (2-\beta_{n} L_{g})} r_{n+1}\le\frac{1}{2 D \beta_{n} (2-\beta_{n} L_{g})}b_{1} (n+1)^{\frac{-1}{2\theta-1}}. $$

Therefore, one obtains:

$$ g(x_{n}) - g(\overline{x}) = (g(x_{n}) - g(y_{n})) + (g(y_{n})-g(\overline{x})) \leq \left( \frac{1}{2 D \beta_{n} (2-\beta_{n} L_{g})}b_{1}+b_{1}\right)n^{\frac{-1}{2\theta-1}}, $$

for every $n\ge \overline N_{1}+2$. Let $b_{2}=\sup _{n\ge \overline N_{1}+2}\left (\frac {1}{2 D \beta _{n} (2-\beta _{n} L_{g})}b_{1}+b_{1}\right )$. Then

$$ g(x_{n}) - g(\overline{x})\le b_{2} n^{\frac{-1}{2\theta-1}}, $$

(39)

for every $n\ge \overline N_{1}+2$.

(b ₃ ). :

For proving (b₃), we use (34) again, and we have that for all $n\ge \overline N\ge \overline N_{1}+2$ it holds

$$ \| x_{n} - \bar{x} \|\le M_{1} r_{n+1}^{1-\theta}\le M_{1} r_{n}^{1-\theta}\le M_{1} \left( b_{1} n^{\frac{-1}{2\theta-1}}\right)^{1-\theta}.$$

Let $b_{3}=M_{1}b_{1}^{1-\theta }$. Then,

$$ \| x_{n} - \bar{x} \|\le b_{3} n^{\frac{\theta-1}{2\theta-1}}, $$

(40)

for all $n\ge \overline N$.

(b ₄ ). :

The final estimate is a straightforward consequence of the definition of y_n and the above estimates. Indeed, for all $n\ge \overline N+1$ one has:

$$ \| y_{n} - \overline{x} \| = \left\| x_{n} + \alpha_{n}(x_{n} - x_{n-1}) - \bar{x} \right\| \leq |1+\alpha_{n}| \cdot \| x_{n} - \overline{x} \| + |\alpha_{n}| \cdot \| x_{n-1} - \overline{x} \|$$

$$ \leq (1+ | \alpha_{n} |) b_{3} n^{\frac{\theta-1}{2\theta-1}} + | \alpha_{n} | b_{3} (n-1)^{\frac{\theta-1}{2\theta-1}} \leq (1+ 2| \alpha_{n} |) b_{3} (n-1)^{\frac{\theta-1}{2\theta-1}}. $$

Let $b_{4} =\sup _{n\ge \overline N+1}(1+ 2| \alpha _{n} |)b_{3}\left (\frac {n}{n-1}\right )^{\frac {1-\theta }{2\theta -1}}> 0$. Then,

$$ \| y_{n} - \bar{x} \| \leq b_{4} n^{\frac{\theta-1}{2\theta-1}}, \text{ for all } n\ge\overline N+1. $$

(41)

Now, if we take $\overline k=\overline N+1$ then (38), (39), (40), and (41) lead to the conclusions (b₁)-(b₄).

□

Remark 16

According to [21], H has the Łojasiewicz property with Łojasiewicz exponent $\theta \in \left [\frac 12,1\right )$, whenever g has the Łojasiewicz property with Łojasiewicz exponent $\theta \in \left [\frac 12,1\right )$. Therefore, one obtains the same convergence rates if in the hypotheses of Theorem 15 one assumes that g has the Łojasiewicz property with Łojasiewicz exponent $\theta \in \left [\frac 12,1\right )$.

4 Numerical experiments

The aim of this section is to support the analytic results of the previous sections by numerical experiments and to highlight some interesting features of the generic algorithm (6).

4.1 Comparing Algorithm (6) with some algorithms from the literature by using different step sizes

In our first experiment, let us consider the convex function:

$$g:\mathbb{R}^{2}\longrightarrow\mathbb{R},{\kern1.7pt}g(x,y)=0.02x^{2} + 0.005y^{2}.$$

Based on the boundedness of its Hessian, we infer that the Lipschitz constant of its gradient is L_g = 0.2. Obviously, g is strongly convex and its global minimum is (0,0).

In order to give a better perspective on the advantages and disadvantages of algorithm (6) for different choices of step sizes and inertial coefficients, in our first numerical experiment, we compare the following:

(a)
The proposed algorithm (6) with inertial parameter $\alpha _{n} = -0.01 \cdot \frac {n}{n+3}$, which shows that $\alpha =-0.01\in \left (\frac {-10+\sqrt {68}}{8} \ , 0 \right )$, and constant step size $\beta _{n} = \beta = 9 \in \left (0 \ , \ \frac {4 \alpha ^{2} + 10 \alpha + 2}{L_{g} \left (2 \alpha + 1 \right )^{2}} \right )$;
(b)
The proposed algorithm (6) with inertial parameter $\alpha _{n} = -0.01 \cdot \frac {n}{n+3}$ and increasing step size $\beta _{n} = 9 \cdot \frac {n + 1}{n+2}$;
(c)
The proposed algorithm (6) with inertial parameter $\alpha _{n} = -0.01 \cdot \frac {n}{n+3}$ and decreasing step size $\beta _{n} = 9 \cdot \frac {n + 3}{n+2}$;
(d)
Polyak’s algorithm (1) with inertial parameter $\alpha _{n} = 0.6\cdot \frac {n}{n+2}$ and constant step size $\beta _{n} = \frac {2}{L_{g}}=10$;
(e)
Polyak’s algorithm (1) with decreasing inertial parameter $\alpha _{n} = 0.7 \cdot \frac {n+2}{n+1.5}$ and increasing step size $\beta _{n} = \frac {2(1-\alpha _{n})\cdot 0.9}{L_{g}}= 9 \cdot \frac {0.3n +0.1}{n+1.5}$;
(f)
Nesterov algorithm (4) with maximal admissible step size $s=\frac {1}{L_{g}}=5$;
(g)
The Nesterov-like algorithm (5) with inertial parameter $0.6 \cdot \frac {n}{n+3}$, and step size $s = \frac {2(1-0.6)}{L_{g}}=4$.

The choices of inertial coefficients and step sizes are motivated by theoretical results in [18, 29] and [20]. We consider the starting points x₀ = x_− 1 = (3,1) and run the simulations until the error |g(x_n+ 1) − g(x_n)| attains the value 10^− 15. These results are shown in Fig. 1, where the horizontal axis measures the number of iterations and the vertical axis shows the error |g(x_n+ 1) − g(x_n)|. The experiment depicted in Fig. 1 shows that Algorithm (6) has the best behavior when we choose a decreasing step size (red square in Fig. 1). This instance outperforms those obtained with the same Algorithm (6) but with constant step size (red star in Fig. 1) and even more so fo increasing step sizes (by red circle in Fig. 1). Further, it should be noted that the Algorithm (6), in all its instances, outperforms Algorithm (5) (green line in Fig. 1), Algorithm (1) with a constant step size (yellow line in Fig. 1) or variable step size (black line in Fig. 1) and also Nesterov’s Algorithm (4) (blue line in Fig. 1).

4.2 The analysis of the behavior of Algorithm (6) via different inertial values and step sizes

In the second set of numerical experiments, we analyze the behavior of our algorithm with different inertial values and different step sizes in a noncovex setting. Our experiments suggest that in order to obtain faster convergence one should use in Algorithm (6) decreasing step size β_n and one should have a sequence of inertial parameter whose limit is as close to 0 as possible (see Fig. 2).

First, consider the Rastrigin function (see [25]):

$$ g:\mathbb{R}^{2}\longrightarrow\mathbb{R},{\kern1.7pt} g(x,y) = 20 + x^{2}- 10\cos(2\pi x) + y^{2}-10\cos(2\pi y) $$

which is nonconvex. For the initial values x₀ = x_− 1 = (0.9,0.9), we run Algorithm (6), with the constant step size β_n = β = 0.001 (yellow circle in Fig. 2a), then with decreasing step size $\beta _{n}=0.001 \cdot \frac {n + 4} {n+ 3}$ (green arrow in Fig. 2a) and then with increasing step size $\beta _{n}=0.001 \cdot \frac {n + 2} {n+ 3}$ (red star in Fig. 2a). Meanwhile, the inertial parameter is taken to be $\alpha _{n}=-0.1 \cdot \frac {n}{n+3}$ with simulations running until the error|g(x_n+ 1) − g(x_n)| attains 10^− 15. The results are shown in Fig. 2a, where the horizontal axis measures the number of iterations and the vertical axis shows the error in terms of iterates.

Next, consider the convex quadratic function $g:\mathbb {R}^{2}\longrightarrow \mathbb {R},{\kern 1.7pt}g(x,y)=0.02x^{2} + 0.005y^{2}$ together with initial values x₀ = x_− 1 = (3,1) and an inertial parameter $\alpha _{n}=-0.01 \cdot \frac {n}{n+1}$. The three instances of our algorithm: with constant step size β_n = β = 8 (yellow circle in Fig. 2b), decreasing step size $\beta _{n}=8 \cdot \frac {n + 4} {n+ 3}$ (green arrow in Fig. 2b) and finally with nondecreasing step size $\beta _{n}=8 \cdot \frac {n + 2} {n+ 3}$ (red star in Fig. 2b), are compared and results are shown in Fig. 2b, where the horizontal axis measures the number of iterations and the vertical axis shows the error in terms of iterates.

We also consider different initial values for Algorithm (6), namely x₀ = x_− 1 = (0.9,0.9) together with a fixed step size β_n = β = 0.001 and the inertial parameters $\alpha _{n}=-0.2 \cdot \frac {n}{n+3}$ (red circle in Fig. 2c), $\alpha _{n}=-0.1 \cdot \frac {n}{n+3}$ (yellow star Fig. 2c), and $\alpha _{n}=-0.001 \cdot \frac {n}{n+3}$ (green arrow Fig. 2c). The result when the objective function g is the Rastrigin function is shown in Fig. 2c.

Finally, we consider the same inertial values as before for Algorithm (6), but we take the convex objective function $g:\mathbb {R}^{2}\longrightarrow \mathbb {R},{\kern 1.7pt}g(x,y)=0.02x^{2} + 0.005y^{2}$ and the fixed step size β_n = β = 8, see Fig. 2d.

4.3 Comparing Algorithm (6) with known algorithms by using some test functions for optimization

Since Algorithm (6) is new in the literature, it is worthwhile to compare with known algorithms using some so-called test functions for optimization (see [25]). In these experiments, we run the algorithms until the error |g(x_n+ 1) − g(x_n)| attains the value 10^− 15. These results are shown in Fig. 3a–d, where the horizontal axis measures the number of iterations and the vertical axis shows the error |g(x_n+ 1) − g(x_n)|.

At first consider Beale’s Function:

$$ g:\mathbb{R}^{2}\longrightarrow\mathbb{R},{\kern1.7pt}g(x,y) = (1.5-x+xy)^{2}+(2.25-x+xy^{2})^{2}+(2.625-x+xy^{3})^{2}. $$

We compare Algorithm (6) with inertial parameter $\alpha _{n}=-0.01 \cdot \frac {n}{n+3}$ (red star Fig. 3a), with Algorithm (2) with inertial parameter $\alpha _{n}=0.01 \cdot \frac {n}{n+3}$ (green square Fig. 3a), and Algorithm (4) (blue circle Fig. 3a), by taking the same step size β_n = s = 0.01, and initial value x₀ = x_− 1 = (0.1,0.5). Meanwhile Algorithm (6) and Algorithm (2) show a similar behavior for the Beale function, both outperform Algorithm (4) (see Fig. 3a).

Consider next the Rosenbrock Function:

$$ g:\mathbb{R}^{2}\longrightarrow\mathbb{R},{\kern1.7pt}g(x,y) = 100(x^{2} - y)^{2} + (x-1)^{2}. $$

We compare Algorithm (6) with inertial parameter $\alpha _{n}=-0.01 \cdot \frac {n}{n+3}$ (red star Fig. 3b), with Algorithm (2) with inertial parameter $\alpha _{n}=0.4 \cdot \frac {n}{n+3}$ (green square Fig. 3b), and Algorithm (4) (blue circle Fig. 3b), by taking the same step size β_n = s = 0.001, and initial value x₀ = x_− 1 = (2,2) (see Fig. 3b). Note that also for the Rosenbrock Function, Algorithm (6) and Algorithm (2) have a similar behavior; however, in contrast with the oscillations in the error terms of iterates |g(x_n+ 1) − g(x_n)| of Algorithm (2), Algorithm (6) shows an almost linear decrease trend.

We are also interested in the quadratic function of the form:

$$ g:\mathbb{R}^{2}\longrightarrow\mathbb{R},{\kern1.7pt}g(x,y) = -3803.84 - 138.08 x -232.92 y + 128.08 x^{2} + 203.64 y^{2} + 182.25 xy. $$

We compare Algorithm (6) with inertial parameter $\alpha _{n}=-0.2 \cdot \frac {n}{n+3}$ (red star Fig. 3c), with Algorithm (2) with inertial parameter $\alpha _{n}=0.49 \cdot \frac {n}{n+3}$ (green square Fig. 3c), and Algorithm (4) (blue circle Fig. 3c), by taking the same step size β_n = s = 0.0025, and initial value x₀ = x_− 1 = (2,2). As Fig. 3c shows that in this case Algorithm (6) clearly outperforms Algorithm (2) and Algorithm (4).

Finally, for the logistic regression with l₂-regularization, we consider the cost function:

$$ g:\mathbb{R}^{m}\longrightarrow\mathbb{R},{\kern1.7pt}g(w) = \frac{1}{k} \sum\limits_{i=1}^{k} \ln \left( 1 + e^{-y_{i} w^{T} x^{i}} \right) + \frac{1}{2} \| w \|^{2} \ , $$

with k = 200, m = 50. Further,

$$ y_{1} , \ldots, y_{k} \in \lbrace -1, +1 \rbrace $$

and

$$ x^{1}, \ldots, x^{k} \in \mathbb{R}^{m} \text{ are generated by a random normal distribution.} $$

We compare Algorithm (6) with inertial parameter $\alpha _{n}=-0.1 \cdot \frac {n}{n+3}$ (red star Fig. 3d), with Algorithm (2) with inertial parameter $\alpha _{n}=0.36 \cdot \frac {n}{n+3}$ (green square Fig. 3d), and Algorithm (4) (blue circle Fig. 3d), by taking the same step size β_n = s = 0.5, and the initial value x₀ = x_− 1 = (1,…,1)^T. Also here, Algorithm (6) outperforms Algorithm (2) and Algorithm (4) (see Fig. 3d).

4.4 Switching between positive and negative inertial parameter values

Finally, a set of numerical experiments is related to the minimization of the nonconvex, coercive function:

$$g:\mathbb{R} \longrightarrow \mathbb{R},{\kern1.7pt}g(x)=\ln(1+(x^{2}-1)^{2}).$$

Observe that this function has two global minima at x = − 1 and x = 1 and a local maximum at x = 0.

The experiments that we present in what follows emphasize the importance of the fact that the inertial parameter α_n in Algorithm (6), though having a strictly negative limit, may have a finite number of positive terms.

Indeed, by taking the same starting points x₀ = x_− 1 = 2 and constant step size β_n = 0.1, according to Fig. 4a, Algorithm (6), with inertial parameter $\alpha _{n}=-0.1\frac {n}{n+3}$ (red star Fig. 4a), seems to converge faster than algorithm (2) with inertial parameter $\alpha _{n}=0.1\frac {n}{n+3}$ (black circle Fig. 4a), after a certain number of iterates. Here, we ran the algorithms until the absolute value of the gradient of the objective function in iterates |∇g(x_n)| attained the value 10^− 15. These results are shown in Fig. 4a, where the horizontal axis measures the number of iterations and the vertical axis shows the energy error $|g(x_{n+1})-g(\overline {x})|$, where $\overline {x}$ in this case is the appropriate minimum 1. However, these algorithms show a similar behavior concerning the scaled error $h^{2}n^{2}|g(x_{n+1})-g(\overline {x})|$, where n is the number of iterations and h is the step size (see Fig. 4b).

Now, for the initial values x₀ = x_− 1 = 0.00001 (which is very closed to the local maximum of the objective function), Algorithm (2) (black circle, Fig. 4c, d), clearly outperforms Algorithm (6) (red star, Fig. 4c, d) both for the energy error $|g(x_{n+1})-g(\overline {x})|$, (Fig. 4c), and the scaled error $h^{2}n^{2}|g(x_{n+1})-g(\overline {x})|$ (Fig. 4d).

Nevertheless, the very general structure of the generic Algorithm (6) allows for much flexibility, as only the limit of the sequence (α_n) is prescribed. So, one can profit by taking the inertial parameter $\alpha _{n}=\frac {-0.1n+5}{n+3}$ in Algorithm (6). Then, α_n is positive for the first 50 iterates, and this helps Algorithm (6) to outperform Algorithm (5) with inertial parameter $\alpha _{n}=0.1\frac {n}{n+3}$, even for the initial values x₀ = x_− 1 = 0.00001 (see Fig. 4e for the energy error $|g(x_{n+1})-g(\overline {x})|$ and Fig. 4f for the scaled error $h^{2}n^{2}|g(x_{n+1})-g(\overline {x})|$, where the graphics corresponding to Algorithm (6) are depicted by red, the graphics corresponding to Algorithm (2) are depicted by black).

References

Aujol, J.-F., Dossal, C.H., Rondepierre, A.: Optimal convergence rates for Nesterov acceleration. arXiv:1805.05719
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1-2), 5–16 (2009)
MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1-2), 91–129 (2013)
MathSciNet MATH Google Scholar
Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. 168(1-2), 123–175 (2018)
MathSciNet MATH Google Scholar
Bégout, P., Bolte, J., Jendoubi, M.A.: On damped second-order gradient systems. J. Differ. Equ. 259, 3115–3143 (2015)
MathSciNet MATH Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. Series A 146(1-2), 459–494 (2014)
MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2006)
MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)
MATH Google Scholar
Boţ, R. I., Csetnek, E.R., László, S.C.: Approaching nonsmooth nonconvex minimization through second-order proximal-gradient dynamical systems. J. Evol. Equ. 18(3), 1291–1318 (2018)
MathSciNet MATH Google Scholar
Boţ, R.I., Csetnek, E.R., László, S.C.: An inertial forward-backward algorithm for minimizing the sum of two non-convex functions. Euro J. Comput. Optim. 4(1), 3–25 (2016)
MathSciNet MATH Google Scholar
Boţ, R. I., Csetnek, E.R., László, S.C.: A second order dynamical approach with variable damping to nonconvex smooth minimization. Applicable Analysis. https://doi.org/10.1080/00036811.2018.1495330 (2018)
Boţ, R.I., Nguyen, D.-K.: The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. arXiv:1801.01994
Chambolle, A., Dossal, C. h.: On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm. J. Optim. Theory Appl. 166(3), 968–982 (2015)
MathSciNet MATH Google Scholar
Combettes, P.L., Glaudin, L.E.: Quasinonexpansive iterations on the affine hull of orbits: From Mann’s mean value algorithm to inertial methods. Siam Journal on Optimization 27(4), 2356–2380 (2017)
MathSciNet MATH Google Scholar
Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka–Łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874–900 (2015)
MathSciNet MATH Google Scholar
Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 IEEE European Control Conference (ECC), pp. 310–315 (2015)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier (Grenoble) 48(3), 769–783 (1998)
MathSciNet MATH Google Scholar
László, S.C.: Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization. arXiv:1811.09616
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 2018, 1–34 (2018)
MATH Google Scholar
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels, Les É,quations aux Dérivées Partielles, Éditions du Centre National de la Recherche Scientifique Paris, pp. 87–89 (1963)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate O(1/k²). (Russian) Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
MathSciNet Google Scholar
Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Dordrecht (2004)
MATH Google Scholar
Polheim, H.: Examples of objective functions, Documentation for Genetic and Evolutionary Algorithms for use with MATLAB : GEATbx version 3.7, http://www.geatbx.com
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. U.S.S.R. Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Google Scholar
Rockafellar, R.T., Wets, R. J. -B.: Variational analysis fundamental principles of mathematical sciences, vol. 317. Springer, Berlin (1998)
Google Scholar
Su, W., Boyd, S., Candes, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17, 1–43 (2016)
MathSciNet MATH Google Scholar
Sun, T., Yin, P., Li, D., Huang, C., Guan, L., Jiang, H.: Non-ergodic convergence analysis of heavy-ball algorithms. arXiv:1811.01777

Download references

Acknowledgments

The authors are thankful to three anonymous reviewers for remarks and suggestions which helped us to improve the quality of the paper.

Author information

Authors and Affiliations

Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy, Cluj-Napoca, Romania
Cristian Daniel Alecsa
Department of Mathematics, Babes-Bolyai University, Cluj-Napoca, Romania
Cristian Daniel Alecsa
Department of Mathematics, Technical University of Cluj-Napoca, Str. Memorandumului nr. 28, 400114, Cluj-Napoca, Romania
Szilárd Csaba László & Adrian Viorel

Authors

Cristian Daniel Alecsa
View author publications
You can also search for this author in PubMed Google Scholar
Szilárd Csaba László
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Viorel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Szilárd Csaba László.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by a grant of Ministry of Research and Innovation, CNCS -UEFISCDI, project number PN-III-P1-1.1-TE-2016-0266, within PNCDI III.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alecsa, C.D., László, S.C. & Viorel, A. A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem. Numer Algor 84, 485–512 (2020). https://doi.org/10.1007/s11075-019-00765-z

Download citation

Received: 22 November 2018
Accepted: 25 June 2019
Published: 13 July 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11075-019-00765-z

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A gradient-type algorithm with backward inertial steps associated to a nonconvex minimization problem

Abstract

Similar content being viewed by others

Convergence rates for an inertial algorithm of gradient type associated to a smooth non-convex minimization

A Telescopic Bregmanian Proximal Gradient Method Without the Global Lipschitz Continuity Assumption

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

1 Introduction and Preliminaries

Remark 1

Lemma 2

Definition 1

Definition 2 (for which we refer to 2, 8, 22)

Lemma 3

2 Convergence results

Theorem 4

Proof

Remark 5

Remark 6

Lemma 7

Proof

Fact 8

Fact 9

Remark 10

Lemma 11

Proof

Theorem 12

Proof

Remark 13

3 Convergence rates

Lemma 14 (14 Lemma 15)

Theorem 15

Proof

Remark 16

4 Numerical experiments

4.1 Comparing Algorithm (6) with some algorithms from the literature by using different step sizes

4.2 The analysis of the behavior of Algorithm (6) via different inertial values and step sizes

4.3 Comparing Algorithm (6) with known algorithms by using some test functions for optimization

4.4 Switching between positive and negative inertial parameter values

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation