Abstract
This paper reveals that a common and central role, played in many error bound (EB) conditions and a variety of gradient-type methods, is a residual measure operator. On one hand, by linking this operator with other optimality measures, we define a group of abstract EB conditions, and then analyze the interplay between them; on the other hand, by using this operator as an ascent direction, we propose an abstract gradient-type method, and then derive EB conditions that are necessary and sufficient for its linear convergence. The former provides a unified framework that not only allows us to find new connections between many existing EB conditions, but also paves a way to construct new ones. The latter allows us to claim the weakest conditions guaranteeing linear convergence for a number of fundamental algorithms, including the gradient method, the proximal point algorithm, and the forward–backward splitting algorithm. In addition, we show linear convergence for the proximal alternating linearized minimization algorithm under a group of equivalent EB conditions, which are strictly weaker than the traditional strongly convex condition. Moreover, by defining a new EB condition, we show Q-linear convergence of Nesterov’s accelerated forward–backward algorithm without strong convexity. Finally, we verify EB conditions for a class of dual objective functions.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
A standard assumption for proving linear convergence of gradient-type methods is strong convexity [43]. In practice, however, strong convexity is too stringent. Moreover, various gradient-type methods for solving convex optimization problems have exhibited linear convergence in numerical experiments even when strong convexity is absent; see e.g. [24, 31, 59]. Thereby, one would wonder whether such a phenomenon can be explained theoretically, and whether there exist weaker alternatives to strong convexity that retain fast rates.
A very powerful idea to address these questions is to connect error bound (EB) conditions with the convergence rate estimation of gradient-type methods. This idea has a long history dating back to 1963 when Polyak introduced an EB inequality as a sufficient condition for gradient descent to attain linear convergence [45]. In the same year, a wide class of inequalities, which include Polyak’s as a special case, were introduced by Łojasiewicz [36]. In recent manuscripts [10, 27], the EB condition of Polyak–Łojasiewicz’s type was further developed for analyzing the complexity of first-order descent methods. The second type of EB conditions is due to Hoffman, who proposed an EB inequality for systems of linear inequalities [25] in 1952. Along this line, Luo and Tseng in the early 90’s contributed several aspects for connecting EB conditions of Hoffman’s type with convergence analysis of descent methods [38, 39]. Recently, global versions of EB conditions of Hoffman’s type attract much attention [52, 57, 69]. The third type of EB conditions is the quadratic growth condition (also called zero-order EB condition in [10]), which might go back to the work [70]. It was recently rediscovered in the special case of convex functions, and widely used to derive linear convergence for many gradient-type methods as well [23, 35, 42].
Moreover, there recently emerges a surge of interest in developing new EB conditions guaranteeing (global) linear convergence for various gradient-type methods. For example, the authors of [31, 64, 67] proposed a restricted secant inequality (RSI), and developed the restricted strongly convex (RSC) property with parameter \(\nu \) for analyzing linear convergence of (dual) gradient descent methods and Nesterov’s restart accelerated methods; the authors of [42] proposed several relaxations of strong convexity that are sufficient for obtaining linear convergence for (projected and accelerated) gradient-type methods.
Another line of recent works is to find connections between existing EB conditions. The authors of [20] discussed the relationship between the quadratic growth condition and the EB condition of Hoffman’s type (also called Luo–Tseng’s type in [32]). Parallel to and partially influenced by the work [20], the author of this paper also established several new types of equivalence between the RSC property, the quadratic growth condition, and the EB condition of Hoffman’s type in [63]. The authors of [10] showed the equivalence between the quadratic growth condition and the Kurdyka–Łojasiewicz inequality. Besides, we note that works [27, 42] also discussed the relationships among many of these EB conditions.
Based on these two lines of recent developments, two natural questions arise. The first one is whether there is a unified framework for defining different EB conditions and analyzing connections between them. The second one is whether these sufficient conditions guaranteeing linear convergence for gradient-type methods are also necessary. To answer these two questions, we will rely on a vital observation: a common and key role, played in many EB conditions and a variety of gradient-type methods, is a residual measure operator. This observation immediately leads us to the following discoveries:
- 1.
By linking the residual measure operator with other optimality measures, we define a group of abstract EB conditions. Then, we comprehensively analyze the interplay between them by means of the technique developed in [10], which plays a fundamental role in the corresponding error bound equivalence. The definition of abstract EB conditions not only unifies many existing EB conditions, but also helps us to construct new ones. The interplay between the abstract EB conditions allows us to find new connections between many existing EB conditions.
- 2.
By viewing the residual measure operator as an ascent direction, we propose an abstract gradient-type method, and then derive EB conditions that are necessary and sufficient for its linear convergence. The latter allows us to claim the weakest (or say, necessary and sufficient) conditions guaranteeing linear convergence for a number of fundamental algorithms, including the gradient method (applied to possibly nonconvex optimization), the proximal point algorithm, and the forward–backward splitting algorithm. The sufficiency of these EB conditions for linear convergence has been widely known; see e.g. [10]. In contrast, there is very little attention to the discussion of necessity.
In addition, we also make the following contributions, from aspects of block coordinate gradient descent, Nesterov’s acceleration, and verifying EB conditions, separately:
- 3.
We show linear convergence for the proximal alternating linearized minimization (PALM) algorithm under a group of equivalent EB conditions. It has been recently shown [26, 33, 51] that PALM achieves sublinear convergence for convex problems and linear convergence for strongly convex problems. In this study, we show its linear convergence under strictly weaker conditions than strong convexity.
- 4.
By defining a new EB condition, we obtain Q-linear convergence of Nesterov’s accelerated forward–backward algorithm, which generalizes the Q-linear convergence of Nesterov’s accelerated gradient method, recently independently discovered in [29, 58]. The new EB condition in some special cases can be viewed as a strictly weaker relaxation of strong convexity. In such sense, we show Q-linear convergence of Nesterov’s accelerated method without strong convexity. Our proof idea is partially inspired by Attouch and Peypouquet [5] but might be of interest in its own right.
- 5.
We provide a new proof to show that a class of dual objective functions satisfy EB conditions, under slightly weaker assumptions, again by means of the technique developed in [10]. The authors of [31] gave the first proof for a special case of this class of functions, and the author of [50] gave the first general proof by contradiction.
The paper is organized as follows. In Sect. 2, we present the basic notation and some elementary preliminaries. In Sect. 3, we analyze necessary and sufficient conditions guaranteeing linear convergence for the gradient descent. In Sect. 4, we define a group of abstract EB conditions, and analyze the interplay between them. In Sect. 5, we define an abstract gradient-type method, and derive EB conditions that are necessary and sufficient for guaranteeing its linear convergence. In Sect. 6, we study linear convergence of the PALM algorithm. In Sect. 7, we study linear convergence of Nesterov’s accelerated forward–backward algorithm. In Sect. 8, we verify EB conditions for a class of dual objective functions. Finally, in Sect. 9, we give a short summary of this paper, along with some discussion for future work.
2 Notation and preliminaries
Throughout the paper, \({\mathbb {R}}^n\) will denote an n-dimensional Euclidean space associated with inner-product \(\langle \cdot , \cdot \rangle \) and induced norm \(\Vert \cdot \Vert \). For any nonempty \(Q\subset {\mathbb {R}}^n\), we define the distance function by \(d(x,Q):=\inf _{y\in Q}\Vert x-y\Vert .\) For a nonempty set \(Q\subset {\mathbb {R}}^n\), we define the indicator function of Q by
We say that f is gradient-Lipschitz-continuous with modulus \(L>0\) if
and f is strongly convex with modulus \(\mu >0\) if for any \(\alpha \in [0,1]\),
or if (when it is differentiable)
We will consider the following classes of functions.
\({{\mathcal {F}}}^1({\mathbb {R}}^n)\): the class of continuously differentiable convex functions from \({\mathbb {R}}^n\) to \({\mathbb {R}}\);
\({{\mathcal {F}}}^{1,1}_{L}({\mathbb {R}}^n)\): the class of gradient-Lipschitz-continuous convex functions from \({\mathbb {R}}^n\) to \({\mathbb {R}}\) with Lipschitz modulus L;
\({{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\): the class of gradient-Lipschitz-continuous and strongly convex functions from \({\mathbb {R}}^n\) to \({\mathbb {R}}\) with Lipschitz modulus L and strongly convex modulus \(\mu \);
\(\Gamma ({\mathbb {R}}^n)\): the class of proper and lower semicontinuous functions from \({\mathbb {R}}^n\) to \((-\infty , +\infty ]\);
\(\Gamma _0({\mathbb {R}}^n)\): the class of proper and lower semicontinuous convex functions from \({\mathbb {R}}^n\) to \((-\infty , +\infty ]\).
Obviously, we have the following inclusions:
It is convenient to denote by \(\mathrm {Arg}\min f\) the set of optimal solutions of minimizing f over \({\mathbb {R}}^n\), and to use “\(\arg \min f\)”, if the solution is unique, to stand for the unique solution. If \(\mathrm {Arg}\min f\) is nonempty, we let \(\min f\) present the minimum of f over \({\mathbb {R}}^n\).
The notation of subdifferential plays a central role in (non)convex optimization.
Definition 1
(subdifferentials, [48]) Let \(f\in \Gamma ({\mathbb {R}}^n)\). Its domain is defined by
- (a)
For a given \(x\in {\mathrm {dom}}f\), the Fr\(\acute{e}\)chet subdifferential of f at x, written \({\hat{\partial }}f(x)\), is the set of all vectors \(u\in {\mathbb {R}}^n\) which satisfy
$$\begin{aligned} \lim _{y\ne x}\inf _{y\rightarrow x}\frac{f(y)-f(x)-\langle u, y-x\rangle }{\Vert y-x\Vert }\ge 0. \end{aligned}$$When \(x\notin {\mathrm {dom}}f\), we set \({\hat{\partial }}f(x)=\emptyset \).
- (b)
The (limiting) subdifferential, of f at \(x\in {\mathbb {R}}^n\), written \(\partial f(x)\), is defined through the following closure process
$$\begin{aligned} \partial f(x):=\left\{ u\in {\mathbb {R}}^n: \exists x^k\rightarrow x, f\left( x^k\right) \rightarrow f(x)~\text {and}~ {\hat{\partial }}f\left( x^k\right) \ni u^k \rightarrow u~\text {as}~k\rightarrow \infty \right\} . \end{aligned}$$ - (c)
If we further assume that f is convex, then the subdifferential of f at \(x\in {\mathrm {dom}}f\) can also be defined by
$$\begin{aligned} \partial f(x):=\left\{ v\in {\mathbb {R}}^n: f(z)\ge f(x)+\langle v, z-x\rangle ,~~\forall ~z\in {\mathbb {R}}^n\right\} . \end{aligned}$$The elements of \(\partial f(x)\) are called subgradients of f at x.
Denote the domain of \(\partial f\) by \({\mathrm {dom}}\partial f:= \{x\in {\mathbb {R}}^n:\partial f(x)\ne \emptyset \}\). Then, if \(f\in \Gamma ({\mathbb {R}}^n)\) and \(x\in {\mathrm {dom}}f\), then \(\partial f(x)\) is closed (see Theorem 8.6 in [48]); if \(f\in \Gamma _0({\mathbb {R}}^n)\) and \(x\in {\mathrm {dom}}\partial f\), then \({\mathrm {dom}}\partial f\subset {\mathrm {dom}}f\) and \(\partial f(x)\) is a nonempty closed convex set (see Proposition 16.3 in [7]). In the later case, we denote by \(\partial ^0 f(x)\) the unique least-norm element of \(\partial f(x)\) for \(x\in {\mathrm {dom}}\partial f\), along with the convention that \(\Vert \partial ^0 f(x)\Vert =+\infty \) for \(x\notin {\mathrm {dom}}\partial f\). Points whose subdifferential contains 0 are called critical points. The set of critical points of f is denoted by \({\mathbf {crit}}f\). If \(f\in \Gamma _0({\mathbb {R}}^n)\), then \({\mathbf {crit}}f=\mathrm {Arg}\min f\).
Let \(f\in \Gamma _0({\mathbb {R}}^n)\); its Fenchel conjugate function \(f^*:{\mathbb {R}}^n\rightarrow (-\infty , +\infty ]\) is defined by
and the proximal mapping operator by
For each \(x\in \overline{{\mathrm {dom}}f} \), it is well-known [12] that there is a unique absolutely continuous curve \(\chi _x: [0, \infty )\rightarrow {\mathbb {R}}^n\) such that \(\chi _x(0)=x\) and for almost every \(t>0\),
We say that \(\Omega \subset {\mathbb {R}}^n\) is \(\partial f\)-invariant if
This concept was proposed in [12] and recently used in [22]. There are several types of \(\Omega \) being \(\partial f\)-invariant; see Example 7.2 in [22] and Section IV.4 in [12]. In Sects. 5 and 8, we will use the fact that the sublevel set \(X_r:=\{x:f(x)\le r\}\) is always \(\partial f\)-invariant for any function \(f\in \Gamma _0({\mathbb {R}}^n)\).
At last, we present some variational analysis tools. Let \({\mathcal {T}}\), \({{\mathcal {E}}}\), and \({{\mathcal {E}}}_i,\)\(i=1,2\) be finite-dimensional Euclidean spaces. The closed ball around \(x\in {{\mathcal {E}}}\) with radius \(r>0\) is denoted by \({\mathbb {B}}_{{\mathcal {E}}}(x,r):=\{y\in {{\mathcal {E}}}:\Vert x-y\Vert \le r\}\). The unit ball is denoted by \({\mathbb {B}}_{{\mathcal {E}}}\) for simplicity, and the open unit ball around the original in \({{\mathcal {E}}}\) is by \({\mathbb {B}}_{{\mathcal {E}}}^o\). A multi-function \(S: {\mathcal {E}}_1\rightrightarrows {\mathcal {E}}_2\) is a mapping assigning each point in \({{\mathcal {E}}}_1\) to a subset of \({{\mathcal {E}}}_2\). The graph of S is defined by
The inverse map \(S^{-1}:{\mathcal {E}}_2\rightrightarrows {\mathcal {E}}_1\) is defined by setting
Calmness and metric subregularity have been considered in various contexts and under various names. Here, we follow the terminology of Dontchev and Rockafellar [17].
Definition 2
([17], Chapter 3H)
- (a)
A multi-function \(S: {\mathcal {E}}_1\rightrightarrows {\mathcal {E}}_2\) is said to be calm with constant \(\kappa >0\) around \({\bar{u}}\in {\mathcal {E}}_1\) for \({\bar{v}}\in {\mathcal {E}}_2\) if \(({\bar{u}},{\bar{v}})\in \texttt {gph}(S)\) and there exist constants \(\epsilon , \delta >0\) such that
$$\begin{aligned} S(u)\cap {\mathbb {B}}_{{\mathcal {E}}_2}({\bar{v}},\epsilon )\subseteq S({\bar{u}})+\kappa \cdot \Vert u-{\bar{u}}\Vert _2{\mathbb {B}}_{{\mathcal {E}}_2},\quad \forall u\in {\mathbb {B}}_{{\mathcal {E}}_1}({\bar{u}},\delta ), \end{aligned}$$(1)or equivalently,
$$\begin{aligned} S(u)\cap {\mathbb {B}}_{{\mathcal {E}}_2}({\bar{v}},\epsilon )\subseteq S({\bar{u}})+\kappa \cdot \Vert u-{\bar{u}}\Vert _2{\mathbb {B}}_{{\mathcal {E}}_2},\quad \forall u\in {\mathcal {E}}_1. \end{aligned}$$(2) - (b)
A multi-function \(S: {\mathcal {E}}_1\rightrightarrows {\mathcal {E}}_2\) is said to be metrically sub-regular with constant \(\kappa >0\) around \({\bar{u}}\in {\mathcal {E}}_1\) for \({\bar{v}}\in {\mathcal {E}}_2\) if \(({\bar{u}},{\bar{v}})\in \texttt {gph}(S)\) and there exists a constant \(\delta >0\) such that
$$\begin{aligned} d\left( u, S^{-1}({\bar{v}})\right) \le \kappa \cdot d({\bar{v}},S(u)),\quad \forall u\in {\mathbb {B}}_{{\mathcal {E}}_1}({\bar{u}},\delta ). \end{aligned}$$(3)
Note that the calmness defined above is weaker than the local upper Lipschitz-continuity property [46]:
which requires the multi-functions S to be calm around \({\bar{u}}\in {{\mathcal {E}}}_1\) with constant \(\kappa >0\) for any \({\bar{v}}\in {{\mathcal {E}}}_2\). Recently, the local upper Lipschitz-continuity property (4) was employed in [50] as a main assumption for verifying EB conditions of a class of dual objective functions.
3 The gradient descent: a necessary and sufficient condition for linear convergence
In this section, we first figure out the weakest condition that ensures gradient descent to converge linearly, and then we show that a number of existing linear convergence results can be recovered in a unified and transparent manner. This is a “warm-up” section for the forthcoming abstract theory in Sects. 4 and 5.
Now, we start by considering the following unconstrained optimization problem
where \(f: {\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is a differentiable function achieving its minimum \(\min f\) so that \(\mathrm {Arg}\min f\ne \emptyset \). Note that \(\mathrm {Arg}\min f\) is closed since f is differentiable. For any \(x\in {\mathbb {R}}^n\), the set of its projection points onto \(\mathrm {Arg}\min f\), denoted by \({{\mathcal {Y}}}_f(x)\), is nonempty. Let \(\{x_k\}_{k\ge 0}\) be generated by the gradient descent method
where \(h>0\) is a constant step size. Observe that \(d(x_k,\mathrm {Arg}\min f)\) measures how close \(x_k\) is to \(\mathrm {Arg}\min f\), and the ratio of \(d(x_{k+1},\mathrm {Arg}\min f)\) to \(d(x_k,\mathrm {Arg}\min f)\) measures how fast \(x_k\) converges to \(\mathrm {Arg}\min f\). Now, we analyze the ratio of \(d(x_{k+1},\mathrm {Arg}\min f)\) to \(d(x_k,\mathrm {Arg}\min f)\) as follows
where \(x_{k+1}^\prime \in {{\mathcal {Y}}}_f(x_{k+1})\) and \(x_{k}^\prime \in {{\mathcal {Y}}}_f(x_{k})\). To ensure gradient descent to converge linearly in the following sense:
it suffices to require that for \(k\ge 0, x_k^\prime \in {{\mathcal {Y}}}_f(x_{k}),\)
i.e.,
It turns out that this sufficient condition is also necessary when the objective function f belongs to \({{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\) and the step size h lies in some interval.
Proposition 1
Let f be a differentiable function on \({\mathbb {R}}^n\) achieving its minimum \(\min f\) so that \(\mathrm {Arg}\min f\ne \emptyset \), and let \(h>0\) and \(\tau \in (0,1)\).
- (i)
If the condition (7) holds, then the sequence \(\{x_k\}_{k\ge 0}\) generated by the gradient descent method (5) must converge linearly in the sense of (6).
- (ii)
Let \(f\in {{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\). If the sequence \(\{x_k\}\) generated by the gradient descent method (5) with \(0< h\le \frac{1-\sqrt{\tau }}{L}\) converges linearly as (6), then the condition (7) must hold.
Proof
The proof of sufficiency has been done. We now show the necessity part. Pick \(u_{k+1}\in {{\mathcal {Y}}}_f(x_{k+1})\) to derive that
Combine (8) and the fact of linear convergence
to obtain
According to Theorem 2.1.5 in [43], we know that \(f\in {{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\) implies
By letting \(\alpha +\beta \le 1\) and \(\alpha ,\beta >0\), we have that for any \( v_k\in {{\mathcal {Y}}}_f(x_{k})\),
where the last inequality follows by (9). Thus, by letting \(\frac{\alpha }{L}=\frac{h}{2}\) and \(\frac{\beta (1-\sqrt{\tau })^2}{Lh^2}=\frac{1-\tau }{2h}\), we get the condition (7). At last, we need
which forces \(h\le \frac{1-\sqrt{\tau }}{L}\). This completes the proof. \(\square \)
The condition (7) means that if the steepest descent direction \(-\nabla f(x)\) is well correlated to any direction towards optimality \(u-x\), where \(u\in {{\mathcal {Y}}}_f(x)\), then a linear convergence rate of the gradient descent method can be ensured. Conversely, when \(f\in {{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\) and if the gradient descent converges linearly and the step size lies in the interval \((0, \frac{1-\sqrt{\tau }}{L}]\), then \(-\nabla f(x)\) must be well correlated to \(u-x\). Now, we list some direct applications of this basic observation.
In our first illustrating example, we consider functions in \(S_{\mu , L}^{1,1}({\mathbb {R}}^n)\). First, we introduce an important property about this type of functions.
Lemma 1
([43]) If \(f\in S_{\mu , L}^{1,1}({\mathbb {R}}^n)\), then we have
Let \(x^*\) be the unique minimizer of \(f\in S_{\mu , L}^{1,1}({\mathbb {R}}^n)\); then \(\mathrm {Arg}\min f=\{x^*\}\). Using the inequality above with \(x=x_k, y=x^*\) and noting that \(\nabla f(x^*)=0\) and \(\Vert x_k-x^*\Vert = d(x_k,\mathrm {Arg}\min f)\), we obtain
To guarantee the condition (7), we only need
which implies that
The optimal linear convergence rate \(\tau _0\) can be obtained by setting \(h=\frac{2}{\mu +L}\). This gives the corresponding result in Nesterov’s book; see Theorem 2.1.15 in [43].
In our second illustrating example, we consider RSC functions [64, 67]. The following property can be viewed as a convex combination of the restricted strong convexity and the gradient-Lipschitz-continuity property; see Lemma 3 in [64].
Lemma 2
([64]) If \(f\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and f is RSC with \(0<\nu < L\), then for every \(\theta \in [0,1]\) it holds:
where \(x^\prime \) is the unique projection point of x onto \(\mathrm {Arg}\min f\) since \(\mathrm {Arg}\min f\) is a nonempty closed convex set.
Similarly, to guarantee the condition (7) , we only need
which implies that
The optimal linear convergence rate \(1-\frac{\nu }{L}\) can be obtained at \(\theta =\frac{1}{2}\) and \(h=\frac{1}{L}\). This gives the corresponding result in [64]. The argument here is much simpler than that previously employed to derive the same result; see the proof of Theorem 2 in [64].
The last example to be illustrated is a nonconvex minimization. The following definition can be viewed as a local version of Lemma 2. Therefore, it is not difficult to predict a local linear convergence under such property.
Definition 3
(Regularity Condition, [14]) Let \({{\mathcal {N}}}\) be a neighborhood of \(\mathrm {Arg}\min f\) and let \(\alpha , \beta >0\). We say that f satisfies the regularity condition if
Again, to guarantee the condition (7) locally, we only need
which implies that
The optimal linear convergence rate \(\tau _0\) can be obtained by setting \(h=\frac{2}{\beta }\) and assuming \(\alpha \beta >4\). The latter can be guaranteed usually; see e.g. the argument below Lemma 7.10 in [14]. Therefore, we obtain the corresponding result in [14]. Regularity condition provably holds for nonconvex optimization problems that appear in phase retrieval and low-rank matrix recovery; interested readers can refer to [14, 56] for details.
Observe that the right-hand side of (7) has two terms. In order to better analyze such condition, we decompose it into two parts:
where \(\theta _i, i=1,2\) are some positive parameters. This idea of separating the right-hand side of (7) partially inspires us to consider new and abstract error bound conditions, which are the main content of the next section.
4 Abstract EB conditions: definition and interplay
This section is divided into two parts. In the first part, we define a group of EB conditions in a unified and abstract way. In the second part, we discuss some interplay between them, along with new connections between many existing EB conditions.
4.1 Definition of abstract EB conditions
The concept of residual measure operator, given by the following definition, will play a key role in the forthcoming theory.
Definition 4
Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) and \(X\subset {\mathbb {R}}^n\). We say that \(G_\varphi : X \rightarrow {\mathbb {R}}^n\) is a residual measure operator related to \(\varphi \) and X, if it satisfies
Especially, if we further assume that \(\varphi \) is convex, the above condition can be written as
Now, we define a group of abstract EB conditions.
Definition 5
Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) be such that it achieves its minimum \(\min \varphi \) and that its critical point set \({\mathbf {crit}}\varphi \) is nonempty and closed. Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\), and \(G_\varphi \) be a residual measure operator related to \(\varphi \) and X. Define the projection operator \({\mathcal {P}}_\varphi : {\mathbb {R}}^n\rightrightarrows {\mathbb {R}}^n\) onto \({\mathbf {crit}}\varphi \) by:
We call \(d(x,{\mathbf {crit}}\varphi )\) point value error, \(\varphi (x)-\min \varphi \) objective value error, \(\Vert G_{\varphi }(x)\Vert \) residual value error, and \(\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \) least correlated error. With these optimality measures, we say that
- 1.
\(\varphi \) satisfies the residual-point value EB condition with operator \(G_\varphi \) and constant \(\kappa >0\) on \(\Omega \), abbreviated \((G_\varphi , \kappa , \Omega )\)-(res-EB) condition, if:
$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\Vert G_{\varphi }(x)\Vert \ge \kappa \cdot d(x,{\mathbf {crit}}\varphi ); \end{aligned}$$(res-EB) - 2.
\(\varphi \) satisfies the correlated-point value EB condition with operator \(G_\varphi \) and constant \(\nu >0\) on \(\Omega \), abbreviated \((G_\varphi , \nu , \Omega )\)-(cor-EB) condition, if:
$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \ge \nu \cdot d^2(x,{\mathbf {crit}}\varphi ); \end{aligned}$$(cor-EB) - 3.
\(\varphi \) satisfies the objective-point value EB condition with constant \(\alpha >0\) on \(\Omega \), abbreviated \((\varphi , \alpha , \Omega )\)-(obj-EB) condition, if:
$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\varphi (x)- \min \varphi \ge \frac{\alpha }{2}\cdot d^2(x,{\mathbf {crit}}\varphi ); \end{aligned}$$(obj-EB) - 4.
\(\varphi \) satisfies the residual-objective value EB condition with operator \(G_\varphi \) and constant \(\eta >0\) on \(\Omega \), abbreviated \((G_\varphi , \eta , \Omega )\)-(res-obj-EB) condition, if:
$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\Vert G_{\varphi }(x)\Vert \ge \eta \cdot \sqrt{\varphi (x)- \min \varphi }; \end{aligned}$$(res-obj-EB) - 5.
\(\varphi \) satisfies the correlated-residual value EB condition with operator \(G_\varphi \) and constant \(\beta >0\) on \(\Omega \), abbreviated \((G_\varphi , \beta , \Omega )\)-(cor-res-EB) condition, if:
$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \ge \beta \cdot \Vert G_{\varphi }(x)\Vert ^2; \end{aligned}$$(cor-res-EB) - 6.
\(\varphi \) satisfies the correlated-objective value EB condition with operator \(G_\varphi \) and constant \(\omega >0\) on \(\Omega \), abbreviated \((G_\varphi , \omega , \Omega )\)-(cor-obj-EB) condition, if:
$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \ge \omega \cdot (\varphi (x)- \min \varphi ). \end{aligned}$$(cor-obj-EB)
We will refer to these EB conditions as global if \(\Omega ={\mathbb {R}}^n\). For global EB conditions, we will omit \(\Omega \) for simplicity.
In order to gain some intuition of the abstract EB conditions, we point out their correspondences to existing notions: (res-EB) corresponds to the EB condition of Hoffman’s type [20, 38, 69], (res-obj-EB) to the Polyak–Łojasiewicz’s type [10, 27], (obj-EB) to the quadratic growth condition [10, 20], (cor-EB) to the RSI’s type [67], and (cor-obj-EB) to the subgradient inequality for convex functions. The (cor-res-EB) condition, which will be used in Sect. 5, is a relaxation of the following property:
which is equivalent to \(\varphi \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\); see Theorem 2.1.5 in [43]. The reader now could find that we use the residual measure operator to replace concrete notions such as gradient and proximal gradient in some existing conditions. But compared to [10], we here consider only the case where the desingularizing function (used in the definition of Kurdyka–Łojasiewicz inequality) is taken to be the square-root. Below, let us briefly explain why we only consider the square-root case.
- 1.
Our motivation is to study linear convergence without strong convexity, as described at the beginning of the introduction. Many clues indicate that linear convergence corresponds to the square-root case. Actually, we will show such link is essential.
- 2.
Our framework includes six types of abstract EB conditions, while [10] only considers two types of them. Currently, it is not an easy task to take general desingularizing functions into account in this framework. But this definitely deserves further study in the future.
In our early manuscript [62], we only roughly gave global EB conditions in Definition 5. The above was obtained by incorporating the referee’s comments and was influenced by the recent work [22], resulting in a much more complete list than the previous one.
4.2 Interplay between the EB conditions
We first show the interplay between the abstract EB conditions. The proof of equivalence will rely heavily on a technical result developed in [10].
Theorem 1
Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) be such that it achieves its minimum \(\min \varphi \) and that \({\mathbf {crit}}\varphi \) is nonempty and closed. Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\), and \(G_\varphi \) be a residual measure operator related to \(\varphi \) and X. Assume that the \((G_\varphi , \omega , \Omega )\)-(cor-obj-EB) condition holds. Then, we have the following implications
One can take \(\nu =\frac{\alpha \omega }{2}, \kappa =\nu , \eta =\sqrt{\kappa \omega }\) to show above implications. If we further assume that \(\varphi \in \Gamma _0({\mathbb {R}}^n)\), \(\Omega \) is \(\partial \varphi \)-invariant, and \(G_\varphi \) satisfies
then we have the following equivalent relationship
For \(({\text {res-obj-EB}})\Rightarrow ({\text {obj-EB}})\), one can take \(\alpha =\frac{1}{2}\eta ^2\).
Proof
We prove this theorem by showing the following implications
Firstly, the implication of \(({\text {obj-EB}})\Rightarrow ({\text {cor-EB}})\) follows from
where the left inequality is (cor-obj-EB) and the right one is (obj-EB).
Secondly, the implication of \(({\text {cor-EB}})\Rightarrow ({\text {res-EB}})\) follows from a direct application of the Cauchy–Schwarz inequality to (cor-EB).
Thirdly, we show \(({\text {res-EB}})\Rightarrow ({\text {res-obj-EB}})\). By (cor-obj-EB) and (res-EB), we derive that for \(\forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi \),
Thus, it holds that \(\forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi , ~~\Vert G_{\varphi }(x)\Vert \ge \sqrt{\kappa \omega }\cdot \sqrt{\varphi (x)-\min \varphi },\) which is just (res-obj-EB).
At last, we show \(({\text {res-obj-EB}}) \Rightarrow ({\text {obj-EB}})\). The following is based on an argument used for proving Theorem 27 in [10]. For the sake of completeness, we reproduce that proof in our particular case. First of all, take \(x\in \Omega \cap {\mathrm {dom}}\varphi \) and recall that we have additionally assumed \({\mathbf {crit}}\varphi =\mathrm {Arg}\min \varphi \). Without loss of generality, we assume that \( \min \varphi =0\) and \(x\notin \mathrm {Arg}\min \varphi \). According to the result about subgradient curves due to Br\(\acute{e}\)zis [12] and Bruck [13] and recently used in [10], we can find the unique absolutely continuous curve \(\chi _x:[0, +\infty )\rightarrow {\mathbb {R}}^n\) such that \(\chi _x(0)=x\) and
for almost every \(t>0\). Moreover, \(\chi _x(t)\) converges to some point \({\hat{x}}\) in \(\mathrm {Arg}\min \varphi \) as \(t\rightarrow +\infty \) and the function \(t \mapsto \varphi (\chi _x(t))\) is nonincreasing and
By the \(\partial \varphi \)-invariant property of \(\Omega \), we have \(\chi _x(t)\in \Omega \) and hence \(\chi _x(t)\in \Omega \cap {\mathrm {dom}}\varphi \) due to the nonincreasingness of \(\varphi (\chi _x(t))\). Let
We claim that \(T>0\). Otherwise, \(T=0\) and then, by the lower semicontinuity property of \(\varphi \), we can derive that
This contradicts \(x\notin \mathrm {Arg}\min \varphi \). Now, combining (10) and (res-obj-EB), we derive that
Observe that for \(p, q\in [0, T)\) with \(q\ge p\),
where \(\text {length}(\chi _x(t), p, q)\) stands for the length of subgradient curve from p to q. By letting \(p=0\) and \(q\rightarrow +\infty \) if \(T=+\infty \) and \(q\rightarrow T\) if \(T<+\infty \), we obtain
Therefore, for \(\forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi \) we always have
which implies that (obj-EB) with \(\alpha =\frac{\eta ^2}{2}\) holds. This completes the proof. \(\square \)
As a direct consequence, we have the following corollary.
Corollary 1
Let \(\varphi \in \Gamma _0({\mathbb {R}}^n)\) be such that its achieves its minimum \(\min \varphi \) so that \(\mathrm {Arg}\min \varphi \ne \emptyset \). Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\) be \(\partial \varphi \)-invariant, and \(G^i_\varphi , i=1, 2\) be two different residual measure operators related to the same function \(\varphi \) and the same subset X. We assume that \(G^i_\varphi , i=1, 2\) satisfy
and \((G^i_\varphi , \omega , \Omega )\)-(cor-obj-EB) conditions hold. Then, we have
Now, we list some cases where the equivalence between the EB conditions indeed holds.
Corollary 2
The EB conditions (cor-EB), (res-EB), (obj-EB), and (res-obj-EB) are equivalent under each of the following situations:
- case 1:
\(\varphi \in {{\mathcal {F}}}^{1}({\mathbb {R}}^n)\) achieves its minimum \(\min \varphi \), \(X={\mathbb {R}}^n\) and \(\Omega \subset X\) is \(\nabla \varphi \)-invariant, and \(G_{\varphi } =\nabla \varphi \);
- case 2:
\(\varphi \in \Gamma _0({\mathbb {R}}^n)\) achieves its minimum \(\min \varphi \), \(X={\mathrm {dom}}\partial \varphi \) and \(\Omega \subset X\) is \(\partial \varphi \)-invariant, and \(G_{\varphi } =\partial ^0\varphi \);
- case 3:
\(\varphi =f +g \), where \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), achieves its minimum \(\min \varphi \), \(X={\mathbb {R}}^n\) and \(\Omega \subset X\) is \(\partial \varphi \)-invariant, and \(G_{\varphi }={\mathcal {R}}_t\), where \({\mathcal {R}}_t(x):=t^{-1}(x-x^+)\) with \(t\in (0,\frac{1}{L}]\) and \(x^+={\mathbf {prox}}_{tg}(x-t\nabla f(x))\). In addition, we assume that there exists a constant \(0<\epsilon \le \frac{2}{t}\) such that
$$\begin{aligned} \Vert G_{\varphi }(x)\Vert ^2\ge \epsilon \left( \varphi (x)-\varphi \left( x^+\right) \right) . \end{aligned}$$(12)
Proof
First of all, \({\mathbf {crit}}\varphi \) is nonempty since \({\mathbf {crit}}\varphi =\mathrm {Arg}\min \varphi \ne \emptyset \), and is closed since \(\varphi \) a proper and lower semicontinuous function, in all the listed cases. Secondly, by optimality conditions, one can easily verify that \(G_\varphi \) in all the listed cases are residual measure operators. We only need to verify the remaining assumptions in Theorem 1.
For both cases 1 and 2, the convexity of \(\varphi \) implies the (cor-obj-EB) condition with \(\omega =1\). In case 1, the assumption (10) holds obviously because of \(\partial \varphi (x)=\{\nabla \varphi (x)\}\). In case 2, the assumption (10) follows from the definition of \(\partial ^0\varphi (x)\).
Now, let us consider the case 3. Since \(f\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), we have that \(G_{\varphi }(x)\) satisfies the standard result
see e.g. Lemma 2.3 in [8] or Lemma 2 in the very recent work [4]. Since \(\varphi \) also belongs to \(\Gamma _0({\mathbb {R}}^n)\), we can conclude that \(\mathrm {Arg}\min \varphi \) is a nonempty closed convex set. Thus, by the projection theorem, there exists a unique projection point of x onto \(\mathrm {Arg}\min \varphi \), denoted by \(x_p\). Using the inequality above with \(y=x_p\) and the assumption (12), we derive that
from which the \((G_\varphi , \omega , \Omega )\)-(cor-obj-EB) condition with \(\omega =\frac{t\epsilon }{2}\) follows. The assumption (10) in this case was established in Theorem 3.5 in [20] and Lemma 4.1 in [32]. This completes the proof. \(\square \)
Remark 1
-
(i)
In cases 1 and 2, from Theorem 1 we can see that if one only needs the implication
$$\begin{aligned} ({\text {obj-EB}})\Rightarrow ({\text {cor-EB}})\Rightarrow ({\text {res-EB}})\Rightarrow ({\text {res-obj-EB}}), \end{aligned}$$then the assumption on \(\Omega \) can be removed.
-
(ii)
In case 2, if \(\Omega ={\mathrm {dom}}\partial \varphi \), then \((\varphi , \alpha , {\mathrm {dom}}\partial \varphi )\)-(obj-EB) is actually equivalent to
$$\begin{aligned} \forall ~ x\in {\mathbb {R}}^n,~~\varphi (x)- \min \varphi \ge \frac{\alpha }{2}\cdot d^2(x, \mathrm {Arg}\min \varphi ), \end{aligned}$$since \({\mathrm {dom}}\partial \varphi \) is a dense subset of \({\mathrm {dom}}\varphi \) according to Corollary 16.29 in [7] and \(\varphi (x)=+\infty \) for \(x\notin {\mathrm {dom}}\varphi \).
We note that while this work was under review, the authors of [27] independently also obtained the equivalent relationship between the EB conditions (cor-EB), (res-EB), (obj-EB), and (res-obj-EB) for functions in \({{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\). We also note that the authors of [22] independently recently obtained the equivalent relationship between the EB conditions (res-EB), (obj-EB), and (res-obj-EB) for functions in \(\Gamma _0({\mathbb {R}}^n)\). The former is merely limited to \({{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\), and the latter mainly focuses on \(\Gamma _0({\mathbb {R}}^n)\) but does not consider (cor-EB).
Observe that the condition (12) is implied by the (res-obj-EB) condition since
And also, note that \(\varphi =f+g\in \Gamma _0({\mathbb {R}}^n)\) if \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\). With a little effort, we can get the following result.
Corollary 3
Let \(\varphi =f+g\) with \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\) achieve its minimum \(\min \varphi \), and let \(\Omega \subset {\mathbb {R}}^n\) be \(\partial \varphi \)-invariant and \(t\in (0, \frac{1}{L}]\). If the \(({\mathcal {R}}_t, \eta , \Omega )\)-(res-obj-EB) condition holds, then each of the following conditions holds and hence they are equivalent:
Based on the relationship established in Theorem 2 in [63], that is \((\varphi , \alpha , \Omega )\)-(obj-EB) \(\Leftrightarrow \)\(({\mathcal {R}}_t, \kappa , \Omega )\)-(res-EB)\(\Leftrightarrow \)\(({\mathcal {R}}_t, \nu , \Omega )\)-(cor-EB), and together with the case 2 of Corollary 2, we still have the following result even if we do not take the \(({\mathcal {R}}_t, \eta , \Omega )\)-(res-obj-EB) condition as an assumption.
Corollary 4
Let \(\varphi =f+g\) with \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\) achieve its minimum \(\min \varphi \), and let \(\Omega \subset {\mathrm {dom}}\partial \varphi \) be \(\partial \varphi \)-invariant and \(t\in (0, \frac{1}{L}]\). Then, we have
Note that the \(({\mathcal {R}}_t, \eta , \Omega )\)-(res-obj-EB) condition is not involved in the equivalence above. This might explain why one can avoid the condition (12) in existing related results.
In all corollaries above, parameters involved in different EB conditions can be set explicitly as in Theorem 1, but we omit the details here.
5 An abstract gradient-type method: linear convergence and applications
In this section, we define an abstract gradient-type method by viewing the negative of the residual measure operator as a descent direction, and then figure out a necessary and sufficient condition for linear convergence based on the abstract EB conditions defined before. The following main result generalizes Proposition 1.
Theorem 2
Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) be such that it achieves its minimum \(\min \varphi \) and that \({\mathbf {crit}}\varphi \) is nonempty and closed. Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\), and \(G_\varphi \) be a residual measure operator related to \(\varphi \) and X. Suppose that \(\varphi \) satisfies the \((G_\varphi , \beta , \Omega )\)-(cor-res-EB) condition. Define the abstract gradient-type method by
with step size \(h>0\) and arbitrary initial point \(x_0\in \Omega \). Assume that \(x_k\in \Omega , k\ge 0\). Let \(\tau , \theta \in (0, 1)\).
- (i)
If \(\varphi \) satisfies the \((G_\varphi , \nu , \Omega )\)-(cor-EB) condition with \(\nu <\frac{1}{\beta }\) and the following inequalities hold
$$\begin{aligned} \frac{1-\tau }{2\theta \nu }\le h\le 2(1-\theta )\beta ,~~\tau \ge 1-4\theta (1-\theta )\beta \nu , \end{aligned}$$(13)then the abstract gradient-type method converges linearly in the sense that
$$\begin{aligned} d^2(x_{k+1},{\mathbf {crit}}\varphi ) \le \tau \cdot d^2(x_k,{\mathbf {crit}}\varphi ),~~k\ge 0. \end{aligned}$$(14)The optimal rate \(\tau _0:=1-\beta \nu \) is obtained at \(h=\beta \) and \(\theta =\frac{1}{2}\).
- (ii)
Conversely, if the abstract gradient-type method converges linearly in the sense of (14), then \(\varphi \) satisfies the \((G_\varphi , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\frac{\beta (1-\sqrt{\tau })^2}{h^2}\).
Proof
First, we repeat the argument before (6) to obtain that for \(v_k\in {\mathcal {P}}_\varphi (x_k)\),
Take \(\theta \in (0, 1)\) and then use a convex combination of the (cor-res-EB) and (cor-EB) conditions at \(x=x_k\) to obtain
Therefore, we can derive that
where the second inequality follows from the condition (13) on the step size. Obviously, the optimal linear convergence rate \(\tau _0=1-\beta \nu \) can be obtained at \(h=\beta , \theta =\frac{1}{2}\).
Conversely, pick \(u_{k+1}\in {\mathcal {P}}_\varphi (x_{k+1})\) to derive that
Combine (15) and the fact of linear convergence
to obtain
Thus, together with the (cor-res-EB) condition, we can derive that
Observe that the starting point \(x_0\in \Omega \) can be arbitrary. Therefore, the (cor-EB) condition with \(\nu =\frac{\beta (1-\sqrt{\tau })^2}{h^2}\) holds. This completes the proof. \(\square \)
With Theorem 2 in hand, we now claim the necessary and sufficient EB conditions guaranteeing linear convergence for the gradient method, the proximal point algorithm, and the forward–backward splitting algorithm. These conditions, previously known to be sufficient for linear convergence (see e.g. Section 4 in [10]), are actually necessary. We start by the gradient method, applied to possibly nonconvex optimization.
Corollary 5
Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a gradient-Lipschitz-continuous function with modulus \(L>0\). Assume that f achieves its minimum \(\min f\) and \({\mathbf {crit}}f=\mathrm {Arg}\min f \ne \emptyset \). Let \(\epsilon >0\) be a fixed constant and set \(\Omega =\{x: f(x)\le \min f+\epsilon \}\). Let \(\{x_k\}_{k\ge 0}\) be generated by the gradient descent method (5) with \(h=\frac{1}{L}\) and \(x_0\in \Omega \).
- (i)
If f satisfies the \((\nabla f, \nu , \Omega )\)-(cor-EB) condition with \(\nu <L\), then the gradient descent method (5) with \(h=\frac{1}{L}\) converges linearly in the sense that
$$\begin{aligned} f(x_{k+1})-\min f\le \left( 1-\left( \frac{\nu }{L}\right) ^2\right) (f(x_k)-\min f),\quad k\ge 0. \end{aligned}$$(16) - (ii)
If we further assume that f is convex, then the gradient descent method (5) with \(h=\frac{1}{L}\) attains the following linear convergence:
$$\begin{aligned} d^2(x_{k+1},\mathrm {Arg}\min f) \le \left( 1-\frac{\nu }{L}\right) \cdot d^2(x_k,\mathrm {Arg}\min f),\quad k\ge 0. \end{aligned}$$(17) - (iii)
Conversely, if f is convex and if starting from an arbitrary initial point \(x_0\in \Omega \), the gradient descent method (5) with \(h=\frac{1}{L}\) converges linearly like (17) but replacing \(1-\frac{\nu }{L}\) with \(\tau \), then f satisfies the \((\nabla f, \nu , \Omega )\)-(cor-EB) condition with \(\nu =L(1-\sqrt{\tau })^2\).
Proof
We first show (16) by modifying the argument due to Polyak [45] and recently highlighted in [27, 28]. The gradient-Lipschitz-continuity of f implies
Using this inequality with \(y=x_{k+1}\) and \(x=x_k\) and together with the update rule of gradient descent, we get
which implies \({x_k}\in \Omega , k\ge 0\). Using again inequality (18) with \(y=x_k\) and \(x=u_k\in {\mathcal {P}}_f(x_k)\), and noting that \(u_k\in {\mathbf {crit}}f= \mathrm {Arg}\min f\) and hence \(f(u_k)=\min f\) and \(\nabla f(u_k)=0\), we have
Applying the Cauchy-Schwarz inequality to the \((\nabla f, \nu , \Omega )\)-(cor-EB) condition, we obtain
Thus, combining inequalities (19) and (20), we have that
from which (16) follows.
Now, with the additional convexity assumption of f, we have \(f\in {{\mathcal {F}}}^{1,1}_{L}({\mathbb {R}}^n)\), which is equivalent to the following condition
see Theorem 2.1.5 [43]. Using this inequality with \(y\in {\mathcal {P}}_f(x)\), we obtain
which is just the \((\nabla f, \beta , \Omega )\)-(cor-res-EB) condition with \(\beta =\frac{1}{L}\). Therefore, the remaining results follow from Theorem 2. This completes the proof. \(\square \)
Remark 2
In Example 2 in [64], we constructed a one-dimensional nonconvex function, that satisfies all the conditions in Corollary 5 that ensure (16). In this sense, (16) is one of the few general results for global linear convergence on non-convex problems. We note that a similar phenomenon was observed by the authors of [27] under the Polyak–Łojasiewicz condition.
While \({\mathbf {crit}}f=\mathrm {Arg}\min f\) is a strong assumption, it is not the same as convexity but implies the weaker condition of invexity, which says that a function f is invex if and only if its every critical point is a global minimum. This assumption can be satisfied by some nonconvex optimization problems recently appeared in machine/deep learning, see e.g. [61, 68].
Before discussing the linear convergence of the proximal point algorithm (PPA), we introduce the following result.
Lemma 3
([7, 49]) Let \(f\in \Gamma _0({\mathbb {R}}^n)\) and \(\lambda >0\). Let the Moreau–Yosida regularization of f be defined by
Then,
\(f_\lambda \) is real-valued, convex, and continuously differentiable and can be formulated as
$$\begin{aligned} f_\lambda (x)=f\left( {\mathbf {prox}}_{\lambda f}(x)\right) +\frac{1}{2\lambda }\left\| x-{\mathbf {prox}}_{\lambda f}(x)\right\| ^2; \end{aligned}$$Its gradient
$$\begin{aligned} \nabla f_\lambda (x)=\lambda ^{-1}\left( x-{\mathbf {prox}}_{\lambda f}(x)\right) \end{aligned}$$is \(\lambda ^{-1}\)-Lipschitz continuous.
\(\mathrm {Arg}\min f_\lambda =\mathrm {Arg}\min f\) and \(\min f=\min f_\lambda \).
Now, we are ready to present the result of linear convergence for PPA.
Corollary 6
Let \(f\in \Gamma _0({\mathbb {R}}^n)\) achieve its minimum \(\min f\) and \(\lambda >0\). Let \(\epsilon >0\) be a fixed constant and set \(\Omega =\{x: f(x)\le \min f+\epsilon \}\cap {\mathrm {dom}}\partial f\). Starting from \(x_0\in \Omega \), the PPA can be defined by
- (i)
If f satisfies the \((f,\alpha , \Omega )\)-(obj-EB) condition, then \(f_\lambda \) satisfies the \((\nabla f_\lambda , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\min \{\frac{\alpha }{4}, \frac{1}{4\lambda }\}\), and hence the PPA converges linearly in the sense that
$$\begin{aligned} d^2(x_{k+1},\mathrm {Arg}\min f) \le \left( 1-\min \left\{ \frac{\alpha \lambda }{4}, \frac{1}{4}\right\} \right) \cdot d^2(x_k,\mathrm {Arg}\min f),\quad k\ge 0.\nonumber \\ \end{aligned}$$(21) - (ii)
Conversely, if starting from an arbitrary initial point \(x_0\in \Omega \) the PPA converges linearly like (21) but replacing the rate \(1-\min \{\frac{\alpha \lambda }{4}, \frac{1}{4}\}\) with a constant \(\tau \in (0, 1)\), then f satisfies the \((f, \alpha , \Omega )\)-(obj-EB) condition with \(\alpha = \frac{(1-\sqrt{\tau })^2}{2\lambda }\).
Proof
First of all, we remark that
From Lemma 3, we have \(f_\lambda \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) with \(L=\lambda ^{-1}\) and hence the \((\nabla f_\lambda , \beta ,\Omega )\)-(cor-res-EB) condition with \(\beta =\lambda \) holds. Now, we first prove that the \((f,\alpha , \Omega )\)-(obj-EB) condition implies the \((f_\lambda ,c, \Omega )\)-(obj-EB) condition with \(c=\min \{\frac{\alpha }{2}, \frac{1}{2\lambda }\}\). Indeed, letting \(v={\mathbf {prox}}_{\lambda f}(x)\) and \(v^\prime \in {\mathcal {P}}_f(v)\), for any \(x\in \Omega \cap {\mathrm {dom}}f\) we can derive that
where the first inequality utilizes the fact of \(f(v)+\frac{1}{2\lambda }\Vert v-x\Vert ^2\le f(x)\), which implies \(v\in \Omega \cap {\mathrm {dom}}f\), and the last inequality follows by \(v^\prime \in {\mathcal {P}}_f(v)\subset {\mathbf {crit}}f ={\mathbf {crit}}f_\lambda \). From case 1 of Corollary 2 and (i) in Remark 1, the \((f_\lambda ,c, \Omega )\)-(obj-EB) condition implies the \((\nabla f_\lambda , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\min \{\frac{\alpha }{4}, \frac{1}{4\lambda }\}\). Therefore, (21) follows from Theorem 2 and the fact in (22).
Now, we turn to the necessity part. Invoking Theorem 2 again, we conclude that \(f_\lambda \) satisfies the \((\nabla f_\lambda , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\frac{(1-\sqrt{\tau })^2}{\lambda }\), that is
Together with the fact of \({\mathbf {crit}}f={\mathbf {crit}}f_\lambda \), we can get
On the other hand, using the definition of \(v={\mathbf {prox}}_{\lambda f}(x)\), which implies \(\frac{1}{\lambda }(x-v)\in \partial f(v)\), and the convexity of f, we obtain that
which further implies that
Thus, combining (24) and (26) and noting that \({\mathrm {dom}}f\subset {\mathrm {dom}}f_\lambda \) and \(\Vert \partial ^0 f(x)\Vert =+\infty \) for \(x\notin {\mathrm {dom}}\partial f\), we obtain
This is just the \((\partial ^0 f, \kappa , \Omega )\)-(res-EB) condition with \(\kappa =\nu \). Note that \(\Omega \) is \(\partial f\)-invaiant. Therefore, the \((f, \alpha , \Omega )\)-(obj-EB) condition with \(\alpha = \frac{(1-\sqrt{\tau })^2}{2\lambda }\) holds by case 2 of Corollary 2. \(\square \)
Remark 3
Linear convergence of PPA was previously provided based on different EB conditions, such as the \({\L }\)ojasiewicz inequality [corresponding to (res-obj-EB)] in [2, 3, 10], the quadratic growth condition [corresponding to (obj-EB)] in Proposition 6.5.2 in [9], and the EB condition of Hoffman’s type [corresponding to (res-EB)] in Theorem 2.1 in [40]. Our novelty here mainly lies in the necessity part, i.e., conclusion (ii).
Finally, we discuss linear convergence for the forward–backward splitting (FBS) algorithm. Recall that \({\mathcal {R}}_{1/L}(x)=L\left( x-{\mathbf {prox}}_{tg}(x-\frac{1}{L}\nabla f(x))\right) \).
Corollary 7
Let \(\varphi =f +g \), where \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), achieve its minimum \(\min \varphi \). Let \(\epsilon >0\) be a fixed constant and set \(\Omega =\{x: \varphi (x)\le \min \varphi +\epsilon \}\). Starting from \(x_0\in \Omega \), the FBS can be defined by
Denote \(S_k:=\sum _{i=0}^\infty \Vert {\mathcal {R}}_{1/L}(x_{k+i})\Vert ^2,\quad k\ge 0\).
- (i)
If \(\varphi \) satisfies the \(({\mathcal {R}}_{1/L}, \nu , \Omega )\)-(cor-EB) condition with \(\nu <2L\), then FBS converges linearly in the sense that
$$\begin{aligned}&\varphi (x_{k+1})-\min \varphi \le \left( 1-\frac{\nu }{2L}\right) \left( \varphi (x_k)-\min \varphi \right) ,\quad k\ge 0, \end{aligned}$$(28)$$\begin{aligned}&d^2(x_{k+1},\mathrm {Arg}\min \varphi ) \le \left( 1-\frac{\nu }{2L}\right) \cdot d^2(x_k,\mathrm {Arg}\min \varphi ),\quad k\ge 0, \end{aligned}$$(29)and
$$\begin{aligned} S_{k+1}\le \left( 1-\frac{\nu }{2L}\right) S_k ,\quad k\ge 0. \end{aligned}$$(30) - (ii)
Conversely, if starting from an arbitrary initial point \(x_0\in \Omega \), FBS converges linearly like (29) but replacing \(1-\frac{\nu }{2L}\) with \(\tau \), then \(\varphi \) satisfies the \(({\mathcal {R}}_{1/L}, \nu , \Omega )\)-(cor-EB) condition with \(\nu =\frac{L}{2}(1-\sqrt{\tau })^2\).
Proof
We rely on the following standard result (see again Lemma 2.3 in [8]):
Using successively this result at \(x=y=x_k\), and then at \(y=x_k, x=u_k\in {\mathcal {P}}_\varphi (x_k)\), together with the fact of \(x_{k+1}={\mathbf {prox}}_{\frac{1}{L}g}(x_k-\frac{1}{L}\nabla f(x_k))\), we obtain the following sufficient decrease property
and
Note that (32) implies \(x_k\in \Omega , k\ge 0\). Applying the Cauchy–Schwarz inequality to the \(({\mathcal {R}}_{1/L}, \nu , \Omega )\)-(cor-EB) condition, we obtain
from which the following inequality follows
Thus, we obtain
Combining (32) and (33), we get
from which the announced result (28) follows. The convergence result (30) can also be derived from (32) and (33). In fact, we first observe that for any integer \(N>0\), it holds
and hence the sufficient decrease property (32) yields
Together with (33), we derive that
Using (32) again, we obtain
i.e.,
from which the announced result (30) follows.
Now, using the standard result (31) with \(x=y_p\in {\mathcal {P}}_\varphi (y)\) to yield
and noting that
we obtain
Thus, \(\varphi \) satisfies the \(({\mathcal {R}}_{1/L},\beta , \Omega )\)-(cor-res-EB) condition with \(\beta =\frac{1}{2L}\). Therefore, the remaining results follow from Theorem 2 and the fact of \({\mathbf {crit}}\varphi =\mathrm {Arg}\min \varphi \). \(\square \)
Remark 4
The results (28) and (29) were essentially shown in [20, 63] respectively, with different methods. We note that while this work was under review, the authors of [16] improved these results under error bound conditions and weaken assumptions on the gradient Lipschitz continuity. Our novelty here lies in conclusion (ii), which was independently also recently observed by the authors in [22]. In addition, the result (30) seems also new and interesting.
6 Linear convergence of the PALM algorithm
The PALM algorithm was recently introduced by the authors of [11] for a class of composite optimization problems in the general non-convex and non-smooth setting. The authors developed a convergence analysis framework relying on the Kurdyka–Łojasiewicz (KL) inequality and proved that PALM converges globally to a critical point for problems with semi-algebraic data. A global non-asymptotic sublinear rate of convergence of PALM for convex problems was obtained independently in [26, 51]. Very recently, global linear convergence of PALM for strongly convex problems was obtained in [33]. Note that PALM is called block coordinate proximal gradient algorithm in [26] and cyclic block coordinate descent-type method in [33]. In this section, we show linear convergence of PALM under EB conditions, which are strictly weaker than strong convexity.
Let \(x_{1:k}:=(x_1,x_2,\dots , x_k)\) and denote \(x_{1:(j-1)}^{(t+1)}:=(x_1^{(t+1)},\dots , x_{j-1}^{(t+1)})\), \(x_{(j+1):p}^{(t)}:=(x_{j+1}^{(t)},\dots , x_{p}^{(t)})\), \(\psi _j^{(t)}(x_j):=f(x_{1:(j-1)}^{(t+1)},x_j,x_{(j+1):p}^{(t)})\), and \(\varphi _j^{(t)}(x_j):=\psi _j^{(t)}(x_j) +g_j(x_j).\) Start with given initial points \(\{x_j^{(0)}\}_{j=1}^p\). PALM generates \(\{x_j^{(t+1)}\}_{j=1}^p\) via solving a collection of subproblems
The following is our main result in this section.
Theorem 3
Consider the following composite convex nonsmooth minimization problem
where \({\mathbb {R}}^d\ni x=(x_1,\dots , x_p)\) with the jth block \(x_j\in {\mathbb {R}}^{d_j}\), and \(d=\sum ^p_{j=1}d_j\). Set \(g(x):=\sum ^p_{j=1}g_j(x_j)\) so that \({\mathrm {dom}}g= \Pi _{j=1}^p{\mathrm {dom}}g_j\). With these notations, the objective function of (34) reads as \(\varphi =f+g\). Assume that
\(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^d)\), \(g_j\in \Gamma _0({\mathbb {R}}^{d_j}), j=1,\dots , p\), and \(\Omega \subset {\mathrm {dom}}\partial \varphi \);
\(f(x_{1:(j-1)},x_j,x_{(j+1):p})\in {{\mathcal {F}}}^{1,1}_{L_j}({\mathbb {R}}^{d_j})\) for all \(x_{1:(j-1)}\) and \(x_{(j+1):p}\), \(j=1,\dots ; p\);
\(\varphi =f+g\) is such that it achieves its minimum \(\min \varphi \);
\(\varphi \) satisfies the \((\partial ^0 \varphi , \eta , \Omega )\)-(res-obj-EB) condition (or its equivalent conditions from case 2 of Corollary 2), which is strictly weaker than strong convexity.
Here, \(L_j, j=1,\dots , p\) and L are positive constants. Let \(\{x^{(t)}\}\) be generated by PALM and assume that \(x^{(t)}\in \Omega , t\ge 0\). Then, PALM converges linearly in the sense that
where \(L_{\min }=\min _j L_j\) and \(L_{\max }=\max _j L_j\).
Proof
We divide the proof into three steps.
Step 1. We prove that
Let \(G^{(t)}_j=L_j(x_j^{(t)}-x_j^{(t+1)})\). By the definition of \(x_j^{(t+1)}\) and Lemma 2.3 in [8], we get
In addition, note that
and
Thus, we derive that for \(t\ge 0\),
from which (35) follows.
Step 2. The \((\partial ^0 \varphi , \eta , \Omega )\)-(res-obj-EB) condition at \(x=x^{(t+1)}\) reads as
At the \((t+1)\)th iteration, there exists \(\xi _j^{(t+1)}\in \partial g_j(x_j^{(t+1)})\) satisfying the optimality condition:
Here and below, we denote the partial gradient \(\nabla _{x_j}f(x)\) by \(\nabla _j f(x)\) for notational simplicity. Let \(\xi ^{(t+1)}=(\xi _1^{(t+1) },\dots , \xi _p^{(t+1)})\). Then,
and hence
Using the optimality condition and the fact of \(f(x)\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^d)\), we derive that
Therefore, we obtain
Step 3. Combining (35) and (36), we derive that
from which the claimed result follows. This completes the proof. \(\square \)
On one hand, the \((\varphi , \alpha , \Omega )\)-(obj-EB) condition is obviously weaker than strong convexity. On the other hand, we can easily construct functions that satisfy (obj-EB) but fail to be strongly convex. For example, the composition f(Ax), where \(f(\cdot )\) is strongly convex and A is rank deficient, is such a function. This explains why we say that the \((\partial ^0 \varphi , \eta , \Omega )\)-(res-obj-EB) condition, which is equivalent to the \((\varphi , \alpha , \Omega )\)-(obj-EB) condition, is strictly weaker than strong convexity.
We note that the authors of [6] very recently showed that the regularized Jacobi algorithm-a type of cyclic block coordinate descent method-achieves a linear convergence rate under similar conditions to that of Theorem 3.
7 Linear convergence of Nesterov’s accelerated forward–backward algorithm
This section is divided into two parts. In the first part, we first introduce a composite optimization problem, and then we give a new EB condition. In the second part, we introduce Nesterov’s accelerated forward–backward algorithm and show its Q-linear convergence.
7.1 Problem formulation and a new EB condition
Given a nonnegative real sequence \(\{r_k\}_{k\ge 0}\). Following the terminology from [44], we say that \(r_k\) converges:
Q-linearly if there exists a constant \(\tau \in (0,1)\) such that \(\forall k\ge 0\), \(r_{k+1}\le \tau \cdot r_k\),
R-linearly if there exists a sequence \(\{s_k\}_{k\ge 0}\) Q-linearly converging to zero such that \(\forall k\ge 0\), \(r_k\le s_k\).
It is well-known that Nesterov’s accelerated gradient method with the following form
converges R-linearly for minimizing \(f\in {{\mathcal {S}}}_{\mu ,L}^{1,1}({\mathbb {R}}^n)\) in the sense that \(\{f(x_k)-\min f\}_{k\ge 0}\) converges R-linearly. Very recently, the following Q-linear convergence was independently discovered in [29, 58] by quite different methods:
where \(w_k=(1+\sqrt{\frac{L}{\mu }})y_k-\sqrt{\frac{L}{\mu }}x_k\). In Nesterov’s book [48], via replacing gradient with gradient mapping, the accelerated scheme (37) was successfully extended to solve the following minimization problems:
and
where \(f, f_i\in {{\mathcal {S}}}_{\mu ,L}^{1,1}({\mathbb {R}}^n), i=1, \dots , m\) and Q is a nonempty closed convex set. Similarly, the accelerated scheme (37) can also be successfully extended to solve
where \(f \in {{\mathcal {S}}}_{\mu ,L}^{1,1}({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^{n})\). Nesterov’s extended accelerated methods have been proved to achieve R-linear convergence. A natural question arises: Whether there exists Q-linear convergence for Nesterov’s accelerated method applied to problems (39)–(41) as well. In order to study problems (39)–(41) in a unified way, we consider the following composite optimization problem:
This is a very powerful expression covering many optimization problems, including problems (39)–(41), as its special cases; see [19, 20]. Now, we introduce a new EB condition, commonly satisfied by many concrete examples in the form of (39)–(41); see Remark 6 below. Our forthcoming argument will heavily rely on this condition.
Definition 6
Let \(\varphi :=f\circ e +g \) be such that \(f:{\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is a closed convex function, \(g\in \Gamma _0({\mathbb {R}}^{n})\), and \(e: {\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) is a smooth mapping with its Jacobian given by \(\nabla e(x)\). Let \(L>0\) and define
and
We say that \(\varphi \) satisfies the composite EB condition with positive constants \(\mu , L\) obeying \(\mu < L\) if
Let us give several comments on this definition.
Remark 5
-
1.
Both p(y) and G(y) are well defined due to the strong convexity of \(\ell (\cdot ;y)\) for any \(y\in {\mathbb {R}}^n\). Moreover, the operator G is a residual measure operator related to \(\varphi \) and \({\mathbb {R}}^n\). In fact, observe that the optimality conditions for the proximal subproblem \(\mathrm {Arg}\min _{x\in {\mathbb {R}}^n} \ell (x;y)\) read as
$$\begin{aligned} G(y)\in \partial g(p(y))+\nabla e(y)^T \partial f(e(y) + \nabla e(y)(p(y)-y)), \end{aligned}$$which implies \(y\in {\mathbf {crit}}\varphi \) if \(G(y)=0\). On the other hand, by the definition of p(y) and using the convexity of g and f, we derive that
$$\begin{aligned} \varphi (y)&= \ell (y;y)\ge \ell (p(y);y) \nonumber \\&= g(p(y))+f(e(y) + \nabla e(y)(p(y)-y)) +\frac{L}{2}\Vert p(y)-y\Vert ^2 \nonumber \\&\ge (g(y)+\langle z, p(y)-y\rangle ) +(f(e(y))+\langle w, \nabla e(y)(p(y)-y)\rangle )+\frac{L}{2}\Vert p(y)-y\Vert ^2 \nonumber \\&= \varphi (y)+\langle z+\nabla e(y)^T w, p(y)-y\rangle +\frac{L}{2}\Vert p(y)-y\Vert ^2, \end{aligned}$$(44)where \(z\in \partial g(y)\) and \(w\in \partial f(e(y))\), and hence \(z+\nabla e(y)^T w \in \partial \varphi (y)\). Thus, if \(0\in \partial \varphi (y)\), then we can take some \(z\in \partial g(y)\) and \(w\in \partial f(e(y))\) such that \(z+\nabla e(y)^T w =0\). Hence, the inequality (44) implies that \(G(y)=0\) if \(y\in {\mathbf {crit}}\varphi \). Therefore, we have \(\{x\in {\mathbb {R}}^n: G(y)=0\}={\mathbf {crit}}\varphi \), i.e., G is a residual measure operator related to \(\varphi \).
-
2.
The composite EB condition (43) can be viewed as a relaxation of strong convexity to some degree. This perspective is in the spirit of the work [42]. Indeed, in case of \(m=1\), \(g(x)\equiv 0\), \(f(t)\equiv t, t\in {\mathbb {R}}\), and \(e\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\), (43) reads as
$$\begin{aligned}&~\forall x, y\in {\mathbb {R}}^n,~~e(x)\ge \left( e(y-\frac{1}{L}\nabla e(y))+\frac{1}{2L}\Vert \nabla e(y)\Vert ^2\right) \nonumber \\&\quad +\langle \nabla e(y),x-y\rangle +\frac{\mu }{2}\Vert x-y\Vert ^2. \end{aligned}$$(45)On the other hand, \(e\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) implies that
$$\begin{aligned} ~~\forall x, y\in {\mathbb {R}}^n,~~e(y)\ge e(y-\frac{1}{L}\nabla e(y))+\frac{1}{2L}\Vert \nabla e(y)\Vert ^2. \end{aligned}$$Therefore, (45) is a relaxation of strong convexity in the following form:
$$\begin{aligned} ~\forall x, y\in {\mathbb {R}}^n,~~e(x)\ge e(y)+\langle \nabla e(y),x-y\rangle +\frac{\mu }{2}\Vert x-y\Vert ^2. \end{aligned}$$In the case of \(f\circ e(x)\equiv 0\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), (43) reads as
$$\begin{aligned} ~\forall x, y\in {\mathbb {R}}^n, ~~g(x)\ge g_\lambda (y)+\langle \nabla g_\lambda (y),x-y\rangle +\frac{\mu }{2}\Vert x-y\Vert ^2, \end{aligned}$$(46)where \(\lambda =\frac{1}{L}\). Recall that \(g_\lambda \) is the Moreau–Yosida regularization of g and note that \(g(x)\ge g_\lambda (x)\). We can see that (46) is a relaxation of strong convexity of \(g_\lambda \).
-
3.
Although we have shown that (43) can be viewed as a relaxation of strong convexity, it is still a very strong property. Now, we construct an example to show that even strongly convex property of f is not enough to ensure (43) to hold. This example is obtained by setting \(n=m=2\), \(x=(x_1, x_2)^T\), \(e(x)=(x_1, x_1)^T\), \(f(x)=\frac{1}{2}x_1^2+\frac{1}{2}x_2^2,\)\(g(x)\equiv 0\); then \(\varphi (x)=f\circ e(x)=x_1^2\). It is obvious to see that f is strongly convex. Let us show that in this special case (43) fails to hold. Actually, after some simple calculations, we can get
$$\begin{aligned} p(y)=\begin{pmatrix}\frac{L}{L+2}y_1 \\ y_2 \end{pmatrix},~~~~G(y)=\begin{pmatrix}\frac{2L}{L+2}y_1 \\ 0 \end{pmatrix}, \end{aligned}$$and therefore (43) reads as
$$\begin{aligned} \frac{2L}{L+2}y_1(y_1-x_1)&\ge \left( \frac{L}{L+2}y_1\right) ^2-x_1^2+\frac{2L}{(L+2)^2}y_1^2\\&\quad +\frac{\mu }{2}(x_1-y_1)^2+\frac{\mu }{2}(x_2-y_2)^2, \quad \forall x_i, y_i\in {\mathbb {R}}, i=1,2. \end{aligned}$$But, if we take \(x_1=y_1\equiv 0\), then we should have
$$\begin{aligned} 0\ge \frac{\mu }{2}(x_2-y_2)^2, \quad \forall x_2, y_2\in {\mathbb {R}}. \end{aligned}$$Obviously, this is impossible for any positive constant \(\mu \).
-
4.
Let \(A\in {\mathbb {R}}^{m\times n}\) with \(m<n\) be a given matrix and \(b\in {\mathbb {R}}^m\) be a given vector. A well-known fact in the community of EB is that the quadratic function \(\frac{1}{2}\Vert Ax-b\Vert ^2\) is not strongly convex but satisfies EB conditions. Unfortunately, this function fails to satisfy (45). We show this point by contradiction. It is enough to consider the case of \(m=1\), \(g(x)\equiv 0\), \(f(t)\equiv t, t\in {\mathbb {R}}\), and \(e(x)=\frac{1}{2}x^Taa^Tx\) with \(\Vert a\Vert ^2=L\). In this case, (45) reads as
$$\begin{aligned} \frac{1}{2}\left( a^Tx-a^Ty\right) ^2 \ge \frac{\mu }{2}\Vert x-y\Vert ^2, \quad \forall x, y\in {\mathbb {R}}^n. \end{aligned}$$Let \(h\ne 0\) be an orthogonal vector of a. Now, take \(y-x=\lambda h, \lambda \in {\mathbb {R}}\). Then, we have
$$\begin{aligned} 0\ge \frac{\mu }{2}\lambda ^2\Vert h\Vert ^2, \quad \forall \lambda \in {\mathbb {R}}, \end{aligned}$$which is impossible for any positive constant \(\mu \).
-
5.
In order to show that (46) can be strictly weaker than strong convexity, we now construct a one-dimensional example that satisfies (46) but fails to be strongly convex. Define the shrinkage operator by \({{\mathcal {S}}}(t):=\mathrm {sign}(t)\cdot \max \{|t|-1,0\}\) and the projection operator by \([x]_I^+:=\arg \min _{y\in I}\Vert x-y\Vert \), where I is some closed interval. Now, we take \(\lambda =1\), \(I=[-2,2]\), and \(g(x)=|x|+\delta _I(x)\). Obviously, such g(x) is convex but not strongly convex. Using formula (14) in [65] and Lemma 3, we have
$$\begin{aligned} g_\lambda (x)=\left| \left[ {{\mathcal {S}}}(x)\right] _I^+\right| +\frac{1}{2}\left( x-\left[ {{\mathcal {S}}}(x)\right] _I^+\right) ^2. \end{aligned}$$Here, \(g_\lambda \) is the Moreau–Yosida regularization of g. Denote \(\ell _{g_\lambda }(x;y):=g_\lambda (y)+\langle \nabla g_\lambda (y),x-y\rangle \). We have the following expression:
$$\begin{aligned} \ell _{g_\lambda }(x;y)=\left\{ \begin{array}{ll} (y+2)x-\frac{1}{2}y^2+4,&{}\quad y\le -3,\\ -x-\frac{1}{2}, &{}\quad -3\le y\le -1,\\ yx-\frac{1}{2}y^2, &{} \quad -1\le y\le 1,\\ x-\frac{1}{2},&{}\quad 1\le y\le 3,\\ (y-2)x-\frac{1}{2}y^2+4,&{}\quad y\ge 3 \end{array} \right. \end{aligned}$$(47)Then, one can verify case by case that for any \(\mu \in (0,\frac{1}{9}]\), (46) always holds. For example, in the case of \(y\le -3\), we only need to verify that
$$\begin{aligned} |x|\ge (y+2)x-\frac{1}{2}y^2+4+\frac{\mu }{2}(x-y)^2, \quad x\in [-2,2], \end{aligned}$$i.e., \(\frac{1-\mu }{2}(x-y)^2 \ge \frac{1}{2}x^2+2x-|x|+4, ~~x\in [-2,2].\) Thus, it is sufficient to require that
$$\begin{aligned} \mu \le 1-\max _{y\le -3, |x|\le 2}\frac{x^2+4x-2|x|+8}{(x-y)^2}. \end{aligned}$$After some simple calculations, we have \(\mu \le \frac{1}{9}\). The other cases can be similarly verified; we omit the details here. This example shows that the composite EB condition (43) indeed holds for some non-strongly convex functions.
Now, we explain why we say that the condition (43) is commonly satisfied by problems (39)–(41), whose objective functions are clearly not in \({{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\).
Remark 6
-
(i)
The minimization problem (42) with \(m=1\), \(e\in {{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\), \(f(t)\equiv t, t\in {\mathbb {R}}\), \(g(x)=\delta _Q(x)\), and Q being nonempty closed convex, corresponds to problem (39). The condition (43) holds in this setting; see Theorem 2.2.7 in [43].
-
(ii)
The minimization problem (42) with \(f(y)=\max _{1\le i\le m}\{y_i\}\), \(f_i\in {{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\), \(e(x)=(f_1(x), f_2(x),\dots , f_m(x))\), \(g(x)=\delta _Q(x)\), and Q being nonempty closed convex, corresponds to problem (40). The condition (43) holds in this setting; see Corollary 2.3.2 in [43].
-
(iii)
The minimization problem (42) with \(m=1\), \(e\in {{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\), \(f(t)\equiv t, t\in {\mathbb {R}}\), and \(g\in \Gamma _0({\mathbb {R}}^n)\), corresponds to problem (41). The condition (43) holds in this setting; see the inequality (4.36) in [15].
Interestingly, we note that while this work was under review, the authors of [41] utilized the exact form (43) to construct underestimate sequences and proposed several first order methods for minimizing strongly convex smooth functions and for strongly convex composite functions. Based on the discussion in this section, it could be expected to extend the corresponding results in [41] to the composite optimization problem (42).
In general, we have to admit that it is difficult to verify the composite EB condition (43), which therefore deserves further study in the future.
7.2 Q-linear convergence of Nesterov’s acceleration
In this part, we show Q-linear convergence of Nesterov’s acceleration under the composite EB condition (43), which is more general than strong convexity. First, in light of Nesterov’s accelerated scheme (2.2.11) in [43], Nesterov’s accelerated forward–backward algorithm for solving the problem (42) reads as: choosing \(x_{-1}=x_0\in {\mathbb {R}}^n\), for \(k\ge 0\),
Let
Let
where \(x^*\in \mathrm {Arg}\min \varphi \) (assumed to be nonempty) and
Now, we are ready to present the main result in this section. The proof idea behind is partially inspired by the argument in [5] but might be of interest in its own right.
Theorem 4
Let \(\varphi :=f\circ e +g \) be such that \(f:{\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is a closed convex function, \(g\in \Gamma _0({\mathbb {R}}^{n})\), and \(e: {\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) is a smooth mapping with its Jacobian given by \(\nabla e(x)\). Let \(\varphi \) satisfy the composite EB condition (43) with positive constants \(\mu , L\) obeying \(\mu < L\). Assume that \(\varphi \) achieves its minimum \(\min \varphi \) so that \(\mathrm {Arg}\min \varphi \ne \emptyset \). Then, there exist a unique vector \(x^*\) such that \(\mathrm {Arg}\min \varphi =\{x^*\}\), and Nesterov’s accelerated forward–backward method converges Q-linearly in the sense that there exists a positive constant \(\theta _0<1\) such that for any \(\theta \in [\theta _0, 1)\) it holds
where \(\rho =\max \{\alpha , \theta \}<1\) and \(\tau =\frac{\theta \beta }{2\rho \gamma }\). Especially, by taking \(\theta = \max \{\theta _0, \alpha \}\), we have
Proof
We first show the uniqueness of optimal solution \(x^*\) of \(\varphi \). In fact, by statement (i) in Remark 5 and the fact of \(\mathrm {Arg}\min \varphi \subset {\mathbf {crit}}\varphi \), we have that \(G(x^*)=0\) and \(p(x^*)=x^*\), and hence (43) at \(y=x^*\) reads as
which clearly implies that \(\mathrm {Arg}\min \varphi =\{x^*\}\).
Now, we analyze rates of linear convergence. Using successively (43) at \(x=x_k\) and \(y=y_k\), and then at \(y=y_k\) and \(x=x^*\), together with the fact of \(x_{k+1}=p(y_k)\), we obtain
and
Multiplying the first inequality by \(\alpha \) and the second one by \(\beta \), and then adding the two resulting inequalities, we obtain
In order to estimate the right-hand side of the inequality above, we first write down:
Secondly, using the expression of \(y_{k+1}=x_{k+1}+\alpha (x_{k+1}-x_k)\), we get
Then, substitute \(x_{k+1}=y_k-\frac{1}{L}G(y_k)\) into formula (51) to obtain
Using equality (52), we derive that
Thus, we have
Combining formula (53) and formula (50), we derive that
where the term \(\Vert G(y_k)\Vert ^2\) is eliminated since \(\frac{\beta \gamma }{2}=\frac{1}{2L}\). Note that (50) can be written as
with which we further derive that
Denote \(\eta _1:=\min \left\{ \frac{\mu \alpha }{2},\frac{\mu \beta }{2}\right\} \) and \(\eta _2:=\max \left\{ 2, \frac{1}{2}(\sqrt{\frac{L}{\mu }}-1)^2\right\} \). Then, we have
Rearrange the terms to obtain
Thus, there exists a positive constant \(\theta _0<1\) such that for any \(\theta \in [\theta _0, 1)\) it holds
Since \(\rho =\max \{\alpha , \theta \}\), we have that \(\rho <1\) and \(\frac{\theta }{\rho }\le 1\). Thus, we obtain
i.e., \(\Phi _{k+1}(x^*;\tau )\le \rho \cdot \Phi _k(x^*;\tau )\) with \(\tau =\frac{\theta \beta }{2\rho \gamma }\). This is just the announced result (48).
It remains to show (49). In fact, if \(\theta = \max \{\theta _0, \alpha \}\), then \(\rho =\max \{\alpha , \theta \}=\max \{\theta _0, \alpha \}=\theta \) and hence
This completes the proof. \(\square \)
Remark 7
It should be noted that we here only show the existence of rates of linear convergence for Nesterov’s accelerated forward–backward method. But, it is not clear whether one can derive an exact rate of linear convergence as \(1-\sqrt{\frac{\mu }{L}}\) as obtained for Nesterov’s accelerated gradient method.
8 A class of dual functions satisfying EB conditions
Verifying EB conditions for functions with certain structure is a difficult topic. In this section, we consider a class of dual objective functions, that have interesting applications in signal processing and compressive sensing [31, 66]. We first describe the problem, along with some direct results.
Proposition 2
Consider the linearly constrained optimization problem
where \(g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) is a real-valued and strongly convex function with modulus \(c>0\), \(A\in {\mathbb {R}}^{n\times m}\) is a given matrix with \(m\le n\), and \(b\in R(A)\) is a given vector. Here, R(A) stands for the range of A. The dual problem is
Then, we have that
the primal problem (P) has a unique optimal solution \({\bar{y}}\),
the dual objective function f belongs to \({{\mathcal {F}}}^{1,1}_{L}({\mathbb {R}}^n)\) with \(L=\frac{\Vert A\Vert ^2}{c}\), and
the set of optimal solutions of the dual problem,
$$\begin{aligned} \mathrm {Arg}\min f:=\left\{ x\in {\mathbb {R}}^n: A\nabla g^*\left( A^Tx\right) =b\right\} , \end{aligned}$$is a nonempty closed convex set, and can be characterized by \(\{x\in {\mathbb {R}}^n: A^Tx\in \partial g({\bar{y}})\}\) or equivalently by \(\{x\in {\mathbb {R}}^n: \nabla g^*(A^Tx)={\bar{y}}\}\) .
Proof
The first two statements are standard results which can be found in textbooks on convex analysis and no proof will be given here. Now, we prove the third statement. First, let the Lagrangian function be given by \(L(y, x)=g(y)-\langle Ay-b, x\rangle .\) By the assumption of \(b\in R(A)\) and the finiteness of the optimal value of primal problem, according to Proposition 5.3.3 in [9], for any \({\bar{x}}\in \mathrm {Arg}\min f\) we have that \({\bar{y}}\in \mathrm {Arg}\min L(y, {\bar{x}}).\) Hence, \(A^T{\bar{x}}\in \partial g({\bar{y}})\) or equivalently \(\nabla g^*(A^T{\bar{x}})={\bar{y}}\) due to \((\partial g)^{-1}=\nabla g^*\), which holds by Corollary 23.5.1 in [47]. This implies that \(\mathrm {Arg}\min f\subseteq \{x\in {\mathbb {R}}^n: \nabla g^*(A^Tx)={\bar{y}}\}\). The inverse inclusion is obvious since \(A{\bar{y}}=b\). Thereby,
This completes the proof. \(\square \)
Now, we state the main result of this section.
Theorem 5
Use the same setting as Proposition 2. Denote \(X_r:=\{x\in {\mathbb {R}}^n: f(x)\le \min f +r\}\) with \(r\ge 0\) and \(V_r:={\mathrm {cl}}(A^TX_r)\), where \({\mathrm {cl}}(A^TX_r)\) stands for the closure of \(A^TX_r\). If the following assumptions hold:
- (a)
\(\partial g\) is calm around \({\bar{y}}\) for any \({\bar{z}}\in V_0\),
- (b)
the collection \(\{\partial g({\bar{y}}), R(A^T)\}\) is linearly regular with constant \(\gamma >0\), that is
$$\begin{aligned} d\left( A^Tx, \partial g({\bar{y}})\right) \ge \gamma \cdot d\left( A^Tx, \partial g({\bar{y}})\cap R\left( A^T\right) \right) ,~~\forall x\in {\mathbb {R}}^n, \end{aligned}$$
then we have that
- (i)
There exist positive constants \(r_0, \tau \) such that the \((f, \tau , X_{r_0})\)-(obj-EB) condition holds, that is
$$\begin{aligned} f(x)-\min f \ge \frac{\tau }{2}\cdot d^2(x, {\mathbf {crit}}f), ~~\forall x\in X_{r_0}. \end{aligned}$$(54)Specifically, if \(\partial g\) is calm with constant \(\kappa >0\) around \({\bar{y}}\) for any \({\bar{z}}\in V_0\), then (54) holds for all \(\tau \in (0, \kappa ^{-1})\).
- (ii)
For any sublevel set \(X_r\), pick \(r_1\in (0, r_0)\) and let \(c_r:=\sqrt{\frac{r_1}{r}}\) and
$$\begin{aligned} \rho _r:= \left\{ \begin{array}{lll} c_r, &{}\quad \text {when} ~~r\ge r_0, \\ 1, &{}\quad \text {when}~~r\le r_0. \end{array} \right. \end{aligned}$$Then, the \((\nabla f, \nu , X_{r})\)-(cor-EB) condition with \(\nu =\frac{\tau \rho _r^2}{8}\) holds.
Proof
We first prove that \(V_r\) is compact for any \(r\ge 0\). To this end, letting \(f_r=\min f+r\) and using the fact \(b=A{\bar{y}}\), we write \(X_r\) into the following form:
Denote
Obviously, \(A^TX_r\subseteq Y_r\). Let \({\tilde{g}}(\cdot ):=g^*(\cdot )-\langle {\bar{y}}, \cdot \rangle \). Then, \({\tilde{g}}^*(y)=g(y+{\bar{y}})\). Thus, \({\mathrm {dom}}{\tilde{g}}^*={\mathrm {dom}}g={\mathbb {R}}^m\). This implies that \({\tilde{g}}\) is coercive (see Theorem 11.8 in [48]) and hence \(Y_r=\{y\in {\mathbb {R}}^m: {\tilde{g}}(y) \le f_r\}\) is bounded. Furthermore, since \({\tilde{g}}\) is continuous, \(Y_r\) is closed and hence compact. Thereby, \(V_r={\mathrm {cl}}(A^TX_r)\subseteq Y_r\) is bounded and hence also compact.
Second, we show that \(V_0 \subseteq \partial g({\bar{y}})\). Recall that we have shown that \(X_0=\{x\in {\mathbb {R}}^n: A^Tx\in \partial g({\bar{y}})\}\) in Proposition 2. Hence, \(A^TX_0\subseteq \partial g({\bar{y}})\). Since g is a real-valued convex function, \(\partial g({\bar{y}})\) must be nonempty, closed, and bounded according to Theorem 23.4 in [47] and Theorem 8.6 in [48]. Therefore, \(V_0={\mathrm {cl}}(A^TX_0)\subseteq \partial g({\bar{y}})\)
Now, since \(\partial g\) is calm at \({\bar{y}}\) for any \({\bar{z}}\in V_0\) and \(V_0 \subseteq \partial g({\bar{y}})\) is compact, by Proposition 2 in [69] we can conclude that there exist constants \(\kappa , \epsilon >0\) such that
where we denote \({\mathbb {R}}^m\) by \({\mathcal {E}}\) for simplicity. Pick \(z\in V_0+\epsilon {\mathbb {B}}_{\mathcal {E}}\) and let \(y=\nabla g^*(z)\). Then, \(z\in \partial g(y)\) due to \(\partial g =(\nabla g^*)^{-1}\) and hence \(z \in \partial g(y)\cap (V_0+\epsilon {\mathbb {B}}_{\mathcal {E}})\). By inclusion (55), we obtain
which can be rewritten as
This implies that \(\nabla g^*\) is always metrically subregular at each \({\bar{z}}\in V_0\) for \({\bar{y}}\). Thus, by Theorem 3.1 in [21], for each \({\bar{z}}\in V_0\) there exists a neighborhood \({\bar{z}}+\epsilon ({\bar{z}}){\mathbb {B}}_{\mathcal {E}}\) and a positive constant \(\alpha ({\bar{z}})\) such that
where the constant \(\alpha ({\bar{z}})\) can be chosen arbitrarily in \((0, \kappa ^{-1})\). Note that \(\{{\bar{z}}+\epsilon ({\bar{z}}){\mathbb {B}}^o_{\mathcal {E}}\}_{{\bar{z}}\in V_0}\) forms an open cover of the compact set \(V_0\). Hence, by the Heine-Borel theorem, there exist K points (where \(K\ge 1\) is finite) \({\bar{z}}_1,\dots , {\bar{z}}_K\in V_0\) such that
Let \(\alpha =\min \{\alpha ({\bar{z}}_1), \dots , \alpha ({\bar{z}}_K)\}\), which can be chosen arbitrarily in \((0, \kappa ^{-1})\), and note that \(\min f = g^*({\bar{z}})-\langle {\bar{y}}, {\bar{z}}\rangle , ~\forall {\bar{z}}\in V_0\). From (58), we have
Letting \(r_0>0\) be small enough such that \(V_{r_0}\subseteq U\) and using the fact of \((\nabla g^*)^{-1} = \partial g\), we obtain
and hence,
Using the linear regularity property of \(\{\partial g({\bar{y}}), R(A^T)\}\), we derive that
where such \({\hat{y}}\in V_0\) exists due to the compactness of \(V_0\). Now, we follow the argument in [50] to finish the proof of (i). Since \({\hat{y}}\in V_0={\mathrm {cl}}(A^TX_0)\), we can find a sequence \(\{x_n\}_{n=0}^{\infty }\subset X_0\) such that \(A^Tx_n \rightarrow {\hat{y}}\) as \(n\rightarrow +\infty \). Denote the null space of \(A^T\) by \(N(A^T)\) and the minimal positive singular value of A by \(\sigma (A)\). Using the fact of \(\mathrm {Arg}\min f+N(A^T) \subseteq \mathrm {Arg}\min f\), we can derive that
where \({\mathcal {P}}_{N(A^T)}\) stands for the orthogonal projection operator onto \(N(A^T)\). Thus, by letting \(n\rightarrow +\infty \), we obtain
Note that \(\mathrm {Arg}\min f={\mathbf {crit}}f\). Thereby, in view of (59) and (60), the (obj-EB) condition follows with \(\tau =\alpha \gamma ^2\sigma ^2(A)\).
Let us prove (ii). Without loss of generality, we assume that \(\min f=0\) and \(r\ge r_0\). Since for any \(r>0\) the sublevel set \(X_r\) is \(\nabla f\)-invariant, using (54) together with the equivalence established in Corollary 2, we can conclude that f satisfies the \((\nabla f, \eta , X_{r_0})\)-(res-obj-EB) conditions with \(\eta =\sqrt{\frac{\tau }{2}}\), that is
Let \(\varphi (t):=2\eta ^{-1}t^{\frac{1}{2}}\). Then, inequality (61) can be written as
By applying Proposition 30 in [10], a globalization result for KL inequalities, to (62), we have that for the given \(r_1\in (0, r_0)\), the function given by
is desingularising for f on all of \({\mathbb {R}}^n\) and hence it holds
Thereby, we can get
where \(X^c_{r_1}\) is the complement of \(X_{r_1}\). By the definition of \(c_r\), we can further obtain
Finally, noting the expression of \(\rho _r\) and together with (61), for \(r>0\) we have
which is just the \((\nabla f, \eta \rho _r, X_r)\)-(res-obj-EB) condition. Thus, the \((\nabla f, \nu , X_{r})\)-(cor-EB) condition follows from Corollary 2. Using the relevant formulas in Theorem 1, we have
which completes the proof. \(\square \)
Remark 8
By directly invoking Corollary 4.3 in [1], we can derive (58) with the constant satisfying \(\alpha ({\bar{z}})\in (0, \frac{1}{4\kappa })\), which is slightly worse than that of \(\alpha ({\bar{z}})\in (0, \kappa ^{-1})\).
Remark 9
The author of [50], with slightly different assumptions, proved by contradiction that the dual objective function \(f(x)= g^*(A^Tx)-\langle b, x\rangle \) satisfies the \((\nabla f, \nu , X_{r})\)-(cor-EB) condition. While the author of [50] requires that \(\partial g\) is calm around \({\bar{y}}\) for any \({\bar{z}}\in {\mathbb {R}}^m\), i.e., the local upper Lipschitz-continuity property (4), we only require that \(\partial g\) is calm around \({\bar{y}}\) for any \({\bar{z}}\in V_0\). Our proof is by means of the KL inequality globalization technique developed in [10], and hence quite different from that of [50].
Remark 10
Verifying EB conditions for more general functions with the form \(f(x):=h(Ax)+l(x)\) was studied recently in [20, 32, 69]. Specialized to the dual objective function \(f(x)= g^*(A^Tx)-\langle b, x\rangle \), the existing theory usually requires \(g^*\) to be strictly or strongly convex; see e.g., Corollary 4.3 in [20] and Assumption 1 in [69]. In contrast, our study, following the research line of work [50], relies on exploiting the primal-dual structure, and is thus quite different from that in [20, 32, 69].
9 Discussion
In this paper, we provide a new perspective for studying EB conditions and analyzing linear convergence of gradient-type methods. Under our theoretical framework, a group of new technical results are discovered. Especially, some EB conditions, previously known to be sufficient for linear convergence, are also necessary; and Nesterov’s accelerated forward–backward algorithm, previously known to be R-linearly convergent, is also Q-linearly convergent. Finally, we close this paper with the following possible future works:
- 1.
We have defined a group of abstract EB conditions of “square type”. But we do not know whether the idea behind can be extended to that of general types by introducing so-called desingularizing functions [10], so that the other EB conditions discussed in [22] can be included in a more general framework.
- 2.
Although we have shown sufficient conditions guaranteeing linear convergence for PALM and Nesterov’s accelerated forward–backward algorithms, it is still unclear whether they are necessary. The very recent work [37] might shed light on this topic.
- 3.
Verifying EB conditions with high probability for non-convex functions has proven to be a very powerful approach for non-convex optimization; see e.g. [14, 34, 56]. Thus, seeking or verifying new classes of non-convex functions, satisfying EB condition with high-probability, deserves future study.
- 4.
What are the optimal rates of linear convergence (or say, exact worst-case convergence rates) for gradient-type methods under general EB conditions? The method of performance estimation, originally proposed in [18] and further developed in [30, 54, 55], might be useful for this topic.
- 5.
Ordinary differential equation (ODE) approaches are recently used to study (accelerated) gradient-type methods [53, 58]. Except one paper [60], existing analyses only consider general convex and strongly convex conditions, and do not work on general EB conditions. It would be interesting to investigate whether the EB condition presented in this paper can be embedded in the ODE approaches to study linear convergence for gradient-type methods.
References
Artacho, F.J.A., Geoffroy, M.H.: Metric subregularity of the convex subdifferential in Banach spaces. J. Nonlinear Convex A. 15(15), 35–47 (2015)
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. Ser. B 116, 5–16 (2009)
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. Ser. A 137(1–2), 91–129 (2013)
Attouch, H., Cabot, A.: Convergence rates of inertial forward–backward algorithms. SIAM J. Optim. 28(1), 849–874 (2018)
Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward–backard method is actually \(o(k^{-2})\). SIAM J. Optim. 26(3), 1824–1834 (2016)
Banjac, G., Margellos, K., Goulart, P.J.: On the convergence of a regularized Jacobi algorithm for convex optimization. IEEE Trans. Autom. Control 63(4), 1113–1119 (2018)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 17(4), 183–202 (2009)
Bertsekas, D.P.: Convex Optimization Theory. Athena Scientific and Tsinghua University Press, Belmont and Beijing (2011)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. Ser. A 165(2), 471–507 (2017)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. Ser. A 146(1–2), 459–494 (2014)
Brézis, H.: Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert. North-Holland Mathematics Studies, vol. 5. North-Holland Publishing Co., Amsterdam-London (1973)
Bruck, R.E.: Asymptotic convergence of nonlinear contraction semigroups in Hilbert space. J. Funct. Anal. 18(1), 15–26 (1975)
Candés, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)
Cruz, J., Li, G., Nghia, T.: On the Q-linear convergence of forward–backward splitting method and uniqueness of optimal solution to Lasso. Researchgate, November. arXiv:1806.06333v1 [math.OC] (2018)
Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mappings. Springer, Berlin (2009)
Drori, Y., Teboulle, M.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. Ser. A 145(1–2), 451–482 (2014)
Drusvyatskiy, D., Kempton, C.: An accelerated algorithm for minimizaing convex compositions. arXiv:1605.00125v1 [math.OC] 30 Apr 2016
Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. Published online: 15 Mar (2018)
Drusvyatskiy, D., Mordukhovich, B.S., Nghia, T.T.A.: Second-order growth, tilt stability, and metric regularity of the subdifferential. J. Convex Anal. 21(4), 1165–1192 (2014)
Garrigos, G., Rosasco, L., Villa, S.: Convergence of the forward–backward algorithm: beyond the worst-case with the help of geometry. arXiv:1703.09477v2 [math.OC] 29 Mar 2017
Gong, P., Ye, J.: Linear convergence of variance-reduced projected stochastic gradient without strong convexity. arXiv:1406.1102v2 [cs.NA] 10 July 2015
Hale, E.T., Yin, W.T., Zhang, Y.: Fixed point continuation for \(\ell _1\)-minimization: methodology and convergence. SIAM J. Optim. 19(3), 1107–1130 (2008)
Hoffman, A.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49, 263–265 (1952)
Hong, M., Wang, X., Razaviyayn, M., Luo, Z.-Q.: Iteration complexity analysis of block coordinate descent methods. Math. Program. Ser. A 163, 85–114 (2017)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of proximal-gradient methods under the Polyak–Łojasiewicz condition. arXiv:1608.04636v1 [cs.LG] 16 Aug 2016 (2016)
Karimi H., Nutini J., Schmidt M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In: Frasconi P., Landwehr N., Manco G., Vreeken J. (eds.) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science, vol 9851. Springer, Cham (2016)
Karimi, S., Vavasis, S.: A unified convergence bound for conjugate gradient and accelerated gradient. arXiv:1605.003200v1 [math.OC] 11 May 2016
Kim, D., Fessler, J.: Optimized first-order methods for smooth convex minimization. Math. Program. Ser. A 159(1), 81–107 (2016)
Lai, M.J., Yin, W.T.: Augmented \(\ell _1\) and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imaging Sci. 6(2), 1059–1091 (2013)
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka–Lojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 1–34. Published online: 10 Aug (2017)
Li, X., Zhao, T., Arora, R., Liu, H., Hong, M.: An improved convergence analysis of cyclic block coordinate descent-type methods for strongly convex minimization. arXiv:1607.02793v1 [math.OC] 10 July 2016
Liu, H., Yue, M.-C., So, A.M.-C.: On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM J. Optim. 27(4), 2426–2446 (2017)
Liu, J., Wright, S.J.: Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25(1), 351–376 (2015)
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Colloques internationaux du CNRS 117, 87–89 (1963)
Luke, D.R., Thao, N.H., Teboulle, M.: Necessary conditions for linear convergence of picard iterations and application to alternating projections. arXiv:1704.08926v1 [math.OC] 28 Apr 2017
Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth minimization. SIAM J. Control Optim. 30(2), 408–425 (1992)
Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46–47(1), 157–178 (1993)
Luque, F.: Asymptotic convergence analysis of the proximal point algorithm. SIAM J. Optim. 22(2), 277–293 (1984)
Ma, C.X., Gudapati, N.V., Jahani, M., Tappenden, R., Takác̆, M.: Underestimate sequences via quadratic averaging. arXiv:1710.03695 (2017)
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. Ser. A. Published online: 22 Jan (2018)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht (2004)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1997)
Polyak, B.T.: Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)
Robinson, S.M.: Some continuity properties of polyhedral multifunctions. Math. Program. Study 14, 206–214 (1981)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (1998)
Ruszczyński, A.P.: Nonlinear Optimization, vol. 13. Princeton University Press, Princeton (2006)
Schöpfer, F.: Linear convergence of descent methods for the unconstrained minimization of restricted strongly convex functions. SIAM J. Optim. 26(3), 1883–1911 (2016)
Shefi, R., Teboulle, M.: On the rate of convergence of the proximal alternating linearized minimization algorithm for convex problems. EURO J. Comput. Optim. 4, 27–46 (2016)
So, M.C.: Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity. arXiv:1309.0113v1 [math.OC] 31 Aug 2013
Su, W., Boyd, S., Candés, E.J.: A differential equation for modeling Nesterovs accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17, 1–13 (2016)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. J Optim. Theory Appl. 178(2), 455–476 (2018)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. Ser. A 161(1–2), 307–345 (2017)
Tu, S., Boczar, R., Simchowitz, M., Recht, B.: Low-rank solutions of linear matrix equations via procrustes flow. arXiv:1507.03566v2 [math.OC] 5 Feb 2016
Wang, P.W., Lin, C.J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15(2), 1523–1548 (2014)
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization. arXiv:1611.02635v1 [math.OC] 8 Nov 2016
Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23(2), 1062–1091 (2013)
Yang, L., Arora, R., Braverman, V., Zhao, T.: The physical systems behind optimization algorithms. arXiv:1612.02803v1 [cs.LG] 8 Dec 2016
Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks. arXiv:1707.02444v1 [cs.LG] 8 July 2017
Zhang, H.: New analysis of linear convergence of gradient-type methods via unifying error bound conditions. arXiv:1606.00269v3 [math.OC] 17 Aug 2016
Zhang, H.: The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optim. Lett. 11(4), 817–833 (2017)
Zhang, H., Cheng, L.: Restricted strong convexity and its applications to convergence analysis of gradient-type methods. Optim. Lett. 9(5), 961–979 (2015)
Zhang, H., Cheng, L.: Projected shrinkage algorithm for box-constrained \(\ell _1\)-minimization. Optim. Lett. 11(1), 55–70 (2017)
Zhang, H., Cheng, L., Yin, W.T.: A dual algorithm for a class of augmented convex signal recovery models. Commun. Math. Sci. 13(1), 103–112 (2015)
Zhang, H., Yin, W.T.: Gradient methods for convex minimization: better rates under weaker conditions. Technical report, CAM Report 13-17, UCLA (2013)
Zhou, Y., Liang, Y.: Characterization of gradient dominance and regularity conditions for neural networks. arXiv:1710.06910 (2017)
Zhou, Z., So, M.C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. Ser. A 165(2), 689–728 (2017)
Zolezzi, T.: On equiwellset minimum problems. Appl. Math. Optim. 4, 203–223 (1978)
Acknowledgements
I am grateful to the anonymous referees, the associate editor, and the coeditor Prof. Adrian S. Lewis for many useful comments, which allowed me to significantly improve the original presentation. I would like to thank Prof. Zaiwen Wen for his invitation and hospitality during my visit to BeiJing International Center for Mathematical Research, and to thank Prof. Dmitriy Drusvyatskiy for a careful reading of an early draft of this manuscript, and for valuable comments and suggestions. I also thank Profs. Chao Ding, Bin Dong, Lei Guo, Yongjin Liu, Deren Han, Mark Schmidt, Anthony Man-Cho So, and Wotao Yin for their time and many helpful discussions with me. Further thanks due to my cousin Boya Ouyang who helped me with my English writing, and to PhD students Ke Guo, Wei Peng, Ziyang Yuan, Xiaoya Zhang, who looked over the manuscript and corrected several typos. While visiting Chinese Academy of Sciences, I was particularly fortunate to be acquainted with Prof. Florian Jarre, who carefully read and polished this paper. This work is supported by the National Science Foundation of China (Nos. 11501569 and 61571008).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, H. New analysis of linear convergence of gradient-type methods via unifying error bound conditions. Math. Program. 180, 371–416 (2020). https://doi.org/10.1007/s10107-018-01360-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-018-01360-1
Keywords
- Residual measure operator
- Error bound
- Gradient descent
- Linear convergence
- Proximal point algorithm
- Forward–backward splitting algorithm
- Proximal alternating linearized minimization
- Nesterov’s acceleration
- Dual objective function