1 Introduction

A standard assumption for proving linear convergence of gradient-type methods is strong convexity [43]. In practice, however, strong convexity is too stringent. Moreover, various gradient-type methods for solving convex optimization problems have exhibited linear convergence in numerical experiments even when strong convexity is absent; see e.g. [24, 31, 59]. Thereby, one would wonder whether such a phenomenon can be explained theoretically, and whether there exist weaker alternatives to strong convexity that retain fast rates.

A very powerful idea to address these questions is to connect error bound (EB) conditions with the convergence rate estimation of gradient-type methods. This idea has a long history dating back to 1963 when Polyak introduced an EB inequality as a sufficient condition for gradient descent to attain linear convergence [45]. In the same year, a wide class of inequalities, which include Polyak’s as a special case, were introduced by Łojasiewicz [36]. In recent manuscripts [10, 27], the EB condition of Polyak–Łojasiewicz’s type was further developed for analyzing the complexity of first-order descent methods. The second type of EB conditions is due to Hoffman, who proposed an EB inequality for systems of linear inequalities [25] in 1952. Along this line, Luo and Tseng in the early 90’s contributed several aspects for connecting EB conditions of Hoffman’s type with convergence analysis of descent methods [38, 39]. Recently, global versions of EB conditions of Hoffman’s type attract much attention [52, 57, 69]. The third type of EB conditions is the quadratic growth condition (also called zero-order EB condition in [10]), which might go back to the work [70]. It was recently rediscovered in the special case of convex functions, and widely used to derive linear convergence for many gradient-type methods as well [23, 35, 42].

Moreover, there recently emerges a surge of interest in developing new EB conditions guaranteeing (global) linear convergence for various gradient-type methods. For example, the authors of [31, 64, 67] proposed a restricted secant inequality (RSI), and developed the restricted strongly convex (RSC) property with parameter \(\nu \) for analyzing linear convergence of (dual) gradient descent methods and Nesterov’s restart accelerated methods; the authors of [42] proposed several relaxations of strong convexity that are sufficient for obtaining linear convergence for (projected and accelerated) gradient-type methods.

Another line of recent works is to find connections between existing EB conditions. The authors of [20] discussed the relationship between the quadratic growth condition and the EB condition of Hoffman’s type (also called Luo–Tseng’s type in [32]). Parallel to and partially influenced by the work [20], the author of this paper also established several new types of equivalence between the RSC property, the quadratic growth condition, and the EB condition of Hoffman’s type in [63]. The authors of [10] showed the equivalence between the quadratic growth condition and the Kurdyka–Łojasiewicz inequality. Besides, we note that works [27, 42] also discussed the relationships among many of these EB conditions.

Based on these two lines of recent developments, two natural questions arise. The first one is whether there is a unified framework for defining different EB conditions and analyzing connections between them. The second one is whether these sufficient conditions guaranteeing linear convergence for gradient-type methods are also necessary. To answer these two questions, we will rely on a vital observation: a common and key role, played in many EB conditions and a variety of gradient-type methods, is a residual measure operator. This observation immediately leads us to the following discoveries:

  1. 1.

    By linking the residual measure operator with other optimality measures, we define a group of abstract EB conditions. Then, we comprehensively analyze the interplay between them by means of the technique developed in [10], which plays a fundamental role in the corresponding error bound equivalence. The definition of abstract EB conditions not only unifies many existing EB conditions, but also helps us to construct new ones. The interplay between the abstract EB conditions allows us to find new connections between many existing EB conditions.

  2. 2.

    By viewing the residual measure operator as an ascent direction, we propose an abstract gradient-type method, and then derive EB conditions that are necessary and sufficient for its linear convergence. The latter allows us to claim the weakest (or say, necessary and sufficient) conditions guaranteeing linear convergence for a number of fundamental algorithms, including the gradient method (applied to possibly nonconvex optimization), the proximal point algorithm, and the forward–backward splitting algorithm. The sufficiency of these EB conditions for linear convergence has been widely known; see e.g. [10]. In contrast, there is very little attention to the discussion of necessity.

In addition, we also make the following contributions, from aspects of block coordinate gradient descent, Nesterov’s acceleration, and verifying EB conditions, separately:

  1. 3.

    We show linear convergence for the proximal alternating linearized minimization (PALM) algorithm under a group of equivalent EB conditions. It has been recently shown [26, 33, 51] that PALM achieves sublinear convergence for convex problems and linear convergence for strongly convex problems. In this study, we show its linear convergence under strictly weaker conditions than strong convexity.

  2. 4.

    By defining a new EB condition, we obtain Q-linear convergence of Nesterov’s accelerated forward–backward algorithm, which generalizes the Q-linear convergence of Nesterov’s accelerated gradient method, recently independently discovered in [29, 58]. The new EB condition in some special cases can be viewed as a strictly weaker relaxation of strong convexity. In such sense, we show Q-linear convergence of Nesterov’s accelerated method without strong convexity. Our proof idea is partially inspired by Attouch and Peypouquet [5] but might be of interest in its own right.

  3. 5.

    We provide a new proof to show that a class of dual objective functions satisfy EB conditions, under slightly weaker assumptions, again by means of the technique developed in [10]. The authors of [31] gave the first proof for a special case of this class of functions, and the author of [50] gave the first general proof by contradiction.

The paper is organized as follows. In Sect. 2, we present the basic notation and some elementary preliminaries. In Sect. 3, we analyze necessary and sufficient conditions guaranteeing linear convergence for the gradient descent. In Sect. 4, we define a group of abstract EB conditions, and analyze the interplay between them. In Sect. 5, we define an abstract gradient-type method, and derive EB conditions that are necessary and sufficient for guaranteeing its linear convergence. In Sect. 6, we study linear convergence of the PALM algorithm. In Sect. 7, we study linear convergence of Nesterov’s accelerated forward–backward algorithm. In Sect. 8, we verify EB conditions for a class of dual objective functions. Finally, in Sect. 9, we give a short summary of this paper, along with some discussion for future work.

2 Notation and preliminaries

Throughout the paper, \({\mathbb {R}}^n\) will denote an n-dimensional Euclidean space associated with inner-product \(\langle \cdot , \cdot \rangle \) and induced norm \(\Vert \cdot \Vert \). For any nonempty \(Q\subset {\mathbb {R}}^n\), we define the distance function by \(d(x,Q):=\inf _{y\in Q}\Vert x-y\Vert .\) For a nonempty set \(Q\subset {\mathbb {R}}^n\), we define the indicator function of Q by

$$\begin{aligned} \delta _Q(x):= \left\{ \begin{array}{lll} 0, &{}\quad \text {if} ~~x\in Q; \\ +\infty , &{}\quad \text {otherwise}. \end{array} \right. \end{aligned}$$

We say that f is gradient-Lipschitz-continuous with modulus \(L>0\) if

$$\begin{aligned} \quad \forall x, y\in {\mathbb {R}}^n,~~\Vert \nabla f(x)-\nabla f(y)\Vert \le L\Vert x-y\Vert , \end{aligned}$$

and f is strongly convex with modulus \(\mu >0\) if for any \(\alpha \in [0,1]\),

$$\begin{aligned} \quad \forall x, y\in {\mathbb {R}}^n,~~f(\alpha x+(1-\alpha )y)\le \alpha f(x)+(1-\alpha )f(y)-\frac{1}{2}\mu \alpha (1-\alpha )\Vert x-y\Vert ^2, \end{aligned}$$

or if (when it is differentiable)

$$\begin{aligned} \quad \forall x, y\in {\mathbb {R}}^n,~~\langle \nabla f(x)-\nabla f(y), x-y\rangle \ge \mu \Vert x-y\Vert ^2. \end{aligned}$$

We will consider the following classes of functions.

  • \({{\mathcal {F}}}^1({\mathbb {R}}^n)\): the class of continuously differentiable convex functions from \({\mathbb {R}}^n\) to \({\mathbb {R}}\);

  • \({{\mathcal {F}}}^{1,1}_{L}({\mathbb {R}}^n)\): the class of gradient-Lipschitz-continuous convex functions from \({\mathbb {R}}^n\) to \({\mathbb {R}}\) with Lipschitz modulus L;

  • \({{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\): the class of gradient-Lipschitz-continuous and strongly convex functions from \({\mathbb {R}}^n\) to \({\mathbb {R}}\) with Lipschitz modulus L and strongly convex modulus \(\mu \);

  • \(\Gamma ({\mathbb {R}}^n)\): the class of proper and lower semicontinuous functions from \({\mathbb {R}}^n\) to \((-\infty , +\infty ]\);

  • \(\Gamma _0({\mathbb {R}}^n)\): the class of proper and lower semicontinuous convex functions from \({\mathbb {R}}^n\) to \((-\infty , +\infty ]\).

Obviously, we have the following inclusions:

$$\begin{aligned} {{\mathcal {S}}}^{1,1}_{\mu ,L}\left( {\mathbb {R}}^n\right) \subseteq {{\mathcal {F}}}^{1,1}_{L}\left( {\mathbb {R}}^n\right) \subseteq {{\mathcal {F}}}^1\left( {\mathbb {R}}^n\right) ,\quad \Gamma _0\left( {\mathbb {R}}^n\right) \subseteq \Gamma \left( {\mathbb {R}}^n\right) . \end{aligned}$$

It is convenient to denote by \(\mathrm {Arg}\min f\) the set of optimal solutions of minimizing f over \({\mathbb {R}}^n\), and to use “\(\arg \min f\)”, if the solution is unique, to stand for the unique solution. If \(\mathrm {Arg}\min f\) is nonempty, we let \(\min f\) present the minimum of f over \({\mathbb {R}}^n\).

The notation of subdifferential plays a central role in (non)convex optimization.

Definition 1

(subdifferentials, [48]) Let \(f\in \Gamma ({\mathbb {R}}^n)\). Its domain is defined by

$$\begin{aligned} {\mathrm {dom}}f:=\left\{ x\in {\mathbb {R}}^n: f(x)<+\infty \right\} . \end{aligned}$$
  1. (a)

    For a given \(x\in {\mathrm {dom}}f\), the Fr\(\acute{e}\)chet subdifferential of f at x, written \({\hat{\partial }}f(x)\), is the set of all vectors \(u\in {\mathbb {R}}^n\) which satisfy

    $$\begin{aligned} \lim _{y\ne x}\inf _{y\rightarrow x}\frac{f(y)-f(x)-\langle u, y-x\rangle }{\Vert y-x\Vert }\ge 0. \end{aligned}$$

    When \(x\notin {\mathrm {dom}}f\), we set \({\hat{\partial }}f(x)=\emptyset \).

  2. (b)

    The (limiting) subdifferential, of f at \(x\in {\mathbb {R}}^n\), written \(\partial f(x)\), is defined through the following closure process

    $$\begin{aligned} \partial f(x):=\left\{ u\in {\mathbb {R}}^n: \exists x^k\rightarrow x, f\left( x^k\right) \rightarrow f(x)~\text {and}~ {\hat{\partial }}f\left( x^k\right) \ni u^k \rightarrow u~\text {as}~k\rightarrow \infty \right\} . \end{aligned}$$
  3. (c)

    If we further assume that f is convex, then the subdifferential of f at \(x\in {\mathrm {dom}}f\) can also be defined by

    $$\begin{aligned} \partial f(x):=\left\{ v\in {\mathbb {R}}^n: f(z)\ge f(x)+\langle v, z-x\rangle ,~~\forall ~z\in {\mathbb {R}}^n\right\} . \end{aligned}$$

    The elements of \(\partial f(x)\) are called subgradients of f at x.

Denote the domain of \(\partial f\) by \({\mathrm {dom}}\partial f:= \{x\in {\mathbb {R}}^n:\partial f(x)\ne \emptyset \}\). Then, if \(f\in \Gamma ({\mathbb {R}}^n)\) and \(x\in {\mathrm {dom}}f\), then \(\partial f(x)\) is closed (see Theorem 8.6 in [48]); if \(f\in \Gamma _0({\mathbb {R}}^n)\) and \(x\in {\mathrm {dom}}\partial f\), then \({\mathrm {dom}}\partial f\subset {\mathrm {dom}}f\) and \(\partial f(x)\) is a nonempty closed convex set (see Proposition 16.3 in [7]). In the later case, we denote by \(\partial ^0 f(x)\) the unique least-norm element of \(\partial f(x)\) for \(x\in {\mathrm {dom}}\partial f\), along with the convention that \(\Vert \partial ^0 f(x)\Vert =+\infty \) for \(x\notin {\mathrm {dom}}\partial f\). Points whose subdifferential contains 0 are called critical points. The set of critical points of f is denoted by \({\mathbf {crit}}f\). If \(f\in \Gamma _0({\mathbb {R}}^n)\), then \({\mathbf {crit}}f=\mathrm {Arg}\min f\).

Let \(f\in \Gamma _0({\mathbb {R}}^n)\); its Fenchel conjugate function \(f^*:{\mathbb {R}}^n\rightarrow (-\infty , +\infty ]\) is defined by

$$\begin{aligned} f^*(x):=\sup _{y\in {\mathbb {R}}^n}\{\langle y, x\rangle -f(y)\}, \end{aligned}$$

and the proximal mapping operator by

$$\begin{aligned} {\mathbf {prox}}_{\lambda f}(x):=\arg \min _{y\in {\mathbb {R}}^n}\left\{ f(y)+\frac{1}{2\lambda }\Vert y-x\Vert ^2\right\} . \end{aligned}$$

For each \(x\in \overline{{\mathrm {dom}}f} \), it is well-known [12] that there is a unique absolutely continuous curve \(\chi _x: [0, \infty )\rightarrow {\mathbb {R}}^n\) such that \(\chi _x(0)=x\) and for almost every \(t>0\),

$$\begin{aligned} {\dot{\chi }}_x(t)\in -\partial f(\chi _x(t)). \end{aligned}$$

We say that \(\Omega \subset {\mathbb {R}}^n\) is \(\partial f\)-invariant if

$$\begin{aligned} (\forall x\in \Omega \cap {\mathrm {dom}}~\partial f)(\text {for a.e.}, t>0)~~\chi _x(t)\in \Omega . \end{aligned}$$

This concept was proposed in [12] and recently used in [22]. There are several types of \(\Omega \) being \(\partial f\)-invariant; see Example 7.2 in [22] and Section IV.4 in [12]. In Sects. 5 and 8, we will use the fact that the sublevel set \(X_r:=\{x:f(x)\le r\}\) is always \(\partial f\)-invariant for any function \(f\in \Gamma _0({\mathbb {R}}^n)\).

At last, we present some variational analysis tools. Let \({\mathcal {T}}\), \({{\mathcal {E}}}\), and \({{\mathcal {E}}}_i,\)\(i=1,2\) be finite-dimensional Euclidean spaces. The closed ball around \(x\in {{\mathcal {E}}}\) with radius \(r>0\) is denoted by \({\mathbb {B}}_{{\mathcal {E}}}(x,r):=\{y\in {{\mathcal {E}}}:\Vert x-y\Vert \le r\}\). The unit ball is denoted by \({\mathbb {B}}_{{\mathcal {E}}}\) for simplicity, and the open unit ball around the original in \({{\mathcal {E}}}\) is by \({\mathbb {B}}_{{\mathcal {E}}}^o\). A multi-function \(S: {\mathcal {E}}_1\rightrightarrows {\mathcal {E}}_2\) is a mapping assigning each point in \({{\mathcal {E}}}_1\) to a subset of \({{\mathcal {E}}}_2\). The graph of S is defined by

$$\begin{aligned} \texttt {gph}(S):=\{(u,v)\in {{\mathcal {E}}}_1\times {{\mathcal {E}}}_2: v\in S(u)\}. \end{aligned}$$

The inverse map \(S^{-1}:{\mathcal {E}}_2\rightrightarrows {\mathcal {E}}_1\) is defined by setting

$$\begin{aligned} S^{-1}(v):=\{u\in {{\mathcal {E}}}_1: v\in S(u)\}. \end{aligned}$$

Calmness and metric subregularity have been considered in various contexts and under various names. Here, we follow the terminology of Dontchev and Rockafellar [17].

Definition 2

([17], Chapter 3H)

  1. (a)

    A multi-function \(S: {\mathcal {E}}_1\rightrightarrows {\mathcal {E}}_2\) is said to be calm with constant \(\kappa >0\) around \({\bar{u}}\in {\mathcal {E}}_1\) for \({\bar{v}}\in {\mathcal {E}}_2\) if \(({\bar{u}},{\bar{v}})\in \texttt {gph}(S)\) and there exist constants \(\epsilon , \delta >0\) such that

    $$\begin{aligned} S(u)\cap {\mathbb {B}}_{{\mathcal {E}}_2}({\bar{v}},\epsilon )\subseteq S({\bar{u}})+\kappa \cdot \Vert u-{\bar{u}}\Vert _2{\mathbb {B}}_{{\mathcal {E}}_2},\quad \forall u\in {\mathbb {B}}_{{\mathcal {E}}_1}({\bar{u}},\delta ), \end{aligned}$$
    (1)

    or equivalently,

    $$\begin{aligned} S(u)\cap {\mathbb {B}}_{{\mathcal {E}}_2}({\bar{v}},\epsilon )\subseteq S({\bar{u}})+\kappa \cdot \Vert u-{\bar{u}}\Vert _2{\mathbb {B}}_{{\mathcal {E}}_2},\quad \forall u\in {\mathcal {E}}_1. \end{aligned}$$
    (2)
  2. (b)

    A multi-function \(S: {\mathcal {E}}_1\rightrightarrows {\mathcal {E}}_2\) is said to be metrically sub-regular with constant \(\kappa >0\) around \({\bar{u}}\in {\mathcal {E}}_1\) for \({\bar{v}}\in {\mathcal {E}}_2\) if \(({\bar{u}},{\bar{v}})\in \texttt {gph}(S)\) and there exists a constant \(\delta >0\) such that

    $$\begin{aligned} d\left( u, S^{-1}({\bar{v}})\right) \le \kappa \cdot d({\bar{v}},S(u)),\quad \forall u\in {\mathbb {B}}_{{\mathcal {E}}_1}({\bar{u}},\delta ). \end{aligned}$$
    (3)

Note that the calmness defined above is weaker than the local upper Lipschitz-continuity property [46]:

$$\begin{aligned} S(u) \subseteq S({\bar{u}})+\kappa \cdot \Vert u-{\bar{u}}\Vert _2{\mathbb {B}}_{{\mathcal {E}}_2},\quad \forall u\in {\mathbb {B}}_{{\mathcal {E}}_1}({\bar{u}},\delta ), \end{aligned}$$
(4)

which requires the multi-functions S to be calm around \({\bar{u}}\in {{\mathcal {E}}}_1\) with constant \(\kappa >0\) for any \({\bar{v}}\in {{\mathcal {E}}}_2\). Recently, the local upper Lipschitz-continuity property (4) was employed in [50] as a main assumption for verifying EB conditions of a class of dual objective functions.

3 The gradient descent: a necessary and sufficient condition for linear convergence

In this section, we first figure out the weakest condition that ensures gradient descent to converge linearly, and then we show that a number of existing linear convergence results can be recovered in a unified and transparent manner. This is a “warm-up” section for the forthcoming abstract theory in Sects. 4 and 5.

Now, we start by considering the following unconstrained optimization problem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in {\mathbb {R}}^n}}\,f(x), \end{aligned}$$

where \(f: {\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is a differentiable function achieving its minimum \(\min f\) so that \(\mathrm {Arg}\min f\ne \emptyset \). Note that \(\mathrm {Arg}\min f\) is closed since f is differentiable. For any \(x\in {\mathbb {R}}^n\), the set of its projection points onto \(\mathrm {Arg}\min f\), denoted by \({{\mathcal {Y}}}_f(x)\), is nonempty. Let \(\{x_k\}_{k\ge 0}\) be generated by the gradient descent method

$$\begin{aligned} x_{k+1}=x_k-h\cdot \nabla f(x_k), \quad k\ge 0, \end{aligned}$$
(5)

where \(h>0\) is a constant step size. Observe that \(d(x_k,\mathrm {Arg}\min f)\) measures how close \(x_k\) is to \(\mathrm {Arg}\min f\), and the ratio of \(d(x_{k+1},\mathrm {Arg}\min f)\) to \(d(x_k,\mathrm {Arg}\min f)\) measures how fast \(x_k\) converges to \(\mathrm {Arg}\min f\). Now, we analyze the ratio of \(d(x_{k+1},\mathrm {Arg}\min f)\) to \(d(x_k,\mathrm {Arg}\min f)\) as follows

$$\begin{aligned} d^2\left( x_{k+1},\mathrm {Arg}\min f\right)&= \left\| x_{k+1}-x_{k+1}^\prime \right\| ^2\le \left\| x_{k+1}-x_k^\prime \right\| ^2 \\&= \left\| x_k-h\cdot \nabla f(x_k)-x_k^\prime \right\| ^2 \\&= d^2\left( x_k,\mathrm {Arg}\min f\right) -2h\left\langle \nabla f(x_k),x_k-x_k^\prime \right\rangle +h^2\Vert \nabla f(x_k)\Vert ^2, \end{aligned}$$

where \(x_{k+1}^\prime \in {{\mathcal {Y}}}_f(x_{k+1})\) and \(x_{k}^\prime \in {{\mathcal {Y}}}_f(x_{k})\). To ensure gradient descent to converge linearly in the following sense:

$$\begin{aligned} d^2(x_{k+1},\mathrm {Arg}\min f) \le \tau \cdot d^2(x_k,\mathrm {Arg}\min f),\quad k\ge 0. \end{aligned}$$
(6)

it suffices to require that for \(k\ge 0, x_k^\prime \in {{\mathcal {Y}}}_f(x_{k}),\)

$$\begin{aligned} d^2(x_k,\mathrm {Arg}\min f) -2h\left\langle \nabla f(x_k),x_k-x_k^\prime \right\rangle +h^2\Vert \nabla f(x_k)\Vert ^2\le \tau \cdot d^2(x_k,\mathrm {Arg}\min f), \end{aligned}$$

i.e.,

$$\begin{aligned} \inf _{u\in {{\mathcal {Y}}}_f(x^{k})} \langle \nabla f(x_k),x_k-u\rangle \ge \frac{1-\tau }{2h}d^2(x_k,\mathrm {Arg}\min f)+\frac{h}{2}\Vert \nabla f(x_k)\Vert ^2, \quad k\ge 0.\nonumber \\ \end{aligned}$$
(7)

It turns out that this sufficient condition is also necessary when the objective function f belongs to \({{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\) and the step size h lies in some interval.

Proposition 1

Let f be a differentiable function on \({\mathbb {R}}^n\) achieving its minimum \(\min f\) so that \(\mathrm {Arg}\min f\ne \emptyset \), and let \(h>0\) and \(\tau \in (0,1)\).

  1. (i)

    If the condition (7) holds, then the sequence \(\{x_k\}_{k\ge 0}\) generated by the gradient descent method (5) must converge linearly in the sense of (6).

  2. (ii)

    Let \(f\in {{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\). If the sequence \(\{x_k\}\) generated by the gradient descent method (5) with \(0< h\le \frac{1-\sqrt{\tau }}{L}\) converges linearly as (6), then the condition (7) must hold.

Proof

The proof of sufficiency has been done. We now show the necessity part. Pick \(u_{k+1}\in {{\mathcal {Y}}}_f(x_{k+1})\) to derive that

$$\begin{aligned} d(x_k, \mathrm {Arg}\min f)&\le \Vert x_k-u_{k+1}\Vert \le \Vert x_{k+1}-u_{k+1}\Vert +\Vert x_{k+1}-x_k\Vert \nonumber \\&= d(x_{k+1}, \mathrm {Arg}\min f)+ h\Vert \nabla f(x_k)\Vert , \quad k\ge 0. \end{aligned}$$
(8)

Combine (8) and the fact of linear convergence

$$\begin{aligned} d(x_{k+1},\mathrm {Arg}\min f) \le \sqrt{\tau } \cdot d(x_{k},\mathrm {Arg}\min f), \quad k\ge 0 \end{aligned}$$

to obtain

$$\begin{aligned} (1-\sqrt{\tau })d(x_k, \mathrm {Arg}\min f) \le h\Vert \nabla f(x_k)\Vert , \quad k\ge 0. \end{aligned}$$
(9)

According to Theorem 2.1.5 in [43], we know that \(f\in {{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\) implies

$$\begin{aligned} \langle \nabla f(x_k),x_k-v_k\rangle \ge \frac{1}{L}\Vert \nabla f(x_k)\Vert ^2, ~ v_k\in {{\mathcal {Y}}}_f(x_{k}), \quad k\ge 0. \end{aligned}$$

By letting \(\alpha +\beta \le 1\) and \(\alpha ,\beta >0\), we have that for any \( v_k\in {{\mathcal {Y}}}_f(x_{k})\),

$$\begin{aligned} \langle \nabla f(x_k),x_k-v_k\rangle&\ge \frac{\alpha }{L}\Vert \nabla f(x_k)\Vert ^2+\frac{\beta }{L}\Vert \nabla f(x_k)\Vert ^2\\&\ge \frac{\alpha }{L}\Vert \nabla f(x_k)\Vert ^2+\frac{\beta \left( 1-\sqrt{\tau }\right) ^2}{Lh^2}d(x_k, \mathrm {Arg}\min f)^2, \quad k\ge 0, \end{aligned}$$

where the last inequality follows by (9). Thus, by letting \(\frac{\alpha }{L}=\frac{h}{2}\) and \(\frac{\beta (1-\sqrt{\tau })^2}{Lh^2}=\frac{1-\tau }{2h}\), we get the condition (7). At last, we need

$$\begin{aligned} \alpha +\beta =\frac{Lh}{2}+\frac{Lh(1-\tau )}{2(1-\sqrt{\tau })^2}=\frac{hL}{1-\sqrt{\tau }}\le 1, \end{aligned}$$

which forces \(h\le \frac{1-\sqrt{\tau }}{L}\). This completes the proof. \(\square \)

The condition (7) means that if the steepest descent direction \(-\nabla f(x)\) is well correlated to any direction towards optimality \(u-x\), where \(u\in {{\mathcal {Y}}}_f(x)\), then a linear convergence rate of the gradient descent method can be ensured. Conversely, when \(f\in {{\mathcal {F}}}_L^{1,1}({\mathbb {R}}^n)\) and if the gradient descent converges linearly and the step size lies in the interval \((0, \frac{1-\sqrt{\tau }}{L}]\), then \(-\nabla f(x)\) must be well correlated to \(u-x\). Now, we list some direct applications of this basic observation.

In our first illustrating example, we consider functions in \(S_{\mu , L}^{1,1}({\mathbb {R}}^n)\). First, we introduce an important property about this type of functions.

Lemma 1

([43]) If \(f\in S_{\mu , L}^{1,1}({\mathbb {R}}^n)\), then we have

$$\begin{aligned}&\forall x, y\in {\mathbb {R}}^n,~~ \langle \nabla f(x)-\nabla f(y), x-y\rangle \ge \frac{\mu L}{\mu +L} \Vert x-y\Vert ^2 \\&\quad +\frac{1}{\mu +L}\Vert \nabla f(x)-\nabla f(y)\Vert ^2. \end{aligned}$$

Let \(x^*\) be the unique minimizer of \(f\in S_{\mu , L}^{1,1}({\mathbb {R}}^n)\); then \(\mathrm {Arg}\min f=\{x^*\}\). Using the inequality above with \(x=x_k, y=x^*\) and noting that \(\nabla f(x^*)=0\) and \(\Vert x_k-x^*\Vert = d(x_k,\mathrm {Arg}\min f)\), we obtain

$$\begin{aligned} \left\langle \nabla f(x_k),x_k-x^*\right\rangle \ge \frac{\mu L}{\mu +L}d^2(x_k,\mathrm {Arg}\min f)+\frac{1}{\mu +L}\Vert \nabla f(x_k)\Vert ^2, ~k\ge 0. \end{aligned}$$

To guarantee the condition (7), we only need

$$\begin{aligned} \frac{\mu L}{\mu +L}\ge \frac{1-\tau }{2h}~~\text {and}~~\frac{1}{\mu +L}\ge \frac{h}{2}, \end{aligned}$$

which implies that

$$\begin{aligned} \frac{(1-\tau )(\mu +L)}{2\mu L}\le h\le \frac{2}{\mu +L},~~ \tau \ge \tau _0:=\left( \frac{L-\mu }{L+\mu }\right) ^2. \end{aligned}$$

The optimal linear convergence rate \(\tau _0\) can be obtained by setting \(h=\frac{2}{\mu +L}\). This gives the corresponding result in Nesterov’s book; see Theorem 2.1.15 in [43].

In our second illustrating example, we consider RSC functions [64, 67]. The following property can be viewed as a convex combination of the restricted strong convexity and the gradient-Lipschitz-continuity property; see Lemma 3 in [64].

Lemma 2

([64]) If \(f\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and f is RSC with \(0<\nu < L\), then for every \(\theta \in [0,1]\) it holds:

$$\begin{aligned} \quad \forall x \in {\mathbb {R}}^n,~~ \left\langle \nabla f(x), x-{x^\prime }\right\rangle \ge \frac{\theta }{L}\Vert \nabla f(x)\Vert ^2+ (1-\theta )\nu d^2(x,\mathrm {Arg}\min f), \end{aligned}$$

where \(x^\prime \) is the unique projection point of x onto \(\mathrm {Arg}\min f\) since \(\mathrm {Arg}\min f\) is a nonempty closed convex set.

Similarly, to guarantee the condition (7) , we only need

$$\begin{aligned} (1-\theta )\nu \ge \frac{1-\tau }{2h}~~\text {and}~~\frac{\theta }{L}\ge \frac{h}{2}, \end{aligned}$$

which implies that

$$\begin{aligned} \frac{1-\tau }{2(1-\theta )\nu }\le h\le \frac{2\theta }{L},~~ \tau \ge 1- \frac{4\theta (1-\theta )\nu }{L}\ge 1-\frac{\nu }{L}. \end{aligned}$$

The optimal linear convergence rate \(1-\frac{\nu }{L}\) can be obtained at \(\theta =\frac{1}{2}\) and \(h=\frac{1}{L}\). This gives the corresponding result in [64]. The argument here is much simpler than that previously employed to derive the same result; see the proof of Theorem 2 in [64].

The last example to be illustrated is a nonconvex minimization. The following definition can be viewed as a local version of Lemma 2. Therefore, it is not difficult to predict a local linear convergence under such property.

Definition 3

(Regularity Condition, [14]) Let \({{\mathcal {N}}}\) be a neighborhood of \(\mathrm {Arg}\min f\) and let \(\alpha , \beta >0\). We say that f satisfies the regularity condition if

$$\begin{aligned} \quad \forall x\in {{\mathcal {N}}},~~\inf _{u\in {{\mathcal {Y}}}_f(x)}\langle \nabla f(x), x-u\rangle \ge \frac{1}{\alpha } d^2(x,\mathrm {Arg}\min f)+ \frac{1}{\beta }\Vert \nabla f(x)\Vert ^2. \end{aligned}$$

Again, to guarantee the condition (7) locally, we only need

$$\begin{aligned} \frac{1}{\alpha }\ge \frac{1-\tau }{2h}~~\text {and}~~\frac{1}{\beta }\ge \frac{h}{2}, \end{aligned}$$

which implies that

$$\begin{aligned} \frac{(1-\tau )\alpha }{2}\le h\le \frac{2}{\beta },~~ \tau \ge \tau _0:=\left( 1-\frac{4}{\alpha \beta }\right) . \end{aligned}$$

The optimal linear convergence rate \(\tau _0\) can be obtained by setting \(h=\frac{2}{\beta }\) and assuming \(\alpha \beta >4\). The latter can be guaranteed usually; see e.g. the argument below Lemma 7.10 in [14]. Therefore, we obtain the corresponding result in [14]. Regularity condition provably holds for nonconvex optimization problems that appear in phase retrieval and low-rank matrix recovery; interested readers can refer to [14, 56] for details.

Observe that the right-hand side of (7) has two terms. In order to better analyze such condition, we decompose it into two parts:

$$\begin{aligned}&\inf _{u\in {{\mathcal {Y}}}_f(x^{k})} \langle \nabla f(x_k),x_k-u\rangle \ge \theta _1\cdot d^2(x_k,\mathrm {Arg}\min f), \\&\inf _{u\in {{\mathcal {Y}}}_f(x^{k})} \langle \nabla f(x_k),x_k-u\rangle \ge \theta _2\cdot \Vert \nabla f(x_k)\Vert ^2, \end{aligned}$$

where \(\theta _i, i=1,2\) are some positive parameters. This idea of separating the right-hand side of (7) partially inspires us to consider new and abstract error bound conditions, which are the main content of the next section.

4 Abstract EB conditions: definition and interplay

This section is divided into two parts. In the first part, we define a group of EB conditions in a unified and abstract way. In the second part, we discuss some interplay between them, along with new connections between many existing EB conditions.

4.1 Definition of abstract EB conditions

The concept of residual measure operator, given by the following definition, will play a key role in the forthcoming theory.

Definition 4

Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) and \(X\subset {\mathbb {R}}^n\). We say that \(G_\varphi : X \rightarrow {\mathbb {R}}^n\) is a residual measure operator related to \(\varphi \) and X, if it satisfies

$$\begin{aligned} \left\{ x\in X: G_{\varphi }(x)=0\right\} ={\mathbf {crit}}\varphi . \end{aligned}$$

Especially, if we further assume that \(\varphi \) is convex, the above condition can be written as

$$\begin{aligned} \left\{ x\in X: G_{\varphi }(x)=0\right\} =\mathrm {Arg}\min \varphi . \end{aligned}$$

Now, we define a group of abstract EB conditions.

Definition 5

Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) be such that it achieves its minimum \(\min \varphi \) and that its critical point set \({\mathbf {crit}}\varphi \) is nonempty and closed. Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\), and \(G_\varphi \) be a residual measure operator related to \(\varphi \) and X. Define the projection operator \({\mathcal {P}}_\varphi : {\mathbb {R}}^n\rightrightarrows {\mathbb {R}}^n\) onto \({\mathbf {crit}}\varphi \) by:

$$\begin{aligned} {\mathcal {P}}_\varphi (x):={\mathrm {Arg}}\,{\mathop {\mathrm{min}}\limits _{u\in {\mathbf {crit}}\varphi }}\Vert x-u\Vert . \end{aligned}$$

We call \(d(x,{\mathbf {crit}}\varphi )\) point value error, \(\varphi (x)-\min \varphi \) objective value error, \(\Vert G_{\varphi }(x)\Vert \) residual value error, and \(\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \) least correlated error. With these optimality measures, we say that

  1. 1.

    \(\varphi \) satisfies the residual-point value EB condition with operator \(G_\varphi \) and constant \(\kappa >0\) on \(\Omega \), abbreviated \((G_\varphi , \kappa , \Omega )\)-(res-EB) condition, if:

    $$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\Vert G_{\varphi }(x)\Vert \ge \kappa \cdot d(x,{\mathbf {crit}}\varphi ); \end{aligned}$$
    (res-EB)
  2. 2.

    \(\varphi \) satisfies the correlated-point value EB condition with operator \(G_\varphi \) and constant \(\nu >0\) on \(\Omega \), abbreviated \((G_\varphi , \nu , \Omega )\)-(cor-EB) condition, if:

    $$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \ge \nu \cdot d^2(x,{\mathbf {crit}}\varphi ); \end{aligned}$$
    (cor-EB)
  3. 3.

    \(\varphi \) satisfies the objective-point value EB condition with constant \(\alpha >0\) on \(\Omega \), abbreviated \((\varphi , \alpha , \Omega )\)-(obj-EB) condition, if:

    $$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\varphi (x)- \min \varphi \ge \frac{\alpha }{2}\cdot d^2(x,{\mathbf {crit}}\varphi ); \end{aligned}$$
    (obj-EB)
  4. 4.

    \(\varphi \) satisfies the residual-objective value EB condition with operator \(G_\varphi \) and constant \(\eta >0\) on \(\Omega \), abbreviated \((G_\varphi , \eta , \Omega )\)-(res-obj-EB) condition, if:

    $$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\Vert G_{\varphi }(x)\Vert \ge \eta \cdot \sqrt{\varphi (x)- \min \varphi }; \end{aligned}$$
    (res-obj-EB)
  5. 5.

    \(\varphi \) satisfies the correlated-residual value EB condition with operator \(G_\varphi \) and constant \(\beta >0\) on \(\Omega \), abbreviated \((G_\varphi , \beta , \Omega )\)-(cor-res-EB) condition, if:

    $$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \ge \beta \cdot \Vert G_{\varphi }(x)\Vert ^2; \end{aligned}$$
    (cor-res-EB)
  6. 6.

    \(\varphi \) satisfies the correlated-objective value EB condition with operator \(G_\varphi \) and constant \(\omega >0\) on \(\Omega \), abbreviated \((G_\varphi , \omega , \Omega )\)-(cor-obj-EB) condition, if:

    $$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \ge \omega \cdot (\varphi (x)- \min \varphi ). \end{aligned}$$
    (cor-obj-EB)

We will refer to these EB conditions as global if \(\Omega ={\mathbb {R}}^n\). For global EB conditions, we will omit \(\Omega \) for simplicity.

In order to gain some intuition of the abstract EB conditions, we point out their correspondences to existing notions: (res-EB) corresponds to the EB condition of Hoffman’s type [20, 38, 69], (res-obj-EB) to the Polyak–Łojasiewicz’s type [10, 27], (obj-EB) to the quadratic growth condition [10, 20], (cor-EB) to the RSI’s type [67], and (cor-obj-EB) to the subgradient inequality for convex functions. The (cor-res-EB) condition, which will be used in Sect. 5, is a relaxation of the following property:

$$\begin{aligned} \quad \forall x, y\in {\mathbb {R}}^n,~~\langle \nabla \varphi (x)-\nabla \varphi (y), x-y\rangle \ge \frac{1}{L}\Vert \nabla \varphi (x)-\nabla \varphi (y)\Vert ^2, \end{aligned}$$

which is equivalent to \(\varphi \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\); see Theorem 2.1.5 in [43]. The reader now could find that we use the residual measure operator to replace concrete notions such as gradient and proximal gradient in some existing conditions. But compared to [10], we here consider only the case where the desingularizing function (used in the definition of Kurdyka–Łojasiewicz inequality) is taken to be the square-root. Below, let us briefly explain why we only consider the square-root case.

  1. 1.

    Our motivation is to study linear convergence without strong convexity, as described at the beginning of the introduction. Many clues indicate that linear convergence corresponds to the square-root case. Actually, we will show such link is essential.

  2. 2.

    Our framework includes six types of abstract EB conditions, while [10] only considers two types of them. Currently, it is not an easy task to take general desingularizing functions into account in this framework. But this definitely deserves further study in the future.

In our early manuscript [62], we only roughly gave global EB conditions in Definition 5. The above was obtained by incorporating the referee’s comments and was influenced by the recent work [22], resulting in a much more complete list than the previous one.

4.2 Interplay between the EB conditions

We first show the interplay between the abstract EB conditions. The proof of equivalence will rely heavily on a technical result developed in [10].

Theorem 1

Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) be such that it achieves its minimum \(\min \varphi \) and that \({\mathbf {crit}}\varphi \) is nonempty and closed. Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\), and \(G_\varphi \) be a residual measure operator related to \(\varphi \) and X. Assume that the \((G_\varphi , \omega , \Omega )\)-(cor-obj-EB) condition holds. Then, we have the following implications

$$\begin{aligned} ({\text {obj-EB}})\Rightarrow ({\text {cor-EB}})\Rightarrow ({\text {res-EB}})\Rightarrow ({\text {res-obj-EB}}). \end{aligned}$$

One can take \(\nu =\frac{\alpha \omega }{2}, \kappa =\nu , \eta =\sqrt{\kappa \omega }\) to show above implications. If we further assume that \(\varphi \in \Gamma _0({\mathbb {R}}^n)\), \(\Omega \) is \(\partial \varphi \)-invariant, and \(G_\varphi \) satisfies

$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~ \Vert G_{\varphi }(x)\Vert \le \inf _{g\in \partial \varphi (x)}\Vert g\Vert , \end{aligned}$$
(10)

then we have the following equivalent relationship

$$\begin{aligned} ({\text {obj-EB}})\Leftrightarrow ({\text {cor-EB}})\Leftrightarrow ({\text {res-EB}})\Leftrightarrow ({\text {res-obj-EB}}). \end{aligned}$$

For \(({\text {res-obj-EB}})\Rightarrow ({\text {obj-EB}})\), one can take \(\alpha =\frac{1}{2}\eta ^2\).

Proof

We prove this theorem by showing the following implications

$$\begin{aligned} ({\text {obj-EB}})\Rightarrow ({\text {cor-EB}})\Rightarrow ({\text {res-EB}})\Rightarrow ({\text {res-obj-EB}}) \Rightarrow ({\text {obj-EB}}). \end{aligned}$$

Firstly, the implication of \(({\text {obj-EB}})\Rightarrow ({\text {cor-EB}})\) follows from

$$\begin{aligned} \inf _{x_p\in {\mathcal {P}}_\varphi (x)}\left\langle G_{\varphi }(x), x-x_p\right\rangle \ge \omega \cdot (\varphi (x)-\min {\varphi })\ge \frac{\alpha \omega }{2}\cdot d^2(x,{\mathbf {crit}}\varphi ), \end{aligned}$$

where the left inequality is (cor-obj-EB) and the right one is (obj-EB).

Secondly, the implication of \(({\text {cor-EB}})\Rightarrow ({\text {res-EB}})\) follows from a direct application of the Cauchy–Schwarz inequality to (cor-EB).

Thirdly, we show \(({\text {res-EB}})\Rightarrow ({\text {res-obj-EB}})\). By (cor-obj-EB) and (res-EB), we derive that for \(\forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi \),

$$\begin{aligned} \omega \cdot (\varphi (x)-\min \varphi )&\le \inf _{x_p\in {\mathcal {P}}_\varphi (x)}\langle G_{\varphi }(x), x-x_p\rangle \\&\le \inf _{x_p\in {\mathcal {P}}_\varphi (x)}\Vert G_{\varphi }(x)\Vert \Vert x-x_p\Vert =\Vert G_{\varphi }(x)\Vert \cdot d(x,{\mathbf {crit}}\varphi ) \\&\le \kappa ^{-1}\Vert G_{\varphi }(x)\Vert ^2. \end{aligned}$$

Thus, it holds that \(\forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi , ~~\Vert G_{\varphi }(x)\Vert \ge \sqrt{\kappa \omega }\cdot \sqrt{\varphi (x)-\min \varphi },\) which is just (res-obj-EB).

At last, we show \(({\text {res-obj-EB}}) \Rightarrow ({\text {obj-EB}})\). The following is based on an argument used for proving Theorem 27 in [10]. For the sake of completeness, we reproduce that proof in our particular case. First of all, take \(x\in \Omega \cap {\mathrm {dom}}\varphi \) and recall that we have additionally assumed \({\mathbf {crit}}\varphi =\mathrm {Arg}\min \varphi \). Without loss of generality, we assume that \( \min \varphi =0\) and \(x\notin \mathrm {Arg}\min \varphi \). According to the result about subgradient curves due to Br\(\acute{e}\)zis [12] and Bruck [13] and recently used in [10], we can find the unique absolutely continuous curve \(\chi _x:[0, +\infty )\rightarrow {\mathbb {R}}^n\) such that \(\chi _x(0)=x\) and

$$\begin{aligned} {\dot{\chi }}_x(t)\in -\partial \varphi (\chi _x(t)) \end{aligned}$$

for almost every \(t>0\). Moreover, \(\chi _x(t)\) converges to some point \({\hat{x}}\) in \(\mathrm {Arg}\min \varphi \) as \(t\rightarrow +\infty \) and the function \(t \mapsto \varphi (\chi _x(t))\) is nonincreasing and

$$\begin{aligned} \lim _{t\rightarrow +\infty }\varphi (\chi _x(t))=\min \varphi = 0. \end{aligned}$$

By the \(\partial \varphi \)-invariant property of \(\Omega \), we have \(\chi _x(t)\in \Omega \) and hence \(\chi _x(t)\in \Omega \cap {\mathrm {dom}}\varphi \) due to the nonincreasingness of \(\varphi (\chi _x(t))\). Let

$$\begin{aligned} T:=\inf \{t\in [0, +\infty ): \varphi (\chi _x(t))=0\}. \end{aligned}$$

We claim that \(T>0\). Otherwise, \(T=0\) and then, by the lower semicontinuity property of \(\varphi \), we can derive that

$$\begin{aligned} \varphi (x)=\varphi (\chi _x(0))\le {\lim }\inf _{t\rightarrow 0^+}\varphi (\chi _x(t)) =0. \end{aligned}$$

This contradicts \(x\notin \mathrm {Arg}\min \varphi \). Now, combining (10) and (res-obj-EB), we derive that

$$\begin{aligned} \frac{\Vert {\dot{\chi }}_x(t)\Vert }{\sqrt{\varphi (\chi _x(t))}}\ge \frac{\inf _{g\in \partial \varphi (\chi _x(t))}\Vert g\Vert }{\sqrt{\varphi (\chi _x(t))}}\ge \frac{\Vert G_{\varphi }(\chi _x(t))\Vert }{\sqrt{\varphi (\chi _x(t))}}\ge \eta , \quad \forall t\in [0, T). \end{aligned}$$

Observe that for \(p, q\in [0, T)\) with \(q\ge p\),

$$\begin{aligned}&\sqrt{\varphi (\chi _x(p))}-\sqrt{\varphi (\chi _x(q))}=\int _q^p \frac{d\sqrt{\varphi (\chi _x(t))}}{dt} dt\\ =&\frac{1}{2}\int _p^q \left( \varphi (\chi _x(p)) \right) ^{-\frac{1}{2}}\Vert {\dot{\chi }}_x(t)\Vert ^2 dt=\frac{1}{2} \int _p^q \frac{\Vert {\dot{\chi }}_x(t)\Vert }{\sqrt{\varphi (\chi _x(t))}} \Vert {\dot{\chi }}_x(t)\Vert dt\\ \ge&\frac{1}{2} \int _p^q \eta \Vert {\dot{\chi }}_x(t)\Vert dt=\frac{\eta }{2}\cdot \text {length}(\chi _x(t), p, q)\ge \frac{\eta }{2}\cdot \Vert \chi _x(p)-\chi _x(q)\Vert , \end{aligned}$$

where \(\text {length}(\chi _x(t), p, q)\) stands for the length of subgradient curve from p to q. By letting \(p=0\) and \(q\rightarrow +\infty \) if \(T=+\infty \) and \(q\rightarrow T\) if \(T<+\infty \), we obtain

$$\begin{aligned} \sqrt{\varphi (\chi _x(0))}=\sqrt{\varphi (x)}\ge \frac{\eta }{2}\cdot \Vert x-{\hat{x}}\Vert . \end{aligned}$$

Therefore, for \(\forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi \) we always have

$$\begin{aligned} \varphi (x)- \min \varphi \ge \frac{\eta ^2}{4}\cdot \Vert x-{\hat{x}}\Vert ^2\ge \frac{\eta ^2}{4}\cdot d^2(x,\mathrm {Arg}\min \varphi )=\frac{\eta ^2}{4}\cdot d^2(x,{\mathbf {crit}}\varphi ), \end{aligned}$$

which implies that (obj-EB) with \(\alpha =\frac{\eta ^2}{2}\) holds. This completes the proof. \(\square \)

As a direct consequence, we have the following corollary.

Corollary 1

Let \(\varphi \in \Gamma _0({\mathbb {R}}^n)\) be such that its achieves its minimum \(\min \varphi \) so that \(\mathrm {Arg}\min \varphi \ne \emptyset \). Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\) be \(\partial \varphi \)-invariant, and \(G^i_\varphi , i=1, 2\) be two different residual measure operators related to the same function \(\varphi \) and the same subset X. We assume that \(G^i_\varphi , i=1, 2\) satisfy

$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\left\| G^i_{\varphi }(x)\right\| \le \inf _{g\in \partial \varphi (x)}\Vert g\Vert , \end{aligned}$$
(11)

and \((G^i_\varphi , \omega , \Omega )\)-(cor-obj-EB) conditions hold. Then, we have

$$\begin{aligned}&\left( G^1_\varphi , \kappa , \Omega \right) {\text {-}}({\text {res-EB}})\Leftrightarrow \left( G^1_\varphi , \nu , \Omega \right) {\text {-}}({\text {cor-EB}})\Leftrightarrow \left( G^1_\varphi , \eta , \Omega \right) {\text {-}}\left( {\text {res-obj-EB}}\right) \\&\quad \Leftrightarrow \left( \varphi , \alpha , \Omega \right) {\text {-}}({\text {obj-EB}}) \Leftrightarrow \left( G^2_\varphi , \kappa , \Omega \right) {\text {-}}({\text {res-EB}})\Leftrightarrow \left( G^2_\varphi , \nu , \Omega \right) {\text {-}}({\text {cor-EB}})\\&\quad \Leftrightarrow \left( G^2_\varphi , \eta , \Omega \right) {\text {-}}({\text {res-obj-EB}}). \end{aligned}$$

Now, we list some cases where the equivalence between the EB conditions indeed holds.

Corollary 2

The EB conditions (cor-EB), (res-EB), (obj-EB), and (res-obj-EB) are equivalent under each of the following situations:

  1. case 1:

    \(\varphi \in {{\mathcal {F}}}^{1}({\mathbb {R}}^n)\) achieves its minimum \(\min \varphi \), \(X={\mathbb {R}}^n\) and \(\Omega \subset X\) is \(\nabla \varphi \)-invariant, and \(G_{\varphi } =\nabla \varphi \);

  2. case 2:

    \(\varphi \in \Gamma _0({\mathbb {R}}^n)\) achieves its minimum \(\min \varphi \), \(X={\mathrm {dom}}\partial \varphi \) and \(\Omega \subset X\) is \(\partial \varphi \)-invariant, and \(G_{\varphi } =\partial ^0\varphi \);

  3. case 3:

    \(\varphi =f +g \), where \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), achieves its minimum \(\min \varphi \), \(X={\mathbb {R}}^n\) and \(\Omega \subset X\) is \(\partial \varphi \)-invariant, and \(G_{\varphi }={\mathcal {R}}_t\), where \({\mathcal {R}}_t(x):=t^{-1}(x-x^+)\) with \(t\in (0,\frac{1}{L}]\) and \(x^+={\mathbf {prox}}_{tg}(x-t\nabla f(x))\). In addition, we assume that there exists a constant \(0<\epsilon \le \frac{2}{t}\) such that

    $$\begin{aligned} \Vert G_{\varphi }(x)\Vert ^2\ge \epsilon \left( \varphi (x)-\varphi \left( x^+\right) \right) . \end{aligned}$$
    (12)

Proof

First of all, \({\mathbf {crit}}\varphi \) is nonempty since \({\mathbf {crit}}\varphi =\mathrm {Arg}\min \varphi \ne \emptyset \), and is closed since \(\varphi \) a proper and lower semicontinuous function, in all the listed cases. Secondly, by optimality conditions, one can easily verify that \(G_\varphi \) in all the listed cases are residual measure operators. We only need to verify the remaining assumptions in Theorem 1.

For both cases 1 and 2, the convexity of \(\varphi \) implies the (cor-obj-EB) condition with \(\omega =1\). In case 1, the assumption (10) holds obviously because of \(\partial \varphi (x)=\{\nabla \varphi (x)\}\). In case 2, the assumption (10) follows from the definition of \(\partial ^0\varphi (x)\).

Now, let us consider the case 3. Since \(f\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), we have that \(G_{\varphi }(x)\) satisfies the standard result

$$\begin{aligned} ~~\forall x, y\in {\mathbb {R}}^n,~~\varphi \left( x^+\right) \le \varphi (y)+\langle G_{\varphi }(x), x-y\rangle -\frac{t}{2}\Vert G_{\varphi }(x)\Vert ^2; \end{aligned}$$

see e.g. Lemma 2.3 in [8] or Lemma 2 in the very recent work [4]. Since \(\varphi \) also belongs to \(\Gamma _0({\mathbb {R}}^n)\), we can conclude that \(\mathrm {Arg}\min \varphi \) is a nonempty closed convex set. Thus, by the projection theorem, there exists a unique projection point of x onto \(\mathrm {Arg}\min \varphi \), denoted by \(x_p\). Using the inequality above with \(y=x_p\) and the assumption (12), we derive that

$$\begin{aligned} \langle G_{\varphi }(x), x-x_p \rangle&\ge \varphi \left( x^+\right) -\min \varphi +\frac{t}{2}\Vert G_{\varphi }(x)\Vert ^2\\&\ge \varphi \left( x^+\right) -\min \varphi +\frac{t\epsilon }{2}\left( \varphi (x)-\varphi \left( x^+\right) \right) \\&= \frac{t\epsilon }{2}(\varphi (x)-\min \varphi )+(1-\frac{t\epsilon }{2})\left( \varphi \left( x^+\right) -\min \varphi \right) \\&\ge \frac{t\epsilon }{2}(\varphi (x)-\min \varphi ), \end{aligned}$$

from which the \((G_\varphi , \omega , \Omega )\)-(cor-obj-EB) condition with \(\omega =\frac{t\epsilon }{2}\) follows. The assumption (10) in this case was established in Theorem 3.5 in [20] and Lemma 4.1 in [32]. This completes the proof. \(\square \)

Remark 1

  1. (i)

    In cases 1 and 2, from Theorem 1 we can see that if one only needs the implication

    $$\begin{aligned} ({\text {obj-EB}})\Rightarrow ({\text {cor-EB}})\Rightarrow ({\text {res-EB}})\Rightarrow ({\text {res-obj-EB}}), \end{aligned}$$

    then the assumption on \(\Omega \) can be removed.

  2. (ii)

    In case 2, if \(\Omega ={\mathrm {dom}}\partial \varphi \), then \((\varphi , \alpha , {\mathrm {dom}}\partial \varphi )\)-(obj-EB) is actually equivalent to

    $$\begin{aligned} \forall ~ x\in {\mathbb {R}}^n,~~\varphi (x)- \min \varphi \ge \frac{\alpha }{2}\cdot d^2(x, \mathrm {Arg}\min \varphi ), \end{aligned}$$

    since \({\mathrm {dom}}\partial \varphi \) is a dense subset of \({\mathrm {dom}}\varphi \) according to Corollary 16.29 in [7] and \(\varphi (x)=+\infty \) for \(x\notin {\mathrm {dom}}\varphi \).

We note that while this work was under review, the authors of [27] independently also obtained the equivalent relationship between the EB conditions (cor-EB), (res-EB), (obj-EB), and (res-obj-EB) for functions in \({{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\). We also note that the authors of [22] independently recently obtained the equivalent relationship between the EB conditions (res-EB), (obj-EB), and (res-obj-EB) for functions in \(\Gamma _0({\mathbb {R}}^n)\). The former is merely limited to \({{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\), and the latter mainly focuses on \(\Gamma _0({\mathbb {R}}^n)\) but does not consider (cor-EB).

Observe that the condition (12) is implied by the (res-obj-EB) condition since

$$\begin{aligned} \Vert G_{\varphi }(x)\Vert ^2\ge \eta ^2 (\varphi (x)-\min \varphi ) \ge \eta ^2 \left( \varphi (x)-\varphi \left( x^+\right) \right) . \end{aligned}$$

And also, note that \(\varphi =f+g\in \Gamma _0({\mathbb {R}}^n)\) if \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\). With a little effort, we can get the following result.

Corollary 3

Let \(\varphi =f+g\) with \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\) achieve its minimum \(\min \varphi \), and let \(\Omega \subset {\mathbb {R}}^n\) be \(\partial \varphi \)-invariant and \(t\in (0, \frac{1}{L}]\). If the \(({\mathcal {R}}_t, \eta , \Omega )\)-(res-obj-EB) condition holds, then each of the following conditions holds and hence they are equivalent:

$$\begin{aligned}&\left( \partial ^0\varphi , \kappa , \Omega \right) {\text {-}}({\text {res-EB}})\Leftrightarrow \left( \partial ^0\varphi , \nu , \Omega \right) {\text {-}}({\text {cor-EB}})\\&\quad \Leftrightarrow \left( \partial ^0\varphi , \eta , \Omega \right) {\text {-}}({\text {res-obj-EB}}) \Leftrightarrow \left( \varphi , \alpha , \Omega \right) {\text {-}}({\text {obj-EB}}) \Leftrightarrow \left( {\mathcal {R}}_t, \kappa , \Omega \right) {\text {-}}({\text {res-EB}})\\&\quad \Leftrightarrow \left( {\mathcal {R}}_t, \nu , \Omega \right) {\text {-}}({\text {cor-EB}}) \Leftrightarrow \left( {\mathcal {R}}_t, \eta , \Omega \right) {\text {-}}({\text {res-obj-EB}}). \end{aligned}$$

Based on the relationship established in Theorem 2 in [63], that is \((\varphi , \alpha , \Omega )\)-(obj-EB) \(\Leftrightarrow \)\(({\mathcal {R}}_t, \kappa , \Omega )\)-(res-EB)\(\Leftrightarrow \)\(({\mathcal {R}}_t, \nu , \Omega )\)-(cor-EB), and together with the case 2 of Corollary 2, we still have the following result even if we do not take the \(({\mathcal {R}}_t, \eta , \Omega )\)-(res-obj-EB) condition as an assumption.

Corollary 4

Let \(\varphi =f+g\) with \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\) achieve its minimum \(\min \varphi \), and let \(\Omega \subset {\mathrm {dom}}\partial \varphi \) be \(\partial \varphi \)-invariant and \(t\in (0, \frac{1}{L}]\). Then, we have

$$\begin{aligned}&\left( \partial ^0\varphi , \kappa , \Omega \right) {\text {-}}({\text {res-EB}}) \Leftrightarrow \left( \partial ^0\varphi , \nu , \Omega \right) {\text {-}}({\text {cor-EB}}) \Leftrightarrow \left( \partial ^0\varphi , \eta , \Omega \right) {\text {-}}({\text {res-obj-EB}}) \\&\quad \Leftrightarrow (\varphi , \alpha , \Omega ){\text {-}}({\text {obj-EB}}) \Leftrightarrow ({\mathcal {R}}_t, \kappa , \Omega ){\text {-}}({\text {res-EB}})\\&\quad \Leftrightarrow ({\mathcal {R}}_t, \nu , \Omega ){\text {-}}({\text {cor-EB}}). \end{aligned}$$

Note that the \(({\mathcal {R}}_t, \eta , \Omega )\)-(res-obj-EB) condition is not involved in the equivalence above. This might explain why one can avoid the condition (12) in existing related results.

In all corollaries above, parameters involved in different EB conditions can be set explicitly as in Theorem 1, but we omit the details here.

5 An abstract gradient-type method: linear convergence and applications

In this section, we define an abstract gradient-type method by viewing the negative of the residual measure operator as a descent direction, and then figure out a necessary and sufficient condition for linear convergence based on the abstract EB conditions defined before. The following main result generalizes Proposition 1.

Theorem 2

Let \(\varphi \in \Gamma ({\mathbb {R}}^n)\) be such that it achieves its minimum \(\min \varphi \) and that \({\mathbf {crit}}\varphi \) is nonempty and closed. Let \(X\subset {\mathbb {R}}^n\), \(\Omega \subset X\), and \(G_\varphi \) be a residual measure operator related to \(\varphi \) and X. Suppose that \(\varphi \) satisfies the \((G_\varphi , \beta , \Omega )\)-(cor-res-EB) condition. Define the abstract gradient-type method by

$$\begin{aligned} x_{k+1}=x_k-h\cdot G_{\varphi }(x_k), ~k\ge 0, \end{aligned}$$

with step size \(h>0\) and arbitrary initial point \(x_0\in \Omega \). Assume that \(x_k\in \Omega , k\ge 0\). Let \(\tau , \theta \in (0, 1)\).

  1. (i)

    If \(\varphi \) satisfies the \((G_\varphi , \nu , \Omega )\)-(cor-EB) condition with \(\nu <\frac{1}{\beta }\) and the following inequalities hold

    $$\begin{aligned} \frac{1-\tau }{2\theta \nu }\le h\le 2(1-\theta )\beta ,~~\tau \ge 1-4\theta (1-\theta )\beta \nu , \end{aligned}$$
    (13)

    then the abstract gradient-type method converges linearly in the sense that

    $$\begin{aligned} d^2(x_{k+1},{\mathbf {crit}}\varphi ) \le \tau \cdot d^2(x_k,{\mathbf {crit}}\varphi ),~~k\ge 0. \end{aligned}$$
    (14)

    The optimal rate \(\tau _0:=1-\beta \nu \) is obtained at \(h=\beta \) and \(\theta =\frac{1}{2}\).

  2. (ii)

    Conversely, if the abstract gradient-type method converges linearly in the sense of (14), then \(\varphi \) satisfies the \((G_\varphi , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\frac{\beta (1-\sqrt{\tau })^2}{h^2}\).

Proof

First, we repeat the argument before (6) to obtain that for \(v_k\in {\mathcal {P}}_\varphi (x_k)\),

$$\begin{aligned} d^2(x_{k+1},{\mathbf {crit}}\varphi ) \le d^2(x_k,{\mathbf {crit}}\varphi ) -2h\langle G_{\varphi }(x_k),x_k-v_k\rangle +h^2\Vert G_{\varphi }(x_k)\Vert ^2,\quad k\ge 0. \end{aligned}$$

Take \(\theta \in (0, 1)\) and then use a convex combination of the (cor-res-EB) and (cor-EB) conditions at \(x=x_k\) to obtain

$$\begin{aligned} \inf _{v_k\in {\mathcal {P}}_\varphi (x_k)}\langle G_{\varphi }(x_k), x_k-v_k\rangle \ge \theta \nu \cdot d^2(x_k,{\mathbf {crit}}\varphi )+(1-\theta )\beta \cdot \Vert G_{\varphi }(x_k)\Vert ^2, \quad k\ge 0. \end{aligned}$$

Therefore, we can derive that

$$\begin{aligned} d^2(x_{k+1},{\mathbf {crit}}\varphi )&\le (1-2\theta \nu h)d^2(x_k,{\mathbf {crit}}\varphi )+\left( h^2-2h(1-\theta )\beta \right) \Vert G_{\varphi }(x_k)\Vert ^2\\&\le \tau \cdot d^2(x_k,{\mathbf {crit}}\varphi ), \quad k\ge 0, \end{aligned}$$

where the second inequality follows from the condition (13) on the step size. Obviously, the optimal linear convergence rate \(\tau _0=1-\beta \nu \) can be obtained at \(h=\beta , \theta =\frac{1}{2}\).

Conversely, pick \(u_{k+1}\in {\mathcal {P}}_\varphi (x_{k+1})\) to derive that

$$\begin{aligned} d(x_k, {\mathbf {crit}}\varphi )&\le \Vert x_k-u_{k+1}\Vert \le \Vert x_{k+1}-u_{k+1}\Vert +\Vert x_{k+1}-x_k\Vert \nonumber \\&= d(x_{k+1}, {\mathbf {crit}}\varphi )+ h\Vert G_{\varphi }(x_k)\Vert , \quad k\ge 0. \end{aligned}$$
(15)

Combine (15) and the fact of linear convergence

$$\begin{aligned} d(x_{k+1},{\mathbf {crit}}\varphi ) \le \sqrt{\tau } \cdot d(x_{k},{\mathbf {crit}}\varphi ),\quad k\ge 0 \end{aligned}$$

to obtain

$$\begin{aligned} \left( 1-\sqrt{\tau }\right) ^2d^2(x_k,{\mathbf {crit}}\varphi )\le h^2\Vert G_{\varphi }(x_k)\Vert ^2, \quad k\ge 0. \end{aligned}$$

Thus, together with the (cor-res-EB) condition, we can derive that

$$\begin{aligned} \inf _{v_k\in {\mathcal {P}}_\varphi (x_k)}\langle G_{\varphi }(x_k), x_k-v_k\rangle \ge \beta \Vert G_{\varphi }(x_k)\Vert ^2\ge \frac{\beta \left( 1-\sqrt{\tau }\right) ^2}{h^2}d^2(x_k,{\mathbf {crit}}\varphi ), \quad k\ge 0. \end{aligned}$$

Observe that the starting point \(x_0\in \Omega \) can be arbitrary. Therefore, the (cor-EB) condition with \(\nu =\frac{\beta (1-\sqrt{\tau })^2}{h^2}\) holds. This completes the proof. \(\square \)

With Theorem 2 in hand, we now claim the necessary and sufficient EB conditions guaranteeing linear convergence for the gradient method, the proximal point algorithm, and the forward–backward splitting algorithm. These conditions, previously known to be sufficient for linear convergence (see e.g. Section 4 in [10]), are actually necessary. We start by the gradient method, applied to possibly nonconvex optimization.

Corollary 5

Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a gradient-Lipschitz-continuous function with modulus \(L>0\). Assume that f achieves its minimum \(\min f\) and \({\mathbf {crit}}f=\mathrm {Arg}\min f \ne \emptyset \). Let \(\epsilon >0\) be a fixed constant and set \(\Omega =\{x: f(x)\le \min f+\epsilon \}\). Let \(\{x_k\}_{k\ge 0}\) be generated by the gradient descent method (5) with \(h=\frac{1}{L}\) and \(x_0\in \Omega \).

  1. (i)

    If f satisfies the \((\nabla f, \nu , \Omega )\)-(cor-EB) condition with \(\nu <L\), then the gradient descent method (5) with \(h=\frac{1}{L}\) converges linearly in the sense that

    $$\begin{aligned} f(x_{k+1})-\min f\le \left( 1-\left( \frac{\nu }{L}\right) ^2\right) (f(x_k)-\min f),\quad k\ge 0. \end{aligned}$$
    (16)
  2. (ii)

    If we further assume that f is convex, then the gradient descent method (5) with \(h=\frac{1}{L}\) attains the following linear convergence:

    $$\begin{aligned} d^2(x_{k+1},\mathrm {Arg}\min f) \le \left( 1-\frac{\nu }{L}\right) \cdot d^2(x_k,\mathrm {Arg}\min f),\quad k\ge 0. \end{aligned}$$
    (17)
  3. (iii)

    Conversely, if f is convex and if starting from an arbitrary initial point \(x_0\in \Omega \), the gradient descent method (5) with \(h=\frac{1}{L}\) converges linearly like (17) but replacing \(1-\frac{\nu }{L}\) with \(\tau \), then f satisfies the \((\nabla f, \nu , \Omega )\)-(cor-EB) condition with \(\nu =L(1-\sqrt{\tau })^2\).

Proof

We first show (16) by modifying the argument due to Polyak [45] and recently highlighted in [27, 28]. The gradient-Lipschitz-continuity of f implies

$$\begin{aligned} f(y)-f(x)-\langle \nabla f(x), y-x\rangle \le \frac{L}{2}\Vert y-x\Vert ^2, \quad \forall ~x, y\in {\mathbb {R}}^n. \end{aligned}$$
(18)

Using this inequality with \(y=x_{k+1}\) and \(x=x_k\) and together with the update rule of gradient descent, we get

$$\begin{aligned} f(x_{k+1})-f(x_k)\le -\frac{1}{2L}\Vert \nabla f(x_k)\Vert ^2, \quad k\ge 0, \end{aligned}$$
(19)

which implies \({x_k}\in \Omega , k\ge 0\). Using again inequality (18) with \(y=x_k\) and \(x=u_k\in {\mathcal {P}}_f(x_k)\), and noting that \(u_k\in {\mathbf {crit}}f= \mathrm {Arg}\min f\) and hence \(f(u_k)=\min f\) and \(\nabla f(u_k)=0\), we have

$$\begin{aligned} f(x_k)-\min f\le \frac{L}{2}d^2(x_k,{\mathbf {crit}}f), \quad k\ge 0. \end{aligned}$$
(20)

Applying the Cauchy-Schwarz inequality to the \((\nabla f, \nu , \Omega )\)-(cor-EB) condition, we obtain

$$\begin{aligned} \forall x\in \Omega \cap {\mathrm {dom}}f, ~~\Vert \nabla f(x)\Vert \ge \nu \cdot d(x,{\mathbf {crit}}f). \end{aligned}$$

Thus, combining inequalities (19) and (20), we have that

$$\begin{aligned} f(x_{k+1})-f(x_k)\le -\frac{1}{2L}\Vert \nabla f(x_k)\Vert ^2\le -\frac{\nu ^2}{L^2}(f(x_k)-\min f), \quad k\ge 0, \end{aligned}$$

from which (16) follows.

Now, with the additional convexity assumption of f, we have \(f\in {{\mathcal {F}}}^{1,1}_{L}({\mathbb {R}}^n)\), which is equivalent to the following condition

$$\begin{aligned} \langle \nabla f(x)-\nabla f(y), x-y\rangle \ge \frac{1}{L}\Vert \nabla f(x)-\nabla f(y)\Vert ^2,~~~x, y\in {\mathbb {R}}^n; \end{aligned}$$

see Theorem 2.1.5 [43]. Using this inequality with \(y\in {\mathcal {P}}_f(x)\), we obtain

$$\begin{aligned} \inf _{y\in {\mathcal {P}}_f(x)} \langle \nabla f(x), x-y\rangle \ge \frac{1}{L}\Vert \nabla f(x)\Vert ^2,\quad x\in {\mathbb {R}}^n, \end{aligned}$$

which is just the \((\nabla f, \beta , \Omega )\)-(cor-res-EB) condition with \(\beta =\frac{1}{L}\). Therefore, the remaining results follow from Theorem 2. This completes the proof. \(\square \)

Remark 2

In Example 2 in [64], we constructed a one-dimensional nonconvex function, that satisfies all the conditions in Corollary 5 that ensure (16). In this sense, (16) is one of the few general results for global linear convergence on non-convex problems. We note that a similar phenomenon was observed by the authors of [27] under the Polyak–Łojasiewicz condition.

While \({\mathbf {crit}}f=\mathrm {Arg}\min f\) is a strong assumption, it is not the same as convexity but implies the weaker condition of invexity, which says that a function f is invex if and only if its every critical point is a global minimum. This assumption can be satisfied by some nonconvex optimization problems recently appeared in machine/deep learning, see e.g. [61, 68].

Before discussing the linear convergence of the proximal point algorithm (PPA), we introduce the following result.

Lemma 3

([7, 49]) Let \(f\in \Gamma _0({\mathbb {R}}^n)\) and \(\lambda >0\). Let the Moreau–Yosida regularization of f be defined by

$$\begin{aligned} f_\lambda (x):=\min _{u\in {\mathbb {R}}^n}\left\{ f(u)+\frac{1}{2\lambda }\Vert x-u\Vert ^2\right\} . \end{aligned}$$

Then,

  • \(f_\lambda \) is real-valued, convex, and continuously differentiable and can be formulated as

    $$\begin{aligned} f_\lambda (x)=f\left( {\mathbf {prox}}_{\lambda f}(x)\right) +\frac{1}{2\lambda }\left\| x-{\mathbf {prox}}_{\lambda f}(x)\right\| ^2; \end{aligned}$$
  • Its gradient

    $$\begin{aligned} \nabla f_\lambda (x)=\lambda ^{-1}\left( x-{\mathbf {prox}}_{\lambda f}(x)\right) \end{aligned}$$

    is \(\lambda ^{-1}\)-Lipschitz continuous.

  • \(\mathrm {Arg}\min f_\lambda =\mathrm {Arg}\min f\) and \(\min f=\min f_\lambda \).

Now, we are ready to present the result of linear convergence for PPA.

Corollary 6

Let \(f\in \Gamma _0({\mathbb {R}}^n)\) achieve its minimum \(\min f\) and \(\lambda >0\). Let \(\epsilon >0\) be a fixed constant and set \(\Omega =\{x: f(x)\le \min f+\epsilon \}\cap {\mathrm {dom}}\partial f\). Starting from \(x_0\in \Omega \), the PPA can be defined by

$$\begin{aligned} x_{k+1}={\mathbf {prox}}_{\lambda f}(x_k)=x_k-\lambda \cdot \nabla f_\lambda (x_k),\quad k\ge 0. \end{aligned}$$
  1. (i)

    If f satisfies the \((f,\alpha , \Omega )\)-(obj-EB) condition, then \(f_\lambda \) satisfies the \((\nabla f_\lambda , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\min \{\frac{\alpha }{4}, \frac{1}{4\lambda }\}\), and hence the PPA converges linearly in the sense that

    $$\begin{aligned} d^2(x_{k+1},\mathrm {Arg}\min f) \le \left( 1-\min \left\{ \frac{\alpha \lambda }{4}, \frac{1}{4}\right\} \right) \cdot d^2(x_k,\mathrm {Arg}\min f),\quad k\ge 0.\nonumber \\ \end{aligned}$$
    (21)
  2. (ii)

    Conversely, if starting from an arbitrary initial point \(x_0\in \Omega \) the PPA converges linearly like (21) but replacing the rate \(1-\min \{\frac{\alpha \lambda }{4}, \frac{1}{4}\}\) with a constant \(\tau \in (0, 1)\), then f satisfies the \((f, \alpha , \Omega )\)-(obj-EB) condition with \(\alpha = \frac{(1-\sqrt{\tau })^2}{2\lambda }\).

Proof

First of all, we remark that

$$\begin{aligned} {\mathbf {crit}}f=\mathrm {Arg}\min f=\mathrm {Arg}\min f_\lambda ={\mathbf {crit}}f_\lambda . \end{aligned}$$
(22)

From Lemma 3, we have \(f_\lambda \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) with \(L=\lambda ^{-1}\) and hence the \((\nabla f_\lambda , \beta ,\Omega )\)-(cor-res-EB) condition with \(\beta =\lambda \) holds. Now, we first prove that the \((f,\alpha , \Omega )\)-(obj-EB) condition implies the \((f_\lambda ,c, \Omega )\)-(obj-EB) condition with \(c=\min \{\frac{\alpha }{2}, \frac{1}{2\lambda }\}\). Indeed, letting \(v={\mathbf {prox}}_{\lambda f}(x)\) and \(v^\prime \in {\mathcal {P}}_f(v)\), for any \(x\in \Omega \cap {\mathrm {dom}}f\) we can derive that

$$\begin{aligned} f_\lambda (x)-\min f_\lambda&= f\left( {\mathbf {prox}}_{\lambda f}(x)\right) +\frac{1}{2\lambda }\left\| x-{\mathbf {prox}}_{\lambda f}(x)\right\| ^2-\min f \\&\ge \frac{\alpha }{2}d^2\left( {\mathbf {prox}}_{\lambda f}(x), {\mathbf {crit}}f\right) +\frac{1}{2\lambda }\left\| x-{\mathbf {prox}}_{\lambda f}(x)\right\| ^2 \\&= \frac{\alpha }{2}\left\| v-v^\prime \right\| ^2 +\frac{1}{2\lambda }\Vert x-v\Vert ^2\ge c\cdot \left( \left\| v-v^\prime \right\| ^2+\Vert x-v\Vert ^2\right) \\&\ge \frac{c}{2}\left( \left\| v-v^\prime \right\| +\Vert x-v\Vert \right) ^2\ge \frac{c}{2} \left\| x-v^\prime \right\| ^2\ge \frac{c}{2} d^2(x,{\mathbf {crit}}f_\lambda ), \end{aligned}$$

where the first inequality utilizes the fact of \(f(v)+\frac{1}{2\lambda }\Vert v-x\Vert ^2\le f(x)\), which implies \(v\in \Omega \cap {\mathrm {dom}}f\), and the last inequality follows by \(v^\prime \in {\mathcal {P}}_f(v)\subset {\mathbf {crit}}f ={\mathbf {crit}}f_\lambda \). From case 1 of Corollary 2 and (i) in Remark 1, the \((f_\lambda ,c, \Omega )\)-(obj-EB) condition implies the \((\nabla f_\lambda , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\min \{\frac{\alpha }{4}, \frac{1}{4\lambda }\}\). Therefore, (21) follows from Theorem 2 and the fact in (22).

Now, we turn to the necessity part. Invoking Theorem 2 again, we conclude that \(f_\lambda \) satisfies the \((\nabla f_\lambda , \nu , \Omega )\)-(cor-EB) condition with \(\nu =\frac{(1-\sqrt{\tau })^2}{\lambda }\), that is

$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}f_\lambda ,~~\inf _{x_p\in {\mathcal {P}}_{f_\lambda }(x)}\langle \nabla f_\lambda (x), x-x_p\rangle \ge \nu \cdot d^2(x,{\mathbf {crit}}f_\lambda ). \end{aligned}$$
(23)

Together with the fact of \({\mathbf {crit}}f={\mathbf {crit}}f_\lambda \), we can get

$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}f_\lambda ,~~\Vert \nabla f_\lambda (x)\Vert \ge \nu \cdot d (x,{\mathbf {crit}}f). \end{aligned}$$
(24)

On the other hand, using the definition of \(v={\mathbf {prox}}_{\lambda f}(x)\), which implies \(\frac{1}{\lambda }(x-v)\in \partial f(v)\), and the convexity of f, we obtain that

$$\begin{aligned} \forall ~ x\in {\mathrm {dom}}f,\forall ~g\in \partial f(x),~~ \left\langle \frac{1}{\lambda }(x-v)-g, v-x\right\rangle \ge 0, \end{aligned}$$
(25)

which further implies that

$$\begin{aligned} \forall ~ x\in {\mathrm {dom}}f,~~ \inf _{g\in \partial f(x)}\Vert g\Vert \ge \frac{1}{\lambda }\Vert x-v\Vert =\Vert \nabla f_\lambda (x)\Vert . \end{aligned}$$
(26)

Thus, combining (24) and (26) and noting that \({\mathrm {dom}}f\subset {\mathrm {dom}}f_\lambda \) and \(\Vert \partial ^0 f(x)\Vert =+\infty \) for \(x\notin {\mathrm {dom}}\partial f\), we obtain

$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}f,~~\left\| \partial ^0 f(x)\right\| =\inf _{g\in \partial f(x)}\Vert g\Vert \ge \nu \cdot d (x, {\mathbf {crit}}f). \end{aligned}$$
(27)

This is just the \((\partial ^0 f, \kappa , \Omega )\)-(res-EB) condition with \(\kappa =\nu \). Note that \(\Omega \) is \(\partial f\)-invaiant. Therefore, the \((f, \alpha , \Omega )\)-(obj-EB) condition with \(\alpha = \frac{(1-\sqrt{\tau })^2}{2\lambda }\) holds by case 2 of Corollary 2. \(\square \)

Remark 3

Linear convergence of PPA was previously provided based on different EB conditions, such as the \({\L }\)ojasiewicz inequality [corresponding to (res-obj-EB)] in [2, 3, 10], the quadratic growth condition [corresponding to (obj-EB)] in Proposition 6.5.2 in [9], and the EB condition of Hoffman’s type [corresponding to (res-EB)] in Theorem 2.1 in [40]. Our novelty here mainly lies in the necessity part, i.e., conclusion (ii).

Finally, we discuss linear convergence for the forward–backward splitting (FBS) algorithm. Recall that \({\mathcal {R}}_{1/L}(x)=L\left( x-{\mathbf {prox}}_{tg}(x-\frac{1}{L}\nabla f(x))\right) \).

Corollary 7

Let \(\varphi =f +g \), where \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), achieve its minimum \(\min \varphi \). Let \(\epsilon >0\) be a fixed constant and set \(\Omega =\{x: \varphi (x)\le \min \varphi +\epsilon \}\). Starting from \(x_0\in \Omega \), the FBS can be defined by

$$\begin{aligned} x_{k+1}={\mathbf {prox}}_{\frac{1}{L}g}\left( x_k-\frac{1}{L}\nabla f(x_k)\right) =x_k-\frac{1}{L}\cdot {\mathcal {R}}_{1/L}(x_k), \quad k\ge 0. \end{aligned}$$

Denote \(S_k:=\sum _{i=0}^\infty \Vert {\mathcal {R}}_{1/L}(x_{k+i})\Vert ^2,\quad k\ge 0\).

  1. (i)

    If \(\varphi \) satisfies the \(({\mathcal {R}}_{1/L}, \nu , \Omega )\)-(cor-EB) condition with \(\nu <2L\), then FBS converges linearly in the sense that

    $$\begin{aligned}&\varphi (x_{k+1})-\min \varphi \le \left( 1-\frac{\nu }{2L}\right) \left( \varphi (x_k)-\min \varphi \right) ,\quad k\ge 0, \end{aligned}$$
    (28)
    $$\begin{aligned}&d^2(x_{k+1},\mathrm {Arg}\min \varphi ) \le \left( 1-\frac{\nu }{2L}\right) \cdot d^2(x_k,\mathrm {Arg}\min \varphi ),\quad k\ge 0, \end{aligned}$$
    (29)

    and

    $$\begin{aligned} S_{k+1}\le \left( 1-\frac{\nu }{2L}\right) S_k ,\quad k\ge 0. \end{aligned}$$
    (30)
  2. (ii)

    Conversely, if starting from an arbitrary initial point \(x_0\in \Omega \), FBS converges linearly like (29) but replacing \(1-\frac{\nu }{2L}\) with \(\tau \), then \(\varphi \) satisfies the \(({\mathcal {R}}_{1/L}, \nu , \Omega )\)-(cor-EB) condition with \(\nu =\frac{L}{2}(1-\sqrt{\tau })^2\).

Proof

We rely on the following standard result (see again Lemma 2.3 in [8]):

$$\begin{aligned}&\forall x, y\in {\mathbb {R}}^n,~~\left\langle {\mathcal {R}}_{1/L}(y), y-x\right\rangle \ge \varphi \left( {\mathbf {prox}}_{\frac{1}{L}g}\left( y-\frac{1}{L}\nabla f(y)\right) \right) -\varphi (x)\nonumber \\&\quad +\frac{1}{2L}\Vert {\mathcal {R}}_{1/L}(y)\Vert ^2. \end{aligned}$$
(31)

Using successively this result at \(x=y=x_k\), and then at \(y=x_k, x=u_k\in {\mathcal {P}}_\varphi (x_k)\), together with the fact of \(x_{k+1}={\mathbf {prox}}_{\frac{1}{L}g}(x_k-\frac{1}{L}\nabla f(x_k))\), we obtain the following sufficient decrease property

$$\begin{aligned} \varphi (x_{k+1})-\varphi (x_k)\le -\frac{1}{2L}\Vert {\mathcal {R}}_{1/L}(x_k)\Vert ^2, \quad k\ge 0, \end{aligned}$$
(32)

and

$$\begin{aligned} \varphi (x_{k+1})-\min \varphi +\frac{1}{2L}\Vert {\mathcal {R}}_{1/L}(x_k)\Vert ^2 \le \langle {\mathcal {R}}_{1/L}(x_k), x_k-u_k\rangle ,\quad k\ge 0. \end{aligned}$$

Note that (32) implies \(x_k\in \Omega , k\ge 0\). Applying the Cauchy–Schwarz inequality to the \(({\mathcal {R}}_{1/L}, \nu , \Omega )\)-(cor-EB) condition, we obtain

$$\begin{aligned} \forall ~ x\in \Omega \cap {\mathrm {dom}}\varphi ,~~\Vert {\mathcal {R}}_{1/L}(x) \Vert \ge \nu \cdot d(x,{\mathbf {crit}}\varphi ), \end{aligned}$$

from which the following inequality follows

$$\begin{aligned} \langle {\mathcal {R}}_{1/L}(x_k), x_k-u_k\rangle \le \frac{1}{\nu }\Vert {\mathcal {R}}_{1/L}(x_k)\Vert ^2,\quad k\ge 0. \end{aligned}$$

Thus, we obtain

$$\begin{aligned} \varphi (x_{k+1})-\min \varphi \le \left( \frac{1}{\nu }-\frac{1}{2L}\right) \Vert {\mathcal {R}}_{1/L}(x_k)\Vert ^2,\quad k\ge 0. \end{aligned}$$
(33)

Combining (32) and (33), we get

$$\begin{aligned} \varphi (x_{k+1})-\varphi (x_k)\le -\frac{1}{2L}\left( \frac{1}{\nu }-\frac{1}{2L}\right) ^{-1}(\varphi (x_{k+1})-\min \varphi ), \quad k\ge 0, \end{aligned}$$

from which the announced result (28) follows. The convergence result (30) can also be derived from (32) and (33). In fact, we first observe that for any integer \(N>0\), it holds

$$\begin{aligned} \varphi (x_{k+1})-\min \varphi \ge \sum _{i=1}^N(\varphi (x_{k+i})-\varphi (x_{k+i+1})), \quad k\ge 0 \end{aligned}$$

and hence the sufficient decrease property (32) yields

$$\begin{aligned} \varphi (x_{k+1})-\min \varphi \ge \sum _{i=1}^\infty (\varphi (x_{k+i})-\varphi (x_{k+i+1}))\ge \frac{1}{2L}\sum _{i=1}^\infty \Vert {\mathcal {R}}_{1/L}(x_{k+i})\Vert ^2,\quad k\ge 0. \end{aligned}$$

Together with (33), we derive that

$$\begin{aligned} \varphi (x_k)-\varphi (x_{k+1})&=\varphi (x_k)-\min \varphi -(\varphi (x_{k+1})-\min \varphi ) \\&\le \left( \frac{1}{\nu }-\frac{1}{2L}\right) \Vert {\mathcal {R}}_{1/L}(x_{k-1})\Vert ^2-\frac{1}{2L} \sum _{i=1}^\infty \Vert {\mathcal {R}}_{1/L}(x_{k+i})\Vert ^2, \quad k\ge 1. \end{aligned}$$

Using (32) again, we obtain

$$\begin{aligned} \left( \frac{1}{\nu }-\frac{1}{2L}\right) \Vert {\mathcal {R}}_{1/L}(x_{k-1})\Vert ^2\ge \frac{1}{2L}\sum _{i=0}^\infty \Vert {\mathcal {R}}_{1/L}(x_{k+i})\Vert ^2,\quad k\ge 1, \end{aligned}$$

i.e.,

$$\begin{aligned} \left( \frac{1}{\nu }-\frac{1}{2L}\right) (S_{k-1}-S_k)\ge \frac{1}{2L}S_k, \quad k\ge 1, \end{aligned}$$

from which the announced result (30) follows.

Now, using the standard result (31) with \(x=y_p\in {\mathcal {P}}_\varphi (y)\) to yield

$$\begin{aligned} \langle {\mathcal {R}}_{1/L}(y), y-y_p\rangle \ge \varphi \left( {\mathbf {prox}}_{\frac{1}{L}g}\left( y-\frac{1}{L}\nabla f(y)\right) \right) -\varphi (y_p)+\frac{1}{2L}\Vert {\mathcal {R}}_{1/L}(y)\Vert ^2, \end{aligned}$$

and noting that

$$\begin{aligned} \varphi \left( {\mathbf {prox}}_{\frac{1}{L}g}\left( y-\frac{1}{L}\nabla f(y)\right) \right) -\varphi (y_p)=\varphi \left( {\mathbf {prox}}_{\frac{1}{L}g}\left( y-\frac{1}{L}\nabla f(y)\right) \right) -\min \varphi \ge 0, \end{aligned}$$

we obtain

$$\begin{aligned} \forall y\in {\mathbb {R}}^n,~~\langle {\mathcal {R}}_{1/L}(y), y-y_p\rangle \ge \frac{1}{2L}\Vert {\mathcal {R}}_{1/L}(y)\Vert ^2. \end{aligned}$$

Thus, \(\varphi \) satisfies the \(({\mathcal {R}}_{1/L},\beta , \Omega )\)-(cor-res-EB) condition with \(\beta =\frac{1}{2L}\). Therefore, the remaining results follow from Theorem 2 and the fact of \({\mathbf {crit}}\varphi =\mathrm {Arg}\min \varphi \). \(\square \)

Remark 4

The results (28) and (29) were essentially shown in [20, 63] respectively, with different methods. We note that while this work was under review, the authors of [16] improved these results under error bound conditions and weaken assumptions on the gradient Lipschitz continuity. Our novelty here lies in conclusion (ii), which was independently also recently observed by the authors in [22]. In addition, the result (30) seems also new and interesting.

6 Linear convergence of the PALM algorithm

The PALM algorithm was recently introduced by the authors of [11] for a class of composite optimization problems in the general non-convex and non-smooth setting. The authors developed a convergence analysis framework relying on the Kurdyka–Łojasiewicz (KL) inequality and proved that PALM converges globally to a critical point for problems with semi-algebraic data. A global non-asymptotic sublinear rate of convergence of PALM for convex problems was obtained independently in [26, 51]. Very recently, global linear convergence of PALM for strongly convex problems was obtained in [33]. Note that PALM is called block coordinate proximal gradient algorithm in [26] and cyclic block coordinate descent-type method in [33]. In this section, we show linear convergence of PALM under EB conditions, which are strictly weaker than strong convexity.

Let \(x_{1:k}:=(x_1,x_2,\dots , x_k)\) and denote \(x_{1:(j-1)}^{(t+1)}:=(x_1^{(t+1)},\dots , x_{j-1}^{(t+1)})\), \(x_{(j+1):p}^{(t)}:=(x_{j+1}^{(t)},\dots , x_{p}^{(t)})\), \(\psi _j^{(t)}(x_j):=f(x_{1:(j-1)}^{(t+1)},x_j,x_{(j+1):p}^{(t)})\), and \(\varphi _j^{(t)}(x_j):=\psi _j^{(t)}(x_j) +g_j(x_j).\) Start with given initial points \(\{x_j^{(0)}\}_{j=1}^p\). PALM generates \(\{x_j^{(t+1)}\}_{j=1}^p\) via solving a collection of subproblems

$$\begin{aligned} x_j^{(t+1)}= & {} \arg \min _{x_j}\left\{ \left\langle x_j-x_j^{(t)}, \nabla \psi _j^{(t)}\left( x_j^{(t)}\right) \right\rangle +\frac{L_j}{2}\left\| x_j-x_j^{(t)}\right\| ^2+g_j(x_j)\right\} ,\\ \quad j= & {} 1,\dots , p,~t\ge 0. \end{aligned}$$

The following is our main result in this section.

Theorem 3

Consider the following composite convex nonsmooth minimization problem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in {\mathbb {R}}^d}}\, \varphi (x):=f(x_1,\dots , x_p)+\sum ^p_{j=1}g_j(x_j), \end{aligned}$$
(34)

where \({\mathbb {R}}^d\ni x=(x_1,\dots , x_p)\) with the jth block \(x_j\in {\mathbb {R}}^{d_j}\), and \(d=\sum ^p_{j=1}d_j\). Set \(g(x):=\sum ^p_{j=1}g_j(x_j)\) so that \({\mathrm {dom}}g= \Pi _{j=1}^p{\mathrm {dom}}g_j\). With these notations, the objective function of (34) reads as \(\varphi =f+g\). Assume that

  • \(f \in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^d)\), \(g_j\in \Gamma _0({\mathbb {R}}^{d_j}), j=1,\dots , p\), and \(\Omega \subset {\mathrm {dom}}\partial \varphi \);

  • \(f(x_{1:(j-1)},x_j,x_{(j+1):p})\in {{\mathcal {F}}}^{1,1}_{L_j}({\mathbb {R}}^{d_j})\) for all \(x_{1:(j-1)}\) and \(x_{(j+1):p}\), \(j=1,\dots ; p\);

  • \(\varphi =f+g\) is such that it achieves its minimum \(\min \varphi \);

  • \(\varphi \) satisfies the \((\partial ^0 \varphi , \eta , \Omega )\)-(res-obj-EB) condition (or its equivalent conditions from case 2 of Corollary 2), which is strictly weaker than strong convexity.

Here, \(L_j, j=1,\dots , p\) and L are positive constants. Let \(\{x^{(t)}\}\) be generated by PALM and assume that \(x^{(t)}\in \Omega , t\ge 0\). Then, PALM converges linearly in the sense that

$$\begin{aligned} \varphi \left( x^{(t+1)}\right) -\min \varphi \le \left( \frac{\eta ^2L_{\min }}{4pL^2+4L_{\max }^2}+1\right) ^{-1}\left( \varphi \left( x^{(t)}\right) -\min \varphi \right) , ~~t\ge 0, \end{aligned}$$

where \(L_{\min }=\min _j L_j\) and \(L_{\max }=\max _j L_j\).

Proof

We divide the proof into three steps.

Step 1. We prove that

$$\begin{aligned} \varphi \left( x^{(t)}\right) -\varphi \left( x^{(t+1)}\right) \ge \frac{L_{\min }}{2}\left\| x^{(t)}-x^{(t+1)}\right\| ^2, ~t\ge 0. \end{aligned}$$
(35)

Let \(G^{(t)}_j=L_j(x_j^{(t)}-x_j^{(t+1)})\). By the definition of \(x_j^{(t+1)}\) and Lemma 2.3 in [8], we get

$$\begin{aligned} \varphi _j^{(t)}\left( x_j^{(t)}\right) -\varphi _j^{(t)}\left( x_j^{(t+1)}\right) \ge \frac{1}{2L_j}\left\| G^{(t)}_j\right\| ^2 =\frac{L_j^2}{2L_j}\left\| x_j^{(t)}-x_j^{(t+1)}\right\| ^2= \frac{L_j}{2}\left\| x_j^{(t)}-x_j^{(t+1)}\right\| ^2. \end{aligned}$$

In addition, note that

$$\begin{aligned} \sum ^p_{j=1}\varphi _j^{(t)}\left( x_j^{(t)}\right) =\sum ^p_{j=1} \left( f\left( x^{(t+1)}_{1:(j-1)},x^{(t)}_{j:p}\right) +g_j\left( x_j^{(t)}\right) \right) \end{aligned}$$

and

$$\begin{aligned} \sum ^p_{j=1}\varphi _j^{(t)}\left( x_j^{(t+1)}\right) =\sum ^p_{j=1} \left( f\left( x^{(t+1)}_{1:j},x^{(t)}_{(j+1):p}\right) +g_j\left( x_j^{(t+1)}\right) \right) . \end{aligned}$$

Thus, we derive that for \(t\ge 0\),

$$\begin{aligned} \varphi \left( x^{(t)}\right) -\varphi \left( x^{(t+1)}\right) =\sum ^p_{j=1}\varphi _j^{(t)}\left( x_j^{(t)}\right) -\sum ^p_{j=1}\varphi _j^{(t)}\left( x_j^{(t+1)}\right) \ge \sum ^p_{j=1}\frac{L_j}{2}\left\| x_j^{(t)}-x_j^{(t+1)}\right\| ^2, \end{aligned}$$

from which (35) follows.

Step 2. The \((\partial ^0 \varphi , \eta , \Omega )\)-(res-obj-EB) condition at \(x=x^{(t+1)}\) reads as

$$\begin{aligned} \varphi \left( x^{(t+1)}\right) -\min \varphi \le \frac{\left\| \partial ^0 \varphi \left( x^{(t+1)}\right) \right\| ^2}{\eta ^2}. \end{aligned}$$

At the \((t+1)\)th iteration, there exists \(\xi _j^{(t+1)}\in \partial g_j(x_j^{(t+1)})\) satisfying the optimality condition:

$$\begin{aligned} \nabla _j f\left( x_{1:(j-1)}^{(t+1)},x_j^{(t)},x_{(j+1):p}^{(t)}\right) + L_j\left( x_j^{(t+1)}-x_j^{(t)}\right) +\xi _j^{(t+1)}=0. \end{aligned}$$

Here and below, we denote the partial gradient \(\nabla _{x_j}f(x)\) by \(\nabla _j f(x)\) for notational simplicity. Let \(\xi ^{(t+1)}=(\xi _1^{(t+1) },\dots , \xi _p^{(t+1)})\). Then,

$$\begin{aligned} \nabla f\left( x^{(t+1)}\right) +\xi ^{(t+1)}\in \partial \varphi \left( x^{(t+1)}\right) \end{aligned}$$

and hence

$$\begin{aligned} \varphi \left( x^{(t+1)}\right) - \min \varphi \le \frac{\left\| \partial ^0 \varphi \left( x^{(t+1)}\right) \right\| ^2}{\eta ^2}\le \frac{\left\| \nabla f\left( x^{(t+1)}\right) +\xi ^{(t+1)}\right\| ^2}{\eta ^2}. \end{aligned}$$

Using the optimality condition and the fact of \(f(x)\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^d)\), we derive that

$$\begin{aligned} \left\| \nabla f\left( x^{(t+1)}\right) +\xi ^{(t+1)}\right\| ^2&= \sum ^p_{j=1}\left\| \nabla _jf\left( x^{(t+1)}\right) -\nabla _j f\left( x_{1:(j-1)}^{(t+1)},x_j^{(t)},x_{(j+1):p}^{(t)}\right) \right. \\&\quad \left. -L_j\left( x_j^{(t+1)}-x_j^{(t)}\right) \right\| ^2\\&\le \sum ^p_{j=1}2\left\| \nabla _jf\left( x^{(t+1)}\right) -\nabla _j f\left( x_{1:(j-1)}^{(t+1)},x_j^{(t)},x_{(j+1):p}^{(t)}\right) \right\| ^2\\&+\sum ^p_{j=1}2L_j^2\left\| x_j^{(t+1)}-x_j^{(t)}\right\| ^2\\&\le \sum ^p_{j=1}2\left\| \nabla f\left( x^{(t+1)}\right) -\nabla f\left( x_{1:(j-1)}^{(t+1)},x_j^{(t)},x_{(j+1):p}^{(t)}\right) \right\| ^2\\&\quad + \sum ^p_{j=1}2L_j^2\left\| x_j^{(t+1)}-x_j^{(t)}\right\| ^2\\&\le \sum ^p_{j=1}2L^2\left\| x_{j:p}^{(t+1)}-x_{j:p}^{(t)}\right\| ^2+ \sum ^p_{j=1}2L_j^2\left\| x_j^{(t+1)}-x_j^{(t)}\right\| ^2\\&\le \left( 2pL^2+2L_{\max }^2\right) \left\| x^{(t+1)}-x^{(t)}\right\| ^2. \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \varphi \left( x^{(t+1)}\right) -\min \varphi \le \frac{\left( 2pL^2+2L_{\max }^2\right) }{\eta ^2}\left\| x^{(t+1)}-x^{(t)}\right\| ^2. \end{aligned}$$
(36)

Step 3. Combining (35) and (36), we derive that

$$\begin{aligned} \varphi \left( x^{(t)}\right) -\min \varphi&= \left( \varphi \left( x^{(t)}\right) - \varphi \left( x^{(t+1)}\right) \right) + \left( \varphi \left( x^{(t+1)}\right) -\min \varphi \right) \\&\ge \frac{L_{\min }}{2}\left\| x^{(t)}-x^{(t+1)}\right\| ^2 +\left( \varphi \left( x^{(t+1)}\right) -\min \varphi \right) \\&\ge \left( \frac{\eta ^2L_{\min }}{4pL^2+4L_{\max }^2}+1\right) \left( \varphi \left( x^{(t+1)}\right) -\min \varphi \right) , \end{aligned}$$

from which the claimed result follows. This completes the proof. \(\square \)

On one hand, the \((\varphi , \alpha , \Omega )\)-(obj-EB) condition is obviously weaker than strong convexity. On the other hand, we can easily construct functions that satisfy (obj-EB) but fail to be strongly convex. For example, the composition f(Ax), where \(f(\cdot )\) is strongly convex and A is rank deficient, is such a function. This explains why we say that the \((\partial ^0 \varphi , \eta , \Omega )\)-(res-obj-EB) condition, which is equivalent to the \((\varphi , \alpha , \Omega )\)-(obj-EB) condition, is strictly weaker than strong convexity.

We note that the authors of [6] very recently showed that the regularized Jacobi algorithm-a type of cyclic block coordinate descent method-achieves a linear convergence rate under similar conditions to that of Theorem 3.

7 Linear convergence of Nesterov’s accelerated forward–backward algorithm

This section is divided into two parts. In the first part, we first introduce a composite optimization problem, and then we give a new EB condition. In the second part, we introduce Nesterov’s accelerated forward–backward algorithm and show its Q-linear convergence.

7.1 Problem formulation and a new EB condition

Given a nonnegative real sequence \(\{r_k\}_{k\ge 0}\). Following the terminology from [44], we say that \(r_k\) converges:

  • Q-linearly if there exists a constant \(\tau \in (0,1)\) such that \(\forall k\ge 0\), \(r_{k+1}\le \tau \cdot r_k\),

  • R-linearly if there exists a sequence \(\{s_k\}_{k\ge 0}\) Q-linearly converging to zero such that \(\forall k\ge 0\), \(r_k\le s_k\).

It is well-known that Nesterov’s accelerated gradient method with the following form

$$\begin{aligned} \left\{ \begin{array}{l} y_k = x_k+\frac{\sqrt{L}-\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }}(x_k-x_{k-1}) \\ \\ x_{k+1}=y_k-\frac{1}{L}\nabla f(y_k), \end{array} \right. \end{aligned}$$
(37)

converges R-linearly for minimizing \(f\in {{\mathcal {S}}}_{\mu ,L}^{1,1}({\mathbb {R}}^n)\) in the sense that \(\{f(x_k)-\min f\}_{k\ge 0}\) converges R-linearly. Very recently, the following Q-linear convergence was independently discovered in [29, 58] by quite different methods:

$$\begin{aligned}&f(x_{k+1})-\min f +\frac{\mu }{2}\left\| w_{k+1}-x^*\right\| ^2\le \left( 1-\sqrt{\frac{\mu }{L}}\right) \left( f(x_k)-\min f \right. \nonumber \\&\left. \quad +\frac{\mu }{2}\left\| w_k-x^*\right\| ^2\right) , ~\forall k\ge 0, \end{aligned}$$
(38)

where \(w_k=(1+\sqrt{\frac{L}{\mu }})y_k-\sqrt{\frac{L}{\mu }}x_k\). In Nesterov’s book [48], via replacing gradient with gradient mapping, the accelerated scheme (37) was successfully extended to solve the following minimization problems:

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in Q}}\, f(x), \end{aligned}$$
(39)

and

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in Q}}\, f(x):=\max _{1\le i\le m} f_i(x), \end{aligned}$$
(40)

where \(f, f_i\in {{\mathcal {S}}}_{\mu ,L}^{1,1}({\mathbb {R}}^n), i=1, \dots , m\) and Q is a nonempty closed convex set. Similarly, the accelerated scheme (37) can also be successfully extended to solve

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in {\mathbb {R}}^n}}\, \varphi (x):=f(x)+g(x), \end{aligned}$$
(41)

where \(f \in {{\mathcal {S}}}_{\mu ,L}^{1,1}({\mathbb {R}}^n)\) and \(g\in \Gamma _0({\mathbb {R}}^{n})\). Nesterov’s extended accelerated methods have been proved to achieve R-linear convergence. A natural question arises: Whether there exists Q-linear convergence for Nesterov’s accelerated method applied to problems (39)–(41) as well. In order to study problems (39)–(41) in a unified way, we consider the following composite optimization problem:

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x}}\, \varphi (x):=f(e(x))+g(x). \end{aligned}$$
(42)

This is a very powerful expression covering many optimization problems, including problems (39)–(41), as its special cases; see [19, 20]. Now, we introduce a new EB condition, commonly satisfied by many concrete examples in the form of (39)–(41); see Remark 6 below. Our forthcoming argument will heavily rely on this condition.

Definition 6

Let \(\varphi :=f\circ e +g \) be such that \(f:{\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is a closed convex function, \(g\in \Gamma _0({\mathbb {R}}^{n})\), and \(e: {\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) is a smooth mapping with its Jacobian given by \(\nabla e(x)\). Let \(L>0\) and define

$$\begin{aligned} \ell (x;y):=g(x)+f(e(y) + \nabla e(y)(x-y)) +\frac{L}{2}\Vert x-y\Vert ^2, \end{aligned}$$

and

$$\begin{aligned}&p(y):=\arg \min _{x\in {\mathbb {R}}^n} \ell (x;y),\\&G(y):=L(y-p(y)). \end{aligned}$$

We say that \(\varphi \) satisfies the composite EB condition with positive constants \(\mu , L\) obeying \(\mu < L\) if

$$\begin{aligned} ~~\forall x, y\in {\mathbb {R}}^n,~~\langle G(y), y-x\rangle \ge \varphi (p(y))-\varphi (x)+\frac{1}{2L}\Vert G(y)\Vert ^2+\frac{\mu }{2}\Vert x-y\Vert ^2. \end{aligned}$$
(43)

Let us give several comments on this definition.

Remark 5

  1. 1.

    Both p(y) and G(y) are well defined due to the strong convexity of \(\ell (\cdot ;y)\) for any \(y\in {\mathbb {R}}^n\). Moreover, the operator G is a residual measure operator related to \(\varphi \) and \({\mathbb {R}}^n\). In fact, observe that the optimality conditions for the proximal subproblem \(\mathrm {Arg}\min _{x\in {\mathbb {R}}^n} \ell (x;y)\) read as

    $$\begin{aligned} G(y)\in \partial g(p(y))+\nabla e(y)^T \partial f(e(y) + \nabla e(y)(p(y)-y)), \end{aligned}$$

    which implies \(y\in {\mathbf {crit}}\varphi \) if \(G(y)=0\). On the other hand, by the definition of p(y) and using the convexity of g and f, we derive that

    $$\begin{aligned} \varphi (y)&= \ell (y;y)\ge \ell (p(y);y) \nonumber \\&= g(p(y))+f(e(y) + \nabla e(y)(p(y)-y)) +\frac{L}{2}\Vert p(y)-y\Vert ^2 \nonumber \\&\ge (g(y)+\langle z, p(y)-y\rangle ) +(f(e(y))+\langle w, \nabla e(y)(p(y)-y)\rangle )+\frac{L}{2}\Vert p(y)-y\Vert ^2 \nonumber \\&= \varphi (y)+\langle z+\nabla e(y)^T w, p(y)-y\rangle +\frac{L}{2}\Vert p(y)-y\Vert ^2, \end{aligned}$$
    (44)

    where \(z\in \partial g(y)\) and \(w\in \partial f(e(y))\), and hence \(z+\nabla e(y)^T w \in \partial \varphi (y)\). Thus, if \(0\in \partial \varphi (y)\), then we can take some \(z\in \partial g(y)\) and \(w\in \partial f(e(y))\) such that \(z+\nabla e(y)^T w =0\). Hence, the inequality (44) implies that \(G(y)=0\) if \(y\in {\mathbf {crit}}\varphi \). Therefore, we have \(\{x\in {\mathbb {R}}^n: G(y)=0\}={\mathbf {crit}}\varphi \), i.e., G is a residual measure operator related to \(\varphi \).

  2. 2.

    The composite EB condition (43) can be viewed as a relaxation of strong convexity to some degree. This perspective is in the spirit of the work [42]. Indeed, in case of \(m=1\), \(g(x)\equiv 0\), \(f(t)\equiv t, t\in {\mathbb {R}}\), and \(e\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\), (43) reads as

    $$\begin{aligned}&~\forall x, y\in {\mathbb {R}}^n,~~e(x)\ge \left( e(y-\frac{1}{L}\nabla e(y))+\frac{1}{2L}\Vert \nabla e(y)\Vert ^2\right) \nonumber \\&\quad +\langle \nabla e(y),x-y\rangle +\frac{\mu }{2}\Vert x-y\Vert ^2. \end{aligned}$$
    (45)

    On the other hand, \(e\in {{\mathcal {F}}}^{1,1}_L({\mathbb {R}}^n)\) implies that

    $$\begin{aligned} ~~\forall x, y\in {\mathbb {R}}^n,~~e(y)\ge e(y-\frac{1}{L}\nabla e(y))+\frac{1}{2L}\Vert \nabla e(y)\Vert ^2. \end{aligned}$$

    Therefore, (45) is a relaxation of strong convexity in the following form:

    $$\begin{aligned} ~\forall x, y\in {\mathbb {R}}^n,~~e(x)\ge e(y)+\langle \nabla e(y),x-y\rangle +\frac{\mu }{2}\Vert x-y\Vert ^2. \end{aligned}$$

    In the case of \(f\circ e(x)\equiv 0\) and \(g\in \Gamma _0({\mathbb {R}}^n)\), (43) reads as

    $$\begin{aligned} ~\forall x, y\in {\mathbb {R}}^n, ~~g(x)\ge g_\lambda (y)+\langle \nabla g_\lambda (y),x-y\rangle +\frac{\mu }{2}\Vert x-y\Vert ^2, \end{aligned}$$
    (46)

    where \(\lambda =\frac{1}{L}\). Recall that \(g_\lambda \) is the Moreau–Yosida regularization of g and note that \(g(x)\ge g_\lambda (x)\). We can see that (46) is a relaxation of strong convexity of \(g_\lambda \).

  3. 3.

    Although we have shown that (43) can be viewed as a relaxation of strong convexity, it is still a very strong property. Now, we construct an example to show that even strongly convex property of f is not enough to ensure (43) to hold. This example is obtained by setting \(n=m=2\), \(x=(x_1, x_2)^T\), \(e(x)=(x_1, x_1)^T\), \(f(x)=\frac{1}{2}x_1^2+\frac{1}{2}x_2^2,\)\(g(x)\equiv 0\); then \(\varphi (x)=f\circ e(x)=x_1^2\). It is obvious to see that f is strongly convex. Let us show that in this special case (43) fails to hold. Actually, after some simple calculations, we can get

    $$\begin{aligned} p(y)=\begin{pmatrix}\frac{L}{L+2}y_1 \\ y_2 \end{pmatrix},~~~~G(y)=\begin{pmatrix}\frac{2L}{L+2}y_1 \\ 0 \end{pmatrix}, \end{aligned}$$

    and therefore (43) reads as

    $$\begin{aligned} \frac{2L}{L+2}y_1(y_1-x_1)&\ge \left( \frac{L}{L+2}y_1\right) ^2-x_1^2+\frac{2L}{(L+2)^2}y_1^2\\&\quad +\frac{\mu }{2}(x_1-y_1)^2+\frac{\mu }{2}(x_2-y_2)^2, \quad \forall x_i, y_i\in {\mathbb {R}}, i=1,2. \end{aligned}$$

    But, if we take \(x_1=y_1\equiv 0\), then we should have

    $$\begin{aligned} 0\ge \frac{\mu }{2}(x_2-y_2)^2, \quad \forall x_2, y_2\in {\mathbb {R}}. \end{aligned}$$

    Obviously, this is impossible for any positive constant \(\mu \).

  4. 4.

    Let \(A\in {\mathbb {R}}^{m\times n}\) with \(m<n\) be a given matrix and \(b\in {\mathbb {R}}^m\) be a given vector. A well-known fact in the community of EB is that the quadratic function \(\frac{1}{2}\Vert Ax-b\Vert ^2\) is not strongly convex but satisfies EB conditions. Unfortunately, this function fails to satisfy (45). We show this point by contradiction. It is enough to consider the case of \(m=1\), \(g(x)\equiv 0\), \(f(t)\equiv t, t\in {\mathbb {R}}\), and \(e(x)=\frac{1}{2}x^Taa^Tx\) with \(\Vert a\Vert ^2=L\). In this case, (45) reads as

    $$\begin{aligned} \frac{1}{2}\left( a^Tx-a^Ty\right) ^2 \ge \frac{\mu }{2}\Vert x-y\Vert ^2, \quad \forall x, y\in {\mathbb {R}}^n. \end{aligned}$$

    Let \(h\ne 0\) be an orthogonal vector of a. Now, take \(y-x=\lambda h, \lambda \in {\mathbb {R}}\). Then, we have

    $$\begin{aligned} 0\ge \frac{\mu }{2}\lambda ^2\Vert h\Vert ^2, \quad \forall \lambda \in {\mathbb {R}}, \end{aligned}$$

    which is impossible for any positive constant \(\mu \).

  5. 5.

    In order to show that (46) can be strictly weaker than strong convexity, we now construct a one-dimensional example that satisfies (46) but fails to be strongly convex. Define the shrinkage operator by \({{\mathcal {S}}}(t):=\mathrm {sign}(t)\cdot \max \{|t|-1,0\}\) and the projection operator by \([x]_I^+:=\arg \min _{y\in I}\Vert x-y\Vert \), where I is some closed interval. Now, we take \(\lambda =1\), \(I=[-2,2]\), and \(g(x)=|x|+\delta _I(x)\). Obviously, such g(x) is convex but not strongly convex. Using formula (14) in [65] and Lemma 3, we have

    $$\begin{aligned} g_\lambda (x)=\left| \left[ {{\mathcal {S}}}(x)\right] _I^+\right| +\frac{1}{2}\left( x-\left[ {{\mathcal {S}}}(x)\right] _I^+\right) ^2. \end{aligned}$$

    Here, \(g_\lambda \) is the Moreau–Yosida regularization of g. Denote \(\ell _{g_\lambda }(x;y):=g_\lambda (y)+\langle \nabla g_\lambda (y),x-y\rangle \). We have the following expression:

    $$\begin{aligned} \ell _{g_\lambda }(x;y)=\left\{ \begin{array}{ll} (y+2)x-\frac{1}{2}y^2+4,&{}\quad y\le -3,\\ -x-\frac{1}{2}, &{}\quad -3\le y\le -1,\\ yx-\frac{1}{2}y^2, &{} \quad -1\le y\le 1,\\ x-\frac{1}{2},&{}\quad 1\le y\le 3,\\ (y-2)x-\frac{1}{2}y^2+4,&{}\quad y\ge 3 \end{array} \right. \end{aligned}$$
    (47)

    Then, one can verify case by case that for any \(\mu \in (0,\frac{1}{9}]\), (46) always holds. For example, in the case of \(y\le -3\), we only need to verify that

    $$\begin{aligned} |x|\ge (y+2)x-\frac{1}{2}y^2+4+\frac{\mu }{2}(x-y)^2, \quad x\in [-2,2], \end{aligned}$$

    i.e., \(\frac{1-\mu }{2}(x-y)^2 \ge \frac{1}{2}x^2+2x-|x|+4, ~~x\in [-2,2].\) Thus, it is sufficient to require that

    $$\begin{aligned} \mu \le 1-\max _{y\le -3, |x|\le 2}\frac{x^2+4x-2|x|+8}{(x-y)^2}. \end{aligned}$$

    After some simple calculations, we have \(\mu \le \frac{1}{9}\). The other cases can be similarly verified; we omit the details here. This example shows that the composite EB condition (43) indeed holds for some non-strongly convex functions.

Now, we explain why we say that the condition (43) is commonly satisfied by problems (39)–(41), whose objective functions are clearly not in \({{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\).

Remark 6

  1. (i)

    The minimization problem (42) with \(m=1\), \(e\in {{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\), \(f(t)\equiv t, t\in {\mathbb {R}}\), \(g(x)=\delta _Q(x)\), and Q being nonempty closed convex, corresponds to problem (39). The condition (43) holds in this setting; see Theorem 2.2.7 in [43].

  2. (ii)

    The minimization problem (42) with \(f(y)=\max _{1\le i\le m}\{y_i\}\), \(f_i\in {{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\), \(e(x)=(f_1(x), f_2(x),\dots , f_m(x))\), \(g(x)=\delta _Q(x)\), and Q being nonempty closed convex, corresponds to problem (40). The condition (43) holds in this setting; see Corollary 2.3.2 in [43].

  3. (iii)

    The minimization problem (42) with \(m=1\), \(e\in {{\mathcal {S}}}^{1,1}_{\mu ,L}({\mathbb {R}}^n)\), \(f(t)\equiv t, t\in {\mathbb {R}}\), and \(g\in \Gamma _0({\mathbb {R}}^n)\), corresponds to problem (41). The condition (43) holds in this setting; see the inequality (4.36) in [15].

Interestingly, we note that while this work was under review, the authors of [41] utilized the exact form (43) to construct underestimate sequences and proposed several first order methods for minimizing strongly convex smooth functions and for strongly convex composite functions. Based on the discussion in this section, it could be expected to extend the corresponding results in [41] to the composite optimization problem (42).

In general, we have to admit that it is difficult to verify the composite EB condition (43), which therefore deserves further study in the future.

7.2 Q-linear convergence of Nesterov’s acceleration

In this part, we show Q-linear convergence of Nesterov’s acceleration under the composite EB condition (43), which is more general than strong convexity. First, in light of Nesterov’s accelerated scheme (2.2.11) in [43], Nesterov’s accelerated forward–backward algorithm for solving the problem (42) reads as: choosing \(x_{-1}=x_0\in {\mathbb {R}}^n\), for \(k\ge 0\),

$$\begin{aligned} \left\{ \begin{array}{l} y_k =x_k+\frac{\sqrt{L}-\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }}(x_k-x_{k-1}) \\ \\ x_{k+1}=y_k-\frac{1}{L}G(y_k). \end{array} \right. \end{aligned}$$

Let

$$\begin{aligned} \alpha =\frac{\sqrt{L}-\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }},~\beta = \frac{2\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }}, ~\gamma =\frac{1}{2L}\left( 1+\sqrt{\frac{L}{\mu }}\right) . \end{aligned}$$

Let

$$\begin{aligned} \Phi _k(x^*;\tau ):=\varphi (x_k)- \min \varphi +\tau \cdot \Vert z_k-x^*\Vert ^2,~k\ge 0, \end{aligned}$$

where \(x^*\in \mathrm {Arg}\min \varphi \) (assumed to be nonempty) and

$$\begin{aligned} z_k=\frac{1}{2}\left( 1+\sqrt{\frac{L}{\mu }}\right) y_k+\frac{1}{2}\left( 1-\sqrt{\frac{L}{\mu }}\right) x_k, \quad k\ge 0. \end{aligned}$$

Now, we are ready to present the main result in this section. The proof idea behind is partially inspired by the argument in [5] but might be of interest in its own right.

Theorem 4

Let \(\varphi :=f\circ e +g \) be such that \(f:{\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is a closed convex function, \(g\in \Gamma _0({\mathbb {R}}^{n})\), and \(e: {\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) is a smooth mapping with its Jacobian given by \(\nabla e(x)\). Let \(\varphi \) satisfy the composite EB condition (43) with positive constants \(\mu , L\) obeying \(\mu < L\). Assume that \(\varphi \) achieves its minimum \(\min \varphi \) so that \(\mathrm {Arg}\min \varphi \ne \emptyset \). Then, there exist a unique vector \(x^*\) such that \(\mathrm {Arg}\min \varphi =\{x^*\}\), and Nesterov’s accelerated forward–backward method converges Q-linearly in the sense that there exists a positive constant \(\theta _0<1\) such that for any \(\theta \in [\theta _0, 1)\) it holds

$$\begin{aligned} \Phi _{k+1}\left( x^*;\tau \right) \le \rho \cdot \Phi _k\left( x^*;\tau \right) , \quad k\ge 0, \end{aligned}$$
(48)

where \(\rho =\max \{\alpha , \theta \}<1\) and \(\tau =\frac{\theta \beta }{2\rho \gamma }\). Especially, by taking \(\theta = \max \{\theta _0, \alpha \}\), we have

$$\begin{aligned} \Phi _{k+1}\left( x^*;\frac{2L\mu }{\left( \sqrt{L}+\sqrt{\mu }\right) ^2}\right) \le \max \{\theta _0, \alpha \} \cdot \Phi _k\left( x^*;\frac{2L\mu }{\left( \sqrt{L}+\sqrt{\mu }\right) ^2}\right) , \quad k\ge 0.\nonumber \\ \end{aligned}$$
(49)

Proof

We first show the uniqueness of optimal solution \(x^*\) of \(\varphi \). In fact, by statement (i) in Remark 5 and the fact of \(\mathrm {Arg}\min \varphi \subset {\mathbf {crit}}\varphi \), we have that \(G(x^*)=0\) and \(p(x^*)=x^*\), and hence (43) at \(y=x^*\) reads as

$$\begin{aligned} \varphi (x)-\min \varphi \ge \frac{\mu }{2}\left\| x-x^*\right\| ^2, \quad \forall x\in {\mathbb {R}}^n, \end{aligned}$$

which clearly implies that \(\mathrm {Arg}\min \varphi =\{x^*\}\).

Now, we analyze rates of linear convergence. Using successively (43) at \(x=x_k\) and \(y=y_k\), and then at \(y=y_k\) and \(x=x^*\), together with the fact of \(x_{k+1}=p(y_k)\), we obtain

$$\begin{aligned} \varphi (x_{k+1})\le \varphi (x_k)+\langle G(y_k),y_k-x_k\rangle -\frac{1}{2L}\Vert G(y_k)\Vert ^2-\frac{\mu }{2}\Vert x_k-y_k\Vert ^2 \end{aligned}$$

and

$$\begin{aligned} \varphi (x_{k+1})\le \varphi \left( x^*\right) +\left\langle G(y_k),y_k-x^*\right\rangle -\frac{1}{2L}\Vert G(y_k)\Vert ^2-\frac{\mu }{2}\left\| x^*-y_k\right\| ^2. \end{aligned}$$

Multiplying the first inequality by \(\alpha \) and the second one by \(\beta \), and then adding the two resulting inequalities, we obtain

$$\begin{aligned} \varphi (x_{k+1})\le&\alpha \varphi (x_k)+\beta \varphi \left( x^*\right) + \left\langle G(y_k),\alpha (y_k-x_k)+\beta \left( y_k-x^*\right) \right\rangle \\&-\frac{1}{2L}\Vert G(y_k)\Vert ^2-\frac{\mu \alpha }{2}\Vert x_k-y_k\Vert ^2-\frac{\mu \beta }{2}\left\| x^*-y_k\right\| ^2. \end{aligned}$$

In order to estimate the right-hand side of the inequality above, we first write down:

$$\begin{aligned} \alpha (y_k-x_k)+\beta \left( y_k-x^*\right) =\beta \left( z_k-x^*\right) . \end{aligned}$$
(50)

Secondly, using the expression of \(y_{k+1}=x_{k+1}+\alpha (x_{k+1}-x_k)\), we get

$$\begin{aligned} z_{k+1}=\frac{1}{2}\left( 1+\sqrt{\frac{L}{\mu }}\right) x_{k+1}+\frac{1}{2}\left( 1-\sqrt{\frac{L}{\mu }}\right) x_k. \end{aligned}$$
(51)

Then, substitute \(x_{k+1}=y_k-\frac{1}{L}G(y_k)\) into formula (51) to obtain

$$\begin{aligned} z_{k+1}-x^*=z_k-x^*-\gamma \cdot G(y_k). \end{aligned}$$
(52)

Using equality (52), we derive that

$$\begin{aligned} \left\langle G(y_k),z_k-x^*\right\rangle&= \frac{1}{\gamma }\left\langle z_k-x^*-\left( z_{k+1}-x^*\right) ,z_k-x^*\right\rangle \\&= \frac{1}{\gamma }\left\| z_k-x^*\right\| ^2 -\frac{1}{\gamma }\left\langle z_{k+1}-x^*,z_k-x^*\right\rangle \\&= \frac{1}{\gamma }\left\| z_k-x^*\right\| ^2 -\frac{1}{\gamma }\left\langle z_{k+1}-x^*,z_{k+1}-x^*+\gamma \cdot G(y_k)\right\rangle \\&= \frac{1}{\gamma }\left\| z_k-x^*\right\| ^2 -\frac{1}{\gamma }\left\| z_{k+1}-x^*\right\| ^2- \left\langle z_{k+1}-x^*, G(y_k)\right\rangle \\&= \frac{1}{\gamma }\left\| z_k-x^*\right\| ^2 -\frac{1}{\gamma }\left\| z_{k+1}-x^*\right\| ^2-\left\langle G(y_k),z_k-x^*\right\rangle +\gamma \Vert G(y_k)\Vert ^2. \end{aligned}$$

Thus, we have

$$\begin{aligned} \left\langle G(y_k),z_k-x^*\right\rangle =\frac{1}{2\gamma }\left( \left\| z_k-x^*\right\| ^2-\left\| z_{k+1}-x^*\right\| ^2\right) + \frac{\gamma }{2}\Vert G(y_k)\Vert ^2. \end{aligned}$$
(53)

Combining formula (53) and formula (50), we derive that

$$\begin{aligned} \varphi (x_{k+1})&\le \alpha \varphi (x_k)+\beta \varphi \left( x^*\right) +\frac{\beta }{2\gamma }\left( \left\| z_k-x^*\right\| ^2-\left\| z_{k+1}-x^*\right\| ^2\right) \\&\quad +\left( \frac{\beta \gamma }{2}-\frac{1}{2L}\right) \Vert G(y_k)\Vert ^2-\frac{\mu \alpha }{2}\Vert x_k-y_k\Vert ^2-\frac{\mu \beta }{2}\left\| x^*-y_k\right\| ^2\\&=\alpha \varphi (x_k)+\beta \varphi \left( x^*\right) +\frac{\beta }{2\gamma }\left( \left\| z_k-x^*\right\| ^2-\left\| z_{k+1}-x^*\right\| ^2\right) \\&\quad -\frac{\mu \alpha }{2}\Vert x_k-y_k\Vert ^2-\frac{\mu \beta }{2}\left\| x^*-y_k\right\| ^2, \end{aligned}$$

where the term \(\Vert G(y_k)\Vert ^2\) is eliminated since \(\frac{\beta \gamma }{2}=\frac{1}{2L}\). Note that (50) can be written as

$$\begin{aligned} z_k-x^*=\left( y_k-x^*\right) +\frac{1}{2}\left( \sqrt{\frac{L}{\mu }}-1\right) (y_k-x_k), \end{aligned}$$

with which we further derive that

$$\begin{aligned} \left\| z_k-x^*\right\| ^2&\le 2\left\| x^*-y_k\right\| ^2+ \frac{1}{2}\left( \sqrt{\frac{L}{\mu }}-1\right) ^2\Vert y_k-x_k\Vert ^2\\&\le \max \left\{ 2, \frac{1}{2}\left( \sqrt{\frac{L}{\mu }}-1\right) ^2\right\} \left( \left\| x^*-y_k\right\| ^2+\Vert y_k-x_k\Vert ^2\right) . \end{aligned}$$

Denote \(\eta _1:=\min \left\{ \frac{\mu \alpha }{2},\frac{\mu \beta }{2}\right\} \) and \(\eta _2:=\max \left\{ 2, \frac{1}{2}(\sqrt{\frac{L}{\mu }}-1)^2\right\} \). Then, we have

$$\begin{aligned} \varphi (x_{k+1})&\le \alpha \varphi (x_k)+\beta \varphi \left( x^*\right) +\frac{\beta }{2\gamma }\left( \left\| z_k-x^*\right\| ^2-\left\| z_{k+1}-x^*\right\| ^2\right) \\&-\eta _1\left( \left\| x^*-y_k\right\| ^2+\Vert y_k-x_k\Vert ^2\right) \\&\le \alpha \varphi (x_k)+\beta \varphi \left( x^*\right) +\frac{\beta }{2\gamma }\left( \left\| z_k-x^*\right\| ^2-\left\| z_{k+1}-x^*\right\| ^2\right) -\frac{\eta _1}{\eta _2}\left\| z_k-x^*\right\| ^2. \end{aligned}$$

Rearrange the terms to obtain

$$\begin{aligned} \varphi (x_{k+1})-\varphi \left( x^*\right) +\frac{\beta }{2\gamma }\left\| z_{k+1}-x^*\right\| ^2\le \alpha \left( \varphi (x_k)-\varphi \left( x^*\right) \right) +\left( \frac{\beta }{2\gamma }-\frac{\eta _1}{\eta _2}\right) \left\| z_k-x^*\right\| ^2. \end{aligned}$$

Thus, there exists a positive constant \(\theta _0<1\) such that for any \(\theta \in [\theta _0, 1)\) it holds

$$\begin{aligned} \varphi (x_{k+1})-\varphi \left( x^*\right) +\frac{\beta }{2\gamma }\left\| z_{k+1}-x^*\right\| ^2\le \alpha \left( \varphi (x_k)-\varphi \left( x^*\right) \right) +\frac{\theta \beta }{2\gamma }\left\| z_k-x^*\right\| ^2. \end{aligned}$$

Since \(\rho =\max \{\alpha , \theta \}\), we have that \(\rho <1\) and \(\frac{\theta }{\rho }\le 1\). Thus, we obtain

$$\begin{aligned} \varphi (x_{k+1})-\varphi \left( x^*\right) +\frac{\theta \beta }{2\rho \gamma }\left\| z_{k+1}-x^*\right\| ^2&\le \alpha \left( \varphi (x_k)-\varphi \left( x^*\right) \right) +\frac{\theta \beta }{2\gamma }\left\| z_k-x^*\right\| ^2\\&\le \rho \left( \varphi (x_k)-\varphi \left( x^*\right) \big )+\frac{\theta \beta }{2\rho \gamma }\left\| z_k-x^*\right\| ^2\right) , \end{aligned}$$

i.e., \(\Phi _{k+1}(x^*;\tau )\le \rho \cdot \Phi _k(x^*;\tau )\) with \(\tau =\frac{\theta \beta }{2\rho \gamma }\). This is just the announced result (48).

It remains to show (49). In fact, if \(\theta = \max \{\theta _0, \alpha \}\), then \(\rho =\max \{\alpha , \theta \}=\max \{\theta _0, \alpha \}=\theta \) and hence

$$\begin{aligned} \tau =\frac{\theta \beta }{2\rho \gamma }=\frac{\beta }{2\gamma }=\frac{2L \mu }{\left( \sqrt{L}+\sqrt{\mu }\right) ^2}. \end{aligned}$$

This completes the proof. \(\square \)

Remark 7

It should be noted that we here only show the existence of rates of linear convergence for Nesterov’s accelerated forward–backward method. But, it is not clear whether one can derive an exact rate of linear convergence as \(1-\sqrt{\frac{\mu }{L}}\) as obtained for Nesterov’s accelerated gradient method.

8 A class of dual functions satisfying EB conditions

Verifying EB conditions for functions with certain structure is a difficult topic. In this section, we consider a class of dual objective functions, that have interesting applications in signal processing and compressive sensing [31, 66]. We first describe the problem, along with some direct results.

Proposition 2

Consider the linearly constrained optimization problem

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{y\in {\mathbb {R}}^m}}\, g(y),~~\text {subject to} ~~ Ay=b, \end{aligned}$$
(P)

where \(g: {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) is a real-valued and strongly convex function with modulus \(c>0\), \(A\in {\mathbb {R}}^{n\times m}\) is a given matrix with \(m\le n\), and \(b\in R(A)\) is a given vector. Here, R(A) stands for the range of A. The dual problem is

$$\begin{aligned} {\mathop {\mathrm{minimize}}\limits _{x\in {\mathbb {R}}^n}}\, f(x):= g^*\left( A^Tx\right) -\langle b, x\rangle . \end{aligned}$$
(D)

Then, we have that

  • the primal problem (P) has a unique optimal solution \({\bar{y}}\),

  • the dual objective function f belongs to \({{\mathcal {F}}}^{1,1}_{L}({\mathbb {R}}^n)\) with \(L=\frac{\Vert A\Vert ^2}{c}\), and

  • the set of optimal solutions of the dual problem,

    $$\begin{aligned} \mathrm {Arg}\min f:=\left\{ x\in {\mathbb {R}}^n: A\nabla g^*\left( A^Tx\right) =b\right\} , \end{aligned}$$

    is a nonempty closed convex set, and can be characterized by \(\{x\in {\mathbb {R}}^n: A^Tx\in \partial g({\bar{y}})\}\) or equivalently by \(\{x\in {\mathbb {R}}^n: \nabla g^*(A^Tx)={\bar{y}}\}\) .

Proof

The first two statements are standard results which can be found in textbooks on convex analysis and no proof will be given here. Now, we prove the third statement. First, let the Lagrangian function be given by \(L(y, x)=g(y)-\langle Ay-b, x\rangle .\) By the assumption of \(b\in R(A)\) and the finiteness of the optimal value of primal problem, according to Proposition 5.3.3 in [9], for any \({\bar{x}}\in \mathrm {Arg}\min f\) we have that \({\bar{y}}\in \mathrm {Arg}\min L(y, {\bar{x}}).\) Hence, \(A^T{\bar{x}}\in \partial g({\bar{y}})\) or equivalently \(\nabla g^*(A^T{\bar{x}})={\bar{y}}\) due to \((\partial g)^{-1}=\nabla g^*\), which holds by Corollary 23.5.1 in [47]. This implies that \(\mathrm {Arg}\min f\subseteq \{x\in {\mathbb {R}}^n: \nabla g^*(A^Tx)={\bar{y}}\}\). The inverse inclusion is obvious since \(A{\bar{y}}=b\). Thereby,

$$\begin{aligned} \mathrm {Arg}\min f=\left\{ x\in {\mathbb {R}}^n: \nabla g^*\left( A^Tx\right) ={\bar{y}}\right\} =\left\{ x\in {\mathbb {R}}^n: A^Tx\in \partial g({\bar{y}})\right\} . \end{aligned}$$

This completes the proof. \(\square \)

Now, we state the main result of this section.

Theorem 5

Use the same setting as Proposition 2. Denote \(X_r:=\{x\in {\mathbb {R}}^n: f(x)\le \min f +r\}\) with \(r\ge 0\) and \(V_r:={\mathrm {cl}}(A^TX_r)\), where \({\mathrm {cl}}(A^TX_r)\) stands for the closure of \(A^TX_r\). If the following assumptions hold:

  1. (a)

    \(\partial g\) is calm around \({\bar{y}}\) for any \({\bar{z}}\in V_0\),

  2. (b)

    the collection \(\{\partial g({\bar{y}}), R(A^T)\}\) is linearly regular with constant \(\gamma >0\), that is

    $$\begin{aligned} d\left( A^Tx, \partial g({\bar{y}})\right) \ge \gamma \cdot d\left( A^Tx, \partial g({\bar{y}})\cap R\left( A^T\right) \right) ,~~\forall x\in {\mathbb {R}}^n, \end{aligned}$$

then we have that

  1. (i)

    There exist positive constants \(r_0, \tau \) such that the \((f, \tau , X_{r_0})\)-(obj-EB) condition holds, that is

    $$\begin{aligned} f(x)-\min f \ge \frac{\tau }{2}\cdot d^2(x, {\mathbf {crit}}f), ~~\forall x\in X_{r_0}. \end{aligned}$$
    (54)

    Specifically, if \(\partial g\) is calm with constant \(\kappa >0\) around \({\bar{y}}\) for any \({\bar{z}}\in V_0\), then (54) holds for all \(\tau \in (0, \kappa ^{-1})\).

  2. (ii)

    For any sublevel set \(X_r\), pick \(r_1\in (0, r_0)\) and let \(c_r:=\sqrt{\frac{r_1}{r}}\) and

    $$\begin{aligned} \rho _r:= \left\{ \begin{array}{lll} c_r, &{}\quad \text {when} ~~r\ge r_0, \\ 1, &{}\quad \text {when}~~r\le r_0. \end{array} \right. \end{aligned}$$

    Then, the \((\nabla f, \nu , X_{r})\)-(cor-EB) condition with \(\nu =\frac{\tau \rho _r^2}{8}\) holds.

Proof

We first prove that \(V_r\) is compact for any \(r\ge 0\). To this end, letting \(f_r=\min f+r\) and using the fact \(b=A{\bar{y}}\), we write \(X_r\) into the following form:

$$\begin{aligned} X_r=\left\{ x\in {\mathbb {R}}^n: g^*\left( A^Tx\right) -\left\langle {\bar{y}}, A^Tx\right\rangle \le f_r\right\} . \end{aligned}$$

Denote

$$\begin{aligned} Y_r:=\left\{ y\in {\mathbb {R}}^m: g^*(y)-\langle {\bar{y}}, y\rangle \le f_r\right\} . \end{aligned}$$

Obviously, \(A^TX_r\subseteq Y_r\). Let \({\tilde{g}}(\cdot ):=g^*(\cdot )-\langle {\bar{y}}, \cdot \rangle \). Then, \({\tilde{g}}^*(y)=g(y+{\bar{y}})\). Thus, \({\mathrm {dom}}{\tilde{g}}^*={\mathrm {dom}}g={\mathbb {R}}^m\). This implies that \({\tilde{g}}\) is coercive (see Theorem 11.8 in [48]) and hence \(Y_r=\{y\in {\mathbb {R}}^m: {\tilde{g}}(y) \le f_r\}\) is bounded. Furthermore, since \({\tilde{g}}\) is continuous, \(Y_r\) is closed and hence compact. Thereby, \(V_r={\mathrm {cl}}(A^TX_r)\subseteq Y_r\) is bounded and hence also compact.

Second, we show that \(V_0 \subseteq \partial g({\bar{y}})\). Recall that we have shown that \(X_0=\{x\in {\mathbb {R}}^n: A^Tx\in \partial g({\bar{y}})\}\) in Proposition 2. Hence, \(A^TX_0\subseteq \partial g({\bar{y}})\). Since g is a real-valued convex function, \(\partial g({\bar{y}})\) must be nonempty, closed, and bounded according to Theorem 23.4 in [47] and Theorem 8.6 in [48]. Therefore, \(V_0={\mathrm {cl}}(A^TX_0)\subseteq \partial g({\bar{y}})\)

Now, since \(\partial g\) is calm at \({\bar{y}}\) for any \({\bar{z}}\in V_0\) and \(V_0 \subseteq \partial g({\bar{y}})\) is compact, by Proposition 2 in [69] we can conclude that there exist constants \(\kappa , \epsilon >0\) such that

$$\begin{aligned} \partial g(y)\cap \left( V_0+\epsilon {\mathbb {B}}_{\mathcal {E}}\right) \subseteq \partial g({\bar{y}})+\kappa \cdot \Vert y-{\bar{y}}\Vert _2 {\mathbb {B}}_{\mathcal {E}}, \quad \forall y\in {\mathcal {E}}, \end{aligned}$$
(55)

where we denote \({\mathbb {R}}^m\) by \({\mathcal {E}}\) for simplicity. Pick \(z\in V_0+\epsilon {\mathbb {B}}_{\mathcal {E}}\) and let \(y=\nabla g^*(z)\). Then, \(z\in \partial g(y)\) due to \(\partial g =(\nabla g^*)^{-1}\) and hence \(z \in \partial g(y)\cap (V_0+\epsilon {\mathbb {B}}_{\mathcal {E}})\). By inclusion (55), we obtain

$$\begin{aligned} d(z, \partial g({\bar{y}}))\le \kappa \Vert y-{\bar{y}}\Vert _2=\kappa \cdot d\left( {\bar{y}}, \nabla g^*(z)\right) , \quad \forall z\in V_0+\epsilon {\mathbb {B}}_{\mathcal {E}}, \end{aligned}$$
(56)

which can be rewritten as

$$\begin{aligned} d\left( z, \left( \nabla g^*\right) ^{-1}({\bar{y}})\right) \le \kappa \cdot d\left( {\bar{y}}, \nabla g^*(z)\right) , \quad \forall z\in V_0+\epsilon {\mathbb {B}}_{\mathcal {E}}. \end{aligned}$$
(57)

This implies that \(\nabla g^*\) is always metrically subregular at each \({\bar{z}}\in V_0\) for \({\bar{y}}\). Thus, by Theorem 3.1 in [21], for each \({\bar{z}}\in V_0\) there exists a neighborhood \({\bar{z}}+\epsilon ({\bar{z}}){\mathbb {B}}_{\mathcal {E}}\) and a positive constant \(\alpha ({\bar{z}})\) such that

$$\begin{aligned}&g^*(z)\ge g^*({\bar{z}})-\langle {\bar{y}},{\bar{z}}-z\rangle +\frac{\alpha ({\bar{z}})}{2}\cdot d^2\left( z,\left( \nabla g^*\right) ^{-1}({\bar{y}})\right) , \nonumber \\&\quad \forall z\in {\mathcal {E}}~~ \text {with}~~ \Vert z-{\bar{z}}\Vert _2\le \epsilon ({\bar{z}}), \end{aligned}$$
(58)

where the constant \(\alpha ({\bar{z}})\) can be chosen arbitrarily in \((0, \kappa ^{-1})\). Note that \(\{{\bar{z}}+\epsilon ({\bar{z}}){\mathbb {B}}^o_{\mathcal {E}}\}_{{\bar{z}}\in V_0}\) forms an open cover of the compact set \(V_0\). Hence, by the Heine-Borel theorem, there exist K points (where \(K\ge 1\) is finite) \({\bar{z}}_1,\dots , {\bar{z}}_K\in V_0\) such that

$$\begin{aligned} V_0\subseteq U:= \bigcup _{i=1}^K\left( {\bar{z}}_i+\epsilon ({\bar{z}}_i){\mathbb {B}}^o_{\mathcal {E}}\right) . \end{aligned}$$

Let \(\alpha =\min \{\alpha ({\bar{z}}_1), \dots , \alpha ({\bar{z}}_K)\}\), which can be chosen arbitrarily in \((0, \kappa ^{-1})\), and note that \(\min f = g^*({\bar{z}})-\langle {\bar{y}}, {\bar{z}}\rangle , ~\forall {\bar{z}}\in V_0\). From (58), we have

$$\begin{aligned} g^*(z)-\langle {\bar{y}}, z\rangle \ge \min f +\frac{\alpha }{2}\cdot d^2\left( z,\left( \nabla g^*\right) ^{-1}({\bar{y}})\right) , \quad \forall z\in U. \end{aligned}$$

Letting \(r_0>0\) be small enough such that \(V_{r_0}\subseteq U\) and using the fact of \((\nabla g^*)^{-1} = \partial g\), we obtain

$$\begin{aligned} g^*(z)-\langle {\bar{y}}, z\rangle \ge \min f +\frac{\alpha }{2}\cdot d^2(z, \partial g({\bar{y}})), \quad \forall z\in V_{r_0}, \end{aligned}$$

and hence,

$$\begin{aligned} f(x) - \min f \ge \frac{\alpha }{2}\cdot d^2\left( A^Tx, \partial g({\bar{y}})\right) , \quad \forall x\in X_{r_0}. \end{aligned}$$
(59)

Using the linear regularity property of \(\{\partial g({\bar{y}}), R(A^T)\}\), we derive that

$$\begin{aligned} d\left( A^Tx, \partial g({\bar{y}})\right)&\ge \gamma \cdot d\left( A^Tx, \partial g ({\bar{y}})\cap R\left( A^T\right) \right) = \gamma \cdot \min _{A^Tu\in \partial g({\bar{y}})}\left\| A^Tx-A^Tu\right\| \\&= \gamma \cdot \min _{y\in A^T X_0}\left\| A^Tx-y\right\| \ge \gamma \cdot \min _{y\in V_0}\left\| A^Tx-y\right\| = \gamma \cdot \left\| A^Tx- {\hat{y}}\right\| , \end{aligned}$$

where such \({\hat{y}}\in V_0\) exists due to the compactness of \(V_0\). Now, we follow the argument in [50] to finish the proof of (i). Since \({\hat{y}}\in V_0={\mathrm {cl}}(A^TX_0)\), we can find a sequence \(\{x_n\}_{n=0}^{\infty }\subset X_0\) such that \(A^Tx_n \rightarrow {\hat{y}}\) as \(n\rightarrow +\infty \). Denote the null space of \(A^T\) by \(N(A^T)\) and the minimal positive singular value of A by \(\sigma (A)\). Using the fact of \(\mathrm {Arg}\min f+N(A^T) \subseteq \mathrm {Arg}\min f\), we can derive that

$$\begin{aligned} d(x, \mathrm {Arg}\min f)\le \left\| x-\left( x_n+{\mathcal {P}}_{N\left( A^T\right) }(x-x_n)\right) \right\| \le \frac{1}{\sigma (A)}\left\| A^Tx-A^Tx_n\right\| ,\quad n\ge 0, \end{aligned}$$

where \({\mathcal {P}}_{N(A^T)}\) stands for the orthogonal projection operator onto \(N(A^T)\). Thus, by letting \(n\rightarrow +\infty \), we obtain

$$\begin{aligned} d(x, \mathrm {Arg}\min f)\le \frac{1}{\sigma (A)}\left\| A^Tx-{\hat{y}}\right\| \le \frac{d\left( A^Tx, \partial g({\bar{y}})\right) }{\gamma \cdot \sigma (A)}. \end{aligned}$$
(60)

Note that \(\mathrm {Arg}\min f={\mathbf {crit}}f\). Thereby, in view of (59) and (60), the (obj-EB) condition follows with \(\tau =\alpha \gamma ^2\sigma ^2(A)\).

Let us prove (ii). Without loss of generality, we assume that \(\min f=0\) and \(r\ge r_0\). Since for any \(r>0\) the sublevel set \(X_r\) is \(\nabla f\)-invariant, using (54) together with the equivalence established in Corollary 2, we can conclude that f satisfies the \((\nabla f, \eta , X_{r_0})\)-(res-obj-EB) conditions with \(\eta =\sqrt{\frac{\tau }{2}}\), that is

$$\begin{aligned} \forall ~ x\in X_{r_0},~~\Vert \nabla f(x)\Vert \ge \eta \cdot \sqrt{f(x) }. \end{aligned}$$
(61)

Let \(\varphi (t):=2\eta ^{-1}t^{\frac{1}{2}}\). Then, inequality (61) can be written as

$$\begin{aligned} \forall ~ x\in X_{r_0},~~\Vert \nabla f(x)\Vert \varphi ^\prime (f(x))\ge 1. \end{aligned}$$
(62)

By applying Proposition 30 in [10], a globalization result for KL inequalities, to (62), we have that for the given \(r_1\in (0, r_0)\), the function given by

$$\begin{aligned} \phi (t):= \left\{ \begin{array}{lll} \varphi (t), &{}\text {when} ~~t\le r_1, \\ \varphi (r_1)+(t-r_1)\varphi ^\prime (r_1), &{}\text {when}~~t\ge r_1, \end{array} \right. \end{aligned}$$

is desingularising for f on all of \({\mathbb {R}}^n\) and hence it holds

$$\begin{aligned} \forall ~ x\in X_{r},~~\Vert \nabla f(x)\Vert \phi ^\prime (f(x))\ge 1. \end{aligned}$$
(63)

Thereby, we can get

$$\begin{aligned} \Vert \nabla f(x)\Vert \ge \eta \sqrt{r_1},~~~\forall x\in X_r\cap X^c_{r_1}, \end{aligned}$$

where \(X^c_{r_1}\) is the complement of \(X_{r_1}\). By the definition of \(c_r\), we can further obtain

$$\begin{aligned} \Vert \nabla f(x)\Vert \ge \eta c_r\sqrt{r}\ge \eta c_r \sqrt{f(x)},~~~\forall x\in X_r\cap X^c_{r_1}. \end{aligned}$$

Finally, noting the expression of \(\rho _r\) and together with (61), for \(r>0\) we have

$$\begin{aligned} \Vert \nabla f(x)\Vert \ge \eta \rho _r \sqrt{f(x)},~~~\forall x\in X_r, \end{aligned}$$

which is just the \((\nabla f, \eta \rho _r, X_r)\)-(res-obj-EB) condition. Thus, the \((\nabla f, \nu , X_{r})\)-(cor-EB) condition follows from Corollary 2. Using the relevant formulas in Theorem 1, we have

$$\begin{aligned} \nu =\frac{1}{4}(\eta \rho _r)^2=\frac{1}{4}\rho _r^2 \cdot \frac{\tau }{2}=\frac{1}{8}\rho _r^2\tau , \end{aligned}$$

which completes the proof. \(\square \)

Remark 8

By directly invoking Corollary 4.3 in [1], we can derive (58) with the constant satisfying \(\alpha ({\bar{z}})\in (0, \frac{1}{4\kappa })\), which is slightly worse than that of \(\alpha ({\bar{z}})\in (0, \kappa ^{-1})\).

Remark 9

The author of [50], with slightly different assumptions, proved by contradiction that the dual objective function \(f(x)= g^*(A^Tx)-\langle b, x\rangle \) satisfies the \((\nabla f, \nu , X_{r})\)-(cor-EB) condition. While the author of [50] requires that \(\partial g\) is calm around \({\bar{y}}\) for any \({\bar{z}}\in {\mathbb {R}}^m\), i.e., the local upper Lipschitz-continuity property (4), we only require that \(\partial g\) is calm around \({\bar{y}}\) for any \({\bar{z}}\in V_0\). Our proof is by means of the KL inequality globalization technique developed in [10], and hence quite different from that of [50].

Remark 10

Verifying EB conditions for more general functions with the form \(f(x):=h(Ax)+l(x)\) was studied recently in [20, 32, 69]. Specialized to the dual objective function \(f(x)= g^*(A^Tx)-\langle b, x\rangle \), the existing theory usually requires \(g^*\) to be strictly or strongly convex; see e.g., Corollary 4.3 in [20] and Assumption 1 in [69]. In contrast, our study, following the research line of work [50], relies on exploiting the primal-dual structure, and is thus quite different from that in [20, 32, 69].

9 Discussion

In this paper, we provide a new perspective for studying EB conditions and analyzing linear convergence of gradient-type methods. Under our theoretical framework, a group of new technical results are discovered. Especially, some EB conditions, previously known to be sufficient for linear convergence, are also necessary; and Nesterov’s accelerated forward–backward algorithm, previously known to be R-linearly convergent, is also Q-linearly convergent. Finally, we close this paper with the following possible future works:

  1. 1.

    We have defined a group of abstract EB conditions of “square type”. But we do not know whether the idea behind can be extended to that of general types by introducing so-called desingularizing functions [10], so that the other EB conditions discussed in [22] can be included in a more general framework.

  2. 2.

    Although we have shown sufficient conditions guaranteeing linear convergence for PALM and Nesterov’s accelerated forward–backward algorithms, it is still unclear whether they are necessary. The very recent work [37] might shed light on this topic.

  3. 3.

    Verifying EB conditions with high probability for non-convex functions has proven to be a very powerful approach for non-convex optimization; see e.g. [14, 34, 56]. Thus, seeking or verifying new classes of non-convex functions, satisfying EB condition with high-probability, deserves future study.

  4. 4.

    What are the optimal rates of linear convergence (or say, exact worst-case convergence rates) for gradient-type methods under general EB conditions? The method of performance estimation, originally proposed in [18] and further developed in [30, 54, 55], might be useful for this topic.

  5. 5.

    Ordinary differential equation (ODE) approaches are recently used to study (accelerated) gradient-type methods [53, 58]. Except one paper [60], existing analyses only consider general convex and strongly convex conditions, and do not work on general EB conditions. It would be interesting to investigate whether the EB condition presented in this paper can be embedded in the ODE approaches to study linear convergence for gradient-type methods.