1 Introduction

To obtain globally linear convergence rates of gradient-type methods for minimizing convex (not necessarily strongly convex) differentiable functions, we recently proposed the restricted strong convexity (RSC) [19, 20], which is a strictly weaker concept than the strong convexity. Up to now, it has been proved that the RSC property is a very powerful tool for analyzing descent methods including (in)exact gradient method, restarted nonlinear CG, BFGS and its damped limited memory variants L-D-BFGS [17, 20]. Almost parallel to work [19], the authors of [11, 18] defined the global error bound (GEB) property in the spirit of Hoffman’s celebrated result on error bounds for systems of linear inequalities [7, 10]. They showed that the GEB property also guarantees globally linear convergence results for descent methods. Moreover, they figured out a class of convex programs that frequently appear in machine learning obeying the GEB property. Very recently, the authors of [6, 9, 13] proposed the quadratic growth (QG) property with different names; it was called second order growth property in [13], optimal strong convexity in [9], and semi-strongly convex property in [6]. They showed that the QG property can guarantee globally linear convergence results for descent methods as well. Since each of these three notions contributes as a linear convergence guarantee, it should be interesting to see what is the relationship between them. With the help of Ekeland’s variational principle, we show that they are actually equivalent.

To deal with convex minimization over a closed convex and structured convex optimization, we propose a group of modified versions and a group of extended versions of these three notions by using gradient mapping and proximal gradient mapping separately [14] to replace the gradient notion. Similarly, the modified and extended versions can be used to derive globally linear convergence results for a large class of descent methods in convex minimization [3, 5, 6, 13]. If the objective function in convex program involves the gradient-Lipschitz-continuous property, we prove that the equivalence for the modified and extended versions still holds. Based on these equivalence notions, we establish new asymptotically linear convergence results for the proximal gradient method, that are complementary to recently appearing theory.

The equivalence results in this paper provide us with alternative ways to check whether a given convex minimization problem satisfies the RSC property; in some cases, to check one of the equivalence properties might be much easier than to check the others. As a case study, we investigate the problem of minimizing the composition of an affine mapping with a strongly convex differentiable function over a polyhedral set, which is very popular in machine learning. We prove that this problem enjoys a strengthened property of the RSC type and hence the modified GEB property without any compactness assumption of polyhedral sets.

At the time of writing this paper, the authors of [4] showed the equivalence between the error bound (corresponding to our growth property) and the Kurdyka-\({\L }\)ojasiewicz inequality for convex functions having a moderately flat profile near the set of minimizers. As the convex functions are differentiable, this paper might provide complementary results to that in [4]. When our paper was under review, the authors of [5] posted their paper concerning the equivalence between the error bound and quadratic growth properties on arXiv. They defined the error bound condition by using the proximal gradient mapping, that generalizes our modified error bound property and the global error bound from the beginning proposed in [18]. However, they did not discuss the equivalence to the RSC property. Besides, our convergence results for the proximal gradient method are new and might be of interest in themselves.

The rest of paper is organized as follows. In Sect. 2, we present the basic notations and concepts discussed in this paper. In Sect. 3, we analyze the relationship of different notions. In Sect. 4, we discuss the convergence of proximal gradient method implied by the equivalence notions. In Sect. 5 is devoted to a case study.

2 Notation and definitions

We denote by \(d(x, \mathcal {Y})\) the distance from a point x to a nonempty closed set \(\mathcal {Y}\); that is, \(d(x, \mathcal {Y})=\inf _{y\in \mathcal {Y}}\Vert x-y\Vert \). When \(\mathcal {Y}\) is a single point set, i.e., \(\mathcal {Y}=\{y\}\), we use d(xy) to replace \(d(x, \mathcal {Y})\) for simplicity. We will consider functions that take values in the extended real line \(\overline{\mathbb {R}}:=\mathbb {R}\bigcup \{+\infty \}\). The projection of x onto a nonempty closed convex set \(\mathcal {Y}\) is denoted by \([x]_\mathcal {Y}^+\). The spectral norm of a matrix X is given by \(\Vert X\Vert \). The terminology below follows from [14]. A convex differentiable function g is gradient-Lipschitz-continuous if there exists a positive scalar L such that

$$\begin{aligned} \Vert \nabla g(x)-\nabla g(y)\Vert \le L\Vert x-y\Vert , \quad \forall x, y\in \mathbb {R}^n, \end{aligned}$$
(1)

or equivalently,

$$\begin{aligned} 0\le g(y)- g(x)-\langle \nabla g(x), y-x\rangle \le \frac{L}{2}\Vert y-x\Vert ^2, \quad \forall x, y\in \mathbb {R}^n, \end{aligned}$$
(2)

and strongly convex if there exists a positive scalar \(\mu \) such that

$$\begin{aligned} \langle \nabla g(x)-\nabla g(y), x-y\rangle \ge \mu \Vert x-y\Vert ^2, \quad \forall x, y\in \mathbb {R}^n. \end{aligned}$$
(3)

2.1 Original versions for unconstrained convex program

Definition 1

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be convex differentiable. Denote \(f^*=\min _{x\in \mathbb {R}^n} f(x)\) and \(\mathcal {X}=\arg \min _{x\in \mathbb {R}^n} f(x)\) and assume that \(\mathcal {X}\) is nonempty. Then the unconstrained convex program

$$\begin{aligned} \mathop {{{\mathrm{minimize}}}}\limits _{x\in \mathbb {R}^n} f(x) \end{aligned}$$

obeys

  1. (a)

    the restricted strongly convex property with constant \(\nu >0\) if it satisfies the restricted secant inequality

    $$\begin{aligned} \langle \nabla f(x), x-[x]_\mathcal {X}^+\rangle \ge \nu \cdot d^2(x,\mathcal {X}),\quad \forall x\in \mathbb {R}^n. \end{aligned}$$
    (4)
  2. (b)

    the global error bound property with constant \(\kappa >0\) if it satisfies the error upper bound inequality

    $$\begin{aligned} \Vert \nabla f(x)\Vert \ge \kappa \cdot d(x,\mathcal {X}),\quad \forall x\in \mathbb {R}^n. \end{aligned}$$
    (5)
  3. (c)

    the quadratic growth property with constant \(\tau >0\) if it satisfies the second order growth of the function value

    $$\begin{aligned} f(x)-f^*\ge \frac{\tau }{2}\cdot d^2(x,\mathcal {X}), \quad \forall x\in \mathbb {R}^n. \end{aligned}$$
    (6)

We use \(RSC(\nu )\), \(GEB(\kappa )\), and \(QG(\tau )\) to stand for the defined properties above respectively.

To ensure that \([x]_\mathcal {X}^+\) is well defined, we need \(\mathcal {X}\) to be closed; this is implied by the differentiable of f(x).

Remark 1

The RSC property first appeared in [8] as a restricted secant inequality. The authors of [19, 20] formally defined it and figured out a class of non-trivial RSC functions.

The authors of [11, 18] proposed the GEB property in the spirit of Hoffman’s error bounds [7]. The local version of GEB only implies asymptotic linear convergence rates [10].

The QG property appeared in a couple of papers [6, 9, 13] with different names. The equivalence between the RSC and the QG of convex differentiable functions was shown in [17, 20].

2.2 Modified versions for constrained convex program

To deal with constrained convex programs, we first introduce the gradient mapping [14].

Definition 2

Let \( \gamma >0\) be a fixed constant and Q be a closed convex set, and let \(\bar{x}\in \mathbb {R}^n\). Denote

$$\begin{aligned} x_Q(\bar{x};\gamma )= & {} \arg \min _{x\in Q}\left[ f(\bar{x})+ \langle \nabla f(\bar{x}), x-\bar{x}\rangle +\frac{\gamma }{2}\Vert x-\bar{x}\Vert ^2\right] \\ G^f_Q(\bar{x};\gamma )= & {} \gamma (\bar{x}-x_Q(\bar{x};\gamma )). \end{aligned}$$

We call \(G^f_Q(\bar{x};\gamma )\) the gradient mapping of function f on Q.

When \(Q=\mathbb {R}^n\), we have that

$$\begin{aligned} x_Q(\bar{x};\gamma )=\bar{x}-\frac{1}{\gamma }\nabla f(\bar{x}) \quad \text {and} \quad G^f_Q(\bar{x};\gamma )=\nabla f(\bar{x}). \end{aligned}$$

The latter implies that the gradient mapping generalizes the gradient notion.

Definition 3

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a convex differentiable function and let Q be a nonempty closed convex set. Denote \(\mathcal {X}=\arg \min _{x\in Q} f(x)\) and \(f^*=\min _{x\in Q} f(x)\) and assume that \(\mathcal {X}\) is nonempty. Let \(\gamma >0\) be a fixed constant. Then the constrained convex program

$$\begin{aligned} \mathop {{{\mathrm{minimize}}}}\limits _{x\in Q} f(x) \end{aligned}$$

obeys

  1. (a)

    the modified restricted strongly convex property on Q with constant \(\nu >0\) if it satisfies the restricted secant inequality

    $$\begin{aligned} \langle G^f_Q(y;\gamma ), y-[y]_\mathcal {X}^+\rangle \ge \nu \cdot d^2(y,\mathcal {X}),\quad \forall y\in Q. \end{aligned}$$
    (7)
  2. (b)

    the modified global error bound property on Q with constant \(\kappa >0\) if it satisfies the error upper bound inequality

    $$\begin{aligned} \Vert G^f_Q(y;\gamma )\Vert \ge \kappa \cdot d(y,\mathcal {X}), \quad \forall y\in Q. \end{aligned}$$
    (8)
  3. (c)

    the modified quadratic growth property on Q with constant \(\tau >0\) if it satisfies the second order growth of the function value

    $$\begin{aligned} f(y)-f^*\ge \frac{\tau }{2}\cdot d^2(y,\mathcal {X}), \quad \forall y\in Q. \end{aligned}$$
    (9)

We use \(mRSC(\nu )\), \(mGEB(\kappa )\), and \(mQG(\tau )\) to stand for the defined properties above respectively.

Again, the closedness of \(\mathcal {X}\) is guaranteed by the differentiable of f(x) and hence \([y]_\mathcal {X}^+\) is well defined.

Remark 2

The author of [17] also proposed a modified RSC by considering a convex constraint set Q. But they required that both \(X_f=\arg \min _{x\in \mathbb {R}^n} f(x)\) and \(Q\bigcap X_f\) are nonempty. Such assumptions are very strong conditions and many practical problems may fail to satisfy.

Remark 3

When \(Q=\mathbb {R}^n\), the modified versions return to the corresponding original versions since \(G^f_Q(\bar{x};\gamma )=\nabla f(\bar{x})\). Therefore, the modified Definition 3 can be viewed as a generalization of Definition 1.

2.3 Extended versions via proximal gradient mapping

To introduce extended versions of the previous notions for structure convex optimization, we need the concept of proximal gradient mapping [2].

Definition 4

Let \( \gamma >0\) be a fixed constant, \(f: \mathbb {R}^n\rightarrow \mathbb {R}\) be a differentiable function, \(g:\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) be a closed convex function, and \(\bar{x}\in \mathbb {R}^n\). Denote

$$\begin{aligned} p^f_g(\bar{x};\gamma )= & {} \mathop {\arg \min }\limits _{x}\left[ f(\bar{x})+ \langle \nabla f(\bar{x}), x-\bar{x}\rangle +\frac{\gamma }{2}\Vert x-\bar{x}\Vert ^2+g(x)\right] \\ G^f_g(\bar{x};\gamma )= & {} \gamma (\bar{x}-p^f_g(\bar{x};\gamma )). \end{aligned}$$

We call \(G^f_g(\bar{x};\gamma )\) the proximal gradient mapping of functions f and g.

Let Q be a closed nonempty convex set. When g is the indicator function

$$\begin{aligned} I_Q(x)=\left\{ \begin{array}{ll} 0,&{} \quad x\in Q,\\ +\infty , &{} \quad x\notin Q, \end{array} \right. \end{aligned}$$

we have that

$$\begin{aligned} p^f_g(\bar{x};\gamma )=x_Q(\bar{x};\gamma ). \end{aligned}$$

This implies that the proximal gradient mapping generalizes the gradient mapping.

Definition 5

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a convex differentiable function and \(g:\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) be a closed convex function. Denote \(\mathcal {X}=\arg \min _{x} \varphi (x):= f(x)+g(x)\) and \(\varphi ^*=\min _{x} \varphi (x)\) and assume that \(\mathcal {X}\) is nonempty. Let \(\gamma >0\) be a fixed constant. Then the following convex program

$$\begin{aligned} \mathop {{{\mathrm{minimize}}}}\limits _{x} \varphi (x)=f(x)+g(x) \end{aligned}$$

obeys

  1. (a)

    the extended restricted strongly convex property with parameter \(\nu , \omega >0\) if it satisfies the restricted secant inequality

    $$\begin{aligned} \langle G^f_g(y;\gamma ), y-[y]_\mathcal {X}^+\rangle \ge \nu \cdot d^2(y,\mathcal {X}),\quad \forall y\in [\varphi \le \varphi ^*+\omega ]. \end{aligned}$$
    (10)
  2. (b)

    the extended global error bound property with parameter \(\kappa , \omega >0\) if it satisfies the error upper bound inequality

    $$\begin{aligned} \Vert G^f_g(y;\gamma )\Vert \ge \kappa \cdot d(y,\mathcal {X}), \quad \forall y\in [\varphi \le \varphi ^*+\omega ]. \end{aligned}$$
    (11)
  3. (c)

    the extended quadratic growth property with constant parameter \(\tau , \omega >0\) if it satisfies the second order growth of the function value

    $$\begin{aligned} \varphi (y)-\varphi ^*\ge \frac{\tau }{2}\cdot d^2(y,\mathcal {X}), \quad \forall y\in [\varphi \le \varphi ^*+\omega ]. \end{aligned}$$
    (12)

We use \(eRSC(\nu , \omega )\), \(eGEB(\kappa , \omega )\), and \(eQG(\tau , \omega )\) to stand for the defined properties above respectively.

It is easy to see that \(\varphi (x)\) is lower semicontinuous over \(\mathbb {R}^n\) and hence \(\mathcal {X}\) is closed; see e.g. Lemma 2.6.3 in [16]. This ensures that \([y]_\mathcal {X}^+\) is well defined.

Remark 4

The eQG appeared in [6] under the name of semi-strongly convex property. The authors of [21] proposed an analog of the eGEB property and exploited it by borrowing tools from set-valued analysis. The authors of [5] introduced the eGEB and proved that it is equivalent to the eQG. Our novelty here lies in the definition of eRSC.

3 Equivalence analysis

3.1 Equivalence among the original versions

In what follows, we prove that the notions defined in Definition 1 are actually equivalent. The idea of proof is mainly inspired by the seminal paper [1] and heavily relies on the well-known Ekeland’s variational principle in Lemma 1.

Theorem 1

Under the setting of Definition 1, the restricted strongly convex property, the global error bound property, and the quadratic growth property are equivalent in the following sense:

$$\begin{aligned} QG(\nu )\Rightarrow RSC\left( \frac{\nu }{2}\right) \Rightarrow GEB\left( \frac{\nu }{2}\right) \Rightarrow QG\left( \frac{\nu }{4}\right) . \end{aligned}$$

Proof

The implication of \(QG(\nu )\Rightarrow RSC(\frac{\nu }{2})\) has been shown in [17, 20]. The implication \(RSC(\frac{\nu }{2})\Rightarrow GEB(\frac{\nu }{2})\) is a direct consequence after applying the Cauchy-Schwartz inequality. It only needs to prove \(GEB(\nu ) \Rightarrow QG(\frac{\nu }{2})\). Now assume that f has the GEB property with constant \(\nu >0\). It suffices to prove that for all \(0\le \alpha <\frac{1}{4}\), the following holds:

$$\begin{aligned} f(z)-f^*\ge \alpha \nu \cdot d^2(z,\mathcal {X}), \quad \forall z\in \mathbb {R}^n. \end{aligned}$$

If this is not true, then there must exist \(z_0\in \mathbb {R}^n\) such that

$$\begin{aligned} f(z_0)<f^*+\alpha \nu \cdot d^2(z_0,\mathcal {X}). \end{aligned}$$

Clearly, \(z_0\notin \mathcal {X}\) and hence \(d(z_0, \mathcal {X})>0\) since \(\mathcal {X}\) is a nonempty closed set. Let \(\lambda =\frac{1}{2}d(z_0, \mathcal {X})\). By Ekeland’s variational principle with \(\epsilon =\alpha \nu \cdot d^2(z_0,\mathcal {X})=4\alpha \nu \lambda ^2\), there exists \(x_0\in \mathbb {R}^n\) such that \(d(x_0,z_0)\le \lambda \) and

$$\begin{aligned} f(x)\ge f(x_0)-\frac{\epsilon }{\lambda }d(x,x_0)=f(x_0)-4\alpha \nu \lambda \cdot d(x,x_0), \quad \forall x\in \mathbb {R}^n. \end{aligned}$$

Then, \(x_0\) minimizes the convex function \(f(x)+4\alpha \nu \lambda \cdot d(x,x_0)\). By the first-order optimality condition, we get

$$\begin{aligned} 0\in \nabla f(x_0)+4\alpha \nu \lambda \cdot \partial (\Vert \cdot -x_0\Vert )(x_0)=\nabla f(x_0)+4\alpha \nu \lambda \cdot \mathcal {Y}, \end{aligned}$$

where \(\mathcal {Y}=\{y\in \mathbb {R}^n: \Vert y\Vert \le 1\}\) [16]. Hence, we can find \(y_0\in \mathcal {Y}\) such that \(\nabla f(x_0)=-4\alpha \nu \lambda y_0\). Since

$$\begin{aligned} 2\lambda =d(z_0, \mathcal {X}) \le d(x_0, z_0)+d(x_0, \mathcal {X})\le \lambda +d(x_0, \mathcal {X}), \end{aligned}$$

we have \(d(x_0, \mathcal {X})\ge \lambda \). Therefore,

$$\begin{aligned} \Vert \nabla f(x_0)\Vert =4\alpha \nu \lambda \Vert y_0\Vert \le 4\alpha \nu \lambda \le 4\alpha \nu \cdot d(x_0, \mathcal {X})< \nu \cdot d(x_0, \mathcal {X}), \end{aligned}$$

which contradicts the GEB property. This completes the proof. \(\square \)

3.2 Equivalence among the modified and extended versions

In this part, we deduce the equivalence among the modified and extended versions.

Theorem 2

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a convex differentiable function whose gradient is Lipschitz-continuous with positive scalar L and \(g:\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) be a closed convex function. Let \(\gamma \ge L\) be a fixed constant. Then the extended versions are equivalent in the following sense

$$\begin{aligned} eRSC(\nu _1, \omega )\Rightarrow eGEB(\kappa _1, \omega )\Rightarrow eQG(\tau _1,\omega ) \end{aligned}$$

and

$$\begin{aligned} eQG(\tau _ 2,\omega )\Rightarrow eGEB(\kappa _2, \omega )\Rightarrow eRSC(\nu _2, \omega ), \end{aligned}$$

where \(\omega \in (0, +\infty ]\) is a fixed constant and the other parameters satisfy \(\kappa _1=\tau _1=\nu _1\) and

$$\begin{aligned} \kappa _2=\frac{\tau _2 \gamma ^2}{(2\gamma +\tau _2)(\gamma +L)}, ~~\nu _2=\frac{\kappa _2^2}{\gamma }. \end{aligned}$$

In particular, the equivalence among the modified versions also holds in the same way by letting \(\omega =+\infty \).

Proof

The equivalence of \(eGEB(\kappa , \omega )\) and \(eQG(\tau ,\omega )\) has been shown in [5] recently; see Lemma 2. Applying the Cauchy-Schwartz inequality to the left-hand side term of eRSC gives the eGEB property. It remains to prove that eGEB implies eRSC. By using Lemma 3 with \(x=[y]_\mathcal {X}^+\) and \(\bar{x}=y\), we get

$$\begin{aligned} \langle G^f_g(y;\gamma ),y-[y]_\mathcal {X}^+\rangle \ge \frac{1}{2\gamma }\Vert G^f_g(y;\gamma )\Vert ^2 +\varphi (p^f_g(y;\gamma )) -\varphi ([y]_\mathcal {X}^+). \end{aligned}$$
(13)

Noticing that \(\varphi (p^f_g(y;\gamma )) - \varphi ([y]_\mathcal {X}^+)=\varphi (p^f_g(y;\gamma )) - \varphi ^*\ge 0\) and applying the eGEB property, we obtain

$$\begin{aligned} \langle G^f_g(y;\gamma ),y-[y]_\mathcal {X}^+\rangle \ge \frac{\kappa _2^2}{2\gamma }d^2(y,\mathcal {X}), \quad \forall y\in [\varphi \le \varphi ^*+\omega ], \end{aligned}$$
(14)

which completes the proof. \(\square \)

4 Convergence analysis

In this section, we discuss asymptotically linear convergence of the proximal gradient method, based on the equivalence notions defined before.

The proximal gradient method, also known as the forward-backward splitting method, is fundamental for the following structured optimization problem

$$\begin{aligned} \mathop {{{\mathrm{minimize}}}}\limits _{x} \varphi (x):= f(x)+g(x), \end{aligned}$$

where f is a convex function with the gradient-Lipschitz-continuous property and g is a closed convex function. It can be stated as

$$\begin{aligned} x_{k+1}=x_k-\frac{1}{\gamma } G^f_g(x_k;\gamma ), \end{aligned}$$

where the constant \(\gamma >0\) is appropriately chosen. Based on the eGEB property, the authors of [5] observed that an asymptotically Q-linear convergence in function values can be assured for this method, and moreover, if the iterates \(x_k\) have some limit point \(x^*\), then \(x_k\) asymptotically converge R-linearly. Motivated by the arguments in [20], we have established below an asymptotically Q-linear convergence in the distance values \(d(x_{k},\mathcal {X})\), and proved that \(x_k\) themselves asymptotically converge R-linearly to a limit point \(x^*\) under a compactness assumption on the minimizer set \(\mathcal {X}\). The following result is complimentary to that of [5] and might be of interest in itself.

Theorem 3

Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a convex differentiable function whose gradient is Lipschitz-continuous with positive scalar L and \(g:\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) be a closed convex function. Denote \(\mathcal {X}=\arg \min _{x} \varphi (x):= f(x)+g(x)\) and \(\varphi ^*=\min _{x} \varphi (x)\) and assume that \(\mathcal {X}\) is nonempty. Let \(\gamma \ge L\) be a fixed constant. Suppose that \({{\mathrm{minimize}}}_{x} \varphi (x)\) satisfies the \(eQG(\tau ,\omega )\) property (equivalently, the \(eGEB(\kappa , \omega )\) property). Then, the iterates \(x_k\) generated by the proximal gradient method asymptotically converge Q-linearly, that is there exists an index m such that the inequality

$$\begin{aligned} d(x_{k+1},\mathcal {X})\le \sqrt{\frac{\gamma }{\gamma + \tau }}d(x_k,\mathcal {X}) \end{aligned}$$

holds for all \(k\ge m\). Moreover, if \(\mathcal {X}\) is compact, then the iterates \(x_k\) asymptotically converge R-linearly to some limit point \(x^*\in \mathcal {X}\) in the sense that

$$\begin{aligned} d^2(x_{k+m},x^*)\le C\cdot \left( 1-\frac{\kappa }{2\gamma }\right) ^k \end{aligned}$$

holds for all \(k\ge 1\), where \(m>\frac{L}{2\omega }d^2(x_0,\mathcal {X})\) and \(C=\frac{2(\varphi (x_m)-\varphi ^*)}{\gamma }\left( \sum _{i=0}^\infty (1-\frac{\kappa }{2\gamma })^{\frac{i}{2}}\right) ^2\).

The proof idea below is partially inspired by [5, 20].

Proof

Let \(m= \lceil \frac{L}{2\omega }d^2(x_0,\mathcal {X})\rceil \) where \(\lceil x \rceil \) denotes the smallest integer larger than x. By using the standard sublinear estimate [2]

$$\begin{aligned} \varphi (x_k)-\varphi ^*\le \frac{L\cdot d^2(x_0,\mathcal {X})}{2k}, \end{aligned}$$

we deduce that \(\varphi (x_k)-\varphi ^*\le \omega \) holds for all \(k\ge m\). Denote the projection point of x onto \(\mathcal {X}\) by \(x^{\prime }\). By invoking Lemma 3 with \(x=x_k^{\prime }\) and \(\bar{x}=x_k\), we have that

$$\begin{aligned} \langle G^f_g(x_k;\gamma ),x_k-x_k^{\prime }\rangle \ge \frac{1}{2\gamma }\Vert G^f_g(x_k;\gamma )\Vert ^2 +\varphi (x_{k+1}) -\varphi ^* . \end{aligned}$$
(15)

Now, together with the \(eQG(\tau ,\omega )\) property, we derive that for all \(k\ge m\)

$$\begin{aligned} d^2(x_{k+1},\mathcal {X})&= \Vert x_{k+1}-x_{k+1}^{\prime }\Vert ^2\le \Vert x_{k+1}-x_k^{\prime }\Vert ^2 \end{aligned}$$
(16a)
$$\begin{aligned}&= \Vert x_k-x_k^{\prime } -\frac{1}{\gamma }G^f_g(x_k;\gamma )\Vert ^2 \end{aligned}$$
(16b)
$$\begin{aligned}&= \Vert x_k-x_k^{\prime }\Vert ^2 -\frac{2}{\gamma }\langle G^f_g(x_k;\gamma ),x_k-x_k^{\prime }\rangle + \frac{1}{\gamma ^2}\Vert G^f_g(x_k;\gamma )\Vert ^2\end{aligned}$$
(16c)
$$\begin{aligned}&\le \Vert x_k-x_k^{\prime }\Vert ^2-\frac{2}{\gamma }(\varphi (x_{k+1}) -\varphi ^* ) \end{aligned}$$
(16d)
$$\begin{aligned}&\le d^2(x_k,\mathcal {X})-\frac{\tau }{\gamma } d^2(x_{k+1},\mathcal {X}), \end{aligned}$$
(16e)

which yields the asymptotically Q-linear convergence of \(x_k\).

To prove the convergence of \(\{x_k\}\) themselves, we first consider the sequence \(\{x^{\prime }_k\}\subseteq \mathcal {X}\), which must have a subsequence, denoted by \(\{x^{\prime }_{k_i}\}\), converging to some point \(x^*\in \mathcal {X}\) due to the compactness assumption on \(\mathcal {X}\). We claim that \(x^{\prime }_k\rightarrow x^*\) as \(k\rightarrow \infty \). Otherwise, there must exist another subsequence of \(\{x_k\}\), denoted by \(\{x^{\prime }_{k_j}\}\), converging to a different point \(\hat{x}\in \mathcal {X}\). Notice that

$$\begin{aligned} d(x^*,\hat{x})\le d(x^*,x^{\prime }_{k_i})+d(x^{\prime }_{k_i}, x^{\prime }_{k_j})+d(x^{\prime }_{k_j},\hat{x}) \end{aligned}$$

and \(d(x^{\prime }_{k_i}, x^{\prime }_{k_j})\le d(x_{k_i}, x_{k_j}),\) where the latter follows from the nonexpansive property of projection operator. We get that

$$\begin{aligned} d(x^*,\hat{x})\le d(x^*,x^{\prime }_{k_i})+d(x^{\prime }_{k_j},\hat{x})+d(x_{k_i}, x_{k_j}). \end{aligned}$$
(17)

Denote \(\ell _1=\min \{k_i, k_j\}\) and \(\ell _2=\max \{k_i,k_j\}\); then

$$\begin{aligned} d(x_{k_i}, x_{k_j})\le \sum _{i=\ell _1}^{\ell _2-1}d(x_i,x_{i+1})\le \sum _{i=\ell _1}^{\infty }d(x_i,x_{i+1}). \end{aligned}$$

By invoking Lemma 3 with \(x=\bar{x}=x_i\), we have that

$$\begin{aligned} \varphi (x_i)-\varphi (x_{i+1})\ge \frac{1}{2\gamma }\Vert G^f_g(x_i;\gamma )\Vert ^2=\frac{\gamma }{2}\Vert x_i-x_{i+1}\Vert ^2. \end{aligned}$$
(18)

On the other hand, inequality (15) implies

$$\begin{aligned} \varphi (x_{i+1}) -\varphi ^*\le \Vert G^f_g(x_i;\gamma )\Vert ^2\left( \frac{\Vert x_i-x^{\prime }_i\Vert }{\Vert G^f_g(x_i;\gamma )\Vert }-\frac{1}{2\gamma }\right) . \end{aligned}$$
(19)

Applying the \(eGEB(\kappa , \omega )\) property to (19) and combining with (18), we get that

$$\begin{aligned} \varphi (x_{i+1}) -\varphi ^*\le \left( 1-\frac{\kappa }{2\gamma }\right) (\varphi (x_i) -\varphi ^*),~~i\ge m . \end{aligned}$$
(20)

Let \(\ell _1\ge m\) and \(i\ge m\); then

$$\begin{aligned} d^2(x_i,x_{i+1})&= \Vert x_i-x_{i+1}\Vert ^2 \le \frac{2}{\gamma }(\varphi (x_i)-\varphi (x_{i+1}))\le \frac{2}{\gamma }(\varphi (x_i)-\varphi ^*)\end{aligned}$$
(21a)
$$\begin{aligned}&\le \frac{2}{\gamma }\left( 1-\frac{\kappa }{2\gamma }\right) ^{i-m}(\varphi (x_m)-\varphi ^*) \end{aligned}$$
(21b)

and hence

$$\begin{aligned} d(x_{k_i},x_{k_j})&\le \sum _{i=\ell _1}^{\infty }d(x_i,x_{i+1})\le \sqrt{\frac{2(\varphi (x_m)-\varphi ^*)}{\gamma }}\sum _{i=\ell _1}^\infty \left( 1-\frac{\kappa }{2\gamma }\right) ^{\frac{i-m}{2}}\end{aligned}$$
(22a)
$$\begin{aligned}&= D\cdot \left( 1-\frac{\kappa }{2\gamma }\right) ^{\frac{\ell _1-m}{2}}\rightarrow 0, ~~as~~\ell _1\rightarrow \infty , \end{aligned}$$
(22b)

where \(D=\sqrt{\frac{2(\varphi (x_m)-\varphi ^*)}{\gamma }}\sum _{i=0}^\infty \left( 1-\frac{\kappa }{2\gamma }\right) ^{\frac{i}{2}}<\infty \). Together with the fact that \(x^{\prime }_{k_i}\rightarrow x^*\) as \(k_i\rightarrow \infty \) and \(x^{\prime }_{k_j}\rightarrow \hat{x}\) as \(k_j\rightarrow \infty \), we immediately get \(d(x^*,\hat{x})=0\) by using (17) with \(k_i\rightarrow \infty \) and \(k_j\rightarrow \infty \), which contradicts \(x^*\ne \hat{x}\). Therefore, \(x^{\prime }_k\rightarrow x^*\) as \(k\rightarrow \infty \) indeed holds. Finally, in light of the asymptotically Q-linear convergence of \(x_k\), we have that

$$\begin{aligned} d(x_k,x^*)\le d(x_k,x^{\prime }_k)+ d(x^{\prime }_k,x^*)=d(x_k, \mathcal {X})+ d(x^{\prime }_k,x^*) \rightarrow 0, ~~as~~k\rightarrow \infty , \end{aligned}$$

which implies that the iterates \(x_k\) converge to \(x^*\in \mathcal {X}\). The asymptotically R-linear convergence of \(\{x_k\}\) follows by setting \(k_i=\ell _1=k+m\) and letting \(k_j\rightarrow +\infty \) in (22). This completes the proof. \(\square \)

Remark 5

Although (22) implies that \(\{x_k\}\) is a Cauchy sequence and hence converges, it can not ensure its convergence to a point belonging to \(\mathcal {X}\) without the compactness assumption on \(\mathcal {X}\).

5 A case study: composition minimization with constraints

The authors of [21] presented a unified framework for establishing error bounds for a class of structured convex optimization problems. By the equivalence between the error bound condition and quadratic growth, the authors of [5] streamlined and illuminated the arguments in [21] and also extended their results to a wider setting. Here, in a transparent way we derive a strengthened property of the RSC type for the following constrained convex program

$$\begin{aligned} {{\mathrm{minimize}}}\,\,f(x),\quad \quad \mathrm{subject\,\,to}\,\,~x\in Q, \end{aligned}$$
(23)

where Q is a polyhedral set in the n-dimensional Euclidean space, and f is the composition of an affine mapping with a strongly convex differentiable function. We assume that f and Q are of the special form:

$$\begin{aligned} f(x)=g(Ex), ~~~Q=\{x\in \mathbb {R}^n| Ax\le b\}, \end{aligned}$$

where E is some \(m\times n\) matrix, A is some \(k\times n\) matrix, \(b\in \mathbb {R}^k\) is some vector, and g is strongly convex and gradient-Lipschitz-continuous with positive scalars \(\mu \) and L respectively.

The convex program (23) has been extensively studied for its wide applications in data networks and machine learning.

Theorem 4

Under the setting of Definition 3 and letting \(\gamma \ge L\Vert EE^T\Vert \), we have that the constrained convex program (23) obeys the strengthened property of the RSC type:

$$\begin{aligned} \langle G^f_Q(y;\gamma ), y-[y]_\mathcal {X}^+\rangle \ge \frac{1}{2\gamma }\Vert G^f_Q(y;\gamma )\Vert ^2+ C_1 \cdot d^2(y,\mathcal {X}), ~~\forall y\in Q, \end{aligned}$$
(24)

and the modified quadratic growth property:

$$\begin{aligned} f(y)\ge f^*+ C_2\cdot d^2(y,\mathcal {X}), ~~\forall y\in Q, \end{aligned}$$
(25)

where \(C_i, i=1,2\) are positive constant depending on matrices E and A and the positive scalar \(\mu \).

Proof

By using (29) in Lemma 5 with \(x=[y]_\mathcal {X}^+, \bar{x}=y\) and noticing that \(f(x_Q(y;\gamma ))\ge f([y]_\mathcal {X}^+)=f^*\), we get

$$\begin{aligned} \langle G^f_Q(y;\gamma ), y-[y]_\mathcal {X}^+\rangle \ge \frac{1}{2\gamma }\Vert G^f_Q(y;\gamma )\Vert ^2+ \frac{\mu }{2} \Vert Ey-E[y]_\mathcal {X}^+\Vert ^2. \end{aligned}$$
(26)

Since g is strongly convex, there must exist a unique vector \(t^*\) such that \( Ex=t^*, \forall x\in \mathcal {X}\); please refer to [10, 18]. Thus,

$$\begin{aligned} \mathcal {X}=\{x| Ex=t^*\}\bigcap \{x|Ax\le b\}. \end{aligned}$$

Due to the result of Hoffman’s error bound in Lemma 6, there must exist a constant \(\theta >0\) depending on matrices E and A such that for any \(y\in Q=\{x|Ax\le b\}\) it holds

$$\begin{aligned} \Vert Ey-E[y]_\mathcal {X}^+\Vert ^2=\Vert Ey-t^*\Vert ^2\ge \theta \Vert y-[y]_\mathcal {X}^+\Vert ^2=\theta \cdot d^2(y,\mathcal {X}). \end{aligned}$$

Thus, the strengthened property of the RSC type follows from this and the inequality (26).

By Lemma 7, for any \(x^*\in \mathcal {X}\) we have \(G^f_Q(x^*;\gamma )=0\) and hence \(x^*=x_Q(x^*;\gamma )\). Therefore, using (29) Lemma 5 with \(\bar{x}=[y]_\mathcal {X}^+\) and the fact of \(\Vert Ey-E[y]_\mathcal {X}^+\Vert ^2\ge \theta \cdot d^2(y,\mathcal {X})\), we get the modified quadratic growth property. This completes the proof. \(\square \)

Remark 6

By Lemma A.8 in [18], we get

$$\begin{aligned} \Vert G^f_Q(y;\gamma )\Vert =\Vert \gamma (y-[y-\frac{1}{\gamma }\nabla f(y)]_\mathcal {X}^+)\Vert \le \gamma \max (1,\gamma ^{-1})\Vert y-[y-\nabla f(y)]_\mathcal {X}^+\Vert . \end{aligned}$$

Thus, using this and applying the Cauchy-Schwartz inequality to (24), we get

$$\begin{aligned} \gamma \max (1,\gamma ^{-1})\Vert y-[y-\nabla f(y)]_\mathcal {X}^+\Vert \ge C_1\cdot d(y,\mathcal {X}). \end{aligned}$$

Or, equivalently there exists a constant \(C_3>0\) such that

$$\begin{aligned} C_3 \cdot d(y,\mathcal {X})\le \Vert y-[y-\nabla f(y)]_\mathcal {X}^+\Vert . \end{aligned}$$

On the contrast, the authors of [18] derived the following error bound

$$\begin{aligned} d(y,\mathcal {X})\le \kappa _3\Vert y-[y-\nabla f(y)]_\mathcal {X}^+\Vert , ~~\forall y\in Q~~\text {and} ~~f(y)-f^*\le M \end{aligned}$$

for \(f(x)=g(Ex)+q^Tx\). Although this bound enables the incorporation of the additional linear term, the constant \(\kappa _3\) depends on the positive parameter M and hence on the variable y. To avoid such dependence, the authors of [3] derived the quadratic growth property

$$\begin{aligned} f(y)\ge f^*+ \kappa _4\cdot d^2(y,\mathcal {X}), ~~\forall y\in Q \end{aligned}$$

for \(f(x)=g(Ex)+q^Tx\) by assuming that Q is compact, where \(\kappa _4>0\) is a constant. Here, we can drop the compactness assumption by neglecting the linear term \(q^Tx\).

Remark 7

There is an alternative way in [13] to prove the modified QG property. Indeed, By the strong convexity of g and the fact of \(\Vert Ey-E[y]_\mathcal {X}^+\Vert ^2\ge \theta \cdot d^2(y,\mathcal {X})\), we derive for \(\forall y\in Q\) that

$$\begin{aligned} f(y)-f^*= & {} f(y)-f([y]_\mathcal {X}^+)=g(Ey) -g(E[y]_\mathcal {X}^+) \ge \frac{\mu }{2}\Vert Ex-E[x]_\mathcal {X}^+\Vert ^2\\\ge & {} \frac{\mu \theta }{2}\cdot d^2(y,\mathcal {X}). \end{aligned}$$

Thus, by Theorem 2 we immediately get the mRSC property

$$\begin{aligned} \langle G^f_Q(y;\gamma ), y-[y]_\mathcal {X}^+\rangle \ge C_4\cdot d^2(y,\mathcal {X}), \end{aligned}$$

where \(C_4\) is some positive constant.