1 Introduction

In this paper, we show that one can skip gradient computations without slowing down the convergence of gradient descent type methods for solving certain structured convex programming (CP) problems. To motivate our study, let us first consider the following classic bilinear saddle point problem (SPP):

$$\begin{aligned} \psi ^*:=\min _{x\in X}\left\{ \psi (x):= f(x) + \max _{y\in Y}[\langle Kx, y\rangle - J(y)]\right\} . \end{aligned}$$
(1.1)

Here, \(X\subseteq \mathbb {R}^n\) and \(Y\subseteq \mathbb {R}^m\) are nonempty, closed, and convex sets, \(K: \mathbb {R}^n \rightarrow \mathbb {R}^m\) is a linear operator, J is a relatively simple convex function, and \(f:X\rightarrow \mathbb {R}\) is a continuously differentiable convex function satisfying

$$\begin{aligned} 0 \le f(x) - l_f(u,x) \le \frac{L}{2}\Vert x - u\Vert ^2,\ \forall x,u\in X, \end{aligned}$$
(1.2)

for some \(L>0\), where \(l_f(u,x):=f(u) + \langle \nabla f(u), x-u\rangle\) denotes the first-order Taylor expansion of f at u. Since \(\psi\) is a nonsmooth convex function, traditional nonsmooth optimization methods, e.g., the subgradient method, would require \(\mathcal{O} (1/\varepsilon ^2)\) iterations to find an \(\varepsilon\)-solution of (1.1), i.e., a point \({{\bar{x}}} \in X\) s.t. \(\psi ({{\bar{x}}}) - \psi ^* \le \varepsilon\). In a landmark work [30], Nesterov suggests to approximate \(\psi\) by a smooth convex function

$$\begin{aligned} \psi ^*_\rho := \min _{x\in X} \left\{ \psi _\rho (x):=f(x) + h_\rho (x) \right\} , \end{aligned}$$
(1.3)

with

$$\begin{aligned} h_\rho (x):=\max _{y\in Y}\langle Kx,y\rangle - J(y) - \rho W(y_0,y) \end{aligned}$$
(1.4)

for some \(\rho > 0\), where \(y_0\in Y\) and \(W(y_0,\cdot )\) is a strongly convex function. By properly choosing \(\rho\) and applying the optimal gradient method to (1.3), he shows that one can compute an \(\varepsilon\)-solution of (1.1) in at most

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }} + \frac{\Vert K\Vert }{\varepsilon }\right) \end{aligned}$$
(1.5)

iterations. Following [30], much research effort has been devoted to the development of first-order methods utilizing the saddle-point structure of (1.1) (see, e.g., the smoothing technique [2, 3, 11, 20, 22, 24, 29, 30, 36], the mirror-prox methods [9, 17, 21, 27], the primal-dual type methods [6, 10, 13, 19, 36, 37] and their equivalent form as the alternating direction method of multipliers [15, 16, 18, 26, 33, 34]). Some of these methods (e.g., [9, 10, 19, 22, 34]) can achieve exactly the same complexity bound as in (1.5). Recently, in [35] it is proved that the complexity bound in (1.5) is theoretically unimprovable. Specifically, for any first-order method that calls oracle \(\mathcal{O} (x,y)\mapsto (\nabla f(x), Kx, K^Ty)\) at inquiry point (xy) to access information of f and K in the saddle point problem (1.1), the number of oracle inquiries to compute an \(\varepsilon\)-solution is at least (1.5). In other words, if each iteration of a first-order method requires both the computation of \(\nabla f\) and the evaluation of the linear operators (K and \(K^T\)), the total numbers of gradient and linear operator evaluations will both be no less than \(\mathcal{O} (1/\varepsilon )\). Therefore, Nesterov’s smooth scheme is an optimal method among all first-order methods that performs gradient and linear operator evaluations in each iteration.

One problem associated with Nesterov’s smoothing scheme and the related methods mentioned above is that each iteration of these methods requires both the computation of \(\nabla f\) and the evaluation of the linear operators (K and \(K^T\)). As a result, the total number of gradient and linear operator evaluations will both be bounded by \({{\mathcal{O}}}(1/\varepsilon )\). However, in many applications the computation of \(\nabla f\) is often much more expensive than the evaluation of the linear operators K and \(K^T\). This happens, for example, when the linear operator K is sparse (e.g., total variation, overlapped group lasso and graph regularization), while f involves a more expensive data-fitting term (see Sect. 4 and [23] for some other examples). In [23], Lan considered some similar situation and proposed a gradient sliding (GS) algorithm to minimize a class of composite problems whose objective function is given by the summation of a general smooth and nonsmooth component. He shows that one can skip the computation of the gradient for the smooth component from time to time, while still maintaining the \({{\mathcal{O}}}(1/\varepsilon ^2)\) iteration complexity bound. More specifically, by applying the GS method in [23] to problem (1.1), we can show that the number of gradient evaluations of \(\nabla f\) will be bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }}\right) , \end{aligned}$$

which is significantly better than (1.5). Unfortunately, the total number of evaluations for the linear operators K and \(K^T\) will be bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }} + \frac{\Vert K\Vert ^2}{\varepsilon ^2}\right) , \end{aligned}$$

which is much worse than (1.5). An important yet unresolved research question is whether one can still preserve the optimal \({{\mathcal{O}}}(1/\varepsilon )\) complexity bound in (1.5) for solving (1.1) by utilizing only \({{\mathcal{O}}}(1/\sqrt{\varepsilon })\) gradient computations of \(\nabla f\) to find an \(\varepsilon\)-solution of (1.1). If so, we could be able to keep the total number of iterations relatively small, but significantly reduce the total number of required gradient computations.

In order to address the aforementioned issues associated with existing solution methods for solving (1.1), we pursue in this paper a different approach to exploit the structural information of (1.1). Firstly, instead of concentrating solely on nonsmooth optimization as in [23], we study the following smooth composite optimization problem:

$$\begin{aligned} \phi ^*:=\min _{x\in X}\left\{ \phi (x):=f(x) + h(x) \right\} . \end{aligned}$$
(1.6)

Here f and h are smooth convex functions satisfying (1.2) and

$$\begin{aligned} 0\le h(x)-l_h(u,x)\le \frac{M}{2}\Vert x-u\Vert ^2,\ \forall x,u\in X, \end{aligned}$$
(1.7)

respectively. It is worth noting that problem (1.6) can be viewed as a special case of either (1.1) or (1.3) (with \(J=h^*\) being a strongly convex function, \(Y=\mathbb {R}^n\), \(K=I\) and \(\rho =0\)). Under the assumption that \(M\ge L\), we present a novel accelerated gradient sliding (AGS) method which can skip the computation of \(\nabla f\) from time to time. We show that the total number of required gradient evaluations of \(\nabla f\) and \(\nabla h\), respectively, can be bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }}\right) \ \ \ \text{ and } \ \ \mathcal{O} \left( \sqrt{\frac{M}{\varepsilon }}\right) \end{aligned}$$
(1.8)

to find an \(\varepsilon\)-solution of (1.6). Observe that the above complexity bounds are sharper than the complexity bound obtained by Nesterov’s optimal method for smooth convex optimization, which is given by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L+M}{\varepsilon }}\right) . \end{aligned}$$

In particular, for the AGS method, the Lipschitz constant M associated with \(\nabla h\) does not affect at all the number of gradient evaluations of \(\nabla f\). Clearly, the higher ratio of M/L will potentially result in more savings on the gradient computation of \(\nabla f\). Moreover, if f is strongly convex with modulus \(\mu\), then the above two complexity bounds in (1.8) can be significantly reduced to

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\mu }} \log \frac{1}{\varepsilon }\right) \ \ \ \text{ and } \ \ \mathcal{O} \left( \sqrt{\frac{M}{\mu }} \log \frac{1}{\varepsilon }\right) , \end{aligned}$$
(1.9)

respectively, which also improves Nesterov’s optimal method applied to (1.6) in terms of the number of gradient evaluations of \(\nabla f\). Observe that in the classic black-box setting [28, 32] the complexity bounds in terms of gradient evaluations of \(\nabla f\) and \(\nabla h\) are intertwined, and a larger Lipschitz constant M will result in more gradient evaluations of \(\nabla f\), even though there is no explicit relationship between \(\nabla f\) and M. In our development, we break down the black-box assumption by assuming that we have separate access to \(\nabla f\) and \(\nabla h\) rather than \(\nabla \phi\) as a whole. To the best of our knowledge, these types of separate complexity bounds as in (1.8) and (1.9) have never been obtained before for smooth convex optimization.

Secondly, we apply the above AGS method to the smooth approximation problem (1.3) in order to solve the aforementioned bilinear SPP in (1.1). By choosing the smoothing parameter properly, we show that the total number of gradient evaluations of \(\nabla f\) and operator evaluations of K (and \(K^T\)) for finding an \(\varepsilon\)-solution of (1.1) can be bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }}\right) \ \ \ \text{ and } \ \ \ \mathcal{O} \left( \frac{\Vert K\Vert }{\varepsilon }\right) , \end{aligned}$$

respectively. In comparison with Nesterov’s original smoothing scheme and other existing methods for solving (1.1), our method can provide significant savings on the number of gradient computations of \(\nabla f\) without increasing the complexity bound on the number of operator evaluations of K and \(K^T\). In comparison with the GS method in [23], our method can reduce the number of operator evaluations of K and \(K^T\) from \({{\mathcal{O}}}(1/ \varepsilon ^2)\) to \({{\mathcal{O}}}(1/\varepsilon )\). Moreover, if f is strongly convex with modulus \(\mu\), the above two bounds will be significantly reduced to

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\mu }} \log \frac{1}{\varepsilon }\right) \ \ \ \text{ and } \ \ \ \mathcal{O} \left( \frac{\Vert K\Vert }{\sqrt{\varepsilon }}\right) , \end{aligned}$$

respectively. To the best of our knowledge, this is the first time that these tight complexity bounds were obtained for solving the classic bilinear saddle point problem (1.1).

It should be noted that, even though the idea of skipping the computation of \(\nabla f\) is similar to [23], the AGS method presented in this paper significantly differs from the GS method in [23]. In particular, each iteration of the GS method consists of one accelerated gradient iteration together with a bounded number of subgradient iterations. On the other hand, each iteration of the AGS method is composed of an accelerated gradient iteration nested with a few other accelerated gradient iterations to solve a different subproblem. The development of the AGS method seems to be more technical than GS and its convergence analysis is also highly nontrivial.

This paper is organized as follows. We first present the AGS method and discuss its convergence properties for minimizing the summation of two smooth convex functions (1.6) in Sect. 2. Utilizing this new algorithm and its associated convergence results, we study the properties of the AGS method for minimizing the bilinear saddle point problem (1.1) in Sect. 3. We then demonstrate the effectiveness of the AGS method throughout preliminary numerical experiments for solving certain image reconstruction problems in Sect. 4. Some brief concluding remarks are made in Sect. 5.

1.1 Notation, assumption and terminology

We use \(\Vert \cdot \Vert\), \(\Vert \cdot \Vert _*\), and \(\langle \cdot ,\cdot \rangle\) to denote an arbitrary norm, the associated dual norm, and the inner product in an Euclidean space, respectively. It should be noted that there are two Euclidean spaces \(\mathbb {R}^n\) and \(\mathbb {R}^m\) in problem (1.1) that may be equipped with different norms. Nonetheless, since it is easy to distinguish the norm of \(x\in \mathbb {R}^n\) and \(y\in \mathbb {R}^m\) by noticing their respective spaces, we will sightly abuse the notation and use the same norm notation \(\Vert x\Vert\) and \(\Vert y\Vert\) to denote their norms in \(\mathbb {R}^n\) and \(\mathbb {R}^m\) respectively. We will also use \(\Vert K\Vert\) to denote the operator norm of an operator \(K:\mathbb {R}^n\rightarrow \mathbb {R}^m\) induced by norms \(\Vert \cdot \Vert\) in \(\mathbb {R}^n\) and \(\mathbb {R}^m\).

For any set X, we say that \(V(\cdot ,\cdot )\) is a prox-function associated with \(X \subseteq \mathbb {R}^n\) modulus \(\nu\) if there exists a strongly convex function \(\zeta (\cdot )\) with strong convexity parameter \(\nu\) such that

$$\begin{aligned} \begin{aligned} V(x,u)=&\ \zeta (u) - \zeta (x) - \langle \nabla \zeta (x), u-x\rangle ,\ \forall x,u\in X. \end{aligned} \end{aligned}$$
(1.10)

The above prox-function is also known as the Bregman divergence [4] (see also [27, 30]), which generalizes the Euclidean distance \(\Vert x-u\Vert _2^2/2\). It can be easily seen from (1.10) and the strong convexity of \(\zeta\) that

$$\begin{aligned} V(x,u)\ge \frac{\nu }{2}\Vert x-u\Vert ^2 \ \ \forall x, y \in X. \end{aligned}$$
(1.11)

Moreover, we say that the prox-function grows quadratically if there exists a constant C such that \(V(x,u)\le C\Vert x-u\Vert ^2/2\). Without loss of generality, we assume that \(C=1\) whenever this happens, i.e.,

$$\begin{aligned} V(x,u)\le \frac{1}{2}\Vert x-u\Vert ^2. \end{aligned}$$
(1.12)

In this paper, we associate sets \(X\subseteq \mathbb {R}^n\) and \(Y \subseteq \mathbb {R}^m\) with prox-functions \(V(\cdot ,\cdot )\) and \(W(\cdot ,\cdot )\) with moduli \(\nu\) and \(\omega\) w.r.t. their respective norms in \(\mathbb {R}^n\) and \(\mathbb {R}^m\).

For any real number r, \(\lceil r\rceil\) and \(\lfloor r\rfloor\) denote the nearest integer to r from above and below, respectively. We denote the set of nonnegative and positive real numbers by \(\mathbb {R}_+\) and \(\mathbb {R}_{++}\), respectively.

2 Accelerated gradient sliding for composite smooth optimization

In this section, we present an accelerated gradient sliding (AGS) algorithm for solving the smooth composite optimization problem in (1.6) and discuss its convergence properties. Our main objective is to show that the AGS algorithm can skip the evaluation of \(\nabla f\) from time to time and achieve better complexity bounds in terms of gradient computations than the classical optimal first-order methods applied to (1.6) (e.g., Nesterov’s method in [31]). Without loss of generality, throughout this section we assume that \(M\ge L\) in (1.2) and (1.7).

2.1 The accelerated gradient sliding algorithm

The AGS method evolves from the gradient sliding (GS) algorithm in [23], which was designed to solve a class of composite convex optimization problems with the objective function given by the summation of a smooth and nonsmooth component. The basic idea of the GS method is to keep the nonsmooth term inside the projection (or proximal mapping) in the accelerated gradient method and then to apply a few subgradient descent iterations to solve the projection subproblem. Inspired by [23], we suggest to keep the smooth term h whose gradient has a larger Lipschitz constant in the proximal mapping in the accelerated gradient method, and then to apply a few accelerated gradient iterations to solve this smooth subproblem. As a consequence, the proposed AGS method involves two nested loops (i.e., outer and inner iterations), each of which consists of a set of modified accelerated gradient descent iterations (see Algorithm 1). At the k-th outer iteration, we first build a linear approximation \(l_f({\underline{x}}_k,u)\) of f at the search point \({\underline{x}}_k \in X\) and then call the ProxAG procedure in (2.5) to compute a new pair of search points \((x_k, {\tilde{x}}_k) \in X \times X\). The ProxAG procedure can be viewed as a subroutine to compute a pair of approximate solutions to

$$\begin{aligned} \min _{u\in X}g_k(u)+ h(u) + \beta V(x_{k-1},u),\text { or equivalently, }\min _{u\in X}l_f({\underline{x}}_k,u)+ h(u) + \beta V(x_{k-1},u), \end{aligned}$$
(2.1)

where \(g_k(\cdot )\) is defined in (2.4). Here \(x_{k-1}\) is called the prox-center at the k-th outer iteration. It is worth mentioning that there are two essential differences associated with the steps (2.3–2.7) from the standard Nesterov’s accelerated gradient iterations. Firstly, we use two different search points, i.e., \(x_k\) and \({\overline{x}}_k\), respectively, to update \({\underline{x}}_k\) to compute the linear approximation and \({\overline{x}}_k\) to compute the output solution in (2.6). Secondly, we employ two parameters, i.e., \(\gamma _k\) and \(\lambda _k\), to update \({\underline{x}}_k\) and \({\overline{x}}_k\), respectively, rather than just one single parameter. For the purpose of understanding the intuition behind the proposed AGS method, with some plausible reasoning, the proposed AGS methods can also be understood as an implementation of Nesterov’s accelerated gradient method in which the proximal subproblem is solved approximately. Namely, iterations (2.3) through (2.6) can also be interpreted as follows:

$$\begin{aligned} {\underline{x}}_k=&\ (1-\gamma _k){\overline{x}}_{k-1} + \gamma _k x_{k-1}, \nonumber \\ {\tilde{x}}_k\approx&\ \mathop {\mathrm{argmin}}_{u\in X}l_f({\underline{x}}_k,u)+ h(u) + \beta V(x_{k-1},u)\ \ \ (\text {also compute } x_k \text { from solving the subproblem}), \nonumber \\ {\overline{x}}_k =&\ (1-\lambda _k){\overline{x}}_{k-1} + \lambda _k {\tilde{x}}_k. \end{aligned}$$
(2.2)

Clearly, if the above proximal subproblem is solved exactly and \({\tilde{x}}_k = x_k\), then AGS becomes Nesterov’s accelerated gradient method. However, note that a caveat of the above interpretation is that \({\tilde{x}}_k\) and \(x_k\) need to be different in order to achieve proper convergence results.

Based on the above plausible interpretation, the ProxAG procedure in Algorithm 1 solves the proximal subproblem (2.2) approximately. Specifically, it performs \(T_k\) inner accelerated gradient iterations to solve (2.1) with certain properly chosen starting points \({\tilde{u}}_0\) and \(u_0\). It should be noted, however, that the accelerated gradient iterations in (2.7)–(2.9) also differ from the standard Nesterov’s accelerated gradient iterations in the sense that the definition of the search point \({\underline{u}}_t\) involves a fixed search point \({\overline{x}}\). Since each inner iteration of the ProxAG procedure requires one evaluation of \(\nabla h\) and no evaluation of \(\nabla f\), the number of gradient evaluations of \(\nabla h\) will be greater than that of \(\nabla f\) as long as \(T_k > 1\). On the other hand, if \(\lambda _k\equiv \gamma _k\) and \(T_k\equiv 1\) in the AGS method, and \(\alpha _t\equiv 1\), and \(p_t\equiv q_t\equiv 0\) in the ProxAG procedure, then (2.5) becomes

$$\begin{aligned} x_k={\tilde{x}}_k = \mathop {\mathrm{argmin}}_{u\in X}g_k(u) + l_h({\underline{x}}_k,u) + \beta _k V(x_{k-1}, u). \end{aligned}$$

In this case, the AGS method reduces to a variant of Nesterov’s optimal gradient method (see, e.g., [32, 36]). Note that concept of Nesterov’s accelerated gradient method (with some modification) is applied twice in both the \(k=1,\ldots ,N\) iterations for solving the original problem and in solving the proximal subproblem (2.2) approximately. The idea may lead to different combinations such as non-accelerated + accelerated, accelerated + non-accelerated, and accelerated + accelerated. In this paper, we show that the total number of gradient evaluations of \(\nabla f\) and \(\nabla h\) can be bounded by \(\mathcal{O} (\sqrt{L/\varepsilon })\) and \(\mathcal{O} (\sqrt{M/\varepsilon })\) respectively. To the best of our knowledge, such complexity can only be achieved through the last combination. It is possible to adapt the convergence analysis to the former two; however, for these two combinations, the bound of one of the gradient evaluations will deteriorate to a worse \(\mathcal{O} (1/\varepsilon )\) order.

figure a

Our goal in the remaining part of this section is to establish the convergence of the AGS method and to provide theoretical guidance to specify quite a few parameters, including \(\{\gamma _k\}\), \(\{\beta _k\}\), \(\{T_k\}\), \(\{\lambda _k\}\), \(\{\alpha _t\}\), \(\{p_t\}\), and \(\{q_t\}\), used in the generic statement of this algorithm. In particular, we will provide upper bounds on the number of outer and inner iterations, corresponding to the number of gradient evaluations of \(\nabla f\) and \(\nabla h\), respectively, performed by the AGS method to find an \(\varepsilon\)-solution to (1.6). It should be noted that, although we use several notations for different parameters, we are not leaving the task of choosing parameters to the readers. Actually, the several notations of different parameters are only used to describe the framework of the proposed AGS method. The exact values of all the parameters will be provided in the sequel, and AGS requires no more than the knowledge of two Lipschitz constants L and M.

2.2 Approximate error measure and technical lemmas

In our convergence analysis, we measure the quality of the output solution computed at the k-th call to the ProxAG procedure by the following (approximate) error measure:

$$\begin{aligned} Q_k(x,u):=&\ g_k(x) - g_k(u) + h(x) - h(u). \end{aligned}$$
(2.10)

Indeed, if \(x^*\) is an optimal solution to (1.6), then \(Q_k(x,x^*)\) provides a linear approximation for the functional optimality gap \(\phi (x) - \phi (x^*)=f(x)-f(x^*)+h(x)-h(x^*)\) obtained by replacing f with \(g_k\). The following result describes some relationship between \(\phi (x)\) and \(Q_k(\cdot ,\cdot )\).

Lemma 2.1

For any \(u\in X\), we have

$$\begin{aligned} \phi ({\overline{x}}_k) - \phi (u) \le&\ (1-\gamma _k)[\phi ({\overline{x}}_{k-1}) - \phi (u)] + Q_k({\overline{x}}_k, u) \nonumber \\&\quad - (1-\gamma _k)Q_k({\overline{x}}_{k-1}, u) + \frac{L}{2}\Vert {\overline{x}}_k - {\underline{x}}_k\Vert ^2. \end{aligned}$$
(2.11)

Proof

By the Lipschitz smoothness assumption (1.2), definition of \(\phi\) and \(g_k\) in (1.6) and (2.4) respectively, and the convexity of \(f(\cdot )\), we have

$$\begin{aligned}&\phi ({\overline{x}}_k) - (1-\gamma _k)\phi ({\overline{x}}_{k-1}) - \gamma _k\phi (u) \\&\quad \overset{(1.2)(1.6)}{\le } \ l_f({\underline{x}}_k,{\overline{x}}_k) + \frac{L}{2}\Vert {\overline{x}}_k - {\underline{x}}_k\Vert ^2 + h({\overline{x}}_k) \\&\qquad - (1-\gamma _k)l_f({\underline{x}}_k,{\overline{x}}_{k-1}) - (1-\gamma _k)h({\overline{x}}_{k-1}) - \gamma _kl_f({\underline{x}}_k,u) - \gamma _kh(u) \\&\quad \overset{(2.4)}{=} \ g_k({\overline{x}}_k) + \frac{L}{2}\Vert {\overline{x}}_k - {\underline{x}}_k\Vert ^2 + h({\overline{x}}_k) \\&\qquad \ - (1-\gamma _k)g_k({\overline{x}}_{k-1}) - (1-\gamma _k)h({\overline{x}}_{k-1}) - \gamma _kg_k(u) - \gamma _kh(u) \\&\quad = \ Q_k({\overline{x}}_k, u) - (1-\gamma _k)Q_k({\overline{x}}_{k-1}, u) + \frac{L}{2}\Vert {\overline{x}}_k - {\underline{x}}_k\Vert ^2. \end{aligned}$$

\(\square\)

For convergence analysis, we need the following two technical results. The first one below characterizes the solution of optimization problems involving prox-functions. The proof of this result can be found, for example, in Lemma 2 of [14].

Lemma 2.2

Suppose that a convex set \(Z\subseteq \mathbb {R}^n\), a convex function \(q:Z\rightarrow \mathbb {R}\), points \(z,z'\in Z\) and scalars \(\mu _1,\mu _2\in \mathbb {R}_+\) are given. Also let V(zu) be a prox-function. If

$$\begin{aligned} u^*\in \mathop {\mathrm{Argmin}}_{u\in Z} q(u) + \mu _1V(z, u) + \mu _2V(z', u), \end{aligned}$$

then for any \(u\in Z\), we have

$$\begin{aligned}&q(u^*) + \mu _1V(z, u^*) + \mu _2V(z', u^*) \\&\quad \le \ q(u) + \mu _1V(z, u) + \mu _2V(z', u) - (\mu _1+\mu _2)V(u^*,u). \end{aligned}$$

The second technical result slightly generalizes Lemma 3 of [22] to provide a convenient way to study sequences with sublinear rates of convergence.

Lemma 2.3

Let \(c_k\in (0,1)\), \(k=2,3,\ldots\) and \(C_1>0\) be given, and define

$$\begin{aligned} C_k:=(1-c_k)C_{k-1},\ k\ge 2. \end{aligned}$$

If the sequence \(\{\delta _k\}_{k\ge 0}\) satisfies

$$\begin{aligned} \delta _k\le (1-c_k) \delta _{k-1}+ B_k,\ k=1,2,\ldots , \end{aligned}$$
(2.12)

then for any \(k\ge 1\) we have

$$\begin{aligned} \delta _k\le C_k\left[ \frac{1-c_1}{C_1}\delta _0 + \sum _{i=1}^k\frac{B_i}{C_i} \right] . \end{aligned}$$
(2.13)

In particular, the above inequality becomes equality when the relations in (2.12) are all equality relations.

Proof

The result follows from dividing both sides of (2.12) by \(C_k\) and then summing up the resulting inequalities or equalities. \(\square\)

It should be noted that, although (2.12) and (2.13) are stated in the form of inequalities, we can derive some useful formulas by setting them to be equalities. For example, let \(\{\alpha _t\}\) be the parameters used in the ProxAG procedure (see (2.7) and (2.9)) and consider the sequence \(\{\varLambda _t\}_{t\ge 1}\) defined by

$$\begin{aligned} \varLambda _t = {\left\{ \begin{array}{ll} 1 &{} t=1, \\ (1-\alpha _t)\varLambda _{t-1}&{} t>1. \end{array}\right. } \end{aligned}$$
(2.14)

By Lemma 2.3 (with \(k=t\), \(C_k=\varLambda _t\), \(c_k=\alpha _t\), \(\delta _k\equiv 1\), and \(B_k=\alpha _t\)) and observing that \(\varLambda _1=1\), we have the following weighted sum result regarding the sum of \(\alpha _i/\varLambda _i\)’s:

$$\begin{aligned} 1 =&\varLambda _t\left[ \frac{1-\alpha _1}{\varLambda _1} + \sum _{i=1}^{t}\frac{\alpha _i}{\varLambda _i}\right] = \varLambda _t(1-\alpha _1) + \varLambda _t\sum _{i=1}^{t}\frac{\alpha _i}{\varLambda _i}\nonumber \\&\text { or equivalently, }\frac{\varLambda _t}{1-\varLambda _t(1-\alpha _1)}\sum _{i=1}^{t}\frac{\alpha _i}{\varLambda _i} = 1. \end{aligned}$$
(2.15)

Similarly, applying Lemma 2.3 to each component of the recursion \({\tilde{u}}_t = (1-\alpha _t){\tilde{u}}_{t-1}+ \alpha _tu_t\) in (2.9) (with \(k=t\), \(C_k=\varLambda _t\), \(c_k=\alpha _t\), \(\delta _k={\tilde{u}}_t\), and \(B_k=\alpha _tu_t\)), we have a weighted sum description of \({\tilde{u}}_t\) below:

$$\begin{aligned} {\tilde{u}}_t = \varLambda _t\left[ (1-\alpha _1){\tilde{u}}_0 + \sum _{i=1}^{t}\frac{\alpha _i}{\varLambda _i}u_i\right] . \end{aligned}$$
(2.16)

In view of (2.15) and the fact that \({\tilde{u}}_0 = {\overline{x}}\) in the description of the ProxAG procedure, the above relation indicates that \({\tilde{u}}_t\) is a convex combination of \({\overline{x}}\) and \(\{u_i\}_{i=1}^t\).

2.3 Convergence properties of the ProxAG procedure

With the help of the technical results in the previous subsection, we are now ready to derive some important convergence properties for the ProxAG procedure in terms of the error measure \(Q_k(\cdot ,\cdot )\). For the sake of notational convenience, when we work on the k-th call to the ProxAG procedure, we drop the subscript k in (2.10) and just denote

$$\begin{aligned} Q(x,u):=g(x) - g(u) + h(x) - h(x). \end{aligned}$$
(2.17)

In a similar vein, we also define

$$\begin{aligned} {\underline{x}}:=(1-\gamma ){\overline{x}}+ \gamma x \text { and }{\overline{x}}^+:=(1-\lambda ){\overline{x}}+ \lambda {\tilde{x}}^+. \end{aligned}$$
(2.18)

Comparing the above notations with (2.3) and (2.6), we can observe that \({\underline{x}}\) and \({\overline{x}}^+\), respectively, represent \({\underline{x}}_k\) and \({\overline{x}}_k\) in the k-th call to the ProxAG procedure.

Lemma 2.4

Consider the k-th call to the ProxAG procedure in Algorithm 1 and let \(\varLambda _t\) and \({\overline{x}}^+\) be defined in (2.14) and (2.18) respectively. If the parameters satisfy

$$\begin{aligned} \lambda \le 1, \varLambda _T(1-\alpha _1) = 1 - \frac{\gamma }{\lambda },\text { and } \beta p_t + q_t \ge \frac{\lambda M\alpha _t}{\nu }, \end{aligned}$$
(2.19)

then

$$\begin{aligned} \begin{aligned}&Q({\overline{x}}^+, u) - (1-\gamma )Q({\overline{x}}, u) \le \varLambda _T\sum _{t=1}^T\frac{\varUpsilon _t(u)}{\varLambda _t},\ \forall u\in X, \end{aligned} \end{aligned}$$
(2.20)

where

$$\begin{aligned} \varUpsilon _t(u):=&{\lambda \beta \alpha _t}[V(x,u) - V(x,u_t) + p_tV(u_{t-1}, u) - (1+p_t)V(u_t,u)] \nonumber \\&+ {\lambda \alpha _tq_t}[V(u_{t-1}, u) - V(u_t,u)]. \end{aligned}$$
(2.21)

Proof

Let us fix any arbitrary \(u\in X\) and denote

$$\begin{aligned} v := (1-\lambda ){\overline{x}}+ \lambda u, \text { and } {\overline{u}}_t:=(1-\lambda ){\overline{x}}+ \lambda {\tilde{u}}_t. \end{aligned}$$
(2.22)

Our proof consists of two major parts. We first prove that

$$\begin{aligned} Q({\overline{x}}^+, u) - (1-\gamma )Q({\overline{x}}, u) \le Q({\overline{u}}_T, v) - \left( 1 - \frac{\gamma }{\lambda }\right) Q({\overline{u}}_0, v), \end{aligned}$$
(2.23)

and then estimate the right-hand-side of (2.23) through the following recurrence property:

$$\begin{aligned} Q({\overline{u}}_t, v) - (1-\alpha _t)Q({\overline{u}}_{t-1}, v)\le \varUpsilon _t(u). \end{aligned}$$
(2.24)

The result in (2.20) then follows as an immediate consequence of (2.23) and (2.24). Indeed, by Lemma 2.3 applied to (2.24) (with \(k=t\), \(C_k=\varLambda _t\), \(c_k=\alpha _t\), \(\delta _k=Q({\overline{u}}_t,v)\), and \(B_k=\varUpsilon _t(u)\)), we have

$$\begin{aligned} Q({\overline{u}}_T, v) \le&\ \varLambda _T\left[ \frac{1-\alpha _1}{\varLambda _1}Q({\overline{u}}_0,v) + \sum _{t=1}^T\frac{\varUpsilon _t(u)}{\varLambda _t}\right] = \left( 1 - \frac{\lambda }{\gamma }\right) Q({\overline{u}}_0,v) + \varLambda _T\sum _{t=1}^T\frac{\varUpsilon _t(u)}{\varLambda _t}, \end{aligned}$$

where last inequality follows from (2.19) and the fact that \(\varLambda _1=1\) in the definition of \(\varLambda _t\) in (2.14). The above relation together with (2.23) then clearly imply (2.20).

We start with the first part of the proof regarding (2.23). By the definition of Q in (2.17) and the linearity of \(g(\cdot )\), we have

$$\begin{aligned} \begin{aligned}&Q({\overline{x}}^+, u) - (1-\gamma )Q({\overline{x}}, u) \\&\quad = \ g({\overline{x}}^+) - (1-\gamma )g({\overline{x}}) - \gamma g(u) + h({\overline{x}}^+) - (1-\gamma )h({\overline{x}}) - \gamma h(u) \\&\quad = \ g({\overline{x}}^+ - (1-\gamma ){\overline{x}}- \gamma u) + h({\overline{x}}^+) - (1-\gamma )h({\overline{x}}) - \gamma h(u) \\&\quad = \ g({\overline{x}}^+ - {\overline{x}}+ \gamma ({\overline{x}}- u)) + h({\overline{x}}^+) - h({\overline{x}}) + \gamma (h({\overline{x}}) - h(u)). \end{aligned} \end{aligned}$$
(2.25)

Now, noting that by the relation between u and v in (2.22), we have

$$\begin{aligned} \gamma ({\overline{x}}- u) = \frac{\gamma }{\lambda }(\lambda {\overline{x}}- \lambda u) = \frac{\gamma }{\lambda }({\overline{x}}-v). \end{aligned}$$
(2.26)

In addition, by (2.22) and the convexity of \(h(\cdot )\), we obtain

$$\begin{aligned} \frac{\gamma }{\lambda }[h(v) - (1-\lambda )h({\overline{x}}) - \lambda h(u)]\le 0, \end{aligned}$$

or equivalently,

$$\begin{aligned} \gamma (h({\overline{x}})-h(u))\le \frac{\gamma }{\lambda }(h({\overline{x}})-h(v)). \end{aligned}$$
(2.27)

Applying (2.26) and (2.27) to (2.25), and using the definition of Q in (2.17), we obtain

$$\begin{aligned}&Q({\overline{x}}^+, u) - (1-\gamma )Q({\overline{x}}, u) \\&\quad \le \ g\left( {\overline{x}}^+ - {\overline{x}}+ \frac{\gamma }{\lambda }({\overline{x}}-v)\right) + h({\overline{x}}^+) - h({\overline{x}}) + \frac{\gamma }{\lambda }(h({\overline{x}})-h(v)) \\&\quad = \ g({\overline{x}}^+) - \left( 1 - \frac{\gamma }{\lambda }\right) g({\overline{x}}) - \frac{\gamma }{\lambda }g(v) + h({\overline{x}}^+) - \left( 1 - \frac{\gamma }{\lambda }\right) h({\overline{x}}) - \frac{\gamma }{\lambda }h(v) \\&\quad \le \ Q({\overline{x}}^+, v) - \left( 1 - \frac{\lambda }{\gamma }\right) Q({\overline{x}}, v). \end{aligned}$$

Noting that \({\tilde{u}}_0={\overline{x}}\) and \({\tilde{x}}^+ = {\tilde{u}}_T\) in the description of the ProxAG procedure, by the definitions of \({\overline{x}}^+\) and \({\overline{u}}_t\) in (2.18) and (2.22) we have \({\overline{x}}^+ = {\overline{u}}_T\) and \({\overline{u}}_0={\overline{x}}\) respectively. Therefore, the above relation is equivalent to (2.23), and we conclude the first part of the proof.

For the second part of the proof regarding (2.24), first note that by the definitions of \({\underline{u}}_t\), \({\tilde{u}}_t\), and v in (2.7), (2.9), and (2.22) respectively,

$$\begin{aligned}&{\overline{u}}_t - (1-\alpha _t){\overline{u}}_{t-1}- \alpha _t v = ({\overline{u}}_t - {\overline{u}}_{t-1}) + \alpha _t({\overline{u}}_{t-1}- v) \\&\quad = \lambda ({\tilde{u}}_t - {\tilde{u}}_{t-1}) + \lambda \alpha _t({\tilde{u}}_{t-1}- u) = \lambda ({\tilde{u}}_t - (1-\alpha _t){\tilde{u}}_{t-1}) - \lambda \alpha _t u \\&\quad = \lambda \alpha _t(u_t - u). \end{aligned}$$

By a similar argument as the above, we also have

$$\begin{aligned} {\overline{u}}_t - {\underline{u}}_t = \lambda ({\tilde{u}}_t - (1-\alpha _t){\tilde{u}}_{t-1}) - \lambda \alpha _tu_{t-1}= \lambda \alpha _t(u_t - u_{t-1}). \end{aligned}$$

Now observe that by the definition of Q in (2.17), the convexity of \(h(\cdot )\), and the smoothness inequality (1.7) regarding constant M,

$$\begin{aligned} \begin{aligned}&Q({\overline{u}}_t, v) - (1-\alpha _t)Q({\overline{u}}_{t-1}, v) \\&\quad \overset{(2.17)}{=} \ \lambda \alpha _t(g(u_t) - g(u)) + h({\overline{u}}_t) - (1-\alpha _t)h({\overline{u}}_{t-1}) - \alpha _th(v) \\&\quad \overset{(1.7)}{\le } \ \lambda \alpha _t(g(u_t) - g(u)) + l_h({\underline{u}}_t, {\overline{u}}_t) + \frac{M}{2}\Vert {\overline{u}}_t - {\underline{u}}_t\Vert ^2 \\&\qquad - (1-\alpha _t)l_h({\underline{u}}_t,{\overline{u}}_{t-1}) - \alpha _tl_h({\underline{u}}_t,v) \\&\quad = \ \lambda \alpha _t(g(u_t) - g(u)) + \langle \nabla h({\underline{u}}_t), {\overline{u}}_t-(1-\alpha _t){\overline{u}}_{t-1}- \alpha _tv\rangle + \frac{M}{2}\Vert {\overline{u}}_t - {\underline{u}}_t\Vert ^2. \end{aligned} \end{aligned}$$

Summarizing the above three relations, we have

$$\begin{aligned}&Q({\overline{u}}_t, v) - (1-\alpha _t)Q({\overline{u}}_{t-1}, v) \\&\quad \le \ \lambda \alpha _t(g(u_t) - g(u)) + \lambda \alpha _t\langle \nabla h({\underline{u}}_t),u_t - u)\rangle + \frac{M\lambda ^2\alpha _t^2}{2}\Vert u_t - u_{t-1}\Vert ^2 \\&\quad = \ \lambda \alpha _t\left[ g(u_t)-g(u) + l_h({\underline{u}}_t, u_t) - l_h({\underline{u}}_t, u) + \frac{M\lambda \alpha _t}{2}\Vert u_t - u_{t-1}\Vert ^2\right] . \end{aligned}$$

Moreover, it follows from Lemma 2.2 applied to the optimization problem in the definition of \(u_t\) in (2.8) that

$$\begin{aligned} \begin{aligned}&g(u_t)-g(u) + l_h({\underline{u}}_t, u_t) - l_h({\underline{u}}_t, u) \\ \le&\ \beta (V(x,u) - V(u_t, u) - V(x, u_t)) + (\beta p_t+q_t)(V(u_{t-1}, u) - V(u_t, u) - V(u_{t-1}, u_t)). \end{aligned} \end{aligned}$$

Also by the relation between the prox-function V and norm in (1.11) and our assumption (2.19), we have

$$\begin{aligned} \frac{M\lambda \alpha _t}{2}\Vert u_t - u_{t-1}\Vert ^2 \le \frac{M\lambda \alpha _t}{2\nu }V(u_{t-1}, u_t) \le (\beta p_t+q_t)V(u_{t-1}, u_t). \end{aligned}$$

Combining the above three relations, we conclude (2.24). \(\square\)

In the following proposition, we provide certain sufficient conditions under which the right-hand-side of (2.20) can be properly bounded. As a consequence, we obtain a recurrence relation for the ProxAG procedure in terms of \(Q({\overline{x}}_k,u)\).

Proposition 2.1

Consider the k-th call to the ProxAG procedure. If (2.19) holds, and

$$\begin{aligned} \frac{\alpha _tq_t}{\varLambda _t} = \frac{\alpha _{t+1}q_{t+1}}{\varLambda _{t+1}} \ \text { and } \ \frac{\alpha _t(1+p_t)}{\varLambda _t} = \frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}} \end{aligned}$$
(2.28)

for any \(1 \le t \le T-1\), then we have

$$\begin{aligned} \begin{aligned}&Q({\overline{x}}^+, u) - (1-\gamma )Q({\overline{x}}, u) \le \lambda \alpha _T[\beta (1+p_T)+q_T]\left[ V(x,u) - V(x^+,u)\right] \\&\quad - \frac{\nu \beta }{2\gamma }\Vert {\overline{x}}^+ - {\underline{x}}\Vert ^2, \end{aligned} \end{aligned}$$
(2.29)

where \({\overline{x}}^+\) and \({\underline{x}}\) are defined in (2.18).

Proof

To prove the proposition it suffices to estimate the right-hand-side of (2.20). We make three observations regarding the results (2.20) and (2.21) of Lemma 2.4. First, by the weight sum result of \(\alpha _i/\varLambda _i\)’s in (2.15),

$$\begin{aligned} \lambda \beta \varLambda _T\sum _{t=1}^{T}\frac{\alpha _t}{\varLambda _t}V(x,u) = \lambda \beta (1-\varLambda _T(1-\alpha _1))V(x,u). \end{aligned}$$

Second, by the prox-function and norm relation (1.11), the weighted sum results (2.15) and (2.16), the assumption of parameters in (2.19), the convexity of \(\Vert \cdot \Vert ^2\), and the fact that \({\tilde{u}}_0={\overline{x}}\) and \({\tilde{x}}^+={\tilde{u}}_T\) in the ProxAG procedure, we have

$$\begin{aligned} \lambda \beta \varLambda _T\sum _{t=1}^{T}\frac{\alpha _t}{\varLambda _t}V(x,u_t) \overset{(1.11)(2.19)}{\ge }&\ \frac{\nu \gamma \beta }{2}\cdot \frac{\varLambda _T}{(1-\varLambda _T(1-\alpha _1))}\sum _{t=1}^{T}\frac{\alpha _t}{\varLambda _t}\Vert x - u_t\Vert ^2 \\ \ge&\frac{\nu \gamma \beta }{2}\left\| x-\frac{\varLambda _T}{1-\varLambda _T(1-\alpha _1)} \sum _{i=1}^{T}\frac{\alpha _t}{\varLambda _t}u_t\right\| ^2 \\ \overset{(2.15)(2.16)}{=}&\ \frac{\nu \gamma \beta }{2}\left\| x-\frac{{\tilde{u}}_T - \varLambda _T(1-\alpha _1){\tilde{u}}_0}{1-\varLambda _T(1-\alpha _1)} \right\| ^2\\ =&\frac{\nu \gamma \beta }{2}\left\| x-\frac{\lambda }{\gamma }{\tilde{u}}_T - \left( 1-\frac{\lambda }{\gamma }\right) {\tilde{u}}_0 \right\| ^2 \\ =&\ \frac{\nu \beta }{2\gamma }\left\| \gamma x- \lambda {\tilde{x}}^+ - (\gamma -\lambda ){\overline{x}}\right\| ^2 \\ =&\ \frac{\nu \beta }{2\gamma }\Vert {\underline{x}}- {\overline{x}}^+\Vert ^2, \end{aligned}$$

where the last equality follows from the definitions of \({\underline{x}}\) and \({\overline{x}}^+\) in (2.18). Third, by the assumption of parameters in (2.28), the fact that \(\varLambda _1=1\) in (2.14), and the relations that \(u_0 = x\) and \(u_T=x^+\) in the ProxAG procedure, we have

$$\begin{aligned}&\lambda \beta \varLambda _T\sum _{t=1}^{T}\frac{\alpha _t}{\varLambda _t}[p_tV(u_{t-1}, u) - (1+p_t)V(u_t,u)] + \lambda \varLambda _T\sum _{t=1}^{T}\frac{\alpha _tq_t}{\varLambda _t}[V(u_{t-1}, u) - V(u_t,u)] \\&\quad =\ \lambda \beta \varLambda _T\left[ \alpha _1p_1V(u_0,u) - \sum _{i=1}^{T-1}\left( \frac{\alpha _t(1+p_t)}{\varLambda _t} - \frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}}\right) V(u_t,u) \right. \\&\qquad \left. - \frac{\alpha _T(1+p_T)}{\varLambda _T}V(u_T,u)\right] + \lambda \alpha _Tq_T[V(u_0, u) - V(u_T,u)] \\&\quad = \ \lambda \beta \left[ \varLambda _T\alpha _1p_1V(u_0,u) - \alpha _T(1+p_T)V(u_T,u)\right] + \lambda \alpha _Tq_T[V(u_0, u) - V(u_T,u)] \\&\quad = \ \lambda \beta \left[ \varLambda _T\alpha _1p_1V(x,u) - \alpha _T(1+p_T)V(x^+,u)\right] + \lambda \alpha _Tq_T[V(x, u) - V(x^+,u)] . \end{aligned}$$

Using the above three observations in the result of Lemma 2.4 in (2.20), we have

$$\begin{aligned} \begin{aligned}&Q({\overline{x}}^+, u) - (1-\gamma )Q({\overline{x}}, u) \\&\quad \le \ \lambda \beta \left[ (1-\varLambda _T(1-\alpha _1) + \varLambda _T\alpha _1p_1)V(x,u) - \alpha _T(1+p_T)V(x^+,u)\right] \\&\qquad + \lambda \alpha _Tq_T[V(x, u) - V(x^+,u)] - \frac{\nu \beta }{2\gamma }\Vert {\underline{x}}- {\overline{x}}^+\Vert ^2. \end{aligned} \end{aligned}$$

Comparing the above equation with our goal (2.29), it suffices to show that

$$\begin{aligned} \alpha _T(1+p_T) = \varLambda _T\alpha _1p_1 + 1-\varLambda _T(1-\alpha _1). \end{aligned}$$

By the last relation in our assumption (2.28), the weighted sum result of \(\alpha _i/\varLambda _i\)’s in (2.15), and the fact that \(\varLambda _1=1\), we have

$$\begin{aligned} \frac{\alpha _t(1+p_t)}{\varLambda _t} = \frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}} = \frac{\alpha _tp_t}{\varLambda _t} + \frac{\alpha _t}{\varLambda _t} = \ldots = \frac{\alpha _1p_1}{\varLambda _1} + \sum _{i=1}^{t}\frac{\alpha _i}{\varLambda _i} = \alpha _1p_1 + \frac{1-\varLambda _t(1-\alpha _1)}{\varLambda _t}. \end{aligned}$$

The above implies that \(\alpha _t(1+p_t) = \varLambda _t\alpha _1p_1 + 1-\varLambda _t(1-\alpha _1)\) for any \(1 \le t \le T\). \(\square\)

2.4 Main convergence results of the AGS method

With the help of the above proposition and Lemma 2.1, we are now ready to establish the convergence of the AGS method. The following sequence will the used in the analysis of the AGS method:

$$\begin{aligned} \varGamma _k = {\left\{ \begin{array}{ll} 1 &{} k=1 \\ (1-\gamma _k)\varGamma _{k-1}&{} k>1. \end{array}\right. } \end{aligned}$$
(2.30)

Theorem 2.1

Suppose that the parameters of the k-th call to the ProxAG procedure in Algorithm 1 satisfy

$$\begin{aligned} \lambda \le 1, \varLambda _T(1-\alpha _1) =&1 - \frac{\gamma }{\lambda }, \beta p_t + q_t \ge \frac{\lambda M\alpha _t}{\nu }, \frac{\alpha _tq_t}{\varLambda _t} = \frac{\alpha _{t+1}q_{t+1}}{\varLambda _{t+1}} \nonumber \\&\text { and } \ \frac{\alpha _t(1+p_t)}{\varLambda _t} = \frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}}. \end{aligned}$$
(2.31)

for any \(1 \le t \le T-1\). If

$$\begin{aligned} \gamma _1=1\text { and }\beta _k\ge \frac{L\gamma _k}{\nu }, \end{aligned}$$
(2.32)

then

$$\begin{aligned} \begin{aligned}&\phi ({\overline{x}}_k) - \phi (u)\le \varGamma _k\sum _{i=1}^{k}\frac{\lambda _i\alpha _{T_i}(\beta _i(1+p_{T_i})+q_{T_i})}{\varGamma _i}(V(x_{i-1},u) - V(x_i, u)), \end{aligned} \end{aligned}$$
(2.33)

where \(\varGamma _k\) is defined in (2.30).

Proof

Note that (2.31) is simply a summary of assumptions (2.19) and (2.28) for Proposition 2.1. It follows from Proposition 2.1 that for all \(u\in X\),

$$\begin{aligned}&Q_k({\overline{x}}_k, u) - (1-\gamma _k)Q_k({\overline{x}}_{k-1}, u) \le \lambda _k\alpha _{T_k}(\beta _k(1+p_{T_k})+q_{T_k})(V(x_{k-1},u)\\&\quad - V(x_k, u)) - \frac{\nu \beta _k}{2\gamma _k}\Vert {\overline{x}}_k - {\underline{x}}_k\Vert ^2. \end{aligned}$$

Substituting the above bound to the result (2.11) in Lemma 2.1, and using the assumption (2.32), we have

$$\begin{aligned} \begin{aligned}&\phi ({\overline{x}}_k) - \phi (u) \le (1-\gamma _k)[\phi ({\overline{x}}_{k-1}) - \phi (u)] + \lambda _k\alpha _{T_k}(\beta _k(1+p_{T_k})\\&\quad +q_{T_k})(V(x_{k-1},u) - V(x_k, u)), \end{aligned} \end{aligned}$$

which, in view of Lemma 2.3 (with \(c_k=\gamma _k\), \(C_k = \varGamma _k\), and \(\delta _k = \phi ({\overline{x}}_k) - \phi (u)\)), then implies that

$$\begin{aligned} \phi ({\overline{x}}_k) - \phi (u) \le&\ \varGamma _k\left[ \frac{1-\gamma _1}{\varGamma _1}(\phi ({\overline{x}}_0) - \phi (u)) \right. \\&\left. + \sum _{i=1}^{k}\frac{\lambda _i\alpha _{T_i}(\beta _i(1+p_{T_i})+q_{T_i})}{\varGamma _i}(V(x_{i-1},u) - V(x_i, u))\right] \\ =&\ \varGamma _k\sum _{i=1}^{k}\frac{\lambda _i\alpha _{T_i}(\beta _i(1+p_{T_i})+q_{T_i})}{\varGamma _i}(V(x_{i-1},u) - V(x_i, u)), \end{aligned}$$

where the last equality follows from the fact that \(\gamma _1=1\) in (2.32). \(\square\)

There are many possible selections of parameters that satisfy the assumptions of the above theorem. In the following corollaries we describe two different ways to specify the parameters of Algorithm 1 that lead to the optimal complexity bounds in terms of the number of gradient evaluations of \(\nabla f\) and \(\nabla h\).

Corollary 2.1

Consider problem (1.6) with the Lipschitz constants in (1.2) and (1.7) satisfing \(M\ge L\). Suppose that the parameters of Algorithm 1 are set to

$$\begin{aligned} \begin{aligned}&\gamma _k = \frac{2}{k+1},\ T_k\equiv T:=\left\lceil \sqrt{\frac{M}{L}}\right\rceil ,\ \lambda _k = {\left\{ \begin{array}{ll} 1 &{} k=1,\\ \dfrac{\gamma _k(T+1)(T+2)}{T(T+3)} &{} k>1, \end{array}\right. } \text { and } \beta _k = \frac{3L\gamma _k}{\nu k\lambda _k}. \end{aligned} \end{aligned}$$
(2.34)

Also assume that the parameters in the first call to the ProxAG procedure (\(k=1\)) are set to

$$\begin{aligned} \alpha _t = \frac{2}{t+1},\ \ p_t = \frac{t-1}{2}, \text { and }q_t = \frac{6M}{\nu t}, \end{aligned}$$
(2.35)

and the parameters in the remaining calls to the ProxAG procedure (\(k>1\)) are set to

$$\begin{aligned} \alpha _t = \frac{2}{t+2},\ \ p_t = \frac{t}{2}, \text { and }q_t = \frac{6M}{\nu k(t+1)}. \end{aligned}$$
(2.36)

Then the numbers of gradient evaluations of \(\nabla f\) and \(\nabla h\) performed by the AGS method to compute an \(\varepsilon\)-solution of (1.6) can be bounded by

$$\begin{aligned} N_{f}:=\sqrt{\frac{30LV(x_0, x^*)}{\nu \varepsilon }} \end{aligned}$$
(2.37)

and

$$\begin{aligned} N_{h} := \sqrt{\frac{30MV(x_0, x^*)}{\nu \varepsilon }} + \sqrt{\frac{30LV(x_0, x^*)}{\nu \varepsilon }} \end{aligned}$$
(2.38)

respectively, where \(x^*\) is a solution to (1.6).

Proof

Let us start with verification of (2.31) and (2.32) for the purpose of applying Theorem 2.1. We will consider the first call to the ProxAG procedure (\(k=1\)) and the remaining calls (\(k>1\)) separately.

When \(k=1\), by (2.34) we have \(\lambda _1=\gamma _1=1\), and \(\beta _1=3L/\nu\), hence (2.32) holds immediately. By (2.35) we can observe that \(\varLambda _t=2/(t(t+1))\) satisfies (2.14), and that

$$\begin{aligned} \frac{\alpha _tq_t}{\varLambda _t} \equiv \frac{6M}{\nu },\text { and }\frac{\alpha _t(1+p_t)}{\varLambda _t} = \frac{t(t+1)}{2}=\frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}}. \end{aligned}$$

In addition, by (2.34) and (2.35) we have \(\lambda =\gamma =1\) and \(\alpha _1=1\) in (2.31), and that

$$\begin{aligned} \beta p_t + q_t \ge q_t = \frac{6M}{\nu t} > \frac{2M}{\nu (t+1)} = \frac{\lambda M \alpha _t}{\nu }. \end{aligned}$$

Therefore (2.31) holds.

For the case when \(k>1\), from (2.34) and noting that \(k, T\ge 1\), we have

$$\begin{aligned} \frac{3}{k} > \frac{3\gamma _k}{2} = \frac{3\lambda _k}{2}\left( 1 - \frac{2}{(T+1)(T+2)}\right) \ge \frac{3\lambda _k}{2}\left( 1-\frac{2}{2\cdot 3}\right) = \lambda _k. \end{aligned}$$
(2.39)

Applying the above relation to the definition of \(\beta _k\) in (2.34) we have (2.32). It now suffices to verify (2.31) in order to apply Theorem 2.1. We can observe from (2.36) that \(\varLambda _t=6/(t+1)(t+2)\) satisfies (2.14), \({\alpha _tq_t}/{\varLambda _t}\equiv {2M}/{(\nu k)}\), and that

$$\begin{aligned} \frac{\alpha _t(1+p_t)}{\varLambda _t} = \frac{(t+1)(t+2)}{6}=\frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}}. \end{aligned}$$

Applying (2.34), (2.36), (2.39), and noting that \(k\ge 2\) and that \(\varLambda _T=6/(T+1)(T+2)\) with \(T\ge 1\), we can verify in (2.31) that

$$\begin{aligned}&\lambda = \frac{\gamma (T+1)(T+2)}{T(T+3)} = \frac{2}{k+1}\left( 1+\frac{2}{T(T+3)}\right) \le \frac{2}{3}\left( 1+\frac{2}{1\cdot 4}\right) =1, \\&\varLambda _T(1-\alpha _1) = \frac{2}{(T+1)(T+2)} = 1 - \frac{T(T+3)}{(T+1)(T+2)} = 1- \frac{\gamma }{\lambda }, \\&\beta p_t + q_t> q_t = \frac{2M}{\nu (t+1)}\cdot \frac{3}{k}>\frac{2\lambda M}{\nu (t+1)}\ge \frac{\lambda M\alpha _t}{\nu }. \end{aligned}$$

Therefore, the conditions in (2.31) are satisfied.

We are now ready to apply Theorem 2.1. In particular, noting that \(\alpha _t(1+p_t)\equiv 1\) from (2.35) and (2.36), we obtain from the result (2.33) of Theorem 2.1 (with \(u=x^*\)) that

$$\begin{aligned} \phi ({\overline{x}}_k) - \phi ^*\le \varGamma _k\sum _{i=1}^{k}\xi _i(V(x_{i-1},x^*) - V(x_i, x^*)), \end{aligned}$$
(2.40)

where

$$\begin{aligned} \xi _i:=\frac{\lambda _i(\beta _i+\alpha _{T_i}q_{T_i})}{\varGamma _i}, \end{aligned}$$
(2.41)

Substituting (2.34) and (2.35) to (2.41), and noting that \(\varGamma _i=2/(i(i+1))\) by (2.30), we have

$$\begin{aligned} \xi _1 =&\ \beta _1 + \alpha _{T} q_{T} = \frac{3L}{\nu } + \frac{12 M}{\nu T(T+1)},\text { and } \\ \xi _i =&\ \frac{\lambda _i\beta _i}{\varGamma _i} + \frac{\lambda _i\alpha _{T_i} q_{T_i}}{\varGamma _i} = \frac{3L\gamma _i}{\nu i\varGamma _i} + \frac{\gamma _i}{\varGamma _i}\frac{(T_i+1)(T_i+2)}{T_i(T_i+3)}\frac{2}{T_i+2}\frac{6M}{\nu i(T_i+1)} \\ \equiv&\ \frac{3L}{\nu } + \frac{12M}{\nu T(T+3)},\forall i>1. \end{aligned}$$

Applying the above two results regarding \(\xi _i\) to (2.40), and noting that \(\xi _1>\xi _2\), we have

$$\begin{aligned} \phi ({\overline{x}}_k) - \phi ^* \le&\ \varGamma _k\left[ \xi _1(V(x_0,x^*) - V(x_1, x^*)) + \sum _{i=2}^k\xi _i(V(x_{i-1},x^*) - V(x_i, x^*)) \right] \\ =&\ \varGamma _k\left[ \xi _1(V(x_0,x^*) - V(x_1, x^*)) + \xi _2(V(x_1,x^*) - V(x_k, x^*)) \right] \\ \le&\ \varGamma _k\xi _1V(x_0,x^*) \\ =&\ \frac{2}{k(k+1)}\left( \frac{3L}{\nu } + \frac{12M}{\nu T(T+1)}\right) V(x_0, x^*) \\ \le&\ \frac{30L}{\nu k(k+1)}V(x_0, x^*), \end{aligned}$$

where the last inequality is due to the fact that \(T\ge \sqrt{M/L}\).

From the above inequality, the number of calls to the ProxAG procedure for computing an \(\varepsilon\)-solution of (1.6) is bounded by \(N_{f}\) in (2.37). This is also the bound for the number of gradient evaluations of \(\nabla f\). Moreover, the number of gradient evaluations of \(\nabla h\) is bounded by

$$\begin{aligned} TN_{f} \le \left( \sqrt{\frac{M}{L}}+1\right) N_{ f} = \sqrt{\frac{30MV(x_0, x^*)}{\nu \varepsilon }}+\sqrt{\frac{30LV(x_0, x^*)}{\nu \varepsilon }} = N_{ h}. \end{aligned}$$

\(\square\)

In the above corollary, the constant factors in (2.37) and (2.38) are both \(\sqrt{30}\). In the following corollary, we provide a slightly different set of parameters for Algorithm 1 that results in a smaller constant factor for (2.37).

Corollary 2.2

Consider problem (1.6) with the Lipschitz constants in (1.2) and (1.7) satisfing \(M\ge L\). Suppose that the parameters in the first call to the ProxAG procedure (\(k=1\)) are set to

$$\begin{aligned} \alpha _t=\frac{2}{t+1},\ p_t = \frac{t-1}{2}, \text { and } q_t = \frac{7LT(T+1)}{4\nu t}, \end{aligned}$$
(2.42)

and that the parameters in the k-th call (\(k>1\)) are set to

$$\begin{aligned} p_t\equiv p:=\sqrt{\frac{M}{L}},\ \alpha _t\equiv \alpha :=\frac{1}{p+1},\text { and }q_t\equiv 0. \end{aligned}$$
(2.43)

If the other parameters in Algorithm 1 satisfy

$$\begin{aligned} \begin{aligned}&\gamma _k =\dfrac{2}{k+1}, \ T_k:=\left\{ \begin{array}{lll} \left\lceil \sqrt{\dfrac{8M}{7L}}\right\rceil , &{} k=1 \\ \left\lceil \dfrac{\ln (3)}{-\ln (1-\alpha )}\right\rceil , &{} k>1, \end{array}\right. \ \lambda _k:=\left\{ \begin{array}{lll} 1, &{} k=1 \\ \dfrac{\gamma _k}{1 - (1-\alpha )^{T_k}}, &{} k>1, \end{array}\right. \\&\text { and } \beta _k:=\left\{ \begin{array}{lll} \dfrac{L}{\nu }, &{} k=1 \\ \dfrac{9L\gamma _k}{2\nu k\lambda _k}, &{} k>1, \end{array}\right. \end{aligned} \end{aligned}$$
(2.44)

where \(\alpha\) is defined in (2.43), then the numbers of gradient evaluations of \(\nabla f\) and \(\nabla h\) performed by the AGS method to find an \(\varepsilon\)-solution to problem (1.6) can be bounded by

$$\begin{aligned} N_{f}:=3\sqrt{\frac{LV(x_0, x^*)}{\nu \varepsilon }} \end{aligned}$$
(2.45)

and

$$\begin{aligned} \begin{aligned} N_{h} :=&\ (1+\ln 3)N_{f}\left( \sqrt{\frac{M}{L}}+1\right) \le 7\left( \sqrt{\frac{MV(x_0, x^*)}{\nu \varepsilon }} + \sqrt{\frac{LV(x_0, x^*)}{\nu \varepsilon }}\right) , \end{aligned} \end{aligned}$$
(2.46)

respectively.

Proof

Let us verify (2.31) and (2.32) first, so that we could apply Theorem 2.1. We consider the case when \(k=1\) first. By the definition of \(\gamma _k\) and \(\beta _k\) in (2.44), it is clear that (2.32) is satisfied when \(k=1\). Also, by (2.42) we have that \(\varLambda _t=2/(t(t+1))\) in (2.14),

$$\begin{aligned} \frac{\alpha _tq_t}{\varLambda _t}\equiv \frac{7LT_1(T_1+1)}{4\nu },\text { and }\frac{\alpha _t(1+p_t)}{\varLambda _t}=\frac{t(t+1)}{2}=\frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}}. \end{aligned}$$

Moreover, by (2.42) and (2.44), we can verify in (2.31) that

$$\begin{aligned} \lambda = \gamma = 1, \varLambda _{T_1}(1-\alpha _1) = 0 = 1-\frac{\gamma }{\lambda },\text { and } \beta p_t + q_t\ge q_t> \frac{7LT^2}{4\nu t} = \frac{8M}{4\nu t}> \frac{M\alpha _t}{\nu }. \end{aligned}$$

Therefore the relations in (2.31) are all satisfied.

Now we consider the case when \(k>1\). By the definition of \(\varLambda _t\) in (2.14) and our setting of parameters in (2.43), we observe that \(\varLambda _t = (1-\alpha )^{t-1}\) for all \(t\ge 1\). Moreover, from the definition of \(T_k\) in (2.44), we can also observe that

$$\begin{aligned} (1-\alpha )^{T_k}\le \frac{1}{3}. \end{aligned}$$

Four relations can be derived based on the aforementioned two observations, (2.43), and (2.44). First,

$$\begin{aligned} \beta _k = \frac{9L(1-(1-\alpha )^{T_k})}{2\nu k}\ge \frac{3L}{\nu k}>\frac{L\gamma _k}{\nu }, \end{aligned}$$

which leads to (2.32). Second,

$$\begin{aligned} \frac{\alpha _t q_t}{\varLambda _t}\equiv 0,\ \frac{\alpha _t(1+p_t)}{\varLambda _t} = \frac{1}{(1-\alpha )^{t-1}} = \frac{\alpha _{t+1}p_{t+1}}{\varLambda _{t+1}}. \end{aligned}$$

Third, noting that \(k\ge 2\), we have

$$\begin{aligned} \frac{\gamma _k}{1-\varLambda _{T_k}(1-\alpha )} = \lambda _k =\frac{\gamma _k}{1-(1-\alpha )^{T_k}}\le \frac{3\gamma _k}{2} = \frac{3}{k+1}\le 1. \end{aligned}$$

Fourth,

$$\begin{aligned} \frac{\nu \beta _kp}{\lambda _k M\alpha } =&\ \frac{9L\gamma _kp(p+1)}{2k\lambda _k^2M} = \frac{9Lp(p+1)\left( 1-(1-\alpha )^{T_k}\right) ^2}{2k\gamma _kM}\\ =&\frac{9(k+1)}{4k}\cdot \left( \frac{Lp(p+1)}{M}\right) \cdot \left( 1-(1-\alpha )^{T_k}\right) ^2 \\ >&\ \frac{9}{4} \cdot 1 \cdot \frac{4}{9} = 1. \end{aligned}$$

The last three relations imply that (2.31) holds.

Summarizing the above discussions regarding both the cases \(k=1\) and \(k>1\), applying Theorem 2.1, and noting that \(\alpha _t(1+p_t)\equiv 1\), we have

$$\begin{aligned} \phi ({\overline{x}}_k) - \phi (u)\le \varGamma _k\sum _{i=1}^{k}\xi _i(V(x_{i-1},u)-V(x_i,u)),\ \forall u\in X, \text { where } \xi _i:=\frac{\lambda _i(\beta _i+\alpha _{T_i}q_{T_i})}{\varGamma _i}. \end{aligned}$$
(2.47)

It should be observed from the definition of \(\gamma _k\) in (2.44) that \(\varGamma _i:=2/(i(i+1))\) satisfies (2.30). Using this observation, applying (2.42), (2.43), and (2.44) to the above equation we have

$$\begin{aligned} \xi _1 = \beta _1 + \alpha _{T_1}q_{T_1} = \frac{L}{\nu } + \frac{7L}{2\nu } = \frac{9L}{2\nu }\text { and } \xi _i=\frac{\lambda _i\beta _i}{\varGamma _i}\equiv \frac{9L}{2\nu }, \ \forall i>1. \end{aligned}$$

Therefore, (2.47) becomes

$$\begin{aligned} \begin{aligned} \phi ({\overline{x}}_k) - \phi (u) \le&\ \frac{9L}{\nu k(k+1)}(V(x_0,u) - V(x_k, u)) \le \frac{9L}{\nu k(k+1)}V(x_0, u). \end{aligned} \end{aligned}$$
(2.48)

Setting \(u=x^*\) in the above inequality, we observe that the number of calls to the ProxAG procedure for computing an \(\varepsilon\)-solution of (1.6) is bounded by \(N_{f}\) in (2.45). This is also the bound for the number of gradient evaluations of \(\nabla f\). Moreover, by (2.43), (2.44), and (2.45) we conclude that the number of gradient evaluations of \(\nabla h\) is bounded by

$$\begin{aligned} \sum _{k=1}^{N_{f}}T_k =&\ T_1 + \sum _{k=2}^{N_{f}}T_k \le \left( \sqrt{\frac{8M}{7L}}+1\right) + (N_{f}-1)\left( \frac{\ln {3}}{-\ln (1-\alpha )} +1\right) \\ \le&\ \left( \sqrt{\frac{8M}{7L}}+1\right) + (N_{f}-1)\left( \frac{\ln 3}{\alpha } + 1\right) \\ =&\ \left( \sqrt{\frac{8M}{7L}}+1\right) + (N_{f}-1)\left( \left( \sqrt{\frac{M}{L}}+1\right) \ln 3+1\right) \\<&\ (1+\ln 3)N_{f}\left( \sqrt{\frac{M}{L}}+1\right) \\ <&\ 7\left( \sqrt{\frac{MV(x_0, x^*)}{\nu \varepsilon }} + \sqrt{\frac{LV(x_0, x^*)}{\nu \varepsilon }}\right) . \end{aligned}$$

Here the second inequity is from the property of logarithm functions that \(-\ln (1-\alpha )\ge \alpha\) for \(\alpha \in [0,1)\). \(\square\)

The major difference between the convergence results of Corollaries 2.1 and 2.2 are their constants in the bound of number of gradient and operator evaluations. In particular, Corollary 2.1 has a slightly better bound in \(N_h\) and Corollary 2.2 has a slightly better bound in \(N_f\), while both the bounds are in the same order. Since \(M\ge L\) in (1.2) and (1.7), the results obtained in Corollaries 2.1 and 2.2 indicate that the number of gradient evaluations of \(\nabla f\) and \(\nabla h\) that Algorithm 1 requires for computing an \(\varepsilon\)-solution of (1.6) can be bounded by \(\mathcal{O} (\sqrt{L/\varepsilon })\) and \(\mathcal{O} (\sqrt{M/\varepsilon })\), respectively. Such a result is particularly useful when M is significantly larger, e.g., \(M=\mathcal{O} (L/\varepsilon )\), since the number of gradient evaluations of \(\nabla f\) would not be affected at all by the large Lipschitz constant of the whole problem. It is interesting to compare the above result with the best known so-far complexity bound under the traditional black-box oracle assumption. If we treat problem (1.6) as a general smooth convex optimization and study its oracle complexity, i.e., under the assumption that there exists an oracle that outputs \(\nabla \phi (x)\) for any test point x (and \(\nabla \phi (x)\) only), it has been shown that the number of calls to the oracle cannot be smaller than \(\mathcal{O} (\sqrt{(L+M)/\varepsilon })\) for computing an \(\varepsilon\)-solution [28, 32]. Under such “single oracle” assumption, the complexity bounds in terms of gradient evaluations of \(\nabla f\) and \(\nabla h\) are intertwined, and a larger Lipschitz constant M will result in more gradient evaluations of \(\nabla f\), even though there is no explicit relationship between \(\nabla f\) and M. However, the results in Corollaries 2.1 and 2.2 suggest that we can study the oracle complexity of problem (1.6) based on the assumption of two separate oracles: one oracle \(\mathcal{O} _{f}\) to compute \(\nabla f\) for any test point x, and the other one \(\mathcal{O} _{h}\) to compute \(\nabla h(y)\) for any test point y. In particular, these two oracles do not have to be called at the same time, and hence it is possible to obtain separate complexity bounds \(\mathcal{O} (\sqrt{L/\varepsilon })\) and \(\mathcal{O} (\sqrt{M/\varepsilon })\) on the number of calls to \(\mathcal{O} _{f}\) and \(\mathcal{O} _{h}\), respectively.

2.5 Strongly convex extensions

We now consider a special case of (1.6) where f is strongly convex. More specifically, we assume that there exists \(\mu >0\) such that

$$\begin{aligned} \frac{\mu }{2}\Vert x - u\Vert ^2 \le f(x) - l_f(u,x) \le \frac{L}{2}\Vert x - u\Vert ^2,\ \forall x,u\in X. \end{aligned}$$
(2.49)

Under the above assumption, we develop a multi-stage AGS algorithm that can skip computation of \(\nabla f\) from time to time, and compute an \(\varepsilon\)-solution of (1.6) with

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\mu }}\log \frac{1}{\varepsilon }\right) \end{aligned}$$
(2.50)

gradient evaluations of \(\nabla f\) (see Alagorithm 2). It should be noted that, under the traditional black-box setting [28, 32] where one could only access \(\nabla \phi (x)\) for each inquiry x, the number of evaluations of \(\nabla \phi (x)\) required to compute an \(\varepsilon\)-solution is bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L+M}{\mu }}\log \frac{1}{\varepsilon }\right) . \end{aligned}$$
(2.51)
figure b

Theorem 2.2 below describes the main convergence properties of the M-AGS algorithm.

Theorem 2.2

Suppose that \(M\ge L\) in (1.7) and (2.49), and that the prox-function \(V(\cdot ,\cdot )\) grows quadratically (i.e., (1.12) holds). If the parameters in Algorithm 2 are set to

$$\begin{aligned} N_0=3\sqrt{\frac{2L}{\nu \mu }}\text { and }S=\log _2\max \left\{ \frac{\varDelta _0}{\varepsilon }, 1\right\} , \end{aligned}$$
(2.52)

then its output \(v_S\) must be an \(\varepsilon\)-solution of (1.1). Moreover, the total number of gradient evaluations of \(\nabla f\) and \(\nabla h\) performed by Algorithm 2 can be bounded respectively by

$$\begin{aligned} N_{f}:=3\sqrt{\frac{2L}{\nu \mu }}\log _2\max \left\{ \frac{\varDelta _0}{\varepsilon }, 1\right\} \end{aligned}$$
(2.53)

and

$$\begin{aligned} \begin{aligned} N_{h}:=&\ (1+\ln 3)N_{f}\left( \sqrt{\frac{M}{L}}+1\right) < 9\left( \sqrt{\frac{L}{\nu \mu }}+\sqrt{\frac{M}{\nu \mu }}\right) \log _2\max \left\{ \frac{\varDelta _0}{\varepsilon }, 1\right\} . \end{aligned} \end{aligned}$$
(2.54)

Proof

With input \(x_0=v_{s-1}\) and \(N=N_0\), we conclude from (2.48) in the proof of Corollary 2.2 (with \(u=x^*\) a solution to problem (1.6)) that

$$\begin{aligned} \phi ({\overline{x}}_N)-\phi ^*\le \frac{9L}{\nu N_0(N_0+1)}V(x_0,x^*) \le \frac{\mu }{2}V(x_0,x^*), \end{aligned}$$

where the last inequality follows from (2.52). Using the facts that the input of the AGS algorithm is \(x_0=v_{s-1}\) and that the output is set to \(v_s={\overline{x}}_N\), and the relation (1.12), we conclude

$$\begin{aligned} \phi (v_s)-\phi ^*\le \frac{\mu }{4}\Vert v_{s-1} - x^*\Vert ^2\le \frac{1}{2}(\phi (v_{s-1})-\phi ^*), \end{aligned}$$

where the last inequality is due to the strong convexity of \(\phi (\cdot )\). It then follows from the above relation, the definition of \(\varDelta _0\) in Algorithm 2, and (2.52) that

$$\begin{aligned} \phi (v_S)-\phi ^*\le \frac{1}{2^S}(\phi (v_0)-\phi ^*) \le \frac{\varDelta _0}{2^S} \le \varepsilon . \end{aligned}$$

Comparing Algorithms 1 and 2, we can observe that the total number of gradient evaluations of \(\nabla f\) in Algorithm 2 is bounded by \(N_0S\), and hence we have (2.53). Moreover, comparing (2.45) and (2.46) in Corollary 2.2, we conclude (2.54). \(\square\)

In view of Theorem 2.2, the total number of gradient evaluations of \(\nabla h\) required by the M-AGS algorithm to compute an \(\varepsilon\)-solution of problem (1.6) is the same as the traditional result (2.51). However, by skipping the gradient evaluations of \(\nabla f\) from time to time in the M-AGS algorithm, the total number of gradient evaluations of \(\nabla f\) is improved from (2.51) to (2.50). Such an improvement becomes more significant as the ratio M/L increases.

3 Application to composite bilinear saddle point problems

Our goal in this section is to show the advantages of the AGS method when applied to our motivating problem, i.e., the composite bilinear saddle point problem in (1.1). In particular, we show in Sect. 3.1 that the AGS algorithm can be used to solve (1.1) by incorporating the smoothing technique in [30] and derive new complexity bounds in terms of the number of gradient computations of \(\nabla f\) and operator evaluations of K and \(K^T\). Moreover, we demonstrate in Sect. 3.2 that even more significant saving on gradient computation of \(\nabla f\) can be obtained when f is strongly convex in (1.1) by incorporating the multi-stage AGS method.

3.1 Saddle point problems

Our goal in this section is to extend the AGS algorithm from composite smooth optimization to nonsmooth optimization. By incorporating the smoothing technique in [30], we can apply AGS to solve the composite saddle point problem (1.1). Throughout this section, we assume that the dual feasible set Y in (1.1) is bounded, i.e., there exists \(y_0\in Y\) such that

$$\begin{aligned} \varOmega :=\max _{v\in Y}W(y_0,v) \end{aligned}$$

is finite, where \(W(\cdot ,\cdot )\) is the prox-function associated with Y with modulus \(\omega\).

Let \(\psi _\rho\) be the smooth approximation of \(\psi\) defined in (1.3). It can be easily shown (see [30]) that

$$\begin{aligned} \psi _\rho (x)\le \psi (x)\le \psi _\rho (x)+\rho \varOmega ,\ \forall x\in X. \end{aligned}$$
(3.1)

Therefore, if \(\rho =\varepsilon /(2\varOmega )\), then an \((\varepsilon /2)\)-solution to (1.3) is also an \(\varepsilon\)-solution to (1.1). Moreover, it follows from Theorem 1 in [30] that problem (1.3) is given in the form of (1.6) (with \(h(x)=h_{\rho }(x)\)) and satisfies (1.7) with \(M=\Vert K\Vert ^2/(\rho \omega )\). Using these observations, we are ready to summarize the convergence properties of the AGS algorithm for solving problem (1.1).

Proposition 3.1

Let \(\varepsilon > 0\) be given and assume that \(2\Vert K\Vert ^2\varOmega >\varepsilon \omega L\). If we apply the AGS method in Algorithm 1 to problem (1.3) (with \(h=h_{\rho }\) and \(\rho =\varepsilon /(2\varOmega )\)), in which the parameters are set to (2.42)–(2.44) with \(M=\Vert K\Vert ^2/(\rho \omega )\), then the total number of gradient evaluations of \(\nabla f\) and linear operator evaluations of K (and \(K^T\)) in order to find an \(\varepsilon\)-solution of (1.1) can be bounded by

$$\begin{aligned} N_{f}:=3\left( \sqrt{\frac{2LV(x_0,x^*)}{\nu \varepsilon }}\right) \end{aligned}$$
(3.2)

and

$$\begin{aligned} N_K:=14\left( \sqrt{\frac{2LV(x_0,x^*)}{\nu \varepsilon }} + \frac{2\Vert K\Vert \sqrt{V(x_0,x^*)\varOmega }}{\sqrt{\nu \omega }\varepsilon }\right) , \end{aligned}$$
(3.3)

respectively.

Proof

By (3.1) we have \(\psi _\rho ^*\le \psi ^*\) and \(\psi (x)\le \psi _\rho (x)+\rho \varOmega\) for all \(x\in X\), and hence

$$\begin{aligned} \psi (x) - \psi ^*\le \psi _{\rho }(x)-\psi ^*_{\rho } + \rho \varOmega ,\ \forall x\in X. \end{aligned}$$

Using the above relation and the fact that \(\rho =\varepsilon /(2\varOmega )\) we conclude that if \(\psi _\rho (x) - \psi _\rho ^*\le \varepsilon /2\), then x is an \(\varepsilon\)-solution to (1.1). To finish the proof, it suffices to consider the complexity of AGS for computing an \(\varepsilon /2\)-solution of (1.3). By Corollary 2.2, the total number of gradient evaluations of \(\nabla f\) is bounded by (3.2). By Theorem 1 in [30], the evaluation of \(\nabla h_{\rho }\) is equivalent to 2 evaluations of linear operators: one computation of form Kx for computing the maximizer \(y^*(x)\) for problem (1.4), and one computation of form \(K^Ty^*(x)\) for computing \(\nabla h_{\rho }(x)\). Using this observation, and substituting \(M=\Vert K\Vert ^2/(\rho \omega )\) to (2.46), we conclude (3.3). \(\square\)

According to Proposition 3.1, the total number of gradient evaluations of \(\nabla f\) and linear operator evaluations of both K and \(K^T\) are bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }}\right) \end{aligned}$$
(3.4)

and

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\varepsilon }}+\frac{\Vert K\Vert }{\varepsilon }\right) \end{aligned}$$
(3.5)

respectively, for computing an \(\varepsilon\)-solution of the saddle point problem (1.1). Therefore, if \(L\le \mathcal{O} (\Vert K\Vert ^2/\varepsilon )\), then the number of gradient evaluations of \(\nabla f\) will not be affected by the dominating term \(\mathcal{O} (\Vert K\Vert /\varepsilon )\). This result significantly improves the best known so-far complexity results for solving the bilinear saddle point problem (1.1) in [30] and [23]. Specifically, it improves the complexity regarding number of gradient computations of \(\nabla f\) from \(\mathcal{O} (1/\varepsilon )\) in [30] to \(\mathcal{O} (1/\sqrt{\varepsilon })\), and also improves the complexity regarding operator evaluations involving K from \(\mathcal{O} (1/\varepsilon ^2)\) in [23] to \(\mathcal{O} (1/\varepsilon )\).

3.2 Strongly convex composite saddle point problems

In this subsection, we still consider the SPP in (1.1), but assume that f is strongly convex (i.e., (2.49) holds). In this case, it has been shown previously in the literature that \(\mathcal{O} (\Vert K\Vert /\sqrt{\varepsilon })\) first-order iterations, each one of them involving the computation of \(\nabla f\), and the evaluation of K and \(K^T\), are needed in order to compute an \(\varepsilon\)-solution of (1.1) (e.g., [29]). However, we demonstrate in this subsection that the complexity with respect to the gradient evaluation of \(\nabla f\) can be significantly improved from \(\mathcal{O} (1/\sqrt{\varepsilon })\) to \(\mathcal{O} (\log (1/\varepsilon ))\).

Such an improvement can be achieved by properly restarting the AGS method applied to to solve a series of smooth optimization problem of form (1.3), in which the smoothing parameter \(\rho\) changes over time. The proposed multi-stage AGS algorithm with dynamic smoothing is stated in Algorithm 3.

figure c

Theorem 3.1 describes the main convergence properties of Algorithm 3.

Theorem 3.1

Let \(\varepsilon > 0\) be given and suppose that the Lipschitz constant L in (2.49) satisfies

$$\begin{aligned} \varOmega \Vert K\Vert ^2\max \left\{ \sqrt{\frac{15\varDelta _0}{\varepsilon }}, 1\right\} \ge 2\omega \varDelta _0L. \end{aligned}$$

Also assume that the prox-function \(V(\cdot ,\cdot )\) grows quadratically (i.e., (1.12) holds). If the parameters in Algorithm 3 are set to

$$\begin{aligned} N_0=3\sqrt{\frac{2L}{\nu \mu }},\ S=\log _2\max \left\{ \frac{15\varDelta _0}{\varepsilon }, 1\right\} ,\text { and } \rho _0=\frac{4\varDelta _0}{\varOmega 2^{S/2}}, \end{aligned}$$
(3.6)

then the output \(v_S\) of this algorithm must be an \(\varepsilon\)-solution (1.1). Moreover, the total number of gradient evaluations of \(\nabla f\) and operator evaluations involving K and \(K^T\) performed by Algorithm 3 can be bounded by

$$\begin{aligned} N_{f}:=3\sqrt{\frac{2L}{\nu \mu }}\log _2\max \left\{ \frac{15\varDelta _0}{\varepsilon }, 1\right\} \end{aligned}$$
(3.7)

and

$$\begin{aligned} N_{K}:= 18\sqrt{\frac{L}{\nu \mu }}\log _2\max \left\{ \frac{15\varDelta _0}{\varepsilon }, 1\right\} + \frac{56\sqrt{\varOmega }\Vert K\Vert }{\sqrt{\mu \varDelta _0\nu \omega }}\cdot \max \left\{ \sqrt{\frac{15\varDelta _0}{\varepsilon }}, 1\right\} , \end{aligned}$$

respectively.

Proof

Suppose that \(x^*\) is an optimal solution to (1.1). By (2.48) in the proof of Corollary 2.2, in the s-th stage of Algorithm 3 (calling AGS with input \(x_0=v_{s-1}\), output \(v_{s}={\overline{x}}_N\), and iteration number \(N=N_0\)), we have

$$\begin{aligned}&\psi _{\rho }(v_s)-\psi _{\rho }(x^*)= \psi _{\rho }({\overline{x}}_N)-\psi _{\rho }(x^*) \\&\quad \le \ \frac{9L}{\nu N_0(N_0+1)}V(x_0, x^*)\le \frac{\mu }{2}V(x_0,x^*)\le \frac{\mu }{4}\Vert x_0-x^*\Vert ^2=\frac{\mu }{4}\Vert v_{s-1}-x^*\Vert ^2, \end{aligned}$$

where the last two inequalities follow from (3.6) and (1.12), respectively. Moreover, by (3.1) we have \(\psi (v_s)\le \psi _\rho (v_s)+\rho \varOmega\) and \(\psi ^*=\psi (x^*)\ge \psi _\rho (x^*)\), hence

$$\begin{aligned} \psi (v_s)-\psi ^*\le \psi _\rho (v_s)-\psi _\rho (x^*)+\rho \varOmega . \end{aligned}$$

Combing the above two equations and using the strong convexity of \(\psi (\cdot )\), we have

$$\begin{aligned}&\psi (v_s)-\psi ^* \le \frac{\mu }{4}\Vert v_{s-1}-x^*\Vert ^2+\rho \varOmega \le \frac{1}{2}[\psi (v_{s-1})-\psi ^*]+\rho \varOmega \\&\quad = \frac{1}{2}[\psi (v_{s-1})-\psi ^*]+2^{-s/2}\rho _0\varOmega , \end{aligned}$$

where the last equality is due to the selection of \(\rho\) in Algorithm 3. Reformulating the above relation as

$$\begin{aligned} 2^s[\psi (v_s)-\psi ^*]\le 2^{s-1}[\psi (v_{s-1})-\psi ^*]+2^{s/2}\rho _0\varOmega , \end{aligned}$$

and summing the above inequalities from \(s=1,\ldots ,S\), we have

$$\begin{aligned}&2^S(\psi (v_S)-\psi ^*)\le \varDelta _0 + \rho _0\varOmega \sum _{s=1}^S 2^{s/2}\\&\quad = \varDelta _0 + \rho _0\varOmega \frac{\sqrt{2}(2^{S/2}-1)}{\sqrt{2}-1}< \varDelta _0+\frac{7}{2}\rho _0\varOmega 2^{S/2}=15\varDelta _0, \end{aligned}$$

where the first inequality follows from the fact that \(\psi (v_0)-\psi ^*\le \varDelta _0\) and the last equality is due to (3.6). By (3.6) and the above result, we have \(\psi (v_S)-\psi ^*\le \varepsilon\). Comparing the descriptions of Algorithms 1 and 3, we can clearly see that the total number of gradient evaluations of \(\nabla f\) in Algorithm 3 is given \(N_0S\), hence we have (3.7).

To complete the proof it suffices to estimate the total number of operator evaluations involving K and \(K^T\). By Theorem 1 in [30], in the s-th stage of Algorithm 3, the number of operator evaluations involving K is equivalent to twice the number of evaluations of \(\nabla h_\rho\) in the AGS algorithm, which, in view of (2.46) in Corollary 2.2, is given by

$$\begin{aligned}&2(1+\ln 3)N\left( \sqrt{\frac{M}{L}}+1\right) = 2(1+\ln 3)N\left( \sqrt{\frac{\Vert K\Vert ^2}{\rho \omega L}}+1\right) \\&\quad = 2(1+\ln 3)N_0\left( \sqrt{\frac{2^{s/2}\Vert K\Vert ^2}{\rho _0\omega L}}+1\right) , \end{aligned}$$

where we used the relation \(M=\Vert K\Vert ^2/(\rho \omega )\) (see Sect. 3.1) in the first equality and relations \(\rho =2^{-s/2}\rho _0\) and \(N=N_0\) from Algorithm 3 in the last equality. It then follows from the above result and (3.6) that the total number of operator evaluations involving K in Algorithm 3 can be bounded by

$$\begin{aligned}&\sum _{s=1}^S 2(1+\ln 3)N_0\left( \sqrt{\frac{2^{s/2}\Vert K\Vert ^2}{\rho _0\omega L}}+1\right) \\&\quad = \ 2(1+\ln 3)N_0S + \frac{2(1+\ln 3)N_0\Vert K\Vert }{\sqrt{\rho _0\omega L}}\sum _{s=1}^S 2^{s/4} \\&\quad = \ 2(1+\ln 3)N_0S + \frac{3\sqrt{2}(1+\ln 3)\sqrt{\varOmega }\Vert K\Vert 2^{S/4}}{\sqrt{\mu \varDelta _0\nu \omega }}\cdot \frac{2^{1/4}(2^{S/4}-1)}{2^{1/4}-1} \\&\quad< \ 2(1+\ln 3)N_0S + \frac{56\sqrt{\varOmega }\Vert K\Vert }{\sqrt{\mu \varDelta _0\nu \omega }}\cdot 2^{S/2} \\&\quad < \ 18\sqrt{\frac{L}{\nu \mu }}\log _2\max \left\{ \frac{15\varDelta _0}{\varepsilon }, 1\right\} + \frac{56\sqrt{\varOmega }\Vert K\Vert }{\sqrt{\mu \varDelta _0\nu \omega }}\cdot \max \left\{ \sqrt{\frac{15\varDelta _0}{\varepsilon }}, 1\right\} . \end{aligned}$$

\(\square\)

By Theorem 3.1, the total number of operator evaluations involving K performed by Algorithm 3 to compute an \(\varepsilon\)-solution of (1.6) can be bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\mu }}\log \frac{1}{\varepsilon }+\frac{\Vert K\Vert }{\sqrt{\varepsilon }}\right) , \end{aligned}$$

which matches with the best-known complexity result (e.g., [29]). However, the total number of gradient evaluations of \(\nabla f\) is now bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\mu }}\log \frac{1}{\varepsilon }\right) , \end{aligned}$$

which drastically improves existing results from \(\mathcal{O} (1/\sqrt{\varepsilon })\) to \(\mathcal{O} (\log (1/\varepsilon ))\).

4 Numerical experiments

For preliminary numerical experiments of the proposed AGS method, we consider the following total-variation (TV) regularized image reconstruction problem:

$$\begin{aligned} \min _{x\in \mathbb {R}^n} \psi (x):= \frac{1}{2}\Vert Ax-b\Vert ^2+ \eta \Vert Dx\Vert _{2,1}. \end{aligned}$$
(4.1)

Here \(x\in \mathbb {R}^n\) is the n-vector form of a two-dimensional image to be reconstructed, \(\Vert Dx\Vert _{2,1}\) is the discrete form of the TV semi-norm where D is the finite difference operator, A is a measurement matrix describing the physics of data acquisition, and b is the observed data. It should be noted that problem (4.1) is equivalent to

$$\begin{aligned} \min _{x\in \mathbb {R}^n}\frac{1}{2}\Vert Ax-b\Vert ^2+ \max _{y\in Y}\eta \langle Dx, y \rangle , \end{aligned}$$

where \(Y:=\{y\in \mathbb {R}^{2n}:\Vert y\Vert _{2,\infty }:=\max _{i=1,\ldots ,n}\Vert (y^{(2i-1)}, y^{(2i)})^T\Vert _2\le 1\}\). The above form can be viewed as a special case of the bilinear SPP (1.1) with

$$\begin{aligned} f(x):=\frac{1}{2}\Vert Ax-b\Vert ^2, K:=\eta D, \text { and }J(y) \equiv 0, \end{aligned}$$

and the associated constants are \(L=\lambda _{max}(A^TA)\) and \(\Vert K\Vert =\eta \sqrt{8}\) (see, e.g., [5]). Therefore, as discussed in Sect. 3.1, such problem can be solved by AGS after incorporating the smoothing technique in [30] and approximate the above problem to the form (1.3).

In this experiment, the dimension of \(A\in \mathbb {R}^{m\times n}\) is set to \(m=\lceil n/3\rceil\). Each component of A is generated from a Bernoulli distribution, namely, it takes equal probability for the values \(1/\sqrt{m}\) and \(-1/\sqrt{m}\) respectively. We generate b from a ground truth image \(x_{true}\) with \(b=Ax_{true}+\varepsilon\), where \(\varepsilon \sim N(0,0.001I_n)\). Two ground truth images \(x_{true}\) are used in the experiment, namely, the 256 by 256 (\(n=65536\)) image “Cameraman” and the 135 by 198 (\(n=26730\)) image “Onion”. Both of them are built-in test images in the MATLAB image processing toolbox.

The parameters of Algorithm 1 are set to Proposition 3.1 for solving the above bilinear SPP through smoothing. In order to demonstrate the efficiency of the AGS algorithm, we compare it with Nesterov’s accelerated gradient method (NEST) in [32]. Note that AGS and NEST are both applied to the smoothed problem (1.3). We compare the performance of AGS and NEST for each test image with different smoothing parameter \(\rho\) in (1.3), and TV regularization parameter \(\eta\) in (4.1). For both algorithms, the prox-functions V(xu) and W(yv) are set to Euclidean distances \(\Vert x-u\Vert _2^2/2\) and \(\Vert y-v\Vert _2^2/2\) respectively. In order to perform a fair comparison, we run NEST for 200 iterations first, and then run AGS with the same amount of CPU time. The performances of AGS and NEST are compared through their respective relative errors \((\psi _{AGS}-\psi ^*)/{\psi ^*}\) and \((\psi _{NEST}-\psi ^*)/{\psi ^*}\), where \(\psi _{AGS}\) and \(\psi _{NEST}\) are the objective values of (4.1) corresponding to the approximated solutions computed by AGS and NEST respectively. Here the optimal objective value \(\psi ^*\) is approximated by running AGS with 2000 evaluation of \(\nabla f\).

Tables 1 and 2 show the comparison between AGS and NEST in terms of gradient evaluations of \(\nabla f\), operator evaluations of K and \(K^T\), and objective values (4.1). It should be noted that in 200 iterations of the NEST algorithm, the number of gradient evaluations of \(\nabla f\) and operator evaluations of K and \(K^T\) are given by 200 and 400, respectively. We can make a few observations about the results reported in these tables. First, by skipping gradient evaluations of \(\nabla f\), AGS is able to perform more operator evaluation of K and \(K^T\) during the same amount of CPU time. Noting the complexity bounds (3.4) and (3.5), we can observe that the extra amount of operator evaluations K and \(K^T\) can possibly result in better approximate solutions obtained by CGS in terms of objective values. It should be noted that in problem (4.1), A is a dense matrix while D is a sparse matrix. Therefore, a very large number of extra evaluations of K and \(K^T\) can be performed for each skipped gradient evaluation of \(\nabla f\). Second, for the smooth approximation problem (1.3), the Lipschitz constant M of \(h_{\rho }\) is given by \(M=\Vert K\Vert ^2/{\rho \omega }\). Therefore, for the cases with \(\rho\) being fixed, larger values of \(\eta\) result in larger norm \(\Vert K\Vert\), and consequently larger Lipschitz constant M. Moreover, for the cases when \(\eta\) is fixed, smaller values of \(\rho\) also lead to larger Lipschitz constant M. For both cases, as the ratio of M/L increases, we would skip more and more gradient evaluations of \(\nabla f\), and allocate more CPU time for operator evaluations of K and \(K^T\), which results in more significant performance improvement of AGS over NEST. Such observations are also consistent with our previous theoretical complexity analysis regarding AGS and NEST for solving composite bilinear saddle point problems.

Noting that problem (4.1) is a common problem in imaging science, we also compare AGS and NEST with the best result among a few commonly used primal-dual (PD) algorithms in the field. Specifically, we use \(\psi _{PD}\) to denote the best objective value computed by the following algorithms: a modified version of primal-dual hybrid gradient (PDHG) method [1, 37] (see, e.g., [7] for an introduction on PDHG algorithms), an overrelaxed primal-dual method [12, 15], and an inertial primal-dual method [25]. Our implementation of the PD algorithms is based on [8] since it studies all the aforementioned algorithms (see Algorithms 1–3 within).Footnote 1 Since our main objective in this experiment is to demonstrate the effectiveness of AGS over NEST on evaluations of \(\nabla f\) and \(\nabla h\), we do not list all the results computed by the aforementioned PD algorithms, but only simply use \(\psi _{PD}\) to denote the smallest objective value computed by all PD algorithms among all possible relaxation and inertial parameters under either ergodic or non-ergodic iterates. We can observe that AGS outperforms PD algorithms significantly in all the experiments.

Table 1 Numbers of gradient evaluations of \(\nabla f\) and \(\nabla h\) performed by the AGS method for solving (4.1) with ground truth image “Cameraman”, after running the same amount of CPU time as 200 iterations of NEST. Here \(\psi _{AGS}\) and \(\psi _{NEST}\) are the objective values of (4.1) corresponding to approximated solutions obtained by AGS and NEST, respectively
Table 2 Numbers of gradient evaluations of \(\nabla f\) and \(\nabla h\) performed by the AGS method for solving (4.1) with ground truth image “Onion”, after running the same amount of CPU time as 200 iterations of NEST. Here \(\psi _{AGS}\) and \(\psi _{NEST}\) are the objective values of (4.1) corresponding to approximated solutions obtained by AGS and NEST, respectively

5 Concluding remarks

We propose an accelerated gradient sliding (AGS) method for solving certain classes of structured convex optimization. The main feature of the proposed AGS method is that it could skip gradient computations of a smooth component in the objective function from time to time, while still maintaining the overall optimal rate of convergence for these probems. In particular, for minimizing the summation of two smooth convex functions, the AGS method can skip the gradient computation of the function with a smaller Lipschitz constant, resulting in sharper complexity results than the best known so-far complexity bound under the traditional black-box assumption. Moreover, for solving a class of bilinear saddle-point problem, by applying the AGS algorithm to solve its smooth approximation, we show that the number of gradient evaluations of the smooth component may be reduced to \(\mathcal{O} (1/\sqrt{\varepsilon })\), which improves the previous \(\mathcal{O} (1/\varepsilon )\) complexity bound in the literature. More significant savings on gradient computations can be obtained when the objective function is strongly convex, with the number of gradient evaluations being reduced further to \(\mathcal{O} (\log (1/\varepsilon ))\). Numerical experiments further confirm the potential advantages of these new optimization schemes for solving structured convex optimization problems.

A few potential future works are in place. First, although the theoretical performance of the proposed AGS method is better than Nesterov’s accelerated gradient method when \(M\gg L\), the complexities developed in Corollaries 2.1 and 2.2 have slightly worse universal constant factors than that of Nesterov’s method. Specifically, for the non-strongly convex smooth case of problem (1.6), applying a version of Nesterov’s method to solve problem (e.g., Theorem 2.2.2 in [32]) we can obtain an \(\varepsilon\)-solution within \(2\sqrt{(M+L)V(x_0,x^*)/(\nu \varepsilon )}\) gradient evaluations of both f and h. Theoretically, the bounds for gradient evaluations of f from Nesterov’s method is worse than that in Corollary 2.1 when \(M\ge 8L\). However, when \(M\le 7L\), the bounds for both gradient evaluations of f and h from Nesterov’s method are better than that of the proposed AGS method. Similar observations can be made for Corollary 2.1 and also later for Corollary 3.1. Since the main purpose of this paper is to develop a theoretical possibility of solving composite convex optimization using two separated gradient oracles instead of treating them jointly, the authors leave it as a future work to study whether the proposed AGS method can be further improved so that its performance is similar or better than that of Nesterov’s method when \(M\le 7L\). Second, it should be noted that the proposed AGS method does not address the sliding of strongly convex smooth optimization problems in the most straightforward way. Specifically, the proposed M-AGS algorithm described in Algorithm 2 is a multi-stage restarting scheme that applies the AGS algorithms \(\mathcal{O} (\log (1/\varepsilon ))\) times. It is a potential future work to study whether one could directly exploit the strong convexity and develop an algorithm that incorporates the strong convexity parameters into the update rules of the parameters. Third, in Algorithm 3 we use an adaptive strategy for choosing the smoothing parameter \(\rho\). Such strategy is a key factor that contributes to the improved convergence properties of Algorithm 3 in strongly convex problems. It is interesting to study whether such adaptive smoothness parameter strategy can also be applied to the non-strongly convex cases to improve practical performance while maintaining the same theoretical convergence properties.