1 Introduction

We consider the following classical convex–concave saddle-point problem:

$$\begin{aligned} \min _{x\in \mathbb {R}^p}\max _{y\in \mathbb {R}^n}\Big \{ \mathcal {L}(x, y) := f(x) + \langle Kx, y\rangle - g^{*}(y) \Big \}, \end{aligned}$$
(1)

where \(f : \mathbb {R}^p \rightarrow \mathbb {R}\cup \{+\infty \}\) and \(g : \mathbb {R}^n\rightarrow \mathbb {R}\cup \{+\infty \}\) are proper, closed, and convex, \(K : \mathbb {R}^p \rightarrow \mathbb {R}^n\) is a linear operator, and \(g^{*}(y) := \sup _{x}\left\{ \langle y, x\rangle - g(x)\right\} \) is the Fenchel conjugate of g.

The convex–concave saddle-point problem (1) can be written in the primal and dual forms. The primal problem is defined as

$$\begin{aligned} F^{\star } := \min _{x\in \mathbb {R}^p}\Big \{F(x) := f(x) + g(Kx) \Big \}. \end{aligned}$$
(2)

The corresponding dual problem is written as

$$\begin{aligned} D^{\star } := \min _{y\in \mathbb {R}^n}\Big \{ D(y) := f^{*}(-K^{\top }y) + g^{*}(y) \Big \}. \end{aligned}$$
(3)

Clearly, both the primal problem (2) and its dual form (3) are convex.

Motivation In [23], two accelerated smoothed gap reduction algorithms are proposed to solve (2) and its special constrained convex problem. Both algorithms achieve optimal sublinear convergence rates (up to a constant factor) in the sense of black-box first-order oracles [15, 16], when f and g are convex, and when f is strongly convex and g is just convex, respectively. The first algorithm in [23] is called ASGARD (Accelerated Smoothed GAp ReDuction). To the best of our knowledge, except for a special case [24], ASGARD was the first primal–dual first-order algorithm that achieves a non-asymptotic optimal convergence rate on the last primal iterate. ASGARD is also different from the alternating direction method of multipliers (ADMM) and its variants, where it does not require solving complex subproblems but uses the proximal operators of f and \(g^{*}\). However, ASGARD (i.e. [23, Algorithm 1]) only covers the general convex case, and it needs only one proximal operation of f and of \(g^{*}\) per iteration. To handle the strong convexity of f, a different variant is developed in [23], called ADSGARD, but requires two proximal operations of f per iteration. Therefore, the following natural question is arising:

Can we develop a unified variant of ASGARD that covers three settings: general convexity, strong convexity, and strong convexity and smoothness?

Contribution In this paper, we affirmatively answer this question by developing a unified variant of ASGARD that covers the following three settings:

Case 1::

Both f and \(g^{*}\) in (1) are only convex, but not necessarily strongly convex.

Case 2::

Either f or \(g^{*}\) is strongly convex, but not both f and \(g^{*}\).

Case 3::

Both f and \(g^{*}\) are strongly convex.

The new variant only requires one proximal operation of f and of \(g^{*}\) at each iteration as in the original ASGARD and existing primal–dual methods, e.g., in [4, 5, 7, 8, 11, 12, 19, 28]. Our algorithm reduces to ASGARD in Case 1, but uses a different update rule for \(\eta _k\) (see Step 5 of Algorithm 1) compared to ASGARD. In Cases 2 and 3, our algorithm is completely new by incorporating the strong convexity parameter \(\mu _f\) of f and/or \(\mu _{g^{*}}\) of \(g^{*}\) in the parameter update rules to achieve optimal \(\mathcal {O}\left( 1/k^2\right) \) sublinear and \(\left( 1 - \mathcal {O}\left( 1/\sqrt{\kappa _F}\right) \right) \) linear rates, respectively, where k is the iteration counter and \(\kappa _F := \Vert K\Vert ^2/(\mu _f\mu _{g^{*}})\).

In terms of convergence guarantees, we establish that, in all cases, our algorithm achieves optimal rates on the last primal sequence \(\{x^k\}\) and an averaging dual sequence \(\left\{ \tilde{y}^k\right\} \). Moreover, our convergence guarantees are on three different criteria: (i) the gap function for (1), (ii) the primal objective residual \(F(x^k) - F^{\star }\) for (2), and (iii) the dual objective residual \(D(\tilde{y}^k) - D^{\star }\) for (3). Our paper therefore provides a unified and full analysis on convergence rates of ASGARD for solving three problems (1), (2), and (3) simultaneously.

We emphasize that primal–dual first-order methods for solving (1), (2), and (3), and their convergence analysis have been well studied in the literature. To avoid overloading this paper, we refer to our recent works [23, 25] for a more thorough discussion and comparison between existing methods. Hitherto, there have been numerous papers studying convergence rates of primal–dual first-order methods, including [4,5,6,7, 9, 14, 27]. However, the best known and optimal rates are only achieved via averaging or weighted averaging sequences, which are also known as ergodic rates. The convergence rates on the last iterate sequence are often slower and suboptimal. Recently, the optimal convergence rates of the last iterates have been studied for primal–dual first-order methods, including [23, 25, 27]. As pointed out in [5, 10, 21, 23, 25], the last iterate convergence guarantee is very important in various applications to maintain some desirable structures of the final solutions such as sparsity, low-rankness, or sharp-edgedness in images. This also motivates us to develop ASGARD.

Paper outline The rest of this paper is organized as follows. Section 2 recalls some basic concepts, states our assumptions, and characterizes the optimality condition of (1). Section 3 presents our main results on the algorithm and its convergence analysis. Section 4 provides a set of experiments to verify our theoretical results and compare our method with Nesterov’s smoothing scheme in [17]. Some technical proofs are deferred to the “Appendix”.

2 Basic concepts, assumptions, and optimality condition

We are working with Euclidean spaces, \(\mathbb {R}^p\) and \(\mathbb {R}^n\), equipped with the standard inner product \(\langle \cdot ,\cdot \rangle \) and the Euclidean norm \(\Vert \cdot \Vert \). We will use the Euclidean norm for the entire paper. Given a proper, closed, and convex function f, we use \(\mathrm {dom}\left( f\right) \) and \(\partial {f}(x)\) to denote its domain and its subdifferential at x, respectively. We also use \(\nabla {f}(x)\) for a subgradient or the gradient (if f is differentiable) of f at x. We denote by \(f^{*}(y) := \sup \left\{ \left\langle y, x\right\rangle - f(x) : x\in \mathrm {dom}\left( f\right) \right\} \) the Fenchel conjugate of f. We denote by \( \mathrm {ri}(\mathcal {X})\) the relative interior of \(\mathcal {X}\).

A function f is called \(M_f\)-Lipschitz continuous if \(\vert f(x) - f(\tilde{x})\vert \le M_f\Vert x - \tilde{x}\Vert \) for all \(x, \tilde{x} \in \mathrm {dom}\left( f\right) \), where \(M_f \in [0, +\infty )\) is called a Lipschitz constant. A proper, closed, and convex function f is \(M_f\)-Lipschitz continuous if and only if \(\partial {f}(\cdot )\) is uniformly bounded by \(M_f\) on \(\mathrm {dom}\left( f\right) \). For a smooth function f, we say that f is \(L_f\)-smooth (or Lipschitz gradient) if for any \(x, \tilde{x}\in \mathrm {dom}\left( f\right) \), we have \(\Vert \nabla {f}(x) - \nabla {f}(\tilde{x})\Vert \le L_f\Vert x - \tilde{x}\Vert \), where \(L_f \in [0, +\infty )\) is a Lipschitz constant. A function f is called \(\mu _f\)-strongly convex with a strong convexity parameter \(\mu _f \ge 0\) if \(f(\cdot ) - \frac{\mu _f}{2}\Vert \cdot \Vert ^2\) remains convex. For a proper, closed, and convex function f, \(\text {prox}_{\gamma f}(x) := \mathrm {arg}\!\min \big \{f(\tilde{x}) + \tfrac{1}{2\gamma }\Vert \tilde{x} - x\Vert ^2 : \tilde{x}\in \mathrm {dom}\left( f\right) \big \}\) is called the proximal operator of \(\gamma f\), where \(\gamma > 0\).

2.1 Basic assumptions and optimality condition

In order to show the relationship between (1), (2) and its dual form (3), we require the following assumptions.

Assumption 1

The following assumptions hold for (1).

\(\mathrm {(a)}\):

Both functions f and g are proper, closed, and convex on their domain.

\(\mathrm {(b)}\):

There exists a saddle-point \(z^{\star } := (x^{\star }, y^{\star })\) of \(\mathcal {L}\) defined in (1), i.e.:

$$\begin{aligned} \mathcal {L}(x^{\star }, y) \le \mathcal {L}(x^{\star }, y^{\star }) \le \mathcal {L}(x, y^{\star }), \quad \forall (x, y)\in \mathrm {dom}\left( f\right) \times \mathrm {dom}\left( g^{*}\right) . \end{aligned}$$
(4)
\(\mathrm {(c)}\):

The Slater condition \(0 \in \mathrm {ri}(\mathrm {dom}\left( g\right) - K\mathrm {dom}\left( f\right) )\) holds.

Assumption 1 is standard in convex–concave saddle-point settings. Under Assumption 1, strong duality holds, i.e. \(F^{\star } = \mathcal {L}(x^{\star }, y^{\star }) = -D^{\star }\).

To characterize a saddle-point of (1), we define the following gap function:

$$\begin{aligned} \mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x, y) := \sup \big \{ \mathcal {L}(x, \tilde{y}) - \mathcal {L}(\tilde{x}, y) : \tilde{x}\in \mathcal {X}, \ \tilde{y}\in \mathcal {Y}\big \}, \end{aligned}$$
(5)

where \(\mathcal {X}\subseteq \mathrm {dom}\left( f\right) \) and \(\mathcal {Y}\subseteq \mathrm {dom}\left( g^{*}\right) \) are nonempty, closed, and convex subsets such that \(\mathcal {X}\times \mathcal {Y}\) contains a saddle-point \((x^{\star },y^{\star })\) of (1). Clearly, we have \(\mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x, y) \ge 0\) for all \((x, y)\in \mathcal {X}\times \mathcal {Y}\). Moreover, if \((x^{\star }, y^{\star })\) is a saddle-point of (1) in \(\mathcal {X}\times \mathcal {Y}\), then \(\mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x^{\star }, y^{\star }) = 0\).

2.2 Smoothing technique for g

We first smooth g in (2) using Nesterov’s smoothing technique [17] as

$$\begin{aligned} g_{\beta }(u, \dot{y}) := \max _{y\in \mathbb {R}^n}\left\{ \langle u, y\rangle - g^{*}(y) - \tfrac{\beta }{2}\Vert y - \dot{y}\Vert ^2\right\} , \end{aligned}$$
(6)

where \(g^{*}\) is the Fenchel conjugate of g, \(\beta > 0\) is a smoothness parameter, and \(\dot{y}\) is a given proximal center. We denote by \(\nabla _ug_{\beta }(u) = \text {prox}_{g^{*}/\beta }(\dot{y} + \frac{1}{\beta }u)\) the gradient of \(g_{\beta }\) w.r.t. u.

Given \(g_{\beta }\) defined by (6), we can approximate F in (2) by

$$\begin{aligned} F_{\beta }(x, \dot{y}) := f(x) + g_{\beta }(Kx, \dot{y}). \end{aligned}$$
(7)

The following lemma provides two key inequalities to link \(F_{\beta }\) to \(\mathcal {L}\) and D.

Lemma 1

Let \(F_{\beta }\) be defined by (7) and \((x^{\star }, y^{\star })\) be a saddle-point of (1). Then, for any \(x, \bar{x}\in \mathrm {dom}\left( f\right) \) and \(\dot{y}, \tilde{y}, y \in \mathrm {dom}\left( g^{*}\right) \), we have

$$\begin{aligned} \begin{array}{lcl} \mathcal {L}(\bar{x}, y) &{} \le &{} F_{\beta }(\bar{x}, \dot{y}) + \frac{\beta }{2}\Vert y - \dot{y}\Vert ^2, \vspace{1ex}\\ D(\tilde{y}) - D(y^{\star }) &{} \le &{} F_{\beta }(\bar{x}, \dot{y}) - \mathcal {L}(\tilde{x}, \tilde{y}) + \frac{\beta }{2}\Vert y^{\star } - \dot{y}\Vert ^2, \quad \forall \tilde{x} \in \partial {f^{*}}(-K^{\top }\tilde{y}). \end{array} \end{aligned}$$
(8)

Proof

Using the definition of \(\mathcal {L}\) in (1) and of \(F_{\beta }\) in (7), we have \(\mathcal {L}(\bar{x}, y) = f(\bar{x}) + \langle K\bar{x}, y\rangle - g^{*}(y) \le f(\bar{x}) + \sup _{y}\{\langle K\bar{x}, y\rangle - g^{*}(y) - \frac{\beta }{2}\Vert y - \dot{y}\Vert ^2\} + \frac{\beta }{2}\Vert y - \dot{y}\Vert ^2 = F_{\beta }(\bar{x}, \dot{y}) + \frac{\beta }{2}\Vert y - \dot{y}\Vert ^2\), which proves the first line of (8).

Next, for any \(\tilde{x} \in \partial {f^{*}}(-K^{\top }\tilde{y})\), we have \(f^{*}(-K^{\top }\tilde{y}) = \langle -K^{\top }\tilde{y}, \tilde{x}\rangle - f(\tilde{x})\) by Fenchel’s equality [1]. Hence, \(D(\tilde{y}) = f^{*}(-K^{\top }\tilde{y}) + g^{*}(\tilde{y}) = -\mathcal {L}(\tilde{x}, \tilde{y})\). On the other hand, by (4), we have \(\mathcal {L}(\bar{x}, y^{\star }) \ge \mathcal {L}(x^{\star }, y^{\star }) = -D(y^{\star })\). Combining these two expressions and the first line of (8), we obtain \(D(\tilde{y}) - D(y^{\star }) \le \mathcal {L}(\bar{x}, y^{\star }) - \mathcal {L}(\tilde{x},\tilde{y}) \le F_{\beta }(\bar{x}, \dot{y}) - \mathcal {L}(\tilde{x}, \tilde{y}) + \frac{\beta }{2}\Vert \dot{y} - y^{\star }\Vert ^2\). \(\square \)

3 New ASGARD variant and its convergence guarantees

In this section, we derive a new and unified variant of ASGARD in [23] and analyze its convergence rate guarantees for three settings.

3.1 The derivation of algorithm and one-iteration analysis

Given \(\dot{y} \in \mathbb {R}^n\) and \(\hat{x}^k \in \mathbb {R}^p\), the main step of ASGARD consists of one primal and one dual updates as follows:

$$\begin{aligned} \left\{ \begin{array}{lcl} y^{k+1} &{} := &{} \text {prox}_{g^{*}/\beta _k}\big (\dot{y} + \tfrac{1}{\beta _k}K\hat{x}^k \big ), \vspace{1ex}\\ x^{k+1} &{} := &{} \text {prox}_{f/L_k}\big (\hat{x}^k - \tfrac{1}{L_k}K^{\top }y^{k+1} \big ), \end{array}\right. \end{aligned}$$
(9)

where \(\beta _k > 0\) is the smoothness parameter of g, and \(L_k > 0\) is an estimate of the Lipschitz constant of \(\nabla {g_{\beta _k}}\). Here, (9) serves as basic steps of various primal–dual first-order methods in the literature, including [5, 14, 23]. However, instead of updating \(\dot{y}\), ASGARD fixes it for all iterations.

The following lemma serves as a key step for our analysis in the sequel. Since its statement and proof are rather different from [23, Lemma 2], we provide its proof in “Appendix B.1”.

Lemma 2

[23] Let \((x^{k+1}, y^{k+1})\) be generated by (9), \(F_{\beta }\) be defined by (7), and \(\mathcal {L}\) be given by (1). Then, for any \(x\in \mathrm {dom}\left( f\right) \) and \(\tau _k \in [0,1]\), we have

$$\begin{aligned} \begin{array}{lcl} F_{\beta _k}(x^{k+1},\dot{y}) &{} \le &{} (1 - \tau _k) F_{\beta _{k-1}}(x^k, \dot{y}) + \tau _k\mathcal {L}(x, y^{k+1}) \vspace{1ex}\\ &{}&{} + {~} \frac{L_k\tau _k^2}{2} \big \Vert \tfrac{1}{\tau _k}[\hat{x}^k - (1-\tau _k)x^k] - x \big \Vert ^2 - \frac{\mu _f(1-\tau _k)\tau _k}{2}\Vert x - x^k\Vert ^2 \vspace{1ex}\\ &{}&{} - {~} \frac{\tau _k^2}{2}\left( L_k + \mu _f \right) \big \Vert \tfrac{1}{\tau _k}[x^{k+1} - (1-\tau _k)x^k] - x \big \Vert ^2 \vspace{1ex}\\ &{}&{} - {~} \frac{L_k}{2}\Vert x^{k+1} - \hat{x}^k\Vert ^2 + \frac{1}{2(\mu _{g^{*}}+\beta _k)}\Vert K(x^{k+1} - \hat{x}^k)\Vert ^2 \vspace{1ex}\\ &{}&{} - {~} \frac{1-\tau _k}{2}\left[ \tau _k\beta _k - (\beta _{k-1} - \beta _k)\right] \Vert \nabla _ug_{\beta _k}(Kx^k,\dot{y}) - \dot{y}\Vert ^2. \end{array} \end{aligned}$$
(10)

Together with the primal–dual step (9), we also apply Nesterov’s accelerated step to \(\hat{x}^k\) and an averaging step to \(\tilde{y}^k\) as follows:

$$\begin{aligned} \left\{ \begin{array}{lcl} \hat{x}^{k+1} &{} := &{} x^{k+1} + \eta _{k+1}(x^{k+1} - x^k), \vspace{1ex}\\ \tilde{y}^{k+1} &{}:= &{} (1-\tau _k)\tilde{y}^k + \tau _ky^{k+1}, \end{array}\right. \end{aligned}$$
(11)

where \(\tau _k \in (0, 1)\) and \(\eta _{k+1} > 0\) will be determined in the sequel.

To analyze the convergence of the new ASGARD variant, we define the following Lyapunov function (also called a potential function):

$$\begin{aligned} \begin{array}{lcl} \mathcal {V}_k(x) &{} := &{} F_{\beta _{k-1}}(x^k, \dot{y}) - \mathcal {L}(x, \tilde{y}^k) \vspace{1ex}\\ &{}&{} + {~} \tfrac{(L_{k-1} + \mu _f)\tau _{k-1}^2}{2}\Vert \tfrac{1}{\tau _{k-1}}[x^k - (1 - \tau _{k-1})x^{k-1}] - x\Vert ^2. \end{array} \end{aligned}$$
(12)

The following lemma provides a key recursive estimate to analyze the convergence of (9) and (11), whose proof is given in “Appendix B.2”.

Lemma 3

Let \((x^{k}, \hat{x}^k, y^k, \tilde{y}^{k})\) be updated by (9) and (11). Given \(\beta _0 > 0\), \(\tau _k, \tau _{k+1}\in (0, 1]\), let \(\beta _k\), \(L_k\), and \(\eta _{k+1}\) be updated by

$$\begin{aligned} \beta _k := \frac{\beta _{k-1}}{1+\tau _k}, \quad L_k := \frac{\Vert K\Vert ^2}{\beta _k + \mu _{g^{*}}}, \quad \text {and}\quad \eta _{k+1} := \frac{(1-\tau _{k})\tau _{k}}{\tau _{k}^2 + m_{k+1}\tau _{k+1}}, \end{aligned}$$
(13)

where \(m_{k+1} := \frac{L_{k+1} + \mu _f}{L_{k} + \mu _f}\). Suppose further that \(\tau _k \in (0, 1]\) satisfies

$$\begin{aligned} \left\{ \begin{array}{llcl} &{}(L_{k-1} + \mu _f)(1-\tau _k)\tau _{k-1}^2 + \mu _f(1-\tau _k)\tau _k &{}\ge &{} L_k\tau _k^2, \vspace{1ex}\\ &{}(L_{k-1} + \mu _f)(L_k + \mu _f)\tau _k\tau _{k-1}^2 + (L_k + \mu _f)^2\tau _k^2 &{} \ge &{} (L_{k-1} + \mu _f)L_k\tau _{k-1}^2. \end{array}\right. \end{aligned}$$
(14)

Then, for any \(x\in \mathrm {dom}\left( f\right) \), the Lyapunov function \(\mathcal {V}_k\) defined by (12) satisfies

$$\begin{aligned} \mathcal {V}_{k+1}(x) \le (1-\tau _k)\mathcal {V}_{k}(x). \end{aligned}$$
(15)

The unified ASGARD algorithm Our next step is to expand (9), (11), and (13) algorithmically to obtain a new ASGARD variant (called ASGARD+) as presented in Algorithm 1.

figure a

Compared to the original ASGARD in [23], Algorithm 1 requires one additional averaging dual step on \(\tilde{y}^k\) at Step 6 to obtain the dual convergence. Note that Algorithm 1 also incorporates the strong convexity parameters \(\mu _f\) of f and \(\mu _{g^{*}}\) of \(g^{*}\) to cover three settings: general convexity (\(\mu _f = \mu _{g^{*}} = 0\)), strong convexity (\(\mu _f > 0\) and \(\mu _{g^{*}} = 0\)), and strong convexity and smoothness (\(\mu _f > 0\) and \(\mu _{g^{*}} > 0\)). More precisely, \(L_k\) and the momentum step-size \(\eta _{k+1}\) are also different from [23] by incorporating \(\mu _{g^{*}}\) and \(\mu _f\). The per-iteration complexity of ASGARD+ remains the same as ASGARD except for the averaging dual update \(\tilde{y}^k\). However, this step is not required if we only solve (2). We highlight that if we apply a new approach from [25] to (1), then we can also update the proximal center \(\dot{y}\) at each iteration.

3.2 Case 1: Both f and \(g^{*}\) are just convex (\(\mu _f = \mu _{g^{*}} = 0\))

The following theorem establishes convergence rates of Algorithm 1 for the general convex case where both f and \(g^{*}\) are just convex.

Theorem 1

Suppose that Assumption 1 holds and both f and \(g^{*}\) are only convex, i.e. \(\mu _f = \mu _{g^{*}} = 0\). Let \(\{(x^k, \tilde{y}^k)\}\) be generated by Algorithm 1 for solving (1), where \(\tau _0 := 1\) and \(\tau _{k+1}\) is the unique solution of the cubic equation \(\tau ^3 + \tau ^2 + \tau _k^2\tau - \tau _k^2 = 0\) in \(\tau \), which always exists. Then, for all \(k \ge 1\), we have:

\(\mathrm {(a)}\):

The gap function \(\mathcal {G}_{\mathcal {X}\times \mathcal {Y}}\) defined by (5) satisfies

$$\begin{aligned} \mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x^k, \tilde{y}^k) \le \frac{\Vert K\Vert ^2}{2\beta _0k} \sup _{x\in \mathcal {X}}\Vert x^0 - x\Vert ^2 + \frac{\beta _0}{k+1}\sup _{y\in \mathcal {Y}}\Vert y - \dot{y}\Vert ^2. \end{aligned}$$
(16)
\(\mathrm {(b)}\):

If g is \(M_g\)-Lipschitz continuous on \(\mathrm {dom}\left( g\right) \), then for (2), it holds that

$$\begin{aligned} F(x^k) - F^{\star } \le \frac{\Vert K\Vert ^2}{2\beta _0 k}\Vert x^0 - x^{\star }\Vert ^2 + \frac{\beta _0}{k+1}(\Vert \dot{y}\Vert + M_g)^2. \end{aligned}$$
(17)
\(\mathrm {(c)}\):

If \(f^{*}\) is \(M_{f^{*}}\)-Lipschitz continuous on \(\mathrm {dom}\left( f^{*}\right) \), then for (3), it holds that

$$\begin{aligned} D(\tilde{y}^k) - D^{\star } \le \frac{\Vert K\Vert ^2}{2\beta _0 k}(\Vert x^0\Vert + M_{f^{*}})^2 + \frac{\beta _0}{k+1}\Vert \dot{y} - y^{\star }\Vert ^2. \end{aligned}$$
(18)

Proof

First, since \(\mu _f = \mu _{g^{*}} = 0\), \(L_k = \frac{\Vert K\Vert ^2}{\beta _k}\), and \((1+\tau _k)\beta _k = \beta _{k-1}\), the two conditions of (14) respectively reduce to

$$\begin{aligned} (1-\tau _k)\tau _{k-1}^2 \ge (1+\tau _k)\tau _k^2 \quad \text {and}\quad (1 + \tau _k)\tau _k^2 \ge \tau _{k-1}^2(1-\tau _k). \end{aligned}$$

These conditions hold if \(\tau _k^3 + \tau _k^2 + \tau _{k-1}^2\tau _k - \tau _{k-1}^2 = 0\). We first choose \(\tau _0 := 1\), and update \(\tau _k\) by solving the cubic equation \(\tau ^3 + \tau + \tau _{k-1}^2\tau - \tau _{k-1}^2 = 0\) for \(k\ge 1\). Note that this equation has a unique positive real solution \(\tau _k \in (0, 1)\) due to Lemma 5(b). Moreover, we have \(\prod _{i=1}^k(1-\tau _i) \le \frac{1}{k+1}\) and \(\beta _k \le \frac{2\beta _0}{k+2}\).

Next, by induction, (15) leads to \(\mathcal {V}_k(x) \le \left[ \prod _{i=1}^{k-1}(1-\tau _i)\right] \mathcal {V}_1(x) \le \frac{1}{k}\mathcal {V}_1(x)\), where we have used \(\prod _{i=1}^{k-1}(1-\tau _i) \le \frac{1}{k}\) from Lemma 5(b). However, from (45) in the proof of Lemma 3 and \(\tau _0 = 1\), we have \(\mathcal {V}_1(x) \le (1-\tau _0)\mathcal {V}_0(x) + \frac{L_0\tau _0^2}{2}\Vert \hat{x}^0 - x\Vert ^2 = \frac{\Vert K\Vert ^2}{2\beta _0}\Vert x^0 - x\Vert ^2\). Hence, we eventually obtain

$$\begin{aligned} \mathcal {V}_k(x) \le \frac{\Vert K\Vert ^2}{2\beta _0k}\Vert x^0 - x\Vert ^2. \end{aligned}$$
(19)

Using (8) from Lemma 1 and \(\beta _{k-1} \le \frac{2\beta _0}{k+1}\) from Lemma 5(b), we get

$$\begin{aligned} \begin{array}{lcl} \mathcal {L}(x^k, y) - \mathcal {L}(x, \tilde{y}^k) &{} \overset{(8)}{\le } &{} F_{\beta _{k-1}}(x^k, \dot{y}) - \mathcal {L}(x,\tilde{y}^k) + \frac{\beta _{k-1}}{2}\Vert y- \dot{y}\Vert ^2 \vspace{1ex}\\ &{} \overset{(12)}{\le } &{} \mathcal {V}_k(x) + \frac{\beta _{k-1}}{2}\Vert y - \dot{y}\Vert ^2 \vspace{1ex}\\ &{} \overset{(19)}{\le } &{} \frac{\Vert K\Vert ^2}{2\beta _0k}\Vert x^0 - x\Vert ^2 + \frac{\beta _0}{k+1}\Vert \dot{y} - y\Vert ^2. \end{array} \end{aligned}$$

Taking the supreme over \(\mathcal {X}\) and \(\mathcal {Y}\) both sides of the last estimate and using (5), we obtain (16).

Now, since \(F_{\beta _{k-1}}(x^k,\dot{y}) - F^{\star } \overset{(4)}{\le } F_{\beta _{k-1}}(x^k,\dot{y}) - \mathcal {L}(x^{\star }, \tilde{y}^k) \overset{(12)}{\le } \mathcal {V}_k(x^{\star })\), combining this inequality and (19), we get

$$\begin{aligned} F_{\beta _{k-1}}(x^k,\dot{y}) - F^{\star } \le \frac{\Vert K\Vert ^2}{2\beta _0k}\Vert x^0 - x^{\star }\Vert ^2. \end{aligned}$$

On the other hand, since g is \(M_g\)-Lipschitz continuous, we have

$$\begin{aligned} \sup \left\{ \Vert y-\dot{y}\Vert : y\in \partial {g}(Kx^k)\right\} \le \Vert \dot{y}\Vert + \sup \{\Vert y\Vert : \Vert y\Vert \le M_g\} = \Vert \dot{y}\Vert + M_g. \end{aligned}$$

Hence, by (31) of Lemma 4, we have \(F(x^k) \le F_{\beta _{k-1}}(x^k, \dot{y}) + \frac{\beta _{k-1}}{2}(\Vert \dot{y}\Vert + M_g)^2 \le F_{\beta _{k-1}}(x^k, \dot{y}) + \frac{\beta _0}{k+1}(\Vert \dot{y}\Vert + M_g)^2\). Combining both estimates, we obtain (17).

Finally, using (8), we have

$$\begin{aligned} \begin{array}{lcl} D(\tilde{y}^k) - D^{\star } &{} \le &{} F_{\beta _{k-1}}(x^k, \dot{y}) - \mathcal {L}(\tilde{x}^k, \tilde{y}^k) + \frac{\beta _{k-1}}{2}\Vert \dot{y} - y^{\star }\Vert ^2 \vspace{1ex}\\ &{}\overset{(12)}{\le } &{} \mathcal {V}_k(\tilde{x}^k) + \frac{\beta _0}{k+1}\Vert \dot{y} - y^{\star }\Vert ^2 \vspace{1ex}\\ &{}\overset{(19)}{\le } &{} \frac{\Vert K\Vert ^2}{2\beta _0k} (\Vert x^0\Vert + M_{f^{*}})^2 + \frac{\beta _0}{k+1}\Vert \dot{y} - y^{\star }\Vert ^2, \end{array} \end{aligned}$$

which proves (18). Here, since \(\tilde{x}^k \in \partial {f^{*}}(-K^{\top }\tilde{y}^k)\), we have \(\Vert x^0 - \tilde{x}^k\Vert \le \Vert \tilde{x}^k\Vert + \Vert x^0\Vert \le M_{f^{*}} + \Vert x^0\Vert \), which has been used in the last inequality. \(\square \)

3.3 Case 2: f is strongly convex and \(g^{*}\) is convex (\(\mu _f > 0\) and \(\mu _{g^{*}} = 0\))

Next, we consider the case when only f or \(g^{*}\) is strongly convex. Without loss of generality, we assume that f is strongly convex with a strong convexity parameter \(\mu _f > 0\), but \(g^{*}\) is only convex with \(\mu _{g^{*}} = 0\). Otherwise, we switch the role of f and \(g^{*}\) in Algorithm 1.

The following theorem establishes an optimal \(\mathcal {O}\left( 1/k^2\right) \) convergence rate (up to a constant factor) of Algorithm 1 in this case (i.e. \(\mu _f > 0\) and \(\mu _{g^{*}} = 0\)).

Theorem 2

Suppose that Assumption 1 holds and that f is strongly convex with a convexity parameter \(\mu _f > 0\), but \(g^{*}\) is just convex (i.e. \(\mu _{g^{*}} = 0\) \()\). Let us choose \(\tau _0 := 1\) and \(\beta _0 \ge \frac{0.382\Vert K\Vert ^2}{\mu _f}\). Let \(\{(x^k, \tilde{y}^k)\}\) be generated by Algorithm 1 using the update rule \(\tau _{k+1} := \frac{\tau _k}{2}\big (\sqrt{\tau _k^2 + 4} - \tau _k\big )\) for \(\tau _k\). Then, we have:

\(\mathrm {(a)}\):

The gap function \(\mathcal {G}_{\mathcal {X}\times \mathcal {Y}}\) be defined by (5) satisfies

$$\begin{aligned} \mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x^k, \tilde{y}^k) \le \frac{2\Vert K\Vert ^2}{\beta _0(k+1)^2}\sup _{x\in \mathcal {X}}\Vert x^0 -x\Vert ^2 + \frac{10\beta _0}{(k+3)^2} \sup _{y\in \mathcal {Y}}\Vert \dot{y} - y\Vert ^2. \end{aligned}$$
(20)
\(\mathrm {(b)}\):

If g is \(M_g\)-Lipschitz continuous on \(\mathrm {dom}\left( g\right) \), then for (2), it holds that

$$\begin{aligned} F(x^k) - F^{\star } \le \frac{2\Vert K\Vert ^2\Vert x^0 - x^{\star }\Vert ^2}{\beta _0(k+1)^2} + \frac{10\beta _0(\Vert \dot{y}\Vert + M_g)^2}{(k+3)^2}. \end{aligned}$$
(21)
\(\mathrm {(c)}\):

If \(f^{*}\) is \(M_{f^{*}}\)-Lipschitz continuous on \(\mathrm {dom}\left( f^{*}\right) \), then for (3), it holds that

$$\begin{aligned} D(\tilde{y}^k) - D^{\star } \le \frac{2\Vert K\Vert ^2(\Vert x^0\Vert + M_{f^{*}})^2}{\beta _0(k+1)^2} + \frac{10\beta _0\Vert \dot{y} - y^{\star }\Vert ^2}{(k+3)^2}. \end{aligned}$$
(22)

Proof

Since \(\mu _{g^{*}} = 0\) and \(\tau _k^2 = \tau _{k-1}^2(1 - \tau _k)\) by the update rule of \(\tau _k\), \(\beta _k = \frac{\beta _{k-1}}{1+\tau _k}\), and \(L_k = \frac{\Vert K\Vert ^2}{\beta _k}\), the first condition of (14) is equivalent to \(\beta _{k-1} \ge \frac{\Vert K\Vert ^2\tau _k^2}{\mu _f}\). However, since \(\beta _{k-1} \ge \frac{\beta _0\tau _{k}^2}{\tau _1^2}\) due to Lemma 5(a), and \(\tau _1 = 0.6180\), \(\beta _{k-1} \ge \frac{\Vert K\Vert ^2\tau _k^2}{\mu _f}\) holds if \(\beta _0 \ge \frac{\Vert K\Vert ^2\tau _1^2}{\mu _f} = \frac{0.382\Vert K\Vert ^2}{\mu _f}\). Thus we can choose \(\beta _0 \ge \frac{0.382\Vert K\Vert ^2}{\mu _f}\) to guarantee the first condition of (14).

Similarly, using \(m_k = \frac{L_k + \mu _f}{L_{k-1} + \mu _f} \ge 1\), the second condition of (14) is equivalent to \(m_k^2\tau _k^2 + m_k\tau _k\tau _{k-1}^2 \ge \frac{L_k\tau _{k-1}^2}{L_{k-1} + \mu _f}\). Since \(\frac{L_k}{L_{k-1} + \mu _f} \le m_k\), the last condition holds if \(m_k^2\tau _k^2 + m_k\tau _k\tau _{k-1}^2 \ge m_k\tau _{k-1}^2\). Using again \(\tau _k^2 = \tau _{k-1}^2(1-\tau _k)\), this condition becomes \(m_k\tau _k^2 \ge \tau _{k-1}^2(1-\tau _k) = \tau _k^2\). This always holds true since \(m_k \ge 1\). Therefore, the second condition of (14) is satisfied.

As a result, we have the recursive estimate (15), i.e.:

$$\begin{aligned} \mathcal {V}_{k+1}(x) \le (1-\tau _k)\mathcal {V}_{k}(x). \end{aligned}$$
(23)

From (27), Lemma 5(a), (45), and noting that \(\tilde{x}^0 := x^0\) and \(\tau _0 := 1\), we have

$$\begin{aligned} \begin{array}{lcl} \mathcal {V}_k(x)\le & {} \big [\prod _{i=1}^{k-1}(1-\tau _i)\big ]\mathcal {V}_1(x) = \varTheta _{1,k-1}\mathcal {V}_1(x) \le \frac{4}{(k+1)^2}\mathcal {R}_0(x). \end{array} \end{aligned}$$

where \(\mathcal {R}_0(x) := \frac{\Vert K\Vert ^2}{2\beta _0}\Vert x^0 - x\Vert ^2\). Similar to the proof of Theorem 1, using the last inequality and \(\beta _{k-1} \le \frac{4\beta _0\tau _0^2}{\tau _2^2[\tau _0(k+1) + 2]^2} \le \frac{20\beta _0}{(k+3)^2}\) from Lemma 5(a), we obtain the bounds (20), (21), and (22), respectively. \(\square \)

Remark 1

The variant of Algorithm 1 in Theorem 2 is completely different from [23, Algorithm 2] and [25, Algorithm 2], where it requires only one \(\text {prox}_{f/L_k}(\cdot )\) as opposed to two proximal operations of f as in [23, 25].

3.4 Case 3: Both f and \(g^{*}\) are strongly convex (\(\mu _f > 0\) and \(\mu _{g^{*}} > 0\))

Finally, we assume that both f and \(g^{*}\) are strongly convex with strong convexity parameters \(\mu _f > 0\) and \(\mu _{g^{*}} > 0\), respectively. Then, the following theorem proves the optimal linear rate (up to a constant factor) of Algorithm 1.

Theorem 3

Suppose that Assumption 1 holds and both f and \(g^{*}\) in (1) are strongly convex with \(\mu _f > 0\) and \(\mu _{g^{*}} > 0\), respectively. Let \(\{(x^k, \tilde{y}^k)\}\) be generated by Algorithm  1 using \(\tau _k := \tau = \frac{1}{\sqrt{1 + \kappa _F}}\in (0, 1)\) and \(\beta _0 > 0\), where \(\kappa _F := \frac{\Vert K\Vert ^2}{\mu _f\mu _{g^{*}}}\). Then, the following statements hold:

\(\mathrm {(a)}\):

The gap function \(\mathcal {G}_{\mathcal {X}\times \mathcal {Y}}\) defined by (5) satisfies

$$\begin{aligned} \mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x^k,\tilde{y}^k) \le (1-\tau )^k\bar{\mathcal {R}}_p + \frac{1}{2(1+\tau )^k}\sup _{y\in \mathcal {Y}}\Vert \dot{y} - y\Vert ^2, \vspace{-0.5ex} \end{aligned}$$
(24)

where \(\bar{\mathcal {R}}_p := {\displaystyle \sup _{x\in \mathcal {X}}}\Big \{ (1-\tau )\big [\mathcal {F}_{\beta _0}(x^0) - \mathcal {L}(x, \tilde{y}^0)\big ] + \frac{\Vert K\Vert ^2\tau ^2}{2(\mu _{g^{*}} + \beta _{0})} \Vert x^0 - x\Vert ^2 \Big \}\).

\(\mathrm {(b)}\):

If g is \(M_g\)-Lipschitz continuous on \(\mathrm {dom}\left( g\right) \), then for (2), it holds that

$$\begin{aligned} F(x^k) - F^{\star } \le (1-\tau )^k\bar{\mathcal {R}}_p^{\star } \ + \ \frac{\beta _0}{2(1+\tau )^k} \left( \Vert \dot{y}\Vert + M_g\right) ^2, \end{aligned}$$
(25)

where \(\bar{\mathcal {R}}_p^{\star } := (1-\tau )\big [ \mathcal {F}_{\beta _0}(x^0) - \mathcal {L}(x^{\star }, \tilde{y}^0) \big ] + \frac{\Vert K\Vert ^2\tau ^2}{2(\mu _{g^{*}} + \beta _{0})}\Vert x^0 - x^{\star }\Vert ^2\).

\(\mathrm {(d)}\):

If \(f^{*}\) is \(M_{f^{*}}\)-Lipschitz continuous on \(\mathrm {dom}\left( f^{*}\right) \), then for (3), it holds that

$$\begin{aligned} D(\tilde{y}^k) - D^{\star } \le (1-\tau )^k\bar{\mathcal {R}}_d^{\star } \ + \ \frac{\beta _0\Vert \dot{y} - y^{\star }\Vert ^2}{2(1+\tau )^k}, \end{aligned}$$
(26)

where \(\bar{\mathcal {R}}_d^{\star } := (1-\tau )\big [\mathcal {F}_{\beta _0}(x^0) - D(\tilde{y}^0)\big ] + \frac{\Vert K\Vert ^2\tau ^2}{2(\mu _{g^{*}} + \beta _{0})}\left( \Vert x^0\Vert + M_{f^{*}}\right) ^2\).

Proof

Since \(\tau _k = \tau = \frac{1}{\sqrt{1+\kappa _F}} = \sqrt{\frac{\mu _f\mu _{g^{*}}}{\Vert K\Vert ^2 + \mu _f\mu _{g^{*}}}} \in (0, 1)\) and \(\beta _{k-1} = (1+\tau )\beta _k\), after a few elementary calculations, we can show that the first condition of (14) automatically holds. The second condition of (14) is equivalent to \(m_k\tau + m_k^2 \ge \frac{L_k}{L_{k-1} + \mu _f}\). Since \(m_k \ge \frac{L_k}{L_{k-1} + \mu _f}\), this condition holds if \(m_k\tau + m_k^2 \ge m_k\), which is equivalent to \(\tau + m_k \ge 1\). This obviously holds true since \(\tau > 0\) and \(m_k \ge 1\).

From (15) of Lemma 3, we have \(\mathcal {V}_{k+1}(x) \le (1-\tau )\mathcal {V}_{k}(x)\). Therefore, by induction and using again (45), we get

$$\begin{aligned} \mathcal {V}_k(x) \le (1-\tau )^k\mathcal {V}_1(x) \le (1-\tau )^k\bar{\mathcal {R}}_0(x). \end{aligned}$$
(27)

where \(\bar{\mathcal {R}}_p(x) := (1-\tau )\big [ \mathcal {F}_{\beta _0}(x^0) - \mathcal {L}(x,\tilde{y}^0)\big ] + \frac{\Vert K\Vert ^2\tau ^2}{2(\mu _{g^{*}} + \beta _{0})}\Vert x^0 - x\Vert ^2\).

Now, since \(\beta _{k-1} = \frac{\beta _0}{(1+\tau )^k}\) due to the update rule of \(\beta _k\), by (8), we have

$$\begin{aligned} \mathcal {L}(x^k, y) - \mathcal {L}(x, \tilde{y}^k) \le \mathcal {V}_k(x) + \frac{\beta _{k-1}}{2}\Vert \dot{y} - y\Vert ^2 \le (1-\tau )^k\bar{\mathcal {R}}_p(x) + \frac{\Vert \dot{y} - y\Vert ^2}{2(1+\tau )^k}. \end{aligned}$$

This implies (24). The estimates (25) and (26) can be proved similarly as in Theorem 1, and we omit the details here. \(\square \)

Remark 2

Since \(g^{*}\) is \(\mu _{g^{*}}\)-strongly convex, it is well-known that \(g\circ K\) is \(\frac{\Vert K\Vert ^2}{\mu _{g^{*}}}\)-smooth. Hence, \(\kappa _F := \frac{\Vert K\Vert ^2}{\mu _f\mu _{g^{*}}}\) is the condition number of F in (2). Theorem shows that Algorithm 1 can achieve a \(\big (1 - \frac{1}{(1+\sqrt{2})\sqrt{\kappa _F}}\big )^k\)-linear convergence rate. Consequently, it also achieves \(\mathcal {O}\left( \sqrt{\kappa _F}\log \left( \frac{1}{\varepsilon }\right) \right) \) oracle complexity to obtain an \(\varepsilon \)-primal–dual solution \((x^k,\tilde{y}^k)\). This linear rate and complexity are optimal (up to a constant factor) under the given assumptions in Theorem 3. However, Algorithm 1 is very different from existing accelerated proximal gradient methods, e.g., [2, 18, 26] for solving (2) since our method uses the proximal operator of \(g^{*}\) (and therefore, the proximal operator of g) instead of the gradient of g as in [2, 18, 26].

Remark 3

The \(\mathcal {O}\left( 1/k\right) \), \(\mathcal {O}\left( 1/k^2\right) \), and linear convergence rates in Theorems 1, 2, and 3, respectively are already optimal (up to a constant factor) under given assumptions as discussed, e.g., in [20, 25]. The primal convergence rate on \(\{F(x^k) - F^{\star }\}\) has been proved in [23, Theorem 4], but only for the case \(\mathcal {O}\left( 1/k\right) \). The convergence rates on \(\{\mathcal {G}_{\mathcal {X}\times \mathcal {Y}}(x^k,\tilde{y}^k)\}\) and \(\{D(\tilde{y}^k) - D^{\star }\}\) are new. Moreover, the convergence of the primal sequence is on the last iterate \(x^k\), while the convergence of the dual sequence is on the averaging iterate \(\tilde{y}^k\).

4 Numerical experiments

In this section, we provide four numerical experiments to verify the theoretical convergence aspects and the performance of Algorithm 1. Our algorithm is implemented in Matlab R.2019b running on a MacBook Laptop with 2.8GHz Quad-Core Intel Core i7 and 16 GB RAM. We also compare our method with Nesterov’s smoothing algorithm in [17] as a baseline. We emphasize that our experiments bellow follow exactly the parameter update rules as stated in Theorems 1 and 2 without any parameter tuning trick. To further improve practical performance of Algorithm 1, one can exploit the restarting strategy in [23], where its theoretical guarantee is established in [22].

The nonsmooth and convex optimization problem we use for our experiments is the following representative model:

$$\begin{aligned} \min _{x\in \mathbb {R}^p}\Big \{ F(x) := \Vert Kx - b\Vert _2 + \lambda \Vert x\Vert _1 + \frac{\rho }{2}\Vert x\Vert _2^2 \Big \}, \end{aligned}$$
(28)

where \(K\in \mathbb {R}^{n\times p}\) is a given matrix, \(b\in \mathbb {R}^n\) is also given, and \(\lambda > 0\) and \(\rho \ge 0\) are two given regularization parameters. The norm \(\Vert \cdot \Vert _2\) is the \(\ell _2\)-norm (or Euclidean norm). If \(\rho = 0\), then (28) reduces to the square-root LASSO model proposed in [3]. If \(\rho > 0\), then (28) becomes a square-root regression problem with elastic net regularizer similar to [30]. Clearly, if we define \(g(y) := \Vert y - b\Vert _2\) and \(f(x) := \lambda \Vert x\Vert _1 + \frac{\rho }{2}\Vert x\Vert _2^2\), then (28) can be rewritten into (2).

To generate the input data for our experiments, we first generate K from standard i.i.d. Gaussian distributions with either uncorrelated or \(50\%\) correlated columns. Then, we generate an observed vector b as \(b := Kx^{\natural } + \mathcal {N}(0,\sigma )\), where \(x^{\natural }\) is a predefined sparse vector and \(\mathcal {N}(0, \sigma )\) stands for standard Gaussian noise with variance \(\sigma = 0.05\). The regularization parameter \(\lambda \) to promote sparsity is chosen as suggested in [3], and the parameter \(\rho \) is set to \(\rho := 0.1\). We first fix the size of problem at \(p := 1000\) and \(n := 350\) and choose the number of nonzero entries of \(x^{\natural }\) to be \(s := 100\). Then, for each experiment, we generate 30 instances of the same size but with different input data (Kb).

For Nesterov’s smoothing method, following [17], we smooth g as

$$\begin{aligned} g_{\gamma }(y) := \max _{v\in \mathbb {R}^n}\Big \{ \langle y - b, v\rangle - \frac{\gamma }{2}\Vert v\Vert ^2: \Vert v\Vert _2 \le 1 \Big \}, \end{aligned}$$

where \(\gamma > 0\) is a smoothness parameter. In order to correctly choose \(\gamma \) for Nesterov’s smoothing method, we first solve (28) with CVX [13] using Mosek with high precision to get a high accurate solution \(x^{\star }\) of (28). Then, we set \(\gamma ^{*} := \frac{\sqrt{2}\Vert K\Vert \Vert x^0 - x^{\star }\Vert }{k_{\max }\sqrt{D_{\mathcal {V}}}}\) by minimizing its theoretical bound from [17] w.r.t. \(\gamma > 0\), where \(D_{\mathcal {V}} := \frac{1}{2}\) is the prox-diameter of the unit \(\ell _2\)-norm ball, and \(k_{\max }\) is the maximum number of iterations. For Algorithm 1, using (17) and \(\dot{y} := 0\), we can set \(\beta _0 = \beta ^{*} := \frac{\Vert K\Vert \Vert x^0 - x^{*}\Vert }{M_g}\) by minimizing the right-hand side of (17) w.r.t. \(\beta _0 > 0\), where \(M_g := 1\). We choose \(k_{\max } := 5000\) for all experiments. To see the effect of the smoothness parameters \(\gamma \) and \(\beta _0\) on the performance of both algorithms, we also consider two variants by increasing or decreasing these parameters 10 times, respectively. More specifically, we set them as follows.

  • For Nesterov’s smoothing scheme, we consider two additional variants by setting \(\gamma := 10\gamma ^{*}\) and \(\gamma = 0.1\gamma ^{*}\), respectively.

  • For Algorithm 1, we also consider two other variants with \(\beta _0 := 10\beta ^{*}\) and \(\beta _0 := 0.1\beta ^{*}\), respectively.

We first conduct two different experiments for the square-root LASSO model (i.e. setting \(\rho := 0\) in (28)). In this case, the underlying optimization problem is non-strongly convex and fully nonsmooth.

  • Experiment 1: We test Algorithm 1 (abbreviated by Alg. 1) and Nesterov’s smoothing method (abbreviated by Nes. Alg.) on 30 problem instances with uncorrelated columns of K. Since both algorithms essentially have the same per-iteration complexity, we report the relative primal objective residual \(\frac{F(x^k) - F(x^{\star })}{\max \{1, \vert F(x^{\star })\vert \}}\) against the number of iterations.

  • Experiment 2: We conduct the same test on another set of 30 problem instances, but using \(50\%\) correlated columns in the input matrix K.

The results of both experiments are depicted in Fig. 1, where the left plot is for Experiment 1 and the right plot is for Experiment 2. The solid line of each curve shows the mean over 30 problem samples, and the corresponding shaded area represents the sample variance of 30 problem samples (i.e. the area between the lowest and the highest deviation from the mean).

Fig. 1
figure 1

The convergence behavior of Algorithm 1 and Nesterov’s smoothing scheme on 30 problem instances of (28) (the non-strongly convex case). Left plot: uncorrelated columns in K, and Right plot: \(50\%\) correlated columns in K

From Fig. 1, we observe that, with the choice \(\beta _0 := \beta ^{*}\) and \(\gamma := \gamma ^{*}\) as suggested by the theory, both algorithms perform best compared to other smaller or larger values of these parameters. We also see that Algorithm 1 outperforms Nesterov’s smoothing scheme in both experiments. If \(\beta _0\) (respectively, \(\gamma \)) is large, then both algorithms make good progress in early iterations, but become saturated at a given objective value in the last iterations. Alternatively, if \(\beta _0\) (respectively, \(\gamma \)) is small, then both algorithms perform worse in early iterations, but further decrease the objective value when the number of iterations is increasing. This behavior also confirms the theoretical results stated in Theorem 1 and in [17]. In fact, if \(\beta _0\) (or \(\gamma \)) is small, then the algorithmic stepsize is small. Hence, the algorithm makes slow progress at early iterations, but it better approximates the nonsmooth function g, leading to more accurate approximation from \(F(x^k)\) to \(F(x^{\star })\). In contrast, if \(\beta _0\) (or \(\gamma \)) is large, then we have a large stepsize and therefore a faster convergence rate in early iterations. However, the smoothed approximation is less accurate.

In order to test the strongly convex case in Theorem 2, we conduct two additional experiments on (28) with \(\rho := 0.1\). In this case, problem (28) is strongly convex with \(\mu _f := 0.1\). Since [17] does not directly handle the strongly convex case, we only compare two variants of Algorithm 1 stated in Theorem 1 (Alg. 1) and Theorem 2 (Alg. 1b), respectively. We set \(\beta _0 := \frac{0.382\Vert K\Vert ^2}{\mu _f}\) in Alg. 1b) as suggested by Theorem 2. We consider two experiments as follows:

  • Experiment 3: Test two variants of Algorithm 1 on a collection of 30 problem instances with uncorrelated columns of K.

  • Experiment 4: Conduct the same test on another set of 30 problem instances, but using \(50\%\) correlated columns in K.

The results of both variants of Algorithm 1 are reported in Fig. 2, where the left plot is for Experiment 3 and the right plot is for Experiment 4.

Fig. 2
figure 2

The convergence behavior of the two variants of Algorithm 1 on a collection of 30 problem instances of (28) (the strongly convex case). Left plot: uncorrelated columns in K, and Right plot: \(50\%\) correlated columns in K

Clearly, as shown in Fig. 2, Alg. 1b (i.e. corresponding to Theorem 2) highly outperforms Alg. 1 (corresponding to Theorem 1). Alg. 1 matches well the \(\mathcal {O}\left( 1/k\right) \) convergence rate as stated in Theorem 1, while Alg. 1b shows its \(\mathcal {O}\left( 1/k^2\right) \) convergence rate as indicated by Theorem 2. Note that since \(g^{*}\) in (28) is non-strongly convex, we omit testing the result of Theorem 3. This case is rather well studied in the literature, see, e.g., [5].

5 Concluding remarks

We have developed a new variant of ASGARD introduced in [23, Algorithm 1], Algorithm 1, that unifies three different settings: general convexity, strong convexity, and strong convexity and smoothness. We have proved the convergence of Algorithm 1 for three settings on three convergence criteria: gap function, primal objective residual, and dual objective residual. Our convergence rates in all cases are optimal up to a constant factor and the convergence rates of the primal sequence is on the last iterate. Our preliminary numerical experiments have shown that the theoretical convergence rates of Algorithm 1 match well the actual rates observed in practice. The proposed algorithm can be easily extended to solve composite convex problems with three or multi-objective terms. It can also be customized to solve other models, including general linear and nonlinear constrained convex problems as discussed in [21, 23, 25].