1 Introduction

Consider the minimization problem

$$\begin{aligned} \min _{x\in Q\cap \varOmega }\, f(x) := h(x) + g(x), \end{aligned}$$
(1)

where \(Q\subset \,{{\mathbb {R}}}^n\) is a simple closed convex subset, \(\varOmega : = \left\{ x\in \,{{\mathbb {R}}}^n:\,Ax = b\right\} \) is an affine set with \(A\in \,{{\mathbb {R}}}^{m\times n}\) and \(b\in \,{{\mathbb {R}}}^m\), and \(f:\,{{\mathbb {R}}}^n\rightarrow \,{{\mathbb {R}}}\cup \{+\infty \}\) is properly closed and convex, with smooth part h and nonsmooth (simple) part g. The model problem (1) arises from many practical applications, such as compressed sensing [6], image processing [8] and decentralized distributed optimization [4].

In the literature, existing algorithms for solving (1) mainly include Bregman iteration [5, 29, 67], quadratic penalty method [35], augmented Lagrangian method (ALM) [26,27,28, 33, 41, 56], and alternating direction method of multipliers [22, 36, 40, 50, 53, 57, 60, 61, 65]. Generally speaking, these methods have sublinear rate \(\mathcal O(1/k)\) for convex problems and can be further accelerated to \({\mathcal {O}}(1/k^2)\) for (partially) strongly convex objectives. We also note that primal–dual methods [7, 20, 25, 31, 58, 59, 62] and operator splitting schemes [13, 18, 19] can be applied to (1) with two-block structure.

However, among these works, it is rare to see the optimal mixed-type convergence rate, i.e., the lower complexity bound [51]

$$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \frac{\left\Vert {A} \right\Vert }{\epsilon },\,\frac{\left\Vert {A} \right\Vert }{\sqrt{\mu \epsilon }}\right\} + \min \left\{ \sqrt{L/\epsilon },\,\sqrt{L/\mu }\cdot \left|{\ln \epsilon } \right|\right\} \right) , \end{aligned}$$
(2)

where \(\mu \ge 0\) is the convexity parameter of f and L is the Lipschitz constant of \(\nabla h\). Both Nesterov’s smoothing technique [46] and the accelerated primal–dual method in [12] achieve the lower bound for convex case \(\mu =0\). The inexact ALM framework in [66] possesses the optimal complexity (2) but involves a subroutine for inexactly solving the subproblem.

We mention that the second part of (2) corresponding to L and \(\mu \) agrees with the well-known lower complexity bound of first-order methods for unconstrained convex problems with Lipschitz gradients, namely (the affine set \(\varOmega \) is the entire space \(\,{{\mathbb {R}}}^n\))

$$\begin{aligned} \min _{x\in Q}\, f(x) := h(x) + g(x). \end{aligned}$$
(3)

The intermediate non-Lipschitz case is also of interest to be considered [43, 45]. Particularly, when \(\nabla h\) is Hölder continuous (cf.(11)) with exponent \(\nu \in [0,1)\), Nesterov [48] presented a universal fast gradient method (FGM) for solving (3) that did not require à priori knowledge of the smoothness parameter \(\nu \) and the Hölderian constant \(M_\nu (h)\). A key ingredient of FGM is that Hölderian gradients can be recast into the standard Lipschitz case but with inexact computations [15, 54, 55], and it achieves the optimal complexity [44]

$$\begin{aligned} {\mathcal {O}} \left( \frac{[M_\nu (h)]^\frac{2}{1+3\nu }}{\epsilon ^\frac{2}{1+3\nu }}\right) . \end{aligned}$$
(4)

More extensions of FGM can be found in [23, 24, 32].

The dual problem of (1) reads equivalently as

$$\begin{aligned} \min _{\lambda \in \,{{\mathbb {R}}}^{m}}\,\left\{ d(\lambda ): = \left\langle {b,\lambda } \right\rangle +\max _{x\in Q}\left\{ -f(x)- \left\langle {A^\top \lambda ,x} \right\rangle \right\} \right\} . \end{aligned}$$
(5)

If f is uniformly convex of degree \(p\ge 2\) (see [48, Definition 1]), then \(\nabla d\) is Hölder continuous with exponent \(\nu = 1/(p-1)\) (cf. [48, Lemma 1]). The methods in [16, 37] work for (5) with strongly convex objective f, i.e., the Lipschitzian case (\(\nu = 1\)). Yurtsever et al. [68] proposed an accelerated universal primal–dual gradient method (AccUniPDGrad) for general Hölderian case (\(\nu <1\)) and established the complexity bound (4) for the objective residual and the feasibility violation, with \(M_\nu (h)\) being replaced by \(M_\nu (d)\). Similarly with the spirit of FGM, the presented method utilizes the “inexactness” property of \(\nabla d\) and applies FISTA [2] to (5) with line search.

In this work, we propose a universal accelerated primal–dual method (see Algorithm 1) for solving (1). Compared with existing works, the main contributions are highlighted as follows:

  • It is first-order black-box type for both Lipschitz and Hölder cases but does not need to know the smoothness level priorly.

  • It uses the Bregman divergence and can handle the non-Euclidean setting.

  • In line search part, it adopts dynamically decreasing tolerance while FGM [48] and AccUniPDGrad [68] use the desired fixed accuracy.

  • By using the tool of Lyapunov function and tight decay estimates of some differential/difference inequalities, we prove the universal mixed-type estimate that achieves the optimal complexity (including (2),(4) as special cases).

We also provide some numerical tests to validate the practical performance. It is confirmed that: (i) a proper choice of Bregman distance is crucial indeed; (ii) our method outperforms FGM and AccUniPDGrad especially for non-Lipschitz problems and smooth problems with large Lipschitz constants, as the automatically decreasing tolerance leads to approximate Lipschitz constants with moderate magnitude.

Our method here is motivated from an implicit-explicit time discretization of a novel accelerated Bregman primal–dual dynamics (see (24)), which is an extension of the accelerated primal–dual flow in [39] to the non-Euclidean case. For unconstrained problems, there are some existing continuous dynamics [34, 63, 64] with Bregman distances. For linearly constrained case, we see an accelerated primal–dual mirror model [69], which is inspired by the accelerated mirror descent [34] and the primal–dual dynamical approach [21] but without time discretizations.

The rest of the paper is organized as follows. In Sect. 2, we provide some preliminaries including the Bregman divergence and the Hölder continuity. Then, the main algorithm together with its universal mixed-type estimate is presented in Sect. 3, and rigorous proofs of two technical lemmas are summarized in Sects. 4 and 5, respectively. Finally, some numerical results are reported in Sect. 6.

2 Preliminary

2.1 Notations

Let \(\left\langle {\cdot ,\cdot } \right\rangle \) be the usual inner product of vectors and \(\left\Vert {\cdot } \right\Vert \) be the standard Euclidean norm (of vectors and matrices). Given a proper function \(g:\,{{\mathbb {R}}}^n\rightarrow \,{{\mathbb {R}}}\cup \{+\infty \}\), the effective domain of g is denoted as usual by \( \mathbf{dom\,}g\), and the subdifferential of g at any \(x\in \mathbf{dom\,}g\) is the set of all subgradients:

$$\begin{aligned} \partial g(x): = \left\{ \xi \in \,{{\mathbb {R}}}^n:\,g(y)\ge g(x)+\left\langle {\xi ,y-x} \right\rangle \quad \forall \,y\in \,{{\mathbb {R}}}^n\right\} . \end{aligned}$$

Recall that \(Q\subset \,{{\mathbb {R}}}^n\) is a nonempty closed convex subset. Let \(\iota _Q(\cdot )\) be the indicator function of Q and \(N_Q(\cdot ): =\partial \iota _Q(\cdot )\) be its normal cone.

Introduce the Lagrangian for the model problem (1):

$$\begin{aligned} {\mathcal {L}}(x,\lambda ): = f(x) +\iota _Q(x)+ \left\langle {\lambda ,Ax-b} \right\rangle \quad \forall \,(x,\lambda )\in \,{{\mathbb {R}}}^n\times \,{{\mathbb {R}}}^m. \end{aligned}$$

We say \((x^*,\lambda ^*)\in Q\times \,{{\mathbb {R}}}^m\) is a saddle point of \({\mathcal {L}}\) if

$$\begin{aligned} {\mathcal {L}}(x^*,\lambda )\le {\mathcal {L}}(x^*,\lambda ^*)\le \mathcal L(x,\lambda ^*)\quad \forall \,(x,\lambda )\in \,{{\mathbb {R}}}^n\times \,{{\mathbb {R}}}^m, \end{aligned}$$
(6)

which also implies the optimality condition:

$$\begin{aligned} Ax^*-b=0,\quad \partial f(x^*) + N_Q(x^*)+A^\top \lambda ^*\ni 0. \end{aligned}$$
(7)

2.2 Bregman Divergence

Let \(\phi :Q\rightarrow \,{{\mathbb {R}}}\) be a smooth prox-function and define the Bregman divergence

$$\begin{aligned} D_\phi (x,y) := \phi (x)-\phi (y)-\left\langle {\nabla \phi (y),x-y} \right\rangle \quad \forall \,x,\,y\in Q. \end{aligned}$$

Throughout, suppose \(\phi \) is 1-strongly convex:

$$\begin{aligned} D_\phi (x,y)\ge \frac{1}{2}\left\Vert {x-y} \right\Vert ^2\quad \forall \,x,\,y\in Q. \end{aligned}$$
(8)

Particularly, \(\phi (x) = \frac{1}{2}\left\Vert {x} \right\Vert ^2\) leads to \(D_\phi (x,y)=D_\phi (y,x)=1/2\left\Vert {x-y} \right\Vert ^2\), which boils down to the standard Euclidean setting. In addition, we have the following three-term identity; see [9, Lemma 3.2] or [17, Lemma 3.3].

Lemma 2.1

( [9, 17]) For any \(x,y,z\in Q\), it holds that

$$\begin{aligned} \left\langle {\nabla \phi (x)-\nabla \phi (y),y-z} \right\rangle = D_\phi (z,x) - D_\phi (z,y)- D_\phi (y,x). \end{aligned}$$
(9)

If \(\phi (x) = \frac{1}{2}\left\Vert {x} \right\Vert ^2\), then

$$\begin{aligned} 2\left\langle {x-y,y-z} \right\rangle = \left\Vert {x-z} \right\Vert ^2 - \left\Vert {y-z} \right\Vert ^2- \left\Vert {x-y} \right\Vert ^2. \end{aligned}$$
(10)

2.3 Hölder Continuity

Let h be a differentiable function on Q. For \(0\le \nu \le 1\), define

$$\begin{aligned} M_\nu (h): = \sup _{\begin{array}{c} x,\,y\in Q\\ x\ne y \end{array}}\frac{\left\Vert {\nabla h(x)-\nabla h(y)} \right\Vert }{\left\Vert {x-y} \right\Vert ^\nu }. \end{aligned}$$

If \(M_\nu (h)<\infty \), then \(\nabla h\) is Hölder continuous with exponent \(\nu \):

$$\begin{aligned} \left\Vert {\nabla h(x)-\nabla h(y)} \right\Vert \le M_\nu (h)\left\Vert {x-y} \right\Vert ^\nu \quad \forall \,x,y\in Q, \end{aligned}$$
(11)

and this also implies that

$$\begin{aligned} h(x)\le h(y)+\left\langle {\nabla h(y),x-y} \right\rangle +\frac{M_\nu (h)}{1+\nu }\left\Vert {x-y} \right\Vert ^{1+\nu } \quad \forall \,x,y\in Q. \end{aligned}$$
(12)

When \(\nu = 1\), \(M_1(h)\) corresponds to the Lipschitz constant of \(\nabla h\), and we also use the conventional notation \(L_h = M_1(h)\).

According to [48, Lemma 2], the estimate (12) can be transferred into the usual gradient descent inequality, with “inexact computations”. Based on this, (accelerated) gradient methods can be used to minimize functions with Hölder continuous gradients [15, 54, 55].

Proposition 2.1

([48]) Assume \(M_\nu (h)<\infty \) and for \(\delta >0\), define

$$\begin{aligned} M(\nu ,\delta ): =\delta ^{\frac{\nu -1}{\nu +1}}[M_\nu (h)]^{\frac{2}{\nu +1}}. \end{aligned}$$
(13)

Then, for any \(M\ge M(\nu ,\delta )\), we have

$$\begin{aligned} h(x)\le h(y)+\left\langle {\nabla h(y),x-y} \right\rangle +\frac{M}{2}\left\Vert {x-y} \right\Vert ^2+ \frac{\delta }{2} \quad \forall \,x,y\in Q. \end{aligned}$$

3 Main Algorithm

Throughout, we make the following assumption on \(f = h+g\):

Assumption 3.1

The nonsmooth part g is properly closed convex on Q. The smooth part h satisfies \( \inf _{0\le \nu \le 1}M_\nu (h)<\infty \) and is \(\mu \)-convex on Q with \(\mu \ge 0\), namely,

$$\begin{aligned} h(x)\ge h(y)+\left\langle {\nabla h(y),x-y} \right\rangle +\mu D_\phi (x,y) \quad \forall \,x,y\in Q. \end{aligned}$$

Remark 3.1

For some cases, h might not be smooth (in \(C^1\) globally) enough but at least Lipschitz continuous. By Rademacher’s theorem [30, Theorem 3.1], Lipschitzian functions are differentiable almost everywhere. Thus, the assumption \(M_\nu (h)<\infty \) holds true with \(\nu =0\). Besides, in practical computations, we expect that the proximal calculation (cf.(14)) of the nonsmooth part g with respect to proper \(D_\phi (\cdot ,\cdot )\) is easy to compute (or has closed solution); see the matrix game problem in Sect. 6.1. \(\square \)

Our main algorithm, called universal accelerated primal–dual (UAPD) method, is summarized in Algorithm 1, where the subpart “\( \texttt{sub}\text {-}\texttt{UAPD}\)” in lines 3 and 6 has been given by Algorithm 2. Note that we do not require priorly the exponent \(\nu \) and the smoothness constant \(M_\nu (h)\) but perform a line search procedure (see lines 47).

Algorithm 1
figure a

Universal Accelerated Primal–Dual (UAPD) Method

Algorithm 2
figure b

\(\{\widetilde{y}_k,\widetilde{x}_k,\widetilde{v}_k,\widetilde{\alpha }_k,\widetilde{\delta }_k,\widetilde{\varDelta }_k\} = \texttt{sub}\text {-}\texttt{UAPD}(k,S_k,\widetilde{M}_{k})\)

3.1 Line Search

In Algorithm 1, the parameter \(\rho _{\!u}>1\) enlarges the approximate Lipschitz constant \(M_{k,i}\) (cf. line 5) to meet the following descent condition (cf. line 4)

$$\begin{aligned} \begin{aligned} h(x_{k,i})\le h(y_{k,i})+\left\langle {\nabla h(y_{k,i}),x_{k,i}-y_{k,i}} \right\rangle + \frac{M_{k,i}}{2}\left\Vert {x_{k,i}-y_{k,i}} \right\Vert ^2+\frac{\delta _{k,i}}{2}. \end{aligned} \end{aligned}$$
(15)

For each \(k\in {\mathbb {N}}\), the line search part will end up with the smallest integer \(i_k\) satisfying (15) and call the subpart \(\texttt{sub}\)-\(\texttt{UAPD}\) \(i_k+1\) times in total. The extra parameter \(\rho _{_{\!d}}\ge 1\) reduces the output constant \(M_{k,i_k}\) and updates \(M_{k+1} = M_{k,i_k}/\rho _{_{\!d}}\) (cf. line 8).

Remark 3.2

We introduce the pair \((\rho _{\! u},\rho _{_{\!d}})\) in our method for generality. Practically, it is not easy to find the optimal choice. If \(\rho _{_{\!d}} = 1\), then \(M_k\) is nondecreasing. Otherwise, \(M_k\) can be nonmonotone. Consider two situations.

  • For the standard Lipschitz case \((\nu =1)\), as analyzed in [47, Section 3], if \(M_0\le L_h\), then the choice \(\rho _{_{\!d}}\ge \rho _{\! u}\) promises that \(M_k\le L_h\) for all \(k\ge 0\).

  • For the Hölder continuous case \((\nu <1)\), we have \(M_k\rightarrow \infty \) as \(k\rightarrow \infty \) (since \(M(\nu ,\delta )\rightarrow \infty \) as \(\delta \rightarrow 0\)). Taking \(\rho _{_{\!d}}>1\) may reduce \(M_k\) (locally for some k) but will increase the burden on the line search procedure. \(\square \)

Below, let us show that \(i_k\) is finite for \(k\in {\mathbb {N}}\). Indeed, we see \(M_{k,i} = \rho _{\! u}^{i}M_{k,0}\) increases as i does, and the step size (cf. line 1 of Algorithm 2)

$$\begin{aligned} \alpha _{k,i} =\sqrt{ \frac{\beta _k\gamma _{k}}{\beta _k M_{k,i} + \left\Vert {A} \right\Vert ^2}} \end{aligned}$$
(16)

has to be decreasing. Thus the tolerance (cf. line 2 of Algorithm 2)

$$\begin{aligned} \delta _{k,i} = \frac{1}{k+1}\cdot \frac{\beta _{k}}{1+\alpha _{k,i}} \end{aligned}$$
(17)

is increasing and by (13), \(M(\nu ,\delta _{k,i})\) is decreasing. This together with Proposition 2.1 and Assumption 3.1 concludes that

$$\begin{aligned} \begin{aligned} {}&i_k = 0{} & {} \text { if } M(\nu ,\delta _{k,0})\le M_{k,0},\\ {}&i_k\le \lceil s^*\rceil{} & {} \text { else}, \end{aligned} \end{aligned}$$

where \( \lceil s^*\rceil \) denotes the ceiling function (the minimal integer that is no less than \(s^*\)), and \(s^*\in [0,\infty )\) solves the equation \(M_{k,s^*} = M(\nu ,\delta _{k,s^*})\); see Fig. 1. In particular, we have

$$\begin{aligned} M_{k,s^*}= \rho _{\!u}^{s^*}M_{k,0}\le M(\nu ,\delta _{k,0})\quad \Longrightarrow \quad s^*\le \log _{ \rho _{\!u}}\frac{M(\nu ,\delta _{k,0})}{M_{k,0}}. \end{aligned}$$

This eventually leads to

$$\begin{aligned} i_k\le \max \left\{ 0,\,\bigg \lceil \log _{ \rho _{\!u}}\frac{M(\nu ,\delta _{k,0})}{M_{k,0}}\bigg \rceil \right\} <\infty . \end{aligned}$$
Fig. 1
figure 1

The illustration of \(M_{k,s}\) and \(M(\nu ,\delta _{k,s})\) as functions of \(s\in [0,\infty )\). Here \(\delta _{k,\infty } =\lim \limits _{s\rightarrow \infty }\delta _{k,s}= \beta _k/(k+1)\) since \(\alpha _{k,s}\rightarrow 0\) as \(s\rightarrow \infty \)

Remark 3.3

In the line search part, our Algorithm 1 adopts dynamically decreasing tolerance (17). However, the methods in [48] and [68, Algorithm 2] chose \(\delta _k=\epsilon /k\), where \(\epsilon \) is the desired accuracy. Hence, by Proposition 2.1, the approximate smoothness constant \(M_k\) of our method is smaller than the other two methods, especially for the Hölderian case. This will be verified by numerical experiments in Sect. 6. \(\square \)

Below, we give an upper bound of \(M_k\) and the total number of line search steps. By Theorem 3.1, \(\beta _k\) corresponds to the convergence rate of Algorithm 1 and admits explicit decay estimate (see Lemma 3.3). If the desired accuracy \(\beta _{k+1}={\mathcal {O}}(\epsilon )\) is given, then the term \(\left|{\log _{\rho _{\!u}}\beta _{k+1}} \right|\) in (19) can be replaced by \(\left|{\log _{\rho _{\!u}}\epsilon } \right|\).

Lemma 3.1

For any \(k\in {\mathbb {N}}\), we have

$$\begin{aligned} M_{k+1}\le \frac{1}{\rho _{_{\!d}}}\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k+1})\right\} , \end{aligned}$$
(18)

and consequently, it holds that

$$\begin{aligned} \small \sum _{j=0}^{k}i_j\le \frac{3}{2}+k\,\frac{\ln \rho _{_{\!d}}}{\ln \rho _{\!u}}+ \frac{2}{1+\nu }\left|{ \log _{\rho _{\!u}}\frac{M_\nu (h)}{M_0}} \right|+ \frac{1-\nu }{1+\nu }\Big [ \log _{\rho _{\!u}}(k+1)+\left|{\log _{\rho _{\!u}}\beta _{k+1}} \right|\Big ]. \end{aligned}$$
(19)

Proof

See Sect. Appendix A. \(\square \)

3.2 Time Discretization Interpretation

Below, we provide a time discretization interpretation of Algorithm 1. Given the k-th iterations \((x_k,v_k,\lambda _k)\) and the parameters \((\gamma _k,\beta _k,M_{k})\), the line search procedure returns \((y_k,x_{k+1},v_{k+1},\lambda _{k+1})\) that satisfy

(20a)
(20b)
(20c)
(20d)

where \({\mathcal {G}}(y_k,v_{k+1},\lambda _k):=\nabla h(y_k)+\partial g(v_{k+1})+N_Q(v_{k+1})+A^\top \widetilde{\lambda }_k\) with \(\widetilde{\lambda }_k= \lambda _k+\alpha _k/\beta _k(Av_k-b)\), and the step size \(\alpha _k\) solves (cf.(16))

$$\begin{aligned} \alpha _k^2(\rho _{_{\!d}}\beta _{k}M_{k+1} +\left\Vert {A} \right\Vert ^2)= \gamma _k\beta _k, \end{aligned}$$
(21)

where we used the relation \(M_{k,i_k} = \rho _{_{\!d}} M_{k+1}\). Besides, \(y_k\) and \(x_{k+1}\) fulfill (cf. (15))

$$\begin{aligned} h(x_{k+1})\le h(y_k)+\left\langle {\nabla h(y_k),x_{k+1}-y_k} \right\rangle +\frac{\rho _{_{\!d}}M_{k+1}}{2}\left\Vert {x_{k+1}-y_k} \right\Vert ^2+\frac{\delta _{k+1}}{2}, \end{aligned}$$
(22)

and the parameters \((\gamma _{k+1},\beta _{k+1})\) are governed by (cf. line 9 of Algorithm 1)

$$\begin{aligned} \frac{\gamma _{k+1}-\gamma _k}{\alpha _k} = {}\mu -\gamma _{k+1},\quad \frac{\beta _{k+1}-\beta _k}{\alpha _k} = -\beta _{k+1}, \end{aligned}$$
(23)

with \(\beta _0 = 1\) and \(\gamma _0>0\). For clarity, in Sect. Appendix B, we give a detailed derivation of the reformulation (20) from Algorithms 1 and 2.

As one can see, \(y_k\) in (20a) is an intermediate which provides a “prediction”, and then the “correction” step (20c) is used to update \(x_{k+1}\). From (20a), (20b), and (20c), it is not hard to find that \(y_k,\,v_{k+1},\,x_{k+1}\in Q\), as long as \(x_k,v_k\in Q\). Therefore, with \(x_0,v_0\in Q\), it holds that \(\{x_k,y_k,v_k\}_{k\in {\mathbb {N}}}\subset Q\).

Furthermore, we mention that the reformulation (20) itself admits an implicit-explicit time discretization for the following primal–dual dynamics:

$$\begin{aligned} \begin{aligned} {}&x' = v-x,\\ {}&\gamma \frac{\,\textrm{d}}{\,\textrm{d}t} \nabla \phi (v){}\in \mu \big [\nabla \phi (x)-\nabla \phi (v)\big ]- \partial f(x)-N_Q(x)-A^\top \lambda ,\\ {}&\beta \lambda ' {}= Av-b, \end{aligned} \end{aligned}$$
(24)

where \(\gamma \) and \(\beta \) are governed by continuous analogues to (23):

$$\begin{aligned} \gamma ' = \mu -\gamma ,\quad \beta ' =-\beta . \end{aligned}$$
(25)

We call (24) the Accelerated Bregman Primal–Dual (ABPD) flow. For \(\phi (x) = 1/2\left\Vert {x} \right\Vert ^2\), it amounts to the accelerated primal–dual flow in [39].

3.3 A Universal Estimate

Let \(\{(x_k,v_k,\lambda _k,\gamma _k,\beta _k)\}_{k\in {\mathbb {N}}}\) be the sequence generated from Algorithm 1. We introduce the discrete Lyapunov function

$$\begin{aligned} {\mathcal {E}}_k: = {\mathcal {L}}\left( x_{k},\lambda ^*\right) -{\mathcal {L}}\left( x^*,\lambda _k\right) + \gamma _kD_\phi (x^*,v_k)+\frac{\beta _k}{2}\left\Vert {\lambda _k-\lambda ^*} \right\Vert ^2. \end{aligned}$$
(26)

A one-step estimate is presented below.

Lemma 3.2

Under Assumption 3.1, we have

$$\begin{aligned} {\mathcal {E}}_{k+1}- {\mathcal {E}}_{k}\le -\alpha _k \mathcal E_{k+1}+\frac{\delta _{k+1}}{2}(1+\alpha _k)\quad \forall \,k\in \mathbb N. \end{aligned}$$
(27)

Proof

See Sect. 4. \(\square \)

Using this lemma, we obtain the following theorem, which says the final rate is given by the sharp decay estimate of the sequence \(\{\beta _k\}_{k\in {\mathbb {N}}}\); see Lemma 3.3.

Theorem 3.1

Under Assumption 3.1, we have \(\{x_k,v_k\}_{k\in {\mathbb {N}}}\subset Q\) and

$$\begin{aligned} \left\Vert {Ax_k-b} \right\Vert \le {}&\beta _k{\mathcal {T}}_{k}, \end{aligned}$$
(28)
$$\begin{aligned} \left|{f(x_k)-f(x^*)} \right|\le {}&\beta _k{\mathcal {W}}_{k}, \end{aligned}$$
(29)
$$\begin{aligned} {\mathcal {L}}\left( x_{k},\lambda ^*\right) -{\mathcal {L}}\left( x^*,\lambda _k\right) \le {}&\beta _k{\mathcal {R}}_{k}, \end{aligned}$$
(30)

for all \(k\in {\mathbb {N}}\), where \({\mathcal {R}}_{k}:={\mathcal {E}}_0+\ln (k+1),\,{\mathcal {T}}_{k}:=\left\Vert {Ax_0-b} \right\Vert +2\sqrt{2{\mathcal {R}}_{k}}\) and \({\mathcal {W}}_{k}:={\mathcal {R}}_{k}+\left\Vert {\lambda ^*} \right\Vert {\mathcal {T}}_{k}\). Moreover, we have

$$\begin{aligned} \gamma _{\min }\left\Vert {v_{k}-x^*} \right\Vert ^2 + \mu \left\Vert {x_k-x^*} \right\Vert ^2\le 4\beta _k{\mathcal {R}}_{k}, \end{aligned}$$
(31)

where \(\gamma _{\min }:= \min \{\gamma _0,\mu \}\ge 0\).

Proof

From (23) and the contraction estimate (27) follows immediately that

$$\begin{aligned} {\mathcal {E}}_{k+1}\le \frac{1}{1+\alpha _k}\mathcal E_k+\frac{\delta _{k+1}}{2}\quad \Longrightarrow \quad \mathcal E_k\le \beta _k\mathcal E_0+\frac{\beta _k}{2}\sum _{i=0}^{k-1}\frac{\delta _{i+1}}{\beta _{i+1}}. \end{aligned}$$

By (17,23), we have

$$\begin{aligned} \delta _{k+1} =\frac{1}{k+1}\cdot \frac{\beta _k}{1+\alpha _k}=\frac{\beta _{k+1}}{k+1}, \end{aligned}$$

which further implies

$$\begin{aligned} {\mathcal {E}}_k\le \beta _k\mathcal E_0+\frac{\beta _k}{2}\sum _{i=0}^{k-1}\frac{1}{i+1} \le \beta _k\left[ {\mathcal {E}}_0+\ln (k+1)\right] =\beta _k\mathcal R_k, \end{aligned}$$
(32)

which proves (30). In view of (23), it holds that \(\gamma _{k+1} = (\gamma _k + \mu \alpha _k)/(1+\alpha _k)\ge \gamma _{\min }\). This together with (32) gives

$$\begin{aligned} \gamma _{\min }D_\phi (x^*,v_k)\le \gamma _kD_\phi (x^*,v_k)\le {\mathcal {E}}_k\le \beta _k{\mathcal {R}}_k. \end{aligned}$$

It is clear that \({\mathcal {L}}(\cdot ,\lambda )\) is convex and by (7,30,3.1),

$$\begin{aligned} \mu D_{\phi }(x_k,x^*)\le {\mathcal {L}}\left( x_{k},\lambda ^*\right) -{\mathcal {L}}\left( x^*,\lambda ^*\right) \le \beta _k{\mathcal {R}}_k. \end{aligned}$$

Hence, combining the above two estimates with (8) leads to (31).

Following [39, Theorem 3.1], we can establish (28,29). For the sake of completeness, we provide the details as below. Thanks to (20c) and (20d), we have

$$\begin{aligned} \begin{aligned} \lambda _{k+1} - \lambda _{k} ={}&\frac{\alpha _k}{\beta _k}(Av_{k+1}-b) = \frac{\alpha _k}{\beta _k}\left[ A\left( x_{k+1}+\frac{x_{k+1}-x_{k}}{\alpha _k}\right) -b\right] \\ = {}&\frac{\alpha _k}{\beta _k}\left[ \frac{1+\alpha _k}{\alpha _k}\left( Ax_{k+1} - b\right) -\frac{1}{\alpha _k}\left( Ax_k-b\right) \right] \\ = {}&\frac{1}{\beta _{k+1}}(Ax_{k+1}-b) - \frac{1}{\beta _k}(Ax_k-b), \end{aligned} \end{aligned}$$

which yields that

$$\begin{aligned} \lambda _{k}-\lambda _0 = \frac{1}{\beta _{k}}(Ax_{k}-b) - (Ax_0-b). \end{aligned}$$

By (32), we obtain \(\left\Vert {\lambda _k-\lambda ^*} \right\Vert ^2\le 2{\mathcal {R}}_k\) and

$$\begin{aligned} \begin{aligned} {}&\left\Vert {Ax_k-b} \right\Vert = \beta _k\left\Vert {\lambda _k-\lambda _0+Ax_0-b} \right\Vert \\&\quad \le {}\beta _k\left\Vert {\lambda _k-\lambda ^*} \right\Vert +\beta _k\left\Vert {\lambda _0-\lambda ^*} \right\Vert +\beta _k\left\Vert {Ax_0-b} \right\Vert \\&\quad \le {}\beta _k\sqrt{2\mathcal R_{k}}+\beta _k\left\Vert {\lambda _0-\lambda ^*} \right\Vert +\beta _k\left\Vert {Ax_0-b} \right\Vert . \end{aligned} \end{aligned}$$

In view of \({\mathcal {R}}_k={\mathcal {E}}_0+\ln (k+1)\) and \(\beta _0=1\), we find that \(\left\Vert {\lambda _0-\lambda ^*} \right\Vert ^2\le 2{\mathcal {R}}_k\). Plugging this into the above estimate gives (28). Observing (30), it follows that

$$\begin{aligned} -\left\langle {\lambda ^*,Ax_k-b} \right\rangle \le f(x_k)-f(x^*)\le \beta _k\mathcal R_k-\left\langle {\lambda ^*,Ax_k-b} \right\rangle , \end{aligned}$$

which promises

$$\begin{aligned} \left|{ f(x_k)-f(x^*)} \right| \le \beta _k\mathcal R_k+\left\Vert {\lambda ^*} \right\Vert \left\Vert {Ax_k-b} \right\Vert \le \beta _k{\mathcal {W}}_k. \end{aligned}$$

Consequently, this proves (29) and concludes the proof of this theorem. \(\square \)

Remark 3.4

Note that the choice (17) can be replaced with

$$\begin{aligned} \delta _{k,i} = \frac{\delta }{k+1}\cdot \frac{\beta _{k}}{1+\alpha _{k,i}},\quad \delta >0. \end{aligned}$$

Then the quantity \({\mathcal {R}}_{k} ={\mathcal {E}}_0+\ln (k+1)\) in Theorem 3.1 becomes \( {\mathcal {R}}_{k}=\mathcal E_0+\delta \ln (k+1)\). Taking \(\delta = 1/\ln (K+1)\) cancels the logarithm factor, where \(K\in {\mathbb {N}}\) is the maximal number of iterations chosen in advance. \(\square \)

It remains to investigate the decay rate of \(\{\beta _k\}_{k\in {\mathbb {N}}}\). From (21,23), we obtain

$$\begin{aligned} \beta _{k+1}-\beta _{k} = -\frac{\sqrt{\gamma _{k}\beta _k}\beta _{k+1}}{\sqrt{\rho _{_{\!d}}\beta _kM_{k+1} +\left\Vert {A} \right\Vert ^2}}. \end{aligned}$$
(33)

A careful investigation into this difference equation gives the desired result, which involves the following quantity

$$\begin{aligned} \varDelta := \max \left\{ M_0,\,\sqrt{\rho _{\! u}} \rho _{\! u}\left[ M_\nu (h)\right] ^{\frac{2}{1+\nu }}\right\} . \end{aligned}$$
(34)

Lemma 3.3

Assume that \(\max \{\gamma _0,\mu \}\le \left\Vert {A} \right\Vert ^2\), then

$$\begin{aligned} \beta _k\le \frac{4\left\Vert {A} \right\Vert }{\sqrt{\gamma _0}k} + \frac{(16\sqrt{2})^{1+\nu }\varDelta ^{\frac{1+\nu }{2}}}{\gamma _0^{\frac{1+\nu }{2}}k^{\frac{1+3\nu }{2}}}\quad \forall \,k\ge 1, \end{aligned}$$
(35)

and moreover, we have

$$\begin{aligned} \beta _k\le \left\{ \begin{aligned} {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2}+ \exp \left( -\frac{k}{8}\sqrt{\frac{\gamma _{\min }}{\varDelta }}\right){} & {} \text { if }\nu =1,\\ {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2} + \left( 1+\frac{1-\nu }{32}\sqrt{\frac{\gamma _{\min }}{2^ {\frac{1-\nu }{1+\nu }}\varDelta }}\,k^{\frac{1+3\nu }{2+2\nu }}\right) ^ {-\frac{2+2\nu }{1-\nu }}{} & {} \text { if }\nu <1, \end{aligned} \right. \end{aligned}$$
(36)

where \(\gamma _{\min } = \min \{\gamma _0,\mu \}\ge 0\).

Proof

Since \(M(\nu ,\delta _{k+1}) = \delta _{k+1}^{\frac{\nu -1}{\nu +1}}[M_\nu (h)]^{\frac{2}{\nu +1}}\) and \(\delta _{k+1} = \beta _{k+1}/(k+1)\le 1\), by Lemma 3.1, we have

$$\begin{aligned} \begin{aligned} \rho _{_{\!d}} M_{k+1}\le {}&\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u} M(\nu ,\delta _{k+1})\right\} \\ ={}&\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u}[M_\nu (h)]^{\frac{2}{\nu +1}} \delta _{k+1}^{\frac{\nu -1}{\nu +1}}\right\} \le {}\varDelta \cdot \delta _{k+1}^{\frac{\nu -1}{\nu +1}}. \end{aligned} \end{aligned}$$

Plugging this into (33) gives

$$\begin{aligned} \beta _{k+1}-\beta _{k} \le -\frac{\sqrt{\gamma _{k}\beta _k}\beta _{k+1}}{\sqrt{\varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}} +\left\Vert {A} \right\Vert ^2}}. \end{aligned}$$
(37)

From this we obtain (35,36). Missing proofs are provided in Sect. 5. \(\square \)

Remark 3.5

Both two estimates (35,36) hold for \(\mu \ge 0\). In addition, for the limiting case \(\nu \rightarrow 1-\), we have

$$\begin{aligned} \lim \limits _{\nu \rightarrow 1-} \left( 1+\frac{1-\nu }{32}\sqrt{\frac{\gamma _{\min }}{2^{\frac{1-\nu }{1+\nu }}\varDelta }}\,k^{\frac{1+3\nu }{2+2\nu }}\right) ^{-\frac{2+2\nu }{1-\nu }}= \exp \left( -\frac{k}{8}\sqrt{\frac{\gamma _{\min }}{\varDelta }}\right) , \end{aligned}$$

which matches the case \(\nu = 1\). \(\square \)

By the universal mixed-type estimate in Lemma 3.3, our Algorithm 1 achieves the lower complexity bound for both the affinely constrained case (i.e., \(A \ne 0_{m\times n}\)) and the unconstrained case \(\varOmega =\,{{\mathbb {R}}}^n\) (i.e., \(A = 0_{m\times n}\) and \(b = 0_{m}\)), with Hölderian smoothness exponent \(\nu \in [0,1]\). Detailed comparisons with existing results are summarized in order.

Remark 3.6

Consider the unconstrained case \(\varOmega =\,{{\mathbb {R}}}^n\) (i.e.,\(A = 0_{m\times n}\) and \(b = 0_{m}\)).

  • The Lipschitzian case \(\nu = 1\): From (29), it follows that (neglecting the logarithm factor \(\ln (k+1)\))

    $$\begin{aligned} f(x_k) - f^* \le \beta _k{\mathcal {E}}_0, \end{aligned}$$

    where by Lemma 3.3, we have (taking \(\gamma _0>\mu \))

    $$\begin{aligned} \beta _k\le \min \left\{ \frac{2\cdot 16^2\varDelta }{\gamma _0k^2},\,\exp \left( -\frac{k}{8}\sqrt{\frac{\mu }{\varDelta }}\right) \right\} . \end{aligned}$$

    Hence, to achieve the accuracy \(f(x_k)-f^*\le \epsilon \), the iteration complexity is no more than (recalling (34) and assuming \(\varDelta \sim L_h\))

    $$\begin{aligned} \mathcal O\left( \min \left\{ \sqrt{\frac{L_h}{\epsilon }},\, \sqrt{\frac{L_h}{\mu }}\cdot \left|{\ln \epsilon } \right|\right\} \right) . \end{aligned}$$

    This is the well-known lower bound (cf. [45, 49]) of first-order methods for smooth convex functions with Lipschitzian gradients; see [10, 11, 38, 42, 47].

  • The Hölderian case \(0\le \nu < 1\): Similarly with the above analysis, the iteration complexity for \(f(x_k)-f^*\le \epsilon \) is bounded by

    $$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \left( \frac{M_\nu (h)}{\epsilon }\right) ^\frac{2}{1+3\nu },\quad \left( \frac{M_\nu (h)}{\mu }\right) ^{\frac{2}{1+3\nu }}\cdot \left( \frac{\mu }{\epsilon }\right) ^{\frac{1-\nu }{1+3\nu }}\right\} \right) . \end{aligned}$$
    (38)

    This matches the lower bound in [43, 44]. The convex case \(\mu = 0\) has been obtained by the methods in [32, 43, 48], and the restarted schemes in [32, 52] attained the complexity bound for \(\mu >0\). Besides, Guminov et al. [24] obtained (38) for nonconvex problems, with an extra one-dimensional line search step. \(\square \)

Remark 3.7

Then, let us focus on the affine constraint case (i.e., \(A \ne 0_{m\times n}\)).

  • The Lipschitzian case \(\nu = 1\): By (28,29), to achieve \(\left|{f(x_k)-f^*} \right|\le \epsilon \) and \(\left\Vert {Ax_k-b} \right\Vert \le \epsilon \), the iteration complexity (neglecting the logarithm factor \(\ln (k+1)\)) is

    $$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \frac{\left\Vert {A} \right\Vert }{\epsilon }+\sqrt{\frac{L_h}{\epsilon }},\quad \frac{\left\Vert {A} \right\Vert }{\sqrt{\mu \epsilon }}+\sqrt{\frac{L_h}{\mu }}\cdot \left|{\ln \epsilon } \right| \right\} \right) . \end{aligned}$$
    (39)

    This coincides with the lower complexity bound in [51]. The methods in [12, 46, 66] achieved the bound for convex case \(\mu =0\), and the strongly convex case \(\mu >0\) can be found in [66].

  • The Hölderian case \(0\le \nu < 1\):

    $$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \frac{\left\Vert {A} \right\Vert }{\epsilon }+\left( \frac{M_\nu (h)}{\epsilon }\right) ^\frac{2}{1+3\nu },\quad \frac{\left\Vert {A} \right\Vert }{\sqrt{\mu \epsilon }}+ \left( \frac{M_\nu (h)}{\mu }\right) ^{\frac{2}{1+3\nu }}\cdot \left( \frac{\mu }{\epsilon }\right) ^{\frac{1-\nu }{1+3\nu }} \right\} \right) . \end{aligned}$$

    Similarly with (39), this universal mixed-type estimate has optimal dependence on \(\left\Vert {A} \right\Vert \) (corresponding to the affine constraint), and the remainder agrees with (38), which is optimal with respect to \(\mu \) and \(M_\nu (h)\) (related to the objective). \(\square \)

4 Proof of Lemma 3.2

Start from the difference \( {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k} = {\mathbb {I}}_1+{\mathbb {I}}_2+{\mathbb {I}}_3\), where

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_1:={}&{\mathcal {L}}\left( x_{k+1},\lambda ^*\right) -{\mathcal {L}}\left( x_{k},\lambda ^*\right) ,\\ {\mathbb {I}}_2:= {}&\gamma _{k+1} D_\phi (x^*,v_{k+1})- \gamma _{k} D_\phi (x^*,v_k),\\ {\mathbb {I}}_3:={}&\frac{\beta _{k+1}}{2} \left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 - \frac{\beta _k}{2} \left\Vert {\lambda _{k}-\lambda ^*} \right\Vert ^2. \end{aligned} \end{aligned}$$

Notice that \(x_k,\,x_{k+1}\in Q\) and the first term is easy to handle:

$$\begin{aligned} {\mathbb {I}}_1 = f(x_{k+1})-f(x_k)+\langle \lambda ^*,A(x_{k+1}-x_k) \rangle . \end{aligned}$$
(40)

We give the estimate of \({\mathbb {I}}_2\) in Sect. 4.1 and finish the proof of (27) in Sect. 4.2.

4.1 Estimate of \({\mathbb {I}}_2\)

Invoking the three-term identity (9) and the difference equation of \(\{\gamma _k\}_{k\in {\mathbb {N}}}\) in (23), we split the second term \({\mathbb {I}}_2\) as follows

$$\begin{aligned} \begin{aligned} \mathbb I_2={}&(\gamma _{k+1}-\gamma _{k})D_\phi (x^*,v_{k+1})+\gamma _{k}\left[ D_\phi (x^*,v_{k+1})- D_\phi (x^*,v_k) \right] \\ = {}&\alpha _k(\mu -\gamma _{k+1})D_\phi (x^*,v_{k+1}) -\gamma _{k}D_\phi (v_{k+1},v_k)\\ {}&\quad +\gamma _{k}\left\langle {\nabla \phi (v_{k+1})-\nabla \phi (v_k),v_{k+1}-x^*} \right\rangle . \end{aligned} \end{aligned}$$
(41)

Let us prove

$$\begin{aligned} \begin{aligned} {}&\mu \alpha _k D_\phi (x^*,v_{k+1}) +\gamma _{k}\left\langle {\nabla \phi (v_{k+1})-\nabla \phi (v_k),v_{k+1}-x^*} \right\rangle \\ \le {}&h(x_k)-h(y_k) - \alpha _k\left[ h(y_k)-h(x^*)+\big \langle \widetilde{\lambda }_k, Av_{k+1} -b\big \rangle \right] \\ {}&\quad -\alpha _k\left[ g(v_{k+1})-g(x^*)+\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \right] , \end{aligned} \end{aligned}$$
(42)

which leads to the desired estimate of \({\mathbb {I}}_2\):

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_2\le&-\alpha _k \gamma _{k+1}D_\phi (x^*,v_{k+1}) -\gamma _{k}D_\phi (v_{k+1},v_k)-\alpha _k \big \langle \widetilde{\lambda }_k, Av_{k+1} -b\big \rangle \\&\quad -\alpha _k\left[ g(v_{k+1})-g(x^*)+h(y_k)-h(x^*)\right] \\&\quad \quad +h(x_k)-h(y_k) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle . \end{aligned} \end{aligned}$$
(43)

To do this, define \(\zeta _{k+1}\) by that

$$\begin{aligned} \begin{aligned} {}&\gamma _k\big [\nabla \phi (v_{k+1})-\nabla \phi (v_k)\big ] \\ ={}&\mu \alpha _k\big [\nabla \phi (y_k)-\nabla \phi (v_{k+1})\big ] -\alpha _k\big [\nabla h(y_k)+\zeta _{k+1}+A^\top \widetilde{\lambda }_k\big ]. \end{aligned} \end{aligned}$$
(44)

Observing (20b), it follows that \(\zeta _{k+1}\in \partial g(v_{k+1})+N_Q(v_{k+1})\) and

$$\begin{aligned} \begin{aligned} {}&-\alpha _k\big \langle \zeta _{k+1}, v_{k+1} - x^{*}\big \rangle \le -\alpha _k\left[ g(v_{k+1})-g(x^*)\right] . \end{aligned} \end{aligned}$$

Thanks to (9), we have the decomposition

$$\begin{aligned} \begin{aligned} {}&\mu \alpha _k \left\langle {\nabla \phi (y_k)-\nabla \phi (v_{k+1}), v_{k+1} - x^{*}} \right\rangle \\ ={}&\mu \alpha _k \big [D_\phi (x^*,y_k) -D_\phi (x^*,v_{k+1}) -D_\phi (v_{k+1},y_k)\big ], \end{aligned} \end{aligned}$$

and invoking (20a) leads to

$$\begin{aligned}&-\alpha _k \left\langle { \nabla h(y_k), v_{k+1} - x^{*}} \right\rangle \\ =&-\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle - \left\langle { \nabla h(y_k), y_{k} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla h(y_k), y_{k} - x^{*}} \right\rangle . \end{aligned}$$

Since \(x_k,\,y_k\in Q\), by Assumption 3.1 we obtain

$$\begin{aligned} {}&- \left\langle { \nabla h(y_k), y_{k} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla h(y_k), y_{k} - x^{*}} \right\rangle \\ \le {}&h(x_k)-h(y_k) - \alpha _k\left[ h(y_k)-h(x^*)+\mu D_\phi (x^*,y_k)\right] . \end{aligned}$$

Hence, combining the above estimates with (44) proves (42).

4.2 Estimate of \({\mathbb {I}}_3\)

Similarly with (41), by (9), (20d) and (23), the third term \({\mathbb {I}}_3\) is rearranged by that

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_3= {}&-\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 -\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\lambda _k\big \Vert ^2\\ {}&\qquad +\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\lambda ^*\big \rangle . \end{aligned} \end{aligned}$$

To match the cross term \( -\alpha _k \big \langle \widetilde{\lambda }_k, Av_{k+1} -b\big \rangle \) in the estimate of \({\mathbb {I}}_2\) (cf.(43)), we rewrite the last term as follows:

$$\begin{aligned} \begin{aligned} {}&\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\lambda ^*\big \rangle \\ ={}&\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\widetilde{\lambda }_k\big \rangle +\alpha _{k}\big \langle Av_{k+1}-b,\widetilde{\lambda }_k-\lambda ^*\big \rangle . \end{aligned} \end{aligned}$$

In view of (10) and (20d), we get

$$\begin{aligned} \begin{aligned} {}&\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\widetilde{\lambda }_k\big \rangle =\beta _k\left\langle {\lambda _{k+1}-\lambda _{k},\lambda _{k+1}-\widetilde{\lambda }_k} \right\rangle \\ ={}&\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\lambda _k\Vert ^2 + \frac{\beta _{k}}{2} \big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 -\frac{\beta _{k}}{2}\big \Vert \lambda _{k}-\widetilde{\lambda }_k\big \Vert ^2, \end{aligned} \end{aligned}$$

which gives

$$\begin{aligned} {\mathbb {I}}_3\le -\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 +\alpha _{k}\big \langle Av_{k+1}-b,\widetilde{\lambda }_k-\lambda ^*\big \rangle . \end{aligned}$$

4.3 Proof of (27)

Combining (43) and the estimate of \({\mathbb {I}}_3\), we obtain

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_2+{\mathbb {I}}_3\le&-\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2-\alpha _k \big \langle \lambda ^*, Av_{k+1} -b\big \rangle \\ {}&\quad -\alpha _k \gamma _{k+1}D_\phi (x^*,v_{k+1}) -\gamma _{k}D_\phi (v_{k+1},v_k)\\&\qquad -\alpha _k\left[ g(v_{k+1})-g(x^*)+h(y_k)-h(x^*)\right] \\&\qquad \quad +h(x_k)-h(y_k) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle . \end{aligned} \end{aligned}$$

Inserting the identity (40) into the above inequality and observing that

$$\begin{aligned} \begin{aligned} {}&-\alpha _k \big \langle \lambda ^*, Av_{k+1} -b\big \rangle +\langle \lambda ^*,A(x_{k+1}-x_k) \rangle \\&\quad =-\alpha _k \left\langle { \lambda ^*, A\left( x_{k+1}+\frac{x_{k+1}-x_k}{\alpha _k}\right) -b} \right\rangle +\langle \lambda ^*,A(x_{k+1}-x_k) \rangle \,\left( \text {by (20c)}\right) \\&\quad =-\alpha _{k}\langle \lambda ^*,Ax_{k+1}-b \rangle \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&-\alpha _k\left[ g(v_{k+1})-g(x^*)+h(y_k)-h(x^*)\right] \\&\qquad =-\alpha _k\left[ g(x_{k+1})-g(x^*)+h(x_{k+1})-h(x^*) \right] \\&\qquad \quad -\alpha _k\left[ g(x_{k})-g(x_{k+1})+h(x_{k})-h(x_{k+1}) \right] \\&\qquad \quad -\alpha _k\left[ g(v_{k+1})-g(x_{k})+h(y_{k})-h(x_{k}) \right] \\&\qquad = -\alpha _k\left[ f(x_{k+1})-f(x^*)\right] -\alpha _k\left[ f(x_{k})-f(x_{k+1})\right] \\&\qquad \quad -\alpha _k\left[ g(v_{k+1})-g(x_{k})+h(y_{k})-h(x_{k}) \right] , \end{aligned} \end{aligned}$$

we arrive at

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-\mathcal E_{k}\le&-\alpha _k\left[ f(x_{k+1})-f(x^*)\right] -\alpha _{k}\langle \lambda ^*,Ax_{k+1}-b \rangle \\ {}&\quad -\alpha _k \gamma _{k+1}D_\phi (x^*,v_{k+1})-\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2\\ {}&\qquad +(1+\alpha _k)\left[ f(x_{k+1})-f(x_{k})\right] +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 \\ {}&\quad \qquad -\alpha _k\left[ g(v_{k+1})-g(x_{k})+h(y_{k})-h(x_{k}) \right] -\gamma _{k}D_\phi (v_{k+1},v_k)\\ {}&\qquad \qquad +h(x_k)-h(y_k) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle . \end{aligned} \end{aligned}$$

The first two lines equal to \( -\alpha _k{\mathcal {E}}_{k+1}\) and a careful collection of the rest terms gives

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k}\le&-\alpha _k{\mathcal {E}}_{k+1} +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 -\gamma _{k}D_\phi (v_{k+1},v_k) \\ {}&\quad +(1+\alpha _k)\left[ h(x_{k+1})- h(y_{k})\right] -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \\ {}&\qquad +(1+\alpha _k)g(x_{k+1})-g(x_{k}) -\alpha _kg(v_{k+1}). \end{aligned} \end{aligned}$$
(45)

From (20c), we see \(x_{k+1}\) is a convex combination of \(x_k\) and \(v_{k+1}\), which implies

$$\begin{aligned} (1+\alpha _k)g(x_{k+1})\le g(x_{k}) +\alpha _kg(v_{k+1}). \end{aligned}$$

Thanks to (22) and the relation \(\alpha _k(v_{k+1}-v_k)=(1+\alpha _k) (x_{k+1}-y_k) \) (cf.(20a) and (20c)), we obtain the estimate

$$\begin{aligned} \begin{aligned}&(1+\alpha _k)\left[ h(x_{k+1})- h(y_{k})\right] -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \\&\qquad \le \frac{\alpha _k^2M_{k+1}}{2+2\alpha _k}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 +\frac{\delta _{k+1}}{2}(1+\alpha _k). \end{aligned} \end{aligned}$$

Consequently, plugging these two estimates into (45) and using (8) leads to

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k}\le&-\alpha _k{\mathcal {E}}_{k+1}+\frac{\delta _{k+1}}{2}(1+\alpha _k) +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2\\&\quad +\frac{\alpha _k^2M_{k+1}-\gamma _{k}(1+\alpha _k)}{1+\alpha _k}D_\phi (v_{k+1},v_k). \end{aligned} \end{aligned}$$

Recall that \(\widetilde{\lambda }_k= \lambda _k+\alpha _k/\beta _k(Av_k-b)\), which together with (20d) gives \(\lambda _{k+1}- \widetilde{\lambda }_k=\alpha _k/\beta _k A(v_{k+1}-v_k)\) and

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k}\le {}&\!-\alpha _k{\mathcal {E}}_{k+1}+\frac{\delta _{k+1}}{2}(1+\alpha _k)\\ {}&\quad +\frac{1}{\beta _{k}} \left[ \alpha _k^2(\beta _{k+1}M_{k+1} +\left\Vert {A} \right\Vert ^2)-\gamma _k\beta _k\right] D_\phi (v_{k+1},v_k). \end{aligned} \end{aligned}$$

Since \(\beta _{k+1}\le \beta _{k}\) (cf.(23)), the desired estimate (27) follows immediately from (21). This finishes the proof of Lemma 3.2.

5 Proof of Lemma 3.3

The key to complete the proof of Lemma 3.3 is the difference inequality (37). In Sect. 5.1, we shall introduce an auxiliary differential inequality (cf.(46)) that can be viewed as a continuous analogue to (37). Later in Sects. 5.2 and 5.3, we finish the proofs of (35),(36) by using the asymptotic estimate of (46).

5.1 A Differential Inequality

In the sequel, we need some functional spaces in one dimension; see [14, Definitions 1.87 and 2.1]. Denote by \(C^1(I)\) the space of (real-valued) continuous functions on the interval \(I\subset \,{{\mathbb {R}}}\) with continuous derivatives. Let \(L^\infty (I)\) be the space of essentially bounded measurable functions, which means any \(\sigma \in L^\infty (I)\) is bounded almost everywhere. The space \(L^1(I)\) consists of measurable functions that are absolutely summable (integrable). As usual, we denote by \(W^{1,\infty }(I)\) the set of all real-valued functions which, together with their generalized derivatives, belong to \(L^\infty (I)\).

Let \(\eta ,\,R\ge 0\) and \(\theta >1\) be constants such that \(\eta \le \theta -1\). Suppose \(y\in W^{1,\infty }(0,\infty )\) is positive and satisfies the differential inequality

$$\begin{aligned} y'(t) \le -\frac{\sigma (t) y^\theta (t)}{\sqrt{\varphi (t)y^{2\eta }(t) + R^2}},\quad y(0) = 1, \end{aligned}$$
(46)

where \(\sigma \in L^1(0,\infty )\) is nonnegative, and \(\varphi \in C^1[0,\infty )\) is positive and nondecreasing. The decay rate of y(t) is given below.

Lemma 5.1

Assume \(y\in W^{1,\infty }(0,\infty )\) is positive and satisfies (46). Then, for all \(t>0\), we have

$$\begin{aligned} y(t)\le \left\{ \begin{aligned} {}&\exp \left( -\frac{\varSigma (t)}{2\sqrt{\varphi (t)}}\right) + \left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}{} & {} \text {if } \eta =\theta -1,\\ {}&\left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}+ \left( 1+\frac{\theta -\eta -1}{2\sqrt{\varphi (t)}}\varSigma (t)\right) ^{\frac{1}{\eta +1-\theta }}{} & {} \text {if }\eta <\theta -1, \end{aligned} \right. \end{aligned}$$

where \(\varSigma (t):=\int _{0}^{t}\sigma (s)\,\textrm{d}s\).

Proof

Write (46) as follows

$$\begin{aligned} y'(t) \le -\frac{\sigma (t) }{\sqrt{\varphi (t)y^{2\eta -2\theta }(t) + R^2y^{-2\theta }(t)}}. \end{aligned}$$

Shifting the denominator from the right to the left and using the trivial estimate

$$\begin{aligned} \sqrt{\varphi (t)y^{2\eta -2\theta }(t) + R^2y^{-2\theta }(t)}\le \sqrt{\varphi (t)}y^{\eta -\theta }(t)+Ry^{-\theta }(t), \end{aligned}$$

we obtain that

$$\begin{aligned} \left( \frac{\sqrt{\varphi (t)}}{y^{\theta -\eta }(t)}+\frac{R}{y^{\theta }(t)}\right) y'(t)\le -\sigma (t). \end{aligned}$$
(47)

To the end, we shall discuss in two cases: \(\eta =\theta -1\) and \(\eta <\theta -1\). Detailed proofs can be found in Sect. Appendix C. \(\square \)

5.2 Proof of (35)

By (23), we have

$$\begin{aligned} \gamma _{k+1}-\gamma _{0}\beta _{k+1} = \frac{\mu \alpha _{k}+\gamma _{k}}{1+\alpha _k}-\frac{\gamma _{0}\beta _k}{1+\alpha _k}\ge \frac{\gamma _{k}-\gamma _{0}\beta _k}{1+\alpha _k}. \end{aligned}$$

Since \(\gamma _{0}=\gamma _{0}\beta _0\), it follows that \(\gamma _k \ge \gamma _0\beta _{k}\) and (37) becomes

$$\begin{aligned} \beta _{k+1}-\beta _{k} \le -\frac{\sqrt{\gamma _{0}}\beta _k\beta _{k+1}}{\sqrt{\varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}} +\left\Vert {A} \right\Vert ^2}}, \end{aligned}$$
(48)

where \(\delta _{k+1} = \beta _{k+1}/(k+1)\).

Define a piecewise continuous linear interpolation

$$\begin{aligned} y(t): ={}\beta _{k}(k+1-t)+\beta _{k+1}(t-k)\quad \forall \,t\in [k,k+1), \, k\in {\mathbb {N}}. \end{aligned}$$
(49)

Clearly, \(y\in W^{1,\infty }(0,\infty )\) is positive and \(0<y(t)\le y(0) = 1\). In particular, we have \(\beta _k = y(k)\) for all \(k\in {\mathbb {N}}\), and the decay estimate of \(\beta _k\) is transferred into the asymptotic behavior of y(t), which satisfies (the proof is given below)

$$\begin{aligned} y' (t)\le -\frac{\sqrt{\gamma _0}y^{2}(t)/2}{\sqrt{\varphi (t)[y(t)]^{\frac{2\nu }{1+\nu }} +\left\Vert {A} \right\Vert ^2 }}, \end{aligned}$$
(50)

where \(\varphi (t):=4\varDelta (t+1)^{\frac{1-\nu }{1+\nu }}\). Thus, utilizing Lemma 5.1 gives

$$\begin{aligned} \beta _k=y(k)\le \frac{4\left\Vert {A} \right\Vert }{\sqrt{\gamma _0}k} + \frac{(16\sqrt{2})^{1+\nu }\varDelta ^{\frac{1+\nu }{2}}}{\gamma _0^{\frac{1+\nu }{2}}k^{\frac{1+3\nu }{2}}} \quad \forall \,k\ge 1, \end{aligned}$$

which establishes (35).

Below, let us verify (50). Since \(\gamma _k\le \max \{\gamma _0,\mu \}\le \left\Vert {A} \right\Vert ^2\), from (21) we find that \(\alpha _k\le \sqrt{\gamma _k\beta _k}/\left\Vert {A} \right\Vert \le 1\). For any \(t\in (k,k+1)\), it is clear that

$$\begin{aligned} 1\ge \frac{\beta _{k+1}}{y(t)}\ge \frac{\beta _{k+1}}{\beta _{k}}=\frac{1}{1+\alpha _k}\ge \frac{1}{2},\quad \text {and}\quad 1\le \frac{\beta _k}{y(t)}\le \frac{\beta _{k}}{\beta _{k+1}}\le 2, \end{aligned}$$

which implies

$$\begin{aligned} \varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}}= \frac{\varphi (k)\beta _k\beta _{k+1}^{\frac{\nu -1}{\nu +1}}}{4} \le \varphi (t)[y(t)]^{\frac{2\nu }{1+\nu }}. \end{aligned}$$

Since \(y'(t) = \beta _{k+1}-\beta _k\), plugging the above estimate into (48) proves (50).

5.3 Proof of (36)

Again, by (23), we have \(\gamma _{k}\ge \gamma _{\min } = \min \{\gamma _0,\mu \}\), and (37) becomes

$$\begin{aligned} \beta _{k+1}-\beta _{k} \le -\frac{\sqrt{\gamma _{\min }\beta _k}\beta _{k+1}}{\sqrt{\varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}} +\left\Vert {A} \right\Vert ^2}}. \end{aligned}$$

Recall the interpolation y(t) defined by (49). Similarly with (50), we claim that

$$\begin{aligned} y' (t)\le -\frac{\sqrt{\gamma _{\min }}y^{3/2}(t)/2}{\sqrt{\varphi (t)[y(t)]^{\frac{2\nu }{1+\nu }} +\left\Vert {A} \right\Vert ^2 }}, \end{aligned}$$

and invoking Lemma 5.1 again gives

$$\begin{aligned} \beta _k\le \left\{ \begin{aligned} {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2}+\exp \left( -\frac{k}{8}\sqrt{\frac{\gamma _{\min }}{\varDelta }}\right){} & {} \text { if }\nu =1,\\ {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2} + \left( 1+\frac{1-\nu }{32}\sqrt{\frac{\gamma _{\min }}{2^ {\frac{1-\nu }{1+\nu }}\varDelta }}\,k^{\frac{1+3\nu }{2+2\nu }}\right) ^{-\frac{2+2\nu }{1-\nu }}{} & {} \text { if }\nu <1. \end{aligned} \right. \end{aligned}$$

This proves (36) and completes the proof of Lemma 3.3.

6 Numerical Examples

In this part, we provide several experiments to validate the performance of our Algorithm 1 (denoted by UAPD). It is compared with Nesterov’s FGM [48] and the AccUniPDGrad method [68], for solving unconstrained and affinely constrained problems.

Both UAPD and FGM involve the proximal mapping of the nonsmooth part g, and FGM performs one more proximal calculation for updating \(v_{k+1}\). In line search part, FGM and AccUniPDGrad use the tolerance \(\delta _k = \epsilon \tau _k\) with \(\tau _k = \mathcal O(1/k)\), where \(\epsilon \) is the desired accuracy. Clearly, this is smaller than ours \(\delta _k = \beta _k/k\). As discussed in Remark 3.3, this will lead to over-estimate issue, especially for Hölderian case (cf. Sect. 6.1) and smooth problems with large Lipschitz constants (cf. Sect. 6.2).

6.1 Matrix Game

The problem reads as

$$\begin{aligned} \min _{x\in \varDelta _n} \max _{y\in \varDelta _m}\left\langle {x,Py} \right\rangle = \min _{x\in \varDelta _n}\left\{ \varphi (x): = \max _{1\le j\le m}\left\langle {p_j,x} \right\rangle \right\} , \end{aligned}$$
(51)

where \(P =(p_1,p_2,\cdots ,p_m)\in \,{{\mathbb {R}}}^{n\times m}\) is the given payoff matrix and \(\varDelta _{\times }\subset \,{{\mathbb {R}}}^\times \) is the simplex with \(\times =m\) or n. By von Neumann’s minimax theorem [1, Corollary 15.30], we can change the min-max order of (51) and obtain

$$\begin{aligned} \begin{aligned} \max _{y\in \varDelta _m} \min _{x\in \varDelta _n} \left\langle {x,Py} \right\rangle = \max _{y\in \varDelta _m}\left\{ \psi (y):=\min _{1\le i\le n}\left\langle {q_i,y} \right\rangle \right\} , \end{aligned} \end{aligned}$$
(52)

where \(P^\top =(q_1,q_2,\cdots ,q_n)\) with \(q_i\in \,{{\mathbb {R}}}^m\). Moreover, there is no duality gap, i.e., \(\varphi ^* = \psi ^*\). According to [1, Proposition 5.12], we claim that \(\varphi (x^\#) = \psi (y^\#)\) for some \((x^\#,y^\#)\in \varDelta _n\times \varDelta _m\) if and only if \(\varphi (x^\#) = \varphi ^*\) and \(\psi (y^\#) = \psi ^*\).

As we do not know the optimal value, consider the unconstrained problem

$$\begin{aligned} \min _{x\in \varDelta _n,\,y\in \varDelta _m}\left\{ f(x,y): = \varphi (x)-\psi (y)\right\} . \end{aligned}$$
(53)

Clearly, the minimal value is zero:

$$\begin{aligned} f^* = \min _{x\in \varDelta _n}\varphi (x) - \max _{y\in \varDelta _m}\psi (y) = \varphi ^*-\psi ^* = 0. \end{aligned}$$

Moreover, \((x^\#,y^\#)\) is an optimal solution to (53) if and only if \(x^\#\) and \(y^\#\) are optimal solutions to (51,52), respectively.

Since \(\psi \) is concave, f is convex but nonsmooth (Lipschitz continuous). In view of Remark 3.1, f satisfies Assumption 3.1 without the simple part g. In addition, to work with the simplex constraint, a proper prox-function is the entropy \(\phi (x) = \left\langle {x,\ln x} \right\rangle \), which gives closed solution of the proximal calculation (14).

Fig. 2
figure 2

Numerical results of FGM and UAPD for solving (53) with \(m = 100,\,n = 400\). The desired accuracy of FGM is \(\epsilon = 1e\)-5

Numerical results are displayed in Fig. 2. We record (i) the decay behavior of the objective residual \(\left|{f(x_k,y_k)-f^*} \right|\) with respect to both iteration number k and running time t (in seconds), (ii) the total number \(\#i_k\) of the line search step \(i_k\) and its average \(\bar{i}_k\), and (iii) the approximate Lipschitz constant \(M_k\). The pay off matrix P is generated from the normal distribution. For FGM, the priori accuracy is \(\epsilon = 1e\)-5. For our method, the line search parameters are \(\rho _{\!u} = 2\) and \(\rho _{_{\!d}} = 1\).

From Fig. 2, our UAPD outperforms FGM, with faster convergence, less number of line search steps and smaller Lipschitz constants. Within the same iteration number \(k = 1e5\), UAPD achieves ten times smaller error (1e-2 v.s. 1e-1) but takes only about half less time (20 s v.s. 40 s). It can be seen that, FGM has much more line search steps and the average is 1, which means it performs at least one step of line search in each iteration. Since \(\rho _{_{\!d}}=1\), according to Remark 3.2, the Lipschitz constant \(M_k\) of our UAPD is nondecreasing, which agrees with the numerical illustration. As mentioned in Remark 3.3, FGM suffers from over-estimated Lipschitz parameters with dramatically growth behavior since it adopts smaller tolerance \(\epsilon /k\) (comparable to our dynamically decreasing choice \(\beta _k/k\)). Besides, in step 3 of FGM, it updates the Lipschitz constant by \(L_{k+1} = L_{k,i_k}/2=2^{i_k}L_k/2\) (here L agrees with our notation M), which corresponds to choosing \(\rho _{_{\!d}}=2\) in our method. Hence, the Lipschitz constant sequence of FGM is nonmonotone and changes with high oscillation. This verifies the claim in Remark 3.2 that large \(\rho _{_{\!d}}>1\) reduces \(M_k\) locally but increases the burden on the line search procedure.

Fig. 3
figure 3

Numerical performances of UAPD on the problem (53) with different prox-functions

In Fig. 3, we check also the difference between the Euclidean distance \(\phi (x) = 1/2\left\Vert {x} \right\Vert ^2\) and the entropy \(\phi (x) = \left\langle {x,\ln x} \right\rangle \). It is observed that these two cases are very similar in line search part but the latter leads to better convergence rate.

Fig. 4
figure 4

Numerical performances of UAPD on the problem (53) with \(\rho _{\!u} = 2\) and different \(\rho _{_{\!d}}\)

Besides, we report the performance of our UAPD with different \(\rho _{_{\!d}}\). From Fig. 4, we see little improvement on the convergence rate with respect to the iteration k, and large \(\rho _{_{\!d}}\) does not win smaller choices because it runs with more than triple time (120.12 s v.s. 39.62 s) but has not reduced the residual by third (6.07e-3 v.s. 4.37e-3). Moreover, the magnitude of \(M_k\) does not differ too much but large \(\rho _{_{\!d}}\) call more line search steps, which increases the computational cost. In conclusion, for matrix game problem, \(\rho _{_{\!d}} = 1\) seems a doable choice.

6.2 Regularized Matrix Game Problem

The objective of (51) admits an approximation (cf. [46])

$$\begin{aligned} \varphi _\sigma (x): = \sigma \ln \left( \sum _{j=1}^{m}e^{\left\langle {p_j,x} \right\rangle /\sigma }\right) , \end{aligned}$$
(54)

where \(\sigma >0\) denotes the smoothing parameter. This regularized objective is smoother than the original one. According to [46, Eq.(4.8)], we choose \(\sigma = \epsilon /(2\ln m)\), and the Lipschitz constant of \(\nabla \varphi _\sigma \) is \(L_\sigma =\max _{i,j}|P_{i,j}|^2/(4\sigma )\). To avoid the overflow issue, we adopt the shifting technique [3].

We then apply UAPD and FGM (with \(\epsilon = 1e\)-6) to the smooth objective (54) and report the numerical outputs in Fig. 5. The optimal value \(\varphi _\sigma ^* \) is obtained by running UAPD with enough iterations. Similarly as before, our UAPD is superior to FGM in convergence, line search cost and approximate Lipschitz constant. Also, we plot the results of the original matrix game problem (see the top row in Fig. 5) and find that with smoothing technique both two methods perform better than before.

Fig. 5
figure 5

Numerical performances of FGM and UAPD on minimizing the regularized objective (54) with \(m = 100,\,n = 400\). The desired accuracy of FGM is \(\epsilon =1e\)-6

6.3 Continuous Steiner Problem

Let us consider one more unconstrained problem

$$\begin{aligned} \min _{x\in \,{{\mathbb {R}}}^n_+}\,f(x) = \sum _{j=1}^{m}\left\Vert {x-a_j} \right\Vert , \end{aligned}$$
(55)

where \(a_j\in \,{{\mathbb {R}}}^n\) denotes the given location. Note that the objective is actually quite smooth far away from each \(a_j\). We generate \(a_j\) from the uniform distribution and run UAPD with enough times to obtain an approximated optimal value \(f^*\). Numerical results in Fig. 6 show that both FGM (with \(\epsilon = 1e\)-8) and UAPD work well and possess similar convergence behaviors. Moreover, as \(\nabla f\) is almost Lipschitz continuous and the magnitude of the Lipschitz constant is not large, the over-estimated issue of FGM is negligible, and the approximated constant \(M_k\) is the same as that of our UAPD.

Fig. 6
figure 6

Numerical results of FGM and UAPD for solving (55) with \(m = 800,\,n = 400\). The desired accuracy of FGM is \(\epsilon = 1e\)-8

6.4 Basis Pursuit Problem

In the last example, we look at the basis pursuit problem

$$\begin{aligned} \min _{x\in \,{{\mathbb {R}}}^n}\left\Vert {x} \right\Vert _1\quad \mathrm{s.t.}\,Ax=b, \end{aligned}$$

where \(A\in \,{{\mathbb {R}}}^{m\times n}\) and \(b\in \,{{\mathbb {R}}}^m\). To be compatible with the problem setting of AccUniPDGrad [68], consider an equivalent formulation

$$\begin{aligned} \min _{x\in \,{{\mathbb {R}}}^n}\,f(x) = \frac{1}{2}\left\Vert {x} \right\Vert _1^2\quad \mathrm{s.t.}\,Ax=b. \end{aligned}$$
(56)

The dual problem reads as

$$\begin{aligned} \min _{\lambda \in \,{{\mathbb {R}}}^{m}}\,\left\{ \varphi (\lambda ): = \left\langle {b,\lambda } \right\rangle +\frac{1}{2} \big \Vert A^\top \lambda \big \Vert _\infty ^2\right\} . \end{aligned}$$

Note that existing accelerated methods in [29, 39, 65] can be applied to solving (56) with theoretical rate \(\mathcal O(1/k)\). But we only focus on the comparison between UAPD and AccUniPDGrad, as black-box type methods with line search procedure. We mention that AccUniPDGrad also uses smaller tolerance \(\epsilon /k\) as that in FGM, where \(\epsilon >0\) denotes the desired accuracy, and it takes \(M_{k+1} = M_{k,i_k}=2^{i_k}M_k\), which coincides with our choice \(\rho _{_{\!d}}=1\).

To obtain a reasonable approximate minimal value \(f^*\), we run the accelerated primal–dual method proposed in [39] with enough iterations. Numerical results are shown in Fig. 7, which indicate that (i) our UAPD has smaller objective residual and feasibility violation, (ii) the line search steps are very close and the Lipschitz constant sequences are nondecreasing (since \(\rho _{_{\!d}}=1\)), and (iii) AccUniPDGrad generates over-estimated \(M_k\) because of its small tolerance \(\epsilon /k\).

Fig. 7
figure 7

Numerical performances of AccUniPDGrad and UAPD on the basis pursuit problem (56) with \(m = 100,\,n = 400\). The desired accuracy for AccUniPDGrad is \(\epsilon = 1e\)-3