A Universal Accelerated Primal–Dual Method for Convex Optimization Problems

Luo, Hao

doi:10.1007/s10957-024-02394-6

A Universal Accelerated Primal–Dual Method for Convex Optimization Problems

Published: 01 March 2024

Volume 201, pages 280–312, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

A Universal Accelerated Primal–Dual Method for Convex Optimization Problems

Download PDF

Hao Luo ORCID: orcid.org/0000-0001-9846-5399^1,2

305 Accesses
Explore all metrics

Abstract

This work presents a universal accelerated primal–dual method for affinely constrained convex optimization problems. It can handle both Lipschitz and Hölder gradients but does not need to know the smoothness level of the objective function. In line search part, it uses dynamically decreasing parameters and produces approximate Lipschitz constant with moderate magnitude. In addition, based on a suitable discrete Lyapunov function and tight decay estimates of some differential/difference inequalities, a universal optimal mixed-type convergence rate is established. Some numerical tests are provided to confirm the efficiency of the proposed method.

A fast dual proximal-gradient method for separable convex optimization with linear coupled constraints

Article 02 March 2016

Accelerated Primal-Dual Gradient Descent with Linesearch for Convex, Nonconvex, and Nonsmooth Optimization Problems

Article 01 March 2019

Two Convergent Primal–Dual Hybrid Gradient Type Methods for Convex Programming with Linear Constraints

Article 18 April 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider the minimization problem

$$\begin{aligned} \min _{x\in Q\cap \varOmega }\, f(x) := h(x) + g(x), \end{aligned}$$

(1)

where $Q\subset \,{{\mathbb {R}}}^n$ is a simple closed convex subset, $\varOmega : = \left\{ x\in \,{{\mathbb {R}}}^n:\,Ax = b\right\} $ is an affine set with $A\in \,{{\mathbb {R}}}^{m\times n}$ and $b\in \,{{\mathbb {R}}}^m$, and $f:\,{{\mathbb {R}}}^n\rightarrow \,{{\mathbb {R}}}\cup \{+\infty \}$ is properly closed and convex, with smooth part h and nonsmooth (simple) part g. The model problem (1) arises from many practical applications, such as compressed sensing [6], image processing [8] and decentralized distributed optimization [4].

In the literature, existing algorithms for solving (1) mainly include Bregman iteration [5, 29, 67], quadratic penalty method [35], augmented Lagrangian method (ALM) [26,27,28, 33, 41, 56], and alternating direction method of multipliers [22, 36, 40, 50, 53, 57, 60, 61, 65]. Generally speaking, these methods have sublinear rate $\mathcal O(1/k)$ for convex problems and can be further accelerated to ${\mathcal {O}}(1/k^2)$ for (partially) strongly convex objectives. We also note that primal–dual methods [7, 20, 25, 31, 58, 59, 62] and operator splitting schemes [13, 18, 19] can be applied to (1) with two-block structure.

However, among these works, it is rare to see the optimal mixed-type convergence rate, i.e., the lower complexity bound [51]

$$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \frac{\left\Vert {A} \right\Vert }{\epsilon },\,\frac{\left\Vert {A} \right\Vert }{\sqrt{\mu \epsilon }}\right\} + \min \left\{ \sqrt{L/\epsilon },\,\sqrt{L/\mu }\cdot \left|{\ln \epsilon } \right|\right\} \right) , \end{aligned}$$

(2)

where $\mu \ge 0$ is the convexity parameter of f and L is the Lipschitz constant of $\nabla h$. Both Nesterov’s smoothing technique [46] and the accelerated primal–dual method in [12] achieve the lower bound for convex case $\mu =0$. The inexact ALM framework in [66] possesses the optimal complexity (2) but involves a subroutine for inexactly solving the subproblem.

We mention that the second part of (2) corresponding to L and $\mu $ agrees with the well-known lower complexity bound of first-order methods for unconstrained convex problems with Lipschitz gradients, namely (the affine set $\varOmega $ is the entire space $\,{{\mathbb {R}}}^n$)

$$\begin{aligned} \min _{x\in Q}\, f(x) := h(x) + g(x). \end{aligned}$$

(3)

The intermediate non-Lipschitz case is also of interest to be considered [43, 45]. Particularly, when $\nabla h$ is Hölder continuous (cf.(11)) with exponent $\nu \in [0,1)$, Nesterov [48] presented a universal fast gradient method (FGM) for solving (3) that did not require à priori knowledge of the smoothness parameter $\nu $ and the Hölderian constant $M_\nu (h)$. A key ingredient of FGM is that Hölderian gradients can be recast into the standard Lipschitz case but with inexact computations [15, 54, 55], and it achieves the optimal complexity [44]

$$\begin{aligned} {\mathcal {O}} \left( \frac{[M_\nu (h)]^\frac{2}{1+3\nu }}{\epsilon ^\frac{2}{1+3\nu }}\right) . \end{aligned}$$

(4)

More extensions of FGM can be found in [23, 24, 32].

The dual problem of (1) reads equivalently as

$$\begin{aligned} \min _{\lambda \in \,{{\mathbb {R}}}^{m}}\,\left\{ d(\lambda ): = \left\langle {b,\lambda } \right\rangle +\max _{x\in Q}\left\{ -f(x)- \left\langle {A^\top \lambda ,x} \right\rangle \right\} \right\} . \end{aligned}$$

(5)

If f is uniformly convex of degree $p\ge 2$ (see [48, Definition 1]), then $\nabla d$ is Hölder continuous with exponent $\nu = 1/(p-1)$ (cf. [48, Lemma 1]). The methods in [16, 37] work for (5) with strongly convex objective f, i.e., the Lipschitzian case ($\nu = 1$). Yurtsever et al. [68] proposed an accelerated universal primal–dual gradient method (AccUniPDGrad) for general Hölderian case ($\nu <1$) and established the complexity bound (4) for the objective residual and the feasibility violation, with $M_\nu (h)$ being replaced by $M_\nu (d)$. Similarly with the spirit of FGM, the presented method utilizes the “inexactness” property of $\nabla d$ and applies FISTA [2] to (5) with line search.

In this work, we propose a universal accelerated primal–dual method (see Algorithm 1) for solving (1). Compared with existing works, the main contributions are highlighted as follows:

It is first-order black-box type for both Lipschitz and Hölder cases but does not need to know the smoothness level priorly.
It uses the Bregman divergence and can handle the non-Euclidean setting.
In line search part, it adopts dynamically decreasing tolerance while FGM [48] and AccUniPDGrad [68] use the desired fixed accuracy.
By using the tool of Lyapunov function and tight decay estimates of some differential/difference inequalities, we prove the universal mixed-type estimate that achieves the optimal complexity (including (2),(4) as special cases).

We also provide some numerical tests to validate the practical performance. It is confirmed that: (i) a proper choice of Bregman distance is crucial indeed; (ii) our method outperforms FGM and AccUniPDGrad especially for non-Lipschitz problems and smooth problems with large Lipschitz constants, as the automatically decreasing tolerance leads to approximate Lipschitz constants with moderate magnitude.

Our method here is motivated from an implicit-explicit time discretization of a novel accelerated Bregman primal–dual dynamics (see (24)), which is an extension of the accelerated primal–dual flow in [39] to the non-Euclidean case. For unconstrained problems, there are some existing continuous dynamics [34, 63, 64] with Bregman distances. For linearly constrained case, we see an accelerated primal–dual mirror model [69], which is inspired by the accelerated mirror descent [34] and the primal–dual dynamical approach [21] but without time discretizations.

The rest of the paper is organized as follows. In Sect. 2, we provide some preliminaries including the Bregman divergence and the Hölder continuity. Then, the main algorithm together with its universal mixed-type estimate is presented in Sect. 3, and rigorous proofs of two technical lemmas are summarized in Sects. 4 and 5, respectively. Finally, some numerical results are reported in Sect. 6.

2 Preliminary

2.1 Notations

Let $\left\langle {\cdot ,\cdot } \right\rangle $ be the usual inner product of vectors and $\left\Vert {\cdot } \right\Vert $ be the standard Euclidean norm (of vectors and matrices). Given a proper function $g:\,{{\mathbb {R}}}^n\rightarrow \,{{\mathbb {R}}}\cup \{+\infty \}$, the effective domain of g is denoted as usual by $ \mathbf{dom\,}g$, and the subdifferential of g at any $x\in \mathbf{dom\,}g$ is the set of all subgradients:

$$\begin{aligned} \partial g(x): = \left\{ \xi \in \,{{\mathbb {R}}}^n:\,g(y)\ge g(x)+\left\langle {\xi ,y-x} \right\rangle \quad \forall \,y\in \,{{\mathbb {R}}}^n\right\} . \end{aligned}$$

Recall that $Q\subset \,{{\mathbb {R}}}^n$ is a nonempty closed convex subset. Let $\iota _Q(\cdot )$ be the indicator function of Q and $N_Q(\cdot ): =\partial \iota _Q(\cdot )$ be its normal cone.

Introduce the Lagrangian for the model problem (1):

$$\begin{aligned} {\mathcal {L}}(x,\lambda ): = f(x) +\iota _Q(x)+ \left\langle {\lambda ,Ax-b} \right\rangle \quad \forall \,(x,\lambda )\in \,{{\mathbb {R}}}^n\times \,{{\mathbb {R}}}^m. \end{aligned}$$

We say $(x^*,\lambda ^*)\in Q\times \,{{\mathbb {R}}}^m$ is a saddle point of ${\mathcal {L}}$ if

$$\begin{aligned} {\mathcal {L}}(x^*,\lambda )\le {\mathcal {L}}(x^*,\lambda ^*)\le \mathcal L(x,\lambda ^*)\quad \forall \,(x,\lambda )\in \,{{\mathbb {R}}}^n\times \,{{\mathbb {R}}}^m, \end{aligned}$$

(6)

which also implies the optimality condition:

$$\begin{aligned} Ax^*-b=0,\quad \partial f(x^*) + N_Q(x^*)+A^\top \lambda ^*\ni 0. \end{aligned}$$

(7)

2.2 Bregman Divergence

Let $\phi :Q\rightarrow \,{{\mathbb {R}}}$ be a smooth prox-function and define the Bregman divergence

$$\begin{aligned} D_\phi (x,y) := \phi (x)-\phi (y)-\left\langle {\nabla \phi (y),x-y} \right\rangle \quad \forall \,x,\,y\in Q. \end{aligned}$$

Throughout, suppose $\phi $ is 1-strongly convex:

$$\begin{aligned} D_\phi (x,y)\ge \frac{1}{2}\left\Vert {x-y} \right\Vert ^2\quad \forall \,x,\,y\in Q. \end{aligned}$$

(8)

Particularly, $\phi (x) = \frac{1}{2}\left\Vert {x} \right\Vert ^2$ leads to $D_\phi (x,y)=D_\phi (y,x)=1/2\left\Vert {x-y} \right\Vert ^2$, which boils down to the standard Euclidean setting. In addition, we have the following three-term identity; see [9, Lemma 3.2] or [17, Lemma 3.3].

Lemma 2.1

( [9, 17]) For any $x,y,z\in Q$, it holds that

$$\begin{aligned} \left\langle {\nabla \phi (x)-\nabla \phi (y),y-z} \right\rangle = D_\phi (z,x) - D_\phi (z,y)- D_\phi (y,x). \end{aligned}$$

(9)

If $\phi (x) = \frac{1}{2}\left\Vert {x} \right\Vert ^2$, then

$$\begin{aligned} 2\left\langle {x-y,y-z} \right\rangle = \left\Vert {x-z} \right\Vert ^2 - \left\Vert {y-z} \right\Vert ^2- \left\Vert {x-y} \right\Vert ^2. \end{aligned}$$

(10)

2.3 Hölder Continuity

Let h be a differentiable function on Q. For $0\le \nu \le 1$, define

$$\begin{aligned} M_\nu (h): = \sup _{\begin{array}{c} x,\,y\in Q\\ x\ne y \end{array}}\frac{\left\Vert {\nabla h(x)-\nabla h(y)} \right\Vert }{\left\Vert {x-y} \right\Vert ^\nu }. \end{aligned}$$

If $M_\nu (h)<\infty $, then $\nabla h$ is Hölder continuous with exponent $\nu $:

$$\begin{aligned} \left\Vert {\nabla h(x)-\nabla h(y)} \right\Vert \le M_\nu (h)\left\Vert {x-y} \right\Vert ^\nu \quad \forall \,x,y\in Q, \end{aligned}$$

(11)

and this also implies that

$$\begin{aligned} h(x)\le h(y)+\left\langle {\nabla h(y),x-y} \right\rangle +\frac{M_\nu (h)}{1+\nu }\left\Vert {x-y} \right\Vert ^{1+\nu } \quad \forall \,x,y\in Q. \end{aligned}$$

(12)

When $\nu = 1$, $M_1(h)$ corresponds to the Lipschitz constant of $\nabla h$, and we also use the conventional notation $L_h = M_1(h)$.

According to [48, Lemma 2], the estimate (12) can be transferred into the usual gradient descent inequality, with “inexact computations”. Based on this, (accelerated) gradient methods can be used to minimize functions with Hölder continuous gradients [15, 54, 55].

Proposition 2.1

([48]) Assume $M_\nu (h)<\infty $ and for $\delta >0$, define

$$\begin{aligned} M(\nu ,\delta ): =\delta ^{\frac{\nu -1}{\nu +1}}[M_\nu (h)]^{\frac{2}{\nu +1}}. \end{aligned}$$

(13)

Then, for any $M\ge M(\nu ,\delta )$, we have

$$\begin{aligned} h(x)\le h(y)+\left\langle {\nabla h(y),x-y} \right\rangle +\frac{M}{2}\left\Vert {x-y} \right\Vert ^2+ \frac{\delta }{2} \quad \forall \,x,y\in Q. \end{aligned}$$

3 Main Algorithm

Throughout, we make the following assumption on $f = h+g$:

Assumption 3.1

The nonsmooth part g is properly closed convex on Q. The smooth part h satisfies $ \inf _{0\le \nu \le 1}M_\nu (h)<\infty $ and is $\mu $-convex on Q with $\mu \ge 0$, namely,

$$\begin{aligned} h(x)\ge h(y)+\left\langle {\nabla h(y),x-y} \right\rangle +\mu D_\phi (x,y) \quad \forall \,x,y\in Q. \end{aligned}$$

Remark 3.1

For some cases, h might not be smooth (in $C^1$ globally) enough but at least Lipschitz continuous. By Rademacher’s theorem [30, Theorem 3.1], Lipschitzian functions are differentiable almost everywhere. Thus, the assumption $M_\nu (h)<\infty $ holds true with $\nu =0$. Besides, in practical computations, we expect that the proximal calculation (cf.(14)) of the nonsmooth part g with respect to proper $D_\phi (\cdot ,\cdot )$ is easy to compute (or has closed solution); see the matrix game problem in Sect. 6.1. $\square $

Our main algorithm, called universal accelerated primal–dual (UAPD) method, is summarized in Algorithm 1, where the subpart “$ \texttt{sub}\text {-}\texttt{UAPD}$” in lines 3 and 6 has been given by Algorithm 2. Note that we do not require priorly the exponent $\nu $ and the smoothness constant $M_\nu (h)$ but perform a line search procedure (see lines 4–7).

3.1 Line Search

In Algorithm 1, the parameter $\rho _{\!u}>1$ enlarges the approximate Lipschitz constant $M_{k,i}$ (cf. line 5) to meet the following descent condition (cf. line 4)

$$\begin{aligned} \begin{aligned} h(x_{k,i})\le h(y_{k,i})+\left\langle {\nabla h(y_{k,i}),x_{k,i}-y_{k,i}} \right\rangle + \frac{M_{k,i}}{2}\left\Vert {x_{k,i}-y_{k,i}} \right\Vert ^2+\frac{\delta _{k,i}}{2}. \end{aligned} \end{aligned}$$

(15)

For each $k\in {\mathbb {N}}$, the line search part will end up with the smallest integer $i_k$ satisfying (15) and call the subpart $\texttt{sub}$-$\texttt{UAPD}$ $i_k+1$ times in total. The extra parameter $\rho _{_{\!d}}\ge 1$ reduces the output constant $M_{k,i_k}$ and updates $M_{k+1} = M_{k,i_k}/\rho _{_{\!d}}$ (cf. line 8).

Remark 3.2

We introduce the pair $(\rho _{\! u},\rho _{_{\!d}})$ in our method for generality. Practically, it is not easy to find the optimal choice. If $\rho _{_{\!d}} = 1$, then $M_k$ is nondecreasing. Otherwise, $M_k$ can be nonmonotone. Consider two situations.

For the standard Lipschitz case $(\nu =1)$, as analyzed in [47, Section 3], if $M_0\le L_h$, then the choice $\rho _{_{\!d}}\ge \rho _{\! u}$ promises that $M_k\le L_h$ for all $k\ge 0$.
For the Hölder continuous case $(\nu <1)$, we have $M_k\rightarrow \infty $ as $k\rightarrow \infty $ (since $M(\nu ,\delta )\rightarrow \infty $ as $\delta \rightarrow 0$). Taking $\rho _{_{\!d}}>1$ may reduce $M_k$ (locally for some k) but will increase the burden on the line search procedure. $\square $

Below, let us show that $i_k$ is finite for $k\in {\mathbb {N}}$. Indeed, we see $M_{k,i} = \rho _{\! u}^{i}M_{k,0}$ increases as i does, and the step size (cf. line 1 of Algorithm 2)

$$\begin{aligned} \alpha _{k,i} =\sqrt{ \frac{\beta _k\gamma _{k}}{\beta _k M_{k,i} + \left\Vert {A} \right\Vert ^2}} \end{aligned}$$

(16)

has to be decreasing. Thus the tolerance (cf. line 2 of Algorithm 2)

$$\begin{aligned} \delta _{k,i} = \frac{1}{k+1}\cdot \frac{\beta _{k}}{1+\alpha _{k,i}} \end{aligned}$$

(17)

is increasing and by (13), $M(\nu ,\delta _{k,i})$ is decreasing. This together with Proposition 2.1 and Assumption 3.1 concludes that

$$\begin{aligned} \begin{aligned} {}&i_k = 0{} & {} \text { if } M(\nu ,\delta _{k,0})\le M_{k,0},\\ {}&i_k\le \lceil s^*\rceil{} & {} \text { else}, \end{aligned} \end{aligned}$$

where $ \lceil s^*\rceil $ denotes the ceiling function (the minimal integer that is no less than $s^*$), and $s^*\in [0,\infty )$ solves the equation $M_{k,s^*} = M(\nu ,\delta _{k,s^*})$; see Fig. 1. In particular, we have

$$\begin{aligned} M_{k,s^*}= \rho _{\!u}^{s^*}M_{k,0}\le M(\nu ,\delta _{k,0})\quad \Longrightarrow \quad s^*\le \log _{ \rho _{\!u}}\frac{M(\nu ,\delta _{k,0})}{M_{k,0}}. \end{aligned}$$

This eventually leads to

$$\begin{aligned} i_k\le \max \left\{ 0,\,\bigg \lceil \log _{ \rho _{\!u}}\frac{M(\nu ,\delta _{k,0})}{M_{k,0}}\bigg \rceil \right\} <\infty . \end{aligned}$$

Remark 3.3

In the line search part, our Algorithm 1 adopts dynamically decreasing tolerance (17). However, the methods in [48] and [68, Algorithm 2] chose $\delta _k=\epsilon /k$, where $\epsilon $ is the desired accuracy. Hence, by Proposition 2.1, the approximate smoothness constant $M_k$ of our method is smaller than the other two methods, especially for the Hölderian case. This will be verified by numerical experiments in Sect. 6. $\square $

Below, we give an upper bound of $M_k$ and the total number of line search steps. By Theorem 3.1, $\beta _k$ corresponds to the convergence rate of Algorithm 1 and admits explicit decay estimate (see Lemma 3.3). If the desired accuracy $\beta _{k+1}={\mathcal {O}}(\epsilon )$ is given, then the term $\left|{\log _{\rho _{\!u}}\beta _{k+1}} \right|$ in (19) can be replaced by $\left|{\log _{\rho _{\!u}}\epsilon } \right|$.

Lemma 3.1

For any $k\in {\mathbb {N}}$, we have

$$\begin{aligned} M_{k+1}\le \frac{1}{\rho _{_{\!d}}}\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k+1})\right\} , \end{aligned}$$

(18)

and consequently, it holds that

$$\begin{aligned} \small \sum _{j=0}^{k}i_j\le \frac{3}{2}+k\,\frac{\ln \rho _{_{\!d}}}{\ln \rho _{\!u}}+ \frac{2}{1+\nu }\left|{ \log _{\rho _{\!u}}\frac{M_\nu (h)}{M_0}} \right|+ \frac{1-\nu }{1+\nu }\Big [ \log _{\rho _{\!u}}(k+1)+\left|{\log _{\rho _{\!u}}\beta _{k+1}} \right|\Big ]. \end{aligned}$$

(19)

Proof

See Sect. Appendix A. $\square $

3.2 Time Discretization Interpretation

Below, we provide a time discretization interpretation of Algorithm 1. Given the k-th iterations $(x_k,v_k,\lambda _k)$ and the parameters $(\gamma _k,\beta _k,M_{k})$, the line search procedure returns $(y_k,x_{k+1},v_{k+1},\lambda _{k+1})$ that satisfy

(20a)

(20b)

(20c)

(20d)

where ${\mathcal {G}}(y_k,v_{k+1},\lambda _k):=\nabla h(y_k)+\partial g(v_{k+1})+N_Q(v_{k+1})+A^\top \widetilde{\lambda }_k$ with $\widetilde{\lambda }_k= \lambda _k+\alpha _k/\beta _k(Av_k-b)$, and the step size $\alpha _k$ solves (cf.(16))

$$\begin{aligned} \alpha _k^2(\rho _{_{\!d}}\beta _{k}M_{k+1} +\left\Vert {A} \right\Vert ^2)= \gamma _k\beta _k, \end{aligned}$$

(21)

where we used the relation $M_{k,i_k} = \rho _{_{\!d}} M_{k+1}$. Besides, $y_k$ and $x_{k+1}$ fulfill (cf. (15))

$$\begin{aligned} h(x_{k+1})\le h(y_k)+\left\langle {\nabla h(y_k),x_{k+1}-y_k} \right\rangle +\frac{\rho _{_{\!d}}M_{k+1}}{2}\left\Vert {x_{k+1}-y_k} \right\Vert ^2+\frac{\delta _{k+1}}{2}, \end{aligned}$$

(22)

and the parameters $(\gamma _{k+1},\beta _{k+1})$ are governed by (cf. line 9 of Algorithm 1)

$$\begin{aligned} \frac{\gamma _{k+1}-\gamma _k}{\alpha _k} = {}\mu -\gamma _{k+1},\quad \frac{\beta _{k+1}-\beta _k}{\alpha _k} = -\beta _{k+1}, \end{aligned}$$

(23)

with $\beta _0 = 1$ and $\gamma _0>0$. For clarity, in Sect. Appendix B, we give a detailed derivation of the reformulation (20) from Algorithms 1 and 2.

As one can see, $y_k$ in (20a) is an intermediate which provides a “prediction”, and then the “correction” step (20c) is used to update $x_{k+1}$. From (20a), (20b), and (20c), it is not hard to find that $y_k,\,v_{k+1},\,x_{k+1}\in Q$, as long as $x_k,v_k\in Q$. Therefore, with $x_0,v_0\in Q$, it holds that $\{x_k,y_k,v_k\}_{k\in {\mathbb {N}}}\subset Q$.

Furthermore, we mention that the reformulation (20) itself admits an implicit-explicit time discretization for the following primal–dual dynamics:

$$\begin{aligned} \begin{aligned} {}&x' = v-x,\\ {}&\gamma \frac{\,\textrm{d}}{\,\textrm{d}t} \nabla \phi (v){}\in \mu \big [\nabla \phi (x)-\nabla \phi (v)\big ]- \partial f(x)-N_Q(x)-A^\top \lambda ,\\ {}&\beta \lambda ' {}= Av-b, \end{aligned} \end{aligned}$$

(24)

where $\gamma $ and $\beta $ are governed by continuous analogues to (23):

$$\begin{aligned} \gamma ' = \mu -\gamma ,\quad \beta ' =-\beta . \end{aligned}$$

(25)

We call (24) the Accelerated Bregman Primal–Dual (ABPD) flow. For $\phi (x) = 1/2\left\Vert {x} \right\Vert ^2$, it amounts to the accelerated primal–dual flow in [39].

3.3 A Universal Estimate

Let $\{(x_k,v_k,\lambda _k,\gamma _k,\beta _k)\}_{k\in {\mathbb {N}}}$ be the sequence generated from Algorithm 1. We introduce the discrete Lyapunov function

$$\begin{aligned} {\mathcal {E}}_k: = {\mathcal {L}}\left( x_{k},\lambda ^*\right) -{\mathcal {L}}\left( x^*,\lambda _k\right) + \gamma _kD_\phi (x^*,v_k)+\frac{\beta _k}{2}\left\Vert {\lambda _k-\lambda ^*} \right\Vert ^2. \end{aligned}$$

(26)

A one-step estimate is presented below.

Lemma 3.2

Under Assumption 3.1, we have

$$\begin{aligned} {\mathcal {E}}_{k+1}- {\mathcal {E}}_{k}\le -\alpha _k \mathcal E_{k+1}+\frac{\delta _{k+1}}{2}(1+\alpha _k)\quad \forall \,k\in \mathbb N. \end{aligned}$$

(27)

Proof

See Sect. 4. $\square $

Using this lemma, we obtain the following theorem, which says the final rate is given by the sharp decay estimate of the sequence $\{\beta _k\}_{k\in {\mathbb {N}}}$; see Lemma 3.3.

Theorem 3.1

Under Assumption 3.1, we have $\{x_k,v_k\}_{k\in {\mathbb {N}}}\subset Q$ and

$$\begin{aligned} \left\Vert {Ax_k-b} \right\Vert \le {}&\beta _k{\mathcal {T}}_{k}, \end{aligned}$$

(28)

$$\begin{aligned} \left|{f(x_k)-f(x^*)} \right|\le {}&\beta _k{\mathcal {W}}_{k}, \end{aligned}$$

(29)

$$\begin{aligned} {\mathcal {L}}\left( x_{k},\lambda ^*\right) -{\mathcal {L}}\left( x^*,\lambda _k\right) \le {}&\beta _k{\mathcal {R}}_{k}, \end{aligned}$$

(30)

for all $k\in {\mathbb {N}}$, where ${\mathcal {R}}_{k}:={\mathcal {E}}_0+\ln (k+1),\,{\mathcal {T}}_{k}:=\left\Vert {Ax_0-b} \right\Vert +2\sqrt{2{\mathcal {R}}_{k}}$ and ${\mathcal {W}}_{k}:={\mathcal {R}}_{k}+\left\Vert {\lambda ^*} \right\Vert {\mathcal {T}}_{k}$. Moreover, we have

$$\begin{aligned} \gamma _{\min }\left\Vert {v_{k}-x^*} \right\Vert ^2 + \mu \left\Vert {x_k-x^*} \right\Vert ^2\le 4\beta _k{\mathcal {R}}_{k}, \end{aligned}$$

(31)

where $\gamma _{\min }:= \min \{\gamma _0,\mu \}\ge 0$.

Proof

From (23) and the contraction estimate (27) follows immediately that

$$\begin{aligned} {\mathcal {E}}_{k+1}\le \frac{1}{1+\alpha _k}\mathcal E_k+\frac{\delta _{k+1}}{2}\quad \Longrightarrow \quad \mathcal E_k\le \beta _k\mathcal E_0+\frac{\beta _k}{2}\sum _{i=0}^{k-1}\frac{\delta _{i+1}}{\beta _{i+1}}. \end{aligned}$$

By (17,23), we have

$$\begin{aligned} \delta _{k+1} =\frac{1}{k+1}\cdot \frac{\beta _k}{1+\alpha _k}=\frac{\beta _{k+1}}{k+1}, \end{aligned}$$

which further implies

$$\begin{aligned} {\mathcal {E}}_k\le \beta _k\mathcal E_0+\frac{\beta _k}{2}\sum _{i=0}^{k-1}\frac{1}{i+1} \le \beta _k\left[ {\mathcal {E}}_0+\ln (k+1)\right] =\beta _k\mathcal R_k, \end{aligned}$$

(32)

which proves (30). In view of (23), it holds that $\gamma _{k+1} = (\gamma _k + \mu \alpha _k)/(1+\alpha _k)\ge \gamma _{\min }$. This together with (32) gives

$$\begin{aligned} \gamma _{\min }D_\phi (x^*,v_k)\le \gamma _kD_\phi (x^*,v_k)\le {\mathcal {E}}_k\le \beta _k{\mathcal {R}}_k. \end{aligned}$$

It is clear that ${\mathcal {L}}(\cdot ,\lambda )$ is convex and by (7,30,3.1),

$$\begin{aligned} \mu D_{\phi }(x_k,x^*)\le {\mathcal {L}}\left( x_{k},\lambda ^*\right) -{\mathcal {L}}\left( x^*,\lambda ^*\right) \le \beta _k{\mathcal {R}}_k. \end{aligned}$$

Hence, combining the above two estimates with (8) leads to (31).

Following [39, Theorem 3.1], we can establish (28,29). For the sake of completeness, we provide the details as below. Thanks to (20c) and (20d), we have

$$\begin{aligned} \begin{aligned} \lambda _{k+1} - \lambda _{k} ={}&\frac{\alpha _k}{\beta _k}(Av_{k+1}-b) = \frac{\alpha _k}{\beta _k}\left[ A\left( x_{k+1}+\frac{x_{k+1}-x_{k}}{\alpha _k}\right) -b\right] \\ = {}&\frac{\alpha _k}{\beta _k}\left[ \frac{1+\alpha _k}{\alpha _k}\left( Ax_{k+1} - b\right) -\frac{1}{\alpha _k}\left( Ax_k-b\right) \right] \\ = {}&\frac{1}{\beta _{k+1}}(Ax_{k+1}-b) - \frac{1}{\beta _k}(Ax_k-b), \end{aligned} \end{aligned}$$

which yields that

$$\begin{aligned} \lambda _{k}-\lambda _0 = \frac{1}{\beta _{k}}(Ax_{k}-b) - (Ax_0-b). \end{aligned}$$

By (32), we obtain $\left\Vert {\lambda _k-\lambda ^*} \right\Vert ^2\le 2{\mathcal {R}}_k$ and

$$\begin{aligned} \begin{aligned} {}&\left\Vert {Ax_k-b} \right\Vert = \beta _k\left\Vert {\lambda _k-\lambda _0+Ax_0-b} \right\Vert \\&\quad \le {}\beta _k\left\Vert {\lambda _k-\lambda ^*} \right\Vert +\beta _k\left\Vert {\lambda _0-\lambda ^*} \right\Vert +\beta _k\left\Vert {Ax_0-b} \right\Vert \\&\quad \le {}\beta _k\sqrt{2\mathcal R_{k}}+\beta _k\left\Vert {\lambda _0-\lambda ^*} \right\Vert +\beta _k\left\Vert {Ax_0-b} \right\Vert . \end{aligned} \end{aligned}$$

In view of ${\mathcal {R}}_k={\mathcal {E}}_0+\ln (k+1)$ and $\beta _0=1$, we find that $\left\Vert {\lambda _0-\lambda ^*} \right\Vert ^2\le 2{\mathcal {R}}_k$. Plugging this into the above estimate gives (28). Observing (30), it follows that

$$\begin{aligned} -\left\langle {\lambda ^*,Ax_k-b} \right\rangle \le f(x_k)-f(x^*)\le \beta _k\mathcal R_k-\left\langle {\lambda ^*,Ax_k-b} \right\rangle , \end{aligned}$$

which promises

$$\begin{aligned} \left|{ f(x_k)-f(x^*)} \right| \le \beta _k\mathcal R_k+\left\Vert {\lambda ^*} \right\Vert \left\Vert {Ax_k-b} \right\Vert \le \beta _k{\mathcal {W}}_k. \end{aligned}$$

Consequently, this proves (29) and concludes the proof of this theorem. $\square $

Remark 3.4

Note that the choice (17) can be replaced with

$$\begin{aligned} \delta _{k,i} = \frac{\delta }{k+1}\cdot \frac{\beta _{k}}{1+\alpha _{k,i}},\quad \delta >0. \end{aligned}$$

Then the quantity ${\mathcal {R}}_{k} ={\mathcal {E}}_0+\ln (k+1)$ in Theorem 3.1 becomes $ {\mathcal {R}}_{k}=\mathcal E_0+\delta \ln (k+1)$. Taking $\delta = 1/\ln (K+1)$ cancels the logarithm factor, where $K\in {\mathbb {N}}$ is the maximal number of iterations chosen in advance. $\square $

It remains to investigate the decay rate of $\{\beta _k\}_{k\in {\mathbb {N}}}$. From (21,23), we obtain

$$\begin{aligned} \beta _{k+1}-\beta _{k} = -\frac{\sqrt{\gamma _{k}\beta _k}\beta _{k+1}}{\sqrt{\rho _{_{\!d}}\beta _kM_{k+1} +\left\Vert {A} \right\Vert ^2}}. \end{aligned}$$

(33)

A careful investigation into this difference equation gives the desired result, which involves the following quantity

$$\begin{aligned} \varDelta := \max \left\{ M_0,\,\sqrt{\rho _{\! u}} \rho _{\! u}\left[ M_\nu (h)\right] ^{\frac{2}{1+\nu }}\right\} . \end{aligned}$$

(34)

Lemma 3.3

Assume that $\max \{\gamma _0,\mu \}\le \left\Vert {A} \right\Vert ^2$, then

$$\begin{aligned} \beta _k\le \frac{4\left\Vert {A} \right\Vert }{\sqrt{\gamma _0}k} + \frac{(16\sqrt{2})^{1+\nu }\varDelta ^{\frac{1+\nu }{2}}}{\gamma _0^{\frac{1+\nu }{2}}k^{\frac{1+3\nu }{2}}}\quad \forall \,k\ge 1, \end{aligned}$$

(35)

and moreover, we have

$$\begin{aligned} \beta _k\le \left\{ \begin{aligned} {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2}+ \exp \left( -\frac{k}{8}\sqrt{\frac{\gamma _{\min }}{\varDelta }}\right){} & {} \text { if }\nu =1,\\ {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2} + \left( 1+\frac{1-\nu }{32}\sqrt{\frac{\gamma _{\min }}{2^ {\frac{1-\nu }{1+\nu }}\varDelta }}\,k^{\frac{1+3\nu }{2+2\nu }}\right) ^ {-\frac{2+2\nu }{1-\nu }}{} & {} \text { if }\nu <1, \end{aligned} \right. \end{aligned}$$

(36)

where $\gamma _{\min } = \min \{\gamma _0,\mu \}\ge 0$.

Proof

Since $M(\nu ,\delta _{k+1}) = \delta _{k+1}^{\frac{\nu -1}{\nu +1}}[M_\nu (h)]^{\frac{2}{\nu +1}}$ and $\delta _{k+1} = \beta _{k+1}/(k+1)\le 1$, by Lemma 3.1, we have

$$\begin{aligned} \begin{aligned} \rho _{_{\!d}} M_{k+1}\le {}&\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u} M(\nu ,\delta _{k+1})\right\} \\ ={}&\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u}[M_\nu (h)]^{\frac{2}{\nu +1}} \delta _{k+1}^{\frac{\nu -1}{\nu +1}}\right\} \le {}\varDelta \cdot \delta _{k+1}^{\frac{\nu -1}{\nu +1}}. \end{aligned} \end{aligned}$$

Plugging this into (33) gives

$$\begin{aligned} \beta _{k+1}-\beta _{k} \le -\frac{\sqrt{\gamma _{k}\beta _k}\beta _{k+1}}{\sqrt{\varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}} +\left\Vert {A} \right\Vert ^2}}. \end{aligned}$$

(37)

From this we obtain (35,36). Missing proofs are provided in Sect. 5. $\square $

Remark 3.5

Both two estimates (35,36) hold for $\mu \ge 0$. In addition, for the limiting case $\nu \rightarrow 1-$, we have

$$\begin{aligned} \lim \limits _{\nu \rightarrow 1-} \left( 1+\frac{1-\nu }{32}\sqrt{\frac{\gamma _{\min }}{2^{\frac{1-\nu }{1+\nu }}\varDelta }}\,k^{\frac{1+3\nu }{2+2\nu }}\right) ^{-\frac{2+2\nu }{1-\nu }}= \exp \left( -\frac{k}{8}\sqrt{\frac{\gamma _{\min }}{\varDelta }}\right) , \end{aligned}$$

which matches the case $\nu = 1$. $\square $

By the universal mixed-type estimate in Lemma 3.3, our Algorithm 1 achieves the lower complexity bound for both the affinely constrained case (i.e., $A \ne 0_{m\times n}$) and the unconstrained case $\varOmega =\,{{\mathbb {R}}}^n$ (i.e., $A = 0_{m\times n}$ and $b = 0_{m}$), with Hölderian smoothness exponent $\nu \in [0,1]$. Detailed comparisons with existing results are summarized in order.

Remark 3.6

Consider the unconstrained case $\varOmega =\,{{\mathbb {R}}}^n$ (i.e.,$A = 0_{m\times n}$ and $b = 0_{m}$).

The Lipschitzian case $\nu = 1$: From (29), it follows that (neglecting the logarithm factor $\ln (k+1)$)
$$\begin{aligned} f(x_k) - f^* \le \beta _k{\mathcal {E}}_0, \end{aligned}$$
where by Lemma 3.3, we have (taking $\gamma _0>\mu $)
$$\begin{aligned} \beta _k\le \min \left\{ \frac{2\cdot 16^2\varDelta }{\gamma _0k^2},\,\exp \left( -\frac{k}{8}\sqrt{\frac{\mu }{\varDelta }}\right) \right\} . \end{aligned}$$
Hence, to achieve the accuracy $f(x_k)-f^*\le \epsilon $, the iteration complexity is no more than (recalling (34) and assuming $\varDelta \sim L_h$)
$$\begin{aligned} \mathcal O\left( \min \left\{ \sqrt{\frac{L_h}{\epsilon }},\, \sqrt{\frac{L_h}{\mu }}\cdot \left|{\ln \epsilon } \right|\right\} \right) . \end{aligned}$$
This is the well-known lower bound (cf. [45, 49]) of first-order methods for smooth convex functions with Lipschitzian gradients; see [10, 11, 38, 42, 47].
The Hölderian case $0\le \nu < 1$: Similarly with the above analysis, the iteration complexity for $f(x_k)-f^*\le \epsilon $ is bounded by
$$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \left( \frac{M_\nu (h)}{\epsilon }\right) ^\frac{2}{1+3\nu },\quad \left( \frac{M_\nu (h)}{\mu }\right) ^{\frac{2}{1+3\nu }}\cdot \left( \frac{\mu }{\epsilon }\right) ^{\frac{1-\nu }{1+3\nu }}\right\} \right) . \end{aligned}$$
(38)
This matches the lower bound in [43, 44]. The convex case $\mu = 0$ has been obtained by the methods in [32, 43, 48], and the restarted schemes in [32, 52] attained the complexity bound for $\mu >0$. Besides, Guminov et al. [24] obtained (38) for nonconvex problems, with an extra one-dimensional line search step. $\square $

Remark 3.7

Then, let us focus on the affine constraint case (i.e., $A \ne 0_{m\times n}$).

The Lipschitzian case $\nu = 1$: By (28,29), to achieve $\left|{f(x_k)-f^*} \right|\le \epsilon $ and $\left\Vert {Ax_k-b} \right\Vert \le \epsilon $, the iteration complexity (neglecting the logarithm factor $\ln (k+1)$) is
$$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \frac{\left\Vert {A} \right\Vert }{\epsilon }+\sqrt{\frac{L_h}{\epsilon }},\quad \frac{\left\Vert {A} \right\Vert }{\sqrt{\mu \epsilon }}+\sqrt{\frac{L_h}{\mu }}\cdot \left|{\ln \epsilon } \right| \right\} \right) . \end{aligned}$$
(39)
This coincides with the lower complexity bound in [51]. The methods in [12, 46, 66] achieved the bound for convex case $\mu =0$, and the strongly convex case $\mu >0$ can be found in [66].
The Hölderian case $0\le \nu < 1$:
$$\begin{aligned} {\mathcal {O}}\left( \min \left\{ \frac{\left\Vert {A} \right\Vert }{\epsilon }+\left( \frac{M_\nu (h)}{\epsilon }\right) ^\frac{2}{1+3\nu },\quad \frac{\left\Vert {A} \right\Vert }{\sqrt{\mu \epsilon }}+ \left( \frac{M_\nu (h)}{\mu }\right) ^{\frac{2}{1+3\nu }}\cdot \left( \frac{\mu }{\epsilon }\right) ^{\frac{1-\nu }{1+3\nu }} \right\} \right) . \end{aligned}$$
Similarly with (39), this universal mixed-type estimate has optimal dependence on $\left\Vert {A} \right\Vert $ (corresponding to the affine constraint), and the remainder agrees with (38), which is optimal with respect to $\mu $ and $M_\nu (h)$ (related to the objective). $\square $

4 Proof of Lemma 3.2

Start from the difference $ {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k} = {\mathbb {I}}_1+{\mathbb {I}}_2+{\mathbb {I}}_3$, where

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_1:={}&{\mathcal {L}}\left( x_{k+1},\lambda ^*\right) -{\mathcal {L}}\left( x_{k},\lambda ^*\right) ,\\ {\mathbb {I}}_2:= {}&\gamma _{k+1} D_\phi (x^*,v_{k+1})- \gamma _{k} D_\phi (x^*,v_k),\\ {\mathbb {I}}_3:={}&\frac{\beta _{k+1}}{2} \left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 - \frac{\beta _k}{2} \left\Vert {\lambda _{k}-\lambda ^*} \right\Vert ^2. \end{aligned} \end{aligned}$$

Notice that $x_k,\,x_{k+1}\in Q$ and the first term is easy to handle:

$$\begin{aligned} {\mathbb {I}}_1 = f(x_{k+1})-f(x_k)+\langle \lambda ^*,A(x_{k+1}-x_k) \rangle . \end{aligned}$$

(40)

We give the estimate of ${\mathbb {I}}_2$ in Sect. 4.1 and finish the proof of (27) in Sect. 4.2.

4.1 Estimate of ${\mathbb {I}}_2$

Invoking the three-term identity (9) and the difference equation of $\{\gamma _k\}_{k\in {\mathbb {N}}}$ in (23), we split the second term ${\mathbb {I}}_2$ as follows

$$\begin{aligned} \begin{aligned} \mathbb I_2={}&(\gamma _{k+1}-\gamma _{k})D_\phi (x^*,v_{k+1})+\gamma _{k}\left[ D_\phi (x^*,v_{k+1})- D_\phi (x^*,v_k) \right] \\ = {}&\alpha _k(\mu -\gamma _{k+1})D_\phi (x^*,v_{k+1}) -\gamma _{k}D_\phi (v_{k+1},v_k)\\ {}&\quad +\gamma _{k}\left\langle {\nabla \phi (v_{k+1})-\nabla \phi (v_k),v_{k+1}-x^*} \right\rangle . \end{aligned} \end{aligned}$$

(41)

Let us prove

$$\begin{aligned} \begin{aligned} {}&\mu \alpha _k D_\phi (x^*,v_{k+1}) +\gamma _{k}\left\langle {\nabla \phi (v_{k+1})-\nabla \phi (v_k),v_{k+1}-x^*} \right\rangle \\ \le {}&h(x_k)-h(y_k) - \alpha _k\left[ h(y_k)-h(x^*)+\big \langle \widetilde{\lambda }_k, Av_{k+1} -b\big \rangle \right] \\ {}&\quad -\alpha _k\left[ g(v_{k+1})-g(x^*)+\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \right] , \end{aligned} \end{aligned}$$

(42)

which leads to the desired estimate of ${\mathbb {I}}_2$:

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_2\le&-\alpha _k \gamma _{k+1}D_\phi (x^*,v_{k+1}) -\gamma _{k}D_\phi (v_{k+1},v_k)-\alpha _k \big \langle \widetilde{\lambda }_k, Av_{k+1} -b\big \rangle \\&\quad -\alpha _k\left[ g(v_{k+1})-g(x^*)+h(y_k)-h(x^*)\right] \\&\quad \quad +h(x_k)-h(y_k) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle . \end{aligned} \end{aligned}$$

(43)

To do this, define $\zeta _{k+1}$ by that

$$\begin{aligned} \begin{aligned} {}&\gamma _k\big [\nabla \phi (v_{k+1})-\nabla \phi (v_k)\big ] \\ ={}&\mu \alpha _k\big [\nabla \phi (y_k)-\nabla \phi (v_{k+1})\big ] -\alpha _k\big [\nabla h(y_k)+\zeta _{k+1}+A^\top \widetilde{\lambda }_k\big ]. \end{aligned} \end{aligned}$$

(44)

Observing (20b), it follows that $\zeta _{k+1}\in \partial g(v_{k+1})+N_Q(v_{k+1})$ and

$$\begin{aligned} \begin{aligned} {}&-\alpha _k\big \langle \zeta _{k+1}, v_{k+1} - x^{*}\big \rangle \le -\alpha _k\left[ g(v_{k+1})-g(x^*)\right] . \end{aligned} \end{aligned}$$

Thanks to (9), we have the decomposition

$$\begin{aligned} \begin{aligned} {}&\mu \alpha _k \left\langle {\nabla \phi (y_k)-\nabla \phi (v_{k+1}), v_{k+1} - x^{*}} \right\rangle \\ ={}&\mu \alpha _k \big [D_\phi (x^*,y_k) -D_\phi (x^*,v_{k+1}) -D_\phi (v_{k+1},y_k)\big ], \end{aligned} \end{aligned}$$

and invoking (20a) leads to

$$\begin{aligned}&-\alpha _k \left\langle { \nabla h(y_k), v_{k+1} - x^{*}} \right\rangle \\ =&-\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle - \left\langle { \nabla h(y_k), y_{k} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla h(y_k), y_{k} - x^{*}} \right\rangle . \end{aligned}$$

Since $x_k,\,y_k\in Q$, by Assumption 3.1 we obtain

$$\begin{aligned} {}&- \left\langle { \nabla h(y_k), y_{k} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla h(y_k), y_{k} - x^{*}} \right\rangle \\ \le {}&h(x_k)-h(y_k) - \alpha _k\left[ h(y_k)-h(x^*)+\mu D_\phi (x^*,y_k)\right] . \end{aligned}$$

Hence, combining the above estimates with (44) proves (42).

4.2 Estimate of ${\mathbb {I}}_3$

Similarly with (41), by (9), (20d) and (23), the third term ${\mathbb {I}}_3$ is rearranged by that

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_3= {}&-\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 -\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\lambda _k\big \Vert ^2\\ {}&\qquad +\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\lambda ^*\big \rangle . \end{aligned} \end{aligned}$$

To match the cross term $ -\alpha _k \big \langle \widetilde{\lambda }_k, Av_{k+1} -b\big \rangle $ in the estimate of ${\mathbb {I}}_2$ (cf.(43)), we rewrite the last term as follows:

$$\begin{aligned} \begin{aligned} {}&\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\lambda ^*\big \rangle \\ ={}&\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\widetilde{\lambda }_k\big \rangle +\alpha _{k}\big \langle Av_{k+1}-b,\widetilde{\lambda }_k-\lambda ^*\big \rangle . \end{aligned} \end{aligned}$$

In view of (10) and (20d), we get

$$\begin{aligned} \begin{aligned} {}&\alpha _{k}\big \langle Av_{k+1}-b,\lambda _{k+1}-\widetilde{\lambda }_k\big \rangle =\beta _k\left\langle {\lambda _{k+1}-\lambda _{k},\lambda _{k+1}-\widetilde{\lambda }_k} \right\rangle \\ ={}&\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\lambda _k\Vert ^2 + \frac{\beta _{k}}{2} \big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 -\frac{\beta _{k}}{2}\big \Vert \lambda _{k}-\widetilde{\lambda }_k\big \Vert ^2, \end{aligned} \end{aligned}$$

which gives

$$\begin{aligned} {\mathbb {I}}_3\le -\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 +\alpha _{k}\big \langle Av_{k+1}-b,\widetilde{\lambda }_k-\lambda ^*\big \rangle . \end{aligned}$$

4.3 Proof of (27)

Combining (43) and the estimate of ${\mathbb {I}}_3$, we obtain

$$\begin{aligned} \begin{aligned} {\mathbb {I}}_2+{\mathbb {I}}_3\le&-\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2 +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2-\alpha _k \big \langle \lambda ^*, Av_{k+1} -b\big \rangle \\ {}&\quad -\alpha _k \gamma _{k+1}D_\phi (x^*,v_{k+1}) -\gamma _{k}D_\phi (v_{k+1},v_k)\\&\qquad -\alpha _k\left[ g(v_{k+1})-g(x^*)+h(y_k)-h(x^*)\right] \\&\qquad \quad +h(x_k)-h(y_k) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle . \end{aligned} \end{aligned}$$

Inserting the identity (40) into the above inequality and observing that

$$\begin{aligned} \begin{aligned} {}&-\alpha _k \big \langle \lambda ^*, Av_{k+1} -b\big \rangle +\langle \lambda ^*,A(x_{k+1}-x_k) \rangle \\&\quad =-\alpha _k \left\langle { \lambda ^*, A\left( x_{k+1}+\frac{x_{k+1}-x_k}{\alpha _k}\right) -b} \right\rangle +\langle \lambda ^*,A(x_{k+1}-x_k) \rangle \,\left( \text {by (20c)}\right) \\&\quad =-\alpha _{k}\langle \lambda ^*,Ax_{k+1}-b \rangle \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&-\alpha _k\left[ g(v_{k+1})-g(x^*)+h(y_k)-h(x^*)\right] \\&\qquad =-\alpha _k\left[ g(x_{k+1})-g(x^*)+h(x_{k+1})-h(x^*) \right] \\&\qquad \quad -\alpha _k\left[ g(x_{k})-g(x_{k+1})+h(x_{k})-h(x_{k+1}) \right] \\&\qquad \quad -\alpha _k\left[ g(v_{k+1})-g(x_{k})+h(y_{k})-h(x_{k}) \right] \\&\qquad = -\alpha _k\left[ f(x_{k+1})-f(x^*)\right] -\alpha _k\left[ f(x_{k})-f(x_{k+1})\right] \\&\qquad \quad -\alpha _k\left[ g(v_{k+1})-g(x_{k})+h(y_{k})-h(x_{k}) \right] , \end{aligned} \end{aligned}$$

we arrive at

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-\mathcal E_{k}\le&-\alpha _k\left[ f(x_{k+1})-f(x^*)\right] -\alpha _{k}\langle \lambda ^*,Ax_{k+1}-b \rangle \\ {}&\quad -\alpha _k \gamma _{k+1}D_\phi (x^*,v_{k+1})-\frac{\alpha _k\beta _{k+1}}{2}\left\Vert {\lambda _{k+1}-\lambda ^*} \right\Vert ^2\\ {}&\qquad +(1+\alpha _k)\left[ f(x_{k+1})-f(x_{k})\right] +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 \\ {}&\quad \qquad -\alpha _k\left[ g(v_{k+1})-g(x_{k})+h(y_{k})-h(x_{k}) \right] -\gamma _{k}D_\phi (v_{k+1},v_k)\\ {}&\qquad \qquad +h(x_k)-h(y_k) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle . \end{aligned} \end{aligned}$$

The first two lines equal to $ -\alpha _k{\mathcal {E}}_{k+1}$ and a careful collection of the rest terms gives

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k}\le&-\alpha _k{\mathcal {E}}_{k+1} +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2 -\gamma _{k}D_\phi (v_{k+1},v_k) \\ {}&\quad +(1+\alpha _k)\left[ h(x_{k+1})- h(y_{k})\right] -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \\ {}&\qquad +(1+\alpha _k)g(x_{k+1})-g(x_{k}) -\alpha _kg(v_{k+1}). \end{aligned} \end{aligned}$$

(45)

From (20c), we see $x_{k+1}$ is a convex combination of $x_k$ and $v_{k+1}$, which implies

$$\begin{aligned} (1+\alpha _k)g(x_{k+1})\le g(x_{k}) +\alpha _kg(v_{k+1}). \end{aligned}$$

Thanks to (22) and the relation $\alpha _k(v_{k+1}-v_k)=(1+\alpha _k) (x_{k+1}-y_k) $ (cf.(20a) and (20c)), we obtain the estimate

$$\begin{aligned} \begin{aligned}&(1+\alpha _k)\left[ h(x_{k+1})- h(y_{k})\right] -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \\&\qquad \le \frac{\alpha _k^2M_{k+1}}{2+2\alpha _k}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 +\frac{\delta _{k+1}}{2}(1+\alpha _k). \end{aligned} \end{aligned}$$

Consequently, plugging these two estimates into (45) and using (8) leads to

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k}\le&-\alpha _k{\mathcal {E}}_{k+1}+\frac{\delta _{k+1}}{2}(1+\alpha _k) +\frac{\beta _{k}}{2}\big \Vert \lambda _{k+1}-\widetilde{\lambda }_k\big \Vert ^2\\&\quad +\frac{\alpha _k^2M_{k+1}-\gamma _{k}(1+\alpha _k)}{1+\alpha _k}D_\phi (v_{k+1},v_k). \end{aligned} \end{aligned}$$

Recall that $\widetilde{\lambda }_k= \lambda _k+\alpha _k/\beta _k(Av_k-b)$, which together with (20d) gives $\lambda _{k+1}- \widetilde{\lambda }_k=\alpha _k/\beta _k A(v_{k+1}-v_k)$ and

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{k+1}-{\mathcal {E}}_{k}\le {}&\!-\alpha _k{\mathcal {E}}_{k+1}+\frac{\delta _{k+1}}{2}(1+\alpha _k)\\ {}&\quad +\frac{1}{\beta _{k}} \left[ \alpha _k^2(\beta _{k+1}M_{k+1} +\left\Vert {A} \right\Vert ^2)-\gamma _k\beta _k\right] D_\phi (v_{k+1},v_k). \end{aligned} \end{aligned}$$

Since $\beta _{k+1}\le \beta _{k}$ (cf.(23)), the desired estimate (27) follows immediately from (21). This finishes the proof of Lemma 3.2.

5 Proof of Lemma 3.3

The key to complete the proof of Lemma 3.3 is the difference inequality (37). In Sect. 5.1, we shall introduce an auxiliary differential inequality (cf.(46)) that can be viewed as a continuous analogue to (37). Later in Sects. 5.2 and 5.3, we finish the proofs of (35),(36) by using the asymptotic estimate of (46).

5.1 A Differential Inequality

In the sequel, we need some functional spaces in one dimension; see [14, Definitions 1.87 and 2.1]. Denote by $C^1(I)$ the space of (real-valued) continuous functions on the interval $I\subset \,{{\mathbb {R}}}$ with continuous derivatives. Let $L^\infty (I)$ be the space of essentially bounded measurable functions, which means any $\sigma \in L^\infty (I)$ is bounded almost everywhere. The space $L^1(I)$ consists of measurable functions that are absolutely summable (integrable). As usual, we denote by $W^{1,\infty }(I)$ the set of all real-valued functions which, together with their generalized derivatives, belong to $L^\infty (I)$.

Let $\eta ,\,R\ge 0$ and $\theta >1$ be constants such that $\eta \le \theta -1$. Suppose $y\in W^{1,\infty }(0,\infty )$ is positive and satisfies the differential inequality

$$\begin{aligned} y'(t) \le -\frac{\sigma (t) y^\theta (t)}{\sqrt{\varphi (t)y^{2\eta }(t) + R^2}},\quad y(0) = 1, \end{aligned}$$

(46)

where $\sigma \in L^1(0,\infty )$ is nonnegative, and $\varphi \in C^1[0,\infty )$ is positive and nondecreasing. The decay rate of y(t) is given below.

Lemma 5.1

Assume $y\in W^{1,\infty }(0,\infty )$ is positive and satisfies (46). Then, for all $t>0$, we have

$$\begin{aligned} y(t)\le \left\{ \begin{aligned} {}&\exp \left( -\frac{\varSigma (t)}{2\sqrt{\varphi (t)}}\right) + \left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}{} & {} \text {if } \eta =\theta -1,\\ {}&\left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}+ \left( 1+\frac{\theta -\eta -1}{2\sqrt{\varphi (t)}}\varSigma (t)\right) ^{\frac{1}{\eta +1-\theta }}{} & {} \text {if }\eta <\theta -1, \end{aligned} \right. \end{aligned}$$

where $\varSigma (t):=\int _{0}^{t}\sigma (s)\,\textrm{d}s$.

Proof

Write (46) as follows

$$\begin{aligned} y'(t) \le -\frac{\sigma (t) }{\sqrt{\varphi (t)y^{2\eta -2\theta }(t) + R^2y^{-2\theta }(t)}}. \end{aligned}$$

Shifting the denominator from the right to the left and using the trivial estimate

$$\begin{aligned} \sqrt{\varphi (t)y^{2\eta -2\theta }(t) + R^2y^{-2\theta }(t)}\le \sqrt{\varphi (t)}y^{\eta -\theta }(t)+Ry^{-\theta }(t), \end{aligned}$$

we obtain that

$$\begin{aligned} \left( \frac{\sqrt{\varphi (t)}}{y^{\theta -\eta }(t)}+\frac{R}{y^{\theta }(t)}\right) y'(t)\le -\sigma (t). \end{aligned}$$

(47)

To the end, we shall discuss in two cases: $\eta =\theta -1$ and $\eta <\theta -1$. Detailed proofs can be found in Sect. Appendix C. $\square $

5.2 Proof of (35)

By (23), we have

$$\begin{aligned} \gamma _{k+1}-\gamma _{0}\beta _{k+1} = \frac{\mu \alpha _{k}+\gamma _{k}}{1+\alpha _k}-\frac{\gamma _{0}\beta _k}{1+\alpha _k}\ge \frac{\gamma _{k}-\gamma _{0}\beta _k}{1+\alpha _k}. \end{aligned}$$

Since $\gamma _{0}=\gamma _{0}\beta _0$, it follows that $\gamma _k \ge \gamma _0\beta _{k}$ and (37) becomes

$$\begin{aligned} \beta _{k+1}-\beta _{k} \le -\frac{\sqrt{\gamma _{0}}\beta _k\beta _{k+1}}{\sqrt{\varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}} +\left\Vert {A} \right\Vert ^2}}, \end{aligned}$$

(48)

where $\delta _{k+1} = \beta _{k+1}/(k+1)$.

Define a piecewise continuous linear interpolation

$$\begin{aligned} y(t): ={}\beta _{k}(k+1-t)+\beta _{k+1}(t-k)\quad \forall \,t\in [k,k+1), \, k\in {\mathbb {N}}. \end{aligned}$$

(49)

Clearly, $y\in W^{1,\infty }(0,\infty )$ is positive and $0<y(t)\le y(0) = 1$. In particular, we have $\beta _k = y(k)$ for all $k\in {\mathbb {N}}$, and the decay estimate of $\beta _k$ is transferred into the asymptotic behavior of y(t), which satisfies (the proof is given below)

$$\begin{aligned} y' (t)\le -\frac{\sqrt{\gamma _0}y^{2}(t)/2}{\sqrt{\varphi (t)[y(t)]^{\frac{2\nu }{1+\nu }} +\left\Vert {A} \right\Vert ^2 }}, \end{aligned}$$

(50)

where $\varphi (t):=4\varDelta (t+1)^{\frac{1-\nu }{1+\nu }}$. Thus, utilizing Lemma 5.1 gives

$$\begin{aligned} \beta _k=y(k)\le \frac{4\left\Vert {A} \right\Vert }{\sqrt{\gamma _0}k} + \frac{(16\sqrt{2})^{1+\nu }\varDelta ^{\frac{1+\nu }{2}}}{\gamma _0^{\frac{1+\nu }{2}}k^{\frac{1+3\nu }{2}}} \quad \forall \,k\ge 1, \end{aligned}$$

which establishes (35).

Below, let us verify (50). Since $\gamma _k\le \max \{\gamma _0,\mu \}\le \left\Vert {A} \right\Vert ^2$, from (21) we find that $\alpha _k\le \sqrt{\gamma _k\beta _k}/\left\Vert {A} \right\Vert \le 1$. For any $t\in (k,k+1)$, it is clear that

$$\begin{aligned} 1\ge \frac{\beta _{k+1}}{y(t)}\ge \frac{\beta _{k+1}}{\beta _{k}}=\frac{1}{1+\alpha _k}\ge \frac{1}{2},\quad \text {and}\quad 1\le \frac{\beta _k}{y(t)}\le \frac{\beta _{k}}{\beta _{k+1}}\le 2, \end{aligned}$$

which implies

$$\begin{aligned} \varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}}= \frac{\varphi (k)\beta _k\beta _{k+1}^{\frac{\nu -1}{\nu +1}}}{4} \le \varphi (t)[y(t)]^{\frac{2\nu }{1+\nu }}. \end{aligned}$$

Since $y'(t) = \beta _{k+1}-\beta _k$, plugging the above estimate into (48) proves (50).

5.3 Proof of (36)

Again, by (23), we have $\gamma _{k}\ge \gamma _{\min } = \min \{\gamma _0,\mu \}$, and (37) becomes

$$\begin{aligned} \beta _{k+1}-\beta _{k} \le -\frac{\sqrt{\gamma _{\min }\beta _k}\beta _{k+1}}{\sqrt{\varDelta \cdot \beta _k\delta _{k+1}^{\frac{\nu -1}{\nu +1}} +\left\Vert {A} \right\Vert ^2}}. \end{aligned}$$

Recall the interpolation y(t) defined by (49). Similarly with (50), we claim that

$$\begin{aligned} y' (t)\le -\frac{\sqrt{\gamma _{\min }}y^{3/2}(t)/2}{\sqrt{\varphi (t)[y(t)]^{\frac{2\nu }{1+\nu }} +\left\Vert {A} \right\Vert ^2 }}, \end{aligned}$$

and invoking Lemma 5.1 again gives

$$\begin{aligned} \beta _k\le \left\{ \begin{aligned} {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2}+\exp \left( -\frac{k}{8}\sqrt{\frac{\gamma _{\min }}{\varDelta }}\right){} & {} \text { if }\nu =1,\\ {}&\frac{64\left\Vert {A} \right\Vert ^2}{\gamma _{\min } k^2+\left\Vert {A} \right\Vert ^2} + \left( 1+\frac{1-\nu }{32}\sqrt{\frac{\gamma _{\min }}{2^ {\frac{1-\nu }{1+\nu }}\varDelta }}\,k^{\frac{1+3\nu }{2+2\nu }}\right) ^{-\frac{2+2\nu }{1-\nu }}{} & {} \text { if }\nu <1. \end{aligned} \right. \end{aligned}$$

This proves (36) and completes the proof of Lemma 3.3.

6 Numerical Examples

In this part, we provide several experiments to validate the performance of our Algorithm 1 (denoted by UAPD). It is compared with Nesterov’s FGM [48] and the AccUniPDGrad method [68], for solving unconstrained and affinely constrained problems.

Both UAPD and FGM involve the proximal mapping of the nonsmooth part g, and FGM performs one more proximal calculation for updating $v_{k+1}$. In line search part, FGM and AccUniPDGrad use the tolerance $\delta _k = \epsilon \tau _k$ with $\tau _k = \mathcal O(1/k)$, where $\epsilon $ is the desired accuracy. Clearly, this is smaller than ours $\delta _k = \beta _k/k$. As discussed in Remark 3.3, this will lead to over-estimate issue, especially for Hölderian case (cf. Sect. 6.1) and smooth problems with large Lipschitz constants (cf. Sect. 6.2).

6.1 Matrix Game

The problem reads as

$$\begin{aligned} \min _{x\in \varDelta _n} \max _{y\in \varDelta _m}\left\langle {x,Py} \right\rangle = \min _{x\in \varDelta _n}\left\{ \varphi (x): = \max _{1\le j\le m}\left\langle {p_j,x} \right\rangle \right\} , \end{aligned}$$

(51)

where $P =(p_1,p_2,\cdots ,p_m)\in \,{{\mathbb {R}}}^{n\times m}$ is the given payoff matrix and $\varDelta _{\times }\subset \,{{\mathbb {R}}}^\times $ is the simplex with $\times =m$ or n. By von Neumann’s minimax theorem [1, Corollary 15.30], we can change the min-max order of (51) and obtain

$$\begin{aligned} \begin{aligned} \max _{y\in \varDelta _m} \min _{x\in \varDelta _n} \left\langle {x,Py} \right\rangle = \max _{y\in \varDelta _m}\left\{ \psi (y):=\min _{1\le i\le n}\left\langle {q_i,y} \right\rangle \right\} , \end{aligned} \end{aligned}$$

(52)

where $P^\top =(q_1,q_2,\cdots ,q_n)$ with $q_i\in \,{{\mathbb {R}}}^m$. Moreover, there is no duality gap, i.e., $\varphi ^* = \psi ^*$. According to [1, Proposition 5.12], we claim that $\varphi (x^\#) = \psi (y^\#)$ for some $(x^\#,y^\#)\in \varDelta _n\times \varDelta _m$ if and only if $\varphi (x^\#) = \varphi ^*$ and $\psi (y^\#) = \psi ^*$.

As we do not know the optimal value, consider the unconstrained problem

$$\begin{aligned} \min _{x\in \varDelta _n,\,y\in \varDelta _m}\left\{ f(x,y): = \varphi (x)-\psi (y)\right\} . \end{aligned}$$

(53)

Clearly, the minimal value is zero:

$$\begin{aligned} f^* = \min _{x\in \varDelta _n}\varphi (x) - \max _{y\in \varDelta _m}\psi (y) = \varphi ^*-\psi ^* = 0. \end{aligned}$$

Moreover, $(x^\#,y^\#)$ is an optimal solution to (53) if and only if $x^\#$ and $y^\#$ are optimal solutions to (51,52), respectively.

Since $\psi $ is concave, f is convex but nonsmooth (Lipschitz continuous). In view of Remark 3.1, f satisfies Assumption 3.1 without the simple part g. In addition, to work with the simplex constraint, a proper prox-function is the entropy $\phi (x) = \left\langle {x,\ln x} \right\rangle $, which gives closed solution of the proximal calculation (14).

Numerical results are displayed in Fig. 2. We record (i) the decay behavior of the objective residual $\left|{f(x_k,y_k)-f^*} \right|$ with respect to both iteration number k and running time t (in seconds), (ii) the total number $\#i_k$ of the line search step $i_k$ and its average $\bar{i}_k$, and (iii) the approximate Lipschitz constant $M_k$. The pay off matrix P is generated from the normal distribution. For FGM, the priori accuracy is $\epsilon = 1e$-5. For our method, the line search parameters are $\rho _{\!u} = 2$ and $\rho _{_{\!d}} = 1$.

From Fig. 2, our UAPD outperforms FGM, with faster convergence, less number of line search steps and smaller Lipschitz constants. Within the same iteration number $k = 1e5$, UAPD achieves ten times smaller error (1e-2 v.s. 1e-1) but takes only about half less time (20 s v.s. 40 s). It can be seen that, FGM has much more line search steps and the average is 1, which means it performs at least one step of line search in each iteration. Since $\rho _{_{\!d}}=1$, according to Remark 3.2, the Lipschitz constant $M_k$ of our UAPD is nondecreasing, which agrees with the numerical illustration. As mentioned in Remark 3.3, FGM suffers from over-estimated Lipschitz parameters with dramatically growth behavior since it adopts smaller tolerance $\epsilon /k$ (comparable to our dynamically decreasing choice $\beta _k/k$). Besides, in step 3 of FGM, it updates the Lipschitz constant by $L_{k+1} = L_{k,i_k}/2=2^{i_k}L_k/2$ (here L agrees with our notation M), which corresponds to choosing $\rho _{_{\!d}}=2$ in our method. Hence, the Lipschitz constant sequence of FGM is nonmonotone and changes with high oscillation. This verifies the claim in Remark 3.2 that large $\rho _{_{\!d}}>1$ reduces $M_k$ locally but increases the burden on the line search procedure.

In Fig. 3, we check also the difference between the Euclidean distance $\phi (x) = 1/2\left\Vert {x} \right\Vert ^2$ and the entropy $\phi (x) = \left\langle {x,\ln x} \right\rangle $. It is observed that these two cases are very similar in line search part but the latter leads to better convergence rate.

Besides, we report the performance of our UAPD with different $\rho _{_{\!d}}$. From Fig. 4, we see little improvement on the convergence rate with respect to the iteration k, and large $\rho _{_{\!d}}$ does not win smaller choices because it runs with more than triple time (120.12 s v.s. 39.62 s) but has not reduced the residual by third (6.07e-3 v.s. 4.37e-3). Moreover, the magnitude of $M_k$ does not differ too much but large $\rho _{_{\!d}}$ call more line search steps, which increases the computational cost. In conclusion, for matrix game problem, $\rho _{_{\!d}} = 1$ seems a doable choice.

6.2 Regularized Matrix Game Problem

The objective of (51) admits an approximation (cf. [46])

$$\begin{aligned} \varphi _\sigma (x): = \sigma \ln \left( \sum _{j=1}^{m}e^{\left\langle {p_j,x} \right\rangle /\sigma }\right) , \end{aligned}$$

(54)

where $\sigma >0$ denotes the smoothing parameter. This regularized objective is smoother than the original one. According to [46, Eq.(4.8)], we choose $\sigma = \epsilon /(2\ln m)$, and the Lipschitz constant of $\nabla \varphi _\sigma $ is $L_\sigma =\max _{i,j}|P_{i,j}|^2/(4\sigma )$. To avoid the overflow issue, we adopt the shifting technique [3].

We then apply UAPD and FGM (with $\epsilon = 1e$-6) to the smooth objective (54) and report the numerical outputs in Fig. 5. The optimal value $\varphi _\sigma ^* $ is obtained by running UAPD with enough iterations. Similarly as before, our UAPD is superior to FGM in convergence, line search cost and approximate Lipschitz constant. Also, we plot the results of the original matrix game problem (see the top row in Fig. 5) and find that with smoothing technique both two methods perform better than before.

6.3 Continuous Steiner Problem

Let us consider one more unconstrained problem

$$\begin{aligned} \min _{x\in \,{{\mathbb {R}}}^n_+}\,f(x) = \sum _{j=1}^{m}\left\Vert {x-a_j} \right\Vert , \end{aligned}$$

(55)

where $a_j\in \,{{\mathbb {R}}}^n$ denotes the given location. Note that the objective is actually quite smooth far away from each $a_j$. We generate $a_j$ from the uniform distribution and run UAPD with enough times to obtain an approximated optimal value $f^*$. Numerical results in Fig. 6 show that both FGM (with $\epsilon = 1e$-8) and UAPD work well and possess similar convergence behaviors. Moreover, as $\nabla f$ is almost Lipschitz continuous and the magnitude of the Lipschitz constant is not large, the over-estimated issue of FGM is negligible, and the approximated constant $M_k$ is the same as that of our UAPD.

6.4 Basis Pursuit Problem

In the last example, we look at the basis pursuit problem

$$\begin{aligned} \min _{x\in \,{{\mathbb {R}}}^n}\left\Vert {x} \right\Vert _1\quad \mathrm{s.t.}\,Ax=b, \end{aligned}$$

where $A\in \,{{\mathbb {R}}}^{m\times n}$ and $b\in \,{{\mathbb {R}}}^m$. To be compatible with the problem setting of AccUniPDGrad [68], consider an equivalent formulation

$$\begin{aligned} \min _{x\in \,{{\mathbb {R}}}^n}\,f(x) = \frac{1}{2}\left\Vert {x} \right\Vert _1^2\quad \mathrm{s.t.}\,Ax=b. \end{aligned}$$

(56)

The dual problem reads as

$$\begin{aligned} \min _{\lambda \in \,{{\mathbb {R}}}^{m}}\,\left\{ \varphi (\lambda ): = \left\langle {b,\lambda } \right\rangle +\frac{1}{2} \big \Vert A^\top \lambda \big \Vert _\infty ^2\right\} . \end{aligned}$$

Note that existing accelerated methods in [29, 39, 65] can be applied to solving (56) with theoretical rate $\mathcal O(1/k)$. But we only focus on the comparison between UAPD and AccUniPDGrad, as black-box type methods with line search procedure. We mention that AccUniPDGrad also uses smaller tolerance $\epsilon /k$ as that in FGM, where $\epsilon >0$ denotes the desired accuracy, and it takes $M_{k+1} = M_{k,i_k}=2^{i_k}M_k$, which coincides with our choice $\rho _{_{\!d}}=1$.

To obtain a reasonable approximate minimal value $f^*$, we run the accelerated primal–dual method proposed in [39] with enough iterations. Numerical results are shown in Fig. 7, which indicate that (i) our UAPD has smaller objective residual and feasibility violation, (ii) the line search steps are very close and the Lipschitz constant sequences are nondecreasing (since $\rho _{_{\!d}}=1$), and (iii) AccUniPDGrad generates over-estimated $M_k$ because of its small tolerance $\epsilon /k$.

References

Bauschke, H., Combettes, P.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer Science+Business Media, New York (2011)
Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
MathSciNet Google Scholar
Blanchard, P., Higham, D.J., Higham, N.J.: Accurately computing the log-sum-exp and softmax functions. IMA J. Numer. Anal. 41(4), 2311–2330 (2021)
MathSciNet Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)
Google Scholar
Cai, J.-F., Osher, S., Shen, Z.: Linearized Bregman iterations for compressed sensing. Math. Comput. 78(267), 1515–1536 (2009)
MathSciNet Google Scholar
Candès, E., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
MathSciNet Google Scholar
Chambolle, A., Pock, T.: A first-order primal–dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
MathSciNet Google Scholar
Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)
MathSciNet Google Scholar
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
MathSciNet Google Scholar
Chen, L., Luo, H.: First order optimization methods based on Hessian-driven Nesterov accelerated gradient flow. arXiv:1912.09276 (2019)
Chen, L., Luo, H.: A unified convergence analysis of first order convex optimization methods via strong Lyapunov functions. arXiv: 2108.00132 (2021)
Chen, Y., Lan, G., Ouyang, Y.: Optimal primal–dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014)
MathSciNet Google Scholar
Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. Splitting Methods in Communication, Imaging, Science, and Engineering, pages 115–163 (2016)
Demengel, F., Demengel, G., Erné, R.: Functional Spaces for the Theory of Elliptic Partial Differential Equations. Universitext. Springer, London (2012)
Google Scholar
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
MathSciNet Google Scholar
Dvurechensky, P, Gasnikov, A., Kroshnin, A.: Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In: Proceedings of the 35 th International Conference on Machine Learning, volume 80, Stockholm, Sweden (2018). PMLR
Dvurechensky, P., Staudigl, M., Shtern, S.: First-order methods for convex optimization. arXiv:2101.00935 (2021)
Eckstein, J.: Splitting Methods for Monotone Operators with Applications to Parallel Optimization. PhD Thesis, Massachusetts Institute of Technology (1989)
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1), 293–318 (1992)
MathSciNet Google Scholar
Esser, E., Zhang, X., Chan, T.F.: A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imaging Sci. 3(4), 1015–1046 (2010)
MathSciNet Google Scholar
Feijer, D., Paganini, F.: Stability of primal–dual gradient dynamics and applications to network optimization. Automatica 46(12), 1974–1981 (2010)
MathSciNet Google Scholar
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
MathSciNet Google Scholar
Guminov, S., Gasnikov, A., Anikin, A., Gornov, A.: A universal modification of the linear coupling method. Optim. Methods Softw. 34(3), 560–577 (2019)
MathSciNet Google Scholar
Guminov, S.V., Nesterov, Y.E., Dvurechensky, P.E., Gasnikov, A.V.: Primal–dual accelerated gradient descent with line search for convex and nonconvex optimization problems. arXiv:1809.05895 (2018)
He, B., You, Y., Yuan, X.: On the convergence of primal-dual hybrid gradient algorithm. SIAM J. Imaging Sci. 7(4), 2526–2537 (2014)
MathSciNet Google Scholar
He, B., Yuan, X.: On the acceleration of augmented Lagrangian method for linearly constrained optimization. https://optimization-online.org/2010/10/2760/ (2010)
He, X., Hu, R., Fang, Y.-P.: Fast primal–dual algorithm via dynamical system for a linearly constrained convex optimization problem. Automatica 146, 110547 (2022)
MathSciNet Google Scholar
He, X., Hu, R., Fang, Y.-P.: Inertial accelerated primal–dual methods for linear equality constrained convex optimization problems. Numer. Algorithms 90(4), 1669–1690 (2022)
MathSciNet Google Scholar
Huang, B., Ma, S., Goldfarb, D.: Accelerated linearized Bregman method. J. Sci. Comput. 54, 428–453 (2013)
MathSciNet Google Scholar
Heinonen, J.: Lectures on Lipschitz analysis. Technical Report vol. 100, Rep. Univ. Jyväskylä Dept. Math. Stat., University of Jyväskylä (2005)
Jiang, F., Cai, X., Wu, Z., Han, D.: Approximate first-order primal-dual algorithms for saddle point problems. Math. Comput. 90(329), 1227–1262 (2021)
MathSciNet Google Scholar
Kamzolov, D., Dvurechensky, P., Gasnikov, A.: Universal intermediate gradient method for convex problems with inexact oracle. arXiv:1712.06036 (2019)
Kang, M., Kang, M., Jung, M.: Inexact accelerated augmented Lagrangian methods. Comput. Optim. Appl. 62(2), 373–404 (2015)
MathSciNet Google Scholar
Krichene, W., Bayen, A., Bartlett, P.: Accelerated mirror descent in continuous and discrete time. Adv. Neural Inf. Process. Syst. (NIPS) 28, 2845–2853 (2015)
Google Scholar
Li, H., Fang, C., Lin, Z.: Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization. arXiv:1711.10802 (2017)
Li, H., Lin, Z.: Accelerated alternating direction method of multipliers: an optimal ${O}(1/{K})$ nonergodic analysis. J. Sci. Comput. 79(2), 671–699 (2019)
MathSciNet Google Scholar
Lin, T., Ho, N., Jordan, M.I.: On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pp. 3982–3991. PMLR (2019)
Luo, H.: Accelerated differential inclusion for convex optimization. Optimization (2021). https://doi.org/10.1080/02331934.2021.2002327
Article Google Scholar
Luo, H.: Accelerated primal-dual methods for linearly constrained convex optimization problems. arXiv:2109.12604 (2021)
Luo, H., Zhang, Z.-H.: A unified differential equation solver approach for separable convex optimization: splitting, acceleration and nonergodic rate. arXiv:2109.13467 (2023)
Luo, H.: A primal–dual flow for affine constrained convex optimization. ESAIM: Control Optim. Calc. Var. 28, 33 (2022)
Luo, H., Chen, L.: From differential equation solvers to accelerated first-order methods for convex optimization. Math. Program. (2021). https://doi.org/10.1007/s10107-021-01713-3
Article Google Scholar
Nbmirovskii, A.S., Nrsterov, Y.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(2), 21–30 (1985)
MathSciNet Google Scholar
Nemirovsky, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, New York (1983)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization. Applied Optimization, vol. 87. Springer, US, Boston, MA (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
MathSciNet Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. Ser. B 140(1), 125–161 (2013)
MathSciNet Google Scholar
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)
MathSciNet Google Scholar
Nesterov, Y.: Lectures on Convex Optimization. Springer Optimization and Its Applications, vol. 137. Springer International Publishing, Cham (2018)
Ouyang, Y., Chen, Y., Lan, G., Pasiliao, E.: An accelerated linearized alternating direction method of multipliers. SIAM J. Imaging Sci. 8(1), 644–681 (2015)
MathSciNet Google Scholar
Ouyang, Y., Xu, Y.: Lower complexity bounds of first-order methods for convex–concave bilinear Saddle-point problems. Math. Program. 185(1–2), 1–35 (2021)
MathSciNet Google Scholar
Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. In: 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA (2017)
Sabach, S., Teboulle, M.: Faster Lagrangian-based methods in convex optimization. SIAM J. Optim. 32(1), 204–227 (2022)
MathSciNet Google Scholar
Stonyakin, F., Dvinskikh, D., Dvurechensky, P., Kroshnin, A., Kuznetsova, O., Agafonov, A., Gasnikov, A., Tyurin, A., Uribe, C.A., Pasechnyuk, D., Artamonov, S.: Gradient methods for problems with inexact model of the objective. arXiv:1902.09001 (2019)
Stonyakin, F., Gasnikov, A., Dvurechensky, P., Alkousa, M., Titov, A.: Generalized mirror prox for monotone variational inequalities: Universality and inexact oracle. arXiv:1806.05140 (2022)
Tao, M., Yuan, X.: Accelerated Uzawa methods for convex optimization. Math. Comput. 86(306), 1821–1845 (2016)
MathSciNet Google Scholar
Tian, W., Yuan, X.: An alternating direction method of multipliers with a worst-case ${O}(1/n^2)$ convergence rate. Math. Comput. 88(318), 1685–1713 (2018)
Google Scholar
Tran-Dinh, Q.: A unified convergence rate analysis of the accelerated smoothed gap reduction algorithm. Optim. Lett. (2021). https://doi.org/10.1007/s11590-021-01775-4
Article Google Scholar
Tran-Dinh, Q., Fercoq, O., Cevher, V.: A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim. 28(1), 96–134 (2018)
MathSciNet Google Scholar
Tran-Dinh, Q., Zhu, Y.: Augmented Lagrangian-based decomposition methods with non-ergodic optimal rates. arXiv:1806.05280 (2018)
Tran-Dinh, Q., Zhu, Y.: Non-stationary first-order primal–dual algorithms with faster convergence rates. SIAM J. Optim. 30(4), 2866–2896 (2020)
MathSciNet Google Scholar
Valkonen, T.: Inertial, corrected, primal–dual proximal splitting. SIAM J. Optim. 30(2), 1391–1420 (2020)
MathSciNet Google Scholar
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
MathSciNet Google Scholar
Wilson, A., Recht, B., Jordan, M.: A Lyapunov analysis of momentum methods in optimization. arXiv:1611.02635 (2016)
Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)
MathSciNet Google Scholar
Xu, Y.: Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming. Math. Program. 185(1–2), 199–244 (2021)
MathSciNet Google Scholar
Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for $\ell _1$-minimization with applications to compressed sensing. SIAM J. Imaging Sci. 1(1), 143–168 (2008)
MathSciNet Google Scholar
Yurtsever, A., Tran-Dinh, Q., Cevher, V.: A universal primal-dual convex optimization framework. arXiv: 1502.03123 (2015)
Zhao, Y., Liao, X., He, X., Li, C.: Accelerated primal-dual mirror dynamical approaches for constrained convex optimization. arXiv:2205.15983 (2022)

Download references

Acknowledgements

The author would like to thank the Editor and two anonymous referees, for their careful readings and valuable comments that improve significantly the early version of the paper.

Author information

Authors and Affiliations

National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing, 401331, China
Hao Luo
Chongqing Research Institute of Big Data, Peking University, Chongqing, 401121, China
Hao Luo

Authors

Hao Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Luo.

Additional information

Communicated by Olivier Fercoq.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Foundation of Chongqing Normal University (Grant No. 202210000161) and the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJZD-K202300505)

Appendices

Proof of Lemma 3.1

Let us first prove (18). Recall that $i_k$ is the smallest nonnegative integer such that (cf. (15))

$$\begin{aligned} h(x_{k,i_k})-\varDelta _{k,i_k}\le \frac{\delta _{k,i_{k}}}{2}. \end{aligned}$$

If $i_k = 0$, then $M_{k+1} = M_{k,0} /\rho _{_{\!d}}= M_{k}/\rho _{_{\!d}}$. Otherwise (i.e., $i_k\ge 1$), we claim that

$$\begin{aligned} M_{k,i_k}\le \rho _{\!u}\cdot M(\nu ,\delta _{k,i_{k}-1}). \end{aligned}$$

(57)

If this is violated, then $M_{k,i_{k}-1} = M_{k,i_k}/\rho _{\!u}>M(\nu ,\delta _{k,i_{k}-1})$. According to Proposition 2.1, this implies immediately that

$$\begin{aligned} h(x_{k,i_{k}-1})-\varDelta _{k,i_{k}-1}\le \frac{\delta _{k,i_{k}-1}}{2}, \end{aligned}$$

which yields a contradiction and thus verifies (57). Additionally, by (16), we have $ \alpha _{k,i_{k}-1}\le \sqrt{\rho _{\! u}}\alpha _{k,i_{k}}$. Therefore, collecting (13,17,57) leads to

$$\begin{aligned} M_{k+1} = M_{k,i_k}/\rho _{_{\!d}}\le \frac{\sqrt{\rho _{\! u}} \rho _{\! u}}{\rho _{_{\!d}}}\cdot M(\nu ,\delta _{k,i_{k}}) =\frac{\sqrt{\rho _{\! u}} \rho _{\! u}}{\rho _{_{\!d}}}\cdot M(\nu ,\delta _{k+1}). \end{aligned}$$

(58)

This means that for all $k\in {\mathbb {N}}$, we have

$$\begin{aligned} M_{k+1}\le \frac{1}{\rho _{_{\!d}}} \max \left\{ M_{k},\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k+1})\right\} . \end{aligned}$$

(59)

Since $\delta _{k+1} = \beta _{k+1}/{(k+1)}$ with $\{\beta _k\}_{k\in {\mathbb {N}}}$ being decreasing (cf.(17)), it follows that $\delta _{k+1}\le \delta _{\ell +1}$ and $M(\nu ,\delta _{\ell +1})\le M(\nu ,\delta _{k+1})$ for all $0\le \ell \le k$, which together with (59) indicates the estimate

$$\begin{aligned} \begin{aligned} M_{k+1}\le {}&\frac{1}{\rho _{_{\!d}}} \max \left\{ \frac{1}{\rho _{_{\!d}}}\max \left\{ M_{k-1},\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k})\right\} ,\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k+1})\right\} \\ \le {}&\frac{1}{\rho _{_{\!d}}}\max \left\{ \frac{M_{k-1}}{\rho _{_{\!d}}},\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k+1})\right\} \\ \le {}&\cdots \le \frac{1}{\rho _{_{\!d}}}\max \left\{ \frac{M_{0}}{\rho _{_{\!d}}^{k}},\,\sqrt{\rho _{\! u}} \rho _{\! u}\cdot M(\nu ,\delta _{k+1})\right\} . \end{aligned} \end{aligned}$$

This proves the desired result (18).

Then, let us verify (19). Observing that $M_{k+1} = M_{k,i_k} /\rho _{_{\!d}}= \rho _{\! u}^{i_k}M_{k}/\rho _{_{\!d}}$, we have

$$\begin{aligned} i_k = \log _{\rho _{\!u}}\frac{\rho _{_{\!d}}M_{k+1}}{M_{k}} \quad \Longrightarrow \quad \sum _{j=0}^{k}i_j= \log _{\rho _{\! u}}\frac{\rho _{_{\!d}}^{k+1}M_{k+1}}{M_{0}}. \end{aligned}$$

Invoking the estimate of $M_{k+1}$ gives

$$\begin{aligned} \begin{aligned} \sum _{j=0}^{k}i_j \le {}&\max \left\{ 0, \frac{3}{2}+k\log _{\rho _{\!u}}\!\rho _{_{\!d}}+\log _{\rho _{\! u}}\frac{M(\nu ,\delta _{k+1})}{M_{0}}\right\} \\ \le {}&\frac{3}{2}+k\log _{\rho _{\!u}}\!\rho _{_{\!d}}+\left|{\log _{\rho _{\! u}}\frac{M(\nu ,\delta _{k+1})}{M_{0}}} \right|. \end{aligned} \end{aligned}$$

Since $M(\nu ,\delta _{k+1}) = \delta _{k+1}^{\frac{\nu -1}{\nu +1}}[M_\nu (h)]^{\frac{2}{\nu +1}}$, we obtain (19) and complete the proof of Lemma 3.1.

Derivation of the Reformulation (20)

Observing line 10 of Algorithm 1, we obtain (20d). Given $(x_k,v_k,\lambda _k)$ and $(\gamma _k,\beta _k,M_{k})$, $(y_k,x_{k+1},v_{k+1})$ are nothing but the output of Algorithm 2 with the input $(k,S_k,M_{k,i_k})$, where $S_k=\{x_k,v_k,\lambda _k,\beta _k,\gamma _k\}$. Hence, from lines 3 and 4, we obtain

$$\begin{aligned} \begin{aligned} {}&\frac{y_k-x_k}{\alpha _k} =v_k-y_k,\\ {}&\frac{x_{k+1}-x_k}{\alpha _k} ={} v_{k+1}-x_{k+1}, \end{aligned} \end{aligned}$$

which gives (20a) and (20c). Besides, we have

$$\begin{aligned} v_{k+1} = \mathop { {\mathop {\textrm{argmin}}\,}}\limits _{v\in Q}\left\{ g(v)+ \big \langle \nabla h(y_k)+A^\top \widetilde{\lambda }_k,v\big \rangle +\mu D_\phi (v,y_k)+ \frac{\gamma _k}{\alpha _k}D_\phi (v,v_k) \right\} , \end{aligned}$$

with $\widetilde{\lambda }_k= \lambda _k+\alpha _k/\beta _k(Av_k-b)$. The optimality condition reads as

$$\begin{aligned} \begin{aligned} 0\in {}&\nabla h(y_k)+\partial g(v_{k+1})+N_Q(v_{k+1})+A^\top \widetilde{\lambda }_k\\ {}&\quad +\mu \left[ \nabla \phi (v_{k+1})-\nabla \phi (y_k)\right] +\frac{\gamma _{k}}{\alpha _k}\left[ \nabla \phi (v_{k+1})-\nabla \phi (v_k)\right] . \end{aligned} \end{aligned}$$

After rearranging, we get (20b).

Proof of Lemma 5.1

1.1 The Case $\eta =\theta -1$

The estimate (47) becomes

$$\begin{aligned} \sqrt{\varphi (t)}\frac{y'(t)}{y(t)}+R\frac{y'(t)}{y^{\theta }(t)}\le -\sigma (t). \end{aligned}$$

(60)

Since $y(0) = 1$ and $y'(t)\le 0$, it holds that $0<y(t)\le 1$ for all $t\ge 0$. As $\varphi (t)$ is positive and nondecreasing, we obtain

$$\begin{aligned} \left( \sqrt{\varphi }\ln y\right) ' = \frac{\varphi '}{2\sqrt{\varphi }}\ln y + \sqrt{\varphi } \frac{y'}{y}\le \sqrt{\varphi } \frac{y'}{y}. \end{aligned}$$

Combining this with (60) gives

$$\begin{aligned} \begin{aligned} \left( \sqrt{\varphi (t)}\ln y(t)+\frac{R}{1-\theta }y^{1-\theta }(t)\right) '\le -\sigma (t), \end{aligned} \end{aligned}$$

and integrating over (0, t) leads to

$$\begin{aligned} \sqrt{\varphi (t)}\ln \frac{1}{y(t)}+\frac{R}{\theta -1}\left( y^{1-\theta }(t)-1\right) \ge \int _{0}^{t}\sigma (s)\,\textrm{d}s=\varSigma (t). \end{aligned}$$

(61)

Define

$$\begin{aligned} Y_1(t) := {}\exp \left( -\frac{\varSigma (t)}{2\sqrt{\varphi (t)}}\right) \quad \text {and}\quad Y_2(t) :={} \left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}. \end{aligned}$$

(62)

Then one finds that

$$\begin{aligned} \begin{aligned} {}&\sqrt{\varphi (t)}\ln \frac{1}{Y_1(t)}=\frac{1}{2}\varSigma (t),{} & {} Y_1(0) = 1,\\ {}&\frac{R}{\theta -1}\left( Y_2^{1-\theta }(t)-1\right) =\frac{1}{2}\varSigma (t),{} & {} Y_2(0) = 1. \end{aligned} \end{aligned}$$

This also implies

$$\begin{aligned} \sqrt{\varphi (t)}\ln \frac{1}{Y(t)}+\frac{R}{\theta -1}\left( Y^{1-\theta }(t)-1\right) \le \varSigma (t), \end{aligned}$$

(63)

where $Y(t):= Y_1(t)+Y_2(t)$. For fixed $t>0$, the function

$$\begin{aligned} v\longmapsto \sqrt{\varphi (t)}\ln \frac{1}{v}+\frac{R}{\theta -1}\left( v^{1-\theta }-1\right) \end{aligned}$$

is monotonously decreasing in terms of $v\in (0,\infty )$. Collecting (61,63) yields that

$$\begin{aligned} y(t)\le Y(t)=\exp \left( -\frac{\varSigma (t)}{2\sqrt{\varphi (t)}}\right) + \left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}. \end{aligned}$$

This completes the proof of Lemma 5.1 with $\eta =\theta -1$.

1.2 The Case $\eta <\theta -1$

The proof is in line with the previous case. With some elementary calculus computations, the estimate (61) now becomes

$$\begin{aligned} G(\varphi (t),y(t))\ge \varSigma (t), \end{aligned}$$

where $G:(0,\infty )\times (0,\infty )\rightarrow \,{{\mathbb {R}}}$ is defined by

$$\begin{aligned} G(w,v): = \frac{\sqrt{w}}{\theta -\eta -1}(v^{\eta +1-\theta }-1)+\frac{R}{\theta -1}(v^{1-\theta }-1), \end{aligned}$$

for all $w,\,v>0$. In addition to $Y_2(t)$ defined in (62), we need

$$\begin{aligned} Y_3(t):= \left( 1+\frac{\theta -\eta -1}{2\sqrt{\varphi (t)}}\varSigma (t)\right) ^{\frac{1}{\eta +1-\theta }}. \end{aligned}$$

Since $G(w,\cdot )$ is monotonously decreasing and

$$\begin{aligned} G(\varphi (t),Y_2(t)+Y_3(t))\le \varSigma (t)\le G(\varphi (t),y(t)), \end{aligned}$$

we obtain

$$\begin{aligned} y(t)\le \left( 1+\frac{\theta -1}{2R}\varSigma (t)\right) ^{\frac{1}{1-\theta }}+ \left( 1+\frac{\theta -\eta -1}{2\sqrt{\varphi (t)}}\varSigma (t)\right) ^{\frac{1}{\eta +1-\theta }}. \end{aligned}$$

This concludes the proof of Lemma 5.1 with $\eta <\theta -1$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Luo, H. A Universal Accelerated Primal–Dual Method for Convex Optimization Problems. J Optim Theory Appl 201, 280–312 (2024). https://doi.org/10.1007/s10957-024-02394-6

Download citation

Received: 08 December 2022
Accepted: 22 January 2024
Published: 01 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10957-024-02394-6

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Universal Accelerated Primal–Dual Method for Convex Optimization Problems

Abstract

Similar content being viewed by others

A fast dual proximal-gradient method for separable convex optimization with linear coupled constraints

Accelerated Primal-Dual Gradient Descent with Linesearch for Convex, Nonconvex, and Nonsmooth Optimization Problems

Two Convergent Primal–Dual Hybrid Gradient Type Methods for Convex Programming with Linear Constraints

1 Introduction

2 Preliminary

2.1 Notations

2.2 Bregman Divergence

Lemma 2.1

2.3 Hölder Continuity

Proposition 2.1

3 Main Algorithm

Assumption 3.1

Remark 3.1

3.1 Line Search

Remark 3.2

Remark 3.3

Lemma 3.1

Proof

3.2 Time Discretization Interpretation

3.3 A Universal Estimate

Lemma 3.2

Proof

Theorem 3.1

Proof

Remark 3.4

Lemma 3.3

Proof

Remark 3.5

Remark 3.6

Remark 3.7

4 Proof of Lemma 3.2

4.1 Estimate of \({\mathbb {I}}_2\)

4.2 Estimate of \({\mathbb {I}}_3\)

4.3 Proof of (27)

5 Proof of Lemma 3.3

5.1 A Differential Inequality

Lemma 5.1

Proof

5.2 Proof of (35)

5.3 Proof of (36)

6 Numerical Examples

6.1 Matrix Game

6.2 Regularized Matrix Game Problem

6.3 Continuous Steiner Problem

6.4 Basis Pursuit Problem

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Lemma 3.1

Derivation of the Reformulation (20)

Proof of Lemma 5.1

1.1 The Case \(\eta =\theta -1\)

1.2 The Case \(\eta <\theta -1\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation