On lower iteration complexity bounds for the convex concave saddle point problems

Zhang, Junyu; Hong, Mingyi; Zhang, Shuzhong

doi:10.1007/s10107-021-01660-z

On lower iteration complexity bounds for the convex concave saddle point problems

Full Length Paper
Series A
Published: 07 June 2021

Volume 194, pages 901–935, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mathematical Programming Submit manuscript

On lower iteration complexity bounds for the convex concave saddle point problems

Download PDF

1676 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we study the lower iteration complexity bounds for finding the saddle point of a strongly convex and strongly concave saddle point problem: $\min _x\max _yF(x,y)$. We restrict the classes of algorithms in our investigation to be either pure first-order methods or methods using proximal mappings. For problems with gradient Lipschitz constants ($L_x, L_y$ and $L_{xy}$) and strong convexity/concavity constants ($\mu _x$ and $\mu _y$), the class of pure first-order algorithms with the linear span assumption is shown to have a lower iteration complexity bound of $\Omega \,\left( \sqrt{\frac{L_x}{\mu _x}+\frac{L_{xy}^2}{\mu _x\mu _y}+\frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $, where the term $\frac{L_{xy}^2}{\mu _x\mu _y}$ explains how the coupling influences the iteration complexity. Under several special parameter regimes, this lower bound has been achieved by corresponding optimal algorithms. However, whether or not the bound under the general parameter regime is optimal remains open. Additionally, for the special case of bilinear coupling problems, given the availability of certain proximal operators, a lower bound of $\Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln (\frac{1}{\epsilon })\right) $ is established under the linear span assumption, and optimal algorithms have already been developed in the literature. By exploiting the orthogonal invariance technique, we extend both lower bounds to the general pure first-order algorithm class and the proximal algorithm class without the linear span assumption. As an application, we apply proper scaling to the worst-case instances, and we derive the lower bounds for the general convex concave problems with $\mu _x = \mu _y = 0$. Several existing results in this case can be deduced from our results as special cases.

A first-order inexact primal-dual algorithm for a class of convex-concave saddle point problems

Article 16 March 2021

Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems

Article 07 August 2019

Inexact first-order primal–dual algorithms

Article Open access 30 March 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper, we establish a lower iteration complexity bound for the first-order methods to solve the following min-max saddle point problem

$$\begin{aligned} \min _x \max _y F(x,y), \end{aligned}$$

(1)

which is of fundamental importance in, e.g., game theory [31, 37], image deconvolution problems [9], parallel computing [39], adversarial training [4, 12], and statistical learning [1].

To proceed, let us introduce the following two problem classes.

Definition 1.1

(Problem class $\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)$) $F(\cdot ,y)$ is $\mu _x$-strongly convex for any fixed y and $F(x,\cdot )$ is $\mu _y$-strongly concave for any fixed x. Overall, the function F is smooth and $\nabla F$ satisfies the following Lipschitz continuity condition

$$\begin{aligned} {\left\{ \begin{array}{ll} \Vert \nabla _x F(x_1,y) - \nabla _x F(x_2,y)\Vert \le L_x\Vert x_1-x_2\Vert ,&{}\forall x_1,x_2,y \\ \Vert \nabla _y F(x,y_1) - \nabla _y F(x,y_2)\Vert \le L_y\Vert y_1-y_2\Vert ,&{}\forall x,y_1,y_2 \\ \Vert \nabla _x F(x,y_1) - \nabla _x F(x,y_2)\Vert \le L_{xy}\Vert y_1-y_2\Vert ,&{}\forall x,y_1,y_2 \\ \Vert \nabla _y F(x_1,y) - \nabla _y F(x_2,y)\Vert \le L_{xy}\Vert x_1-x_2\Vert ,&{}\forall x_1,x_2,y. \\ \end{array}\right. } \end{aligned}$$

(2)

We shall remark here that the constants in (2) may also be understood as the bounds on the different blocks of the Hessian matrix $\nabla ^2 F(x,y)$ if F is twice continuously differentiable. That is,

$$\begin{aligned} \sup _{x,y}\Vert \nabla _{xx}^2 F(x,y)\Vert _2\le L_x,\quad \sup _{x,y}\Vert \nabla _{yy}^2 F(x,y)\Vert _2\le L_y,\quad \sup _{x,y}\Vert \nabla _{xy}^2 F(x,y)\Vert _2\le L_{xy}. \end{aligned}$$

However, throughout this paper we do not assume either $F(\cdot , y)$ or $F(x,\cdot )$ is second-order differentiable.

The second problem class is the bilinear saddle point model:

Definition 1.2

(Bilinear class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$) In this special class, the problems are written as

$$\begin{aligned} \min _x \max _y F(x,y):=f(x) + x^\top Ay - g(y), \end{aligned}$$

(3)

where f(x) and g(y) are both lower semi-continuous with f(x) being $\mu _x$-strongly convex and g(y) being $\mu _y$-strongly convex. The coupling matrix A satisfies $\Vert A\Vert _2\le L_{xy}$.

For this special model class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$, we assume the availability of the following prox-operations:

$$\begin{aligned} \mathbf {prox}_{\gamma f}(v):=\mathop {\hbox {argmin}}\limits _x f(x) + \frac{1}{2\gamma }\Vert x-v\Vert ^2 \quad \text{ and } \quad \mathbf {prox}_{\sigma g}(u):=\mathop {\hbox {argmin}}\limits _y g(y) + \frac{1}{2\sigma }\Vert y-u\Vert ^2. \end{aligned}$$

(4)

In this paper we shall establish the lower iteration complexity bound

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_x}{\mu _x}+\frac{L_{xy}^2}{\mu _x\mu _y}+\frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \text{ for } \mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y), \end{aligned}$$

and

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \text{ for } \mathcal {B}(L_{xy},\mu _x,\mu _y) \end{aligned}$$

with the proximal oracles (4). In particular, we first establish these lower bounds for pure first-order and general proximal algorithm classes under the linear span assumption. Later on we generalize the results for more general algorithm classes without the linear span assumption through the orthogonal invariance technique introduced by [25]. For more detailed applications of the orthogonal invariance technique in the lower bound derivation, the interested readers are referred to [7, 8, 33].

As an application of the above bound, we apply proper scaling to the worst-case instances and show that the above result implies several exisiting lower bounds for general convex-concave problems with bounded saddle point solutions. Specifically, we have

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_xR_x^2}{\epsilon }}+\frac{L_{xy}R_xR_y}{\epsilon }+\sqrt{\frac{L_yR_y^2}{\epsilon }}\,\,\right) \text{ for } \mathcal {F}(L_{x},L_{y},L_{xy},0,0), \,\,\text{ and }\,\,\Vert x^*\Vert \le R_x, \Vert y^*\Vert \le R_y, \end{aligned}$$

and

$$\begin{aligned} \Omega \left( \frac{L_{xy}R_xR_y}{\epsilon }\right) \text{ for } \mathcal {B}(L_{xy},0,0),\,\,\text{ and }\,\, \Vert x^*\Vert \le R_x, \Vert y^*\Vert \le R_y. \end{aligned}$$

For the above two lower bounds, we remark that under specific parameter regimes, the first bound is known, see [25] for the case with $L_x = L_y = L_{xy}$ and see [33] for the case with $L_y = 0$. However to our best knowledge, the first bound under a general set of parameters as well as the second bound for bilinear problem class are not known. Similar reductions can also be done for the problem classes with only one of $\mu _x$ and $\mu _y$ equal to 0, for which the lower bounds have already been discovered in [33].

Such lower iteration complexity results shed light on understanding the performance of the algorithms designed for min-max saddle point models. There are numerous results in the literature prior to ours. As a special case of (1), the lower bound results of convex minimization problem with $F(x,y) = f(x)$ has been well-studied in the past decades. For convex problems, Nesterov’s accelerated gradient method have achieved iteration complexities of ${\mathcal {O}}(\sqrt{L/\epsilon })$ for L-smooth convex problems, and ${\mathcal {O}}\left( \sqrt{\frac{L}{\mu }}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $ for L-smooth and $\mu $-strongly convex problems respectively, and both of them are shown to match the lower complexity bound for the first-order methods; see [29].

However, for the min-max saddle-point models, the situation is more subtle. Due to the convex-concave nature of F, the vector field

$$\begin{aligned} G(x,y) = \begin{pmatrix} \nabla _x F(x,y)\\ -\nabla _y F(x,y) \end{pmatrix}\end{aligned}$$

is monotone. Hence the convex-concave saddle point problem is often studied as a subclass of the variational inequality problems (VIP); see e.g. [16, 22, 24, 28, 30, 36] and references therein. Although there have been plenty of studies on the variational inequalities model, the roles played by different Lischitz constants on the different blocks of variables have not been fully explored in the literature. In other words, often one would denote L to be an overall Lipschitz constant of the vector field G, which is of the order $\Theta (\max \{L_x,L_y,L_{xy}\})$ in our case, and set $\mu $ to be the strong monotonicity parameter of G, which is of the order $\Theta (\min \{\mu _x,\mu _y\})$ in our case, and no further distinctions among the parameters would be made. Hence the considered problems are of special instances in $\mathcal {F}(L,L,L,\mu ,\mu )$. Under such settings, many algorithms including the mirror-prox algorithm [24], the extra-gradient methods [17, 23], and the accelerated dual extrapolation^{Footnote 1} [30] and so on, have all achieved the iteration complexity of ${\mathcal {O}}\left( \frac{L}{\mu }\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $, and this complexity is shown to be optimal for first-order methods in solving the problem class $\mathcal {F}(L,L,L,\mu ,\mu )$; see [26]. However, under the more general parameter regime of $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$, these methods are not optimal. For example, Nesterov’s accelerated dual extrapolation method [30] has a complexity of ${\mathcal {O}}\left( \frac{\max \{L_x,L_{xy},L_y\}}{\min \{\mu _x,\mu _y\}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $, even if the algorithm are modified carefully one can only guarantee a complexity of ${\mathcal {O}}\left( \sqrt{\frac{L_x^2}{\mu _x^2} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y^2}{\mu _y^2}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $, both of which do not match the lower bound provided in this paper. More recently, tighter upper bounds have been derived. In [19], the authors consider the problems class of $\mathcal {F}(L,L,L,\mu _x,\mu _y)$ and achieve an upper bound of ${\mathcal {O}}\left( \sqrt{\frac{L^2}{\mu _x\mu _y}}\cdot \ln ^3\left( \frac{1}{\epsilon }\right) \right) $, which matches our lower bound when $L_{x} = L_{y} = L_{xy}=L$ up to a logarithmic term. In [38], the authors consider the general problems class of $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$, the proposed algorithm achieves an upper bound of ${\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L\cdot L_{xy}}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln ^3\left( \frac{L^2}{\mu _x\mu _y}\right) \cdot \ln \left( \frac{1}{\epsilon }\right) \right) $ where $L = \max \{L_x,L_y,L_{xy}\}$, which almost matches our lower bound for the general problem class. Despite the gap for the general problem class $\mathcal {F}(L_x,L_{xy},L_y,\mu _x,\mu _y)$, given the availability of proximal operators the authors of [9, 10] have derived an algorithm for problem class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$ with complexity ${\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $. We will prove in this paper that this result has matched the theoretical lower complexity bound for its problem and algorithm classes, hence optimal.

For the bilinear problem (3), when f is smooth and convex, $g(y) = b^\top y$ is linear, the problem is equivalent to the following convex optimization problem

$$\begin{aligned}\min _x \{f(x): A^\top x-b = 0\}.\end{aligned}$$

Without using projection onto the hyperplane $\{x:A^\top x=b\}$ which requires a matrix inversion, pure first-order methods achieve ${\mathcal {O}}(1/\epsilon )$ complexity despite the strong convexity of f; see e.g. [11, 32, 40]. Those iteration complexity bounds are shown to match the lower bound provided in [33]. For more details on the lower and upper bounds on this formulation, the interested readers are referred to [33]. Finally, for the bilinear coupling problem (3), the authors of [13] show that a lower bound of ${\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2 + \mu _x\mu _y}{\mu _{xy}^2 + \mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $ can be derived, where $\mu _{xy}$ stands for the minimum singular value of the coupling matrix A. It is interesting that this result covers the linear convergence phenomenon for pure bilinear saddle point problem [5] where $f(x)\equiv g(y)\equiv 0$. Another remark is that, due to the special construction of the worst-case instance and algorithm class, [13] cannot characterize the impact of $L_x$ and $L_y$ as well as the lower bound for proximal algorithm class.

Other than studies on the first-order algorithms, there are also studies on the higher-order methods as well. For example, in [3] lower iteration complexity bounds for second-order methods are considered, and in [2, 27] lower iteration complexity bounds are presented for general higher-order (tensor) methods. For smooth nonconvex optimization, in [8] the iteration complexity lower bounds for first-order methods are considered, while in [7] that for higher-order methods are considered.

Another line of research is for the non-conex/concave min-max saddle point problems; see [14, 15] and the references therein. To guarantee convergence, additional structures are often needed. For example, if one assumes that the solutions of the problem satisfy the Minty variational inequality [18] then convergent algorithm can be constructed. Another important situation is when F is concave in y. In that case, convergence and iteration complexity to a stationary solution is possible; see e.g. [21]. For more literatures in this type of problems, we refer the interested readers to [20] and the references therein.

Organization This paper is organized as follows. In Sect. 2, we introduce two different algorithm classes (with or without proximal-operators). In Sect. 3, we construct a worst-case example for problem class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$ and derive the corresponding lower iteration complexity bound for the algorithm class allowing proximal-operators. An optimal algorithm is discussed in this case. In Sect. 4, we construct the worst-case example for problem class $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$ and establish the corresponding lower complexity bound for the first-order method (without any proximal oracles). Optimal algorithms under several special parameter regimes are discussed. Finally, we conclude the paper in Sect. 6.

2 The first-order algorithm classes

In this section, we discuss some preliminaries for the strongly convex and strongly concave saddle point problem. Then, we shall introduce two algorithm classes to set the ground for our discussion, and we shall also note specific known algorithms as representative members in those algorithm classes.

2.1 Primal function, dual function, and the duality gap

First, we define $\Phi (\cdot )$ to be the primal function and $\Psi (\cdot )$ to be the dual function of the saddle point problem $\min _x \max _y F(x,y)$, respectively, with the following definitions

$$\begin{aligned} \Phi (x):=\max _y F(x,y)\qquad \text{ and } \qquad \Psi (y):=\min _x F(x,y). \end{aligned}$$

(5)

As the maximum of a class of $\mu _x$-strongly convex function, we know $\Phi (x)$ is a $\mu _x$-strongly convex function. Similarly, $\Psi (y)$ is a $\mu _y$-strongly concave function. We define the duality gap as

$$\begin{aligned} \Delta (x,y) := \max _{y'} F(x,y')-\min _{x'} F(x',y) = \Phi (x)-\Psi (y).\end{aligned}$$

Suppose the unique solution of this min-max problem is $(x^*,y^*)$. By the strong duality theorem, we know for any x and y it holds that

$$\begin{aligned} \Phi (x)\ge \min _{x'}\Phi (x') = \Phi (x^*) =F(x^*,y^*) = \Psi (y^*) = \max _{y'}\Psi (y')\ge \Psi (y).\end{aligned}$$

Together with the $\mu _x$-strong convexity of $\Phi $ and the $\mu _y$-strong concavity of $\Psi $, we further have

$$\begin{aligned} \Delta (x,y)= \Phi (x)-\Phi (x^*) + \Psi (y^*) - \Psi (y)\ge \frac{\mu _x}{2}\Vert x-x^*\Vert ^2 + \frac{\mu _y}{2}\Vert y-y^*\Vert ^2. \end{aligned}$$

(6)

Now, suppose that $(\tilde{x}_k,\tilde{y}_k)$ is the approximate solution generated after k iterations of an algorithm. Our aim is to lower bound the distance between $(\tilde{x}_k,\tilde{y}_k)$ and $(x^*,y^*)$. By (6), this would construct a lower iteration complexity bound in terms of the duality gap as well.

2.2 Proximal algorithm class

First, let us consider the bilinearly coupled problem class (3) as introduced in Definition 1.2:

$$\begin{aligned} \min _x \max _y F(x,y):=f(x) + x^\top Ay - g(y). \end{aligned}$$

For this special problem class, let us consider the lower iteration bound of the algorithm class where the proximal oracles (4) are available.

Definition 2.1

(Proximal algorithm class) In each iteration, the iterate sequence $\{(x^{k},y^{k})\}_{k=0,1,...}$ are generated so that $(x^{k},y^{k})\in \mathcal {H}_x^{k} \times \mathcal {H}_y^{k}$. These subspaces are generated with $\mathcal {H}_x^{0} = \mathrm {Span}\{x^0\}, \mathcal {H}_y^{0} = \mathrm {Span}\{y^0\}$ and

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^{k+1} := \mathrm {Span}\{x^i,\mathbf {prox}_{\gamma _i f}(\hat{x}^i-\gamma _iA\tilde{y}^i)~\,: \forall \hat{x}^i\in \mathcal {H}_x^{i},~ \tilde{y}^i\in \mathcal {H}_y^i,~ 0\le i\le k\}\\ \mathcal {H}_y^{k+1} := \mathrm {Span}\{y^i,\mathbf {prox}_{\sigma _i g}(\hat{y}^i+\sigma _iA^\top \tilde{x}^i): \forall \tilde{x}^i\in \mathcal {H}_x^{i},~ \hat{y}^i\in \mathcal {H}_y^i,~ 0\le i\le k\}. \end{array}\right. } \end{aligned}$$

(7)

Remark that when applying the proximal oracles, it is not necessary to use the most recent iterate $x^k$ as the proximal center. Neither is it necessary to use the gradients of the coupling term (namely the $A^\top x$ and Ay terms) at the current iterate. Instead, the algorithm class allows the usage of the combination of any points in the historical search space. We shall also remark that the algorithm class in Definition 2.1 does not necessarily need to update x and y at the same time, because setting $x^{k+1}=x^k$ or $y^{k+1} = y^k$ also satisfies Definition 2.1. Thus this algorithm class also includes the methods that alternatingly update x and y. Below is a sample algorithm in this class.

Example 2.1

(Algorithm 3 in [9]) Initialize with $\gamma = \frac{1}{L_{xy}}\sqrt{\frac{\mu _y}{\mu _x}}$, $\sigma = \frac{1}{L_{xy}}\sqrt{\frac{\mu _x}{\mu _y}}$, and $\theta = \frac{L_{xy}}{2\sqrt{\mu _x\mu _y} + L_{xy}}.$ Set $\tilde{x}^0 = x^0$. Then the algorithm proceeds as

$$\begin{aligned} {\left\{ \begin{array}{ll} y^{k+1} = \mathbf {prox}_{\sigma g}(y^k + \sigma A^\top \tilde{x}^k)\\ x^{k+1} = \mathbf {prox}_{\gamma f}(x^k - \gamma Ay^{k+1})\\ \tilde{x}^{k+1} = x^{k+1} + \theta (x^{k+1}-x^k). \end{array}\right. } \end{aligned}$$

(8)

It can be observed that this algorithm takes the alternating order of update, by slightly manipulating the index, it can be written in the form of (7) in Definition 2.1. The complexity of this method is ${\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $.

2.3 Pure first-order algorithm class

In constrast to the previous section, here we consider the more general problem class $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$:

$$\begin{aligned}\min _x\max _y F(x,y).\end{aligned}$$

For such problems, we refer to the algorithm class as the pure first-order methods, meaning that there is no proximal oracle in the design of algorithms in this class.

Definition 2.2

(Pure first-order algorithm class) In each iteration, the sequence $\{(x_{k},y_{k})\}_{k=0,1,...}$ is generated so that $(x^{k},y^{k})\in \mathcal {H}_x^{k} \times \mathcal {H}_y^{k}$, with $\mathcal {H}_x^{0} = \mathrm {Span}\{x^0\}$, $\mathcal {H}_y^{0} = \mathrm {Span}\{y^0\}$, and

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^{k+1} := \mathrm {Span}\{x^i,\nabla _xF(\tilde{x}^i,\tilde{y}^i): \forall \tilde{x}^i\in \mathcal {H}_x^{i}, \tilde{y}^i\in \mathcal {H}_y^i, 0\le i\le k\}\\ \mathcal {H}_y^{k+1} := \mathrm {Span}\{y^i,\nabla _yF(\tilde{x}^i,\tilde{y}^i): \forall \tilde{x}^i\in \mathcal {H}_x^{i}, \tilde{y}^i\in \mathcal {H}_y^i, 0\le i\le k\}. \end{array}\right. } \end{aligned}$$

(9)

Similar to our earlier comments on the proximal algorithm class, in this class of algorithms the gradients at any combination of points in the historical search space are allowed. The algorithm class also includes the methods that alternatingly update between x and y, or even the double loop algorithms that optimize one side until certain accuracy is achieved before switching to the other side. At that level of generality, it indeed accommodates many updating schemes. To illustrate this point, let us present below some sample algorithms in this class.

The first example is a double loop scheme, in which the primal function $\Phi (x)$ is optimized approximately. Specifically, let $y^*(x) = \mathop {\hbox {argmax}}\limits _y F(x,y)$, by Danskin’s theorem, $\nabla \Phi (x) = \nabla _x F(x,y^*(x))$; see e.g. [6, 34]. Therefore, one can apply Nesterov’s accelerated gradient method to minimize $\Phi (x)$. The double loop scheme performs this procedure approximately.

Example 2.2

(Double loop schemes, [35]) Denote $\alpha _1 = \sqrt{\frac{\mu _x}{L_{\Phi ,x}}}$ and $\alpha _2 = \sqrt{\frac{\mu _y}{L_y}}$, where $L_{\Phi ,x} = L_x + \frac{L_{xy}^2}{\mu _y}$ is the Lipschitz constant of $\nabla \Phi (x)$ (see [35]). Given $(x^0, y^0)$ and define $\bar{x}^0 = x^0$, the double loop scheme works as follows:

$$\begin{aligned}{\left\{ \begin{array}{ll} x^{k+1} = \bar{x}^k - \frac{1}{L_{\Phi ,x}}\nabla _x F(\bar{x}^k, y^k)\\ \bar{x}^{k+1} = x^{k+1} + \frac{1-\sqrt{\alpha _2}}{1+\sqrt{\alpha _2}}(x^{k+1}-x^k) \end{array}\right. }\quad \text{ for }\quad k = 0,1,...,T_1,\end{aligned}$$

where the point $y^k$ is generated by an inner loop of accelerated gradient iterations

$$\begin{aligned}{\left\{ \begin{array}{ll} w^{t+1} = \bar{w}^t + \frac{1}{L_{y}}\nabla _y F(\bar{x}^k, \bar{w}^t)\\ \bar{w}^{t+1} = w^{t+1} + \frac{1-\sqrt{\alpha _1}}{1+\sqrt{\alpha _1}}(w^{t+1}-w^t) \end{array}\right. }\,\,\text{ for }\quad \,\, t = 0,1,...,T_2\,\,\,\, \text{ and }\,\,\,\, w^0 = \bar{w}^0 = y^{k-1}.\end{aligned}$$

Then, set $y^{k}:=w^{T_2+1}$ to be the last iterate of the inner loop.

For simplicity, we have applied a specific scheme of acceleration [29] which does not work for nonstrongly-convex problems. In principle, the FISTA scheme can also be used. For this scheme, with properly chosen $T_1$ and $T_2$, the iteration complexity ${\mathcal {O}}$ $\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \sqrt{\frac{L_y}{\mu _y}}\ln ^2\left( \frac{1}{\epsilon }\right) \right) $ is achievable.

In the following, we also list examples of several single loop algorithms, including the gradient descent-ascent method (GDA), the extra-gradient (EG) method [17] (a special case of mirror-prox algorithm [24]), and the accelerated dual extrapolation (ADE) [30].

Example 2.3

(Single loop algorithms) Let $L = \max \{L_x,L_y,L_{xy}\}, \mu = \min \{\mu _x,\mu _y\}$. Given the initial solution $(x^0,y^0)$, the algorithms proceed as follows:

$$\begin{aligned}&(\mathrm{GDA})\qquad \qquad \qquad \qquad \qquad {\left\{ \begin{array}{ll} x^{k+1} = x^k - \eta _1 \nabla _x F(x^k,y^k)\\ y^{k+1}\, = y^k + \eta _1 \nabla _yF(x^k,y^k) \end{array}\right. }\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \\&(\mathrm{EG})\qquad {\left\{ \begin{array}{ll} \tilde{x}^{k+1} = x^k - \eta _2 \nabla _x F(x^k,y^k)\\ \tilde{y}^{k+1}\, = y^k + \eta _2 \nabla _yF(x^k,y^k) \end{array}\right. } \text{ and }\quad \,\, {\left\{ \begin{array}{ll} x^{k+1} = x^k - \eta _2 \nabla _x F(\tilde{x}^{k+1},\tilde{y}^{k+1})\\ y^{k+1}\, = y^k + \eta _2 \nabla _yF(\tilde{x}^{k+1},\tilde{y}^{k+1}) \end{array}\right. }\qquad \qquad \qquad \qquad \quad \\&(\mathrm{ADE})\qquad \qquad {\left\{ \begin{array}{ll} x^{k+1} = x^k - \eta _3 \left( \frac{\mu }{L+\mu }\nabla _x F(x^k,y^k)+\frac{L}{L+\mu }\nabla _x F(\tilde{x}^{k+1},\tilde{y}^{k+1})\right) \\ y^{k+1} \,= y^k + \eta _3 \left( \frac{\mu }{L+\mu }\nabla _yF(x^k,y^k)\,+\frac{L}{L+\mu }\nabla _y F(\tilde{x}^{k+1},\tilde{y}^{k+1})\right) \end{array}\right. }\qquad \qquad \qquad \qquad \qquad \end{aligned}$$

where $\eta _1 = {\mathcal {O}}\left( \frac{\mu }{L^2}\right) $, $\eta _2={\mathcal {O}}\left( \frac{1}{L}\right) ,\eta _3 = {\mathcal {O}}\left( \frac{1}{L}\right) $. The iterative points $(\tilde{x}^{k+1},\tilde{y}^{k+1})$ in (ADE) are the same as that in (EG), except that $\eta _2$ is replaced by $\eta _3$.

The original update of (ADE) algorithm is rather complex since it involves the handling of constraints. In the unconstrained case, it can be simplified to the current form, which is a mixture of (GDA) and (EG). The corresponding iteration complexity bounds are ${\mathcal {O}}\left( \frac{L^2}{\mu ^2}\ln \left( \frac{1}{\epsilon }\right) \right) $ for (GDA), and ${\mathcal {O}}\left( \frac{L}{\mu }\ln \left( \frac{1}{\epsilon }\right) \right) $ for both (EG) and (ADE).

2.4 General deterministic algorithm classes without linear span structure

Although all the reviewed first-order methods satisfy the linear span property in the proximal algorithm class in Definition 2.1 and the pure first-order algorithm class in Definition 2.2, this does not exclude the possibility of the deriving an algorithm that does not satisfy the linear span property. Therefore, we also define the general deterministic proximal algorithm class and the general deterministic pure first-order algorithm class as follows, whose iteration complexity lower bound can be generalized from their linear span counterpart through the technique of adversary rotation.

Definition 2.3

(General proximal algorithm class) Consider the problem (3) in the problem class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$, denote $\theta = (L_{xy},\mu _x,\mu _y)$ as the corresponding problem parameters. Let algorithm $\mathcal {A}$ belong to the general proximal algorithm class. Then $\mathcal {A}$ consists of a sequence of deterministic mappings $\{(\mathcal {A}_x^1, \mathcal {A}_y^1,\mathcal {A}_u^1,\mathcal {A}_v^1),(\mathcal {A}_x^2, \mathcal {A}_y^2,\mathcal {A}_u^2,\mathcal {A}_v^2)...\}$ such that the iterate sequence $\{(x^{k},y^{k})\}_{k=0,1,...}$ and the output sequence $\{(\tilde{x}^{k},\tilde{y}^{k})\}_{k=0,1,...}$ are generated by

$$\begin{aligned} {\left\{ \begin{array}{ll} (x^k,\tilde{x}^k) := \mathcal {A}_x^k\left( \theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1}; \mathbf {prox}_{\gamma _k f}(u^k) \right) ,\\ (y^k,\tilde{y}^k) := \mathcal {A}_y^k\left( \theta ; y^0,A^\top x^0,...,y^{k-1},A^\top x^{k-1}; \mathbf {prox}_{\sigma _k g}(v^k) \right) , \end{array}\right. } \end{aligned}$$

(10)

where $u^k = \mathcal {A}^k_u(\theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1})$, $v^k = \mathcal {A}^k_v(\theta ; y^0,A^\top x^0,...,y^{k-1},A^\top x^{k-1})$, and $(x^0,y^0)$ is any given initial solution.

One remark is that the input of the proximal mapping $\mathbf {prox}_{\gamma _kf}(\cdot )$ is constructed with other inputs to the $\mathcal {A}_x^k$, i.e., there could be another deterministic mapping $\mathcal {A}_u^k$ to generate a vector $u^k = \mathcal {A}^k_u(\theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1})$ and then $\mathbf {prox}_{\gamma _kf}(u^k)$ is passed to the mapping $\mathcal {A}_x^k$. This $\mathcal {A}^k_u$ does not need to be linear. The situation for $v^k$ and $\mathcal {A}^k_v$ is similar. Similar to the general proximal algorithm class, the general pure first-order algorithm class is defined as follows.

Definition 2.4

(General pure first-order algorithm class) Consider the problem (1) in the problem class $\mathcal {F}(L_x, L_y, L_{xy},\mu _x,\mu _y)$, denote $\theta = (L_x, L_y, L_{xy},\mu _x,\mu _y)$ as the corresponding problem parameters. Let algorithm $\mathcal {A}$ belong to the general pure first-order algorithm class. Then $\mathcal {A}$ consists of a sequence of deterministic mappings $\{\mathcal {A}_x^1, \mathcal {A}_y^1, \mathcal {A}_x^2, \mathcal {A}_y^2,...\}$ such that the iterate sequence $\{(x^{k},y^{k})\}_{k=0,1,...}$ and the output sequence $\{(\tilde{x}^{k},\tilde{y}^{k})\}_{k=0,1,...}$ are generated by

$$\begin{aligned} {\left\{ \begin{array}{ll} (x^k,\tilde{x}^k) := \mathcal {A}_x^k\left( \theta ; x^0,\nabla _x F(x^0,y^0),...,x^{k-1},\nabla _x F(x^{k-1},y^{k-1}) \right) ,\\ (y^k,\tilde{y}^k) := \mathcal {A}_y^k\left( \theta ; y^0,\nabla _y F(x^0,y^0),...,y^{k-1},\nabla _y F(x^{k-1},y^{k-1})\right) , \end{array}\right. } \end{aligned}$$

(11)

given any initial solution $(x^0,y^0)$.

A remark is that the gradients $\nabla _x F(\cdot ,\cdot )$ and $\nabla _y F(\cdot ,\cdot )$ actually do not need to be taken on the previous iterates $\{(x^0,y^0),(x^1,y^1),...,(x^{k-1},y^{k-1})\}$. Similar to the proximal case, they can also be taken on some other $\{(u^0_{k-1},v^0_{k-1}),(u^1_{k-1},v^1_{k-1}),...,(u^{k-1}_{k-1},v^{k-1}_{k-1})\}$ that are generated by some mappings $\{(\mathcal {A}_u^{k-1,0},\mathcal {A}_v^{k-1,0}),(\mathcal {A}_u^{k-1,1},\mathcal {A}_v^{k-1,1}),...,(\mathcal {A}_u^{k-1,k-1},\mathcal {A}_v^{k-1,k-1})\}$. The reason that we do not consider this more general form is twofold. First, the simpler form in Definition 2.4 has already covered all the discussed algorithms. Second, this more general form actually shares the same iteration complexity lower bound, despite the technical complications involved. Therefore, in this paper we shall only include the gradients at the past iterates as the input to the algorithm.

3 Lower bound for proximal algorithms

3.1 The worst-case instance

Let us construct the following bilinearly coupled min-max saddle point problem:

$$\begin{aligned} \min _x\max _y F(x,y):=\frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top Ay - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top y \end{aligned}$$

(12)

where b is a vector to be determined later, and the coupling matrix A (hence $A^2$ and $A^4$) is defined as follows:

(13)

Note that $A^\top =A$ and $\Vert A\Vert _2 \le 2$. Therefore (12) is an instance in the problem class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$. It is worth noting that the example (12) is the same as that in Proposition 2 of [13], which is a parallel work focused on pure first-order algorithm class. Here, we use the same example to elaborate the lower bound of the proximal methods over the general bilinear coupling class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$, as well as a warmup for the discussion of more complex problem class $\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)$.

Denote $e_i$ to be the i-th unit vector, which has 1 at the i-th component and 0 elsewhere. Then by direct calculation, one can check that $A^2$ satisfies the following zero-chain property (see Chapter 2 of [29]).

Proposition 3.1

(Zero-chain property) For any vector $v\in \mathbb {R}^n$, if $v\in \mathrm {Span}\{e_i:i\le k\}$ for some $1\le k\le n-1$, then $A^2v\in \mathrm {Span}\{e_i:i\le k+1\}$.

This means that if v only has nonzero elements at the first k entries, then $A^2v$ will have at most one more nonzero entry at the $(k+1)$-th position.

For problem (12), the proximal operators in (7) can be written explicitly:

$$\begin{aligned} \mathbf {prox}_{\gamma _i f}(\hat{x}_i-\gamma _iA\tilde{y}_i)= & {} \mathop {\hbox {argmin}}\limits _x \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{1}{2\gamma _i}\left\| x-(\hat{x}_i-\frac{\gamma _iL_{xy}}{2}A\tilde{y}_i)\right\| ^2\nonumber \\= & {} \frac{1}{1+\gamma _i\mu _x}\hat{x}_i - \frac{\gamma _iL_{xy}}{2(1+\gamma _i\mu _x)}A\tilde{y}_i\nonumber \\\in & {} \mathrm {Span}\{\hat{x}_i,A\tilde{y}_i\}. \end{aligned}$$

(14)

Similarly, for the y block, we also have

$$\begin{aligned} \mathbf {prox}_{\sigma _i g}(\hat{y}_i+\sigma _iA^\top \tilde{x}_i) = \frac{\hat{y}_i-\sigma _i b}{1+\sigma _i\mu _y} + \frac{\sigma _iL_{xy}}{2(1+\sigma _i\mu _x)}A\tilde{x}_i \in \mathrm {Span}\{\hat{y}_i,A\tilde{x}_i,b\}. \end{aligned}$$

(15)

Let us assume the initial point to be $x^0 = y^0 = 0$ ($\mathcal {H}_x^0 = \mathcal {H}_y^0 = \{0\}$) without loss of generality. Directly substituting (14) and 15 into Definition (2.1) yields

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^1 \subseteq \mathrm {Span}\{0\}\\ \mathcal {H}_y^1 \subseteq \mathrm {Span}\{b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^2 \subseteq \mathrm {Span}\{Ab\}\\ \mathcal {H}_y^2 \subseteq \mathrm {Span}\{b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^3 \subseteq \mathrm {Span}\{Ab\}\\ \mathcal {H}_y^3 \subseteq \mathrm {Span}\{b,A^2b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^4 \subseteq \mathrm {Span}\{Ab, A^3b\}\\ \mathcal {H}_y^4 \subseteq \mathrm {Span}\{b,A^2b\} \end{array}\right. } \ldots \end{aligned}$$

We formally summarize this observation below:

Lemma 3.2

For problem (12), for any $k \in \mathbb {N}$, if the iterates are generated so that $(x_k,y_k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k$, with $\mathcal {H}_x^k$ and $\mathcal {H}_y^k$ defined by (2.1), then based on (14) and (15) the search subspaces satisfy

$$\begin{aligned}\mathcal {H}_x^k \subseteq {\left\{ \begin{array}{ll} \{0\}, &{} k = 1\\ \mathrm {Span}\left\{ A^{2i}(Ab): 0\le i\le \left\lfloor \frac{k}{2}\right\rfloor -1 \right\} , &{} k\ge 2 \end{array}\right. } \quad \text{ and } \quad \mathcal {H}_y^k \subseteq \mathrm {Span}\left\{ A^{2i}b: 0\le i\le \left\lceil \frac{k}{2}\right\rceil -1 \right\} .\end{aligned}$$

3.2 Lower bounding the duality gap

Let us lower bound the dual gap, which is upper bounded by the whole duality gap. To achieve this, let us first write down the dual function of problem (12) as

$$\begin{aligned} \Psi (y) = \min _{x} F(x,y) = -\frac{1}{2}y^\top \left( \frac{L_{xy}^2}{4\mu _x}\cdot A^2 + \mu _y\cdot I\right) y - b^\top y . \end{aligned}$$

(16)

For this $\mu _y$-strongly concave dual function, we can characterize the optimal solution $y^*$ directly by its KKT condition $\nabla \Psi (y^*) = 0$. However, the exact solution $y^*$ does not have a simple and clear form, so we choose to characterize it by an approximate solution $\hat{y}^*$.

Lemma 3.3

(Approximate optimal solution) Let us assign the value of b as $b := -\frac{L_{xy}^2}{4\mu _x}e_1$. Denote $\alpha := \frac{4\mu _x\mu _y}{L_{xy}^2}$, and let $q = \frac{1}{2}\left( (2+\alpha ) - \sqrt{(2+\alpha )^2 - 4}\right) \in (0,1)$ be the smallest root of the quadratic equation $1 - (2+\alpha )q + q^2 = 0$. Then, an approximate optimal solution $\hat{y}^*$ can be constructed as

$$\begin{aligned} \hat{y}^*_{i} = \frac{q^i}{1-q} \quad \text{ for } \quad i = 1,2,...,n. \end{aligned}$$

(17)

The approximation error can be bounded by

$$\begin{aligned} \Vert \hat{y}^* - y^*\Vert \le \frac{q^{n+1}}{\alpha (1-q)}, \end{aligned}$$

(18)

where $\hat{y}^*_i$ is the i-th element of $\hat{y}^*$. Note that $q<1$ and the lower bound is dimension-independent, hence we are free to choose n to make the approximation error arbitrarily small.

Proof

First, let us substitute the value of b into the KKT system $\nabla \Psi (y^*)=0$, by slight rearranging and scaling the terms, we get

$$\begin{aligned} \left( A^2 + \frac{4\mu _x\mu _y}{L_{xy}^2} I\right) y^* = - \frac{4\mu _x}{L_{xy}^2}b. \end{aligned}$$

Using the definition of $\alpha $ and b, the equation becomes

$$\begin{aligned} (A^2+\alpha I)y^* = e_1.\end{aligned}$$

Substituting the formula of $A^2$ in (13), we expand the above equation as

$$\begin{aligned} {\left\{ \begin{array}{ll} (1+\alpha )y_1^* - y_2^* = 1\\ -y_1^* + (2+\alpha )y_2^* - y_3^* = 0\\ \qquad \qquad \quad \vdots \\ -y_{n-2}^* + (2+\alpha )y_{n-1}^* - y_n^* = 0\\ -y_{n-1}^* + (2+\alpha )y_n^* = 0. \end{array}\right. } \end{aligned}$$

(19)

By direct calculation, we can check that $\hat{y}^*$ satisfies the first $n-1$ equations of the KKT system (19). The last equation, however, is violated, but with a residual of size $q^{n+1}/(1-q)$. In details,

$$\begin{aligned}{\left\{ \begin{array}{ll} (A^2 + \alpha \cdot I)\hat{y}^* = e_1 + \frac{q^{n+1}}{1-q}\cdot e_n\\ (A^2 + \alpha \cdot I)y^* = e_1. \end{array}\right. }\end{aligned}$$

This indicates that $\hat{y}^* - y^* = \frac{q^{n+1}}{1-q}\cdot (A^2 + \alpha I)^{-1} e_n$. Note that $\alpha ^{-1}I\succeq (A^2 + \alpha I)^{-1}\succ 0$, we have the approximation error bounded by (18). $\square $

Note that in Lemma 3.3, we have chosen $b\propto e_1$. By the zero-chain property in Proposition 3.1 and Lemma 3.2, we can verify that the subspaces $\mathcal {H}_y^{2k-1}$ and $\mathcal {H}_y^{2k}$ satisfy

$$\begin{aligned} \mathcal {H}_y^{2k-1}, \mathcal {H}_y^{2k} \subseteq \mathrm {Span}\{b,A^2b,...,A^{2(k-1)}b\} =\mathrm {Span}\{e_1, e_2, ...,e_k\}. \end{aligned}$$

(20)

This implies that for both $y^{2k}$ and $y^{2k-1}$, the only possible nonzero elements are the first k ones, which again implies that the lower bound of $\Vert y^{2k}-y^*\Vert ^2$ and $\Vert y^{2k-1}-y^*\Vert ^2$ will be similar. For simplicity, we only discuss this lower bound for $y^{2k}$. The counterpart for $y^{2k-1}$ can be obtained in a similar way. Therefore, we have the following estimations.

Lemma 3.4

Assume $k\le \frac{n}{2}$ and $n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) $. Then

$$\begin{aligned} \Vert y^{2k}-y^*\Vert ^2\ge \frac{q^{2k}}{16}\Vert y^0-y^*\Vert ^2 \end{aligned}$$

(21)

where $y^0 = 0$ is the initial solution.

Proof

By the subspace characterization (20), we have

$$\begin{aligned}\Vert y^{2k}-\hat{y}^*\Vert \ge \sqrt{\sum ^n_{j = k+1} (\hat{y}^*_j)^2} = \frac{q^{k}}{1-q}\sqrt{q^2 + q^4 + \cdots + q^{2(n-k)}}\ge \frac{q^k}{\sqrt{2}}\Vert \hat{y}^*\Vert = \frac{q^k}{\sqrt{2}}\Vert y^0-\hat{y}^*\Vert ,\end{aligned}$$

where the last inequality is due to the fact that $q< 1$, $k\le \frac{n}{2}$, and $y^0 = 0$. If we choose n to be large enough, then $\hat{y}^*$ and $y^*$ can be made arbitrarily close to each other. Hence we can transform the above inequality to (21). More details of this derivation can be found in Appendix A. $\square $

Using Lemma 3.4 and (6), it is then straightforward to lower bound the duality gap by

$$\begin{aligned}\Delta (x^{2k},y^{2k})\ge q^{2k}\cdot \frac{\mu _y\Vert y^*-y^0\Vert ^2}{32}.\end{aligned}$$

Summarizing, below we present our first main result.

Theorem 3.5

Let the positive parameters $\mu _x,\mu _y>0$ and $L_{xy}> 0$ be given. For any integer k, there exists a problem instance from $\mathcal {B}(L_{xy},\mu _x,\mu _y)$ of form (12), with $n\ge \max \left\{ 2\log _q\left( \frac{\mu _x\mu _y}{\sqrt{2}L_{xy}^2}\right) , 4k\right\} $, where $A\in \mathbb {R}^{n\times n}$ as defined in (13), and $b = -\frac{L_{xy}^2}{4\mu _x}e_1$. For such a problem instance, any approximate solution $(\tilde{x}^k, \tilde{y}^k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k$ generated by the proximal algorithm class under the linear span assumption (7) satisfies

$$\begin{aligned} \max _y F(\tilde{x}^{k},y) - \min _{x} F(x, \tilde{y}^k)\ge q^{k}\cdot \frac{\mu _y\Vert y^*-y^0\Vert ^2}{32} \quad \text{ and } \quad \Vert \tilde{y}^k-y^*\Vert ^2\ge q^{k}\cdot \frac{\Vert y^*-y^0\Vert ^2}{16},\nonumber \\ \end{aligned}$$

(22)

where $q = 1+\frac{2\mu _x\mu _y}{L_{xy}^2} - 2\sqrt{\left( \frac{\mu _x\mu _y}{L_{xy}^2}\right) ^2 + \frac{\mu _x\mu _y}{L_{xy}^2}}$.

Proposition 3.6

Under the same set of assumptions of Theorem 3.5, if we require the duality gap to be bounded by $\epsilon $, the number of iterations needed is at least

$$\begin{aligned} k\ge \ln \left( \frac{\mu _y\Vert y^*-y^0\Vert ^2}{32\epsilon }\right) /\ln (q^{-1}) = \Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

(23)

The proof of Proposition 3.6 is in Appendix B.

3.3 The general proximal algorithm class

Note that Theorem 3.5 is derived for the proximal algorithm class with the linear span assumption, in this section, we will apply the orthogonal invariance technique, introduced in [25], to generalize the result of Theorem 3.5 to the general proximal algorithm class without the linear span assumption.

Consider the bilinear problem class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$ and the corresponding worst case problem (12) with $F(x,y) = \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top A y - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top y$, where A and b are defined in accordance with Theorem 3.5. We define the orthogonally rotated problem as

$$\begin{aligned} \min _x\max _y F_{U,V}(x,y) : = F(Ux,Vy) = \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top U^\top AV y - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top Vy, \end{aligned}$$

(24)

where U, V are two orthogonal matrices. Therefore, it is clear that $F_{U,V}\in \mathcal {B}(L_{xy},\mu _x,\mu _y)$. Let $(x^*, y^*)$ be the saddle point of F(x, y), then it is clear that the saddle point of $F_{U,V}(x,y)$ is $(\bar{x}^*,\bar{y}^*) = (U^\top x^*,V^\top y^*).$ Consequently, the lower bound for the general proximal algorithm class is characterized by the following theorem.

Theorem 3.7

Let $\mathcal {A}$ be any algorithm from the general proximal algorithm class decribed in Definition 2.3. We assume the dimension n is sufficiently large for simplicity. For any integer k, then there exist orthogonal matrices U, V s.t. $F_{U,V}\in \mathcal {B}(L_{xy},\mu _x,\mu _y)$, when applying $\mathcal {A}$ to $F_{U,V}$ with initial solution $(x^0,y^0) = (0,0)$, the iterates and output satisfies

$$\begin{aligned}&\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1}\qquad \text{ and }\\&\quad (\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1},\end{aligned}$$

where $\mathcal {H}_x^i,\mathcal {H}_y^i$ are defined by Lemma 3.2. Consequenty, by Theorem 3.5,

$$\begin{aligned} \Vert \tilde{y}^k-V^\top y^*\Vert ^2\ge \frac{q^{4k+2}}{16}\Vert y^*-y^0\Vert ^2 \end{aligned}$$

where q is given in Theorem 3.5. As a result, it takes $\Omega \left( \sqrt{\frac{L^2_{xy}}{\mu _x\mu _y}}\cdot \log \left( \frac{1}{\epsilon }\right) \right) $ iterations to output a solution with $O(\epsilon )$ duality gap.

For the proof of this theorem, we only need to construct the orthogonal matrices U, V such that when the algorithm $\mathcal {A}$ is applied to $F_{U,V}$, the subspace inclusion argument $\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1}$ and $(\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1}$ holds. As a result,

$$\begin{aligned} \Vert \tilde{y}^k-V^\top y^*\Vert ^2 = \Vert V\tilde{y}^k- y^*\Vert ^2\ge \min _{y\in \mathcal {H}_y^{4k+1}}\Vert y-y^*\Vert ^2\ge \frac{q^{4k+2}}{16}\Vert y^*-y^0\Vert ^2. \end{aligned}$$

With this argument, the latter results follow directly from the discussion of Theorem 3.5. The proof of this theorem is presented in Appendix C.

3.4 Tightness of the bound

We claim the tightness of the derived lower bound by the following remark.

Remark 3.8

(Tightness of the bound) Consider the algorithm defined in Example 2.1, from [9, 10]. The achieved upper complexity bound is ${\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $, and it matches our lower bound. This means that our lower bound (23) is tight and the algorithm defined in Example 2.1 is an optimal algorithm in the proximal algorithm class in Definition 2.1.

4 Lower bound for pure first-order algorithms

4.1 The worst-case instance

In this section, we consider the lower complexity bound for the pure first-order method without any proximal oracle. In this case, only the gradient information can be used to construct the iterates and produce the approximate solution output. Similar as before, we still consider the bilinearly coupled problems:

$$\begin{aligned} \min _x\max _y F(x,y):=\frac{1}{2}x^\top (B_xA^2 + \mu _xI)x + \frac{L_{xy}}{2}x^\top Ay - \frac{1}{2}y^\top (B_yA^2 + \mu _yI)y - b^\top y \end{aligned}$$

(25)

where b is a vector whose value will be determined later. The coefficients $B_x: = \frac{L_x-\mu _x}{4}, B_y:= \frac{L_y-\mu _y}{4}$ and the coupleing matrix A is defined by (13). Note that $\Vert A\Vert _2\le 2$ and $\Vert A\Vert _2^2\le 4$, we can check that problem (25) is an instance from the problem class $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$. This time the subspaces $\mathcal {H}_x^k$’s and $\mathcal {H}_y^k$’s are generated by the following gradients:

$$\begin{aligned} {\left\{ \begin{array}{ll} \nabla _x F(x,y) = (B_xA^2 + \mu _xI)x + \frac{L_{xy}}{2}Ay,\\ \nabla _yF(x,y) = -(B_yA^2 + \mu _yI)y + \frac{L_{xy}}{2}Ax - b. \end{array}\right. } \end{aligned}$$

Following Definition 2.2, by letting $x^0 = y^0 = 0$ we have

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^1 \subseteq \mathrm {Span}\{0\}\\ \mathcal {H}_y^1 \subseteq \mathrm {Span}\{b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^2 \subseteq \mathrm {Span}\{Ab\}\\ \mathcal {H}_y^2 \subseteq \mathrm {Span}\{b, A^2b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^3 \subseteq \mathrm {Span}\{Ab,A^2(Ab)\}\\ \mathcal {H}_y^3 \subseteq \mathrm {Span}\{b,A^2b,A^4b\} \end{array}\right. } \quad \ldots \end{aligned}$$

By induction, we get the general structure of these subspaces.

Lemma 4.1

For problem (25) and for any $k \in \mathbb {N}$, if the iterates are generated so that $(x_k,y_k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k$, with $\mathcal {H}_x^k$ and $\mathcal {H}_y^k$ defined by (2.2), then we have

$$\begin{aligned}\mathcal {H}_x^k \subseteq {\left\{ \begin{array}{ll} \{0\}, &{} k = 1\\ \mathrm {Span}\left\{ A^{2i}(Ab): 0\le i\le k-2 \right\} , &{} k\ge 2 \end{array}\right. } \quad \text{ and } \quad \mathcal {H}_y^k \subseteq \mathrm {Span}\left\{ A^{2i}b: 0\le i\le k -1 \right\} .\end{aligned}$$

Different from the discussion of last section, this time it is more convenient to deal with the primal function instead of the dual one. By partially maximizing over y we have

$$\begin{aligned} \Phi (x):=\max _y F(x,y)&= \frac{1}{2}x^\top (B_xA^2+\mu _xI)x \\&\quad + \frac{L_{xy}^2}{8}\left( Ax-\frac{2b}{L_{xy}}\right) ^\top (B_yA^2+\mu _yI)^{-1}\left( Ax-\frac{2b}{L_{xy}}\right) , \end{aligned}$$

which is $\mu _x$-strongly convex. Therefore, the primal optimal solution $x^*$ is completely characterized by the optimality condition $\nabla \Phi (x^*) = 0$. However, the solution of this system cannot be computed exactly. Instead, we shall construct an approximate solution $\hat{x}^*$ to the exact solution $x^*$.

Lemma 4.2

(Root estimation) Consider a quartic equation

$$\begin{aligned} 1 -(4+\alpha )x + (6+2\alpha + \beta )x^2 - (4+\alpha )x^3 + x^4 = 0, \end{aligned}$$

(26)

where the constants are given by

$$\begin{aligned} \alpha = \frac{L_{xy}^2}{4B_xB_y} + \frac{\mu _x}{B_x} + \frac{\mu _y}{B_y},\qquad \beta = \frac{\mu _x\mu _y}{B_xB_y}. \end{aligned}$$

(27)

As long as $L_x>\mu _x>0,$ and $L_y>\mu _y>0$. Then the constants $0<\alpha ,\beta <+\infty $ are well-defined positive real numbers. For this quartic equation, it has a real root $x=q$ satisfying

$$\begin{aligned} 1-\left( \frac{1}{2}+\frac{1}{2\sqrt{2}}\sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\right) ^{-1}<q<1 - \left( \frac{1}{2}+\frac{1}{2}\sqrt{\frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_x}{\mu _x} + \frac{L_y}{\mu _y}-1}\right) ^{-1} . \end{aligned}$$

(28)

The proof of this lemma is presented in Appendix D. With this lemma, we can construct the approximate solution $\hat{x}^*$ as follows.

Lemma 4.3

(Approximate optimal solution) Let $\alpha ,\beta $ be defined according to (27), let q be a real root of quartic equation (26) satisfying (28). Let us define a vector $\hat{b}$ with elements given by

$$\begin{aligned} \hat{b}_1 := (2+\alpha + \beta )q - (3+\alpha )q^2+q^3,\qquad \hat{b}_2 :=q-1, \quad \text{ and } \quad \hat{b}_k = 0, \text{ for } 3\le k \le n, \end{aligned}$$

(29)

and then assign $b = \frac{2B_xB_y}{L_{xy}}A^{-1}\hat{b}$. Then an approximate solution $\hat{x}^*$ is constructed as

$$\begin{aligned} \hat{x}^*_{i} = q^i \quad \text{ for } \quad i = 1,2,...,n. \end{aligned}$$

(30)

The approximation error can be bounded by

$$\begin{aligned} \Vert \hat{x}^*-x^*\Vert \le \frac{7+\alpha }{\beta }\cdot q^n. \end{aligned}$$

(31)

Note that $q<1$ and the lower bound is dimension-independent, hence we are free to choose n to make the approximation error arbitrarily small.

The proof of this lemma is parallel to that of Lemma 3.3, but is more involved; the detailed proof is in Appendix E.

Note that in this case, the vector $Ab\propto \hat{b}\subset \mathrm {Span}\{e_1,e_2\}$. By the zero-chain property in Proposition 3.1, the subspace $\mathcal {H}_x^k$ described in Lemma 4.1 can be calculated by induction

$$\begin{aligned} \mathcal {H}_x^k\subset \mathrm {Span}\{e_1,e_2,...,e_k\} \quad \text{ for } \quad k\ge 2. \end{aligned}$$

(32)

Parallel to Lemma 3.4, we have the following lemma, whose proof is in Appendix F.

Lemma 4.4

Assume $k\le \frac{n}{2}$ and $n\ge 2\log _q\left( \frac{\beta }{4\sqrt{2}(7+\alpha )}\right) +2$. Then

$$\begin{aligned} \Vert x^k-x^*\Vert ^2\ge \frac{q^{2k}}{16}\Vert x^*-x^0\Vert ^2 \end{aligned}$$

(33)

where $x^0 = 0$ is the initial solution.

Consequently, the duality gap is lower bounded by

$$\begin{aligned}\Delta (x^k,y^k)\ge \frac{\mu _x}{2}\Vert x^k-x^*\Vert ^2\ge q^{2k}\cdot \frac{\mu _x\Vert x^0-x^*\Vert ^2}{32}.\end{aligned}$$

Summarizing, we present our second main result in the following theorem.

Theorem 4.5

Let positive parameters $\mu _x,\, \mu _y>0$ and $L_x>\mu _x,\, L_y>\mu _y,\, L_{xy} > 0$ be given. For any integer k, there exists a problem instance in $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\nu _y)$ of form (25), with $n\ge \max \left\{ 2\log _q\left( \frac{7+\alpha }{\beta }\right) , 2k\right\} $, the constants $\alpha ,\beta $ as in (27), the matrix $A\in \mathbb {R}^{n\times n}$ as in (13), the vector $b = \frac{2B_xB_y}{L_{xy}}A^{-1}\hat{b}$ where $\hat{b}$ as in (29). For this problem, any approximate solution $(\tilde{x}^k, \tilde{y}^k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k$ generated by first-order algorithm class (9) satisfies

$$\begin{aligned}&\max _y F(\tilde{x}^{k},y) - \min _{x} F(x, \tilde{y}^k)\ge q^{2k}\cdot \frac{\mu _x\Vert x^*-x^0\Vert ^2}{32} \quad \text{ and } \nonumber \\&\quad \Vert \tilde{x}^k-x^*\Vert ^2\ge q^{2k}\cdot \frac{\Vert x^*-x^0\Vert ^2}{16}, \end{aligned}$$

(34)

where q satisfying (28) is a root of the quartic equation (26).

Remark 4.6

As a result, if we require the duality gap to be bounded by $\epsilon $, then the number of iterations needed is at least

$$\begin{aligned} k\ge \frac{1}{2}\ln \left( \frac{\mu _x\Vert x^*-x^0\Vert ^2}{32\epsilon }\right) /\ln (q^{-1}) = \Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

(35)

4.2 The general pure first-order algorithm class

Consider the problem class $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$, we define the orthogonally rotated problem as

$$\begin{aligned} \min _x\max _y F_{U,V}(x,y) : = \frac{1}{2}x^\top (B_xU^\top \!\!A^2U + \mu _xI)x + \frac{L_{xy}}{2}x^\top U^\top \!\!AVy - \frac{1}{2}y^\top (B_yV^\top \!\!A^2V + \mu _yI)y - b^\top V^\top \!y \end{aligned}$$

(36)

where $A,b,B_x,B_y$ are defined in accordance with Theorem 4.5 and U, V are two orthogonal matrices. Therefore, it is clear that $F_{U,V}\in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$. Let $( x^*,y^*)$ be the saddle point of F(x, y), then it is clear that the saddle point of $F_{U,V}(x,y)$ is $(\bar{x}^*,\bar{y}^*) = (U^\top \bar{x}^*,V^\top \bar{y}^*).$ Consequently, the lower bound for the general proximal algorithm class is characterized by the following theorem.

Theorem 4.7

Let $\mathcal {A}$ be any algorithm from general pure first-order algorithm class described in Definition 2.4. We assume the dimension n is sufficiently large for simplicity. For any integer k, then there exist orthogonal matrices U, V s.t. $F_{U,V}\in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$, when applying $\mathcal {A}$ to $F_{U,V}$ with initial solution $(x^0,y^0) = (0,0)$, the iterates and output satisfies

$$\begin{aligned}&\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{2k}\times V^\top \mathcal {H}_y^{2k}\qquad \text{ and }\qquad \\&\quad (\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{2k+1}\times V^\top \mathcal {H}_y^{2k+1},\end{aligned}$$

where $\mathcal {H}_x^i,\mathcal {H}_y^i$ are defined by Lemma 4.1. Consequenty, by Theorem 3.5,

$$\begin{aligned} \Vert \tilde{x}^k-U^\top x^*\Vert ^2 \ge \frac{q^{4k+2}}{16}\Vert x^*-x^0\Vert ^2 \end{aligned}$$

where q is defined in Theorem 4.5. As a result, it takes $\Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $ iterations to output a solution with $O(\epsilon )$ duality gap.

The proof of this theorem is completely parallel to that of Theorem 3.7. We only need to construct the orthogonal matrices U, V such that $\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{2k}\times V^\top \mathcal {H}_y^{2k}$ and $(\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{2k+1}\times V^\top \mathcal {H}_y^{2k+1}$ hold, whose proof follows exactly the same proof procedure of Theorem 3.7. Then argue that

$$\begin{aligned} \Vert \tilde{x}^k-U^\top x^*\Vert ^2 = \Vert U\tilde{x}^k- x^*\Vert ^2\ge \min _{x\in \mathcal {H}_x^{2k+1}}\Vert x-x^*\Vert ^2\ge \frac{q^{4k+2}}{16}\Vert x^*-x^0\Vert ^2. \end{aligned}$$

The latter results follow Theorem 4.5 and we omit the proof.

4.3 Tightness of the bound

In this section, we discuss the tightness of this bound. Currently, to the best of our knowledge, there does not exist a pure first-order algorithm that can achieve the lower complexity bound provided in (35). Therefore, whether an optimal algorithm exists that can match this bound or the bound can be further improved remains an open problem. However, we shall see below that (35) under several special parameter regimes is indeed a tight bound.

Case 1 $\mathcal {F}(L_x,L_y, L_{xy},\mu _x,\mu _y)$ For this general class, define $L = \max \{L_x,L_y,L_{xy}\}$. A near optimal upper bound of

$$\begin{aligned} {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L\cdot L_{xy}}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \end{aligned}$$

is obtained in [38], which almost matches our lower bound.

Case 2 $\mathcal {F}(L_x,L_y, 0,\mu _x,\mu _y)$ In this case $L_{xy} = 0$, meaning that variables x and y are decoupled. Problem (1) becomes two independent convex problems with condition numbers $\frac{L_x}{\mu _x}$ and $\frac{L_y}{\mu _y}$ respectively. In this case (35) is reduced to

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_x}{\mu _x}}\ln \left( \frac{1}{\epsilon }\right) + \sqrt{\frac{L_y}{\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) .\end{aligned}$$

This is matched by running two independent Nesterov’s accelerated gradient methods [29].

Case 3 $\mathcal {F}(L,L,L,\mu ,\mu )$ In this case $L_x = L_y = L_{xy} = L$, $\mu _x = \mu _y = \mu $. Then (35) is reduced to

$$\begin{aligned} \Omega \left( \frac{L}{\mu }\ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

The extra-gradient algorithm (EG) and the accelerated dual extrapolation algorithm (ADE) introduced in Example 2.3 have achieved this bound; see e.g. [23, 30].

Case 4 $\mathcal {F}(L_x,{\mathcal {O}}(1)\cdot \mu _y,L_{xy},\mu _x,\mu _y)$ In this case $L_y = {\mathcal {O}}(1)\cdot \mu _y$, meaning that one side of the problem is easy to solve. Then, (35) is reduced to

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) .\end{aligned}$$

For the double loop algorithm defined in Example 2.2, when we set the inner loop iteration to be $T_2 = {\mathcal {O}}\left( \sqrt{\frac{L_y}{\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) = {\mathcal {O}}\left( \ln \left( \frac{1}{\epsilon }\right) \right) $, and the outer loop iteration to be $T_1 = {\mathcal {O}}\left( \sqrt{\frac{L_{\Phi ,x}}{\mu _x}}\ln \left( \frac{1}{\epsilon }\right) \right) = {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) $. Then, an upper bound of

$$\begin{aligned} T_1T_2 = {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln ^2\left( \frac{1}{\epsilon }\right) \right) \end{aligned}$$

can be guaranteed. It is tight up to a logarithmic factor.

Case 5 $\mathcal {F}(L,L,L,\mu _x,\mu _x)$ In this case $L_x = L_y = L_{xy} = L$. Then (35) is reduced to

$$\begin{aligned} \Omega \left( \sqrt{\frac{L^2}{\mu _x\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

This bound has been achieved by [19] up to a logarithmic factor.

5 Reduction to lower bounds for convex-concave problems

Note that in the previous sections, we consider the strongly-convex and strongly-concave problem classes $\mathcal {B}(L_{xy},\mu _x,\mu _y)$ and $\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)$, where $\mu _x>0$ and $\mu _y>0$. In this section, we show how our iteration complexity lower bounds provided in Theorem 3.5 and 4.5 can be reduced to the problem classes with $\mu _x = \mu _y = 0$. Similar reduction can also be done for the case where $\mu _x>0, \mu _y = 0$, but is omitted in this paper.

5.1 Lower bound for pure first-order algorithm class

Unlike the strongly-convex and strongly-concave saddle point problems, the saddle point of the general convex-concave problem may not always exist. Therefore, we define a new problem class with bounded saddle point solution as follows.

Definition 5.1

(Problem class $\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)$) We say a function F belongs to the class $\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)$ as long as: (i). $F\in \mathcal {F}(L_x,L_y,L_{xy},0,0).$ (ii). The solution to $(x^*,y^*) = \mathop {\hbox {argmin}}\limits _x\mathop {\hbox {argmax}}\limits _y F(x,y)$ exists, and $\Vert x^*\Vert \le R_x$, $\Vert y^*\Vert \le R_y$.

For this problem class, we have the following lower bound result, as a corollary of Theorem 4.7.

Corollary 5.2

Consider applying the general first-order algorithm class defined by (2.2) to the problem class $\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)$. For any $\epsilon >0$, there exists a problem instance $F_\epsilon (x,y)\in \mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)$, such that

$$\begin{aligned} \Omega \left( \,\,\sqrt{\frac{L_xR_x^2}{\epsilon }}+\frac{L_{xy}R_xR_y}{\epsilon } + \sqrt{\frac{L_yR_y^2}{\epsilon }}\,\,\right) \end{aligned}$$

(37)

iterations are required to reduce the duality gap to $\epsilon $.

Proof

We start the reduction by the following scaling argument. First, for any $\epsilon >0$, let $\hat{F}_\epsilon \in \mathcal {F}(L_x, L_y, L_{xy},\mu _x,\mu _y)$ be the worst-case instance described by Theorem 4.7. For our purpose, we choose

$$\begin{aligned} \mu _x = 64\epsilon /R_x^2\qquad \text{ and }\qquad \mu _y = 64\epsilon /R_y^2.\end{aligned}$$

Then by direct computation, we know that the following scaled problem satisfies

$$\begin{aligned} F_\epsilon (x,y): = a \hat{F}_\epsilon (cx,dy)\in \mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y).\end{aligned}$$

We skip the parameter b since it is already used in the construction of the worst case instance $\hat{F}_\epsilon $. Denote $(\hat{x}^*,\hat{y}^*) = \min _{x}\max _y \hat{F}_\epsilon (x,y)$ and $(x^*,y^*) = \min _{x}\max _y F_\epsilon (x,y)$. Let us set

$$\begin{aligned} c = \frac{\Vert \hat{x}^*\Vert }{R_x}, \quad d = \frac{\Vert \hat{y}^*\Vert }{R_y},\quad \text{ and }\quad a = \min \{c^{-2},d^{-2}\}.\end{aligned}$$

Then we have $x^* = \frac{R_x\hat{x}^*}{\Vert \hat{x}^*\Vert }$ and $y^* = \frac{R_y\hat{y}^*}{\Vert \hat{y}^*\Vert }$, and the Lipschitz constants of $F_\epsilon $ satisfy that $ac^2 L_x\le L_x$, $ad^2 L_y\le L_y$, and $acd L_{xy}\le L_{xy}$. Therefore, we know

$$\begin{aligned} F_\epsilon \in \mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y)\cap \mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y).\end{aligned}$$

Note that purely scaling the variables and the function does not change the worst-case nature of this problem. That is, $F_\epsilon $ is still the worst-case problem instance of the function class $\mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y)$ and the lower bound of Theorem 4.5 is valid for this specific instance. Therefore, to get the duality gap less than or equal to $\epsilon $, the number of iteration k is lower bounded by

$$\begin{aligned} k&\ge \Omega \left( \sqrt{\frac{ac^2L_x}{ac^2\mu _x} + \frac{a^2c^2d^2L_{xy}^2}{ac^2\mu _x\cdot ad^2\mu _y} + \frac{ad^2L_y}{ad^2 \mu _y}}\cdot \ln \left( \frac{ac^2\mu _x\Vert x^*-x^0\Vert ^2}{32\epsilon }\right) \right) \nonumber \\&\overset{(i)}{=} \Omega \Bigg (\bigg (\sqrt{\frac{L_xR_x^2}{\epsilon }} + \frac{L_{xy}R_xR_y}{\epsilon }+\sqrt{\frac{L_yR_y^2}{\epsilon }}\bigg )\cdot \ln \left( 2ac^2\right) \Bigg ) \end{aligned}$$

(38)

where (i) is because $x^0 = 0$, $\Vert x^*\Vert = R_x$, and $\mu _x = 64\epsilon /R_x^2$. Therefore, as long as we can show that $\ln \left( 2ac^2\right) \ge \Omega (1)$, then the corollary is proved. However, since the details are rather technical, we shall provide a proof of $\ln \left( 2ac^2\right) \ge \ln 2 $ in Appendix G. $\square $

As a remark, by setting $L_y = 0$, the lower bound (37) implies the result derived in [33]. When $L_x = L_y = L_{xy} = L$, the lower bound (37) implies the result derived in [25]. The reduction for the general pure first-order algorithm class defined by (2.4) without the linear span assumption can also be done in a similar manner and is omitted for succinctness.

5.2 Lower bound for proximal algorithm class

Like Definition 5.1, we define a new bilinear problem class with bounded saddle point solution as follows.

Definition 5.3

(Problem class $\mathcal {B}_0(L_{xy},R_x,R_y)$) We say a function F belongs to the function class $\mathcal {B}_0(L_{xy},R_x,R_y)$ as long as: (i). $F\in \mathcal {B}(L_{xy},0,0).$ (ii). Solution $(x^*,y^*) = \mathop {\hbox {argmin}}\limits _x\mathop {\hbox {argmax}}\limits _y F(x,y)$ exists, and $\Vert x^*\Vert \le R_x$, $\Vert y^*\Vert \le R_y$.

For this problem class, we have the following lower bound result, as a corollary of Theorem 3.7.

Corollary 5.4

Consider applying the general first-order algorithm class defined by (2.2) to the problem class $\mathcal {B}_0(L_{xy},R_x,R_y)$. For any $\epsilon >0$, there exists an instance $F_\epsilon (x,y)\in \mathcal {B}_0(L_{xy},R_x,R_y)$, such that

$$\begin{aligned} \Omega \left( \,\,\frac{L_{xy}R_xR_y}{\epsilon } \,\,\right) \end{aligned}$$

(39)

iterations are required to reduce the duality gap to $\epsilon $.

Remark 5.5

The lower bound in Corollary 5.2 is tight. An optimal algorithm is derived in [9, 10].

The reduction can be done in a similar way as in Corollary 5.2, but is much simpler. The details are omitted here.

6 Conclusion

In this paper, we establish the lower complexity bound for the first-order methods in solving strongly convex and strongly concave saddle point problems. Different from existing results, we discuss the problem in the most general parameter regime. For the bilinear coupling problem class $\mathcal {B}(L_{xy},\mu _x,\mu _y)$ and for both proximal algorithm class (7) with linear span assumption and the general proximal algorithm class (10) without linear span assumption, a tight lower bound is established. For general coupling problem class $\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$ and for both the pure first-order algorithm class (9) with linear span assumption and the general pure first-order algorithm class (11) without the linear span assumption, a lower bound has been established. Under various special parameter regimes, tight upper-bounds can be developed. In the most general setting of the min-max framework, a near optimal algorithm has been discovered, while the optimal algorithm that exactly matches the lower bound has yet to be discovered. Finally, we also show that our result implies several exisiting lower bounds for general convex-concave problems through proper scaling of the worst-case instance, which indicates the generality of our results.

Notes

In Nesterov’s original paper [30], the author did not give a name to his algorithm. For convenience of referencing, in this paper we shall call it accelerated dual extrapolation.

References

Abadeh, S.S., Esfahani, P.M., Kuhn, D.: Distributionally robust logistic regression. In: Advances in Neural Information Processing Systems, pp. 1576–1584, (2015)
Agarwal, N., Hazan, E.: Lower bounds for higher-order convex optimization. arXiv preprint arXiv:1710.10329, (2017)
Arjevani, Y., Shamir, O., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization. Math. Progr. 178(1–2), 327–360 (2019)
Article MathSciNet Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein Gan. arXiv preprint arXiv:1701.07875, (2017)
Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., Gidel, G.: Accelerating smooth games by manipulating spectral shapes. arXiv preprint arXiv:2001.00602, (2020)
Bertsekas, D.P.: Nonlinear Progr. Athena Scientific, Nashua (1997)
Google Scholar
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Progr. 184, 71–120 (2017)
Article MathSciNet Google Scholar
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: first-order methods. Math. Progr. 158, 1–2 (2019)
MATH Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Mah. Progr. 159(1–2), 253–287 (2016)
Article MathSciNet Google Scholar
Gao, X., Zhang, S.: First-order algorithms for convex optimization with nonseparable objective and coupled constraints. J. Oper. Res. Soc. China 5(2), 131–159 (2017)
Article MathSciNet Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversaria nets. In: Advances in Neural Information Processing Systems, pages 2672–2680, (2014)
Ibrahim, A., Azizian, W., Gidel, G., Mitliagkas, I.: Linear lower bounds and conditioning of differentiable games. arXiv preprint arXiv:1906.07300, (2019)
Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, (2019)
Jin, C., Netrapalli, P., Jordan, M.I.: What is local optimality in nonconvex-nonconcave minimax optimization? arXiv preprint arXiv:1902.00618, (2019)
Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1(1), 17–58 (2011)
Article MathSciNet Google Scholar
Korpelevich, G.M.: The extragradient method for finding saddle points and other problems. Matecon 12, 747–756 (1976)
MathSciNet MATH Google Scholar
Lin, Q., Liu, M., Rafique, H., Yang, T.: Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207, (2018)
Lin, T., Jin, C., Jordan, M.: Near-optimal algorithms for minimax optimization. In: Annual Conference on Learning Theory, (2020)
Lin, T., Jin, C., Jordan, M.I.: On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, (2019)
Lu, S., Tsaknakis, I., Hong, M., Chen, Y.: Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. arXiv preprint arXiv:1902.08294, (2019)
Marcotte, P., Dussault, J.-P.: A note on a globally convergent newton method for solving monotone variational inequalities. Oper. Res. Lett. 6(1), 35–42 (1987)
Article MathSciNet Google Scholar
Mokhtari, A., Ozdaglar, A., Pattathil, S.: A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511, (2019)
Nemirovski, A.: Prox-method with rate of convergence $o(1/t)$ for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Article MathSciNet Google Scholar
Nemirovsky, A.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Article MathSciNet Google Scholar
Nemirovsky, A., Yudin, D.B.: Problem complexity and method efficiency in optimization. (1983)
Nesterov. Yu.: Implementable tensor methods in unconstrained convex optimization. CORE Discussion Paper, 2018/05
Nesterov, Yu.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Progr. 109(2–3), 319–344 (2007)
Article MathSciNet Google Scholar
Nesterov, Yu.: Lectures on convex optimization, vol. 137. Springer, Cham (2018)
Book Google Scholar
Yu. Nesterov and L. Scrimali. Solving strongly monotone variational and quasi-variational inequalities. Available at SSRN 970903, 2006
Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.: Algorithmic game theory. Cambridge University Press, Cambridge (2007)
Book Google Scholar
Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr., E.: An accelerated linearized alternating direction method of multipliers. SIAM J. Imaging Sci. 8(1), 644–681 (2015)
Article MathSciNet Google Scholar
Ouyang, Y., Xu, Y.: Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. arXiv preprint arXiv:1808.02901, (2018)
Rockafellar, R.T.: Convex analysis. Princeton University Press, Princeton (1970)
Book Google Scholar
Sanjabi, M., Razaviyayn, M., Lee, J.D.: Solving non-convex non-concave min-max games under Polyak-Lojasiewicz condition. arXiv preprint arXiv:1812.02878, (2018)
Taji, K., Fukushima, M., Ibaraki, T.: A globally convergent newton method for solving strongly monotone variational inequalities. Math. Progr. 58(1–3), 369–383 (1993)
Article Google Scholar
von Neumann, J., Morgenstern, O., Kuhn, H.W.: Theory of games and economic behavior (commemorative edition). Princeton University Press, Princeton (2007)
Google Scholar
Wang, Y., Li, J.: Improved algorithms for convex-concave minimax optimization. arXiv preprint arXiv:2006.06359, (2020)
Xiao, L., Yu, A., Lin, Q., Chen, W.: DSCOVR: randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. J. Mach. Learn. Res. 20(43), 1–58 (2019)
MathSciNet MATH Google Scholar
Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the two anonymous reviewers for their insightful suggestions on orthogonal invariance argument for breaking the linear span assumption and the suggestion on applying scaling to obtain lower bounds for general convex-concave problems.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Princeton University, Princeton, USA
Junyu Zhang
Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore
Junyu Zhang
Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, USA
Mingyi Hong
Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, USA
Shuzhong Zhang

Authors

Junyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mingyi Hong
View author publications
You can also search for this author in PubMed Google Scholar
Shuzhong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junyu Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of lemma 3.4

By the subspace characterization (20), we have

$$\begin{aligned}\Vert y^{2k}-\hat{y}^*\Vert&\ge \sqrt{\sum ^n_{j = k+1} (\hat{y}^*_j)^2} = \frac{q^{k}}{1-q}\sqrt{q^2 + q^4 + \cdots + q^{2(n-k)}}\\&\ge \frac{q^k}{\sqrt{2}}\Vert \hat{y}^*\Vert = \frac{q^k}{\sqrt{2}}\Vert y^0-\hat{y}^*\Vert ,\end{aligned}$$

where the last inequality is due to the fact that $q\le 1, k\le \frac{n}{2}$ and $y^0 = 0$. Note that by Lemma 3.3, if we require $n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) $, then we can guarantee that

$$\begin{aligned} \Vert \hat{y}^* - y^*\Vert \le \frac{q^{n+1}}{\alpha (1-q)}\le \frac{q^{\frac{n}{2}}}{\alpha }\cdot q^k\cdot \frac{q}{(1-q)} \le \frac{1}{4}\cdot \frac{q^k}{\sqrt{2}}\Vert y^0-\hat{y}^*\Vert \quad \text{ for } \quad \forall 1\le k\le n/2, \end{aligned}$$

(40)

where the last inequality is due to $\frac{q^{\frac{n}{2}}}{\alpha }\le \frac{1}{4\sqrt{2}}$ and $q/(1-q)\le \Vert y^0-\hat{y}^*\Vert $. Therefore, we have

$$\begin{aligned} \Vert y^{2k}-y^*\Vert ^2\ge & {} (\Vert y^{2k}-\hat{y}^*\Vert -\Vert \hat{y}^*- y^*\Vert )^2\nonumber \\\ge & {} \Vert y^{2k}-\hat{y}^*\Vert ^2 - 2\Vert y^{2k}-\hat{y}^*\Vert \Vert \hat{y}^*- y^*\Vert \nonumber \\\ge & {} \min _t\left\{ t^2 - 2\Vert \hat{y}^*- y^*\Vert t: t\ge \delta _k:= \frac{q^k}{\sqrt{2}}\Vert y^0-\hat{y}^*\Vert \right\} \nonumber \\= & {} \delta _k(\delta _k - 2\Vert \hat{y}^*- y^*\Vert )\nonumber \\\ge & {} \frac{1}{2}\delta _k^2 = \frac{q^{2k}}{4}\Vert y^0-\hat{y}^*\Vert ^2, \end{aligned}$$

(41)

where the fourth line is due to that $d(t^2 - 2\Vert \hat{y}^*-y^*\Vert t)/dt = 2(t-\Vert \hat{y}^*-y^*\Vert )\ge 0$ when $t\ge \delta _k$. Hence the quadratic function is monotonically increasing in the considered interval. In addition, we also have

$$\begin{aligned}\Vert y^0-y^*\Vert&\le \Vert y^0-\hat{y}^*\Vert + \Vert \hat{y}^*-y^*\Vert \le \Vert y^0-\hat{y}^*\Vert + \frac{q^n}{\alpha }\cdot \frac{q}{1-q}\\&\le (1 + q^n/\alpha )\Vert y^0-\hat{y}^*\Vert \le 2\Vert y^0-\hat{y}^*\Vert ,\end{aligned}$$

where the third inequality is due to that $\Vert y^0-\hat{y}^*\Vert \ge \hat{y}^*_1 = q/(1-q)$. For the last inequality, if $\alpha \ge 1$, then $q^n/\alpha <1$; if $\alpha \le 1$, then $q^n/\alpha \le \alpha /32\le 1$ since $n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) $. Combining the above two inequalities, the desired bound (21) follows.

Proof of proposition 3.6

Here we only prove the last inequality of (23). Due to the fact that $(\ln (1+z))^{-1}\ge 1/z$ for $\forall z>0$, we know

$$\begin{aligned} (\ln (q^{-1}))^{-1}= & {} (\ln (1 + (1-q)/q))^{-1} \ge \frac{q}{1-q}\\= & {} \frac{1+\frac{2\mu _x\mu _y}{L_{xy}^2} - 2\sqrt{\left( \frac{\mu _x\mu _y}{L_{xy}^2}\right) ^2 + \frac{\mu _x\mu _y}{L_{xy}^2}}}{2\sqrt{\left( \frac{\mu _x\mu _y}{L_{xy}^2}\right) ^2 + \frac{\mu _x\mu _y}{L_{xy}^2}}-\frac{2\mu _x\mu _y}{L_{xy}^2}}\\= & {} \frac{\sqrt{\left( \frac{\mu _x\mu _y}{L_{xy}^2}\right) ^2 + \frac{\mu _x\mu _y}{L_{xy}^2}}-\frac{\mu _x\mu _y}{L_{xy}^2}}{\frac{2\mu _x\mu _y}{L_{xy}^2}}\\= & {} \frac{1}{2}\sqrt{\frac{L_{xy}^2}{\mu _x\mu _y} + 1} - \frac{1}{2},\\= & {} \Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\right) \end{aligned}$$

which completes the proof.

Proof of theorem 3.7

Before proceeding the proof, let us first quote a lemma from [33].

Lemma C.1

[Lemma 3.1, [33]] Let $\mathcal {X}\subsetneqq \bar{\mathcal {X}}\subseteqq \mathbb {R}^{p}$ be two linear subspaces. Then for any $\bar{x}\in \mathbb {R}^p$, there exists an orthogonal matrix $\Gamma \in \mathbb {R}^{p\times p}$ s.t. $\Gamma x = x, \forall x\in \mathcal {X}$ and $\Gamma \bar{x}\in \bar{\mathcal {X}}$.

Note that for an orthogonal matrix $\Gamma $, if $\Gamma x = x$, then we also have $\Gamma ^\top x = x$. Now let us start our proof of Theorem 3.7.

Proof

To prove this theorem, we only need to show

$$\begin{aligned}&\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1}\qquad \text{ and }\\&\quad (\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1}.\end{aligned}$$

We separate the proof into two parts.

Part I. There exist orthogonal matrices $\hat{U}$, $\hat{V}$ s.t. when $\mathcal {A}$ is applied to the rotated instance $F_{\hat{U},\hat{V}}$, $\{(x^0,y^0),...,(x^k,y^k)\}\subseteq \hat{U}^\top \mathcal {H}_x^{4k-1}\times \hat{V}^\top \mathcal {H}_y^{4k-1}.$

Let $\theta = (L_{xy},\mu _x,\mu _y)$ be the set of algorithmic parameters. To prove the result, let us construct the worst-case function $F_{U,V}$ in a recursive way.

Case $k = 1$: Let us define $U_0 = V_0 = I$. When $\mathcal {A}$ is applied to the function $F_{U_0,V_0}\in \mathcal {B}(L_{xy}, \mu _x, \mu _y)$, the iterate sequence is $(x_{0}^0, y_{0}^0) = (0, 0)$ and

$$\begin{aligned}{\left\{ \begin{array}{ll} u^1_0 = \mathcal {A}_u^1(\theta ; x_0^0, U_0^\top A V_0 y_0^0), \qquad (x^1_0, \tilde{x}^1_0) = \mathcal {A}_x^1(\theta ; x^0_0, U_0^\top A V_0 y_0^0, \mathbf {prox}_{\gamma _1f}(u^1_0)),\\ v^1_0 = \mathcal {A}_v^1(\theta ; y_0^0, V^\top _0 A U_0x_0^0),\qquad (y^1_0, \tilde{y}^1_0) = \mathcal {A}_y^1(\theta ; y^0_0, V^\top _0 A U_0x_0^0, \mathbf {prox}_{\sigma _1g}(v^1_0)). \end{array}\right. }\end{aligned}$$

By Lemma C.1, there exists orthogonal matrices $\Gamma _x^0$ and $\Gamma _y^0$ such that $\Gamma _x^0x^1_0\in \mathcal {H}_x^3 = \mathrm {Span}\{Ab\}$, $\Gamma _y^0y^1_0\in \mathcal {H}_y^3=\mathrm {Span}\{b, A^2b\}$, and $\Gamma _y^0b = (\Gamma _y^0)^\top b = b.$ That is

$$\begin{aligned} x_0^1\in U_1^\top \mathcal {H}_x^3,\qquad \text{ and }\qquad y_0^1\in V_1^\top \mathcal {H}_y^3, \quad V_1 b = V_1^\top b = b, \end{aligned}$$

(42)

where $U_1 = U_0\Gamma _x^0$ and $V_1 = V_0\Gamma _y^0$.

Now we prove that when we apply the algorithm $\mathcal {A}$ to $F_{U_1,V_1}$, the generated iterates $\{(x^0_1, y^0_1), (x^1_1, y^1_1)\}$ satisfy that $(x^0_1, y^0_1) = (0, 0)$ and $(x^1_1, y^1_1) = (x^1_0, y^1_0)$. That is, the first two iterates generated by $\mathcal {A}$ is completely the same for $F_{U_0,V_0}$ and $F_{U_1,V_1}$. The reason is because $u^1_1 = \mathcal {A}_u^1(\theta ; x^0_1, U_1^\top AV_1y^0_1) = \mathcal {A}_u^1(\theta ; 0, 0) = \mathcal {A}_u^1(\theta ; x^0_0, U_0^\top AV_0y^0_0) = u^1_0$, therefore

$$\begin{aligned} (x^1_1, \tilde{x}^1_1)= & {} \mathcal {A}_x^1(\theta ; x^0_1, U_1^\top A V_1 y_1^0, \mathbf {prox}_{\gamma _1f}(u^1_1)) \\= & {} \mathcal {A}_x^1(\theta ; 0, 0, \mathbf {prox}_{\gamma _1f}(u^1_1)) \\= & {} \mathcal {A}_x^1(\theta ; x^0_0, U_0^\top A V_0 y_0^0, \mathbf {prox}_{\gamma _1f}(u^1_0)) \\= & {} (x^1_0, \tilde{x}^1_0). \end{aligned}$$

Through similar argument, we know $(y^1_1,\tilde{y}^1_1) = (y^1_0,\tilde{y}^1_0)$. Therefore, (42) indicates that

$$\begin{aligned} x_1^1\in U_1^\top \mathcal {H}_x^3,\qquad \text{ and }\qquad y_1^1\in V_1^\top \mathcal {H}_y^3, \quad V_1 b = V_1^\top b = b\in V_1^\top \mathcal {H}_y^3. \end{aligned}$$

(43)

Case $k=2$. For the ease of the readers to follow, we perform one extra step of discussion for $k=2$, before presenting the construction on general k.

For the problem instance $F_{U_1,V_1}$, the iterates generated by $\mathcal {A}$ are $(x_{1}^0, y_{1}^0) = (0, 0)$ and

$$\begin{aligned}&{\left\{ \begin{array}{ll} u^1_1 = \mathcal {A}_u^1(\theta ; x_1^0, U_1^\top A V_1 y_1^0), \qquad (x^1_1, \tilde{x}^1_1) = \mathcal {A}_x^1(\theta ; x^0_1, U_1^\top A V_1 y_1^0, \mathbf {prox}_{\gamma _1f}(u^1_1)),\\ v^1_1 = \mathcal {A}_v^1(\theta ; y_1^0, V^\top _1 A U_1x_1^0),\qquad (y^1_1, \tilde{y}^1_1) = \mathcal {A}_y^1(\theta ; y^0_1, V^\top _1 A U_1x_1^0, \mathbf {prox}_{\sigma _1g}(v^1_1)).\\ \end{array}\right. }\\&{\left\{ \begin{array}{ll} u^2_1 = \mathcal {A}_u^2(\theta ; x_1^0, U_1^\top A V_1 y_1^0, x_1^1, U_1^\top A V_1 y_1^1), \quad (x^2_1, \tilde{x}^2_1) = \mathcal {A}_x^2(\theta ; x^0_1, U_1^\top A V_1 y_1^0, x_1^1, U_1^\top A V_1 y_1^1, \mathbf {prox}_{\gamma _2f}(u^2_1)),\\ v^2_1 = \mathcal {A}_v^2(\theta ; y_1^0, V^\top _1 A U_1x_1^0, y_1^1, V^\top _1 A U_1x_1^1),\quad (y^2_1, \tilde{y}^2_1) = \mathcal {A}_y^2(\theta ; y^0_1, V^\top _1 A U_1x_1^0, y_1^1, V^\top _1 A U_1x_1^1, \mathbf {prox}_{\sigma _2g}(v^2_1)).\\ \end{array}\right. }\end{aligned}$$

Note that $x_1^1 \in U_1^\top \mathcal {H}_x^3 \subsetneqq U_1^\top \mathcal {H}_x^5\subsetneqq U_1^\top \mathcal {H}_x^7$ and $\{y_1^1,b\}\subsetneqq V_1^\top \mathcal {H}_y^3\subsetneqq V_1^\top \mathcal {H}_y^5\subsetneqq V_1^\top \mathcal {H}_y^7$. Therefore, there exist orthogonal matrices $\Gamma _x^1$ and $\Gamma _y^1$ such that

$$\begin{aligned} {\left\{ \begin{array}{ll} \Gamma _x^1x = (\Gamma _x^1)^\top x = x, \,\,\forall x\in U_1^\top \mathcal {H}_x^5,\,\,\,\, \Gamma _x^1 x_1^2\in U_1^\top \mathcal {H}_x^7,\\ \Gamma _y^1y \,= (\Gamma _y^1)^\top y = y, \,\,\,\forall y\in V_1^\top \mathcal {H}_y^5,\,\,\,\, \Gamma _y^1 y_1^2\in V_1^\top \mathcal {H}_y^7. \end{array}\right. } \end{aligned}$$

(44)

Now, let us define

$$\begin{aligned}U_2 = U_1\Gamma _x^1\qquad \text{ and }\qquad V_2 = V_1\Gamma _y^1.\end{aligned}$$

Now we prove that if $\mathcal {A}$ is applied to $F_{U_2,V_2}$, the generated iterates $\{(x_{2}^0, y_{2}^0), (x_{2}^1, y_{2}^1), (x_{2}^2, y_{2}^2)\}$ satisfy $(x_{2}^0, y_{2}^0) = (0, 0)$, $(x_{2}^1, y_{2}^1) = (x_{1}^1, y_{1}^1)$, and $(x_{2}^2, y_{2}^2) = (x_{1}^2, y_{1}^2)$. The argument for $(x_{2}^1, y_{2}^1) = (x_{1}^1, y_{1}^1)$ is almost the same as that of the case $k=1$. We only provide the proof for $(x_{2}^2, y_{2}^2) = (x_{1}^2, y_{1}^2)$.

Next, we need to show $u_2^2 = u_1^2$, which can be proved by arguing that all the inputs to $\mathcal {A}_u^2$ are the same for both $u_2^2$ and $u_1^2$. First, it is straightforward that $x_1^0 = 0 = x^0_2, U_1^\top AV_1 y_1^0 = 0 = U_2^\top AV_2 y_2^0$. By previous argument $x_2^1 = x_1^1$. Finally, consider the last input $U_2^\top AV_2 y_2^1$, because $y_2^1 = y_1^1\in V_1^\top \mathcal {H}_y^3\subsetneqq V_1^\top \mathcal {H}_y^5$, we have $\Gamma _y^1 y_2^1 = y_2^1 = y_1^1\in V_1^\top \mathcal {H}_y^3.$ Then $V_2y_2^1 = V_1\Gamma _y^1y_2^1\in V_1V_1^\top \mathcal {H}_y^3 = \mathcal {H}_y^3.$ Therefore $U_1^\top AV_2y_2^1\in U_1^\top A\mathcal {H}_y^3 = U_1^\top \mathcal {H}_x^5$ and

$$\begin{aligned}U_2^\top AV_2y_2^1 = \Gamma _x^1U_1^\top AV_2y_2^1 = U_1^\top AV_2y_2^1 = U_1^\top AV_1\Gamma _y^1y_2^1 = U_1^\top AV_1y_1^1.\end{aligned}$$

Consequently,

$$\begin{aligned} u_2^2 = \mathcal {A}_u^2(\theta ; x_2^0, U_2^\top AV_2 y_2^0, x_2^1, U_2^\top AV_2 y_2^1) =\mathcal {A}_u^2(\theta ; x_1^0, U_1^\top AV_1 y_1^0, x_1^1, U_1^\top AV_1 y_1^1) = u_2^1 \end{aligned}$$

and

$$\begin{aligned} (x_2^2,\tilde{x}_2^2)= & {} \mathcal {A}_x^2(\theta ; x_2^0, U_2^\top AV_2 y_2^0, x_2^1, U_2^\top AV_2 y_2^1,\mathbf {prox}_{\gamma _2f}(u^2_2))\\= & {} \mathcal {A}_x^2(\theta ; x_1^0, U_1^\top AV_1 y_1^0, x_1^1, U_1^\top AV_1 y_1^1,\mathbf {prox}_{\gamma _2f}(u^2_1))\\= & {} (x_1^2,\tilde{x}_1^2). \end{aligned}$$

Through a similar argument, we have $(y_2^2,\tilde{y}_2^2) = (y_1^2,\tilde{y}_1^2)$. By (43) and (44), we have

$$\begin{aligned} \{x_2^0,x_2^1,x_2^2\}\in U_2^\top \mathcal {H}_x^7\qquad \text{ and }\qquad \{b,y_2^0,y_2^1,y_2^2\}\in V_2^\top \mathcal {H}_y^7. \end{aligned}$$

(45)

Case k. Suppose we already have orthogonal matrices $U_{k-1}, V_{k-1}$, such that when $\mathcal {A}$ is applied to $F_{U_{k-1},V_{k-1}}$, we have

$$\begin{aligned} \{x_{k-1}^0,x_{k-1}^1,\cdots , x_{k-1}^{k-1}\}\in U_{k-1}^\top \mathcal {H}_x^{4k-5}\qquad \text{ and }\qquad \{b,y_{k-1}^0,y_{k-1}^1,\cdots ,y_{k-1}^{k-1}\}\in V_{k-1}^\top \mathcal {H}_y^{4k-5}. \end{aligned}$$

(46)

Again, by Lemma C.1, there exist orthogonal matrices $\Gamma _x^{k-1}$ and $\Gamma _y^{k-1}$, such that

$$\begin{aligned} {\left\{ \begin{array}{ll} \Gamma _x^{k-1}x = (\Gamma _x^{k-1})^\top x = x, \,\,\forall x\in U_{k-1}^\top \mathcal {H}_x^{4k-3},\,\,\,\, \Gamma _x^{k-1} x_{k-1}^{k}\in U_{k-1}^\top \mathcal {H}_x^{4k-1},\\ \Gamma _y^{k-1}y \,= (\Gamma _y^{k-1})^\top y = y, \,\,\,\forall y\in V_{k-1}^\top \mathcal {H}_y^{4k-3},\,\,\,\, \Gamma _y^{k-1} y_{k-1}^k\in V_{k-1}^\top \mathcal {H}_y^{4k-1}. \end{array}\right. } \end{aligned}$$

(47)

Now we define that

$$\begin{aligned}U_k = U_{k-1}\Gamma _x^{k-1} \qquad \text{ and }\qquad V_k = V_{k-1}\Gamma _y^{k-1}.\end{aligned}$$

Therefore, similar to our previous discussion, we only need to argue that when $\mathcal {A}$ is applied to $F_{U_k,V_k}$, the generated iterates $\{(x_k^0,y_k^0), (x_k^1,y_k^1),\cdots ,(x_k^k,y_k^k)\}$ satisfy $(x_k^i,y_k^i) = (x_{k-1}^i,y_{k-1}^i)$ for $i = 0,1,...,k$. We prove this argument by induction. First, it is straightforward that $(x_k^0,y_k^0) = (0,0) = (x_{k-1}^0,y_{k-1}^0)$. Suppose $(x_k^i,y_k^i) = (x_{k-1}^i,y_{k-1}^i)$ holds for $i = 0,1,...,j-1\le k-1$, now we prove $(x_k^j,y_k^j) = (x_{k-1}^j,y_{k-1}^j)$, which is almost identical to the case $k=2$.

For any $i\in \{0,1,...,j-1\}$, let us show $U_{k-1}^\top AV_{k-1} y_{k-1}^i = U_k^\top AV_k y_k^i$. Because $y_k^i = y_{k-1}^i\in V_{k-1}^\top \mathcal {H}_y^{4k-5}\subsetneqq V_{k-1}^\top \mathcal {H}_y^{4k-3}$, we have $\Gamma _y^{k-1} y_k^i = y_k^i = y_{k-1}^i\in V_{k-1}^\top \mathcal {H}_y^{4k-5}.$ Then $V_ky_k^i = V_{k-1}\Gamma _y^{k-1}y_k^i\in V_{k-1}V_{k-1}^\top \mathcal {H}_y^{4k-5} = \mathcal {H}_y^{4k-5}.$ Therefore $U_{k-1}^\top AV_{k}y_k^i\in U_{k-1}^\top A\mathcal {H}_y^{4k-5} = U_{k-1}^\top \mathcal {H}_x^{4k-3}$ and

$$\begin{aligned}U_k^\top AV_ky_k^i = (\Gamma _x^{k-1})^\top U_{k-1}^\top AV_ky_k^i = U_{k-1}^\top AV_{k}y_k^i = U_{k-1}^\top AV_{k-1}\Gamma _y^{k-1}y_k^i = U_{k-1}^\top AV_{k-1}y_{k-1}^i,\end{aligned}$$

for $0\le i\le j-1$. Consequently,

$$\begin{aligned} u_k^i= & {} \mathcal {A}_u^i(\theta ; x_k^0, U_k^\top AV_k y_k^0,..., x_k^{i-1}, U_k^\top AV_k y_k^{i-1})\\= & {} \mathcal {A}_u^i(\theta ; x_{k-1}^0, U_{k-1}^\top AV_{k-1} y_{k-1}^0,..., x_{k-1}^{i-1}, U_{k-1}^\top AV_{k-1} y_{k-1}^{i-1})\\= & {} u_{k-1}^i \end{aligned}$$

and

$$\begin{aligned} (x_k^i,\tilde{x}_k^i)= & {} \mathcal {A}_x^i(\theta ; x_k^0, U_k^\top AV_k y_k^0,..., x_k^{i-1}, U_k^\top AV_k y_k^{i-1},\mathbf {prox}_{\gamma _if}(u^i_k))\\= & {} \mathcal {A}_x^2(\theta ; x_{k-1}^0, U_{k-1}^\top AV_{k-1} y_{k-1}^0,..., x_{k-1}^{i-1}, U_{k-1}^\top AV_{k-1} y_{k-1}^{i-1},\mathbf {prox}_{\gamma _if}(u^i_{k-1}))\\= & {} (x_{k-1}^i,\tilde{x}_{k-1}^i). \end{aligned}$$

Through a similar argument, we have $(y_k^i,\tilde{y}_k^i) = (y_{k-1}^i,\tilde{y}_{k-1}^i)$. By induction, we know $(y_k^i,\tilde{y}_k^i) = (y_{k-1}^i,\tilde{y}_{k-1}^i)$ for $i = 0,1,...,k$. Consequently, we have

$$\begin{aligned} \{x_{k}^0,x_{k}^1,\cdots , x_{k}^{k}\}\in U_{k}^\top \mathcal {H}_x^{4k-1}\qquad \text{ and }\qquad \{b,y_{k}^0,y_{k}^1,\cdots ,y_{k}^{k}\}\in V_{k}^\top \mathcal {H}_y^{4k-1}. \end{aligned}$$

(48)

By setting $\hat{U} = U_k$ and $\hat{V} = V_k$, we prove the result for Part I.

Part II. There exist orthogonal matrices U, V such that when $\mathcal {A}$ is applied to the rotated instance $F_{U,V}$, $\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1},$ and $(\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1}$.

Given the result of Part I, and let $\{(x_k^0, y_k^0),...,(x_k^k, y_k^k)\}$ and $(\tilde{x}_k^k, \tilde{y}_k^k)$ be generated by $\mathcal {A}$ when applied to $F_{\hat{U},\hat{V}} = F_{U_k,V_k}$. Therefore, by Lemma C.1, there exist orthogonal matrices P, Q such that

$$\begin{aligned} {\left\{ \begin{array}{ll} Px = P^\top x = x, \,\,\forall x\in U_{k}^\top \mathcal {H}_x^{4k-1},\,\,\,\, P \tilde{x}_{k}^{k}\in U_{k}^\top \mathcal {H}_x^{4k+1},\\ Qy \,=Q^\top y = y, \,\,\forall y\in V_{k}^\top \mathcal {H}_y^{4k-1},\,\,\,\, Q\tilde{y}_{k}^k\in V_{k}^\top \mathcal {H}_y^{4k+1}. \end{array}\right. } \end{aligned}$$

(49)

Define $U = U_kP$, and $V = V_kQ$. Let $\{(x^0,y^0),...,(x^k,y^k)\}$ and the output $(\tilde{x}^k,\tilde{y}^k)$ be generated by $\mathcal {A}$ when applied to $F_{{U,V}}$. Then following the same line of argument of Case k, Part I, we have

$$\begin{aligned} (x^i,y^i) = (x^i_k,y^i_k), \,\,\text{ for }\,\, i = 0,1,...,k\qquad \text{ and }\qquad (\tilde{x}^k,\tilde{y}^k) = (\tilde{x}^k_k,\tilde{y}^k_k).\end{aligned}$$

Therefore, combining (49), we complete the proof of Part II. $\square $

Proof of lemma 4.2

For the ease of analysis, let us perform a change of variable $r:=(1-q)^{-1}$. Then the quartic equation (26) can be transformed to

$$\begin{aligned} f(r):= 1 + \alpha r + (\beta -\alpha )r^2-2\beta r^3 + \beta r^4 = 0 \end{aligned}$$

(50)

Although the quartic equation does have a root formula, it is impractical to use the formula for the purpose of lower iteration complexity bound. Instead, we will provide an estimation of a large enough lower bound of r, which corresponds to lower bound on q.

First, we let $\bar{r} = \frac{1}{2}+\sqrt{\frac{\alpha }{\beta } + \frac{1}{4}}$. Then $f(\bar{r}) = 1>0.$

Second, we let $\underline{r} = \frac{1}{2}+\sqrt{\frac{\alpha }{2\beta } + \frac{1}{4}}$. Then,

$$\begin{aligned} f(\underline{r})= & {} \beta \left( -\frac{\alpha ^2}{4\beta ^2} + \frac{1}{\beta }\right) \\= & {} \frac{\beta }{4}\left( -\left( \frac{L_{xy}^2}{4\mu _x\mu _y} + \frac{B_x}{\mu _x} + \frac{B_y}{\mu _y}\right) ^2 + \frac{4B_xB_y}{\mu _x\mu _y}\right) \\= & {} \frac{\beta }{4}\left( -\left( \frac{L_{xy}^2}{4\mu _x\mu _y}\right) ^2 -\frac{L_{xy}^2}{2\mu _x\mu _y}\cdot \left( \frac{B_x}{\mu _x} + \frac{B_y}{\mu _y}\right) -\left( \frac{B_x}{\mu _x} - \frac{B_y}{\mu _y}\right) ^2\right) \\< & {} 0. \end{aligned}$$

Together with the fact that $f(\bar{r}) = 1>0$, by continuity we know there is a root r between $\left( \underline{r}, \bar{r}\right) $, where

$$\begin{aligned}\underline{r} = \frac{1}{2}+\sqrt{\frac{\alpha }{2\beta } + \frac{1}{4}} = \frac{1}{2}+\frac{1}{2\sqrt{2}}\sqrt{\frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_x}{\mu _x} + \frac{L_y}{\mu _y}}\end{aligned}$$

and

$$\begin{aligned}\bar{r} = \frac{1}{2}+\sqrt{\frac{\alpha }{\beta } + \frac{1}{4}} = \frac{1}{2}+\frac{1}{2}\sqrt{\frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_x}{\mu _x} + \frac{L_y}{\mu _y}-1}\end{aligned}$$

This further implies

$$\begin{aligned}1-\underline{r}^{-1}<q<1 - \bar{r}^{-1},\end{aligned}$$

which proves this lemma.

Proof of lemma 4.3

First, by setting $\nabla \Phi (x^*) = 0$, we get

$$\begin{aligned} (B_xA^2+\mu _xI)x^* + \frac{L_{xy}^2}{4}A(B_yA^2 + \mu _yI)^{-1}\left( Ax^* - \frac{2b}{L_{xy}}\right) = 0. \end{aligned}$$

(51)

Note that matrix A is invertible, with

Therefore, by the interchangability of $A(B_yA^2+\mu _yI) = (B_yA^2+\mu _yI)A$, we can take the inverse and get $(B_yA^2+\mu _yI)^{-1}A^{-1} = A^{-1}(B_yA^2+\mu _yI)^{-1}$. Left multiply by A and right multiply by A for both sides we get the interchangablity of

$$\begin{aligned} A(B_yA^2+\mu _yI)^{-1} = (B_yA^2+\mu _yI)^{-1}A. \end{aligned}$$

Applying this on equation (51) and multiplying both sides by $\frac{1}{B_xB_y}(B_yA^2 + \mu _yI)$, we can equivalently write the optimality condition as

$$\begin{aligned} (A^4 + \alpha A^2 + \beta I)x^* = \hat{b} \end{aligned}$$

(52)

where

$$\begin{aligned} \alpha = \frac{L_{xy}^2}{4B_xB_y} + \frac{\mu _x}{B_x} + \frac{\mu _y}{B_y},\qquad \beta = \frac{\mu _x\mu _y}{B_xB_y},\quad \text{ and } \quad \hat{b} = \frac{L_{xy}}{2B_xB_y}Ab. \end{aligned}$$

The values of matrices $A^2$ and $A^4$ can be found in (13). For the ease of discussion, we may also write equation (52) in an expanded form as:

$$\begin{aligned} {\left\{ \begin{array}{ll} \,\,\quad \qquad \qquad \qquad \qquad (2+\alpha + \beta )x_1^*\,\,\,\,\,\,\,\,\, - (3+\alpha )x_2^*\,\,\,\,\,\, + \,\,\,\,x_3^* \,\,\,&{} = \hat{b}_1\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,-\,(3+\alpha )x_1^* \,\,\,\,\,\,+ \,\,\,(6+2\alpha + \beta )x_2^*\,\,\,\,\,\, - (4+\alpha )x_3^*\,\,\,\,\,\, +\,\,\,\, x_4^*&{} = \hat{b}_2\\ x^*_{k-2} - (4 + \alpha )x_{k-1}^* + \,\,\,(6+2\alpha + \beta )x_k^*\,\,\,\,\,\,- (4 + \alpha )x_{k+1}^* +\,\, y_{k+2}^*&{} = \hat{b}_k\quad \text{ for } 3\le k\le n-2\\ x^*_{n-3} - (4 + \alpha )x_{n-2}^* + \,\,(6+2\alpha + \beta )x_{n-1}^*- (4 + \alpha )x_{n}^*\qquad \quad \,\,\,\,&{} = \hat{b}_{n-1}\\ x^*_{n-2} - (4 + \alpha )x_{n-1}^* + \,\,(5+2\alpha + \beta )x_{n}^*\qquad \qquad \qquad \qquad \quad \,&{} = \hat{b}_{n} . \end{array}\right. }\nonumber \\ \end{aligned}$$

(53)

Because $q\in (0,1)$ is a root to the quartic equation $1 -(4+\alpha )q + (6+2\alpha + \beta )q^2 - (4+\alpha )q^3 + q^4 = 0$, and our approximate solution $\hat{x}^*$ is constructed as $\hat{x}^*_i = q^i$. By direct calculation one can check that the first $n-2$ equations are satisfied and the last 2 equations are violated with controllably residuals. Indeed, for the $(n-1)$-th equation the violation is of the order $q^{n+1}$, and for the n-th equation the violation is of the order $|-q^{n} + (4+\alpha )q^{n+1} - q^{n+2}|$. Similar to the arguments for (18), we have

$$\begin{aligned} \beta \Vert \hat{x}^*-x^*\Vert \le \Vert (A^4 + \alpha A^2 + \beta I)(\hat{x}^* - x^*)\Vert \le (7+\alpha )q^n . \end{aligned}$$

That is, $\Vert \hat{x}^*-x^*\Vert \le \frac{7+\alpha }{\beta }\cdot q^n$, which completes the proof.

Proof of lemma 4.4

By the subspace characterization (32), we have

$$\begin{aligned}\Vert x^k-\hat{x}^*\Vert \ge q^k\sqrt{q^2+\cdots +q^{2(n-k)}}\ge \frac{q^k}{\sqrt{2}}\Vert \hat{x}^*-x^0\Vert , \quad \text{ for } \quad \forall 1\le k\le n/2.\end{aligned}$$

When we set $k\le \frac{n}{2}$ and $n\ge 2\log _q\left( \frac{\beta }{4\sqrt{2}(7+\alpha )}\right) +2$, by (31) we also have

$$\begin{aligned}\Vert \hat{x}^*-x^*\Vert \le q^n(7+\alpha )/\beta \le \frac{q^k}{4\sqrt{2}}q \le \frac{1}{4}\cdot \frac{q^k}{\sqrt{2}}\Vert \hat{x}^*-x^0\Vert .\end{aligned}$$

Therefore, similar to (41), we also have

$$\begin{aligned} \Vert x^k-x^*\Vert ^2\ge \frac{q^{2k}}{16}\Vert x^*-x^0\Vert ^2 \end{aligned}$$

(54)

which proves the lemma.

Proof of $\ln (2ac^2) = \Omega (1)$

Proof

Note that $a = \min \{c^{-2},d^{-2}\}$, if $c^{-2}\le d^{-2}$, then $ac^2 = 1$. Consequently,

$$\begin{aligned}\ln \left( 2ac^2\right) = \ln 2 = \Omega (1).\end{aligned}$$

However, when $c^{-2}\ge d^{-2}$, the situation is more complicated. In this case,

$$\begin{aligned} ac^2 = \frac{c^2}{d^2} = \frac{R_y^2}{R_x^2}\cdot \frac{\Vert \hat{x}^*\Vert ^2}{\Vert \hat{y}^*\Vert ^2},\end{aligned}$$

where $\hat{x}^*$ and $\hat{y}^*$ is the solution to the unscaled worst-case instance $\hat{F}_\epsilon \in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)$. For the ease of discussion, let us take the dimension n is sufficiently large so that we can view the approximate solution constructed in Lemma 4.3 as the exact solution. Therefore, we have

$$\begin{aligned}{\left\{ \begin{array}{ll} \hat{x}^*(i) = {q^i}, \,\, i = 1,...,n\\ (\mu _yI + B_yA^2)\hat{y}^* = \frac{L_{xy}}{2}A\hat{x}^* - b, \end{array}\right. }\end{aligned}$$

where q is defined by Theorem 4.5 and the second equality is due to the first-order stationary condition. Note that equation (51) also provides that

$$\begin{aligned}(B_xA^2+\mu _xI)\hat{x}^* + \frac{L_{xy}^2}{4}A(B_yA^2 + \mu _yI)^{-1}\left( A\hat{x}^* - \frac{2b}{L_{xy}}\right) = 0.\end{aligned}$$

Combining the above two relations, we have

$$\begin{aligned} \hat{y}^*= & {} (\mu _yI + B_yA^2)^{-1}(\frac{L_{xy}}{2}A\hat{x}^* - b)\\= & {} -\frac{2}{L_{xy}}A^{-1}(B_xA^2+\mu _xI)\hat{x}^*\\= & {} -\frac{2B_x}{L_{xy}}A\hat{x}^*-\frac{128\epsilon }{L_{xy}R_x^2}A^{-1}\hat{x}^*. \end{aligned}$$

Substituting the specific forms of A and $A^{-1}$, we have

$$\begin{aligned}\hat{y}^*(i) = {\left\{ \begin{array}{ll} -\frac{2B_x}{L_{xy}} q^n - \frac{128\epsilon }{L_{xy}R_x^2}q^n, \quad i = 1\\ -\frac{2B_x}{L_{xy}} q^{n+1-i}(1-q)- \frac{128\epsilon }{L_{xy}R_x^2}q^{n+1-i}\frac{1-q^i}{1-q}, \quad i\ge 2. \end{array}\right. }\end{aligned}$$

Therefore, we have

$$\begin{aligned} \Vert \hat{y}^*\Vert ^2 \le \left( \frac{2B_x}{L_{xy}} + \frac{128\epsilon }{L_{xy}R_x^2}\right) ^2q^{2n} + \left( \frac{2B_x}{L_{xy}} (1-q)+ \frac{128\epsilon }{L_{xy}R_x^2(1-q)}\right) ^2\sum ^{n}_{i=1} q^{2i}. \end{aligned}$$

For ease of discussion, the following simplifications are made. First, we omit the $q^{2n}$ term since $q<1$ and n is sufficiently large. Second, note that Lemma 4.2 indicates that $1-q = \Theta (\epsilon )$, the term $\frac{2B_x}{L_{xy}} (1-q) = {\mathcal {O}}(\epsilon )$ and the term $\frac{128\epsilon }{L_{xy}R_x^2(1-q)} = \Omega (1)$. Thus we also omit the $\frac{2B_x}{L_{xy}} (1-q)$ term which is significantly smaller. Therefore, we can write

$$\begin{aligned} \Vert \hat{y}^*\Vert ^2 \le \left( \frac{128\epsilon }{L_{xy}R_x^2(1-q)}\right) ^2\sum ^{n}_{i=1} q^{2i} = \left( \frac{128\epsilon }{L_{xy}R_x^2(1-q)}\right) ^2\Vert \hat{x}^*\Vert ^2. \end{aligned}$$

As a result,

$$\begin{aligned}ac^2 = \frac{R_y^2}{R_x^2}\cdot \frac{\Vert \hat{x}^*\Vert ^2}{\Vert \hat{y}^*\Vert ^2} \ge \frac{L_{xy}^2R_y^2R_x^2(1-q)^2}{128^2 \epsilon ^2}.\end{aligned}$$

In Lemma 4.2, we also have a lower bound of $1-q$ as

$$\begin{aligned}1-q>\left( \frac{1}{2}+\frac{1}{2}\sqrt{\frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_x}{\mu _x} + \frac{L_y}{\mu _y}-1}\right) ^{-1} \overset{(i)}{>}\frac{128\epsilon }{L_{xy}R_xR_y}\end{aligned}$$

where (i) is because we have omitted the terms of smaller magnitude. Therefore,

$$\begin{aligned}\ln \left( 2ac^2\right) \ge \ln \left( \frac{2L_{xy}^2R_y^2R_x^2}{128^2 \epsilon ^2}\cdot \frac{128^2\epsilon ^2}{L_{xy}^2R_x^2R_y^2}\right) = \ln \left( 2\right) = \Omega (1).\end{aligned}$$

Thus we complete the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Hong, M. & Zhang, S. On lower iteration complexity bounds for the convex concave saddle point problems. Math. Program. 194, 901–935 (2022). https://doi.org/10.1007/s10107-021-01660-z

Download citation

Received: 13 February 2020
Accepted: 26 April 2021
Published: 07 June 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10107-021-01660-z

Keywords

Mathematics Subject Classification

90C47

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On lower iteration complexity bounds for the convex concave saddle point problems

Abstract

Similar content being viewed by others

A first-order inexact primal-dual algorithm for a class of convex-concave saddle point problems

Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems

Inexact first-order primal–dual algorithms

1 Introduction

Definition 1.1

Definition 1.2

2 The first-order algorithm classes

2.1 Primal function, dual function, and the duality gap

2.2 Proximal algorithm class

Definition 2.1

Example 2.1

2.3 Pure first-order algorithm class

Definition 2.2

Example 2.2

Example 2.3

2.4 General deterministic algorithm classes without linear span structure

Definition 2.3

Definition 2.4

3 Lower bound for proximal algorithms

3.1 The worst-case instance

Proposition 3.1

Lemma 3.2

3.2 Lower bounding the duality gap

Lemma 3.3

Proof

Lemma 3.4

Proof

Theorem 3.5

Proposition 3.6

3.3 The general proximal algorithm class

Theorem 3.7

3.4 Tightness of the bound

Remark 3.8

4 Lower bound for pure first-order algorithms

4.1 The worst-case instance

Lemma 4.1

Lemma 4.2

Lemma 4.3

Lemma 4.4

Theorem 4.5

Remark 4.6

4.2 The general pure first-order algorithm class

Theorem 4.7

4.3 Tightness of the bound

5 Reduction to lower bounds for convex-concave problems

5.1 Lower bound for pure first-order algorithm class

Definition 5.1

Corollary 5.2

Proof

5.2 Lower bound for proximal algorithm class

Definition 5.3

Corollary 5.4

Remark 5.5

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of lemma 3.4

Proof of proposition 3.6

Proof of theorem 3.7

Lemma C.1

Proof

Proof of lemma 4.2

Proof of lemma 4.3

Proof of lemma 4.4

Proof of \(\ln (2ac^2) = \Omega (1)\)

Proof

Rights and permissions

About this article

Cite this article

Share this article