1 Introduction

In this paper, we establish a lower iteration complexity bound for the first-order methods to solve the following min-max saddle point problem

$$\begin{aligned} \min _x \max _y F(x,y), \end{aligned}$$
(1)

which is of fundamental importance in, e.g., game theory [31, 37], image deconvolution problems [9], parallel computing [39], adversarial training [4, 12], and statistical learning [1].

To proceed, let us introduce the following two problem classes.

Definition 1.1

(Problem class \(\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)\)) \(F(\cdot ,y)\) is \(\mu _x\)-strongly convex for any fixed y and \(F(x,\cdot )\) is \(\mu _y\)-strongly concave for any fixed x. Overall, the function F is smooth and \(\nabla F\) satisfies the following Lipschitz continuity condition

$$\begin{aligned} {\left\{ \begin{array}{ll} \Vert \nabla _x F(x_1,y) - \nabla _x F(x_2,y)\Vert \le L_x\Vert x_1-x_2\Vert ,&{}\forall x_1,x_2,y \\ \Vert \nabla _y F(x,y_1) - \nabla _y F(x,y_2)\Vert \le L_y\Vert y_1-y_2\Vert ,&{}\forall x,y_1,y_2 \\ \Vert \nabla _x F(x,y_1) - \nabla _x F(x,y_2)\Vert \le L_{xy}\Vert y_1-y_2\Vert ,&{}\forall x,y_1,y_2 \\ \Vert \nabla _y F(x_1,y) - \nabla _y F(x_2,y)\Vert \le L_{xy}\Vert x_1-x_2\Vert ,&{}\forall x_1,x_2,y. \\ \end{array}\right. } \end{aligned}$$
(2)

We shall remark here that the constants in (2) may also be understood as the bounds on the different blocks of the Hessian matrix \(\nabla ^2 F(x,y)\) if F is twice continuously differentiable. That is,

$$\begin{aligned} \sup _{x,y}\Vert \nabla _{xx}^2 F(x,y)\Vert _2\le L_x,\quad \sup _{x,y}\Vert \nabla _{yy}^2 F(x,y)\Vert _2\le L_y,\quad \sup _{x,y}\Vert \nabla _{xy}^2 F(x,y)\Vert _2\le L_{xy}. \end{aligned}$$

However, throughout this paper we do not assume either \(F(\cdot , y)\) or \(F(x,\cdot )\) is second-order differentiable.

The second problem class is the bilinear saddle point model:

Definition 1.2

(Bilinear class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\)) In this special class, the problems are written as

$$\begin{aligned} \min _x \max _y F(x,y):=f(x) + x^\top Ay - g(y), \end{aligned}$$
(3)

where f(x) and g(y) are both lower semi-continuous with f(x) being \(\mu _x\)-strongly convex and g(y) being \(\mu _y\)-strongly convex. The coupling matrix A satisfies \(\Vert A\Vert _2\le L_{xy}\).

For this special model class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\), we assume the availability of the following prox-operations:

$$\begin{aligned} \mathbf {prox}_{\gamma f}(v):=\mathop {\hbox {argmin}}\limits _x f(x) + \frac{1}{2\gamma }\Vert x-v\Vert ^2 \quad \text{ and } \quad \mathbf {prox}_{\sigma g}(u):=\mathop {\hbox {argmin}}\limits _y g(y) + \frac{1}{2\sigma }\Vert y-u\Vert ^2. \end{aligned}$$
(4)

In this paper we shall establish the lower iteration complexity bound

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_x}{\mu _x}+\frac{L_{xy}^2}{\mu _x\mu _y}+\frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \text{ for } \mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y), \end{aligned}$$

and

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \text{ for } \mathcal {B}(L_{xy},\mu _x,\mu _y) \end{aligned}$$

with the proximal oracles (4). In particular, we first establish these lower bounds for pure first-order and general proximal algorithm classes under the linear span assumption. Later on we generalize the results for more general algorithm classes without the linear span assumption through the orthogonal invariance technique introduced by [25]. For more detailed applications of the orthogonal invariance technique in the lower bound derivation, the interested readers are referred to [7, 8, 33].

As an application of the above bound, we apply proper scaling to the worst-case instances and show that the above result implies several exisiting lower bounds for general convex-concave problems with bounded saddle point solutions. Specifically, we have

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_xR_x^2}{\epsilon }}+\frac{L_{xy}R_xR_y}{\epsilon }+\sqrt{\frac{L_yR_y^2}{\epsilon }}\,\,\right) \text{ for } \mathcal {F}(L_{x},L_{y},L_{xy},0,0), \,\,\text{ and }\,\,\Vert x^*\Vert \le R_x, \Vert y^*\Vert \le R_y, \end{aligned}$$

and

$$\begin{aligned} \Omega \left( \frac{L_{xy}R_xR_y}{\epsilon }\right) \text{ for } \mathcal {B}(L_{xy},0,0),\,\,\text{ and }\,\, \Vert x^*\Vert \le R_x, \Vert y^*\Vert \le R_y. \end{aligned}$$

For the above two lower bounds, we remark that under specific parameter regimes, the first bound is known, see [25] for the case with \(L_x = L_y = L_{xy}\) and see [33] for the case with \(L_y = 0\). However to our best knowledge, the first bound under a general set of parameters as well as the second bound for bilinear problem class are not known. Similar reductions can also be done for the problem classes with only one of \(\mu _x\) and \(\mu _y\) equal to 0, for which the lower bounds have already been discovered in [33].

Such lower iteration complexity results shed light on understanding the performance of the algorithms designed for min-max saddle point models. There are numerous results in the literature prior to ours. As a special case of (1), the lower bound results of convex minimization problem with \(F(x,y) = f(x)\) has been well-studied in the past decades. For convex problems, Nesterov’s accelerated gradient method have achieved iteration complexities of \({\mathcal {O}}(\sqrt{L/\epsilon })\) for L-smooth convex problems, and \({\mathcal {O}}\left( \sqrt{\frac{L}{\mu }}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) for L-smooth and \(\mu \)-strongly convex problems respectively, and both of them are shown to match the lower complexity bound for the first-order methods; see [29].

However, for the min-max saddle-point models, the situation is more subtle. Due to the convex-concave nature of F, the vector field

$$\begin{aligned} G(x,y) = \begin{pmatrix} \nabla _x F(x,y)\\ -\nabla _y F(x,y) \end{pmatrix}\end{aligned}$$

is monotone. Hence the convex-concave saddle point problem is often studied as a subclass of the variational inequality problems (VIP); see e.g. [16, 22, 24, 28, 30, 36] and references therein. Although there have been plenty of studies on the variational inequalities model, the roles played by different Lischitz constants on the different blocks of variables have not been fully explored in the literature. In other words, often one would denote L to be an overall Lipschitz constant of the vector field G, which is of the order \(\Theta (\max \{L_x,L_y,L_{xy}\})\) in our case, and set \(\mu \) to be the strong monotonicity parameter of G, which is of the order \(\Theta (\min \{\mu _x,\mu _y\})\) in our case, and no further distinctions among the parameters would be made. Hence the considered problems are of special instances in \(\mathcal {F}(L,L,L,\mu ,\mu )\). Under such settings, many algorithms including the mirror-prox algorithm [24], the extra-gradient methods [17, 23], and the accelerated dual extrapolationFootnote 1 [30] and so on, have all achieved the iteration complexity of \({\mathcal {O}}\left( \frac{L}{\mu }\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), and this complexity is shown to be optimal for first-order methods in solving the problem class \(\mathcal {F}(L,L,L,\mu ,\mu )\); see [26]. However, under the more general parameter regime of \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), these methods are not optimal. For example, Nesterov’s accelerated dual extrapolation method [30] has a complexity of \({\mathcal {O}}\left( \frac{\max \{L_x,L_{xy},L_y\}}{\min \{\mu _x,\mu _y\}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), even if the algorithm are modified carefully one can only guarantee a complexity of \({\mathcal {O}}\left( \sqrt{\frac{L_x^2}{\mu _x^2} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y^2}{\mu _y^2}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), both of which do not match the lower bound provided in this paper. More recently, tighter upper bounds have been derived. In [19], the authors consider the problems class of \(\mathcal {F}(L,L,L,\mu _x,\mu _y)\) and achieve an upper bound of \({\mathcal {O}}\left( \sqrt{\frac{L^2}{\mu _x\mu _y}}\cdot \ln ^3\left( \frac{1}{\epsilon }\right) \right) \), which matches our lower bound when \(L_{x} = L_{y} = L_{xy}=L\) up to a logarithmic term. In [38], the authors consider the general problems class of \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), the proposed algorithm achieves an upper bound of \({\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L\cdot L_{xy}}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln ^3\left( \frac{L^2}{\mu _x\mu _y}\right) \cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) where \(L = \max \{L_x,L_y,L_{xy}\}\), which almost matches our lower bound for the general problem class. Despite the gap for the general problem class \(\mathcal {F}(L_x,L_{xy},L_y,\mu _x,\mu _y)\), given the availability of proximal operators the authors of [9, 10] have derived an algorithm for problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) with complexity \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \). We will prove in this paper that this result has matched the theoretical lower complexity bound for its problem and algorithm classes, hence optimal.

For the bilinear problem (3), when f is smooth and convex, \(g(y) = b^\top y\) is linear, the problem is equivalent to the following convex optimization problem

$$\begin{aligned}\min _x \{f(x): A^\top x-b = 0\}.\end{aligned}$$

Without using projection onto the hyperplane \(\{x:A^\top x=b\}\) which requires a matrix inversion, pure first-order methods achieve \({\mathcal {O}}(1/\epsilon )\) complexity despite the strong convexity of f; see e.g. [11, 32, 40]. Those iteration complexity bounds are shown to match the lower bound provided in [33]. For more details on the lower and upper bounds on this formulation, the interested readers are referred to [33]. Finally, for the bilinear coupling problem (3), the authors of [13] show that a lower bound of \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2 + \mu _x\mu _y}{\mu _{xy}^2 + \mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) can be derived, where \(\mu _{xy}\) stands for the minimum singular value of the coupling matrix A. It is interesting that this result covers the linear convergence phenomenon for pure bilinear saddle point problem [5] where \(f(x)\equiv g(y)\equiv 0\). Another remark is that, due to the special construction of the worst-case instance and algorithm class, [13] cannot characterize the impact of \(L_x\) and \(L_y\) as well as the lower bound for proximal algorithm class.

Other than studies on the first-order algorithms, there are also studies on the higher-order methods as well. For example, in [3] lower iteration complexity bounds for second-order methods are considered, and in [2, 27] lower iteration complexity bounds are presented for general higher-order (tensor) methods. For smooth nonconvex optimization, in [8] the iteration complexity lower bounds for first-order methods are considered, while in [7] that for higher-order methods are considered.

Another line of research is for the non-conex/concave min-max saddle point problems; see [14, 15] and the references therein. To guarantee convergence, additional structures are often needed. For example, if one assumes that the solutions of the problem satisfy the Minty variational inequality [18] then convergent algorithm can be constructed. Another important situation is when F is concave in y. In that case, convergence and iteration complexity to a stationary solution is possible; see e.g. [21]. For more literatures in this type of problems, we refer the interested readers to [20] and the references therein.

Organization This paper is organized as follows. In Sect. 2, we introduce two different algorithm classes (with or without proximal-operators). In Sect. 3, we construct a worst-case example for problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and derive the corresponding lower iteration complexity bound for the algorithm class allowing proximal-operators. An optimal algorithm is discussed in this case. In Sect. 4, we construct the worst-case example for problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\) and establish the corresponding lower complexity bound for the first-order method (without any proximal oracles). Optimal algorithms under several special parameter regimes are discussed. Finally, we conclude the paper in Sect. 6.

2 The first-order algorithm classes

In this section, we discuss some preliminaries for the strongly convex and strongly concave saddle point problem. Then, we shall introduce two algorithm classes to set the ground for our discussion, and we shall also note specific known algorithms as representative members in those algorithm classes.

2.1 Primal function, dual function, and the duality gap

First, we define \(\Phi (\cdot )\) to be the primal function and \(\Psi (\cdot )\) to be the dual function of the saddle point problem \(\min _x \max _y F(x,y)\), respectively, with the following definitions

$$\begin{aligned} \Phi (x):=\max _y F(x,y)\qquad \text{ and } \qquad \Psi (y):=\min _x F(x,y). \end{aligned}$$
(5)

As the maximum of a class of \(\mu _x\)-strongly convex function, we know \(\Phi (x)\) is a \(\mu _x\)-strongly convex function. Similarly, \(\Psi (y)\) is a \(\mu _y\)-strongly concave function. We define the duality gap as

$$\begin{aligned} \Delta (x,y) := \max _{y'} F(x,y')-\min _{x'} F(x',y) = \Phi (x)-\Psi (y).\end{aligned}$$

Suppose the unique solution of this min-max problem is \((x^*,y^*)\). By the strong duality theorem, we know for any x and y it holds that

$$\begin{aligned} \Phi (x)\ge \min _{x'}\Phi (x') = \Phi (x^*) =F(x^*,y^*) = \Psi (y^*) = \max _{y'}\Psi (y')\ge \Psi (y).\end{aligned}$$

Together with the \(\mu _x\)-strong convexity of \(\Phi \) and the \(\mu _y\)-strong concavity of \(\Psi \), we further have

$$\begin{aligned} \Delta (x,y)= \Phi (x)-\Phi (x^*) + \Psi (y^*) - \Psi (y)\ge \frac{\mu _x}{2}\Vert x-x^*\Vert ^2 + \frac{\mu _y}{2}\Vert y-y^*\Vert ^2. \end{aligned}$$
(6)

Now, suppose that \((\tilde{x}_k,\tilde{y}_k)\) is the approximate solution generated after k iterations of an algorithm. Our aim is to lower bound the distance between \((\tilde{x}_k,\tilde{y}_k)\) and \((x^*,y^*)\). By (6), this would construct a lower iteration complexity bound in terms of the duality gap as well.

2.2 Proximal algorithm class

First, let us consider the bilinearly coupled problem class (3) as introduced in Definition 1.2:

$$\begin{aligned} \min _x \max _y F(x,y):=f(x) + x^\top Ay - g(y). \end{aligned}$$

For this special problem class, let us consider the lower iteration bound of the algorithm class where the proximal oracles (4) are available.

Definition 2.1

(Proximal algorithm class) In each iteration, the iterate sequence \(\{(x^{k},y^{k})\}_{k=0,1,...}\) are generated so that \((x^{k},y^{k})\in \mathcal {H}_x^{k} \times \mathcal {H}_y^{k}\). These subspaces are generated with \(\mathcal {H}_x^{0} = \mathrm {Span}\{x^0\}, \mathcal {H}_y^{0} = \mathrm {Span}\{y^0\}\) and

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^{k+1} := \mathrm {Span}\{x^i,\mathbf {prox}_{\gamma _i f}(\hat{x}^i-\gamma _iA\tilde{y}^i)~\,: \forall \hat{x}^i\in \mathcal {H}_x^{i},~ \tilde{y}^i\in \mathcal {H}_y^i,~ 0\le i\le k\}\\ \mathcal {H}_y^{k+1} := \mathrm {Span}\{y^i,\mathbf {prox}_{\sigma _i g}(\hat{y}^i+\sigma _iA^\top \tilde{x}^i): \forall \tilde{x}^i\in \mathcal {H}_x^{i},~ \hat{y}^i\in \mathcal {H}_y^i,~ 0\le i\le k\}. \end{array}\right. } \end{aligned}$$
(7)

Remark that when applying the proximal oracles, it is not necessary to use the most recent iterate \(x^k\) as the proximal center. Neither is it necessary to use the gradients of the coupling term (namely the \(A^\top x\) and Ay terms) at the current iterate. Instead, the algorithm class allows the usage of the combination of any points in the historical search space. We shall also remark that the algorithm class in Definition 2.1 does not necessarily need to update x and y at the same time, because setting \(x^{k+1}=x^k\) or \(y^{k+1} = y^k\) also satisfies Definition 2.1. Thus this algorithm class also includes the methods that alternatingly update x and y. Below is a sample algorithm in this class.

Example 2.1

(Algorithm 3 in [9]) Initialize with \(\gamma = \frac{1}{L_{xy}}\sqrt{\frac{\mu _y}{\mu _x}}\), \(\sigma = \frac{1}{L_{xy}}\sqrt{\frac{\mu _x}{\mu _y}}\), and \(\theta = \frac{L_{xy}}{2\sqrt{\mu _x\mu _y} + L_{xy}}.\) Set \(\tilde{x}^0 = x^0\). Then the algorithm proceeds as

$$\begin{aligned} {\left\{ \begin{array}{ll} y^{k+1} = \mathbf {prox}_{\sigma g}(y^k + \sigma A^\top \tilde{x}^k)\\ x^{k+1} = \mathbf {prox}_{\gamma f}(x^k - \gamma Ay^{k+1})\\ \tilde{x}^{k+1} = x^{k+1} + \theta (x^{k+1}-x^k). \end{array}\right. } \end{aligned}$$
(8)

It can be observed that this algorithm takes the alternating order of update, by slightly manipulating the index, it can be written in the form of (7) in Definition 2.1. The complexity of this method is \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \).

2.3 Pure first-order algorithm class

In constrast to the previous section, here we consider the more general problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\):

$$\begin{aligned}\min _x\max _y F(x,y).\end{aligned}$$

For such problems, we refer to the algorithm class as the pure first-order methods, meaning that there is no proximal oracle in the design of algorithms in this class.

Definition 2.2

(Pure first-order algorithm class) In each iteration, the sequence \(\{(x_{k},y_{k})\}_{k=0,1,...}\) is generated so that \((x^{k},y^{k})\in \mathcal {H}_x^{k} \times \mathcal {H}_y^{k}\), with \(\mathcal {H}_x^{0} = \mathrm {Span}\{x^0\}\), \(\mathcal {H}_y^{0} = \mathrm {Span}\{y^0\}\), and

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^{k+1} := \mathrm {Span}\{x^i,\nabla _xF(\tilde{x}^i,\tilde{y}^i): \forall \tilde{x}^i\in \mathcal {H}_x^{i}, \tilde{y}^i\in \mathcal {H}_y^i, 0\le i\le k\}\\ \mathcal {H}_y^{k+1} := \mathrm {Span}\{y^i,\nabla _yF(\tilde{x}^i,\tilde{y}^i): \forall \tilde{x}^i\in \mathcal {H}_x^{i}, \tilde{y}^i\in \mathcal {H}_y^i, 0\le i\le k\}. \end{array}\right. } \end{aligned}$$
(9)

Similar to our earlier comments on the proximal algorithm class, in this class of algorithms the gradients at any combination of points in the historical search space are allowed. The algorithm class also includes the methods that alternatingly update between x and y, or even the double loop algorithms that optimize one side until certain accuracy is achieved before switching to the other side. At that level of generality, it indeed accommodates many updating schemes. To illustrate this point, let us present below some sample algorithms in this class.

The first example is a double loop scheme, in which the primal function \(\Phi (x)\) is optimized approximately. Specifically, let \(y^*(x) = \mathop {\hbox {argmax}}\limits _y F(x,y)\), by Danskin’s theorem, \(\nabla \Phi (x) = \nabla _x F(x,y^*(x))\); see e.g. [6, 34]. Therefore, one can apply Nesterov’s accelerated gradient method to minimize \(\Phi (x)\). The double loop scheme performs this procedure approximately.

Example 2.2

(Double loop schemes, [35]) Denote \(\alpha _1 = \sqrt{\frac{\mu _x}{L_{\Phi ,x}}}\) and \(\alpha _2 = \sqrt{\frac{\mu _y}{L_y}}\), where \(L_{\Phi ,x} = L_x + \frac{L_{xy}^2}{\mu _y}\) is the Lipschitz constant of \(\nabla \Phi (x)\) (see [35]). Given \((x^0, y^0)\) and define \(\bar{x}^0 = x^0\), the double loop scheme works as follows:

$$\begin{aligned}{\left\{ \begin{array}{ll} x^{k+1} = \bar{x}^k - \frac{1}{L_{\Phi ,x}}\nabla _x F(\bar{x}^k, y^k)\\ \bar{x}^{k+1} = x^{k+1} + \frac{1-\sqrt{\alpha _2}}{1+\sqrt{\alpha _2}}(x^{k+1}-x^k) \end{array}\right. }\quad \text{ for }\quad k = 0,1,...,T_1,\end{aligned}$$

where the point \(y^k\) is generated by an inner loop of accelerated gradient iterations

$$\begin{aligned}{\left\{ \begin{array}{ll} w^{t+1} = \bar{w}^t + \frac{1}{L_{y}}\nabla _y F(\bar{x}^k, \bar{w}^t)\\ \bar{w}^{t+1} = w^{t+1} + \frac{1-\sqrt{\alpha _1}}{1+\sqrt{\alpha _1}}(w^{t+1}-w^t) \end{array}\right. }\,\,\text{ for }\quad \,\, t = 0,1,...,T_2\,\,\,\, \text{ and }\,\,\,\, w^0 = \bar{w}^0 = y^{k-1}.\end{aligned}$$

Then, set \(y^{k}:=w^{T_2+1}\) to be the last iterate of the inner loop.

For simplicity, we have applied a specific scheme of acceleration [29] which does not work for nonstrongly-convex problems. In principle, the FISTA scheme can also be used. For this scheme, with properly chosen \(T_1\) and \(T_2\), the iteration complexity \({\mathcal {O}}\) \(\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \sqrt{\frac{L_y}{\mu _y}}\ln ^2\left( \frac{1}{\epsilon }\right) \right) \) is achievable.

In the following, we also list examples of several single loop algorithms, including the gradient descent-ascent method (GDA), the extra-gradient (EG) method [17] (a special case of mirror-prox algorithm [24]), and the accelerated dual extrapolation (ADE) [30].

Example 2.3

(Single loop algorithms) Let \(L = \max \{L_x,L_y,L_{xy}\}, \mu = \min \{\mu _x,\mu _y\}\). Given the initial solution \((x^0,y^0)\), the algorithms proceed as follows:

$$\begin{aligned}&(\mathrm{GDA})\qquad \qquad \qquad \qquad \qquad {\left\{ \begin{array}{ll} x^{k+1} = x^k - \eta _1 \nabla _x F(x^k,y^k)\\ y^{k+1}\, = y^k + \eta _1 \nabla _yF(x^k,y^k) \end{array}\right. }\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \\&(\mathrm{EG})\qquad {\left\{ \begin{array}{ll} \tilde{x}^{k+1} = x^k - \eta _2 \nabla _x F(x^k,y^k)\\ \tilde{y}^{k+1}\, = y^k + \eta _2 \nabla _yF(x^k,y^k) \end{array}\right. } \text{ and }\quad \,\, {\left\{ \begin{array}{ll} x^{k+1} = x^k - \eta _2 \nabla _x F(\tilde{x}^{k+1},\tilde{y}^{k+1})\\ y^{k+1}\, = y^k + \eta _2 \nabla _yF(\tilde{x}^{k+1},\tilde{y}^{k+1}) \end{array}\right. }\qquad \qquad \qquad \qquad \quad \\&(\mathrm{ADE})\qquad \qquad {\left\{ \begin{array}{ll} x^{k+1} = x^k - \eta _3 \left( \frac{\mu }{L+\mu }\nabla _x F(x^k,y^k)+\frac{L}{L+\mu }\nabla _x F(\tilde{x}^{k+1},\tilde{y}^{k+1})\right) \\ y^{k+1} \,= y^k + \eta _3 \left( \frac{\mu }{L+\mu }\nabla _yF(x^k,y^k)\,+\frac{L}{L+\mu }\nabla _y F(\tilde{x}^{k+1},\tilde{y}^{k+1})\right) \end{array}\right. }\qquad \qquad \qquad \qquad \qquad \end{aligned}$$

where \(\eta _1 = {\mathcal {O}}\left( \frac{\mu }{L^2}\right) \), \(\eta _2={\mathcal {O}}\left( \frac{1}{L}\right) ,\eta _3 = {\mathcal {O}}\left( \frac{1}{L}\right) \). The iterative points \((\tilde{x}^{k+1},\tilde{y}^{k+1})\) in (ADE) are the same as that in (EG), except that \(\eta _2\) is replaced by \(\eta _3\).

The original update of (ADE) algorithm is rather complex since it involves the handling of constraints. In the unconstrained case, it can be simplified to the current form, which is a mixture of (GDA) and (EG). The corresponding iteration complexity bounds are \({\mathcal {O}}\left( \frac{L^2}{\mu ^2}\ln \left( \frac{1}{\epsilon }\right) \right) \) for (GDA), and \({\mathcal {O}}\left( \frac{L}{\mu }\ln \left( \frac{1}{\epsilon }\right) \right) \) for both (EG) and (ADE).

2.4 General deterministic algorithm classes without linear span structure

Although all the reviewed first-order methods satisfy the linear span property in the proximal algorithm class in Definition 2.1 and the pure first-order algorithm class in Definition 2.2, this does not exclude the possibility of the deriving an algorithm that does not satisfy the linear span property. Therefore, we also define the general deterministic proximal algorithm class and the general deterministic pure first-order algorithm class as follows, whose iteration complexity lower bound can be generalized from their linear span counterpart through the technique of adversary rotation.

Definition 2.3

(General proximal algorithm class) Consider the problem (3) in the problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\), denote \(\theta = (L_{xy},\mu _x,\mu _y)\) as the corresponding problem parameters. Let algorithm \(\mathcal {A}\) belong to the general proximal algorithm class. Then \(\mathcal {A}\) consists of a sequence of deterministic mappings \(\{(\mathcal {A}_x^1, \mathcal {A}_y^1,\mathcal {A}_u^1,\mathcal {A}_v^1),(\mathcal {A}_x^2, \mathcal {A}_y^2,\mathcal {A}_u^2,\mathcal {A}_v^2)...\}\) such that the iterate sequence \(\{(x^{k},y^{k})\}_{k=0,1,...}\) and the output sequence \(\{(\tilde{x}^{k},\tilde{y}^{k})\}_{k=0,1,...}\) are generated by

$$\begin{aligned} {\left\{ \begin{array}{ll} (x^k,\tilde{x}^k) := \mathcal {A}_x^k\left( \theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1}; \mathbf {prox}_{\gamma _k f}(u^k) \right) ,\\ (y^k,\tilde{y}^k) := \mathcal {A}_y^k\left( \theta ; y^0,A^\top x^0,...,y^{k-1},A^\top x^{k-1}; \mathbf {prox}_{\sigma _k g}(v^k) \right) , \end{array}\right. } \end{aligned}$$
(10)

where \(u^k = \mathcal {A}^k_u(\theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1})\), \(v^k = \mathcal {A}^k_v(\theta ; y^0,A^\top x^0,...,y^{k-1},A^\top x^{k-1})\), and \((x^0,y^0)\) is any given initial solution.

One remark is that the input of the proximal mapping \(\mathbf {prox}_{\gamma _kf}(\cdot )\) is constructed with other inputs to the \(\mathcal {A}_x^k\), i.e., there could be another deterministic mapping \(\mathcal {A}_u^k\) to generate a vector \(u^k = \mathcal {A}^k_u(\theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1})\) and then \(\mathbf {prox}_{\gamma _kf}(u^k)\) is passed to the mapping \(\mathcal {A}_x^k\). This \(\mathcal {A}^k_u\) does not need to be linear. The situation for \(v^k\) and \(\mathcal {A}^k_v\) is similar. Similar to the general proximal algorithm class, the general pure first-order algorithm class is defined as follows.

Definition 2.4

(General pure first-order algorithm class) Consider the problem (1) in the problem class \(\mathcal {F}(L_x, L_y, L_{xy},\mu _x,\mu _y)\), denote \(\theta = (L_x, L_y, L_{xy},\mu _x,\mu _y)\) as the corresponding problem parameters. Let algorithm \(\mathcal {A}\) belong to the general pure first-order algorithm class. Then \(\mathcal {A}\) consists of a sequence of deterministic mappings \(\{\mathcal {A}_x^1, \mathcal {A}_y^1, \mathcal {A}_x^2, \mathcal {A}_y^2,...\}\) such that the iterate sequence \(\{(x^{k},y^{k})\}_{k=0,1,...}\) and the output sequence \(\{(\tilde{x}^{k},\tilde{y}^{k})\}_{k=0,1,...}\) are generated by

$$\begin{aligned} {\left\{ \begin{array}{ll} (x^k,\tilde{x}^k) := \mathcal {A}_x^k\left( \theta ; x^0,\nabla _x F(x^0,y^0),...,x^{k-1},\nabla _x F(x^{k-1},y^{k-1}) \right) ,\\ (y^k,\tilde{y}^k) := \mathcal {A}_y^k\left( \theta ; y^0,\nabla _y F(x^0,y^0),...,y^{k-1},\nabla _y F(x^{k-1},y^{k-1})\right) , \end{array}\right. } \end{aligned}$$
(11)

given any initial solution \((x^0,y^0)\).

A remark is that the gradients \(\nabla _x F(\cdot ,\cdot )\) and \(\nabla _y F(\cdot ,\cdot )\) actually do not need to be taken on the previous iterates \(\{(x^0,y^0),(x^1,y^1),...,(x^{k-1},y^{k-1})\}\). Similar to the proximal case, they can also be taken on some other \(\{(u^0_{k-1},v^0_{k-1}),(u^1_{k-1},v^1_{k-1}),...,(u^{k-1}_{k-1},v^{k-1}_{k-1})\}\) that are generated by some mappings \(\{(\mathcal {A}_u^{k-1,0},\mathcal {A}_v^{k-1,0}),(\mathcal {A}_u^{k-1,1},\mathcal {A}_v^{k-1,1}),...,(\mathcal {A}_u^{k-1,k-1},\mathcal {A}_v^{k-1,k-1})\}\). The reason that we do not consider this more general form is twofold. First, the simpler form in Definition 2.4 has already covered all the discussed algorithms. Second, this more general form actually shares the same iteration complexity lower bound, despite the technical complications involved. Therefore, in this paper we shall only include the gradients at the past iterates as the input to the algorithm.

3 Lower bound for proximal algorithms

3.1 The worst-case instance

Let us construct the following bilinearly coupled min-max saddle point problem:

$$\begin{aligned} \min _x\max _y F(x,y):=\frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top Ay - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top y \end{aligned}$$
(12)

where b is a vector to be determined later, and the coupling matrix A (hence \(A^2\) and \(A^4\)) is defined as follows:

(13)

Note that \(A^\top =A\) and \(\Vert A\Vert _2 \le 2\). Therefore (12) is an instance in the problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\). It is worth noting that the example (12) is the same as that in Proposition 2 of [13], which is a parallel work focused on pure first-order algorithm class. Here, we use the same example to elaborate the lower bound of the proximal methods over the general bilinear coupling class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\), as well as a warmup for the discussion of more complex problem class \(\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)\).

Denote \(e_i\) to be the i-th unit vector, which has 1 at the i-th component and 0 elsewhere. Then by direct calculation, one can check that \(A^2\) satisfies the following zero-chain property (see Chapter 2 of [29]).

Proposition 3.1

(Zero-chain property) For any vector \(v\in \mathbb {R}^n\), if \(v\in \mathrm {Span}\{e_i:i\le k\}\) for some \(1\le k\le n-1\), then \(A^2v\in \mathrm {Span}\{e_i:i\le k+1\}\).

This means that if v only has nonzero elements at the first k entries, then \(A^2v\) will have at most one more nonzero entry at the \((k+1)\)-th position.

For problem (12), the proximal operators in (7) can be written explicitly:

$$\begin{aligned} \mathbf {prox}_{\gamma _i f}(\hat{x}_i-\gamma _iA\tilde{y}_i)= & {} \mathop {\hbox {argmin}}\limits _x \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{1}{2\gamma _i}\left\| x-(\hat{x}_i-\frac{\gamma _iL_{xy}}{2}A\tilde{y}_i)\right\| ^2\nonumber \\= & {} \frac{1}{1+\gamma _i\mu _x}\hat{x}_i - \frac{\gamma _iL_{xy}}{2(1+\gamma _i\mu _x)}A\tilde{y}_i\nonumber \\\in & {} \mathrm {Span}\{\hat{x}_i,A\tilde{y}_i\}. \end{aligned}$$
(14)

Similarly, for the y block, we also have

$$\begin{aligned} \mathbf {prox}_{\sigma _i g}(\hat{y}_i+\sigma _iA^\top \tilde{x}_i) = \frac{\hat{y}_i-\sigma _i b}{1+\sigma _i\mu _y} + \frac{\sigma _iL_{xy}}{2(1+\sigma _i\mu _x)}A\tilde{x}_i \in \mathrm {Span}\{\hat{y}_i,A\tilde{x}_i,b\}. \end{aligned}$$
(15)

Let us assume the initial point to be \(x^0 = y^0 = 0\) (\(\mathcal {H}_x^0 = \mathcal {H}_y^0 = \{0\}\)) without loss of generality. Directly substituting (14) and 15 into Definition (2.1) yields

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^1 \subseteq \mathrm {Span}\{0\}\\ \mathcal {H}_y^1 \subseteq \mathrm {Span}\{b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^2 \subseteq \mathrm {Span}\{Ab\}\\ \mathcal {H}_y^2 \subseteq \mathrm {Span}\{b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^3 \subseteq \mathrm {Span}\{Ab\}\\ \mathcal {H}_y^3 \subseteq \mathrm {Span}\{b,A^2b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^4 \subseteq \mathrm {Span}\{Ab, A^3b\}\\ \mathcal {H}_y^4 \subseteq \mathrm {Span}\{b,A^2b\} \end{array}\right. } \ldots \end{aligned}$$

We formally summarize this observation below:

Lemma 3.2

For problem (12), for any \(k \in \mathbb {N}\), if the iterates are generated so that \((x_k,y_k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\), with \(\mathcal {H}_x^k\) and \(\mathcal {H}_y^k\) defined by (2.1), then based on (14) and (15) the search subspaces satisfy

$$\begin{aligned}\mathcal {H}_x^k \subseteq {\left\{ \begin{array}{ll} \{0\}, &{} k = 1\\ \mathrm {Span}\left\{ A^{2i}(Ab): 0\le i\le \left\lfloor \frac{k}{2}\right\rfloor -1 \right\} , &{} k\ge 2 \end{array}\right. } \quad \text{ and } \quad \mathcal {H}_y^k \subseteq \mathrm {Span}\left\{ A^{2i}b: 0\le i\le \left\lceil \frac{k}{2}\right\rceil -1 \right\} .\end{aligned}$$

3.2 Lower bounding the duality gap

Let us lower bound the dual gap, which is upper bounded by the whole duality gap. To achieve this, let us first write down the dual function of problem (12) as

$$\begin{aligned} \Psi (y) = \min _{x} F(x,y) = -\frac{1}{2}y^\top \left( \frac{L_{xy}^2}{4\mu _x}\cdot A^2 + \mu _y\cdot I\right) y - b^\top y . \end{aligned}$$
(16)

For this \(\mu _y\)-strongly concave dual function, we can characterize the optimal solution \(y^*\) directly by its KKT condition \(\nabla \Psi (y^*) = 0\). However, the exact solution \(y^*\) does not have a simple and clear form, so we choose to characterize it by an approximate solution \(\hat{y}^*\).

Lemma 3.3

(Approximate optimal solution) Let us assign the value of b as \(b := -\frac{L_{xy}^2}{4\mu _x}e_1\). Denote \(\alpha := \frac{4\mu _x\mu _y}{L_{xy}^2}\), and let \(q = \frac{1}{2}\left( (2+\alpha ) - \sqrt{(2+\alpha )^2 - 4}\right) \in (0,1)\) be the smallest root of the quadratic equation \(1 - (2+\alpha )q + q^2 = 0\). Then, an approximate optimal solution \(\hat{y}^*\) can be constructed as

$$\begin{aligned} \hat{y}^*_{i} = \frac{q^i}{1-q} \quad \text{ for } \quad i = 1,2,...,n. \end{aligned}$$
(17)

The approximation error can be bounded by

$$\begin{aligned} \Vert \hat{y}^* - y^*\Vert \le \frac{q^{n+1}}{\alpha (1-q)}, \end{aligned}$$
(18)

where \(\hat{y}^*_i\) is the i-th element of \(\hat{y}^*\). Note that \(q<1\) and the lower bound is dimension-independent, hence we are free to choose n to make the approximation error arbitrarily small.

Proof

First, let us substitute the value of b into the KKT system \(\nabla \Psi (y^*)=0\), by slight rearranging and scaling the terms, we get

$$\begin{aligned} \left( A^2 + \frac{4\mu _x\mu _y}{L_{xy}^2} I\right) y^* = - \frac{4\mu _x}{L_{xy}^2}b. \end{aligned}$$

Using the definition of \(\alpha \) and b, the equation becomes

$$\begin{aligned} (A^2+\alpha I)y^* = e_1.\end{aligned}$$

Substituting the formula of \(A^2\) in (13), we expand the above equation as

$$\begin{aligned} {\left\{ \begin{array}{ll} (1+\alpha )y_1^* - y_2^* = 1\\ -y_1^* + (2+\alpha )y_2^* - y_3^* = 0\\ \qquad \qquad \quad \vdots \\ -y_{n-2}^* + (2+\alpha )y_{n-1}^* - y_n^* = 0\\ -y_{n-1}^* + (2+\alpha )y_n^* = 0. \end{array}\right. } \end{aligned}$$
(19)

By direct calculation, we can check that \(\hat{y}^*\) satisfies the first \(n-1\) equations of the KKT system (19). The last equation, however, is violated, but with a residual of size \(q^{n+1}/(1-q)\). In details,

$$\begin{aligned}{\left\{ \begin{array}{ll} (A^2 + \alpha \cdot I)\hat{y}^* = e_1 + \frac{q^{n+1}}{1-q}\cdot e_n\\ (A^2 + \alpha \cdot I)y^* = e_1. \end{array}\right. }\end{aligned}$$

This indicates that \(\hat{y}^* - y^* = \frac{q^{n+1}}{1-q}\cdot (A^2 + \alpha I)^{-1} e_n\). Note that \(\alpha ^{-1}I\succeq (A^2 + \alpha I)^{-1}\succ 0\), we have the approximation error bounded by (18). \(\square \)

Note that in Lemma 3.3, we have chosen \(b\propto e_1\). By the zero-chain property in Proposition 3.1 and Lemma 3.2, we can verify that the subspaces \(\mathcal {H}_y^{2k-1}\) and \(\mathcal {H}_y^{2k}\) satisfy

$$\begin{aligned} \mathcal {H}_y^{2k-1}, \mathcal {H}_y^{2k} \subseteq \mathrm {Span}\{b,A^2b,...,A^{2(k-1)}b\} =\mathrm {Span}\{e_1, e_2, ...,e_k\}. \end{aligned}$$
(20)

This implies that for both \(y^{2k}\) and \(y^{2k-1}\), the only possible nonzero elements are the first k ones, which again implies that the lower bound of \(\Vert y^{2k}-y^*\Vert ^2\) and \(\Vert y^{2k-1}-y^*\Vert ^2\) will be similar. For simplicity, we only discuss this lower bound for \(y^{2k}\). The counterpart for \(y^{2k-1}\) can be obtained in a similar way. Therefore, we have the following estimations.

Lemma 3.4

Assume \(k\le \frac{n}{2}\) and \(n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) \). Then

$$\begin{aligned} \Vert y^{2k}-y^*\Vert ^2\ge \frac{q^{2k}}{16}\Vert y^0-y^*\Vert ^2 \end{aligned}$$
(21)

where \(y^0 = 0\) is the initial solution.

Proof

By the subspace characterization (20), we have

$$\begin{aligned}\Vert y^{2k}-\hat{y}^*\Vert \ge \sqrt{\sum ^n_{j = k+1} (\hat{y}^*_j)^2} = \frac{q^{k}}{1-q}\sqrt{q^2 + q^4 + \cdots + q^{2(n-k)}}\ge \frac{q^k}{\sqrt{2}}\Vert \hat{y}^*\Vert = \frac{q^k}{\sqrt{2}}\Vert y^0-\hat{y}^*\Vert ,\end{aligned}$$

where the last inequality is due to the fact that \(q< 1\), \(k\le \frac{n}{2}\), and \(y^0 = 0\). If we choose n to be large enough, then \(\hat{y}^*\) and \(y^*\) can be made arbitrarily close to each other. Hence we can transform the above inequality to (21). More details of this derivation can be found in Appendix A. \(\square \)

Using Lemma 3.4 and (6), it is then straightforward to lower bound the duality gap by

$$\begin{aligned}\Delta (x^{2k},y^{2k})\ge q^{2k}\cdot \frac{\mu _y\Vert y^*-y^0\Vert ^2}{32}.\end{aligned}$$

Summarizing, below we present our first main result.

Theorem 3.5

Let the positive parameters \(\mu _x,\mu _y>0\) and \(L_{xy}> 0\) be given. For any integer k, there exists a problem instance from \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) of form (12), with \(n\ge \max \left\{ 2\log _q\left( \frac{\mu _x\mu _y}{\sqrt{2}L_{xy}^2}\right) , 4k\right\} \), where \(A\in \mathbb {R}^{n\times n}\) as defined in (13), and \(b = -\frac{L_{xy}^2}{4\mu _x}e_1\). For such a problem instance, any approximate solution \((\tilde{x}^k, \tilde{y}^k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\) generated by the proximal algorithm class under the linear span assumption (7) satisfies

$$\begin{aligned} \max _y F(\tilde{x}^{k},y) - \min _{x} F(x, \tilde{y}^k)\ge q^{k}\cdot \frac{\mu _y\Vert y^*-y^0\Vert ^2}{32} \quad \text{ and } \quad \Vert \tilde{y}^k-y^*\Vert ^2\ge q^{k}\cdot \frac{\Vert y^*-y^0\Vert ^2}{16},\nonumber \\ \end{aligned}$$
(22)

where \(q = 1+\frac{2\mu _x\mu _y}{L_{xy}^2} - 2\sqrt{\left( \frac{\mu _x\mu _y}{L_{xy}^2}\right) ^2 + \frac{\mu _x\mu _y}{L_{xy}^2}}\).

Proposition 3.6

Under the same set of assumptions of Theorem 3.5, if we require the duality gap to be bounded by \(\epsilon \), the number of iterations needed is at least

$$\begin{aligned} k\ge \ln \left( \frac{\mu _y\Vert y^*-y^0\Vert ^2}{32\epsilon }\right) /\ln (q^{-1}) = \Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$
(23)

The proof of Proposition 3.6 is in Appendix B.

3.3 The general proximal algorithm class

Note that Theorem 3.5 is derived for the proximal algorithm class with the linear span assumption, in this section, we will apply the orthogonal invariance technique, introduced in [25], to generalize the result of Theorem 3.5 to the general proximal algorithm class without the linear span assumption.

Consider the bilinear problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and the corresponding worst case problem (12) with \(F(x,y) = \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top A y - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top y\), where A and b are defined in accordance with Theorem 3.5. We define the orthogonally rotated problem as

$$\begin{aligned} \min _x\max _y F_{U,V}(x,y) : = F(Ux,Vy) = \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top U^\top AV y - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top Vy, \end{aligned}$$
(24)

where UV are two orthogonal matrices. Therefore, it is clear that \(F_{U,V}\in \mathcal {B}(L_{xy},\mu _x,\mu _y)\). Let \((x^*, y^*)\) be the saddle point of F(xy), then it is clear that the saddle point of \(F_{U,V}(x,y)\) is \((\bar{x}^*,\bar{y}^*) = (U^\top x^*,V^\top y^*).\) Consequently, the lower bound for the general proximal algorithm class is characterized by the following theorem.

Theorem 3.7

Let \(\mathcal {A}\) be any algorithm from the general proximal algorithm class decribed in Definition 2.3. We assume the dimension n is sufficiently large for simplicity. For any integer k, then there exist orthogonal matrices UV s.t. \(F_{U,V}\in \mathcal {B}(L_{xy},\mu _x,\mu _y)\), when applying \(\mathcal {A}\) to \(F_{U,V}\) with initial solution \((x^0,y^0) = (0,0)\), the iterates and output satisfies

$$\begin{aligned}&\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1}\qquad \text{ and }\\&\quad (\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1},\end{aligned}$$

where \(\mathcal {H}_x^i,\mathcal {H}_y^i\) are defined by Lemma 3.2. Consequenty, by Theorem 3.5,

$$\begin{aligned} \Vert \tilde{y}^k-V^\top y^*\Vert ^2\ge \frac{q^{4k+2}}{16}\Vert y^*-y^0\Vert ^2 \end{aligned}$$

where q is given in Theorem 3.5. As a result, it takes \(\Omega \left( \sqrt{\frac{L^2_{xy}}{\mu _x\mu _y}}\cdot \log \left( \frac{1}{\epsilon }\right) \right) \) iterations to output a solution with \(O(\epsilon )\) duality gap.

For the proof of this theorem, we only need to construct the orthogonal matrices UV such that when the algorithm \(\mathcal {A}\) is applied to \(F_{U,V}\), the subspace inclusion argument \(\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1}\) and \((\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1}\) holds. As a result,

$$\begin{aligned} \Vert \tilde{y}^k-V^\top y^*\Vert ^2 = \Vert V\tilde{y}^k- y^*\Vert ^2\ge \min _{y\in \mathcal {H}_y^{4k+1}}\Vert y-y^*\Vert ^2\ge \frac{q^{4k+2}}{16}\Vert y^*-y^0\Vert ^2. \end{aligned}$$

With this argument, the latter results follow directly from the discussion of Theorem 3.5. The proof of this theorem is presented in Appendix C.

3.4 Tightness of the bound

We claim the tightness of the derived lower bound by the following remark.

Remark 3.8

(Tightness of the bound) Consider the algorithm defined in Example 2.1, from [9, 10]. The achieved upper complexity bound is \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), and it matches our lower bound. This means that our lower bound (23) is tight and the algorithm defined in Example 2.1 is an optimal algorithm in the proximal algorithm class in Definition 2.1.

4 Lower bound for pure first-order algorithms

4.1 The worst-case instance

In this section, we consider the lower complexity bound for the pure first-order method without any proximal oracle. In this case, only the gradient information can be used to construct the iterates and produce the approximate solution output. Similar as before, we still consider the bilinearly coupled problems:

$$\begin{aligned} \min _x\max _y F(x,y):=\frac{1}{2}x^\top (B_xA^2 + \mu _xI)x + \frac{L_{xy}}{2}x^\top Ay - \frac{1}{2}y^\top (B_yA^2 + \mu _yI)y - b^\top y \end{aligned}$$
(25)

where b is a vector whose value will be determined later. The coefficients \(B_x: = \frac{L_x-\mu _x}{4}, B_y:= \frac{L_y-\mu _y}{4}\) and the coupleing matrix A is defined by (13). Note that \(\Vert A\Vert _2\le 2\) and \(\Vert A\Vert _2^2\le 4\), we can check that problem (25) is an instance from the problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\). This time the subspaces \(\mathcal {H}_x^k\)’s and \(\mathcal {H}_y^k\)’s are generated by the following gradients:

$$\begin{aligned} {\left\{ \begin{array}{ll} \nabla _x F(x,y) = (B_xA^2 + \mu _xI)x + \frac{L_{xy}}{2}Ay,\\ \nabla _yF(x,y) = -(B_yA^2 + \mu _yI)y + \frac{L_{xy}}{2}Ax - b. \end{array}\right. } \end{aligned}$$

Following Definition 2.2, by letting \(x^0 = y^0 = 0\) we have

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {H}_x^1 \subseteq \mathrm {Span}\{0\}\\ \mathcal {H}_y^1 \subseteq \mathrm {Span}\{b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^2 \subseteq \mathrm {Span}\{Ab\}\\ \mathcal {H}_y^2 \subseteq \mathrm {Span}\{b, A^2b\} \end{array}\right. } \quad {\left\{ \begin{array}{ll} \mathcal {H}_x^3 \subseteq \mathrm {Span}\{Ab,A^2(Ab)\}\\ \mathcal {H}_y^3 \subseteq \mathrm {Span}\{b,A^2b,A^4b\} \end{array}\right. } \quad \ldots \end{aligned}$$

By induction, we get the general structure of these subspaces.

Lemma 4.1

For problem (25) and for any \(k \in \mathbb {N}\), if the iterates are generated so that \((x_k,y_k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\), with \(\mathcal {H}_x^k\) and \(\mathcal {H}_y^k\) defined by (2.2), then we have

$$\begin{aligned}\mathcal {H}_x^k \subseteq {\left\{ \begin{array}{ll} \{0\}, &{} k = 1\\ \mathrm {Span}\left\{ A^{2i}(Ab): 0\le i\le k-2 \right\} , &{} k\ge 2 \end{array}\right. } \quad \text{ and } \quad \mathcal {H}_y^k \subseteq \mathrm {Span}\left\{ A^{2i}b: 0\le i\le k -1 \right\} .\end{aligned}$$

Different from the discussion of last section, this time it is more convenient to deal with the primal function instead of the dual one. By partially maximizing over y we have

$$\begin{aligned} \Phi (x):=\max _y F(x,y)&= \frac{1}{2}x^\top (B_xA^2+\mu _xI)x \\&\quad + \frac{L_{xy}^2}{8}\left( Ax-\frac{2b}{L_{xy}}\right) ^\top (B_yA^2+\mu _yI)^{-1}\left( Ax-\frac{2b}{L_{xy}}\right) , \end{aligned}$$

which is \(\mu _x\)-strongly convex. Therefore, the primal optimal solution \(x^*\) is completely characterized by the optimality condition \(\nabla \Phi (x^*) = 0\). However, the solution of this system cannot be computed exactly. Instead, we shall construct an approximate solution \(\hat{x}^*\) to the exact solution \(x^*\).

Lemma 4.2

(Root estimation) Consider a quartic equation

$$\begin{aligned} 1 -(4+\alpha )x + (6+2\alpha + \beta )x^2 - (4+\alpha )x^3 + x^4 = 0, \end{aligned}$$
(26)

where the constants are given by

$$\begin{aligned} \alpha = \frac{L_{xy}^2}{4B_xB_y} + \frac{\mu _x}{B_x} + \frac{\mu _y}{B_y},\qquad \beta = \frac{\mu _x\mu _y}{B_xB_y}. \end{aligned}$$
(27)

As long as \(L_x>\mu _x>0,\) and \(L_y>\mu _y>0\). Then the constants \(0<\alpha ,\beta <+\infty \) are well-defined positive real numbers. For this quartic equation, it has a real root \(x=q\) satisfying

$$\begin{aligned} 1-\left( \frac{1}{2}+\frac{1}{2\sqrt{2}}\sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\right) ^{-1}<q<1 - \left( \frac{1}{2}+\frac{1}{2}\sqrt{\frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_x}{\mu _x} + \frac{L_y}{\mu _y}-1}\right) ^{-1} . \end{aligned}$$
(28)

The proof of this lemma is presented in Appendix D. With this lemma, we can construct the approximate solution \(\hat{x}^*\) as follows.

Lemma 4.3

(Approximate optimal solution) Let \(\alpha ,\beta \) be defined according to (27), let q be a real root of quartic equation (26) satisfying (28). Let us define a vector \(\hat{b}\) with elements given by

$$\begin{aligned} \hat{b}_1 := (2+\alpha + \beta )q - (3+\alpha )q^2+q^3,\qquad \hat{b}_2 :=q-1, \quad \text{ and } \quad \hat{b}_k = 0, \text{ for } 3\le k \le n, \end{aligned}$$
(29)

and then assign \(b = \frac{2B_xB_y}{L_{xy}}A^{-1}\hat{b}\). Then an approximate solution \(\hat{x}^*\) is constructed as

$$\begin{aligned} \hat{x}^*_{i} = q^i \quad \text{ for } \quad i = 1,2,...,n. \end{aligned}$$
(30)

The approximation error can be bounded by

$$\begin{aligned} \Vert \hat{x}^*-x^*\Vert \le \frac{7+\alpha }{\beta }\cdot q^n. \end{aligned}$$
(31)

Note that \(q<1\) and the lower bound is dimension-independent, hence we are free to choose n to make the approximation error arbitrarily small.

The proof of this lemma is parallel to that of Lemma 3.3, but is more involved; the detailed proof is in Appendix E.

Note that in this case, the vector \(Ab\propto \hat{b}\subset \mathrm {Span}\{e_1,e_2\}\). By the zero-chain property in Proposition 3.1, the subspace \(\mathcal {H}_x^k\) described in Lemma 4.1 can be calculated by induction

$$\begin{aligned} \mathcal {H}_x^k\subset \mathrm {Span}\{e_1,e_2,...,e_k\} \quad \text{ for } \quad k\ge 2. \end{aligned}$$
(32)

Parallel to Lemma 3.4, we have the following lemma, whose proof is in Appendix F.

Lemma 4.4

Assume \(k\le \frac{n}{2}\) and \(n\ge 2\log _q\left( \frac{\beta }{4\sqrt{2}(7+\alpha )}\right) +2\). Then

$$\begin{aligned} \Vert x^k-x^*\Vert ^2\ge \frac{q^{2k}}{16}\Vert x^*-x^0\Vert ^2 \end{aligned}$$
(33)

where \(x^0 = 0\) is the initial solution.

Consequently, the duality gap is lower bounded by

$$\begin{aligned}\Delta (x^k,y^k)\ge \frac{\mu _x}{2}\Vert x^k-x^*\Vert ^2\ge q^{2k}\cdot \frac{\mu _x\Vert x^0-x^*\Vert ^2}{32}.\end{aligned}$$

Summarizing, we present our second main result in the following theorem.

Theorem 4.5

Let positive parameters \(\mu _x,\, \mu _y>0\) and \(L_x>\mu _x,\, L_y>\mu _y,\, L_{xy} > 0\) be given. For any integer k, there exists a problem instance in \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\nu _y)\) of form (25), with \(n\ge \max \left\{ 2\log _q\left( \frac{7+\alpha }{\beta }\right) , 2k\right\} \), the constants \(\alpha ,\beta \) as in (27), the matrix \(A\in \mathbb {R}^{n\times n}\) as in (13), the vector \(b = \frac{2B_xB_y}{L_{xy}}A^{-1}\hat{b}\) where \(\hat{b}\) as in (29). For this problem, any approximate solution \((\tilde{x}^k, \tilde{y}^k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\) generated by first-order algorithm class (9) satisfies

$$\begin{aligned}&\max _y F(\tilde{x}^{k},y) - \min _{x} F(x, \tilde{y}^k)\ge q^{2k}\cdot \frac{\mu _x\Vert x^*-x^0\Vert ^2}{32} \quad \text{ and } \nonumber \\&\quad \Vert \tilde{x}^k-x^*\Vert ^2\ge q^{2k}\cdot \frac{\Vert x^*-x^0\Vert ^2}{16}, \end{aligned}$$
(34)

where q satisfying (28) is a root of the quartic equation (26).

Remark 4.6

As a result, if we require the duality gap to be bounded by \(\epsilon \), then the number of iterations needed is at least

$$\begin{aligned} k\ge \frac{1}{2}\ln \left( \frac{\mu _x\Vert x^*-x^0\Vert ^2}{32\epsilon }\right) /\ln (q^{-1}) = \Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$
(35)

4.2 The general pure first-order algorithm class

Consider the problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), we define the orthogonally rotated problem as

$$\begin{aligned} \min _x\max _y F_{U,V}(x,y) : = \frac{1}{2}x^\top (B_xU^\top \!\!A^2U + \mu _xI)x + \frac{L_{xy}}{2}x^\top U^\top \!\!AVy - \frac{1}{2}y^\top (B_yV^\top \!\!A^2V + \mu _yI)y - b^\top V^\top \!y \end{aligned}$$
(36)

where \(A,b,B_x,B_y\) are defined in accordance with Theorem 4.5 and UV are two orthogonal matrices. Therefore, it is clear that \(F_{U,V}\in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\). Let \(( x^*,y^*)\) be the saddle point of F(xy), then it is clear that the saddle point of \(F_{U,V}(x,y)\) is \((\bar{x}^*,\bar{y}^*) = (U^\top \bar{x}^*,V^\top \bar{y}^*).\) Consequently, the lower bound for the general proximal algorithm class is characterized by the following theorem.

Theorem 4.7

Let \(\mathcal {A}\) be any algorithm from general pure first-order algorithm class described in Definition 2.4. We assume the dimension n is sufficiently large for simplicity. For any integer k, then there exist orthogonal matrices UV s.t. \(F_{U,V}\in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), when applying \(\mathcal {A}\) to \(F_{U,V}\) with initial solution \((x^0,y^0) = (0,0)\), the iterates and output satisfies

$$\begin{aligned}&\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{2k}\times V^\top \mathcal {H}_y^{2k}\qquad \text{ and }\qquad \\&\quad (\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{2k+1}\times V^\top \mathcal {H}_y^{2k+1},\end{aligned}$$

where \(\mathcal {H}_x^i,\mathcal {H}_y^i\) are defined by Lemma 4.1. Consequenty, by Theorem 3.5,

$$\begin{aligned} \Vert \tilde{x}^k-U^\top x^*\Vert ^2 \ge \frac{q^{4k+2}}{16}\Vert x^*-x^0\Vert ^2 \end{aligned}$$

where q is defined in Theorem 4.5. As a result, it takes \(\Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) iterations to output a solution with \(O(\epsilon )\) duality gap.

The proof of this theorem is completely parallel to that of Theorem 3.7. We only need to construct the orthogonal matrices UV such that \(\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{2k}\times V^\top \mathcal {H}_y^{2k}\) and \((\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{2k+1}\times V^\top \mathcal {H}_y^{2k+1}\) hold, whose proof follows exactly the same proof procedure of Theorem 3.7. Then argue that

$$\begin{aligned} \Vert \tilde{x}^k-U^\top x^*\Vert ^2 = \Vert U\tilde{x}^k- x^*\Vert ^2\ge \min _{x\in \mathcal {H}_x^{2k+1}}\Vert x-x^*\Vert ^2\ge \frac{q^{4k+2}}{16}\Vert x^*-x^0\Vert ^2. \end{aligned}$$

The latter results follow Theorem 4.5 and we omit the proof.

4.3 Tightness of the bound

In this section, we discuss the tightness of this bound. Currently, to the best of our knowledge, there does not exist a pure first-order algorithm that can achieve the lower complexity bound provided in (35). Therefore, whether an optimal algorithm exists that can match this bound or the bound can be further improved remains an open problem. However, we shall see below that (35) under several special parameter regimes is indeed a tight bound.

Case 1 \(\mathcal {F}(L_x,L_y, L_{xy},\mu _x,\mu _y)\)     For this general class, define \(L = \max \{L_x,L_y,L_{xy}\}\). A near optimal upper bound of

$$\begin{aligned} {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L\cdot L_{xy}}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \end{aligned}$$

is obtained in [38], which almost matches our lower bound.

Case 2 \(\mathcal {F}(L_x,L_y, 0,\mu _x,\mu _y)\)     In this case \(L_{xy} = 0\), meaning that variables x and y are decoupled. Problem (1) becomes two independent convex problems with condition numbers \(\frac{L_x}{\mu _x}\) and \(\frac{L_y}{\mu _y}\) respectively. In this case (35) is reduced to

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_x}{\mu _x}}\ln \left( \frac{1}{\epsilon }\right) + \sqrt{\frac{L_y}{\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) .\end{aligned}$$

This is matched by running two independent Nesterov’s accelerated gradient methods [29].

Case 3 \(\mathcal {F}(L,L,L,\mu ,\mu )\)     In this case \(L_x = L_y = L_{xy} = L\), \(\mu _x = \mu _y = \mu \). Then (35) is reduced to

$$\begin{aligned} \Omega \left( \frac{L}{\mu }\ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

The extra-gradient algorithm (EG) and the accelerated dual extrapolation algorithm (ADE) introduced in Example 2.3 have achieved this bound; see e.g. [23, 30].

Case 4 \(\mathcal {F}(L_x,{\mathcal {O}}(1)\cdot \mu _y,L_{xy},\mu _x,\mu _y)\)     In this case \(L_y = {\mathcal {O}}(1)\cdot \mu _y\), meaning that one side of the problem is easy to solve. Then, (35) is reduced to

$$\begin{aligned} \Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) .\end{aligned}$$

For the double loop algorithm defined in Example 2.2, when we set the inner loop iteration to be \(T_2 = {\mathcal {O}}\left( \sqrt{\frac{L_y}{\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) = {\mathcal {O}}\left( \ln \left( \frac{1}{\epsilon }\right) \right) \), and the outer loop iteration to be \(T_1 = {\mathcal {O}}\left( \sqrt{\frac{L_{\Phi ,x}}{\mu _x}}\ln \left( \frac{1}{\epsilon }\right) \right) = {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \). Then, an upper bound of

$$\begin{aligned} T_1T_2 = {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln ^2\left( \frac{1}{\epsilon }\right) \right) \end{aligned}$$

can be guaranteed. It is tight up to a logarithmic factor.

Case 5 \(\mathcal {F}(L,L,L,\mu _x,\mu _x)\)     In this case \(L_x = L_y = L_{xy} = L\). Then (35) is reduced to

$$\begin{aligned} \Omega \left( \sqrt{\frac{L^2}{\mu _x\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

This bound has been achieved by [19] up to a logarithmic factor.

5 Reduction to lower bounds for convex-concave problems

Note that in the previous sections, we consider the strongly-convex and strongly-concave problem classes \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and \(\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)\), where \(\mu _x>0\) and \(\mu _y>0\). In this section, we show how our iteration complexity lower bounds provided in Theorem 3.5 and 4.5 can be reduced to the problem classes with \(\mu _x = \mu _y = 0\). Similar reduction can also be done for the case where \(\mu _x>0, \mu _y = 0\), but is omitted in this paper.

5.1 Lower bound for pure first-order algorithm class

Unlike the strongly-convex and strongly-concave saddle point problems, the saddle point of the general convex-concave problem may not always exist. Therefore, we define a new problem class with bounded saddle point solution as follows.

Definition 5.1

(Problem class \(\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\)) We say a function F belongs to the class \(\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\) as long as: (i). \(F\in \mathcal {F}(L_x,L_y,L_{xy},0,0).\) (ii). The solution to \((x^*,y^*) = \mathop {\hbox {argmin}}\limits _x\mathop {\hbox {argmax}}\limits _y F(x,y)\) exists, and \(\Vert x^*\Vert \le R_x\), \(\Vert y^*\Vert \le R_y\).

For this problem class, we have the following lower bound result, as a corollary of Theorem 4.7.

Corollary 5.2

Consider applying the general first-order algorithm class defined by (2.2) to the problem class \(\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\). For any \(\epsilon >0\), there exists a problem instance \(F_\epsilon (x,y)\in \mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\), such that

$$\begin{aligned} \Omega \left( \,\,\sqrt{\frac{L_xR_x^2}{\epsilon }}+\frac{L_{xy}R_xR_y}{\epsilon } + \sqrt{\frac{L_yR_y^2}{\epsilon }}\,\,\right) \end{aligned}$$
(37)

iterations are required to reduce the duality gap to \(\epsilon \).

Proof

We start the reduction by the following scaling argument. First, for any \(\epsilon >0\), let \(\hat{F}_\epsilon \in \mathcal {F}(L_x, L_y, L_{xy},\mu _x,\mu _y)\) be the worst-case instance described by Theorem 4.7. For our purpose, we choose

$$\begin{aligned} \mu _x = 64\epsilon /R_x^2\qquad \text{ and }\qquad \mu _y = 64\epsilon /R_y^2.\end{aligned}$$

Then by direct computation, we know that the following scaled problem satisfies

$$\begin{aligned} F_\epsilon (x,y): = a \hat{F}_\epsilon (cx,dy)\in \mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y).\end{aligned}$$

We skip the parameter b since it is already used in the construction of the worst case instance \(\hat{F}_\epsilon \). Denote \((\hat{x}^*,\hat{y}^*) = \min _{x}\max _y \hat{F}_\epsilon (x,y)\) and \((x^*,y^*) = \min _{x}\max _y F_\epsilon (x,y)\). Let us set

$$\begin{aligned} c = \frac{\Vert \hat{x}^*\Vert }{R_x}, \quad d = \frac{\Vert \hat{y}^*\Vert }{R_y},\quad \text{ and }\quad a = \min \{c^{-2},d^{-2}\}.\end{aligned}$$

Then we have \(x^* = \frac{R_x\hat{x}^*}{\Vert \hat{x}^*\Vert }\) and \(y^* = \frac{R_y\hat{y}^*}{\Vert \hat{y}^*\Vert }\), and the Lipschitz constants of \(F_\epsilon \) satisfy that \(ac^2 L_x\le L_x\), \(ad^2 L_y\le L_y\), and \(acd L_{xy}\le L_{xy}\). Therefore, we know

$$\begin{aligned} F_\epsilon \in \mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y)\cap \mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y).\end{aligned}$$

Note that purely scaling the variables and the function does not change the worst-case nature of this problem. That is, \(F_\epsilon \) is still the worst-case problem instance of the function class \(\mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y)\) and the lower bound of Theorem 4.5 is valid for this specific instance. Therefore, to get the duality gap less than or equal to \(\epsilon \), the number of iteration k is lower bounded by

$$\begin{aligned} k&\ge \Omega \left( \sqrt{\frac{ac^2L_x}{ac^2\mu _x} + \frac{a^2c^2d^2L_{xy}^2}{ac^2\mu _x\cdot ad^2\mu _y} + \frac{ad^2L_y}{ad^2 \mu _y}}\cdot \ln \left( \frac{ac^2\mu _x\Vert x^*-x^0\Vert ^2}{32\epsilon }\right) \right) \nonumber \\&\overset{(i)}{=} \Omega \Bigg (\bigg (\sqrt{\frac{L_xR_x^2}{\epsilon }} + \frac{L_{xy}R_xR_y}{\epsilon }+\sqrt{\frac{L_yR_y^2}{\epsilon }}\bigg )\cdot \ln \left( 2ac^2\right) \Bigg ) \end{aligned}$$
(38)

where (i) is because \(x^0 = 0\), \(\Vert x^*\Vert = R_x\), and \(\mu _x = 64\epsilon /R_x^2\). Therefore, as long as we can show that \(\ln \left( 2ac^2\right) \ge \Omega (1)\), then the corollary is proved. However, since the details are rather technical, we shall provide a proof of \(\ln \left( 2ac^2\right) \ge \ln 2 \) in Appendix G. \(\square \)

As a remark, by setting \(L_y = 0\), the lower bound (37) implies the result derived in [33]. When \(L_x = L_y = L_{xy} = L\), the lower bound (37) implies the result derived in [25]. The reduction for the general pure first-order algorithm class defined by (2.4) without the linear span assumption can also be done in a similar manner and is omitted for succinctness.

5.2 Lower bound for proximal algorithm class

Like Definition 5.1, we define a new bilinear problem class with bounded saddle point solution as follows.

Definition 5.3

(Problem class \(\mathcal {B}_0(L_{xy},R_x,R_y)\)) We say a function F belongs to the function class \(\mathcal {B}_0(L_{xy},R_x,R_y)\) as long as: (i). \(F\in \mathcal {B}(L_{xy},0,0).\) (ii). Solution \((x^*,y^*) = \mathop {\hbox {argmin}}\limits _x\mathop {\hbox {argmax}}\limits _y F(x,y)\) exists, and \(\Vert x^*\Vert \le R_x\), \(\Vert y^*\Vert \le R_y\).

For this problem class, we have the following lower bound result, as a corollary of Theorem 3.7.

Corollary 5.4

Consider applying the general first-order algorithm class defined by (2.2) to the problem class \(\mathcal {B}_0(L_{xy},R_x,R_y)\). For any \(\epsilon >0\), there exists an instance \(F_\epsilon (x,y)\in \mathcal {B}_0(L_{xy},R_x,R_y)\), such that

$$\begin{aligned} \Omega \left( \,\,\frac{L_{xy}R_xR_y}{\epsilon } \,\,\right) \end{aligned}$$
(39)

iterations are required to reduce the duality gap to \(\epsilon \).

Remark 5.5

The lower bound in Corollary 5.2 is tight. An optimal algorithm is derived in [9, 10].

The reduction can be done in a similar way as in Corollary 5.2, but is much simpler. The details are omitted here.

6 Conclusion

In this paper, we establish the lower complexity bound for the first-order methods in solving strongly convex and strongly concave saddle point problems. Different from existing results, we discuss the problem in the most general parameter regime. For the bilinear coupling problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and for both proximal algorithm class (7) with linear span assumption and the general proximal algorithm class (10) without linear span assumption, a tight lower bound is established. For general coupling problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\) and for both the pure first-order algorithm class (9) with linear span assumption and the general pure first-order algorithm class (11) without the linear span assumption, a lower bound has been established. Under various special parameter regimes, tight upper-bounds can be developed. In the most general setting of the min-max framework, a near optimal algorithm has been discovered, while the optimal algorithm that exactly matches the lower bound has yet to be discovered. Finally, we also show that our result implies several exisiting lower bounds for general convex-concave problems through proper scaling of the worst-case instance, which indicates the generality of our results.