Abstract
In this paper, we study the lower iteration complexity bounds for finding the saddle point of a strongly convex and strongly concave saddle point problem: \(\min _x\max _yF(x,y)\). We restrict the classes of algorithms in our investigation to be either pure first-order methods or methods using proximal mappings. For problems with gradient Lipschitz constants (\(L_x, L_y\) and \(L_{xy}\)) and strong convexity/concavity constants (\(\mu _x\) and \(\mu _y\)), the class of pure first-order algorithms with the linear span assumption is shown to have a lower iteration complexity bound of \(\Omega \,\left( \sqrt{\frac{L_x}{\mu _x}+\frac{L_{xy}^2}{\mu _x\mu _y}+\frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), where the term \(\frac{L_{xy}^2}{\mu _x\mu _y}\) explains how the coupling influences the iteration complexity. Under several special parameter regimes, this lower bound has been achieved by corresponding optimal algorithms. However, whether or not the bound under the general parameter regime is optimal remains open. Additionally, for the special case of bilinear coupling problems, given the availability of certain proximal operators, a lower bound of \(\Omega \left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln (\frac{1}{\epsilon })\right) \) is established under the linear span assumption, and optimal algorithms have already been developed in the literature. By exploiting the orthogonal invariance technique, we extend both lower bounds to the general pure first-order algorithm class and the proximal algorithm class without the linear span assumption. As an application, we apply proper scaling to the worst-case instances, and we derive the lower bounds for the general convex concave problems with \(\mu _x = \mu _y = 0\). Several existing results in this case can be deduced from our results as special cases.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper, we establish a lower iteration complexity bound for the first-order methods to solve the following min-max saddle point problem
which is of fundamental importance in, e.g., game theory [31, 37], image deconvolution problems [9], parallel computing [39], adversarial training [4, 12], and statistical learning [1].
To proceed, let us introduce the following two problem classes.
Definition 1.1
(Problem class \(\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)\)) \(F(\cdot ,y)\) is \(\mu _x\)-strongly convex for any fixed y and \(F(x,\cdot )\) is \(\mu _y\)-strongly concave for any fixed x. Overall, the function F is smooth and \(\nabla F\) satisfies the following Lipschitz continuity condition
We shall remark here that the constants in (2) may also be understood as the bounds on the different blocks of the Hessian matrix \(\nabla ^2 F(x,y)\) if F is twice continuously differentiable. That is,
However, throughout this paper we do not assume either \(F(\cdot , y)\) or \(F(x,\cdot )\) is second-order differentiable.
The second problem class is the bilinear saddle point model:
Definition 1.2
(Bilinear class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\)) In this special class, the problems are written as
where f(x) and g(y) are both lower semi-continuous with f(x) being \(\mu _x\)-strongly convex and g(y) being \(\mu _y\)-strongly convex. The coupling matrix A satisfies \(\Vert A\Vert _2\le L_{xy}\).
For this special model class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\), we assume the availability of the following prox-operations:
In this paper we shall establish the lower iteration complexity bound
and
with the proximal oracles (4). In particular, we first establish these lower bounds for pure first-order and general proximal algorithm classes under the linear span assumption. Later on we generalize the results for more general algorithm classes without the linear span assumption through the orthogonal invariance technique introduced by [25]. For more detailed applications of the orthogonal invariance technique in the lower bound derivation, the interested readers are referred to [7, 8, 33].
As an application of the above bound, we apply proper scaling to the worst-case instances and show that the above result implies several exisiting lower bounds for general convex-concave problems with bounded saddle point solutions. Specifically, we have
and
For the above two lower bounds, we remark that under specific parameter regimes, the first bound is known, see [25] for the case with \(L_x = L_y = L_{xy}\) and see [33] for the case with \(L_y = 0\). However to our best knowledge, the first bound under a general set of parameters as well as the second bound for bilinear problem class are not known. Similar reductions can also be done for the problem classes with only one of \(\mu _x\) and \(\mu _y\) equal to 0, for which the lower bounds have already been discovered in [33].
Such lower iteration complexity results shed light on understanding the performance of the algorithms designed for min-max saddle point models. There are numerous results in the literature prior to ours. As a special case of (1), the lower bound results of convex minimization problem with \(F(x,y) = f(x)\) has been well-studied in the past decades. For convex problems, Nesterov’s accelerated gradient method have achieved iteration complexities of \({\mathcal {O}}(\sqrt{L/\epsilon })\) for L-smooth convex problems, and \({\mathcal {O}}\left( \sqrt{\frac{L}{\mu }}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) for L-smooth and \(\mu \)-strongly convex problems respectively, and both of them are shown to match the lower complexity bound for the first-order methods; see [29].
However, for the min-max saddle-point models, the situation is more subtle. Due to the convex-concave nature of F, the vector field
is monotone. Hence the convex-concave saddle point problem is often studied as a subclass of the variational inequality problems (VIP); see e.g. [16, 22, 24, 28, 30, 36] and references therein. Although there have been plenty of studies on the variational inequalities model, the roles played by different Lischitz constants on the different blocks of variables have not been fully explored in the literature. In other words, often one would denote L to be an overall Lipschitz constant of the vector field G, which is of the order \(\Theta (\max \{L_x,L_y,L_{xy}\})\) in our case, and set \(\mu \) to be the strong monotonicity parameter of G, which is of the order \(\Theta (\min \{\mu _x,\mu _y\})\) in our case, and no further distinctions among the parameters would be made. Hence the considered problems are of special instances in \(\mathcal {F}(L,L,L,\mu ,\mu )\). Under such settings, many algorithms including the mirror-prox algorithm [24], the extra-gradient methods [17, 23], and the accelerated dual extrapolationFootnote 1 [30] and so on, have all achieved the iteration complexity of \({\mathcal {O}}\left( \frac{L}{\mu }\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), and this complexity is shown to be optimal for first-order methods in solving the problem class \(\mathcal {F}(L,L,L,\mu ,\mu )\); see [26]. However, under the more general parameter regime of \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), these methods are not optimal. For example, Nesterov’s accelerated dual extrapolation method [30] has a complexity of \({\mathcal {O}}\left( \frac{\max \{L_x,L_{xy},L_y\}}{\min \{\mu _x,\mu _y\}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), even if the algorithm are modified carefully one can only guarantee a complexity of \({\mathcal {O}}\left( \sqrt{\frac{L_x^2}{\mu _x^2} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y^2}{\mu _y^2}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), both of which do not match the lower bound provided in this paper. More recently, tighter upper bounds have been derived. In [19], the authors consider the problems class of \(\mathcal {F}(L,L,L,\mu _x,\mu _y)\) and achieve an upper bound of \({\mathcal {O}}\left( \sqrt{\frac{L^2}{\mu _x\mu _y}}\cdot \ln ^3\left( \frac{1}{\epsilon }\right) \right) \), which matches our lower bound when \(L_{x} = L_{y} = L_{xy}=L\) up to a logarithmic term. In [38], the authors consider the general problems class of \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), the proposed algorithm achieves an upper bound of \({\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L\cdot L_{xy}}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln ^3\left( \frac{L^2}{\mu _x\mu _y}\right) \cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) where \(L = \max \{L_x,L_y,L_{xy}\}\), which almost matches our lower bound for the general problem class. Despite the gap for the general problem class \(\mathcal {F}(L_x,L_{xy},L_y,\mu _x,\mu _y)\), given the availability of proximal operators the authors of [9, 10] have derived an algorithm for problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) with complexity \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \). We will prove in this paper that this result has matched the theoretical lower complexity bound for its problem and algorithm classes, hence optimal.
For the bilinear problem (3), when f is smooth and convex, \(g(y) = b^\top y\) is linear, the problem is equivalent to the following convex optimization problem
Without using projection onto the hyperplane \(\{x:A^\top x=b\}\) which requires a matrix inversion, pure first-order methods achieve \({\mathcal {O}}(1/\epsilon )\) complexity despite the strong convexity of f; see e.g. [11, 32, 40]. Those iteration complexity bounds are shown to match the lower bound provided in [33]. For more details on the lower and upper bounds on this formulation, the interested readers are referred to [33]. Finally, for the bilinear coupling problem (3), the authors of [13] show that a lower bound of \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2 + \mu _x\mu _y}{\mu _{xy}^2 + \mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) can be derived, where \(\mu _{xy}\) stands for the minimum singular value of the coupling matrix A. It is interesting that this result covers the linear convergence phenomenon for pure bilinear saddle point problem [5] where \(f(x)\equiv g(y)\equiv 0\). Another remark is that, due to the special construction of the worst-case instance and algorithm class, [13] cannot characterize the impact of \(L_x\) and \(L_y\) as well as the lower bound for proximal algorithm class.
Other than studies on the first-order algorithms, there are also studies on the higher-order methods as well. For example, in [3] lower iteration complexity bounds for second-order methods are considered, and in [2, 27] lower iteration complexity bounds are presented for general higher-order (tensor) methods. For smooth nonconvex optimization, in [8] the iteration complexity lower bounds for first-order methods are considered, while in [7] that for higher-order methods are considered.
Another line of research is for the non-conex/concave min-max saddle point problems; see [14, 15] and the references therein. To guarantee convergence, additional structures are often needed. For example, if one assumes that the solutions of the problem satisfy the Minty variational inequality [18] then convergent algorithm can be constructed. Another important situation is when F is concave in y. In that case, convergence and iteration complexity to a stationary solution is possible; see e.g. [21]. For more literatures in this type of problems, we refer the interested readers to [20] and the references therein.
Organization This paper is organized as follows. In Sect. 2, we introduce two different algorithm classes (with or without proximal-operators). In Sect. 3, we construct a worst-case example for problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and derive the corresponding lower iteration complexity bound for the algorithm class allowing proximal-operators. An optimal algorithm is discussed in this case. In Sect. 4, we construct the worst-case example for problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\) and establish the corresponding lower complexity bound for the first-order method (without any proximal oracles). Optimal algorithms under several special parameter regimes are discussed. Finally, we conclude the paper in Sect. 6.
2 The first-order algorithm classes
In this section, we discuss some preliminaries for the strongly convex and strongly concave saddle point problem. Then, we shall introduce two algorithm classes to set the ground for our discussion, and we shall also note specific known algorithms as representative members in those algorithm classes.
2.1 Primal function, dual function, and the duality gap
First, we define \(\Phi (\cdot )\) to be the primal function and \(\Psi (\cdot )\) to be the dual function of the saddle point problem \(\min _x \max _y F(x,y)\), respectively, with the following definitions
As the maximum of a class of \(\mu _x\)-strongly convex function, we know \(\Phi (x)\) is a \(\mu _x\)-strongly convex function. Similarly, \(\Psi (y)\) is a \(\mu _y\)-strongly concave function. We define the duality gap as
Suppose the unique solution of this min-max problem is \((x^*,y^*)\). By the strong duality theorem, we know for any x and y it holds that
Together with the \(\mu _x\)-strong convexity of \(\Phi \) and the \(\mu _y\)-strong concavity of \(\Psi \), we further have
Now, suppose that \((\tilde{x}_k,\tilde{y}_k)\) is the approximate solution generated after k iterations of an algorithm. Our aim is to lower bound the distance between \((\tilde{x}_k,\tilde{y}_k)\) and \((x^*,y^*)\). By (6), this would construct a lower iteration complexity bound in terms of the duality gap as well.
2.2 Proximal algorithm class
First, let us consider the bilinearly coupled problem class (3) as introduced in Definition 1.2:
For this special problem class, let us consider the lower iteration bound of the algorithm class where the proximal oracles (4) are available.
Definition 2.1
(Proximal algorithm class) In each iteration, the iterate sequence \(\{(x^{k},y^{k})\}_{k=0,1,...}\) are generated so that \((x^{k},y^{k})\in \mathcal {H}_x^{k} \times \mathcal {H}_y^{k}\). These subspaces are generated with \(\mathcal {H}_x^{0} = \mathrm {Span}\{x^0\}, \mathcal {H}_y^{0} = \mathrm {Span}\{y^0\}\) and
Remark that when applying the proximal oracles, it is not necessary to use the most recent iterate \(x^k\) as the proximal center. Neither is it necessary to use the gradients of the coupling term (namely the \(A^\top x\) and Ay terms) at the current iterate. Instead, the algorithm class allows the usage of the combination of any points in the historical search space. We shall also remark that the algorithm class in Definition 2.1 does not necessarily need to update x and y at the same time, because setting \(x^{k+1}=x^k\) or \(y^{k+1} = y^k\) also satisfies Definition 2.1. Thus this algorithm class also includes the methods that alternatingly update x and y. Below is a sample algorithm in this class.
Example 2.1
(Algorithm 3 in [9]) Initialize with \(\gamma = \frac{1}{L_{xy}}\sqrt{\frac{\mu _y}{\mu _x}}\), \(\sigma = \frac{1}{L_{xy}}\sqrt{\frac{\mu _x}{\mu _y}}\), and \(\theta = \frac{L_{xy}}{2\sqrt{\mu _x\mu _y} + L_{xy}}.\) Set \(\tilde{x}^0 = x^0\). Then the algorithm proceeds as
It can be observed that this algorithm takes the alternating order of update, by slightly manipulating the index, it can be written in the form of (7) in Definition 2.1. The complexity of this method is \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \).
2.3 Pure first-order algorithm class
In constrast to the previous section, here we consider the more general problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\):
For such problems, we refer to the algorithm class as the pure first-order methods, meaning that there is no proximal oracle in the design of algorithms in this class.
Definition 2.2
(Pure first-order algorithm class) In each iteration, the sequence \(\{(x_{k},y_{k})\}_{k=0,1,...}\) is generated so that \((x^{k},y^{k})\in \mathcal {H}_x^{k} \times \mathcal {H}_y^{k}\), with \(\mathcal {H}_x^{0} = \mathrm {Span}\{x^0\}\), \(\mathcal {H}_y^{0} = \mathrm {Span}\{y^0\}\), and
Similar to our earlier comments on the proximal algorithm class, in this class of algorithms the gradients at any combination of points in the historical search space are allowed. The algorithm class also includes the methods that alternatingly update between x and y, or even the double loop algorithms that optimize one side until certain accuracy is achieved before switching to the other side. At that level of generality, it indeed accommodates many updating schemes. To illustrate this point, let us present below some sample algorithms in this class.
The first example is a double loop scheme, in which the primal function \(\Phi (x)\) is optimized approximately. Specifically, let \(y^*(x) = \mathop {\hbox {argmax}}\limits _y F(x,y)\), by Danskin’s theorem, \(\nabla \Phi (x) = \nabla _x F(x,y^*(x))\); see e.g. [6, 34]. Therefore, one can apply Nesterov’s accelerated gradient method to minimize \(\Phi (x)\). The double loop scheme performs this procedure approximately.
Example 2.2
(Double loop schemes, [35]) Denote \(\alpha _1 = \sqrt{\frac{\mu _x}{L_{\Phi ,x}}}\) and \(\alpha _2 = \sqrt{\frac{\mu _y}{L_y}}\), where \(L_{\Phi ,x} = L_x + \frac{L_{xy}^2}{\mu _y}\) is the Lipschitz constant of \(\nabla \Phi (x)\) (see [35]). Given \((x^0, y^0)\) and define \(\bar{x}^0 = x^0\), the double loop scheme works as follows:
where the point \(y^k\) is generated by an inner loop of accelerated gradient iterations
Then, set \(y^{k}:=w^{T_2+1}\) to be the last iterate of the inner loop.
For simplicity, we have applied a specific scheme of acceleration [29] which does not work for nonstrongly-convex problems. In principle, the FISTA scheme can also be used. For this scheme, with properly chosen \(T_1\) and \(T_2\), the iteration complexity \({\mathcal {O}}\) \(\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \sqrt{\frac{L_y}{\mu _y}}\ln ^2\left( \frac{1}{\epsilon }\right) \right) \) is achievable.
In the following, we also list examples of several single loop algorithms, including the gradient descent-ascent method (GDA), the extra-gradient (EG) method [17] (a special case of mirror-prox algorithm [24]), and the accelerated dual extrapolation (ADE) [30].
Example 2.3
(Single loop algorithms) Let \(L = \max \{L_x,L_y,L_{xy}\}, \mu = \min \{\mu _x,\mu _y\}\). Given the initial solution \((x^0,y^0)\), the algorithms proceed as follows:
where \(\eta _1 = {\mathcal {O}}\left( \frac{\mu }{L^2}\right) \), \(\eta _2={\mathcal {O}}\left( \frac{1}{L}\right) ,\eta _3 = {\mathcal {O}}\left( \frac{1}{L}\right) \). The iterative points \((\tilde{x}^{k+1},\tilde{y}^{k+1})\) in (ADE) are the same as that in (EG), except that \(\eta _2\) is replaced by \(\eta _3\).
The original update of (ADE) algorithm is rather complex since it involves the handling of constraints. In the unconstrained case, it can be simplified to the current form, which is a mixture of (GDA) and (EG). The corresponding iteration complexity bounds are \({\mathcal {O}}\left( \frac{L^2}{\mu ^2}\ln \left( \frac{1}{\epsilon }\right) \right) \) for (GDA), and \({\mathcal {O}}\left( \frac{L}{\mu }\ln \left( \frac{1}{\epsilon }\right) \right) \) for both (EG) and (ADE).
2.4 General deterministic algorithm classes without linear span structure
Although all the reviewed first-order methods satisfy the linear span property in the proximal algorithm class in Definition 2.1 and the pure first-order algorithm class in Definition 2.2, this does not exclude the possibility of the deriving an algorithm that does not satisfy the linear span property. Therefore, we also define the general deterministic proximal algorithm class and the general deterministic pure first-order algorithm class as follows, whose iteration complexity lower bound can be generalized from their linear span counterpart through the technique of adversary rotation.
Definition 2.3
(General proximal algorithm class) Consider the problem (3) in the problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\), denote \(\theta = (L_{xy},\mu _x,\mu _y)\) as the corresponding problem parameters. Let algorithm \(\mathcal {A}\) belong to the general proximal algorithm class. Then \(\mathcal {A}\) consists of a sequence of deterministic mappings \(\{(\mathcal {A}_x^1, \mathcal {A}_y^1,\mathcal {A}_u^1,\mathcal {A}_v^1),(\mathcal {A}_x^2, \mathcal {A}_y^2,\mathcal {A}_u^2,\mathcal {A}_v^2)...\}\) such that the iterate sequence \(\{(x^{k},y^{k})\}_{k=0,1,...}\) and the output sequence \(\{(\tilde{x}^{k},\tilde{y}^{k})\}_{k=0,1,...}\) are generated by
where \(u^k = \mathcal {A}^k_u(\theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1})\), \(v^k = \mathcal {A}^k_v(\theta ; y^0,A^\top x^0,...,y^{k-1},A^\top x^{k-1})\), and \((x^0,y^0)\) is any given initial solution.
One remark is that the input of the proximal mapping \(\mathbf {prox}_{\gamma _kf}(\cdot )\) is constructed with other inputs to the \(\mathcal {A}_x^k\), i.e., there could be another deterministic mapping \(\mathcal {A}_u^k\) to generate a vector \(u^k = \mathcal {A}^k_u(\theta ; x^0,Ay^0,...,x^{k-1},Ay^{k-1})\) and then \(\mathbf {prox}_{\gamma _kf}(u^k)\) is passed to the mapping \(\mathcal {A}_x^k\). This \(\mathcal {A}^k_u\) does not need to be linear. The situation for \(v^k\) and \(\mathcal {A}^k_v\) is similar. Similar to the general proximal algorithm class, the general pure first-order algorithm class is defined as follows.
Definition 2.4
(General pure first-order algorithm class) Consider the problem (1) in the problem class \(\mathcal {F}(L_x, L_y, L_{xy},\mu _x,\mu _y)\), denote \(\theta = (L_x, L_y, L_{xy},\mu _x,\mu _y)\) as the corresponding problem parameters. Let algorithm \(\mathcal {A}\) belong to the general pure first-order algorithm class. Then \(\mathcal {A}\) consists of a sequence of deterministic mappings \(\{\mathcal {A}_x^1, \mathcal {A}_y^1, \mathcal {A}_x^2, \mathcal {A}_y^2,...\}\) such that the iterate sequence \(\{(x^{k},y^{k})\}_{k=0,1,...}\) and the output sequence \(\{(\tilde{x}^{k},\tilde{y}^{k})\}_{k=0,1,...}\) are generated by
given any initial solution \((x^0,y^0)\).
A remark is that the gradients \(\nabla _x F(\cdot ,\cdot )\) and \(\nabla _y F(\cdot ,\cdot )\) actually do not need to be taken on the previous iterates \(\{(x^0,y^0),(x^1,y^1),...,(x^{k-1},y^{k-1})\}\). Similar to the proximal case, they can also be taken on some other \(\{(u^0_{k-1},v^0_{k-1}),(u^1_{k-1},v^1_{k-1}),...,(u^{k-1}_{k-1},v^{k-1}_{k-1})\}\) that are generated by some mappings \(\{(\mathcal {A}_u^{k-1,0},\mathcal {A}_v^{k-1,0}),(\mathcal {A}_u^{k-1,1},\mathcal {A}_v^{k-1,1}),...,(\mathcal {A}_u^{k-1,k-1},\mathcal {A}_v^{k-1,k-1})\}\). The reason that we do not consider this more general form is twofold. First, the simpler form in Definition 2.4 has already covered all the discussed algorithms. Second, this more general form actually shares the same iteration complexity lower bound, despite the technical complications involved. Therefore, in this paper we shall only include the gradients at the past iterates as the input to the algorithm.
3 Lower bound for proximal algorithms
3.1 The worst-case instance
Let us construct the following bilinearly coupled min-max saddle point problem:
where b is a vector to be determined later, and the coupling matrix A (hence \(A^2\) and \(A^4\)) is defined as follows:
Note that \(A^\top =A\) and \(\Vert A\Vert _2 \le 2\). Therefore (12) is an instance in the problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\). It is worth noting that the example (12) is the same as that in Proposition 2 of [13], which is a parallel work focused on pure first-order algorithm class. Here, we use the same example to elaborate the lower bound of the proximal methods over the general bilinear coupling class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\), as well as a warmup for the discussion of more complex problem class \(\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)\).
Denote \(e_i\) to be the i-th unit vector, which has 1 at the i-th component and 0 elsewhere. Then by direct calculation, one can check that \(A^2\) satisfies the following zero-chain property (see Chapter 2 of [29]).
Proposition 3.1
(Zero-chain property) For any vector \(v\in \mathbb {R}^n\), if \(v\in \mathrm {Span}\{e_i:i\le k\}\) for some \(1\le k\le n-1\), then \(A^2v\in \mathrm {Span}\{e_i:i\le k+1\}\).
This means that if v only has nonzero elements at the first k entries, then \(A^2v\) will have at most one more nonzero entry at the \((k+1)\)-th position.
For problem (12), the proximal operators in (7) can be written explicitly:
Similarly, for the y block, we also have
Let us assume the initial point to be \(x^0 = y^0 = 0\) (\(\mathcal {H}_x^0 = \mathcal {H}_y^0 = \{0\}\)) without loss of generality. Directly substituting (14) and 15 into Definition (2.1) yields
We formally summarize this observation below:
Lemma 3.2
For problem (12), for any \(k \in \mathbb {N}\), if the iterates are generated so that \((x_k,y_k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\), with \(\mathcal {H}_x^k\) and \(\mathcal {H}_y^k\) defined by (2.1), then based on (14) and (15) the search subspaces satisfy
3.2 Lower bounding the duality gap
Let us lower bound the dual gap, which is upper bounded by the whole duality gap. To achieve this, let us first write down the dual function of problem (12) as
For this \(\mu _y\)-strongly concave dual function, we can characterize the optimal solution \(y^*\) directly by its KKT condition \(\nabla \Psi (y^*) = 0\). However, the exact solution \(y^*\) does not have a simple and clear form, so we choose to characterize it by an approximate solution \(\hat{y}^*\).
Lemma 3.3
(Approximate optimal solution) Let us assign the value of b as \(b := -\frac{L_{xy}^2}{4\mu _x}e_1\). Denote \(\alpha := \frac{4\mu _x\mu _y}{L_{xy}^2}\), and let \(q = \frac{1}{2}\left( (2+\alpha ) - \sqrt{(2+\alpha )^2 - 4}\right) \in (0,1)\) be the smallest root of the quadratic equation \(1 - (2+\alpha )q + q^2 = 0\). Then, an approximate optimal solution \(\hat{y}^*\) can be constructed as
The approximation error can be bounded by
where \(\hat{y}^*_i\) is the i-th element of \(\hat{y}^*\). Note that \(q<1\) and the lower bound is dimension-independent, hence we are free to choose n to make the approximation error arbitrarily small.
Proof
First, let us substitute the value of b into the KKT system \(\nabla \Psi (y^*)=0\), by slight rearranging and scaling the terms, we get
Using the definition of \(\alpha \) and b, the equation becomes
Substituting the formula of \(A^2\) in (13), we expand the above equation as
By direct calculation, we can check that \(\hat{y}^*\) satisfies the first \(n-1\) equations of the KKT system (19). The last equation, however, is violated, but with a residual of size \(q^{n+1}/(1-q)\). In details,
This indicates that \(\hat{y}^* - y^* = \frac{q^{n+1}}{1-q}\cdot (A^2 + \alpha I)^{-1} e_n\). Note that \(\alpha ^{-1}I\succeq (A^2 + \alpha I)^{-1}\succ 0\), we have the approximation error bounded by (18). \(\square \)
Note that in Lemma 3.3, we have chosen \(b\propto e_1\). By the zero-chain property in Proposition 3.1 and Lemma 3.2, we can verify that the subspaces \(\mathcal {H}_y^{2k-1}\) and \(\mathcal {H}_y^{2k}\) satisfy
This implies that for both \(y^{2k}\) and \(y^{2k-1}\), the only possible nonzero elements are the first k ones, which again implies that the lower bound of \(\Vert y^{2k}-y^*\Vert ^2\) and \(\Vert y^{2k-1}-y^*\Vert ^2\) will be similar. For simplicity, we only discuss this lower bound for \(y^{2k}\). The counterpart for \(y^{2k-1}\) can be obtained in a similar way. Therefore, we have the following estimations.
Lemma 3.4
Assume \(k\le \frac{n}{2}\) and \(n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) \). Then
where \(y^0 = 0\) is the initial solution.
Proof
By the subspace characterization (20), we have
where the last inequality is due to the fact that \(q< 1\), \(k\le \frac{n}{2}\), and \(y^0 = 0\). If we choose n to be large enough, then \(\hat{y}^*\) and \(y^*\) can be made arbitrarily close to each other. Hence we can transform the above inequality to (21). More details of this derivation can be found in Appendix A. \(\square \)
Using Lemma 3.4 and (6), it is then straightforward to lower bound the duality gap by
Summarizing, below we present our first main result.
Theorem 3.5
Let the positive parameters \(\mu _x,\mu _y>0\) and \(L_{xy}> 0\) be given. For any integer k, there exists a problem instance from \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) of form (12), with \(n\ge \max \left\{ 2\log _q\left( \frac{\mu _x\mu _y}{\sqrt{2}L_{xy}^2}\right) , 4k\right\} \), where \(A\in \mathbb {R}^{n\times n}\) as defined in (13), and \(b = -\frac{L_{xy}^2}{4\mu _x}e_1\). For such a problem instance, any approximate solution \((\tilde{x}^k, \tilde{y}^k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\) generated by the proximal algorithm class under the linear span assumption (7) satisfies
where \(q = 1+\frac{2\mu _x\mu _y}{L_{xy}^2} - 2\sqrt{\left( \frac{\mu _x\mu _y}{L_{xy}^2}\right) ^2 + \frac{\mu _x\mu _y}{L_{xy}^2}}\).
Proposition 3.6
Under the same set of assumptions of Theorem 3.5, if we require the duality gap to be bounded by \(\epsilon \), the number of iterations needed is at least
The proof of Proposition 3.6 is in Appendix B.
3.3 The general proximal algorithm class
Note that Theorem 3.5 is derived for the proximal algorithm class with the linear span assumption, in this section, we will apply the orthogonal invariance technique, introduced in [25], to generalize the result of Theorem 3.5 to the general proximal algorithm class without the linear span assumption.
Consider the bilinear problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and the corresponding worst case problem (12) with \(F(x,y) = \frac{\mu _x}{2}\Vert x\Vert ^2 + \frac{L_{xy}}{2}x^\top A y - \frac{\mu _y}{2}\Vert y\Vert ^2 - b^\top y\), where A and b are defined in accordance with Theorem 3.5. We define the orthogonally rotated problem as
where U, V are two orthogonal matrices. Therefore, it is clear that \(F_{U,V}\in \mathcal {B}(L_{xy},\mu _x,\mu _y)\). Let \((x^*, y^*)\) be the saddle point of F(x, y), then it is clear that the saddle point of \(F_{U,V}(x,y)\) is \((\bar{x}^*,\bar{y}^*) = (U^\top x^*,V^\top y^*).\) Consequently, the lower bound for the general proximal algorithm class is characterized by the following theorem.
Theorem 3.7
Let \(\mathcal {A}\) be any algorithm from the general proximal algorithm class decribed in Definition 2.3. We assume the dimension n is sufficiently large for simplicity. For any integer k, then there exist orthogonal matrices U, V s.t. \(F_{U,V}\in \mathcal {B}(L_{xy},\mu _x,\mu _y)\), when applying \(\mathcal {A}\) to \(F_{U,V}\) with initial solution \((x^0,y^0) = (0,0)\), the iterates and output satisfies
where \(\mathcal {H}_x^i,\mathcal {H}_y^i\) are defined by Lemma 3.2. Consequenty, by Theorem 3.5,
where q is given in Theorem 3.5. As a result, it takes \(\Omega \left( \sqrt{\frac{L^2_{xy}}{\mu _x\mu _y}}\cdot \log \left( \frac{1}{\epsilon }\right) \right) \) iterations to output a solution with \(O(\epsilon )\) duality gap.
For the proof of this theorem, we only need to construct the orthogonal matrices U, V such that when the algorithm \(\mathcal {A}\) is applied to \(F_{U,V}\), the subspace inclusion argument \(\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1}\) and \((\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1}\) holds. As a result,
With this argument, the latter results follow directly from the discussion of Theorem 3.5. The proof of this theorem is presented in Appendix C.
3.4 Tightness of the bound
We claim the tightness of the derived lower bound by the following remark.
Remark 3.8
(Tightness of the bound) Consider the algorithm defined in Example 2.1, from [9, 10]. The achieved upper complexity bound is \({\mathcal {O}}\left( \sqrt{\frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \), and it matches our lower bound. This means that our lower bound (23) is tight and the algorithm defined in Example 2.1 is an optimal algorithm in the proximal algorithm class in Definition 2.1.
4 Lower bound for pure first-order algorithms
4.1 The worst-case instance
In this section, we consider the lower complexity bound for the pure first-order method without any proximal oracle. In this case, only the gradient information can be used to construct the iterates and produce the approximate solution output. Similar as before, we still consider the bilinearly coupled problems:
where b is a vector whose value will be determined later. The coefficients \(B_x: = \frac{L_x-\mu _x}{4}, B_y:= \frac{L_y-\mu _y}{4}\) and the coupleing matrix A is defined by (13). Note that \(\Vert A\Vert _2\le 2\) and \(\Vert A\Vert _2^2\le 4\), we can check that problem (25) is an instance from the problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\). This time the subspaces \(\mathcal {H}_x^k\)’s and \(\mathcal {H}_y^k\)’s are generated by the following gradients:
Following Definition 2.2, by letting \(x^0 = y^0 = 0\) we have
By induction, we get the general structure of these subspaces.
Lemma 4.1
For problem (25) and for any \(k \in \mathbb {N}\), if the iterates are generated so that \((x_k,y_k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\), with \(\mathcal {H}_x^k\) and \(\mathcal {H}_y^k\) defined by (2.2), then we have
Different from the discussion of last section, this time it is more convenient to deal with the primal function instead of the dual one. By partially maximizing over y we have
which is \(\mu _x\)-strongly convex. Therefore, the primal optimal solution \(x^*\) is completely characterized by the optimality condition \(\nabla \Phi (x^*) = 0\). However, the solution of this system cannot be computed exactly. Instead, we shall construct an approximate solution \(\hat{x}^*\) to the exact solution \(x^*\).
Lemma 4.2
(Root estimation) Consider a quartic equation
where the constants are given by
As long as \(L_x>\mu _x>0,\) and \(L_y>\mu _y>0\). Then the constants \(0<\alpha ,\beta <+\infty \) are well-defined positive real numbers. For this quartic equation, it has a real root \(x=q\) satisfying
The proof of this lemma is presented in Appendix D. With this lemma, we can construct the approximate solution \(\hat{x}^*\) as follows.
Lemma 4.3
(Approximate optimal solution) Let \(\alpha ,\beta \) be defined according to (27), let q be a real root of quartic equation (26) satisfying (28). Let us define a vector \(\hat{b}\) with elements given by
and then assign \(b = \frac{2B_xB_y}{L_{xy}}A^{-1}\hat{b}\). Then an approximate solution \(\hat{x}^*\) is constructed as
The approximation error can be bounded by
Note that \(q<1\) and the lower bound is dimension-independent, hence we are free to choose n to make the approximation error arbitrarily small.
The proof of this lemma is parallel to that of Lemma 3.3, but is more involved; the detailed proof is in Appendix E.
Note that in this case, the vector \(Ab\propto \hat{b}\subset \mathrm {Span}\{e_1,e_2\}\). By the zero-chain property in Proposition 3.1, the subspace \(\mathcal {H}_x^k\) described in Lemma 4.1 can be calculated by induction
Parallel to Lemma 3.4, we have the following lemma, whose proof is in Appendix F.
Lemma 4.4
Assume \(k\le \frac{n}{2}\) and \(n\ge 2\log _q\left( \frac{\beta }{4\sqrt{2}(7+\alpha )}\right) +2\). Then
where \(x^0 = 0\) is the initial solution.
Consequently, the duality gap is lower bounded by
Summarizing, we present our second main result in the following theorem.
Theorem 4.5
Let positive parameters \(\mu _x,\, \mu _y>0\) and \(L_x>\mu _x,\, L_y>\mu _y,\, L_{xy} > 0\) be given. For any integer k, there exists a problem instance in \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\nu _y)\) of form (25), with \(n\ge \max \left\{ 2\log _q\left( \frac{7+\alpha }{\beta }\right) , 2k\right\} \), the constants \(\alpha ,\beta \) as in (27), the matrix \(A\in \mathbb {R}^{n\times n}\) as in (13), the vector \(b = \frac{2B_xB_y}{L_{xy}}A^{-1}\hat{b}\) where \(\hat{b}\) as in (29). For this problem, any approximate solution \((\tilde{x}^k, \tilde{y}^k)\in \mathcal {H}_x^k\times \mathcal {H}_y^k\) generated by first-order algorithm class (9) satisfies
where q satisfying (28) is a root of the quartic equation (26).
Remark 4.6
As a result, if we require the duality gap to be bounded by \(\epsilon \), then the number of iterations needed is at least
4.2 The general pure first-order algorithm class
Consider the problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), we define the orthogonally rotated problem as
where \(A,b,B_x,B_y\) are defined in accordance with Theorem 4.5 and U, V are two orthogonal matrices. Therefore, it is clear that \(F_{U,V}\in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\). Let \(( x^*,y^*)\) be the saddle point of F(x, y), then it is clear that the saddle point of \(F_{U,V}(x,y)\) is \((\bar{x}^*,\bar{y}^*) = (U^\top \bar{x}^*,V^\top \bar{y}^*).\) Consequently, the lower bound for the general proximal algorithm class is characterized by the following theorem.
Theorem 4.7
Let \(\mathcal {A}\) be any algorithm from general pure first-order algorithm class described in Definition 2.4. We assume the dimension n is sufficiently large for simplicity. For any integer k, then there exist orthogonal matrices U, V s.t. \(F_{U,V}\in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\), when applying \(\mathcal {A}\) to \(F_{U,V}\) with initial solution \((x^0,y^0) = (0,0)\), the iterates and output satisfies
where \(\mathcal {H}_x^i,\mathcal {H}_y^i\) are defined by Lemma 4.1. Consequenty, by Theorem 3.5,
where q is defined in Theorem 4.5. As a result, it takes \(\Omega \left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y} + \frac{L_y}{\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \) iterations to output a solution with \(O(\epsilon )\) duality gap.
The proof of this theorem is completely parallel to that of Theorem 3.7. We only need to construct the orthogonal matrices U, V such that \(\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{2k}\times V^\top \mathcal {H}_y^{2k}\) and \((\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{2k+1}\times V^\top \mathcal {H}_y^{2k+1}\) hold, whose proof follows exactly the same proof procedure of Theorem 3.7. Then argue that
The latter results follow Theorem 4.5 and we omit the proof.
4.3 Tightness of the bound
In this section, we discuss the tightness of this bound. Currently, to the best of our knowledge, there does not exist a pure first-order algorithm that can achieve the lower complexity bound provided in (35). Therefore, whether an optimal algorithm exists that can match this bound or the bound can be further improved remains an open problem. However, we shall see below that (35) under several special parameter regimes is indeed a tight bound.
Case 1 \(\mathcal {F}(L_x,L_y, L_{xy},\mu _x,\mu _y)\) For this general class, define \(L = \max \{L_x,L_y,L_{xy}\}\). A near optimal upper bound of
is obtained in [38], which almost matches our lower bound.
Case 2 \(\mathcal {F}(L_x,L_y, 0,\mu _x,\mu _y)\) In this case \(L_{xy} = 0\), meaning that variables x and y are decoupled. Problem (1) becomes two independent convex problems with condition numbers \(\frac{L_x}{\mu _x}\) and \(\frac{L_y}{\mu _y}\) respectively. In this case (35) is reduced to
This is matched by running two independent Nesterov’s accelerated gradient methods [29].
Case 3 \(\mathcal {F}(L,L,L,\mu ,\mu )\) In this case \(L_x = L_y = L_{xy} = L\), \(\mu _x = \mu _y = \mu \). Then (35) is reduced to
The extra-gradient algorithm (EG) and the accelerated dual extrapolation algorithm (ADE) introduced in Example 2.3 have achieved this bound; see e.g. [23, 30].
Case 4 \(\mathcal {F}(L_x,{\mathcal {O}}(1)\cdot \mu _y,L_{xy},\mu _x,\mu _y)\) In this case \(L_y = {\mathcal {O}}(1)\cdot \mu _y\), meaning that one side of the problem is easy to solve. Then, (35) is reduced to
For the double loop algorithm defined in Example 2.2, when we set the inner loop iteration to be \(T_2 = {\mathcal {O}}\left( \sqrt{\frac{L_y}{\mu _y}}\ln \left( \frac{1}{\epsilon }\right) \right) = {\mathcal {O}}\left( \ln \left( \frac{1}{\epsilon }\right) \right) \), and the outer loop iteration to be \(T_1 = {\mathcal {O}}\left( \sqrt{\frac{L_{\Phi ,x}}{\mu _x}}\ln \left( \frac{1}{\epsilon }\right) \right) = {\mathcal {O}}\left( \sqrt{\frac{L_x}{\mu _x} + \frac{L_{xy}^2}{\mu _x\mu _y}}\cdot \ln \left( \frac{1}{\epsilon }\right) \right) \). Then, an upper bound of
can be guaranteed. It is tight up to a logarithmic factor.
Case 5 \(\mathcal {F}(L,L,L,\mu _x,\mu _x)\) In this case \(L_x = L_y = L_{xy} = L\). Then (35) is reduced to
This bound has been achieved by [19] up to a logarithmic factor.
5 Reduction to lower bounds for convex-concave problems
Note that in the previous sections, we consider the strongly-convex and strongly-concave problem classes \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and \(\mathcal {F}(L_{x},L_{y},L_{xy},\mu _x,\mu _y)\), where \(\mu _x>0\) and \(\mu _y>0\). In this section, we show how our iteration complexity lower bounds provided in Theorem 3.5 and 4.5 can be reduced to the problem classes with \(\mu _x = \mu _y = 0\). Similar reduction can also be done for the case where \(\mu _x>0, \mu _y = 0\), but is omitted in this paper.
5.1 Lower bound for pure first-order algorithm class
Unlike the strongly-convex and strongly-concave saddle point problems, the saddle point of the general convex-concave problem may not always exist. Therefore, we define a new problem class with bounded saddle point solution as follows.
Definition 5.1
(Problem class \(\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\)) We say a function F belongs to the class \(\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\) as long as: (i). \(F\in \mathcal {F}(L_x,L_y,L_{xy},0,0).\) (ii). The solution to \((x^*,y^*) = \mathop {\hbox {argmin}}\limits _x\mathop {\hbox {argmax}}\limits _y F(x,y)\) exists, and \(\Vert x^*\Vert \le R_x\), \(\Vert y^*\Vert \le R_y\).
For this problem class, we have the following lower bound result, as a corollary of Theorem 4.7.
Corollary 5.2
Consider applying the general first-order algorithm class defined by (2.2) to the problem class \(\mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\). For any \(\epsilon >0\), there exists a problem instance \(F_\epsilon (x,y)\in \mathcal {F}_0(L_x,L_y,L_{xy},R_x,R_y)\), such that
iterations are required to reduce the duality gap to \(\epsilon \).
Proof
We start the reduction by the following scaling argument. First, for any \(\epsilon >0\), let \(\hat{F}_\epsilon \in \mathcal {F}(L_x, L_y, L_{xy},\mu _x,\mu _y)\) be the worst-case instance described by Theorem 4.7. For our purpose, we choose
Then by direct computation, we know that the following scaled problem satisfies
We skip the parameter b since it is already used in the construction of the worst case instance \(\hat{F}_\epsilon \). Denote \((\hat{x}^*,\hat{y}^*) = \min _{x}\max _y \hat{F}_\epsilon (x,y)\) and \((x^*,y^*) = \min _{x}\max _y F_\epsilon (x,y)\). Let us set
Then we have \(x^* = \frac{R_x\hat{x}^*}{\Vert \hat{x}^*\Vert }\) and \(y^* = \frac{R_y\hat{y}^*}{\Vert \hat{y}^*\Vert }\), and the Lipschitz constants of \(F_\epsilon \) satisfy that \(ac^2 L_x\le L_x\), \(ad^2 L_y\le L_y\), and \(acd L_{xy}\le L_{xy}\). Therefore, we know
Note that purely scaling the variables and the function does not change the worst-case nature of this problem. That is, \(F_\epsilon \) is still the worst-case problem instance of the function class \(\mathcal {F}(ac^2 L_x,ad^2 L_y,acd L_{xy},ac^2 \mu _x,ad^2 \mu _y)\) and the lower bound of Theorem 4.5 is valid for this specific instance. Therefore, to get the duality gap less than or equal to \(\epsilon \), the number of iteration k is lower bounded by
where (i) is because \(x^0 = 0\), \(\Vert x^*\Vert = R_x\), and \(\mu _x = 64\epsilon /R_x^2\). Therefore, as long as we can show that \(\ln \left( 2ac^2\right) \ge \Omega (1)\), then the corollary is proved. However, since the details are rather technical, we shall provide a proof of \(\ln \left( 2ac^2\right) \ge \ln 2 \) in Appendix G. \(\square \)
As a remark, by setting \(L_y = 0\), the lower bound (37) implies the result derived in [33]. When \(L_x = L_y = L_{xy} = L\), the lower bound (37) implies the result derived in [25]. The reduction for the general pure first-order algorithm class defined by (2.4) without the linear span assumption can also be done in a similar manner and is omitted for succinctness.
5.2 Lower bound for proximal algorithm class
Like Definition 5.1, we define a new bilinear problem class with bounded saddle point solution as follows.
Definition 5.3
(Problem class \(\mathcal {B}_0(L_{xy},R_x,R_y)\)) We say a function F belongs to the function class \(\mathcal {B}_0(L_{xy},R_x,R_y)\) as long as: (i). \(F\in \mathcal {B}(L_{xy},0,0).\) (ii). Solution \((x^*,y^*) = \mathop {\hbox {argmin}}\limits _x\mathop {\hbox {argmax}}\limits _y F(x,y)\) exists, and \(\Vert x^*\Vert \le R_x\), \(\Vert y^*\Vert \le R_y\).
For this problem class, we have the following lower bound result, as a corollary of Theorem 3.7.
Corollary 5.4
Consider applying the general first-order algorithm class defined by (2.2) to the problem class \(\mathcal {B}_0(L_{xy},R_x,R_y)\). For any \(\epsilon >0\), there exists an instance \(F_\epsilon (x,y)\in \mathcal {B}_0(L_{xy},R_x,R_y)\), such that
iterations are required to reduce the duality gap to \(\epsilon \).
Remark 5.5
The lower bound in Corollary 5.2 is tight. An optimal algorithm is derived in [9, 10].
The reduction can be done in a similar way as in Corollary 5.2, but is much simpler. The details are omitted here.
6 Conclusion
In this paper, we establish the lower complexity bound for the first-order methods in solving strongly convex and strongly concave saddle point problems. Different from existing results, we discuss the problem in the most general parameter regime. For the bilinear coupling problem class \(\mathcal {B}(L_{xy},\mu _x,\mu _y)\) and for both proximal algorithm class (7) with linear span assumption and the general proximal algorithm class (10) without linear span assumption, a tight lower bound is established. For general coupling problem class \(\mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\) and for both the pure first-order algorithm class (9) with linear span assumption and the general pure first-order algorithm class (11) without the linear span assumption, a lower bound has been established. Under various special parameter regimes, tight upper-bounds can be developed. In the most general setting of the min-max framework, a near optimal algorithm has been discovered, while the optimal algorithm that exactly matches the lower bound has yet to be discovered. Finally, we also show that our result implies several exisiting lower bounds for general convex-concave problems through proper scaling of the worst-case instance, which indicates the generality of our results.
Notes
In Nesterov’s original paper [30], the author did not give a name to his algorithm. For convenience of referencing, in this paper we shall call it accelerated dual extrapolation.
References
Abadeh, S.S., Esfahani, P.M., Kuhn, D.: Distributionally robust logistic regression. In: Advances in Neural Information Processing Systems, pp. 1576–1584, (2015)
Agarwal, N., Hazan, E.: Lower bounds for higher-order convex optimization. arXiv preprint arXiv:1710.10329, (2017)
Arjevani, Y., Shamir, O., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization. Math. Progr. 178(1–2), 327–360 (2019)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein Gan. arXiv preprint arXiv:1701.07875, (2017)
Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., Gidel, G.: Accelerating smooth games by manipulating spectral shapes. arXiv preprint arXiv:2001.00602, (2020)
Bertsekas, D.P.: Nonlinear Progr. Athena Scientific, Nashua (1997)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Progr. 184, 71–120 (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: first-order methods. Math. Progr. 158, 1–2 (2019)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Mah. Progr. 159(1–2), 253–287 (2016)
Gao, X., Zhang, S.: First-order algorithms for convex optimization with nonseparable objective and coupled constraints. J. Oper. Res. Soc. China 5(2), 131–159 (2017)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversaria nets. In: Advances in Neural Information Processing Systems, pages 2672–2680, (2014)
Ibrahim, A., Azizian, W., Gidel, G., Mitliagkas, I.: Linear lower bounds and conditioning of differentiable games. arXiv preprint arXiv:1906.07300, (2019)
Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, (2019)
Jin, C., Netrapalli, P., Jordan, M.I.: What is local optimality in nonconvex-nonconcave minimax optimization? arXiv preprint arXiv:1902.00618, (2019)
Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1(1), 17–58 (2011)
Korpelevich, G.M.: The extragradient method for finding saddle points and other problems. Matecon 12, 747–756 (1976)
Lin, Q., Liu, M., Rafique, H., Yang, T.: Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207, (2018)
Lin, T., Jin, C., Jordan, M.: Near-optimal algorithms for minimax optimization. In: Annual Conference on Learning Theory, (2020)
Lin, T., Jin, C., Jordan, M.I.: On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, (2019)
Lu, S., Tsaknakis, I., Hong, M., Chen, Y.: Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. arXiv preprint arXiv:1902.08294, (2019)
Marcotte, P., Dussault, J.-P.: A note on a globally convergent newton method for solving monotone variational inequalities. Oper. Res. Lett. 6(1), 35–42 (1987)
Mokhtari, A., Ozdaglar, A., Pattathil, S.: A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511, (2019)
Nemirovski, A.: Prox-method with rate of convergence \(o(1/t)\) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nemirovsky, A.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Nemirovsky, A., Yudin, D.B.: Problem complexity and method efficiency in optimization. (1983)
Nesterov. Yu.: Implementable tensor methods in unconstrained convex optimization. CORE Discussion Paper, 2018/05
Nesterov, Yu.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Progr. 109(2–3), 319–344 (2007)
Nesterov, Yu.: Lectures on convex optimization, vol. 137. Springer, Cham (2018)
Yu. Nesterov and L. Scrimali. Solving strongly monotone variational and quasi-variational inequalities. Available at SSRN 970903, 2006
Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.: Algorithmic game theory. Cambridge University Press, Cambridge (2007)
Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr., E.: An accelerated linearized alternating direction method of multipliers. SIAM J. Imaging Sci. 8(1), 644–681 (2015)
Ouyang, Y., Xu, Y.: Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. arXiv preprint arXiv:1808.02901, (2018)
Rockafellar, R.T.: Convex analysis. Princeton University Press, Princeton (1970)
Sanjabi, M., Razaviyayn, M., Lee, J.D.: Solving non-convex non-concave min-max games under Polyak-Lojasiewicz condition. arXiv preprint arXiv:1812.02878, (2018)
Taji, K., Fukushima, M., Ibaraki, T.: A globally convergent newton method for solving strongly monotone variational inequalities. Math. Progr. 58(1–3), 369–383 (1993)
von Neumann, J., Morgenstern, O., Kuhn, H.W.: Theory of games and economic behavior (commemorative edition). Princeton University Press, Princeton (2007)
Wang, Y., Li, J.: Improved algorithms for convex-concave minimax optimization. arXiv preprint arXiv:2006.06359, (2020)
Xiao, L., Yu, A., Lin, Q., Chen, W.: DSCOVR: randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. J. Mach. Learn. Res. 20(43), 1–58 (2019)
Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)
Acknowledgements
We thank the two anonymous reviewers for their insightful suggestions on orthogonal invariance argument for breaking the linear span assumption and the suggestion on applying scaling to obtain lower bounds for general convex-concave problems.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of lemma 3.4
By the subspace characterization (20), we have
where the last inequality is due to the fact that \(q\le 1, k\le \frac{n}{2}\) and \(y^0 = 0\). Note that by Lemma 3.3, if we require \(n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) \), then we can guarantee that
where the last inequality is due to \(\frac{q^{\frac{n}{2}}}{\alpha }\le \frac{1}{4\sqrt{2}}\) and \(q/(1-q)\le \Vert y^0-\hat{y}^*\Vert \). Therefore, we have
where the fourth line is due to that \(d(t^2 - 2\Vert \hat{y}^*-y^*\Vert t)/dt = 2(t-\Vert \hat{y}^*-y^*\Vert )\ge 0\) when \(t\ge \delta _k\). Hence the quadratic function is monotonically increasing in the considered interval. In addition, we also have
where the third inequality is due to that \(\Vert y^0-\hat{y}^*\Vert \ge \hat{y}^*_1 = q/(1-q)\). For the last inequality, if \(\alpha \ge 1\), then \(q^n/\alpha <1\); if \(\alpha \le 1\), then \(q^n/\alpha \le \alpha /32\le 1\) since \(n\ge 2\log _q\left( \frac{\alpha }{4\sqrt{2}}\right) \). Combining the above two inequalities, the desired bound (21) follows.
Proof of proposition 3.6
Here we only prove the last inequality of (23). Due to the fact that \((\ln (1+z))^{-1}\ge 1/z\) for \(\forall z>0\), we know
which completes the proof.
Proof of theorem 3.7
Before proceeding the proof, let us first quote a lemma from [33].
Lemma C.1
[Lemma 3.1, [33]] Let \(\mathcal {X}\subsetneqq \bar{\mathcal {X}}\subseteqq \mathbb {R}^{p}\) be two linear subspaces. Then for any \(\bar{x}\in \mathbb {R}^p\), there exists an orthogonal matrix \(\Gamma \in \mathbb {R}^{p\times p}\) s.t. \(\Gamma x = x, \forall x\in \mathcal {X}\) and \(\Gamma \bar{x}\in \bar{\mathcal {X}}\).
Note that for an orthogonal matrix \(\Gamma \), if \(\Gamma x = x\), then we also have \(\Gamma ^\top x = x\). Now let us start our proof of Theorem 3.7.
Proof
To prove this theorem, we only need to show
We separate the proof into two parts.
Part I. There exist orthogonal matrices \(\hat{U}\), \(\hat{V}\) s.t. when \(\mathcal {A}\) is applied to the rotated instance \(F_{\hat{U},\hat{V}}\), \(\{(x^0,y^0),...,(x^k,y^k)\}\subseteq \hat{U}^\top \mathcal {H}_x^{4k-1}\times \hat{V}^\top \mathcal {H}_y^{4k-1}.\)
Let \(\theta = (L_{xy},\mu _x,\mu _y)\) be the set of algorithmic parameters. To prove the result, let us construct the worst-case function \(F_{U,V}\) in a recursive way.
Case \(k = 1\): Let us define \(U_0 = V_0 = I\). When \(\mathcal {A}\) is applied to the function \(F_{U_0,V_0}\in \mathcal {B}(L_{xy}, \mu _x, \mu _y)\), the iterate sequence is \((x_{0}^0, y_{0}^0) = (0, 0)\) and
By Lemma C.1, there exists orthogonal matrices \(\Gamma _x^0\) and \(\Gamma _y^0\) such that \(\Gamma _x^0x^1_0\in \mathcal {H}_x^3 = \mathrm {Span}\{Ab\}\), \(\Gamma _y^0y^1_0\in \mathcal {H}_y^3=\mathrm {Span}\{b, A^2b\}\), and \(\Gamma _y^0b = (\Gamma _y^0)^\top b = b.\) That is
where \(U_1 = U_0\Gamma _x^0\) and \(V_1 = V_0\Gamma _y^0\).
Now we prove that when we apply the algorithm \(\mathcal {A}\) to \(F_{U_1,V_1}\), the generated iterates \(\{(x^0_1, y^0_1), (x^1_1, y^1_1)\}\) satisfy that \((x^0_1, y^0_1) = (0, 0)\) and \((x^1_1, y^1_1) = (x^1_0, y^1_0)\). That is, the first two iterates generated by \(\mathcal {A}\) is completely the same for \(F_{U_0,V_0}\) and \(F_{U_1,V_1}\). The reason is because \(u^1_1 = \mathcal {A}_u^1(\theta ; x^0_1, U_1^\top AV_1y^0_1) = \mathcal {A}_u^1(\theta ; 0, 0) = \mathcal {A}_u^1(\theta ; x^0_0, U_0^\top AV_0y^0_0) = u^1_0\), therefore
Through similar argument, we know \((y^1_1,\tilde{y}^1_1) = (y^1_0,\tilde{y}^1_0)\). Therefore, (42) indicates that
Case \(k=2\). For the ease of the readers to follow, we perform one extra step of discussion for \(k=2\), before presenting the construction on general k.
For the problem instance \(F_{U_1,V_1}\), the iterates generated by \(\mathcal {A}\) are \((x_{1}^0, y_{1}^0) = (0, 0)\) and
Note that \(x_1^1 \in U_1^\top \mathcal {H}_x^3 \subsetneqq U_1^\top \mathcal {H}_x^5\subsetneqq U_1^\top \mathcal {H}_x^7\) and \(\{y_1^1,b\}\subsetneqq V_1^\top \mathcal {H}_y^3\subsetneqq V_1^\top \mathcal {H}_y^5\subsetneqq V_1^\top \mathcal {H}_y^7\). Therefore, there exist orthogonal matrices \(\Gamma _x^1\) and \(\Gamma _y^1\) such that
Now, let us define
Now we prove that if \(\mathcal {A}\) is applied to \(F_{U_2,V_2}\), the generated iterates \(\{(x_{2}^0, y_{2}^0), (x_{2}^1, y_{2}^1), (x_{2}^2, y_{2}^2)\}\) satisfy \((x_{2}^0, y_{2}^0) = (0, 0)\), \((x_{2}^1, y_{2}^1) = (x_{1}^1, y_{1}^1)\), and \((x_{2}^2, y_{2}^2) = (x_{1}^2, y_{1}^2)\). The argument for \((x_{2}^1, y_{2}^1) = (x_{1}^1, y_{1}^1)\) is almost the same as that of the case \(k=1\). We only provide the proof for \((x_{2}^2, y_{2}^2) = (x_{1}^2, y_{1}^2)\).
Next, we need to show \(u_2^2 = u_1^2\), which can be proved by arguing that all the inputs to \(\mathcal {A}_u^2\) are the same for both \(u_2^2\) and \(u_1^2\). First, it is straightforward that \(x_1^0 = 0 = x^0_2, U_1^\top AV_1 y_1^0 = 0 = U_2^\top AV_2 y_2^0\). By previous argument \(x_2^1 = x_1^1\). Finally, consider the last input \(U_2^\top AV_2 y_2^1\), because \(y_2^1 = y_1^1\in V_1^\top \mathcal {H}_y^3\subsetneqq V_1^\top \mathcal {H}_y^5\), we have \(\Gamma _y^1 y_2^1 = y_2^1 = y_1^1\in V_1^\top \mathcal {H}_y^3.\) Then \(V_2y_2^1 = V_1\Gamma _y^1y_2^1\in V_1V_1^\top \mathcal {H}_y^3 = \mathcal {H}_y^3.\) Therefore \(U_1^\top AV_2y_2^1\in U_1^\top A\mathcal {H}_y^3 = U_1^\top \mathcal {H}_x^5\) and
Consequently,
and
Through a similar argument, we have \((y_2^2,\tilde{y}_2^2) = (y_1^2,\tilde{y}_1^2)\). By (43) and (44), we have
Case k. Suppose we already have orthogonal matrices \(U_{k-1}, V_{k-1}\), such that when \(\mathcal {A}\) is applied to \(F_{U_{k-1},V_{k-1}}\), we have
Again, by Lemma C.1, there exist orthogonal matrices \(\Gamma _x^{k-1}\) and \(\Gamma _y^{k-1}\), such that
Now we define that
Therefore, similar to our previous discussion, we only need to argue that when \(\mathcal {A}\) is applied to \(F_{U_k,V_k}\), the generated iterates \(\{(x_k^0,y_k^0), (x_k^1,y_k^1),\cdots ,(x_k^k,y_k^k)\}\) satisfy \((x_k^i,y_k^i) = (x_{k-1}^i,y_{k-1}^i)\) for \(i = 0,1,...,k\). We prove this argument by induction. First, it is straightforward that \((x_k^0,y_k^0) = (0,0) = (x_{k-1}^0,y_{k-1}^0)\). Suppose \((x_k^i,y_k^i) = (x_{k-1}^i,y_{k-1}^i)\) holds for \(i = 0,1,...,j-1\le k-1\), now we prove \((x_k^j,y_k^j) = (x_{k-1}^j,y_{k-1}^j)\), which is almost identical to the case \(k=2\).
For any \(i\in \{0,1,...,j-1\}\), let us show \(U_{k-1}^\top AV_{k-1} y_{k-1}^i = U_k^\top AV_k y_k^i\). Because \(y_k^i = y_{k-1}^i\in V_{k-1}^\top \mathcal {H}_y^{4k-5}\subsetneqq V_{k-1}^\top \mathcal {H}_y^{4k-3}\), we have \(\Gamma _y^{k-1} y_k^i = y_k^i = y_{k-1}^i\in V_{k-1}^\top \mathcal {H}_y^{4k-5}.\) Then \(V_ky_k^i = V_{k-1}\Gamma _y^{k-1}y_k^i\in V_{k-1}V_{k-1}^\top \mathcal {H}_y^{4k-5} = \mathcal {H}_y^{4k-5}.\) Therefore \(U_{k-1}^\top AV_{k}y_k^i\in U_{k-1}^\top A\mathcal {H}_y^{4k-5} = U_{k-1}^\top \mathcal {H}_x^{4k-3}\) and
for \(0\le i\le j-1\). Consequently,
and
Through a similar argument, we have \((y_k^i,\tilde{y}_k^i) = (y_{k-1}^i,\tilde{y}_{k-1}^i)\). By induction, we know \((y_k^i,\tilde{y}_k^i) = (y_{k-1}^i,\tilde{y}_{k-1}^i)\) for \(i = 0,1,...,k\). Consequently, we have
By setting \(\hat{U} = U_k\) and \(\hat{V} = V_k\), we prove the result for Part I.
Part II. There exist orthogonal matrices U, V such that when \(\mathcal {A}\) is applied to the rotated instance \(F_{U,V}\), \(\{(x^0,y^0),...,(x^k,y^k)\}\subseteq U^\top \mathcal {H}_x^{4k-1}\times V^\top \mathcal {H}_y^{4k-1},\) and \((\tilde{x}^k,\tilde{y}^k)\in U^\top \mathcal {H}_x^{4k+1}\times V^\top \mathcal {H}_y^{4k+1}\).
Given the result of Part I, and let \(\{(x_k^0, y_k^0),...,(x_k^k, y_k^k)\}\) and \((\tilde{x}_k^k, \tilde{y}_k^k)\) be generated by \(\mathcal {A}\) when applied to \(F_{\hat{U},\hat{V}} = F_{U_k,V_k}\). Therefore, by Lemma C.1, there exist orthogonal matrices P, Q such that
Define \(U = U_kP\), and \(V = V_kQ\). Let \(\{(x^0,y^0),...,(x^k,y^k)\}\) and the output \((\tilde{x}^k,\tilde{y}^k)\) be generated by \(\mathcal {A}\) when applied to \(F_{{U,V}}\). Then following the same line of argument of Case k, Part I, we have
Therefore, combining (49), we complete the proof of Part II. \(\square \)
Proof of lemma 4.2
For the ease of analysis, let us perform a change of variable \(r:=(1-q)^{-1}\). Then the quartic equation (26) can be transformed to
Although the quartic equation does have a root formula, it is impractical to use the formula for the purpose of lower iteration complexity bound. Instead, we will provide an estimation of a large enough lower bound of r, which corresponds to lower bound on q.
First, we let \(\bar{r} = \frac{1}{2}+\sqrt{\frac{\alpha }{\beta } + \frac{1}{4}}\). Then \(f(\bar{r}) = 1>0.\)
Second, we let \(\underline{r} = \frac{1}{2}+\sqrt{\frac{\alpha }{2\beta } + \frac{1}{4}}\). Then,
Together with the fact that \(f(\bar{r}) = 1>0\), by continuity we know there is a root r between \(\left( \underline{r}, \bar{r}\right) \), where
and
This further implies
which proves this lemma.
Proof of lemma 4.3
First, by setting \(\nabla \Phi (x^*) = 0\), we get
Note that matrix A is invertible, with
Therefore, by the interchangability of \(A(B_yA^2+\mu _yI) = (B_yA^2+\mu _yI)A\), we can take the inverse and get \((B_yA^2+\mu _yI)^{-1}A^{-1} = A^{-1}(B_yA^2+\mu _yI)^{-1}\). Left multiply by A and right multiply by A for both sides we get the interchangablity of
Applying this on equation (51) and multiplying both sides by \(\frac{1}{B_xB_y}(B_yA^2 + \mu _yI)\), we can equivalently write the optimality condition as
where
The values of matrices \(A^2\) and \(A^4\) can be found in (13). For the ease of discussion, we may also write equation (52) in an expanded form as:
Because \(q\in (0,1)\) is a root to the quartic equation \(1 -(4+\alpha )q + (6+2\alpha + \beta )q^2 - (4+\alpha )q^3 + q^4 = 0\), and our approximate solution \(\hat{x}^*\) is constructed as \(\hat{x}^*_i = q^i\). By direct calculation one can check that the first \(n-2\) equations are satisfied and the last 2 equations are violated with controllably residuals. Indeed, for the \((n-1)\)-th equation the violation is of the order \(q^{n+1}\), and for the n-th equation the violation is of the order \(|-q^{n} + (4+\alpha )q^{n+1} - q^{n+2}|\). Similar to the arguments for (18), we have
That is, \(\Vert \hat{x}^*-x^*\Vert \le \frac{7+\alpha }{\beta }\cdot q^n\), which completes the proof.
Proof of lemma 4.4
By the subspace characterization (32), we have
When we set \(k\le \frac{n}{2}\) and \(n\ge 2\log _q\left( \frac{\beta }{4\sqrt{2}(7+\alpha )}\right) +2\), by (31) we also have
Therefore, similar to (41), we also have
which proves the lemma.
Proof of \(\ln (2ac^2) = \Omega (1)\)
Proof
Note that \(a = \min \{c^{-2},d^{-2}\}\), if \(c^{-2}\le d^{-2}\), then \(ac^2 = 1\). Consequently,
However, when \(c^{-2}\ge d^{-2}\), the situation is more complicated. In this case,
where \(\hat{x}^*\) and \(\hat{y}^*\) is the solution to the unscaled worst-case instance \(\hat{F}_\epsilon \in \mathcal {F}(L_x,L_y,L_{xy},\mu _x,\mu _y)\). For the ease of discussion, let us take the dimension n is sufficiently large so that we can view the approximate solution constructed in Lemma 4.3 as the exact solution. Therefore, we have
where q is defined by Theorem 4.5 and the second equality is due to the first-order stationary condition. Note that equation (51) also provides that
Combining the above two relations, we have
Substituting the specific forms of A and \(A^{-1}\), we have
Therefore, we have
For ease of discussion, the following simplifications are made. First, we omit the \(q^{2n}\) term since \(q<1\) and n is sufficiently large. Second, note that Lemma 4.2 indicates that \(1-q = \Theta (\epsilon )\), the term \(\frac{2B_x}{L_{xy}} (1-q) = {\mathcal {O}}(\epsilon )\) and the term \(\frac{128\epsilon }{L_{xy}R_x^2(1-q)} = \Omega (1)\). Thus we also omit the \(\frac{2B_x}{L_{xy}} (1-q)\) term which is significantly smaller. Therefore, we can write
As a result,
In Lemma 4.2, we also have a lower bound of \(1-q\) as
where (i) is because we have omitted the terms of smaller magnitude. Therefore,
Thus we complete the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Zhang, J., Hong, M. & Zhang, S. On lower iteration complexity bounds for the convex concave saddle point problems. Math. Program. 194, 901–935 (2022). https://doi.org/10.1007/s10107-021-01660-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01660-z