1 Introduction

Let \(\mathcal{E}\) be a finite dimensional vector space with inner product \(\langle \cdot ,\cdot \rangle \) and norm \(\Vert \cdot \Vert \) (not necessarily induced by \(\langle \cdot , \cdot \rangle \)), and Z be a non-empty closed convex set in \({\mathcal {E}}\). Our problem of interest is to find an \(u^*\in Z\) that solves the following monotone stochastic variational inequality (SVI) problem:

$$\begin{aligned} \langle \mathbb {E}_{\xi ,\zeta }[{\mathcal {F}}(u;\xi ,\zeta )], u^* - u\rangle \le 0,\quad \forall u\in Z. \end{aligned}$$
(1)

Here, the expectation is taken with respect to the random vectors \(\xi \) and \(\zeta \) whose distributions are supported on \({\varXi }\subseteq \mathbb {R}^{d}\) and \({\varXi }'\subseteq \mathbb {R}^{d'}\), respectively, and \({\mathcal {F}}\) is given by the summation of three components with different structural properties, i.e.,

$$\begin{aligned} {\mathcal {F}}(u;\xi ,\zeta ) = {\mathcal {G}}(u;\xi ) + {\mathcal {H}}(u;\zeta ) + J'(u),\quad \forall u\in Z. \end{aligned}$$
(2)

In particular, we assume that \(J'(u)\in \partial J(u)\) is a subgradient of a relatively simple and convex function J (see (10) below), \({\mathcal {H}}(u; \zeta )\) is an unbiased estimator of a monotone and Lipschitz continuous operator H such that \(\mathbb {E}_{\zeta }[{\mathcal {H}}(u;\zeta )]=H(u)\),

$$\begin{aligned}&\langle H(w) - H(v), w - v\rangle \ge 0,\text { and } \Vert H(w)-H(v)\Vert _*\le M\Vert w-v\Vert , \quad \forall w, v \in Z, \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _*\) denotes the conjugate norm of \(\Vert \cdot \Vert \). Moreover, we assume that \({\mathcal {G}}(u; \xi )\) is an unbiased estimator of the gradient for a convex and continuously differentiable function G such that \(\mathbb {E}_{\xi }[{\mathcal {G}}(u;\xi )] = \nabla G(u)\) and

$$\begin{aligned} 0\le G(w)-G(v)-\langle \nabla G(v), w-v\rangle \le \frac{L}{2} \Vert w-v\Vert ^2, \quad \forall w,\;v\in Z. \end{aligned}$$
(4)

Observe that \(u^*\) given by (1) is often called a weak solution for SVI. A related notion is a strong SVI solution. More specifically, letting

$$\begin{aligned} F(u):=\mathbb {E}_{\xi ,\zeta }[{\mathcal {F}}(u;\xi ,\zeta )] = \nabla G(u) + H(u) + J'(u), \end{aligned}$$
(5)

we say that \(u^*\) is a strong SVI solution if it satisfies

$$\begin{aligned} \&\langle F(u^*), u^* - u\rangle \le 0,\quad \forall u\in Z. \end{aligned}$$
(6)

It should be noted that the operator F above might not be continuous. Problems (1) and (6) are also known as the Minty variational inequality and the Stampacchia variational inequality respectively, due to their origin [16, 27]. For any monotone operator F, it is well-known that strong solutions defined in (6) are also weak solutions in (1), and the reverse is also true under mild assumptions (e.g., when F is continuous). For example, for F in (5), if \(J=0\), then the weak and strong solutions in (1) and (6) are equivalent. For the sake of notational convenience, we use SVI(ZGHJ) or simply SVI(ZF) to denote problem (1).

SVIs have recently found many applications, especially in data analysis. To motivate our discussion, let us mention one widespread machine learning model which helps to represent massive data in a compact way [17]. Consider a set of observed data \(S = \{ (x_i, y_i)\}_{i=1}^m\), drawn at random from an unknown distribution \({\mathcal {D}}\) on \(X \times Y\). We would like to find a function \(y = \mathcal{Y}(x, \theta )\), parameterized by \(\theta \in {\varTheta }\), to describe the relation between x and y. To this end, we can solve different problems of the form (e.g., [4, 45, 49, 52])

$$\begin{aligned} \min _{\theta \in {\varTheta }} \mathbb {E}[\mathcal{L}(\mathcal{Y}(x, \theta ), y)] + r(\theta ), \end{aligned}$$
(7)

where \(\mathcal{L}\) denotes a loss function, r is a regularization to enforce certain structures of the generated solutions, and the expectation is taken w.r.t. the random vector (xy). While one can directly solve (7) as a stochastic optimization problem, the SVI in (1) provides us a unified model to study different subclasses of problems given in the form of (7), which include but not limited to the following cases when: (a) \(\mathcal{L}\) is a smooth convex function (see [24]); (b) either \(\mathcal{L}\) or r are nonsmooth, but admitting a saddle point reformulation (see [10]); (c) the feasible set contains linear or nonlinear functional constraints; and (d) \(\theta \) has to satisfy the optimality condition for another optimization problem. Moreover, the SVI in (1) can potentially be used to solve a wider class of stochastic equilibrium and complementarity problems whose operators are given in the form of expectation (see for instance the survey [26] and the references therein).

In spite of the modeling power of SVIs and many different algorithms that have been developed for solving deterministic VIs [9, 20, 22, 28, 32, 35, 37, 40, 43, 44], to compute the solutions of SVIs still seems to be challenging. A basic difficulty to solve (1) is that the expectation function in (1) cannot be computed efficiently within high accuracy, especially when the dimension of the random vector \((\xi , \zeta )\) is large. Hence, the algorithms for solving deterministic VIs are not directly applicable to SVIs in general. This paper focuses on Monte-carlo sampling (or scenario generation) based approaches for solving SVIs. In particular, we assume that there exist stochastic oracles \(\mathcal{SO}_G\) and \(\mathcal{SO}_H\) that provide random samples of \({\mathcal {G}}(u;\xi )\) and \({\mathcal {H}}(u;\xi )\) for any test point \(u\in Z\). At the i-th call of \(\mathcal{SO}_G\) and \(\mathcal{SO}_H\) with input \(z\in Z\), the oracles \(\mathcal{SO}_G\) and \(\mathcal{SO}_H\) output stochastic information \({\mathcal {G}}(z; \xi _i)\) and \({\mathcal {H}}(z; \zeta _i)\) respectively, such that

A1

$$\begin{aligned} \mathbb {E}\left[ \left\| \mathcal{G}{(u; \xi _i)} - \nabla G(u)\right\| _*^2\right] \le \sigma _G^2,\ \mathbb {E}\left[ \left\| \mathcal{H}{(u; \zeta _i)} - H(u)\right\| _*^2\right] \le \sigma _H^2, \end{aligned}$$

for some \(\sigma _G, \sigma _H \ge 0\), where \(\xi _i\) and \(\zeta _i\) are independently distributed random samples. For the sake of notational convenience, throughout this paper we also denote

$$\begin{aligned} \sigma := \sqrt{\sigma _{G}^2 + \sigma _{H}^2}. \end{aligned}$$
(8)

Assumption A1 basically implies that the variance associated with \({\mathcal {G}}(u,\xi _i)\) and \({\mathcal {H}}(u,\zeta _i)\) is bounded. It should be noted that deterministic VIs, denoted by VI(ZGHJ), are special cases of SVIs with \(\sigma _G=\sigma _H=0\). The above setting covers as a special case of the regular SVIs whose operators G(u) or H(u) are given in the form of expectation as shown in (1). Moreover, it provides a framework to study randomized algorithms for solving deterministic VI or saddle point problems [33] (see Sect. 4.2 for an example).

One popular approach based on Monte-carlo sampling to solve SVI is the sample average approximation (SAA) method [8, 41, 42, 48, 50]. In this approach one first approximates the expectation \({\mathcal {F}}(u) = \mathbb {E}[{\mathcal {F}}(u;\xi ,\zeta )]\) by \(\tilde{{\mathcal {F}}}(u) \equiv \sum _{i=1}^N \mathbb {E}[{\mathcal {F}}(u;\xi _i,\zeta _i)] /N\) for some \(N > 0\), and then solves a deterministic counterpart of (1) with \({\mathcal {F}}\) replaced by \(\tilde{{\mathcal {F}}}\). However, the resulting deterministic VI problem is still often difficult to solve when the dimension of u or the sample size N is large. Moreover, this approach is not applicable to the online setting when the decision vector needs to be updated as new samples arrive. Recently, there has been a resurgence of interest in stochastic approximation (SA) type algorithms which aim at solving the SVIs directly based on the noisy estimation of the operators returned by the stochastic oracles (see, e.g., [19, 20, 23, 24, 33, 51]). These more recent studies focused on analyzing the convergence behaviour of SA type methods during a finite number of iterations (i.e., complexity) and exploring whether these performance bounds are tight or not. However, to the best of our knowledge, none of existing algorithms can attain the theoretically optimal rate of convergence to solve the SVI problems in (1) due to its rich structural properties (e.g., gradient field G and Lipschitz continuity of H). More specifically, we can see that the total number of gradient and operator evaluations for solving SVI cannot be smaller than

$$\begin{aligned} {\mathcal {O}}\left( \sqrt{\frac{L}{\epsilon }} + \frac{M}{\epsilon } + \frac{\sigma ^2}{\epsilon ^2}\right) . \end{aligned}$$
(9)

This is a lower complexity bound derived based on the following three observations:

  1. 1.

    If \(H=0\) and \(\sigma =0\), SVI(ZG, 0, 0) is equivalent to a smooth optimization problem \(\min _{u\in Z}G(u)\), and the complexity for minimizing G(u) cannot be better than \({\mathcal {O}}(\sqrt{L/\epsilon })\) [34, 38];

  2. 2.

    If \(G=0\) and \(\sigma =0\), the complexity for solving SVI(Z; 0, H, 0) cannot be better than \({\mathcal {O}}(M/\epsilon )\) [31] (see also the discussions in Section 5 of [32]).

  3. 3.

    If \(H=0\), SVI(ZG, 0, 0) is equivalent to a stochastic smooth optimization problem, and the complexity cannot be better than \({\mathcal {O}}(\sigma ^2/\epsilon ^2)\) [24].

However, there exist significant gaps between the above lower complexity bound and the complexity of existing algorithms, especially in terms of their dependence on the Lipschitz constants L and M, while it is well-known that SA-type methods are sensitive to these parameters. It is worth noting that the above lower complexity bound has not been attained even for the deterministic VIs with \(\sigma = 0\).

The lower complexity bound in (9) and the three observations stated above provide some important guidelines to the design of efficient algorithms to solve the SVI problem with the operator given in (5). It might seem natural to consider the more general problem (6) by combining \(\nabla G(u)\) and H(u) in (5) together as a single monotone operator, instead of separating them apart. Such consideration is reasonable from a generalization point of view, by noting that the convexity of function G(u) is equivalent to the monotonicity of \(\nabla G(u)\), and the Lipschitz conditions (3) and (4) are equivalent to a Lipschitz condition of F(u) in (6) with \(\Vert F(w) - F(v)\Vert _*\le (L+M)\Vert w-v\Vert \). However, from the algorithmic point of view, a special treatment of \(\nabla G\) separately from H is crucial for the design of accelerated algorithms. By observations 2 and 3 above, if we consider \(F:=\nabla G+H\) as a single monotone operator, the complexity for solving SVI(Z; 0; F; 0) can not be smaller than

$$\begin{aligned} {\mathcal {O}}\left( \frac{L+M}{\epsilon } + \frac{\sigma ^2}{\epsilon ^2}\right) . \end{aligned}$$

This is worse than (9) in terms of the dependence on L. The identification and specialized treatment on the gradient field \(\nabla G\) allows us to use its Lipschitz condition (4) to achieve the optimal iteration complexity (see observation 1 above) for solving SVIs.

In order to achieve the complexity bound in (9) for SVIs, we incorporate a multi-step acceleration scheme into the stochastic mirror-prox method in [20], and introduce a stochastic accelerated mirror-prox (SAMP) method that fully exploits the structural properties of (1). We show that SAMP can exhibit a complexity bound given by (9). To the best of our knowledge, this is the first time in the literature that the lower complexity bound in (9) has been achieved for SVIs. Table 1 shows in more details how our results improve the best-known so-far complexity results for solving deterministic and stochastic VIs. In particular, for deterministic VIs, the Lipschitz constant L can be as large as \({\varOmega }(1/\epsilon )\) without affecting the rate of convergence. Moreover, the Lipschitz constant L can be as large as \({\varOmega }(1/\epsilon ^{3/2})\) without significantly slowing down the convergence rate for solving SVIs. We demonstrate the advantages of these accelerated algorithms over some existing algorithms through our numerical experiments in Sect. 4.

Table 1 Comparison of complexity results

In addition to improving existing complexity bounds for solving VI problems, we incorporate into SAMP the termination criterion employed by Monteiro and Svaiter [28, 29] for solving variational and hemivariational inequalities posed as monotone inclusion problem. As a result, the SAMP can deal with the case when Z is unbounded, as long as a strong solution to problem (6) exists, and the iteration complexity of SAMP will depend on the distance from the initial point to the set of strong solutions. It is worth noting that no such complexity results have been presented before in the literature for solving unbounded SVIs.

The remaining part of this paper is organized as follows. We propose the SAMP algorithm and discuss its main convergence results for solving SVIs in Section 2. To facilitate the readers, we present the proofs of the main convergence results in Sect. 3. Some preliminary numerical experiments are provided in Sect. 4 to demonstrate the efficiency of the SAMP algorithm. Finally, we make some concluding remarks in Sect. 5.

2 The stochastic accelerated mirror-prox method

In this section, we develop the stochastic accelerated mirror-prox (SAMP) method for solving SVI(ZF) and demonstrate that it can achieve the optimal rate of convergence in (9).

Throughout this paper, we assume that the following prox-mapping can be solved efficiently:

$$\begin{aligned} P_z^J(\eta ):=\mathop {\text {argmin}}_{u\in Z}\langle \eta , u-z\rangle + V(z, u) + J(u). \end{aligned}$$
(10)

In (10), the function \(V(\cdot ,\cdot )\) is defined by

$$\begin{aligned} V(z,u):=\omega (u)-\omega (z)-\langle \nabla \omega (z),u-z\rangle ,\quad \forall u,\;z\in Z, \end{aligned}$$
(11)

where \(\omega (\cdot )\) is a strongly convex function with convexity parameter \(\mu >0\), and is called the distance generating function. The function \(V(\cdot ,\cdot )\) is known as a prox-function, or Bregman divergence [6] (see, e.g., [2, 3, 32, 39] for the properties of prox-functions and prox-mappings and their applications in convex optimization). Using the aforementioned definition of the prox-mapping, we describe the SAMP method in Algorithm 1.

figure a

Observe that in the SAMP algorithm we introduced two sequences, i.e., \(\{w_{t}^{md}\}\) and \(\{w_{t}^{ag}\}\) (here “md” stands for “middle”, and “ag” stands for “aggregated”), that are convex combinations of iterations \(\{w_t\}\) and \(\{r_t\}\) as long as \(\alpha _t\in [0,1]\). If \(\alpha _t\equiv 1\), \(G=0\) and \(J=0\), then Algorithm 1 for solving SVI(Z; 0, H, 0) is equivalent to the stochastic mirror-prox method in [20]. If, in addition, \(\sigma = 0\), then it reduces to the mirror-prox method in [32]. Moreover, if the distance generating function \(w(\cdot )=\Vert \cdot \Vert ^2/2\), then iterations (13) and (14) become

$$\begin{aligned} w_{t+1}&= \mathop {\text {argmin}}_{u\in Z}\langle \gamma _tH(r_t), u-r_t\rangle + \frac{1}{2}\Vert u-r_t\Vert ^2,\\ r_{t+1}&= \mathop {\text {argmin}}_{u\in Z}\langle \gamma _tH(w_{t+1}), u-r_t\rangle + \frac{1}{2}\Vert u-r_t\Vert ^2, \end{aligned}$$

which are exactly the iterates of the extragradient method in [22]. On the other hand, if \(H=0\), then (13) and (14) produce the same optimizer \(w_{t+1}= r_{t+1}\), and Algorithm 1 is equivalent to the accelerated stochastic approximation method in [24]. Specifically, if, in addition, \(\sigma = 0\), then it reduces to a version of Nesterov’s accelerated gradient method for solving \(\min _{u\in Z}G(u)+J(u)\) (see, for example, Algorithm 1 in [46]). Therefore, Algorithm 1 can be viewed as a hybrid algorithm of the stochastic mirror-prox method and the accelerated stochastic approximation method, which gives its name stochastic accelerated mirror-prox method. It is interesting to note that for any t, there are two calls of \(\mathcal{SO}_H\) but just one call of \(\mathcal{SO}_G\). However, if we assume that \(J=0\) and use the stochastic mirror-prox method in [20] to solve SVI(ZGH, 0), for any t there would be two calls of \(\mathcal{SO}_H\) and two calls of \(\mathcal{SO}_G\). Therefore, the cost per iteration of SAMP is less than that of the stochastic mirror-prox method.

In order to analyze the convergence of Algorithm 1, we introduce a notion to characterize the weak solutions of SVI(ZGHJ). For all \(\tilde{u}, u\in Z\), we define

$$\begin{aligned} Q(\tilde{u}, u):=G(\tilde{u}) - G(u) + \langle H(u), \tilde{u} - u\rangle + J(\tilde{u}) - J(u). \end{aligned}$$
(16)

Clearly, for F defined in (5), we have \(\langle F(u),\tilde{u}-u\rangle \le Q(\tilde{u}, u)\). Therefore, if \(Q(\tilde{u}, u)\le 0\) for all \(u\in Z\), then \(\tilde{u}\) is a weak solution of SVI(ZGHJ). Hence when Z is bounded, it is natural to use the gap function

$$\begin{aligned} g(\tilde{u}):= \sup _{u\in Z}Q(\tilde{u}, u) \end{aligned}$$
(17)

to evaluate the accuracy of a feasible solution \(\tilde{u}\in Z\). However, if Z is unbounded, then \(g(\tilde{z})\) may not be well-defined, even when \(\tilde{z}\in Z\) is a nearly optimal solution. Therefore, we need to employ a slightly modified gap function in order to measure the accuracy of candidate solutions when Z is unbounded. In the sequel, we will consider the cases of bounded and unbounded Z separately. For both cases we establish the rate of convergence of the gap functions in terms of their expectation, i.e., the “average” rate of convergence over many runs of the algorithm. Furthermore, we demonstrate that if Z is bounded, then we can also establish the rate of convergence of \(g(\cdot )\) in the probability sense, under the following “light-tail” assumption:

A2 For any i-th call on oracles \(\mathcal{SO}_H\) and \(\mathcal{SO}_H\) with any input \(u\in Z\),

$$\begin{aligned} \mathbb {E}[\exp \{\Vert \nabla G(u) - {\mathcal {G}}(u; \xi _i)\Vert _*^2/\sigma _G^2 \}]\le \exp \{1\}, \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}[\exp \{\Vert H(u) - {\mathcal {H}}(u; \zeta _i)\Vert _*^2/\sigma _H^2 \}]\le \exp \{1\}. \end{aligned}$$

Assumption A2 is sometimes called the sub-Gaussian assumption. Many different random variables, such as Gaussian, uniform, and any random variables with a bounded support, will satisfy this assumption. It should be noted that Assumption A2 implies Assumption A1 by Jensen’s inequality.

We start with establishing some convergence properties of Algorithm 1 when Z is bounded. It should be noted that the following quantity will be used throughout the convergence analysis of this paper:

$$\begin{aligned} {\varGamma }_t = \left\{ \begin{aligned}&1,&\text { when } t=1\\&(1-\alpha _t){\varGamma }_{t-1},&\text { when } t>1.\\ \end{aligned} \right. \end{aligned}$$
(18)

Theorem 1

Suppose that

$$\begin{aligned} \displaystyle \sup _{z_1, z_2\in Z}V(z_1, z_2) \le {\varOmega }_Z^2. \end{aligned}$$
(19)

Also assume that the parameters \(\{\alpha _t\}\) and \(\{\gamma _t\}\) in Algorithm 1 satisfy \(\alpha _1=1\),

$$\begin{aligned} q{\mu }-{L\alpha _t\gamma _t}- \frac{3M^2\gamma _t^2}{\mu }\ge 0 \quad \text { for some }q\in (0,1) , \text { and } \frac{\alpha _t}{{\varGamma }_t\gamma _t} \le \frac{\alpha _{t+1}}{{\varGamma }_{t+1}\gamma _{t+1}}, \quad \forall t\ge 1, \end{aligned}$$
(20)

where \({\varGamma }_t\) is defined in (18). Then,

  1. (a)

    Under Assumption A1, for all \(t\ge 1\),

    $$\begin{aligned}&\mathbb {E}\left[ g(w_{t+1}^{ag})\right] \le {\mathcal {Q}}_0(t):=\frac{2\alpha _t}{\gamma _t}{\varOmega }_Z^2 + \left[ 4\sigma _H^2 + \left( 1 + \frac{1}{2(1-q)} \right) \sigma _G^2\right] {\varGamma }_t \sum _{i=1}^{t}\frac{\alpha _i\gamma _i}{\mu {\varGamma }_i}. \end{aligned}$$
    (21)
  2. (b)

    Under Assumption A2, for all \(\lambda >0\) and \(t\ge 1\),

    $$\begin{aligned}&Prob\{g(w_{t+1}^{ag})>{\mathcal {Q}}_0(t) + \lambda {\mathcal {Q}}_1(t) \}\le 2\exp \{-\lambda ^2/3 \} + 3\exp \{-\lambda \}, \end{aligned}$$
    (22)

    where

    $$\begin{aligned} \begin{aligned} {\mathcal {Q}}_1(t):=&\ {\varGamma }_t(\sigma _G+\sigma _H) {\varOmega }_Z\sqrt{\frac{2}{\mu }\sum _{i=1}^t\left( \frac{\alpha _i}{{\varGamma }_i} \right) ^2 }\\&\ + \left[ 4\sigma _H^2 + \left( 1 + \frac{1}{2(1-q)} \right) \sigma _G^2\right] {\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i\gamma _i}{\mu {\varGamma }_i}. \end{aligned} \end{aligned}$$
    (23)

There are various options for choosing the parameters \(\{\alpha _t\}\) and \(\{\gamma _t\}\) that satisfy (20). In the following corollary, we give one example of such parameter settings.

Corollary 1

Suppose that (19) holds. If the stepsizes \(\{\alpha _t\}\) and \(\{\gamma _t\}\) in Algorithm 1 are set to:

$$\begin{aligned} \alpha _t = \frac{2}{t+1}\text { and } \gamma _t = \frac{\mu t}{4L+ 3Mt + \beta (t+1)\sqrt{\mu t}}, \end{aligned}$$
(24)

where \(\beta >0\) is a parameter. Then under Assumption A1,

$$\begin{aligned} \mathbb {E}\left[ g(w_{t+1}^{ag})\right]&\le \frac{16L {\varOmega }_Z^2}{\mu t(t+1)} + \frac{12M {\varOmega }_Z^2}{\mu (t+1)}\nonumber \\&+ \frac{\sigma {\varOmega }_Z}{\sqrt{\mu (t-1)}}\left( \frac{4\beta {\varOmega }_Z}{\sigma } + \frac{16\sigma }{3\beta {\varOmega }_Z}\right) =:{\mathcal {C}}_0(t), \end{aligned}$$
(25)

where \(\sigma \) and \({\varOmega }_Z\) are defined in (8) and (19), respectively. Furthermore, under Assumption A2,

$$\begin{aligned} Prob\{g(w_{t+1}^{ag})>{\mathcal {C}}_0(t) + \lambda {\mathcal {C}}_1(t) \}\le 2\exp \{-\lambda ^2/3 \} + 3\exp \{-\lambda \}, \quad \forall \lambda >0, \end{aligned}$$

where

$$\begin{aligned} {\mathcal {C}}_1(t) := \frac{\sigma {\varOmega }_Z}{\sqrt{\mu (t-1)}}\left( \frac{4\sqrt{3}}{3} + \frac{16\sigma }{3\beta {\varOmega }_Z} \right) . \end{aligned}$$
(26)

Proof

It is easy to check that

$$\begin{aligned} {\varGamma }_t=\frac{2}{t(t+1)} \text { and } \frac{\alpha _t}{{\varGamma }_t\gamma _t} \le \frac{\alpha _{t+1}}{{\varGamma }_{t+1}\gamma _{t+1}}. \end{aligned}$$

In addition, in view of (24), we have \(\gamma _t\le \mu t/(4L)\) and \(\gamma _t^2 \le (\mu ^2)/(9M^2)\), which implies

$$\begin{aligned} \frac{5\mu }{6} - L\alpha _t\gamma _t - \frac{3M^2\gamma _t^2}{\mu } \ge \frac{5\mu }{6} - \frac{\mu t}{4}\cdot \frac{2}{t+1} - \frac{\mu }{3} \ge 0. \end{aligned}$$

Therefore the first relation in (20) holds with constant \(q=5/6\). In view of Theorem 1, it now suffices to show that \({\mathcal {Q}}_0(t)\le {\mathcal {C}}_0(t)\) and \({\mathcal {Q}}_1(t)\le {\mathcal {C}}_1(t)\). Observing that \(\alpha _t/{\varGamma }_t=t\), and \(\gamma _t\le \sqrt{\mu }/(\beta \sqrt{t})\), we obtain

$$\begin{aligned} \ \sum _{i=1}^{t}\frac{\alpha _i\gamma _i}{{\varGamma }_i} \le \frac{\sqrt{\mu }}{\beta }\sum _{i=1}^t \sqrt{i} \le \frac{\sqrt{\mu }}{\beta }\int _{0}^{t+1}\sqrt{t}dt = {\frac{\sqrt{\mu }}{\beta }}\cdot \frac{2(t+1)^{3/2}}{3} = \frac{2\sqrt{\mu }(t+1)^{3/2}}{{3}\beta }. \end{aligned}$$

Using the above relation, (19), (21), (23), (24), and the fact that \(\sqrt{t+1}/t\le 1/\sqrt{t-1}\) and \(\sum _{i=1}^t i^2\le t(t+1)^2/3\), we have

$$\begin{aligned} {\mathcal {Q}}_0(t)&= \frac{4{\varOmega }_Z^2}{\mu t(t+1)}\left( 4L+ 3Mt + \beta (t+1)\sqrt{\mu t}\right) + \frac{8\sigma ^2}{ \mu t(t+1)}\sum _{i=1}^{t}\frac{\alpha _i\gamma _i}{{\varGamma }_i}\\&\le \ \frac{16L {\varOmega }_Z^2}{\mu t(t+1)} + \frac{12M {\varOmega }_Z^2}{\mu (t+1)} + \frac{4\beta {\varOmega }_Z^2}{\sqrt{\mu t}} + \frac{16\sigma ^2 \sqrt{t+1}}{3\sqrt{\mu }\beta t}\\&\le \ {\mathcal {C}}_0(t), \end{aligned}$$

and

$$\begin{aligned} {\mathcal {Q}}_1(t)&= \ \frac{2(\sigma _G + \sigma _H)}{t(t+1)} {\varOmega }_Z\sqrt{\frac{2}{\mu }\sum _{i=1}^ti^2} + \frac{8\sigma ^2}{ \mu t(t+1)}\sum _{i=1}^{t}\frac{\alpha _i\gamma _i}{{\varGamma }_i}\\&\le \ \frac{2\sqrt{2}(\sigma _G + \sigma _H){\varOmega }_Z}{\sqrt{3\mu t}} + \frac{16\sigma ^2 \sqrt{t+1}}{3\sqrt{\mu }\beta t}\\&\le \ {\mathcal {C}}_1(t). \end{aligned}$$

We now add a few remarks about the results obtained in Corollary 1. Firstly, in view of (9), (25) and (26), we can clearly see that the SAMP method is robust with respect to the estimates of \(\sigma \) and \({\varOmega }_Z\). Indeed, the SAMP method achieves the optimal iteration complexity for solving the SVI problem as long as \(\beta ={\mathcal {O}}(\sigma /{\varOmega }_Z)\). In particular, in this case, the number of iterations performed by the AMP method to find an \(\epsilon \)-solution of (1), i.e., a point \(\bar{w} \in Z\) s.t. \(\mathbb {E}[g(\bar{w})] \le \epsilon \), can be bounded by

$$\begin{aligned} \mathcal{O} \left( \sqrt{\frac{L}{\epsilon }} + \frac{M}{\epsilon } + \frac{\sigma ^2}{\epsilon ^2} \right) , \end{aligned}$$
(27)

which implies that this algorithm allows L to be as large as \({\mathcal {O}}(\epsilon ^{-3/2})\) and M to be as large as \({\mathcal {O}}(\epsilon ^{-1})\) without significantly affecting its convergence properties. Secondly, for the deterministic case when \(\sigma = 0\), the complexity bound in (27) significantly improves the best-known so-far complexity for solving problem (1) (see (9)) in terms of their dependence on the Lipschitz constant L.

In the following theorem, we demonstrate some convergence properties of Algorithm 1 for solving the stochastic problem SVI(ZGHJ) when Z is unbounded. It seems that this case has not been well-studied previously in the literature. To study the convergence properties of AMP in this case, we use a perturbation-based termination criterion recently employed by Monteiro and Svaiter [28, 29], which is based on the enlargement of a maximal monotone operator first introduced in [7]. More specifically, we say that the pair \((\tilde{v}, \tilde{u})\in {\mathcal {E}}\times Z\) is a \((\rho , \varepsilon )\)-approximate solution of SVI(ZGHJ) if \(\Vert \tilde{v}\Vert \le \rho \) and \(\tilde{g}(\tilde{u}, \tilde{v})\le \varepsilon \), where the gap function \(\tilde{g}(\cdot , \cdot )\) is defined by

$$\begin{aligned} \tilde{g}(\tilde{u}, \tilde{v}):=\sup _{u\in Z}Q(\tilde{u}, u) - \langle \tilde{v}, \tilde{u} - u\rangle . \end{aligned}$$
(28)

We call \(\tilde{v}\) the perturbation vector associated with \(\tilde{u}\). One advantage of employing this termination criterion is that the convergence analysis does not depend on the boundedness of Z.

Theorem 2 below describes the convergence properties of SAMP for solving SVIs with unbounded feasible sets, under the assumption that a strong solution of (6) exists. It should be noted that this assumption does not limit too much the applicability of the SAMP method. For example, when \(J\equiv 0\) in (1), the conditions for the existence of strong solutions are described in Section 2.2 of the seminal book [12]. Indeed, any weak solution to SVI(ZF) is also a strong solution in such case. For general case in which \(J\not \equiv 0\) and F in (5) is a point-to-set map, there has also been several studies on the theories of strong solutions to (6) (see, e.g., [13, 21]). For example, when J is a finite-valued closed convex function, some conditions for the existence of strong solutions are proven in Theorem 2.3 of [21].

Theorem 2

Suppose that \(V(r, z):= \Vert z - r\Vert ^2/2\) for any \(r\in Z\) and \(z\in Z\). If the parameters \(\{\alpha _t\}\) and \(\{\gamma _t\}\) in Algorithm 1 are chosen such that \(\alpha _1=1\), and for all \(t> 1\),

$$\begin{aligned} 0\le \alpha _t<1, \ {L\alpha _t\gamma _t}+{3M^2\gamma _t^2}\le c^2 < q \text { for some }c,q{\in }(0,1), \text { and }\frac{\alpha _t}{{\varGamma }_t\gamma _t} = \frac{\alpha _{t+1}}{{\varGamma }_{t+1}\gamma _{t+1}},\nonumber \\ \end{aligned}$$
(29)

where \({\varGamma }_t\) is defined in (18). Then for all \(t\ge 1\) there exists a perturbation vector \(v_{t+1}\) and a residual \(\varepsilon _{t+1}\ge 0\) such that \(\tilde{g}(w_{t+1}^{ag}, v_{t+1}) \le \varepsilon _{t+1}\). Moreover, for all \(t\ge 1\), we have

$$\begin{aligned} \mathbb {E}[\Vert v_{t+1}\Vert ]&\le \ \frac{\alpha _t}{\gamma _t}\left( 2D + 2\sqrt{D^2 + C_t^2}\right) , \end{aligned}$$
(30)
$$\begin{aligned} \mathbb {E}[\varepsilon _{t+1}]&\le \ \frac{\alpha _t}{\gamma _t}\left[ (3+6\theta )D^2 + (1+6\theta )C_t^2\right] + \frac{18\alpha _t^2\sigma _H^2}{\gamma _t^2}\sum _{i=1}^t\gamma _i^3, \end{aligned}$$
(31)

where

$$\begin{aligned} D: = \Vert r_1 - u^*\Vert , \end{aligned}$$
(32)

\(u^*\) is a strong solution of SVI(ZGHJ),

$$\begin{aligned} \theta = \max \left\{ 1, \frac{c^2}{q-c^2} \right\} \text { and } C_t = \sqrt{\left[ 4\sigma _H^2 + \left( 1 + \frac{1}{2(1-q)} \right) \sigma _G^2\right] \sum _{i=1}^{t}\gamma _i^2}. \end{aligned}$$
(33)

Below we give an example of parameters \(\alpha _t\) and \(\gamma _t\) that satisfies (29).

Corollary 2

Suppose that there exists a strong solution of (1). If the maximum number of iterations N is given, and the stepsizes \(\{\alpha _t\}\) and \(\{\gamma _t\}\) in Algorithm 1 are set to

$$\begin{aligned} \alpha _t = \frac{2}{t+1}\,\text { and } \gamma _t = \frac{t}{5L + 3MN + \beta N\sqrt{N-1}}, \end{aligned}$$
(34)

where \(\sigma \) is defined in Corollary 1, then there exists \(v_N\in {\mathcal {E}}\) and \(\varepsilon _N>0\), such that \(\tilde{g}(w_{t}^{ag}[][N],v_{N})\le \varepsilon _N\),

$$\begin{aligned} \mathbb {E}[\Vert v_N\Vert ]\le&\ \frac{40LD}{N(N-1)} + \frac{24MD}{N-1} + \frac{\sigma }{\sqrt{N-1}}\left( \frac{8\beta D}{\sigma } + 5\right) , \end{aligned}$$
(35)

and

$$\begin{aligned} \mathbb {E}[\varepsilon _N]\le&\ \frac{90LD^2}{N(N-1)} + \frac{54MD^2}{N-1} + \frac{\sigma D}{\sqrt{N-1}}\left( \frac{18\beta D}{\sigma } + \frac{56\sigma }{3\beta D}+\frac{18\sigma }{\beta D N}\right) . \end{aligned}$$
(36)

Proof

Clearly, we have \({\varGamma }_t ={2}/[{t(t+1)}]\), and hence (18) is satisfied. Moreover, in view of (34), we have

$$\begin{aligned} L\alpha _t\gamma _t + 3M^2\gamma _t^2 \le&\ \frac{2L}{5L+3MN} + \frac{3M^2N^2}{(5L+3MN)^2}\\ =&\ \frac{10L^2 + 6LMN + 3M^2N^2}{(5L+3MN)^2}< \frac{5}{12} < \frac{5}{6}, \end{aligned}$$

which implies that (29) is satisfied with \(c^2=5/12\) and \(q=5/6\). Observing from (34) that \(\gamma _t = t\gamma _1\), setting \(t=N-1\) in (33) and (34), we obtain

$$\begin{aligned} \frac{\alpha _{N-1}}{\gamma _{N-1}} = \frac{2}{\gamma _1N(N-1)}\text { and } C_{N-1}^2 = 4\sigma ^2\sum _{i=1}^{N-1} \gamma _1^2 i^2 \le \frac{4\sigma ^2\gamma _1^2N^2(N-1)}{3}, \end{aligned}$$
(37)

where \(C_{N-1}\) is defined in (33). Applying (37) to (30) we have

$$\begin{aligned} \mathbb {E}[\Vert v_N\Vert ] \le&\ \frac{2}{\gamma _1N(N-1)}(4D + 2C_{N-1}) \le \frac{8D}{\gamma _1N(N-1)} + \frac{8\sigma }{\sqrt{3(N-1)}}\\ \le&\ \frac{40LD}{N(N-1)} + \frac{24MD}{N-1} + \frac{\sigma }{\sqrt{N-1}}\left( \frac{8\beta D}{\sigma } + 5\right) . \end{aligned}$$

In addition, using (31), (37), and the facts that \(\theta =1\) in (33) and

$$\begin{aligned} \sum _{i=1}^{N-1}\gamma _i^3 = {\gamma _1^3} N^2(N-1)^2/4, \end{aligned}$$

we have

$$\begin{aligned} \mathbb {E}[\varepsilon _{N-1}]&\le \frac{2}{\gamma _1N(N-1)}(9D^2 + 7C_{N-1}^2) + \frac{72\sigma _H^2}{\gamma _1^2N^2(N-1)^2}\cdot \frac{\gamma _1^3N^2(N-1)^2}{4}\\&\le \frac{18D^2}{\gamma _1 N(N-1)} + \frac{56\sigma ^2\gamma _1 N}{3} + 18\sigma _H^2\gamma _1\\&\le \ \frac{90LD^2}{N(N-1)} + \frac{54MD^2}{N-1} + \frac{18\beta D^2}{\sqrt{N-1}} + \frac{56\sigma ^2}{3\beta \sqrt{N-1}} + \frac{18\sigma _H^2}{\beta N\sqrt{N-1}}\\&\le \frac{90LD^2}{N(N-1)} + \frac{54MD^2}{N-1} + \frac{\sigma D}{\sqrt{N-1}}\left( \frac{18\beta D}{\sigma } + \frac{56\sigma }{3\beta D}+\frac{18\sigma }{\beta D N}\right) . \end{aligned}$$

Several remarks are in place for the results obtained in Theorem 2 and Corollary 2. Firstly, similarly to the bounded case (see the remark after Corollary 1), one may want to choose \(\beta \) in a way such that the right hand side of (35) or (36) is minimized, e.g., \(\beta ={\mathcal {O}}(\sigma /D)\). However, since the value of D will be very difficult to estimate for the unbounded case and hence one often has to resort to a suboptimal selection for \(\beta \). For example, if \(\beta = \sigma \), then the RHS of (35) and (36) will become \(\mathcal{O}(L D/N^2 + MD/N + \sigma D / \sqrt{N})\) and \(\mathcal{O}(LD^2/N^2 + MD^2/N + \sigma D^2 / \sqrt{N})\), respectively. Secondly, both residuals \(\Vert v_N\Vert \) and \(\varepsilon _N\) in (35) and (36) converge to 0 at the same rate (up to a constant factor). Finally, it is only for simplicity that we assume that \(V(r,z)=\Vert z-r\Vert ^2/2\); Similar results can be achieved under assumptions that \(\nabla \omega \) is Lipschitz continuous.

3 Convergence analysis

In this section, we focus on proving the main convergence results in Sect. 2, namely, Theorems 1 and 2.

To prove the convergence of the stochastic AMP algorithm, we first present some technical results. Lemmas 1 and 2 describe some important properties of the prox-mapping \(P_r^J(\eta )\) used in (13) and (14) of Algorithm 1. Lemma 3 provides a recursion related to the function \(Q(\cdot ,\cdot )\) defined in (16). With the help of Lemmas 1, 2 and 3, we estimate a bound on \(Q(\cdot ,\cdot )\) in Lemma 4.

Lemma 1

For all \(r, \zeta \in {\mathcal {E}}\), if \(w=P_r^J(\zeta )\), then for all \(u\in Z\), we have

$$\begin{aligned} \langle \zeta , w-u\rangle + J(w) - J(u)\le V(r, u)-V(r, w)-V(w,u). \end{aligned}$$

Proof

See Lemma 2 in [14] for the proof.

The following lemma is a slight extension of Lemma 6.3 in [20]. In particular, when \(J(\cdot )=0\), we can obtain (41) and (42) directly by applying (40) to (6.8) in [20], and the results when \(J(\cdot )\not \equiv 0\) can be easily constructed from the proof of Lemma 6.3 in [20]. We provide the proof here only for the sake of completeness.

Lemma 2

Given \(r, w, y \in Z\) and \(\eta , \vartheta \in {\mathcal {E}}\) that satisfy

$$\begin{aligned} w&= P_r^J(\eta ), \end{aligned}$$
(38)
$$\begin{aligned} y&= P_r^J(\vartheta ), \end{aligned}$$
(39)

and

$$\begin{aligned} \Vert \vartheta - \eta \Vert ^2_* \le L^2\Vert w - r\Vert ^2 + M^2. \end{aligned}$$
(40)

Then, for all \(u\in Z\),

$$\begin{aligned} \begin{aligned}&\ \langle \vartheta , w-u\rangle + J(w)-J(u) \le V(r, u)-V(y, u)-\left( \frac{\mu }{2}- \frac{L^2}{2\mu }\right) \Vert r-w\Vert ^2 + \frac{M^2}{2\mu }, \end{aligned} \end{aligned}$$
(41)

and

$$\begin{aligned} V(y, w) \le \frac{L^2}{\mu ^2}V(r, w) + \frac{M^2}{2\mu }. \end{aligned}$$
(42)

Proof

Applying Lemma 1 to (38) and (39), for all \(u\in Z\) we have

$$\begin{aligned} \langle \eta , w-u\rangle + J(w)-J(u)&\le V(r, u)-V(r, w)-V(w, u), \end{aligned}$$
(43)
$$\begin{aligned} \langle \vartheta , y-u\rangle + J(y)-J(u)&\le V(r, u)-V(r, y) - V(y, u), \end{aligned}$$
(44)

In particular, letting \(u=y\) in (43) we have

$$\begin{aligned} \langle \eta , w-y\rangle + J(w)-J(y)\le V(r, y)-V(r, w)-V(w, y). \end{aligned}$$
(45)

Adding inequalities (44) and (45), then

$$\begin{aligned}&\langle \vartheta , y-u\rangle + \langle \eta , w-y\rangle + J(w)-J(u)\\&\quad \le V(r, u)-V(y, u)-V(r,w)-V(w, y), \end{aligned}$$

which is equivalent to

$$\begin{aligned}&\langle \vartheta , w-u\rangle + J(w)-J(u) \le \langle \vartheta - \eta , w - y\rangle + V(r, u)\\&\quad -V(y, u)-V(r,w)-V(w, y). \end{aligned}$$

Applying Schwartz inequality and Young’s inequality to the above inequality, and using the fact that

$$\begin{aligned} \frac{\mu }{2}\Vert z- u\Vert ^2\le V(u, z), \forall u, z\in Z, \end{aligned}$$
(46)

due to the strong convexity of \(\omega (\cdot )\) in (11), we obtain

$$\begin{aligned} \begin{aligned}&\ \langle \vartheta , w-u\rangle + J(w)-J(u)\\&\quad \le \ \Vert \vartheta - \eta \Vert _* \Vert w - y\Vert + V(r, u)-V(y, u) -V(r, w)-\frac{\mu }{2}\Vert w-y\Vert ^2\\&\quad \le \ \frac{1}{2\mu }\Vert \vartheta - \eta \Vert _*^2 + \frac{\mu }{2}\Vert w - y\Vert ^2 + V(r, u)-V(y, u) - V(r, w)-\frac{\mu }{2}\Vert w-y\Vert ^2\\&\quad = \ \frac{1}{2\mu }\Vert \vartheta - \eta \Vert _*^2 + V(r, u)-V(y, u) - V(r, w). \end{aligned} \end{aligned}$$
(47)

The result in (41) then follows immediately from above relation, (40) and (46).

Moreover, observe that by setting \(u=w\) and \(u=y\) in (44) and (47), respectively, we have

$$\begin{aligned} \langle \vartheta , y- w\rangle + J(y) - J(w)&\le V(r, w) - V(r, y) - V(y, w),\\ \langle \vartheta , w-y\rangle + J(w)-J(y)&\le \frac{1}{2\mu }\Vert \vartheta - \eta \Vert _*^2 + V(r, y) - V(r, w). \end{aligned}$$

Adding the above two inequalities, and using (40) and (46), we have

$$\begin{aligned} 0\le & {} \frac{1}{2\mu }\Vert \vartheta - \eta \Vert _*^2-V(y,w)\le \frac{L^2}{2\mu }\Vert r - w\Vert ^2 + \frac{M^2}{2\mu } - V(y, w)\\\le & {} \frac{L^2}{\mu ^2}V(r,w) + \frac{M^2}{2\mu } - V(y, w), \end{aligned}$$

and thus (42) holds.

Lemma 3

For any sequences \(\{r_t\}_{t\ge 1}\) and \(\{w_t\}_{t\ge 1}\subset Z\), if the sequences \(\{w_{t}^{ag}\}\) and \(\{w_{t}^{md}\}\) are generated by (12) and (15), then for all \(u\in Z\),

$$\begin{aligned} \begin{aligned}&Q(w_{t+1}^{ag}, u) - (1-\alpha _t)Q(w_{t}^{ag}, u) \le \alpha _t\langle \nabla G(w_{t}^{md}) + H(w_{t+1}),w_{t+1}-u\rangle \\&\quad +\frac{L\alpha _t^2}{2}\Vert w_{t+1}-r_t\Vert ^2 {+ \alpha _tJ(w_{t+1})}-\alpha _tJ(u). \end{aligned} \end{aligned}$$
(48)

Proof

Observe from (12) and (15) that \(w_{t+1}^{ag}-w_{t}^{md}=\alpha _t(w_{t+1}-r_t)\). This observation together with the convexity of \(G(\cdot )\) imply that for all \(u\in Z\),

$$\begin{aligned} G(w_{t+1}^{ag})&\le \ G(w_{t}^{md})+\langle \nabla G(w_{t}^{md}), w_{t+1}^{ag}-w_{t}^{md}\rangle + \frac{L}{2}\Vert w_{t+1}^{ag}-w_{t}^{md}\Vert ^2\\&=\ (1-\alpha _t)\left[ G(w_{t}^{md})+\langle \nabla G(w_{t}^{md}),w_{t}^{ag}-w_{t}^{md}\rangle \right] \\&\quad +\alpha _t\left[ G(w_{t}^{md})+\langle \nabla G(w_{t}^{md}),u-w_{t}^{md}\rangle \right] \\&\quad +\alpha _t\langle \nabla G(w_{t}^{md}),w_{t+1}-u\rangle +\frac{L\alpha _t^2}{2}\Vert w_{t+1}-r_t\Vert ^2\\&\le (1-\alpha _t)G(w_{t}^{ag})+\alpha _tG(u)+\alpha _t\langle \nabla G(w_{t}^{md}),w_{t+1}-u\rangle \\&\quad +\frac{L\alpha _t^2}{2}\Vert w_{t+1}-r_t\Vert ^2. \end{aligned}$$

Using the above inequality, (15), (16) and the monotonicity of \(H(\cdot )\), we have

$$\begin{aligned} \begin{aligned}&\ Q(w_{t+1}^{ag}, u) - (1-\alpha _t) Q(w_{t}^{ag}, u)\\&=\ G(w_{t+1}^{ag})-(1-\alpha _t)G(w_{t}^{ag})-\alpha _tG(u)\\&\qquad \ + \langle H(u), w_{t+1}^{ag}-u\rangle -(1-\alpha _t)\langle H(u), w_{t}^{ag}-u\rangle \\&\qquad + J(w_{t+1}^{ag}) - (1-\alpha _t)J(w_{t}^{ag}) -\alpha _tJ(u)\\&\le G(w_{t+1}^{ag})-(1-\alpha _t)G(w_{t}^{ag})-\alpha _tG(u) + \alpha _t\langle H(u), w_{t+1}-u\rangle \\&\qquad + \alpha _tJ(w_{t+1}) -\alpha _tJ(u)\\&\le \alpha _t\langle \nabla G(w_{t}^{md}),w_{t+1}-u\rangle +\frac{L\alpha _t^2}{2}\Vert w_{t+1}-r_t\Vert ^2 + \alpha _t\langle H(w_{t+1}), w_{t+1}-u\rangle \\&\qquad + \alpha _tJ(w_{t+1}) -\alpha _tJ(u). \end{aligned} \end{aligned}$$

In the sequel, we will use the following notations to describe the inexactness of the first order information from \(\mathcal{SO}_H\) and \(\mathcal{SO}_G\). At the t-th iteration, letting \({\mathcal {H}}(r_t; \zeta _{2t-1})\), \({\mathcal {H}}(w_{t+1}; \zeta _{2t})\) and \({\mathcal {G}}(w_{t}^{md}; \xi _t)\) be the output of the stochastic oracles, we denote

$$\begin{aligned} \begin{aligned} {\varDelta }_H^{2t-1}&:= {\mathcal {H}}(r_t; \zeta _{2t-1})- H(r_t),\\ {\varDelta }_H^{2t}&:= {\mathcal {H}}(w_{t+1}; \zeta _{2t})- H(w_{t+1}),\text { and }\\ {\varDelta }_G^{t}&:= {\mathcal {G}}(w_{t}^{md}; \xi _t)- \nabla G(w_{t}^{md}). \end{aligned} \end{aligned}$$
(49)

Lemma 4 below provides a bound on \(Q(w_{t+1}^{ag}, u)\) for all \(u\in Z\).

Lemma 4

Suppose that the parameters \(\{\alpha _t\}\) in Algorithm 1 satisfies \(\alpha _1=1\) and \(0\le \alpha _t<1\) for all \(t>1\). Then the iterates \(\{r_t\}, \{w_t\}\) and \(\{w_{t}^{ag}\}\) satisfy

$$\begin{aligned} \begin{aligned} \ \frac{1}{{\varGamma }_t}Q(w_{t+1}^{ag}, u)&\le \mathcal{B}_t(u, r_{[t]}) -\sum _{i=1}^t\frac{\alpha _i}{2{\varGamma }_i\gamma _i}\left( q{\mu }-{L\alpha _i\gamma _i}- \frac{3M^2\gamma _i^2}{\mu }\right) \Vert r_i-w_{i+1}\Vert ^2 \\&\quad + \sum _{i=1}^t{\varLambda }_i(u), \ \forall u\in Z, \end{aligned} \end{aligned}$$
(50)

where \({\varGamma }_t\) is defined in (18),

$$\begin{aligned}&\mathcal{B}_t(u, r_{[t]}):=\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i\gamma _i}(V(r_i, u) - V(r_{i+1},u)), \end{aligned}$$
(51)

and

$$\begin{aligned} \begin{aligned} {\varLambda }_i(u) :=&\ \frac{3\alpha _i\gamma _i}{2\mu {\varGamma }_i}\left( \Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2 \right) - \frac{(1-q)\mu \alpha _i}{2{\varGamma }_i\gamma _i}\Vert r_i - w_{i+1}\Vert ^2\\&\ - \frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_H^{2i}+{\varDelta }_G^i, w_{i+1} - u\rangle . \end{aligned} \end{aligned}$$
(52)

Proof

Observe from (49) that

$$\begin{aligned} \begin{aligned}&\ \Vert {\mathcal {H}}(w_{t+1}; \zeta _{2t})- {\mathcal {H}}(r_t; \zeta _{2t-1})\Vert _*^2\\&\quad \le \left( \Vert H(w_{t+1}) - H(r_t)\Vert _* + \Vert {\varDelta }_H^{2t}\Vert _* + \Vert {\varDelta }_H^{2t-1}\Vert _*\right) ^2\\&\quad \le 3 \left( \Vert H(w_{t+1}) - H(r_t)\Vert _*^2 + \Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2\right) \\&\quad \le 3 \left( M^2\Vert w_{t+1}- r_t\Vert ^2 + \Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2\right) . \end{aligned} \end{aligned}$$
(53)

Applying Lemma 2 to (13) and (14) (with \(r=r_t, w=w_{t+1}, y=r_{t+1}, {\eta } = \gamma _t{\mathcal {H}}(r_t; \zeta _{2t-1})+ \gamma _t{\mathcal {G}}(w_{t}^{md}; \xi _t),\) \({\vartheta }=\gamma _t{\mathcal {H}}(w_{t+1}; \zeta _{2t})+ \gamma _t{\mathcal {G}}(w_{t}^{md}; \xi _t), J=\gamma _tJ\), \(L^2=3M^2\gamma _t^2\) and \(M^2 = 3\gamma _t^2(\Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2)\)), and using (53), we have for any \(u\in Z\),

$$\begin{aligned} \begin{aligned}&\ \gamma _t\langle {\mathcal {H}}(w_{t+1}; \zeta _{2t})+ {\mathcal {G}}(w_{t}^{md}; \xi _t), w_{t+1}-u\rangle + \gamma _t J({w_{t+1}})-\gamma _t J(u)\\&\quad \le V(r_t, u)-V(r_{t+1}, u) - \left( \frac{\mu }{2} - \frac{3M^2\gamma _t^2}{2\mu }\right) \Vert r_t - w_{t+1}\Vert ^2 \\&\qquad + \frac{3\gamma _t^2}{2\mu } (\Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2). \end{aligned} \end{aligned}$$

Applying (49) and the above inequality to (48), we have

$$\begin{aligned} \begin{aligned}&\ Q(w_{t+1}^{ag}, u) - (1-\alpha _t)Q(w_{t}^{ag}, u)\\&\quad \le \alpha _t\langle {\mathcal {H}}(w_{t+1}; \zeta _{2t})+ {\mathcal {G}}(w_{t}^{md}; \xi _t), w_{t+1}-u\rangle + \alpha _tJ(w_{t+1}) -\alpha _tJ(u) \\&\qquad \ +\frac{L\alpha _t^2}{2}\Vert w_{t+1}-r_t\Vert ^2- \alpha _t\langle {\varDelta }_H^{2t} + {\varDelta }_G^t, w_{t+1}- u\rangle \\&\quad \le \frac{\alpha _t}{\gamma _t}\left( V(r_t, u) - V(r_{t+1}, u) \right) - \frac{\alpha _t}{2\gamma _t}\left( \mu - {L\alpha _t\gamma _t} - \frac{3M^2\gamma _t^2}{\mu } \right) \Vert r_t - w_{t+1}\Vert ^2\\&\qquad + \frac{3\alpha _t\gamma _t}{2\mu }\left( \Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2 \right) - \alpha _t\langle {\varDelta }_H^{2t}+{\varDelta }_G^t, w_{t+1}- u\rangle . \end{aligned} \end{aligned}$$

Dividing the above inequality by \({\varGamma }_t\) and using the definition of \({\varLambda }_t(u)\) in (52), we obtain

$$\begin{aligned} \begin{aligned}&\ \frac{1}{{\varGamma }_t}Q(w_{t+1}^{ag}, u) - \frac{1-\alpha _t}{{\varGamma }_t}Q(w_{t}^{ag}, u)\\&\quad \le \frac{\alpha _t}{{\varGamma }_t\gamma _t}\left( V(r_t, u) - V(r_{t+1}, u) \right) \\&\qquad - \frac{\alpha _t}{2{\varGamma }_t\gamma _t}\left( q\mu - {L\alpha _t\gamma _t} - \frac{3M^2\gamma _t^2}{\mu } \right) \Vert r_t - w_{t+1}\Vert ^2 + {\varLambda }_t(u). \end{aligned} \end{aligned}$$

Noting the fact that \(\alpha _1=1\) and \(({1-\alpha _t})/{{\varGamma }_t} = 1/{{\varGamma }_{t-1}}\), \(t>1\), due to (18), applying the above inequality recursively and using the definition of \({\mathcal {B}}_t(\cdot ,\cdot )\) in (51), we conclude (50).

We still need the following technical result that helps to provide a bound on the last stochastic term in (50) before proving Theorems 1 and 2.

Lemma 5

Let \(\theta _t,\gamma _t>0\), \(t=1,2,\ldots ,\) be given. For any \(w_1\in Z\) and any sequence \(\{{\varDelta }^t \}\subset {\mathcal {E}}\), if we define \(w^v_1=w_1\) and

$$\begin{aligned} w_{i+1}^v = \mathop {\text {argmin}}_{u\in Z}-\gamma _i\langle {\varDelta }^i, u\rangle + V(w_i^v, u), \quad \forall i>1, \end{aligned}$$
(54)

then

$$\begin{aligned} \sum _{i=1}^t\theta _i\langle -{\varDelta }^i, w_i^v - u\rangle \le \sum _{i=1}^{t}\frac{\theta _i}{\gamma _i}(V(w_i^v, u) - V(w_{i+1}^v, u)) + \sum _{i=1}^t\frac{\theta _i\gamma _i}{2\mu }\Vert {\varDelta }_i\Vert _*^2, \quad \forall u\in Z. \end{aligned}$$
(55)

Proof

Applying Lemma 1 to (54) (with \(r=w_i^v\), \(w=w_{i+1}^v\), \(\zeta =-\gamma _i{\varDelta }^i\) and \(J=0\)), we have

$$\begin{aligned} -\gamma _i\langle {\varDelta }^i, w_{i+1}^v - u\rangle \le V(w_i^v, u) -V(w_i^v, w_{i+1}^v) - V(w_{i+1}^v, u), \quad \forall u\in Z. \end{aligned}$$

Moreover, by Schwartz inequality, Young’s inequality and (46) we have

$$\begin{aligned}&\ -\gamma _i\langle {\varDelta }^i, w_i^v - w_{i+1}^v\rangle \le \gamma _i\Vert {\varDelta }^i{\Vert _*}\Vert w_i^v - w_{i+1}^v\Vert \le \frac{\gamma _i^2}{2\mu }\Vert {\varDelta }_i\Vert _*^2 + \frac{\mu }{2}\Vert w_i^v - w_{i+1}^v\Vert ^2 \\&\quad \le \frac{\gamma _i^2}{2\mu }\Vert {\varDelta }_i\Vert _*^2 + V(w_i^v, w_{i+1}^v). \end{aligned}$$

Adding the above two inequalities and multiplying the resulting inequality by \(\theta _i/\gamma _i\), we obtain

$$\begin{aligned} -\theta _i\langle {\varDelta }^i, w_i^v-u\rangle \le \frac{\theta _i\gamma _i}{2\mu }\Vert {\varDelta }_i\Vert _*^2 + \frac{\theta _i}{\gamma _i}(V(w_i^v, u) - V(w_{i+1}^v, u)). \end{aligned}$$

Summing the above inequalities from \(i=1\) to t, we conclude (55).

With the help of Lemma 4 and 5, we are now ready to prove Theorem 1, which provides an estimate of the gap function of SAMP in both expectation and probability.

Proof of Theorem 1

We first provide a bound on \( \mathcal{B}_t(u, r_{[t]})\). Since the sequence \(\{r_i \}_{i=1}^{t+1}\) is in the bounded set Z, applying (19) and (20) to (51) we have

$$\begin{aligned} \begin{aligned}&\ \mathcal{B}_t(u, r_{[t]})\\&\quad = \frac{\alpha _1}{{\varGamma }_1\gamma _1}V(r_1, u) - \sum _{i=1}^{t-1}\left[ \frac{\alpha _i}{{\varGamma }_i\gamma _i} - \frac{\alpha _{i+1}}{{\varGamma }_{i+1}\gamma _{i+1}} \right] V(r_{t+1}[i], u) -\frac{\alpha _t}{{\varGamma }_t\gamma _t}V(r_{t+1}, u)\\&\quad \le \frac{\alpha _1}{{\varGamma }_1\gamma _1}{\varOmega }_Z^2 - \sum _{i=1}^{t-1}\left[ \frac{\alpha _i}{{\varGamma }_i\gamma _i} - \frac{\alpha _{i+1}}{{\varGamma }_{i+1}\gamma _{i+1}} \right] {\varOmega }_Z^2 = \frac{\alpha _t}{{\varGamma }_t\gamma _t}{\varOmega }_Z^2,\ \forall u\in Z, \end{aligned} \end{aligned}$$
(56)

Applying (20) and the above inequality to (50) in Lemma 4, we have

$$\begin{aligned} \frac{1}{{\varGamma }_t}Q(w_{t+1}^{ag}, u) \le \frac{\alpha _t}{{\varGamma }_t\gamma _t}{\varOmega }_Z^2 + \sum _{i=1}^{t}{\varLambda }_i(u), \quad \forall u\in Z. \end{aligned}$$
(57)

Letting \(w_1^v = w_1\), defining \(w_{i+1}^v\) as in (54) with \({\varDelta }^i = {\varDelta }^{2i}_H + {\varDelta }^i_G\) for all \(i>1\), we conclude from (51) and Lemma 5 (with \(\theta _i=\alpha _i/{\varGamma }_i\)) that

$$\begin{aligned} -\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }^{2i}_H + {\varDelta }^i_G, w^v_i-u\rangle \le \mathcal{B}_t(u, w^v_{[t]}) + \sum _{i=1}^t\frac{\alpha _i\gamma _i}{2\mu {\varGamma }_i}\Vert {\varDelta }_H^{2i} + {\varDelta }_G^i\Vert _*^2,\quad \forall u\in Z. \end{aligned}$$
(58)

The above inequality together with (52) and the Young’s inequality yield

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{t}{\varLambda }_i(u)&= - \sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_H^{2i} + {\varDelta }_G^i, w_{i}^v - u\rangle + \sum _{i=1}^{t}\frac{3\alpha _i\gamma _i}{2\mu {\varGamma }_i}\left( \Vert {\varDelta }_H^{2i}\Vert _*^2+ \Vert {\varDelta }_H^{2i-1}\Vert _*^2 \right) \\&\quad + \sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i}\left[ -\frac{(1-q)\mu }{2\gamma _i}\Vert r_i - w_{i+1}\Vert ^2 - \langle {\varDelta }_G^i, w_{i+1} - r_i\rangle \right] \\&\quad - \sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_G^i, r_i - w_i^v\rangle - \sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_H^{2i}, w_{i+1} - w_{i}^v\rangle \\&\le \mathcal{B}_t(u, w^v_{[t]}) + U_t, \end{aligned} \end{aligned}$$
(59)

where

$$\begin{aligned} \begin{aligned} U_t&:= \sum _{i=1}^t\frac{\alpha _i\gamma _i}{2\mu {\varGamma }_i}\Vert {\varDelta }_H^{2i} + {\varDelta }_G^i\Vert _*^2 + \sum _{i=1}^t\frac{\alpha _i\gamma _i}{2(1-q)\mu {\varGamma }_i}\Vert {\varDelta }_G^i\Vert _*^2\\&\quad + \sum _{i=1}^{t}\frac{3\alpha _i\gamma _i}{2\mu {\varGamma }_i}\left( \Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2 \right) \\&\quad - \sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_G^i, r_i - w_i^v\rangle - \sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_H^{2i}, w_{i+1} - w_{i}^v\rangle . \end{aligned} \end{aligned}$$
(60)

Applying (56) and (59) to (57), we have

$$\begin{aligned} \frac{1}{{\varGamma }_t}Q(w_{t+1}^{ag}, u) \le \frac{2\alpha _t}{\gamma _t{\varGamma }_t}{\varOmega }_Z^2 + U_t, \quad \forall u\in Z, \end{aligned}$$

or equivalently,

$$\begin{aligned} g({w_{t+1}^{ag}})\le \frac{2\alpha _t}{\gamma _t}{\varOmega }_Z^2 + {\varGamma }_tU_t. \end{aligned}$$
(61)

Now it suffices to bound \(U_t\), in both expectation and probability.

We prove part (a) first. By our assumptions on \({\mathcal {SO}}_G\) and \({\mathcal {SO}}_H\) and in view of (13), (14) and (54), during the i-th iteration of Algorithm 1, the random noise \({\varDelta }^{2i}_H\) is independent of \(w_{i+1}\) and \(w_i^v\), and \({\varDelta }^{i}_G\) is independent of \(r_i\) and \(w_i^v\), hence \(\mathbb {E}[\langle {\varDelta }_G^i, r_i - w_i^v\rangle ] = \mathbb {E}[\langle {\varDelta }_H^{2i}, w_{i+1} - w_{i}^v\rangle ] = 0 \). In addition, Assumption A1 implies that \(\mathbb {E}[\Vert {\varDelta }_G^i\Vert _*^2]\le \sigma _{G}^2\), \(\mathbb {E}[\Vert {\varDelta }_H^{2i-1}\Vert _*^2]\le \sigma _H^2\) and \(\mathbb {E}[\Vert {\varDelta }_H^{2i}\Vert _*^2]\le \sigma _H^2\), where \({\varDelta }_G^i\), \({\varDelta }_H^{2i-1}\) and \({\varDelta }_H^{2i}\) are independent. Therefore, taking expectation on (60) we have

$$\begin{aligned} \begin{aligned} \mathbb {E}[U_t] \le&\ \mathbb {E}\left[ \sum _{i=1}^t\frac{\alpha _i\gamma _i}{\mu {\varGamma }_i}\left( \Vert {\varDelta }_H^{2i}\Vert ^2 + \Vert {\varDelta }_G^i\Vert _*^2\right) + \sum _{i=1}^t\frac{\alpha _i\gamma _i}{2(1-q)\mu {\varGamma }_i}\Vert {\varDelta }_G^i\Vert _*^2 \right. \\&\quad + \left. \sum _{i=1}^{t}\frac{3\alpha _i\gamma _i}{2\mu {\varGamma }_i}\left( \Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2 \right) \right] \\&= \sum _{i=1}^{t}\frac{\alpha _i\gamma _i}{\mu {\varGamma }_i}\left[ 4\sigma _H^2 + \left( 1 + \frac{1}{2(1-q)} \right) \sigma _G^2\right] . \end{aligned} \end{aligned}$$
(62)

Taking expectation on both sides of (61), and using (62), we obtain (21).

Next we prove part (b). Observe that the sequence \(\{\langle {\varDelta }_G^i, r_i - w_i^v\rangle \}_{i\ge 1}\) is a martingale difference and hence satisfies the large-deviation theorem (see, e.g., Lemma 2 of [25]). Therefore using Assumption A2 and the fact that

$$\begin{aligned}&\ \mathbb {E}\left[ \exp \left\{ \frac{\mu (\alpha _i{\varGamma }_i^{-1}\langle {\varDelta }_G^i, r_i - w_i^v\rangle )^2}{2(\sigma _G\alpha _i{\varGamma }_i^{-1}{\varOmega }_Z)^2} \right\} \right] \\&\quad \le \ \mathbb {E}\left[ \exp \left\{ \frac{\mu \Vert {\varDelta }_G^i\Vert _*^2\Vert r_i-w_i^v\Vert ^2}{2\sigma _G^2{\varOmega }_Z^2} \right\} \right] \le \mathbb {E}\left[ \exp \left\{ {\Vert {\varDelta }_G^i\Vert _*^2}/{\sigma _G^2} \right\} \right] \le \exp \{ 1\}, \end{aligned}$$

we conclude from the large-deviation theorem that

$$\begin{aligned} Prob\left\{ {-}\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_G^i, r_i - w_i^v\rangle > \lambda \sigma _G{\varOmega }_Z\sqrt{\frac{2}{\mu }\sum _{i=1}^t\left( \frac{\alpha _i}{{\varGamma }_i} \right) ^2 } \right\} \le \exp \{-\lambda ^2/3\}. \end{aligned}$$
(63)

By using a similar argument we have

$$\begin{aligned} Prob\left\{ {-}\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i}\langle {\varDelta }_H^{2i}, w_{i+1} - w_i^v\rangle > \lambda \sigma _H{\varOmega }_Z\sqrt{\frac{2}{\mu }\sum _{i=1}^t\left( \frac{\alpha _i}{{\varGamma }_i} \right) ^2 } \right\} \le \exp \{-\lambda ^2/3\}. \end{aligned}$$
(64)

In addition, letting \(S_i = \alpha _i\gamma _i/(\mu {\varGamma }_i)\) and \(S = \sum _{i=1}^tS_i\), by Assumption A2 and the convexity of exponential functions, we have

$$\begin{aligned} \begin{aligned} \mathbb {E}\left[ \exp \left\{ \frac{1}{S}\sum _{i=1}^{t}S_i{\Vert {\varDelta }_{G}^i\Vert _*^2}/{\sigma _{G}^2} \right\} \right] \le \mathbb {E}\left[ \frac{1}{S}\sum _{i=1}^{t}S_i\exp \left\{ \Vert {\varDelta }_{G}^i\Vert _*^2/\sigma _{G}^2 \right\} \right] \le \exp \{1\}. \end{aligned} \end{aligned}$$

Noting by Markov’s inequality that \(P(X>a)\le \mathbb {E}[X]/a\) for all nonnegative random variables X and constants \(a>0\), the above inequality implies that

$$\begin{aligned}&Prob\left[ \sum _{i=1}^{t}S_i{\Vert {\varDelta }_{G}^i\Vert _*^2}>(1+\lambda ){\sigma _{G}^2}S\right] \\&\quad = Prob\left[ \exp \left\{ \frac{1}{S}\sum _{i=1}^{t}S_i{\Vert {\varDelta }_{G}^i\Vert _*^2}/{\sigma _{G}^2} \right\} >\exp \{1+\lambda \}\right] \\&\quad \le \mathbb {E}\left[ \exp \left\{ \frac{1}{S}\sum _{i=1}^{t}S_i{\Vert {\varDelta }_{G}^i\Vert _*^2}/{\sigma _{G}^2} \right\} \right] \Big / \exp \{1+\lambda \}\\&\quad \le \exp \{-\lambda \}. \end{aligned}$$

Recalling that \(S_i = \alpha _i\gamma _i/(\mu {\varGamma }_i)\) and \(S = \sum _{i=1}^tS_i\), the above relation is equivalent to

$$\begin{aligned} \begin{aligned}&\ Prob\left\{ \left( 1 + \frac{1}{2(1-q)} \right) \sum _{i=1}^t \frac{\alpha _i\gamma _i}{\mu {\varGamma }_i}\Vert {\varDelta }^i_G\Vert _*^2 > (1+\lambda )\sigma _G^2\left( 1 + \frac{1}{2(1-q)} \right) \sum _{i=1}^t \frac{\alpha _i\gamma _i}{\mu {\varGamma }_i} \right\} \\&\quad \le \exp \{-\lambda \}. \end{aligned} \end{aligned}$$
(65)

Using similar arguments, we also have

$$\begin{aligned}&Prob\left\{ \sum _{i=1}^t \frac{3\alpha _i\gamma _i}{2\mu {\varGamma }_i}\Vert {\varDelta }^{2i-1}_H\Vert _*^2 > (1+\lambda )\frac{3\sigma _H^2}{2}\sum _{i=1}^t \frac{\alpha _i\gamma _i}{\mu {\varGamma }_i} \right\} \le \exp \{-\lambda \}, \end{aligned}$$
(66)
$$\begin{aligned}&Prob\left\{ \sum _{i=1}^t \frac{5\alpha _i\gamma _i}{2\mu {\varGamma }_i}\Vert {\varDelta }^{2i}_H\Vert _*^2 > (1+\lambda )\frac{5\sigma _H^2}{2}\sum _{i=1}^t \frac{\alpha _i\gamma _i}{\mu {\varGamma }_i} \right\} \le \exp \{-\lambda \}. \end{aligned}$$
(67)

Using the fact that \(\Vert {\varDelta }_H^{2i} + {\varDelta }_G^{2i-1}\Vert _{*}^2\le 2\Vert {\varDelta }_H^{2i}\Vert _{*}^2 + 2\Vert {\varDelta }_G^{2i-1}\Vert _{*}^2\), we conclude from (61)–(67) that (22) holds.

In the remaining part of this subsection, we will focus on proving Theorem 2, which describes the rate of convergence of Algorithm 1 for solving SVI(ZGHJ) when Z is unbounded.

Proof the Theorem 2

Let \(U_t\) be defined in (60). Firstly, applying (29) and (59) to (50) in Lemma 4, and noting that \(\mu =1\), we have

$$\begin{aligned}&\frac{1}{{\varGamma }_t}Q(w_{t+1}^{ag}, u) \end{aligned}$$
(68)
$$\begin{aligned}&\le \mathcal{B}_t(u, r_{[t]}) -\frac{\alpha _t}{2{\varGamma }_t\gamma _t}\sum _{i=1}^t\left( q-c^2\right) \Vert r_i-w_{i+1}\Vert ^2 + \mathcal{B}_t(u, w^v_{[t]}) + U_t, \quad \forall u\in Z.\nonumber \\ \end{aligned}$$
(69)

In addition, applying (29) to the definition of \({\mathcal {B}}_t(\cdot ,\cdot )\) in (51), we obtain

$$\begin{aligned} \mathcal{B}_t(u, r_{[t]})&= \ \frac{\alpha _t}{2{\varGamma }_t\gamma _t}(\Vert r_1 - u\Vert ^2 - \Vert r_{t+1}-u\Vert ^2) \end{aligned}$$
(70)
$$\begin{aligned}&= \ \frac{\alpha _t}{2{\varGamma }_t\gamma _t}(\Vert r_1 - w_{t+1}^{ag}\Vert ^2 - \Vert r_{t+1}-w_{t+1}^{ag}\Vert ^2 + 2\langle r_1 - r_{t+1}, w_{t+1}^{ag}- u\rangle ). \end{aligned}$$
(71)

By using a similar argument and the fact that \(w^v_1=w_1=r_1\), we have

$$\begin{aligned} \mathcal{B}_t(u, w^v_{[t]})&=\ \frac{\alpha _t}{2{\varGamma }_t\gamma _t}(\Vert r_1 - u\Vert ^2 - \Vert w^v_{t+1}-u\Vert ^2) \end{aligned}$$
(72)
$$\begin{aligned}&=\ \frac{\alpha _t}{2{\varGamma }_t\gamma _t}(\Vert r_1 - w_{t+1}^{ag}\Vert ^2 - \Vert w^v_{t+1}-w_{t+1}^{ag}\Vert ^2 + 2\langle r_1 - w_{t+1}^v, w_{t+1}^{ag}- u\rangle ). \end{aligned}$$
(73)

We then conclude from (68 69), (71), and (73) that

$$\begin{aligned} Q(w_{t+1}^{ag}, u) - \langle v_{t+1}, w_{t+1}^{ag}- u\rangle \le \varepsilon _{t+1},\quad \forall u\in Z, \end{aligned}$$
(74)

where

$$\begin{aligned} v_{t+1}: = \frac{\alpha _t}{\gamma _t}(2r_1 - r_{t+1}- w_{t+1}^v) \end{aligned}$$
(75)

and

$$\begin{aligned} \begin{aligned} \varepsilon _{t+1}:&=\ \frac{\alpha _t}{2\gamma _t}\left( 2\Vert r_1 - w_{t+1}^{ag}\Vert ^2 - \Vert r_{t+1}- w_{t+1}^{ag}\Vert ^2 - \Vert w_{t+1}^v - w_{t+1}^{ag}\Vert ^2 \frac{}{}\right. \\&\quad -\left. \sum _{i=1}^t\left( q-c^2\right) \Vert r_i-w_{i+1}\Vert ^2 \right) +{\varGamma }_tU_t. \end{aligned} \end{aligned}$$
(76)

It is easy to see that the residual \(\varepsilon _{t+1}\) is positive by setting \(u=w_{t+1}^{ag}\) in (74). Hence \(\tilde{g}(w_{t+1}^{ag}, v_{t+1}) \le \varepsilon _{t+1}\). To finish the proof, it suffices to estimate the bounds for \(\mathbb {E}[\Vert v_{t+1}\Vert ]\) and \(\mathbb {E}[\varepsilon _{t+1}]\). Observe that by (2), (6), (16) and the convexity of G and J, we have

$$\begin{aligned} Q(w_{t+1}^{ag}, u^*)\ge \langle {F}(u^*),w_{t+1}^{ag}-u^*\rangle \ge 0, \end{aligned}$$
(77)

where the last inequality follows from the assumption that \(u^*\) is a strong solution of SVI(ZGHJ). Using the above inequality and letting \(u=u^*\) in (68 69), we conclude from (70) and (72) that

$$\begin{aligned}&2\Vert r_1 - u^*\Vert ^2 - \Vert r_{t+1}- u^*\Vert ^2 - \Vert w^v_{t+1} - u^*\Vert ^2 - \sum _{i=1}^t\left( q-c^2\right) \Vert r_i-w_{i+1}\Vert ^2 \\&\quad + \frac{2{\varGamma }_t\gamma _t}{\alpha _t}U_t \ge {\frac{2\gamma _t}{\alpha _t}}Q(w_{t+1}^{ag}, u^*) \ge 0. \end{aligned}$$

By the above inequality and the definition of D in (32), we have

$$\begin{aligned} \Vert r_{t+1}- u^*\Vert ^2 + \Vert w^v_{t+1} - u^*\Vert ^2 + \sum _{i=1}^t\left( q-c^2\right) \Vert r_i-w_{i+1}\Vert ^2 \le 2D^2 + \frac{2{\varGamma }_t\gamma _t}{\alpha _t}U_t. \end{aligned}$$
(78)

In addition, applying (29) and the definition of \(C_t\) in (33) to (62), we have

$$\begin{aligned} \mathbb {E}[U_t]\le \sum _{i=1}^{t}\frac{\alpha _t\gamma _i^2}{{\varGamma }_t\gamma _t}\left[ 4\sigma _H^2 + \left( 1 + \frac{1}{2(1-q)} \right) \sigma _G^2\right] = \frac{\alpha _t}{{\varGamma }_t\gamma _t}C_t^2. \end{aligned}$$
(79)

Combining (78) and (79), we have

$$\begin{aligned} \mathbb {E}[\Vert r_{t+1}- u^*\Vert ^2] + \mathbb {E}[\Vert w^v_{t+1} - u^*\Vert ^2] + \sum _{i=1}^t\left( q-c^2\right) \mathbb {E}[\Vert r_i-w_{i+1}\Vert ^2] \le 2D^2 + 2C_t^2. \end{aligned}$$
(80)

We are now ready to prove (30). Observe from the definition of \(v_{t+1}\) in (75) and the definition of D in (32) that \(\Vert v_{t+1}\Vert \le {\alpha _t}(2D + \Vert w_{t+1}^v - u^*\Vert + \Vert r_{t+1}- u^*\Vert )/{\gamma _t}\), using the previous inequality, Jensen’s inequality, and (80), we obtain

$$\begin{aligned} \mathbb {E}[\Vert v_{t+1}\Vert ]&\le \frac{\alpha _t}{\gamma _t}\left( 2D + \sqrt{\mathbb {E}[(\Vert r_{t+1}- u^*\Vert + \Vert w_{t+1}^v - u^*\Vert )^2]}\right) \\&\le \ \frac{\alpha _t}{\gamma _t}\left( 2D + \sqrt{2\mathbb {E}[\Vert r_{t+1}- u^*\Vert ^2 + \Vert w_{t+1}^v - u^*\Vert ^2] }\right) \\&\le \frac{\alpha _t}{\gamma _t}\left( 2D + 2\sqrt{D^2 + C_t^2}\right) . \end{aligned}$$

Our remaining goal is to prove (31). By (15) and (18), we have

$$\begin{aligned} \frac{1}{{\varGamma }_t}w_{t+1}^{ag}= \frac{1}{{\varGamma }_{t-1}} w_{t}^{ag}+ \frac{\alpha _t}{{\varGamma }_t}w_{t+1}, \quad \forall {t > 1}. \end{aligned}$$

Using the assumption that \(w_1^{ag} = w_1\), we obtain

$$\begin{aligned} w_{t+1}^{ag}= {\varGamma }_t\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i}w_{i+1}, \end{aligned}$$
(81)

where by (18) we have

$$\begin{aligned} {\varGamma }_t\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i} = 1. \end{aligned}$$
(82)

Therefore, \(w_{t+1}^{ag}\) is a convex combination of iterates \(w_2, \ldots , w_{t+1}\). Also, by a similar argument in the proof of Lemma 4, applying Lemma 2 to (13) and (14) (with \(r=r_t, w=w_{t+1}, y=r_{t+1}, {\eta } = \gamma _t{\mathcal {H}}(r_t; \zeta _{2t-1})+ \gamma _t{\mathcal {G}}(w_{t}^{md}; \xi _t), {\vartheta }=\gamma _t{\mathcal {H}}(w_{t+1}; \zeta _{2t})+ \gamma _t{\mathcal {G}}(w_{t}^{md}; \xi _t), J=\gamma _tJ\), \(L=3M^2\gamma _t^2\) and \(M^2 = 3\gamma _t^2(\Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2)\)), and using (42) and (53), we have

$$\begin{aligned} \frac{1}{2}\Vert r_{t+1}- w_{t+1}\Vert ^2&\le \frac{3M^2\gamma _t^2}{2}\Vert r_t - w_{t+1}\Vert ^2 + \frac{3\gamma _t^2}{2}(\Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2)\\&\le \ \frac{c^2}{2}\Vert r_t - w_{t+1}\Vert ^2 + \frac{3\gamma _t^2}{2}(\Vert {\varDelta }_H^{2t}\Vert _*^2 + \Vert {\varDelta }_H^{2t-1}\Vert _*^2), \end{aligned}$$

where the last inequality follows from (29).

Now using (76), (81), (82), the above inequality, and applying Jensen’s inequality, we have

$$\begin{aligned} \begin{aligned} \varepsilon _{t+1}- {\varGamma }_tU_t&\le \ \frac{\alpha _t}{\gamma _t}\Vert r_1 - w_{t+1}^{ag}\Vert ^2 =\ \frac{\alpha _t}{\gamma _t}\left\| r_1 - u^* + {{\varGamma }_t}\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}(u^*-r_{i+1}) \right. \\&\quad \left. +\, {{\varGamma }_t}\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}(r_{i+1} - w_{i+1})\right\| ^{{2}}\\&\le \ \frac{3\alpha _t}{\gamma _t}\left[ D^2 + {\varGamma }_t\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i}\left( \Vert r_{i+1} - u^*\Vert ^2 + \Vert w_{i+1} - r_{i+1}\Vert ^2\right) \right] \\&\le \ \frac{3\alpha _t}{\gamma _t}\Bigg [D^2 + {\varGamma }_t\sum _{i=1}^t\frac{\alpha _i}{{\varGamma }_i}\Big (\Vert r_{i+1} - u^*\Vert ^2 + c^2\Vert w_{i+1} - r_{i}\Vert ^2\\&\qquad \ \frac{}{}+ 3\gamma _i^2(\Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2)\Vert \Big )\Bigg ]. \end{aligned} \end{aligned}$$
(83)

Noting that by (33) and (78),

$$\begin{aligned}&\ {\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}(\Vert r_{i+1} - u^*\Vert ^2 + c^2\Vert w_{i+1} - r_{i}\Vert ^2)\\&\quad \le \ {\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i\theta }{{\varGamma }_i}(\Vert r_{i+1} - u^*\Vert ^2 + (q-c^2)\Vert w_{i+1} - r_{i}\Vert ^2)\\&\quad \le \ {\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i\theta }{{\varGamma }_i}(2D^2 + \frac{2{\varGamma }_i\gamma _i}{\alpha _i}U_i) = 2\theta D^2 + 2\theta {\varGamma }_t\sum _{i=1}^{t}{\gamma _i}U_i, \end{aligned}$$

and that by (29),

$$\begin{aligned} {\varGamma }_t\sum _{i=1}^{t}\frac{3\alpha _i\gamma _i^2}{{\varGamma }_i}(\Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2)&=\ {\varGamma }_t\sum _{i=1}^{t}\frac{3\alpha _t\gamma _i^3}{{\varGamma }_t\gamma _t}(\Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2)\\&=\frac{3\alpha _t}{\gamma _t}\sum _{i=1}^{t}{\gamma _i^3}(\Vert {\varDelta }_H^{2i}\Vert _*^2 + \Vert {\varDelta }_H^{2i-1}\Vert _*^2), \end{aligned}$$

we conclude from (79), (83) and Assumption A1 that

$$\begin{aligned} \mathbb {E}[\varepsilon _{t+1}]&\le \ {\varGamma }_t\mathbb {E}[U_t] + \frac{3\alpha _t}{\gamma _t}\left[ D^2+2\theta D^2 + 2\theta {\varGamma }_t\sum _{i=1}^{t}{\gamma _i}\mathbb {E}[U_i] + \frac{6\alpha _t\sigma _H^2}{\gamma _t}\sum _{i=1}^{t}{\gamma _i^3}\right] \\&\le \ \frac{\alpha _t}{\gamma _t}C_t^2 +\frac{3\alpha _t}{\gamma _t}\left[ (1+2\theta )D^2 + 2\theta {\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i} C_i^2 + \frac{6\alpha _t\sigma _H^2}{\gamma _t}\sum _{i=1}^{t}{\gamma _i^3}\right] . \end{aligned}$$

Finally, observing from (33) and (82) that

$$\begin{aligned} {\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i}C_i^2\le C_t^2{\varGamma }_t\sum _{i=1}^{t}\frac{\alpha _i}{{\varGamma }_i} = C_t^2, \end{aligned}$$

we conclude (31) from the above inequality.

4 Numerical experiments

In this section, we present some preliminary experimental results on solving deterministic and stochastic variational inequality problems using the SAMP algorithm. The comparisons with the mirror-prox method in [32] and the stochastic mirror-prox method in [20] are provided for better examination of the performance of the SAMP algorithm.

4.1 Overlapped group lasso

Our first numerical experiment is on a problem of form (7). Specifically, we consider the following overlapped group lasso problem:

$$\begin{aligned} {\min _{x\in X}\frac{1}{2}\mathbb {E}_{a,f}[(\langle a,x\rangle - f)^2] + \lambda \sum _{g\in {\mathcal {S}}}\Vert x_g\Vert .} \end{aligned}$$
(84)

Here, the feasible set X is a Euclidean ball with \(X:=\left\{ x\in \mathbb {R}^n|\Vert x\Vert \le D\right\} \), \(\Vert \cdot \Vert \) is the Euclidean norm, the random variable pair (af) represents a dataset of interest, and x is the sparse feature of the dataset to be extracted. The sparsity structure of x is represented by group \({\mathcal {S}}\subseteq 2^{\{1,\ldots ,n\}}\), and for any \(g\subseteq \{1,\ldots ,n\}\), \(x_g\) is a sparse vector that is constructed by components of x whose indices are in g, i.e., \(x_g:=(x_i)_{i\in g}\) and there are very few number of non-zero components in \(x_g\). Problems utilizing such sparsity structure is known as the overlapped group lasso [18]. In this experiment, we assume that each group \(g\in {\mathcal {S}}\) consists of k elements. The first term in (84) describes the fidelity of the relation between dataset and its underlying feature, and the second term is the regularization term to enforce certain group sparsity. Problem (84) can be formulated as a SVI problem (1) with \(\xi =(a,f)\) and

$$\begin{aligned} {\mathcal {F}}(u;\xi ) = \begin{pmatrix} (\langle a,x\rangle -f)a + K^Ty\\ -K^Tx \end{pmatrix},\quad \forall u=(x,y)\in Z:=X\times Y. \end{aligned}$$
(85)

Here the linear operator K is defined by \(Kx = \lambda (x_{g_1}^T, x_{g_2}^T,\ldots , x_{g_l}^T)^T\), where \(g_i\in {\mathcal {S}}\) and \({\mathcal {S}}= \{g_i\}_{i=1}^{l}\). The set X is \(\mathbb {R}^n\), and the set Y is the set of vectors \(y\in \mathbb {R}^{kn}\) that are composed by n sub-vectors in the k dimensional unit ball:

$$\begin{aligned} Y:=\left\{ y\in \mathbb {R}^{kn}|\left\| \left( y^{(ki-k+1)},y^{(ki-k+2)},\ldots ,y^{(ki)}\right) ^T\right\| \le 1\right\} . \end{aligned}$$
(86)

We can see that F has the form of (2) where

$$\begin{aligned} {\mathcal {G}}(u;\xi ) = \begin{pmatrix} \langle (a,x\rangle -f)a\\ 0 \end{pmatrix},\ {\mathcal {H}}(u;\xi )=H(u)=\begin{pmatrix} 0 &{} K^T \\ -K &{} 0 \end{pmatrix}u,\text { and } J(u)\equiv 0. \end{aligned}$$
(87)

In this experiment, we assume that the k-th components of a satisfy \(a^{(k)}\sim N(0,1)\) for all k, \(f=\langle a,x_{true}\rangle + \varepsilon \) for some underlying ground truth \(x_{true}\) and noise \(\varepsilon \sim N(0,\sigma _{noise}^2)\). The goal of solving (84) is to recover a feature x that is as close to \(x_{true}\) as possible, with only the knowledge of norm \(D:=\Vert x_{true}\Vert \) of the underlying ground truth. With such assumptions, it can be computed that \(G(u)=\Vert x-x_{true}\Vert ^2/2 + \sigma _{noise}^2\). The stochastic gradients \({\mathcal {G}}(u;\xi )\) is computed through a mini-batch fashion with batch size b, namely, \({\mathcal {G}}(u;\xi ) = (\sum _{i=1}^{b}(\langle a_i,x\rangle -f_i)a_i)/b\) with b independently generated samples \((a_i,f_i)\), where \(a_i^{(k)}\sim N(0,1)\) for all the k-th components of \(a^{(k)}\), \(f_i = \langle a_i, x_{true}\rangle + \varepsilon _i\), and \(\varepsilon _i\sim N(0,\sigma _{noise}^2)\). Fixing any \(x\in X\) and denoting that \(d:=x - x_{true}\) and \(\kappa _i:=(\langle a_i, d\rangle - \varepsilon _i)a_i - d\), we have \(\mathbb {E}[\kappa _i] = 0\) and \(\Vert d\Vert \le 2D\), hence

$$\begin{aligned} \mathbb {E}\left[ \Vert \kappa _i\Vert ^2\right]&= \mathbb {E}\left[ \Vert (\langle a_i, d\rangle - \varepsilon _i)a_i\Vert ^2\right] - \Vert d\Vert ^2\\&= \mathbb {E}\left[ \Vert \langle a_i, d\rangle a_i\Vert \right] ^2 + \mathbb {E}\left[ \varepsilon _i^2\Vert a_i\Vert ^2\right] - \Vert d\Vert ^2\\&= \mathbb {E}\left[ \sum _{j=1}^n\left( a_i^{(j)}d^{(j)}+\sum _{\begin{array}{c} k=1\\ k\not =j \end{array}}^n a_i^{(k)}d^{(k)}\right) ^2\left( a_i^{(j)}\right) ^2\right] + n\sigma _{noise}^2 - \Vert d\Vert ^2\\&= \mathbb {E}\left[ \sum _{j=1}^n\left( \left( a_i^{(j)}\right) ^2d^{(j)}+\sum _{\begin{array}{c} k=1\\ k\not =j \end{array}}^n a_i^{(j)}a_i^{(k)}d^{(k)}\right) ^2\right] + n\sigma _{noise}^2 - \Vert d\Vert ^2\\&= \mathbb {E}\left[ \sum _{j=1}^n\left( \left( a_i^{(j)}\right) ^4\left( d^{(j)}\right) ^2+\sum _{\begin{array}{c} k=1\\ k\not =j \end{array}}^n \left( a_i^{(j)}a_i^{(k)}\right) ^2\left( d^{(k)}\right) ^2\right) \right] + n\sigma _{noise}^2 - \Vert d\Vert ^2\\&= \sum _{j=1}^n \left( 3\left( d^{(j)}\right) ^2+\sum _{\begin{array}{c} k=1\\ k\not =j \end{array}}^n \left( d^{(k)}\right) ^2\right) + n\sigma _{noise}^2 - \Vert d\Vert ^2\\&= (n+1)\Vert d\Vert ^2 + n\sigma _{noise}^2\le 4(n+1)D^2 + n\sigma _{noise}^2. \end{aligned}$$

Therefore, noting that \(\kappa _i\) are i.i.d., we have

$$\begin{aligned} \mathbb {E}[{\mathcal {G}}(u;\xi )] = \mathbb {E}\left[ \frac{1}{b}\sum _{i=1}^{b}\kappa _i + d\right] = x-x_{true} = \nabla G(u), \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}\left[ \left\| {\mathcal {G}}(u;\xi ) - \nabla G(u)\right\| _*^2\right]&= \mathbb {E}\left[ \left\| \frac{1}{b}\sum _{i=1}^{b}\kappa _i\right\| ^2\right] = \frac{1}{b^2}\sum _{i=1}^{b}\mathbb {E}[\Vert \kappa _i\Vert ^2] \\&\le \frac{4(n+1)D^2+\sigma _{noise}^2n}{b}. \end{aligned}$$

By the above analysis, it can be computed that \(L=1\) in (4), \(M=\Vert K\Vert \) in (3), \(\sigma _G^2 =(4(n+1)D^2+\sigma _{noise}^2n)/b\) and \(\sigma _H^2=0\) in Assumption A1. The true feature \(x_{true}\) is the n-vector form of a \(64 \times 64\) two-dimensional signal whose intensities are shown in Fig. 1. Within its support, the nonzero intensities of \(x_{true}\) are generated independently from standard normal distribution. The group sparsity structure we enforce is a grid structure as described in [18] with all the 4-cycles, in order to enforce that each pixel in the support \(x_{true}\) is connected to the pixels that are above, below, left, and right of itself.

Fig. 1
figure 1

True feature \(x_{true}\) in the experiment of overlapped group LASSO. From left to right: problem instances 1 through 3

We present in Table 2 the comparison results between the SAMP algorithm and the stochastic mirror-prox (SMP) algorithm in [20]. Three problem instances are generated, and in all the instances we set \(\lambda =10^{-4}\) in (84), and \(\sigma _{noise}=0.1\) and \(b=50\) in the sampling process. The accuracy of feature extraction of algorithm output x is evaluated by the relative error to the ground truth, which is defined by

$$\begin{aligned} \frac{\Vert x - x_{true}\Vert }{\Vert x_{true}\Vert }. \end{aligned}$$
(88)

For both the SAMP and SMP algorithms, the parameters are selected based on the recommended optimal settings. In particular, for the SAMP algorithm, we use the parameters described in Corollary 1 (in which \(\beta =\sigma /D\)). For the SMP algorithm, we use the parameters described in Corollary 1 (with \(L=1+\Vert K\Vert \), \(M=0\), \({\varOmega }=D\), and \(\sigma =\sigma _G\)) in [20] . We can observe that the SAMP algorithm significantly outperforms the SMP algorithm in extracting the underlying feature of the datasets.

Table 2 The comparison of the SAMP and SMP algorithms in extracting overlapped group sparse feature of datasets, in terms of the relative error (88) to the ground truth

4.2 Randomized algorithm for solving two-player game

The goal of this subsection is to demonstrate the efficiency of the SAMP algorithm in computing the equilibrium of a two-player game. In particular, we consider the SVI problem

$$\begin{aligned} \left\langle \mathbb {E}\left[ \begin{pmatrix} {\mathcal {P}}(x;\xi _x) + {\mathcal {K}}_y(y;\zeta _y)\\ -{\mathcal {K}}_x(x;\zeta _x) + {\mathcal {Q}}(y;\xi _y) \end{pmatrix} \right] , \begin{pmatrix} x^* - x \\ y^* - y \end{pmatrix}\right\rangle \le 0,\quad \forall x,y\in {\varDelta }^n, \end{aligned}$$
(89)

where \(\xi _x,\xi _y,\zeta _x,\zeta _y\) are random variables such that

$$\begin{aligned}&{\text {Prob}}({\mathcal {P}}(x;\xi _x)=P_j) = x_i^{(j)}, \ {\text {Prob}}({\mathcal {Q}}(y;\xi _y)=Q_k) = y_i^{(k)}, \end{aligned}$$
(90)
$$\begin{aligned}&{\text {Prob}}({\mathcal {K}}_x(x;\zeta _x)=K_l) = x_i^{(l)}, \text { and }{\text {Prob}}({\mathcal {K}}_y(y;\zeta _y)=K^m) = y_i^{(m)}. \end{aligned}$$
(91)

Here \({\varDelta }^n\) is a standard simplex, \(P_j\) and \(Q_k\) are the j-th and k-th columns of positive semidefinite matrices P and Q, and \(K_l\) and \((K^m)^T\) are the l-th column and m-th row of a matrix K, respectively. Letting \(Z:={\varDelta }^n\times {\varDelta }^n\), \(\xi :=(\xi _x,\xi _y)\), \(\zeta :=(\zeta _x,\zeta _y)\), and using the notation \(u=(x,y)\in Z\), problem (89) can be formulated as (1) where

$$\begin{aligned} {\mathcal {G}}(u;\xi ) = \begin{pmatrix} {\mathcal {P}}(x;\xi _x)\\ {\mathcal {Q}}(y;\xi _y) \end{pmatrix},\ {\mathcal {H}}(u;\zeta )=\begin{pmatrix} {\mathcal {K}}_y(y;\zeta _y)\\ -{\mathcal {K}}_x(x;\zeta _x) \end{pmatrix},\text { and }J(u)\equiv 0. \end{aligned}$$
(92)

Also, it can be checked from (90 91) that \(\mathbb {E}[{\mathcal {P}}(x;\xi _x)]=Px\), \(\mathbb {E}[{\mathcal {Q}}(y;\xi _y)] = Qy\), \(\mathbb {E}[{\mathcal {K}}_x(x;\zeta _x)]=Kx\) and \(\mathbb {E}[{\mathcal {K}}_y(y;\zeta _y)] = K^Ty\), hence problem (89) also have the equivalent form (5) where

$$\begin{aligned} F(u) = \left( \begin{array}{cc} P &{} K^T\\ -K &{} Q \end{array} \right) u,\ G(u)=\frac{1}{2}\left\langle Px, x\right\rangle + \frac{1}{2}\langle Qy, y\rangle ,\text { and } H(u)=\left( \begin{array}{c} K^Ty\\ -Kx \end{array}\right) , \end{aligned}$$

or a saddle point problem

$$\begin{aligned} \min _{x\in {\varDelta }^n}\max _{y\in {\varDelta }^n}\frac{1}{2}\langle Px,x\rangle + \langle Kx, y\rangle - \frac{1}{2}\langle Qy, y\rangle . \end{aligned}$$
(93)

The above saddle point problem describes the equilibrium of a two-player game.

It should be noted that, although problems (89) and (93) are equivalent, from the algorithm design point of view the stochastic formulation (89) may have some advantages over its deterministic counterpart (93), as pointed out in [33]. This is because that when PQ and K are dense and n is large, the matrix-vector multiplication of Px, Qy, \(K^Ty\) and Kx may be relatively expensive. Consequently, the stochastic formulation (89) becomes more favorable than its deterministic counterpart (89), since each sampling of \({\mathcal {P}}(x;\xi _x)\), \({\mathcal {Q}}(y;\xi _y)\), \({\mathcal {K}}_y(y;\zeta _y)\) and \({\mathcal {K}}_x(x;\zeta _x)\) is just a random selection of a column or row, rather than a matrix-vector computation. Indeed, designing a stochastic approximation algorithm for solving the SVI (89) is equivalent to designing a randomized algorithm for solving (93).

To compute a solution of (1), we consider an entropy setting for the prox-function used in the AMP algorithm. For simplicity, we only consider in this experiment the case when \(\max _{i,j}|P^{(i,j)}|=\max _{i,j}|Q^{(i,j)}|\).Footnote 1 For all \(z=(\tilde{x},\tilde{y})\in Z\), \(u=(x,y)\in Z\) and \(\eta =(\eta _x,\eta _y)\in \mathcal {E}\), we define

$$\begin{aligned} \begin{aligned}&\Vert u\Vert :=\sqrt{\Vert x\Vert _1^2+\Vert y\Vert _1^2s},\ \Vert \eta \Vert _*:=\sqrt{\Vert \eta _x\Vert _{\infty }^2+\Vert \eta _y\Vert _{\infty }^2},\ \text {and }\\&V(z,u) := \sum _{j=1}^{n}(x_i^{(j)} + \nu /n)\ln \frac{x_i^{(j)}+\nu /n}{\tilde{x}_i^{(j)}+\nu /n} + \sum _{j=1}^{n}(y_i^{(j)} + \nu /n)\ln \frac{y_i^{(j)}+\nu /n}{\tilde{y}_i^{(j)}+\nu /n}. \end{aligned} \end{aligned}$$
(94)

Here, \(y_i^{(j)}\) denotes the j-th entry of the strategy \(y_i\), and \(\nu \) is arbitrarily small (e.g., \(\nu =10^{-16}\)). With the above setting, the optimization problem in the prox-mapping (10) can be efficiently solved within machine accuracy, and the strong convexity parameter of the prox-function V(zu) is \(\mu =1+\nu \) (See [3] for details on the entropy prox-functions). It is easy to check that

$$\begin{aligned}&\ L \le \max \left\{ \max _{k,j}|P^{(k,j)}|, \max _{k,j}|Q^{(k,j)}| \right\} ,\ M\le \max _{k,j}|K^{(k,j)}|,\ {\varOmega }_Z^2 = 2\left( 1+\frac{\nu }{n}\right) \ln \left( \frac{n}{\nu }+1\right) ,\\&\ \mathbb {E}_\xi \left[ \left\| {\mathcal {G}}(u;\xi ) - \nabla G(u)\right\| _*^2 \right] \le 4\left( \max _{k,j}|P^{(k,j)}|^2+\max _{k,j}|Q^{(k,j)}|^2\right) ,\text { and }\\&\ \mathbb {E}_\zeta \left[ \left\| {\mathcal {H}}(u;\zeta ) - H(u)\right\| _*^2 \right] \le 8\max _{k,j}|K^{(k,j)}|^2. \end{aligned}$$

Therefore, we set

$$\begin{aligned} \sigma _{G}=2\sqrt{\left( \max _{k,j}|P^{(k,j)}|^2+\max _{k,j}|Q^{(k,j)}|^2\right) },\ \displaystyle \sigma _{H}=2\sqrt{2\max _{k,j}|K^{(k,j)}|}, \end{aligned}$$

and \(\sigma \) by (8).

In this experiment, we generate random matrices \(B, C\in \mathbb {R}^{100\times n}\), and \(K\in \mathbb {R}^{n\times n}\) first, where each entry of these matrices are independently and uniformly distributed over [0, 1]. The matrices P and Q are then generated by \(P=B^TB\) and \(Q=C^TC\), and also rescaled so that P and Q are both positive semidefinite and \(L=\max _{k,j}|P^{(k,j)}|=\max _{k,j}|Q^{(k,j)}|\). For the SAMP algorithm, we use the scheme in Algorithm 1 with the parameters described in (94) and Corollary 1 (in which \(\beta =\sigma /{\varOmega }_Z\)). As a comparison, we also implement the stochastic mirror-prox (SMP) method in with parameters set by Corollay 1 in [20] (in which \(L=\sqrt{2\max _{k,j}|P^{(k,j)}|^2+2\max _{k,j}|K^{(k,j)}|^2}\), \({\sigma } = 4\sqrt{\max _{k,j}|P^{(k,j)}|^2+\max _{k,j}|K^{(k,j)}|^2}\) and \({\varOmega }={\varOmega }_Z^2\)). The performance of the SAMP and SMP algorithms are compared in terms of the average of the gap function values (17) (computed by MOSEK [30]) in 100 runs.

The comparison between the SAMP and SMP algorithms in terms of the performance on computing approximate solutions of (93) is described in Table 3. We can see that the SAMP algorithm outperforms the SMP algorithm, which is consistent with our theoretical observation on the iteration complexities of the SAMP and SMP algorithms. In particular, as L increases, the advantage of SAMP over SMP in terms of \(\mathbb {E}[g(u)]\) becomes more evident.

Table 3 The comparison of the SAMP and SMP algorithms in solving SVI (89), in terms of the expectation of the gap function value g(u) for any approximate solution u

4.3 Two-player game with nonlinear payoff

Our goal in this subsection is to demonstrate the advantages of the SAMP algorithm over the mirror-prox method (or extragradient method) even for solving certain classes of deterministic VIs. We consider a problem on computing the equilibrium of a convex-concave two-player game of form

$$\begin{aligned} \min _{x\in {\varDelta }^n}\max _{y\in {\varDelta }^n}\sum _{i=1}^{m}\log \left( 1+e^{\langle a_i,x\rangle }\right) + \sum _{i=1}^{n}\log \left( 1+\frac{y^{(i)}}{c^{(i)}+x^{(i)}}\right) - \sum _{i=1}^{m}\log \left( 1+e^{\langle b_i,x\rangle }\right) , \end{aligned}$$
(95)

where \({\varDelta }^n\) is the standard simplex. When \(a_i=b_i=0\) for all \(i=1,\ldots ,m\), the above becomes the water filling problem (see, e.g., [5]). Letting \(Z:={\varDelta }^n\times {\varDelta }^n\) and using the notation \(u=(x,y)\), the above problem is equivalent to (5) with \(J(u)\equiv 0\) and

$$\begin{aligned} G(u)&= \sum _{i=1}^{m}\log \left( 1+e^{\langle a_i,x\rangle }\right) + \sum _{i=1}^{m}\log \left( 1+e^{\langle b_i,x\rangle }\right) , \end{aligned}$$
(96)
$$\begin{aligned} H(u)&= \begin{pmatrix} -\dfrac{y^{(i)}}{(c^{(i)}+x^{(i)})(c^{(i)}+x^{(i)}+y^{(i)})}\\ -\dfrac{1}{c^{(i)}+x^{(i)}+y^{(i)}} \end{pmatrix}. \end{aligned}$$
(97)

In this experiment, we generate \(a_i\) and \(b_i\) randomly from the standard normal distribution, and set \(c^{(i)}=1\) for all i. We apply the entropy setting in (94). For the SAMP algorithm, we incorporate a backtracking linesearch technique in order to determine the best values of L and M that guarantees the convergence of Algorithm 1 (see, e.g., the backtracking technique in [36]). We compare the SAMP algorithm with the mirror-prox (MP) algorithm with adaptive stepsize described in [32]. The comparison between the SAMP algorithm and the MP algorithm is illustrated through the convergence of gap function g(u) in (17) (computed by IPOPT [47]). It can be observed from Fig. 2 that the SAMP algorithm outperforms the MP algorithm in terms of both iteration vs. gap value and CPU time versus gap value. This is consistent with our theoretical observation that SAMP has a better iteration complexity bound than the MP algorithm for solving deterministic VIs.

Fig. 2
figure 2

Performance of SAMP and MP on solving problem (95). Left: iteration versus gap value g(u). Right: CPU time versus gap value g(u)

4.4 Variational inequality on the Lorentz cone

In this section, we study the performance of SAMP when the feasible set is unbounded. In particular, we consider an SVI problem on solving \(u^*\in Z\) such that

$$\begin{aligned} \langle \mathbb {E}[Au+\zeta ], u^*-u\rangle \le 0,\quad \forall u\in Z, \end{aligned}$$
(98)

where \(A\in \mathbb {R}^{(n+1)\times (n+1)}\) is a linear monotone operator, \(\xi \) is a random vector whose expectation \(b = \mathbb {E}[\zeta ]\) is unknown a priori, and Z is the Lorentz cone:

$$\begin{aligned} Z:=\{(x,t)\in \mathbb {R}^{(n+1)}\mid \Vert x\Vert \le t\}. \end{aligned}$$

To solve (98), we can decompose the linear monotone operator A to the sum of a symmetric positive semidefinite matrix \((A+A^T)/2\) and a skew-symmetric matrix \((A-A^T)/2\), hence the SVI problem (98) can be viewed as an instance of (1) and (5) with

$$\begin{aligned} \begin{aligned}&F(u) = Au+b,\ G(u) = \frac{1}{4}\langle (A+A^T)u, u\rangle ,\ H(u) = \frac{1}{2}(A-A^T)u + b, J(u) = 0,\\&\quad {\mathcal {G}}(u;\xi )\equiv \nabla G(u), \text { and } {\mathcal {H}}(u;\zeta ) = \frac{1}{2}(A-A^T)u + \zeta . \end{aligned} \end{aligned}$$
(99)

In this experiment, we generate the linear monotone operator A randomly by \(A=B^TB + (C - C^T)\), where \(B\in \mathbb {R}^{\lceil (n+1)/2\rceil \times (n+1)}\) (so that A is monotone but not strictly monotone), \(C\in \mathbb {R}^{(n+1)\times (n+1)}\), and the entries of B and C are generated independently from the uniform [0, 1] distribution. We generate \(\xi \) by \(\xi \sim N(b, \frac{1}{n}I)\), where the entries of the mean vector b are also randomly distributed between 0 and 1. Therefore, in Assumption A1 we have \(\sigma _G=0\) and \(\sigma _H=1\). By setting \(V(z,u)=\Vert z-u\Vert ^2/2\), the prox-mapping \(P_z^J(u)\) in (10) becomes the projection of \(z-\eta \) to the Lorentz cone Z, which can be calculated efficiently. For the SAMP algorithm, we use the parameter settings in Corollary 2 with \(L=\Vert A+A^T\Vert /2\), \(M=\Vert A-A^T\Vert /2\), and \(\beta =1\). Since the study of the stochastic mirror-prox method in [20] only considers compact feasible sets, we could not find recommended parameter settings of the stochastic mirror-prox method for solving unbounded SVI. Therefore, we compare the SAMP algorithm with the deterministic mirror-prox (MP) method in [32]. In each iteration, the SAMP algorithm is supplied with the stochastic information \({\mathcal {F}}(u;\xi ,\zeta )\), and the MP method is supplied with the deterministic information F(u). We choose the parameters of MP according to (3.2) of [32] in which \(L=\Vert A\Vert \). Noting the fact that SAMP computes relatively more matrix-vector multiplications due to the aforementioned decomposition, we set the total number of iterations of the MP method to be twice of that of the SAMP method. The performance of the SAMP and MP algorithms are compared in terms of the gap function (28), which is computed using MOSEK [30]. In particular, for any approximate solution w and perturbation vector v, we compute the value of \(\tilde{g}(w,v)\) in (28) and norm of the perturbation vector \(\Vert v\Vert \).Footnote 2 The comparison between the SAMP and MP algorithms is described in Table 4.

Table 4 The comparison of the SAMP and MP algorithms in solving the SVI problem (98)

Two remarks on the performance of the SAMP and MP algorithms are in order. Firstly, it is interesting to observe that the practical convergence of the perturbation vector \(\Vert v\Vert \) is slower than that of the gap function value \(\tilde{g}(w,v)\), although they have the same rate of convergence (see Corollary 2). Secondly, the SAMP algorithm outperforms the MP algorithm for solving (98). Such performance comparison indicates the importance of the special treatment of the gradient field term \(\nabla G\) from the SVI (98).

5 Concluding remarks

We present in this paper a novel stochastic accelerated mirror-prox method for solving a class of stochastic variational inequality problems. The basic idea of this algorithm is to incorporate a multi-step acceleration scheme into the stochastic mirror-prox method in [20]. The SAMP achieves the optimal iteration complexity, not only in terms of its dependence on the number of the iterations, but also on a variety of problem parameters. As a byproduct, the SAMP also significantly improves the iteration complexity for solving a class of deterministic variational inequalities. Moreover, the iteration cost of the SAMP is comparable to, or even less than that of the stochastic mirror-prox method in that it saves one computation of the stochastic gradient of the smooth component. To the best of our knowledge, this is the first algorithm with the optimal iteration complexity bounds for solving the SVIs of type (2). Furthermore, we show that the developed SAMP scheme can deal with the situation when the feasible region is unbounded, as long as a strong solution of the VI exists. In the unbounded case, we adopt the modified termination criterion employed by Monteiro and Svaiter in solving monotone inclusion problem, and demonstrate that the rate of convergence of SAMP depends on the distance from the initial point to the set of strong solutions. Our preliminary numerical results show that the proposed SAMP algorithm is promising to solve large-scale variational inequality problems.

It should be noted that in this paper we focus on the algorithm design for computing weak solutions to monotone SVIs. In view of some recent development on the unified analysis for convex and nonconvex stochastic optimization algorithms (see, e.g., [15]), it will be interesting to study a unified SAMP method that deals with both monotone and non-monotone SVIs. Also, considering that the problem of interest is a SVI with only deterministic feasible set, in the future it will be interesting to study different stochastic approximation type algorithms for solving SVIs involving expectation or probabilistic constraints. Moreover, it has been shown in [11, 28, 29] that the for SVI(Z; 0, H, 0) with \(\sigma =0\) and \(G=0\), the mirror-prox method in [32] indeed converges to a strong solution with complexity \({\mathcal {O}}(M^2/\varepsilon ^2)\). Since our main interest of this paper is to study the specialized treatment of gradient field \(\nabla G\) in the SVI, we focus only on how to achieve the lower complexity bound for computing weak solutions. However, by incorporating the analysis in [11], it will be interesting to see if Algorithm 1 could also compute an approximate strong solution of the SVI problem (1) with complexity \({\mathcal {O}}(\sqrt{L/\varepsilon }+(M^2+\sigma ^2)/\varepsilon ^2)\).