A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems

Guo, Chenzheng; Zhao, Jing; Dong, Qiao-Li

doi:10.1007/s11075-023-01693-9

A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems

Original Paper
Published: 09 November 2023

Volume 97, pages 51–100, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Numerical Algorithms Aims and scope Submit manuscript

A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems

Download PDF

Chenzheng Guo¹,
Jing Zhao¹ &
Qiao-Li Dong¹

292 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, for solving a broad class of large-scale nonconvex and nonsmooth optimization problems, we propose a stochastic two-step inertial Bregman proximal alternating linearized minimization (STiBPALM) algorithm with variance-reduced stochastic gradient estimators. And we show that SAGA and SARAH are variance-reduced gradient estimators. Under expectation conditions with the Kurdyka–Łojasiewicz property and some suitable conditions on the parameters, we obtain that the sequence generated by the proposed algorithm converges to a critical point. And the general convergence rate is also provided. Numerical experiments on sparse nonnegative matrix factorization and blind image-deblurring are presented to demonstrate the performance of the proposed algorithm.

A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems

Article 10 September 2019

A class of modified accelerated proximal gradient methods for nonsmooth and nonconvex minimization problems

Article 30 June 2023

Proximal linearization methods for Schatten p-quasi-norm minimization

Article 02 December 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper, we are interested in solving the following composite optimization problem:

$$\begin{aligned} \min _{x\in \mathbb R^{l} ,y\in \mathbb R^{m}} \Phi (x,y)=f(x)+H(x,y)+g(y), \end{aligned}$$

(1.1)

where $f:\mathbb {R}^l\rightarrow {(-\infty ,+\infty ]}$ and $g:\mathbb {R}^m\rightarrow {(-\infty ,+\infty ]}$ are proper lower semicontinuous. $H(x,y)=\frac{1}{n} \sum _{i=1}^{n} H_i(x,y)$ has a finite-sum structure, $H_i:\mathbb {R}^l\times \mathbb {R}^m \rightarrow \mathbb {R}$ is continuously differentiable, and $\nabla H_i$ is Lipschitz continuous on bounded subsets. Note that here and throughout the paper, no convexity is imposed on $\Phi $. In practical application, numerous problems can be formulated into the form of (1.1), such as signal and image processing [1, 2], nonnegative matrix factorization [3,4,5], blind image-deblurring [5, 6], sparse principal component analysis [7, 8], and compressed sensing [9, 10]. Here, we list two applications of (1.1), which will also be used in the numerical experiments.

(1) Sparse nonnegative matrix factorization (S-NMF). The S-NMF has important applications in image processing (face recognition) and bioinformatics (clustering of gene expressions) (see [4] for details). Given a matrix $A\in \mathbb {R} ^{ l\times m}$ and an integer $r>0$, we want to seek a factorization $A \approx XY$, where $X \in \mathbb {R} ^{ l\times r}$ and $Y \in \mathbb {R} ^{r\times m}$ are nonnegative with $r \le \min \left\{ l,m\right\} $ and X is sparse. One way to solve this problem is by finding a solution for the nonnegative least squares model given by

$$\begin{aligned} \underset{X,Y}{\min }\ \left\{ \frac{\eta }{2}\left\| A-XY \right\| _{F}^{2} : \ X,Y\ge 0,\ \left\| X_i \right\| _0\le s,\ i=1,2,\dots ,r\right\} , \end{aligned}$$

(1.2)

where $\eta >0$, $X_i$ denotes the ith column of X, and $\left\| X_i \right\| _0$ denotes the number of nonzero elements of the ith column of X. In this formulation, the sparsity on X is strictly enforced using the nonconvex $l_ 0$ constraint. Let $H(X,Y)=\frac{\eta }{2}\left\| A-XY \right\| _{F}^{2}=\sum _{i=1}^l\frac{\eta }{2}\left\| A_i-X_iY \right\| _{F}^{2}$, $f(X)=\iota _{X\ge 0}(X)+\iota _{\left\| X_1\right\| _0\ge s}(X)+\cdots +\iota _{\left\| X_r \right\| _0\ge s}(X)$, $g(Y)=\iota _{Y\ge 0}(Y)$, where $A_i$ denotes the ith low of A, and $\iota _C$ is the indicator function on C. Then, this model (1.2) can be converted to (1.1).

(2) Blind image deconvolution (BID). Let A be the observed blurred image, and let X be the unknown sharp image of the same size. Furthermore, let Y denote a small unknown blur kernel, and a typical variational formulation of the blind deconvolution problem is given by the following:

$$\begin{aligned} \underset{X,Y}{\min }\ \left\{ \frac{1}{2} \left\| A-\!X\odot Y \right\| _{F}^{2}\!+\eta \sum _{r=1}^{2d} R([D(X)]_r) : \ 0\!\le X\!\le 1,\ 0\le \! Y\le \! 1,\ \left\| Y \right\| _1\le \! 1\right\} , \end{aligned}$$

(1.3)

where $\eta >0$, $\odot $ is the two-dimensional convolution operator, X is the image to recover, and Y is the blur kernel to estimate. Here, $R(\cdot )$ is an image regularization term, that imposes sparsity on the image gradient and hence favors sharp images. $D(\cdot )$ is the differential operator, computing the horizontal and vertical gradients for each pixel. This model (1.3) can be converted to (1.1), where $H(X,Y)=\frac{1}{2}\left\| A-X\odot Y \right\| _{F}^{2}+\eta \sum _{r=1}^{2d} R([D(X)]_r)$, $f(X)=\iota _{0\le X\le 1}(X)$, $g(Y)=\iota _{\left\| Y \right\| _1\le 1}(Y)+\iota _{0\le Y\le 1}(Y)$. See [6] for details.

For solving problem (1.1), a frequently applied algorithm is the following proximal alternating linearized minimization algorithm (PALM) by Bolte et al. [11] based on results in [12, 13]:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{k+1}\in \arg \min _{ x\in \mathbb {R}^l}\{f(x)+\langle x,\nabla _xH(x_k,y_k)\rangle +\frac{1}{2\lambda _k}\Vert x-x_k\Vert ^2_2\},\\ y_{k+1}\in \arg \min _{y\in \mathbb {R}^m}\{g(y)+\langle y,\nabla _yH(x_{k+1},y_k)\rangle +\frac{1}{2\mu _k}\Vert y-y_k\Vert ^2_2\},\\ \end{array}\right. } \end{aligned}$$

(1.4)

where $\{\lambda _k\}_{k\in \mathbb {N}}$ and $\{\mu _k\}_{k\in \mathbb {N}}$ are positive sequences. To further improve the performance of PALM, Pock and Sabach [6] introduced an inertial step to PALM and proposed the following inertial proximal alternating linearized minimization (iPALM) algorithm:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_{1k}=x_k+\alpha _{1k}(x_k-x_{k-1}), v_{1k}=x_k+\beta _{1k}(x_k-x_{k-1}),\\ x_{k+1}\in \arg \min _{ x\in \mathbb {R}^l}\{f(x)+\langle x,\nabla _xH(v_{1k},y_k)\rangle +\frac{1}{2\lambda _k}\Vert x-u_{1k}\Vert ^2_2\},\\ u_{2k}=y_k+\alpha _{2k}(y_k-y_{k-1}), v_{2k}=y_k+\beta _{2k}(y_k-y_{k-1}),\\ y_{k+1}\in \arg \min _{y\in \mathbb {R}^m}\{g(y)+\langle y,\nabla _yH(x_{k+1},v_{2k})\rangle +\frac{1}{2\mu _k}\Vert y-u_{2k}\Vert ^2_2\},\\ \end{array}\right. } \end{aligned}$$

(1.5)

where $\alpha _{1k},\alpha _{2k},\beta _{1k},\beta _{2k}\in \left[ 0,1 \right] $. Then, Gao et al. [14] presented a Gauss–Seidel type inertial proximal alternating linearized minimization (GiPALM) algorithm, in which the inertial step is performed whenever the x or y-subproblem is updated. In order to use the existing information as much as possible to further improve the numerical performance, Wang et al. [15] proposed a new inertial version of proximal alternating linearized minimization (NiPALM) algorithm, which inherits both advantages of iPALM and GiPALM.

The Bregman distance regularization is an effective way to improve the numerical results of the algorithm. In [16], the authors constructed the following two-step inertial Bregman alternating minimization (TiBAM) algorithm using the information of the previous three iterates:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{k+1}\!\in \! \arg \min _{ x\in \mathbb {R}^l}\{\Phi (x,y_k)\!+\!D_{\phi _1}(x,x_k)\!+\!\alpha _{1k} \langle x,x_{k-1}\!-\!x_k\rangle \!+\!\alpha _{2k} \langle x,x_{k-2}\!-\!x_{k-1}\rangle \},\\ y_{k+1}\!\in \! \arg \min _{y\in \mathbb {R}^m}\{\Phi (x_{k+1},y)\!+\!D_{\phi _2}(y,y_k)\!+\!\beta _{1k} \langle y,y_{k-1}\!-\!y_k\rangle \!+\!\beta _{2k} \langle y,y_{k-2}\!-\!y_{k-1}\}, \end{array}\right. } \end{aligned}$$

(1.6)

where $D_{\phi _i}(i=1,2)$ denotes the Bregman distance with respect to $\phi _i(i=1,2)$. By linearizing H(x, y) in TiBAM algorithm, the authors [17] proposed the following two-step inertial Bregman proximal alternating linearized minimization (TiBPALM) algorithm:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{k+1}\in \arg \min _{ x\in \mathbb {R}^l}\{f(x)+\langle x,\nabla _xH(x_k,y_k)\rangle +D_{\phi _1}(x,x_k)+\alpha _{1k} \langle x,x_{k-1}-x_k\rangle \\ \qquad \qquad \qquad \qquad \qquad +\alpha _{2k} \langle x,x_{k-2}-x_{k-1}\rangle \},\\ y_{k+1}\in \arg \min _{y\in \mathbb {R}^m}\{g(y)\!+\!\langle y,\nabla _yH(x_{k+1},y_k)\rangle \!+\!D_{\phi _2}(y,y_k)\!+\!\beta _{1k} \langle y,y_{k-1}\!-\!y_k\rangle \\ \qquad \qquad \qquad \qquad \qquad +\beta _{2k} \langle y,y_{k-2}-y_{k-1}\rangle \}. \end{array}\right. } \end{aligned}$$

(1.7)

If we take $\phi _1(x)=\frac{1}{2\lambda }\Vert x\Vert ^2_2$ and $\phi _2(y)=\frac{1}{2\mu }\Vert y\Vert ^2_2$ for all $x\in \mathbb {R}^l$ and $y\in \mathbb {R}^m$, then (1.7) becomes two-step inertial proximal alternating linearized minimization (TiPALM) algorithm. Then, based on alternating minimization algorithm, Chao et al. [18] proposed inertial alternating minimization with the Bregman distance (BIAM) algorithm. Other related work can be found in [19, 20] and their references.

It should be noted that all these works are obtained for deterministic methods, i.e., no randomness involved. But when the dimension of data is very large, the computing cost of the full gradient of the function H(x, y) is often prohibitively expensive. In order to overcome this difficulty, stochastic gradient approximations were applied (see, e.g., [21] and the references therein). A block stochastic gradient iteration combining a simple stochastic gradient descent (SGD) estimator with PALM was first proposed by Xu and Yin [22]. To weaken the assumptions on the objective function in [22] and improve the estimates on the convergence rate of a stochastic PALM algorithm, Driggs et al. [23] used more sophisticated so-called variance-reduced gradient estimators instead of the simple stochastic gradient descent estimators and proposed the following stochastic proximal alternating linearized minimization (SPRING) algorithm:

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{k+1}\in \arg \min _{ x\in \mathbb {R}^l}\{f(x)+\langle x,\widetilde{\nabla }_x(x_k,y_k)\rangle +\frac{1}{2\lambda _k}\Vert x-x_k\Vert ^2_2\},\\ y_{k+1}\in \arg \min _{y\in \mathbb {R}^m}\{g(y)+\langle y,\widetilde{\nabla }_y(x_{k+1},y_k)\rangle +\frac{1}{2\mu _k}\Vert y-y_k\Vert ^2_2\}.\\ \end{array}\right. } \end{aligned}$$

(1.8)

The key of SPRING algorithm is replacing the full gradient computations $\nabla _x H(x_k,y_k)$ and $\nabla _yH(x_{k+1},y_k)$ with stochastic estimations $\widetilde{\nabla }_x(x_k,y_k)$ and $\widetilde{\nabla }_y(x_{k+1},y_k)$, respectively. Then, Hertrich et al. [24] introduced the following inertial variant of a stochastic PALM algorithm with a variance-reduced gradient estimator, called SiPALM:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_{1k}=x_k+\alpha _{1k}(x_k-x_{k-1}), v_{1k}=x_k+\beta _{1k}(x_k-x_{k-1}),\\ x_{k+1}\in \arg \min _{ x\in \mathbb {R}^l}\{f(x)+\langle x,\widetilde{\nabla }_x(v_{1k},y_k)\rangle +\frac{1}{2\lambda _k}\Vert x-u_{1k}\Vert ^2_2\},\\ u_{2k}=y_k+\alpha _{2k}(y_k-y_{k-1}), v_{2k}=y_k+\beta _{2k}(y_k-y_{k-1}),\\ y_{k+1}\in \arg \min _{y\in \mathbb {R}^m}\{g(y)+\langle y,\widetilde{\nabla }_y(x_{k+1},v_{2k})\rangle +\frac{1}{2\mu _k}\Vert y-u_{2k}\Vert ^2_2\},\\ \end{array}\right. } \end{aligned}$$

(1.9)

where $\alpha _{1k},\alpha _{2k},\beta _{1k},\beta _{2k}\in \left[ 0,1 \right] $. Also, some variance-reduced gradient estimators are proposed to solve the nonconvex optimization problem. The classical stochastic gradient direction is modified in various ways so as to drive the variance of the gradient estimator towards zero, such as SAG [25], SVRG [26, 27], SAGA [28], and SARAH [29, 30].

In this paper, we combine the inertial technique, Bregman distance, and stochastic gradient estimators to develop a stochastic two-step inertial Bregman proximal alternating linearized minimization (STiBPALM) algorithm to solve the nonconvex optimization problem (1.1). Our contributions are listed as follows:

(1):: We propose the STiBPALM algorithm with variance-reduced stochastic gradient estimators to solve the nonconvex optimization problem (1.1). And we show that SAGA and SARAH are variance-reduced gradient estimators (Definition 3.4) in the appendix.
(2):: We provide theoretical analysis to show that the proposed algorithm with the variance-reduced stochastic gradient estimator has global convergence under expectation conditions. Under the expectation version of Kurdyka–Łojasiewicz (KŁ) property, the sequence generated by the proposed algorithm converges to a critical point and the general convergence rate is also obtained.
(3):: We use several well-studied stochastic gradient estimators (e.g., SGD, SAGA, and SARAH) to test the performance of STiBPALM for sparse nonnegative matrix factorization and blind image-deblurring problems. And compared with some existing algorithms (e.g., PALM, iPALM, SPRING, and SiPALM) in the literature, we report some preliminary numerical results to demonstrate the effectiveness of the proposed algorithm.

This paper is organized as follows. In Sect. 2, we recall some concepts and important lemmas which will be used in the proof of main results. Section 3 introduces our STiBPALM algorithm in detail. We discuss the convergence behavior of STiBPALM in Sect. 4. In Sect. 5, we perform some numerical experiments and compare the results with other algorithms. We give the specific theoretical analysis to show that SAGA and SARAH have variance-reduced stochastic gradient estimators in the appendix.

2 Preliminaries

In this section, we summarize some useful definitions and lemmas.

Definition 2.1

Let $F: \mathbb {R}^d \rightarrow (-\infty ,+\infty ]$ be a proper and lower semicontinuous function. For $x\in $ domF, the Fréchet subdifferential of F at x, written $\hat{\partial }F(x)$, is the set of vectors $v\in \mathbb {R}^d$ which satisfy

$$\liminf _{y\rightarrow x}\frac{1}{\Vert x-y\Vert _2}[F(y)-F(x)-\langle v,y-x\rangle ]\ge 0.$$

If $x\not \in $ domF, then $\hat{\partial }F(x)=\emptyset $. The limiting-subdifferential, or simply the subdifferential for short, of F at $x\in $ domF, written $\partial F(x)$, is defined as follows:

$$\partial F(x):=\{v\in \mathbb {R}^d: \exists x_k\rightarrow x, F(x_k)\rightarrow F(x), v_k\in \hat{\partial }F(x_k), v_k\rightarrow v\}.$$

Remark 2.1

(a)
The above definition implies that $\hat{\partial }F(x)\subseteq \partial F(x)$ for each $x\in \mathbb {R}^d$, where the first set is convex and closed while the second one is closed. (see [31]).
(b)
(Closedness of $\partial F$) Let $\{x_k\}_{k\in \mathbb {N}}$ and $\{v_k\}_{k\in \mathbb {N}}$ be sequences in $\mathbb {R}^d$ such that $v_k \in \partial F(x_k)$ for all $k\in \mathbb {N}$. If $(x_k,v_k)\rightarrow (x,v)$ and $F(x_k)\rightarrow F(x)$ as $k\rightarrow \infty $, then $v \in \partial F (x)$.
(c)
If $F: \mathbb {R}^d \rightarrow (-\infty ,+\infty ]$ be a proper and lower semicontinuous and $H: \mathbb {R}^d \rightarrow \mathbb {R}$ is a continuously differentiable function, then $\partial (F+H)(x) = \partial F(x)+\nabla H(x)$ for all $x \in \mathbb {R}^d$.
(d)
A necessary (but not sufficient) condition for $x\in \mathbb {R}^d$ to be a minimizer of F is
$$0\in \partial F(x).$$
A point satisfying $0\in \partial F(x)$ is called limiting-critical or simply critical. The set of critical points of F is denoted by critF.

Definition 2.2

(Kurdyka–Łojasiewicz property [12]) Let $F: \mathbb {R}^d \rightarrow (-\infty ,+\infty ]$ be a proper and lower semicontinuous function.

(i)
The function $F: \mathbb {R}^d \rightarrow (-\infty ,+\infty ]$ is said to have the Kurdyka–Łojasiewicz (KŁ) property at $x^*\in $domF if there exist $\eta \in (0,+\infty ]$, a neighborhood U of $x^*$ and a continuous concave function $\varphi :[0,\eta )\rightarrow \mathbb {R}_{+}$ such that $\varphi (0)=0$, $\varphi $ is $C^1$ on $(0,\eta )$, for all $s\in (0,\eta )$, it is $\varphi '(s)>0$, and for all x in $U\cap [F(x^*)<F<F(x^*)+\eta ]$, the Kurdyka–Łojasiewicz inequality holds
$$\varphi '(F(x)-F(x^*))\textrm{dist}(0,\partial F(x))\ge 1.$$
(ii)
Proper lower semicontinuous functions which satisfy the Kurdyka–Łojasiewicz inequality at each point of the domain of its subdifferential are called Kurdyka–Łojasiewicz (KŁ) functions.

Roughly speaking, KŁ functions become sharp up to reparameterization via $\varphi $, a desingularizing function for F. Typical KŁ functions include the class of semialgebraic functions [32, 33]. For instance, the $l _0$ pseudonorm and the rank function are KŁ. Semialgebraic functions admit desingularizing functions of the form $\varphi (r)=ar^{1-\vartheta } $ for $a > 0$, and $\vartheta \in [0, 1)$ is known as the KŁ exponent of the function [11, 32]. For these functions, the KŁ inequality reads

$$\begin{aligned} (F(x)-F(x^*))^\vartheta \le C\left\| \xi \right\| ,\ \ \forall \xi \in \partial F(x) \end{aligned}$$

(2.1)

for some $C>0$.

Definition 2.3

A function F is said convex if domF is a convex set and if, for all x, $y\in $domF, $\alpha \in [0,1]$,

$$F(\alpha x+(1-\alpha )y)\le \alpha F(x)+(1-\alpha )F(y).$$

F is said $\theta $-strongly convex with $\theta > 0$ if $F-\frac{\theta }{2}\Vert \cdot \Vert ^2$ is convex, i.e.,

$$F(\alpha x+(1-\alpha )y)\le \alpha F(x)+(1-\alpha )F(y)-\frac{1}{2}\theta \alpha (1-\alpha )\Vert x-y\Vert ^2$$

for all x, $y\in $domF and $\alpha \in [0,1]$.

Suppose that the function F is differentiable. Then, F is convex if and only if domF is a convex set and

$$F(x)\ge F(y)+\langle \nabla F(y),x-y\rangle $$

holds for all x, $y\in $domF. Moreover, F is $\theta $-strongly convex with $\theta > 0$ if and only if

$$F(x)\ge F(y)+\langle \nabla F(y),x-y\rangle +\frac{\theta }{2}\Vert x-y\Vert ^2$$

for all x, $y\in $domF.

Definition 2.4

Let $\phi :\mathbb {R}^d \rightarrow (-\infty ,+\infty ]$ be a convex and Gâteaux differentiable function. The function $D_\phi :$ dom$\phi \,\,\times $ intdom$\phi \rightarrow [0,+\infty )$, defined by

$$D_\phi (x,y)=\phi (x)-\phi (y)-\langle \nabla \phi (y),x-y\rangle ,$$

is called the Bregman distance with respect to $\phi $.

From the above definition, it follows that

$$\begin{aligned} D_\phi (x,y)\ge \frac{\theta }{2}\Vert x-y\Vert ^2, \end{aligned}$$

(2.2)

if $\phi $ is $\theta $-strongly convex.

Lemma 2.1

(Descent lemma[34]) Let $F: \mathbb {R}^{d}\rightarrow \mathbb {R}$ be a continuously differentiable function with gradient $\nabla F$ assumed L-Lipschitz continuous. Then,

$$\begin{aligned} \left| F(y)-F(x)-\left\langle y-x,\nabla F(x) \right\rangle \right| \le \frac{L }{2}\left\| x-y \right\| ^{2},\ \forall x,y\in \mathbb R^{d}. \end{aligned}$$

(2.3)

Lemma 2.2

Let $F:\mathbb R^{d}\rightarrow \mathbb R$ be a function with L-Lipschitz continuous gradient, $G:\mathbb R^{d}\rightarrow \mathbb R$ a proper lower semicontinuous function, and $z\in \arg \min _{v\in \mathbb {R}^d}\{G(v)+\langle d,v-x\rangle +D_{\phi }(v,x)+\gamma \langle v,u\rangle +\mu \langle v,w\rangle \}$, where $D_{\phi }$ denotes the Bregman distance with respect to $\phi $, and x, d, u, $w\in \mathbb R^{d}$. Then, for all $y\in \mathbb R^{d}$,

$$\begin{aligned} F(z)+G(z)\le&F(y)+G(y)+\left\langle \nabla F(x)-d,z-y \right\rangle +\frac{L}{2} \left\| x-y \right\| ^2+D_{\phi }(y,x)\nonumber \\&+\frac{L}{2} \left\| z-x \right\| ^2-D_{\phi }(z,x)+\gamma \langle y-z,u\rangle +\mu \langle y-z,w\rangle . \end{aligned}$$

(2.4)

Proof

By Lemma 2.1, we have the inequalities

$$\begin{aligned}&F(x)-F(y)\le \left\langle \nabla F(x), x-y\right\rangle +\frac{L}{2} \left\| x-y \right\| ^2,\\&F(z)-F(x)\le \left\langle \nabla F(x), z-x\right\rangle +\frac{L}{2} \left\| z-x \right\| ^2, \end{aligned}$$

which implies that

$$\begin{aligned} F(z)\le F(y)+\left\langle \nabla F(x), z-y\right\rangle +\frac{L}{2} \left\| x-y \right\| ^2+\frac{L}{2} \left\| z-x \right\| ^2. \end{aligned}$$

(2.5)

Furthermore, by the definition of z, taking $v=y$, we obtain

$$\begin{aligned}&G(z)+\langle d,z-x\rangle +D_{\phi }(z,x)+\gamma \langle z,u\rangle +\mu \langle z,w\rangle \\ \le&G(y)+\langle d,y-x\rangle +D_{\phi }(y,x)+\gamma \langle y,u\rangle +\mu \langle y,w\rangle , \end{aligned}$$

which implies that

$$\begin{aligned} G(z)\le G(y)+\langle d,y-z\rangle +D_{\phi }(y,x)-D_{\phi }(z,x)+\gamma \langle y-z,u\rangle +\mu \langle y-z,w\rangle . \end{aligned}$$

(2.6)

Adding (2.5) and (2.6) completes the proof. $\square $

Lemma 2.3

(sufficient decrease property) Let F, G, and z be defined as in Lemma 2.2, where x, d, u, $w\in \mathbb R^{d}$. Assume that $\phi $ is $\theta $-strongly convex. Then, the following inequality holds, for any $\lambda >0$,

$$\begin{aligned} F(z)+G(z)\le&F(x)+G(x)+\frac{1}{2L\lambda }\left\| d-\nabla F(x) \right\| ^2 +\frac{L(\lambda +1) -\theta }{2} \left\| x-z \right\| ^2\nonumber \\&+\gamma \langle x-z,u\rangle +\mu \langle x-z,w\rangle . \end{aligned}$$

(2.7)

Proof

From Lemma 2.2 with $y=x$, we have

$$\begin{aligned} F(z)+G(z)\le&F(x)+G(x)+\left\langle \nabla F(x)-d,z-x \right\rangle +\frac{L}{2} \left\| x-z \right\| ^2\\&-D_{\phi }(z,x)+\gamma \langle x-z,u\rangle +\mu \langle x-z,w\rangle . \end{aligned}$$

Using Young’s inequality $\left\langle \nabla F(x)-d,z-x \right\rangle \le \frac{1}{2L\lambda }\left\| d-\nabla F(x) \right\| ^2 +\frac{L\lambda }{2} \left\| x-z \right\| ^2$ and (2.2), we can obtain

$$\begin{aligned} F(z)+G(z)\le&F(x)+G(x)+\frac{1}{2L\lambda }\left\| d-\nabla F(x) \right\| ^2 +\frac{L\lambda }{2} \left\| x-z \right\| ^2+\frac{L}{2} \left\| x-z \right\| ^2\\&-\frac{\theta }{2} \left\| z-x \right\| ^2+\gamma \langle x-z,u\rangle +\mu \langle x-z,w\rangle , \end{aligned}$$

which can be abbreviated as the desired result. $\square $

3 Stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm

Throughout this paper, we impose the following assumptions.

Assumption 3.1

(i)
The function $\Phi $ is bounded from below, i.e., $\Phi (x,y)\ge \underline{\Phi }.$
(ii)
For any fixed y, the partial gradient $\nabla _{x} H_i(\cdot ,y)$ is globally Lipschitz with module $L_y$ for all $i\in \left\{ 1,\dots ,n \right\} $, that is,
$$\left\| \nabla _{x} H_i\left( x_{1},y \right) - \nabla _{x} H_i\left( x_{2},y \right) \right\| \le L_y\left\| x_{1}-x_{2} \right\| , \ \forall x_{1},x_{2} \in \mathbb R^{l}. $$
Likewise, for any fixed x, the partial gradient $\nabla _{y} H_i(x,\cdot )$ is globally Lipschitz with module $L_x$,
$$\left\| \nabla _{y} H_i\left( x,y_{1} \right) - \nabla _{y} H_i\left( x,y_{2} \right) \right\| \le L_x\left\| y_{1}-y_{2} \right\| , \ \forall y_{1},y_{2} \in \mathbb R^{m}. $$
(iii)
$\nabla H$ is Lipschitz continuous on bounded subsets of $\mathbb R^{l}\times \mathbb R^{m}$. In other words, for each bounded subset $B_1\times B_2$ of $\mathbb R^{l}\times \mathbb R^{m}$, there exists $M_{B_1\times B_2} > 0$ such that
$$\begin{aligned} \left\| \left( \nabla _{x} H\left( x_{1} ,y_1 \right) \!-\! \nabla _{x} H\left( x_{2},y_2 \right) ,\nabla _{y} H\left( x_1,y_{1} \right) \!-\! \nabla _{y} H\left( x_2,y_{2} \right) \right) \right\| \!\le \! M_{B_1\times B_2}\left\| \left( x_{1}\!-\!x_{2},y_{1}\!-\!y_{2} \right) \right\| . \end{aligned}$$
for all $( x_{1},y_1), ( x_{2},y_2)\in B_1\times B_2$.

(iv) $\phi _i(i=1,2)$ is $\theta _i$-strongly convex differentiable function. And the gradient $\nabla \phi _i$ is $\eta _i$-Lipschitz continuous, i.e.,

$$\begin{aligned}&\left\| \nabla {\phi _1}(x_1) -\nabla {\phi _1}(x_2)\right\| \le \eta _1 \Vert x_{1}-x_{2}\Vert ,\ \forall x_{1} ,x_{2}\in \mathbb R^{l},\\&\left\| \nabla {\phi _2}(y_1) -\nabla {\phi _2}(y_2)\right\| \le \eta _2 \Vert y_{1} -y_{2}\Vert ,\ \ \forall y_{1} ,y_{2}\in \mathbb R^{m}. \end{aligned}$$

We now introduce a stochastic version of the two-step inertial Bregman proximal alternating linearized minimization algorithm. The key of our algorithm is replacing the full gradient computations $\nabla _x H(u_k,y_k)$ and $\nabla _y(x_{k+1},v_k)$ with stochastic estimations $\widetilde{\nabla }_x(u_k,y_k)$ and $\widetilde{\nabla }_y(x_{k+1},v_k)$, respectively. We describe the resulting algorithm as follows.

Algorithm 3.1

Choose $(x_0,y_0)\in $dom$\Phi $ and set $(x_{-i},y_{-i})=(x_0,y_0)$, $i=1, 2$. Take the sequences $\{\gamma _{1k}\}$, $\{\mu _{1k}\}\subseteq [0,\gamma _1]$, $\{\gamma _{2k}\}$, $\{\mu _{2k}\}\subseteq [0,\gamma _2]$, $\{\alpha _{1k}\}$, $\{\beta _{1k}\}\subseteq [0,\alpha _1]$ and $\{\alpha _{2k}\}$, $\{\beta _{2k}\}\subseteq [0,\alpha _2]$, where $\gamma _1\ge 0$, $\gamma _2\ge 0$, $\alpha _1\ge 0$ and $\alpha _2\ge 0$. For $k\ge 0$, let

$$\begin{aligned} {\left\{ \begin{array}{ll} u_k=x_k+\gamma _{1k}(x_k-x_{k-1})+\gamma _{2k}(x_{k-1}-x_{k-2}),\\ x_{k+1}\in \arg \min _{ x\in \mathbb {R}^l}\{f(x)+\langle x,\widetilde{\nabla }_x(u_k,y_k)\rangle +D_{\phi _1}(x,x_k)+\alpha _{1k} \langle x,x_{k-1}-x_k\rangle \\ \qquad \qquad \qquad \qquad \qquad +\alpha _{2k} \langle x,x_{k-2}-x_{k-1}\rangle \},\\ v_k=y_k+\mu _{1k}(y_k-y_{k-1})+\mu _{2k}(y_{k-1}-y_{k-2}),\\ y_{k+1}\in \arg \min _{y\in \mathbb {R}^m}\{g(y)+\langle y,\widetilde{\nabla }_y(x_{k+1},v_k)\rangle +D_{\phi _2}(y,y_k)+\beta _{1k} \langle y,y_{k-1}-y_k\rangle \\ \qquad \qquad \qquad \qquad \qquad +\beta _{2k} \langle y,y_{k-2}-y_{k-1}\rangle \}, \end{array}\right. } \end{aligned}$$

(3.1)

where $D_{\phi _1}$ and $D_{\phi _2}$ denote the Bregman distance with respect to $\phi _1$ and $\phi _2$, respectively.

Stochastic gradients $\widetilde{\nabla }_x(u_k,y_k)$ and $\widetilde{\nabla }_y(x_{k+1},v_k)$ use the gradients of only a few indices $\nabla _xH_i(u_k,y_k)$ and $\nabla _yH_i(x_{k+1},v_k)$ for $i \in B_k \subset \left\{ 1,2,\dots , n \right\} $. The minibatch $B_k$ is chosen uniformly at random from all subsets of $\left\{ 1,2,\dots , n \right\} $ with cardinality b. The simplest one is the stochastic gradient descent (SGD) estimator [35]. While the SGD estimator is not variance-reduced, many popular gradient estimators as the SAGA [28] and SARAH [29, 30] estimators have this property. In this paper, we mainly consider SAGA (Appendix A) and SARAH (Appendix B) gradient estimators.

Definition 3.1

(SGD [35]) The SGD gradient estimator $\widetilde{\nabla }_x^{SGD}(x_k,y_k)$ is defined as follows:

$$\begin{aligned} \widetilde{\nabla }_x^{SGD}(x_k,y_k)=\frac{1}{b}\sum _{i\in B_k} \nabla _xH_i(x_k,y_k), \end{aligned}$$

where $B_k$ are mini-batches containing b indices.

The SGD gradient estimator uses the gradient of a randomly sampled batch to represent the full gradient.

Definition 3.2

(SAGA [28]) The SAGA gradient estimator $\widetilde{\nabla }_x^{SAGA}(x_k,y_k)$ is defined as follows:

$$\begin{aligned} \widetilde{\nabla }_x^{SAGA}(x_k,y_k)=\frac{1}{b}\sum _{i\in B_k}\left( \nabla _xH_i(x_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) + \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}), \end{aligned}$$

where $B_k$ are mini-batches containing b indices. The variables $\varphi _{k}^{i}$ follow the update rules $\varphi _{k+1}^{i}=x_k$ if $i\in B_k$ and $\varphi _{k+1}^{i}=\varphi _{k}^{i}$ otherwise.

Definition 3.3

(SARAH [29, 30]) The SARAH gradient estimator reads for $k = 0$ as

$$\widetilde{\nabla }_x^{SARAH}(x_0,y_0)=\nabla _xH(x_0,y_0).$$

For $k = 1, 2,\dots $, we define random variables $p_k\in \left\{ 0,1 \right\} $ with $P(p_k=0)=\frac{1}{p}$ and $P(p_k=1)=1-\frac{1}{p}$, where $p \in (1,\infty )$ is a fixed chosen parameter. Let $B_k$ be a random subset uniformly drawn from $\left\{ 1,\dots , n \right\} $ of fixed batch size b. Then, for $k= 1, 2,\dots $, the SARAH gradient estimator reads as

$$\begin{aligned}&\widetilde{\nabla }_x^{SARAH}(x_{k},y_{k})\\ =&{\left\{ \begin{array}{ll} \nabla _xH(x_k,y_k),&{}\!\!\!\!\text { if } p_k\!=\!0, \\ \frac{1}{b}\sum _{i\in B_k}\left( \nabla _xH_i(x_k,y_k)\!-\! \nabla _xH_i(x_{k-1},y_{k-1}) \right) \!+\!\widetilde{\nabla }_x^{SARAH}(x_{k-1},y_{k-1}),&{}\!\!\!\! \text { if } p_k\!=\!1. \end{array}\right. } \end{aligned}$$

In our analysis, we assume that stochastic gradient estimator used in Algorithm 3.1 is variance-reduced, which is a quite general assumption in stochastic gradient algorithms [23, 24]. The following definition is analogous to Definition 2.1 in [23].

Definition 3.4

(Variance-reduced gradient estimator) Let $\left\{ z_k \right\} _{k\in \mathbb {N} }=\left\{ (x_k,y_k)\right\} _{k\in \mathbb {N} }$ be the sequence generated by Algorithm 3.1 with some gradient estimator $\widetilde{\nabla }$. This gradient estimator is called variance-reduced with constants $V_1,V_2,V_\Upsilon \ge 0$, and $\rho \in (0,1]$ if it satisfies the following conditions:

(i)
(MSE bound) There exists a sequence of random variables $\left\{ \Upsilon _k \right\} _{k\in \mathbb {N} }$ of the form $\Upsilon _k=\sum _{i=1}^{s} (v_{k}^{i} )^2$ for some nonnegative random variables $v_{k}^{i}\in \mathbb {R} $ such that
$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \nonumber \\ \le&\Upsilon _k\!+\!V_1\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$
(3.2)
and, with $\Gamma _k=\sum _{i=1}^{s} v_{k}^{i} $
$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \nonumber \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) . \end{aligned}$$
(3.3)
(ii)
(Geometric decay) The sequence $\left\{ \Upsilon _k \right\} _{k\in \mathbb {N} }$ decays geometrically:
$$\begin{aligned} \mathbb {E}_k\Upsilon _{k+1}\le&(1-\rho )\Upsilon _k+V_\Upsilon \left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2}+\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right. \nonumber \\&\left. +\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) . \end{aligned}$$
(3.4)
(iii)
(Convergence of estimator) If $\left\{ z_k \right\} _{k\in \mathbb {N} }$ satisfies $\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}=0$, then $ \mathbb {E}\Upsilon _k\rightarrow 0$ and $\mathbb {E}\Gamma _k\rightarrow 0$.

In the following, if $\left\{ z_k \right\} _{k\in \mathbb {N} }=\left\{ (x_k,y_k)\right\} _{k\in \mathbb {N} }$ is the bounded sequence generated by Algorithm 3.1, we assume $\nabla H$ is M-Lipschitz continuous on $\left\{ (x_k,y_k)\right\} _{k\in \mathbb {N} }$.

Assumption 3.2

For the sequences $\left\{ x_k \right\} _{k\in \mathbb {N} } $ and $\left\{ y_k \right\} _{k\in \mathbb {N} } $ generated by Algorithm 3.1, there exists $L> 0$ such that

$$\sup \left\{ L _{y_k}:k\in \mathbb N \right\} \le L\ \ \textrm{and}\ \sup \left\{ L _{x_k}:k\in \mathbb N \right\} \le L, $$

where $L _{y_k}$ and $L _{x_k}$ are the Lipschitz constants for $\nabla _{x} H_i(\cdot ,y_k)$ and $\nabla _{y} H_i(x_k,\cdot )$, respectively.

Proposition 3.1

Let $\left\{ z_k \right\} _{k\in \mathbb {N} }=\left\{ (x_k,y_k)\right\} _{k\in \mathbb {N} }$ be the bounded sequence generated by Algorithm 3.1. Then, the SAGA gradient estimator is variance-reduced with parameters $V_{1}=\frac{16N^2\gamma ^2}{b}$, $V_{2}=\frac{4N\gamma }{\sqrt{b}}$, $V_{\Upsilon }=\frac{408nN^2(1+2\gamma _1^2+\gamma _2^2)}{b^2}$ and $\rho =\frac{b}{2n}$, where $N=\max \left\{ M,L \right\} $, $\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} $. The SARAH estimator is variance-reduced with parameters $V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)$, $V_{2}=M\sqrt{6(1-\frac{1}{p})(1 +2\gamma _{1}^2+\gamma _{2}^2) }$, $V_{\Upsilon }=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)$ and $\rho = \frac{1}{p}$.

See the detailed proof of Proposition 3.1 in Appendix A and B. And the conclusion that SVRG gradient estimator is variance-reduced can be obtained similarly.

Below, we give the supermartingale convergence theorem that will be applied to obtain almost sure convergence of sequences generated by STiBPALM (Algorithm 3.1).

Lemma 3.1

(Supermartingale convergence) Let $\left\{ X_k \right\} _{k\in \mathbb {N} } $ and $\left\{ Y_k \right\} _{k\in \mathbb {N} } $ be sequences of bounded nonnegative random variables such that $X_k$ and $Y_k$ depend only on the first k iterations of Algorithm 3.1. If

$$\begin{aligned} \mathbb {E} _kX_{k+1}+Y_k\le X_k \end{aligned}$$

(3.5)

for all k, then $\sum _{k=0}^{\infty } Y_k<+\infty $ a.s. and $\left\{ X_k \right\} $ converges a.s.

4 Convergence analysis under the KŁ property

In this section, under Assumptions 3.1 and 3.2, we prove convergence of the sequence and extend the convergence rates of SPRING to Algorithm 3.1, for semialgebraic function $\Phi $. Given $k\in \mathbb {N}$, define the quantity

$$\begin{aligned} \Psi _k=&\Phi (z_{k})+ \frac{1}{L\lambda \rho } \Upsilon _k+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _1+\alpha _2}{2}+\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }+3Z \right) \Vert z_{k}-z_{k-1}\Vert ^2 \nonumber \\&+\!\left( \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _2}{2}\!+\!\frac{2L\gamma _{2}^2}{\lambda }\!+\!2Z \right) \Vert z_{k-1}\!-\!z_{k-2}\Vert ^2\!+\!\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } \!+\!Z \right) \Vert z_{k-2}\!-\!z_{k-3}\Vert ^2, \end{aligned}$$

(4.1)

where $\lambda =\sqrt{\frac{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}{L^2}}$, $Z=\frac{V_1+V_\Upsilon /\rho }{\sqrt{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)} }+\epsilon >0$, $\epsilon >0$ is small enough. Our first result guarantees that $\Psi _k$ is decreasing in expectation.

Lemma 4.1

($ l _2$ summability) Suppose Assumptions 3.1 and 3.2 hold. Let $\left\{ z_k \right\} _{k\in \mathbb {N} } $ be the sequence generated by Algorithm 3.1 with variance-reduced gradient estimator, and let

$$\begin{aligned} \theta \overset{\bigtriangleup }{=}\min \left\{ \theta _1,\theta _2 \right\} > L+2\alpha _1+2\alpha _2+2\sqrt{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}+6\epsilon , \end{aligned}$$

then the following conclusions hold.

(i):

$\Psi _k$ satisfies

$$\begin{aligned} \mathbb {E}_k\left[ \Psi _{k+1}\! +\!\kappa \left\| z_{k+1}\!-\!z_{k} \right\| ^2\!+\!\epsilon \left\| z_{k}\!-\!z_{k-1} \right\| ^2\!+\! \epsilon \left\| z_{k-1}\!-\!z_{k-2} \right\| ^2\!+\! Z \left\| z_{k-2}\!-\!z_{k-3} \right\| ^2 \right] \!\le \! \Psi _k, \end{aligned}$$

(4.2)

where $\kappa =-\frac{L -\theta }{2}-\alpha _1-\alpha _2-\sqrt{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}-3\epsilon >0$.

(ii):

The expectation of the squared distance between the iterates is summable:

$$\sum _{k=0}^{\infty } \mathbb {E} [\left\| x_{k+1}-x_{k} \right\| ^2+\left\| y_{k+1}-y_{k} \right\| ^2]=\sum _{k=0}^{\infty } \mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2<\infty .$$

Proof

(i) Applying Lemma 2.3 with $F(\cdot )=H(\cdot ,y_k)$, $G(\cdot )=f(\cdot )$, $z=x_{k+1}$, $x= x_k$, $d =\widetilde{\nabla }_x(u_k,y_k)$, $u = x_{k-1}-x_{k}$ and $w = x_{k-2}-x_{k-1}$, for any $\lambda >0$, we have

$$\begin{aligned}&H(x_{k+1},y_k)+f(x_{k+1})\nonumber \\ \le&H(x_{k},y_k)+f(x_{k})+\frac{1}{2L\lambda }\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _x H(x_k,y_k) \right\| ^2+\frac{L(\lambda +1) -\theta _1 }{2} \left\| x_{k+1}-x_k \right\| ^2 \nonumber \\&+\alpha _{1k} \langle x_{k+1}-x_k,x_{k}-x_{k-1}\rangle +\alpha _{2k} \langle x_{k+1}-x_k,x_{k-1}-x_{k-2}\rangle \nonumber \\ \overset{(1)}{\le }\&H(x_{k},y_k)\!+\!f(x_{k})\!+\!\frac{1}{L\lambda }\left\| \widetilde{\nabla }_x(u_k,y_k)\!-\!\nabla _x H(u_k,y_k) \right\| ^2\!+\!\frac{1}{L\lambda }\left\| \nabla _x H(u_k,y_k)\!-\!\nabla _x H(x_k,y_k) \right\| ^2\nonumber \\&+\frac{L(\lambda +1) -\theta _1 }{2} \left\| x_{k+1}-x_k \right\| ^2 +\frac{\alpha _{1k}}{2} (\Vert x_{k+1}-x_k\Vert ^2+\Vert x_k-x_{k-1}\Vert ^2)\nonumber \\&+\frac{\alpha _{2k}}{2}(\Vert x_{k+1}-x_k\Vert ^2+\Vert x_{k-1}-x_{k-2}\Vert ^2)\nonumber \\ \overset{(2)}{\le }\&H(x_{k},y_k)+f(x_{k})+\frac{1}{L\lambda }\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _x H(u_k,y_k) \right\| ^2+\frac{L}{\lambda }\left\| u_k-x_k \right\| ^2\nonumber \\&+\left( \frac{L(\lambda +1) -\theta _1 }{2} +\frac{\alpha _{1}+\alpha _{2}}{2} \right) \left\| x_{k+1}-x_k \right\| ^2+\frac{\alpha _{1}}{2} \Vert x_k-x_{k-1}\Vert ^2+\frac{\alpha _{2}}{2}\Vert x_{k-1}-x_{k-2}\Vert ^2\nonumber \\ \le&H(x_{k},y_k)+f(x_{k})+\frac{1}{L\lambda }\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _x H(u_k,y_k) \right\| ^2+\left( \frac{2L\gamma _{1k}^2}{\lambda }+\frac{\alpha _{1}}{2} \right) \left\| x_k-x_{k-1} \right\| ^2 \nonumber \\&+\left( \frac{2L\gamma _{2k}^2}{\lambda }+\frac{\alpha _{2}}{2} \right) \left\| x_{k-1}-x_{k-2} \right\| ^2 +\left( \frac{L(\lambda +1) -\theta _1 }{2}+\frac{\alpha _{1}+\alpha _{2}}{2} \right) \left\| x_{k+1}-x_k \right\| ^2. \end{aligned}$$

(4.3)

Inequality (1) is the standard inequality $\left\| a-c\right\| ^2\le 2\left\| a-b\right\| ^2+2\left\| b-c\right\| ^2$, and (2) uses Assumption 3.1 (ii) and Assumption 3.2. Analogously, for the updates in $y_k$, we use Lemma 2.3 with $F(\cdot )=H( x_{k+1},\cdot )$, $G(\cdot )=g(\cdot )$, $z=y_{k+1}$, $x= y_k$, $d =\widetilde{\nabla }_y(x_{k+1},v_k)$, $u = y_{k-1}-y_{k}$ and $w = y_{k-2}-y_{k-1}$, we have

$$\begin{aligned}&H(x_{k+1},y_{k+1})+g(y_{k+1})\nonumber \\ \le&H(x_{k+1},y_k)\!+\!g(y_{k})\!+\!\frac{1}{L\lambda }\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _y H(x_{k+1},v_k) \right\| ^2\!+\!\left( \frac{2L\mu _{1k}^2}{\lambda }\!+\!\frac{\alpha _{1}}{2} \right) \left\| y_k\!-\!y_{k-1} \right\| ^2 \nonumber \\&+\left( \frac{2L\mu _{2k}^2}{\lambda }+\frac{\alpha _{2}}{2} \right) \left\| y_{k-1}-y_{k-2} \right\| ^2+\left( \frac{L(\lambda +1) -\theta _2 }{2} +\frac{\alpha _{1}+\alpha _{2}}{2} \right) \left\| y_{k+1}-y_k \right\| ^2. \end{aligned}$$

(4.4)

Adding (4.3) and (4.4), we have

$$\begin{aligned}&\Phi (x_{k+1},y_{k+1})\\ \le&\Phi (x_{k},y_k)\!+\!\frac{1}{L\lambda }\left( \left\| \widetilde{\nabla }_x(u_k,y_k)\!-\!\nabla _x H(u_k,y_k) \right\| ^2 \!+\!\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _y H(x_{k+1},v_k) \right\| ^2 \right) \\&+\left( \frac{L(\lambda +1) -\theta }{2}+\frac{\alpha _1+\alpha _2}{2} \right) \Vert z_{k+1}-z_k\Vert ^2+\left( \frac{2L\gamma _{1}^2}{\lambda }+\frac{\alpha _1}{2}\right) \Vert z_{k}-z_{k-1}\Vert ^2\\&+\left( \frac{2L\gamma _{2}^2}{\lambda }+\frac{\alpha _2}{2}\right) \Vert z_{k-1}-z_{k-2}\Vert ^2, \end{aligned}$$

where $\theta =\min \left\{ \theta _1,\theta _2 \right\} $. Applying the conditional expectation operator $\mathbb {E} _k$, we can bound the MSE terms using (3.2). This gives

$$\begin{aligned}&\mathbb {E} _k\left[ \Phi (z_{k+1})+\left( -\frac{L(\lambda +1) -\theta }{2}-\frac{\alpha _1+\alpha _2}{2}-\frac{V_1}{L\lambda } \right) \Vert z_{k+1}-z_k\Vert ^2\right] \nonumber \\ \le&\Phi (z_{k})\!+\! \frac{1}{L\lambda } \Upsilon _k\!+\!\left( \frac{V_1}{L\lambda }\!+\!\frac{2L\gamma _{1}^2}{\lambda }\!+\!\frac{\alpha _1}{2} \right) \Vert z_{k}-z_{k-1}\Vert ^2\!+\!\left( \frac{V_1}{L\lambda }\!+\! \frac{2L\gamma _{2}^2}{\lambda }\!+\!\frac{\alpha _2}{2}\right) \Vert z_{k-1}\!-\!z_{k-2}\Vert ^2\nonumber \\&+\frac{V_1}{L\lambda }\Vert z_{k-2}-z_{k-3}\Vert ^2. \end{aligned}$$

(4.5)

Next, we use (3.4) to say that

$$\begin{aligned} \frac{1}{L\lambda } \Upsilon _k&\le \frac{1}{L\lambda \rho } \left( -\mathbb {E}_k\Upsilon _{k+1}+\Upsilon _{k}+V_\Upsilon \left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} \right. \right. \\&\left. \left. +\left\| z_{k-1}-z_{k-2} \right\| ^{2} +\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) \right) . \end{aligned}$$

Combining these inequalities, we have

$$\begin{aligned}&\mathbb {E} _k\left[ \! \Phi (z_{k+1})\!+\! \frac{1}{L\lambda \rho } \Upsilon _{k+1} \!+\!\left( \!-\!\frac{L(\lambda \!+\!1) \!-\!\theta }{2}\!-\!\frac{\alpha _1\!+\!\alpha _2}{2} \!-\!\frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda } \right) \Vert z_{k+1}\!-\!z_k\Vert ^2\right] \\ \le&\Phi (z_{k})+ \frac{1}{L\lambda \rho } \Upsilon _k+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{2L\gamma _{1}^2}{\lambda }+\frac{\alpha _1}{2} \right) \Vert z_{k}-z_{k-1}\Vert ^2\\&+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+ \frac{2L\gamma _{2}^2}{\lambda }+\frac{\alpha _2}{2}\right) \Vert z_{k-1}-z_{k-2}\Vert ^2+\frac{V_1+V_\Upsilon /\rho }{L\lambda }\Vert z_{k-2}-z_{k-3}\Vert ^2. \end{aligned}$$

This is equivalent to

$$\begin{aligned}&\mathbb {E} _k\left[ \Phi (z_{k+1})+ \frac{1}{L\lambda \rho } \Upsilon _{k+1}+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _1+\alpha _2}{2}+\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }+3Z \right) \Vert z_{k+1}-z_k\Vert ^2 \right. \nonumber \\&\left. +\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } +\frac{\alpha _2}{2}+\frac{2L\gamma _{2}^2}{\lambda }+2Z \right) \Vert z_{k}-z_{k-1}\Vert ^2+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } +Z \right) \Vert z_{k-1}-z_{k-2}\Vert ^2 \right. \nonumber \\&\left. +\left( -\frac{L(\lambda +1) -\theta }{2}- \frac{2(V_1+V_\Upsilon /\rho ) }{L\lambda }-\alpha _1-\alpha _2-\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }-3Z\right) \Vert z_{k+1}-z_k\Vert ^2\right] \nonumber \\ \le&\Phi (z_{k})+ \frac{1}{L\lambda \rho } \Upsilon _k+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _1+\alpha _2}{2}+\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }+3Z \right) \Vert z_{k}-z_{k-1}\Vert ^2 \nonumber \\&+\!\left( \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _2}{2}\!+\!\frac{2L\gamma _{2}^2}{\lambda }\!+\!2Z \right) \Vert z_{k-1}\!-\!z_{k-2}\Vert ^2\!+\!\left( \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda } \!+\!Z \right) \Vert z_{k-2}\!-\!z_{k-3}\Vert ^2\nonumber \\&-\!\left( Z-\frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\right) \Vert z_{k}\!-\!z_{k-1}\Vert ^2\!-\!\left( Z\!-\!\frac{V_1+V_\Upsilon /\rho }{L\lambda }\right) \Vert z_{k-1}\!-\!z_{k-2}\Vert ^2\!-\!Z\Vert z_{k-2}\!-\!z_{k-3}\Vert ^2. \end{aligned}$$

(4.6)

We have

$$\begin{aligned}&\mathbb {E} _k\left[ \!\Psi _{k+1}\!+\!\left( \!-\!\frac{L(\lambda \!+\!1) \!-\!\theta }{2}\!-\! \frac{2(V_1\!+\!V_\Upsilon /\rho ) }{L\lambda }\!-\!\alpha _1\!-\!\alpha _2\!-\!\frac{2L(\gamma _{1}^2\!+\!\gamma _{2}^2)}{\lambda }\!-\!3Z\right) \Vert z_{k+1}\!-\!z_k\Vert ^2\right] \nonumber \\ \le&\Psi _k\!-\!\left( Z\!-\!\frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\right) \Vert z_{k}\!-\!z_{k-1}\Vert ^2\!-\!\left( Z\!-\!\frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\right) \Vert z_{k-1}\!-\!z_{k-2}\Vert ^2-Z\Vert z_{k-2}\!-\!z_{k-3}\Vert ^2. \end{aligned}$$

(4.7)

By $\lambda = \sqrt{\frac{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}{L^2}}$, we have $-\frac{L(\lambda +1) -\theta }{2}- \frac{2(V_1+V_\Upsilon /\rho ) }{L\lambda }-\alpha _1-\alpha _2-\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }-3Z=-\frac{L -\theta }{2}-\alpha _1-\alpha _2-\sqrt{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}-3\epsilon =\kappa $. Hence, (4.7) becomes

$$\begin{aligned} \mathbb {E}_k\left[ \!\Psi _{k+1} \!+\!\kappa \left\| z_{k+1}\!-\!z_{k} \right\| ^2\!+\!\epsilon \left\| z_{k}\!-\!z_{k-1} \right\| ^2\!+\! \epsilon \left\| z_{k-1}\!-\!z_{k-2} \right\| ^2 \!+\! Z \left\| z_{k-2}\!-\!z_{k-3} \right\| ^2\right] \!\le \! \Psi _k. \end{aligned}$$

(4.8)

According to $\theta > L+2\alpha _1+2\alpha _2+2\sqrt{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}+6\epsilon $, we have $\kappa >0$. So we prove the first claim.

(ii) We apply the full expectation operator to (4.8) and sum the resulting inequality from $k=0$ to $k=T-1$,

$$\begin{aligned}&\mathbb {E}\Psi _{T}+\kappa \sum _{k=0}^{T-1} \mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2+\epsilon \sum _{k=0}^{T-1}\mathbb {E} \left\| z_{k}-z_{k-1} \right\| ^2+ \epsilon \sum _{k=0}^{T-1}\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2\\&+ Z \sum _{k=0}^{T-1}\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2\\ \le&\Psi _0, \end{aligned}$$

Using the fact that $\underline{\Phi } \le \Psi _T$,

$$\begin{aligned}&\kappa \sum _{k=0}^{T-1} \mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2+\epsilon \sum _{k=0}^{T-1}\mathbb {E} \left\| z_{k}-z_{k-1} \right\| ^2+ \epsilon \sum _{k=0}^{T-1}\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2\nonumber \\&+ Z \sum _{k=0}^{T-1}\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2\nonumber \\ \le&\Psi _0-\underline{\Phi }. \end{aligned}$$

(4.9)

Taking the limit $T \rightarrow +\infty $, we have the sequence $\left\{ \mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2 \right\} $ is summable. $\square $

The next lemma establishes a bound on the norm of the subgradients of $\Phi (z_k)$.

Lemma 4.2

(Subgradient bound) Suppose Assumptions 3.1 and 3.2 hold. Let $\{z_k\}_{k\in \mathbb {N}}$ be a bounded sequence, which is generated by Algorithm 3.1 with variance-reduced gradient estimator. For $k\ge 0$, define

$$\begin{aligned} A_{x}^{k} =&\nabla _xH(x_{k},y_{k})\!-\!\widetilde{\nabla } _x(u_{k-1},y_{k-1})\!+\!\nabla \phi _1(x_{k-1})\!-\! \nabla \phi _1(x_{k})\!+\!\alpha _{1,k-1}(x_{k-1}\!-\!x_{k-2})\\&+\alpha _{2,k-1}(x_{k-2}-x_{k-3}),\\ A_{y}^{k} =&\nabla _yH(x_{k},y_{k})-\widetilde{\nabla } _y(x_{k},v_{k-1})+\nabla \phi _2(y_{k-1})- \nabla \phi _2(y_{k})+\beta _{1,k-1}(y_{k-1}-y_{k-2})\\&+\beta _{2,k-1}(y_{k-2}-y_{k-3}). \end{aligned}$$

Then, $(A_{x}^{k},A_{y}^{k} )\in \partial \Phi (x_k,y_k)$ and

$$\begin{aligned}{} & {} \mathbb {E}_{k-1}\left\| (A_{x}^{k},A_{y}^{k} ) \right\| \\\le & {} p\left( \mathbb {E}_{k-1}\left\| z_{k}\!-\!z_{k-1} \right\| \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| \!+\!\left\| z_{k-3}\!-\!z_{k-4} \right\| \right) \!+\!\Gamma _{k-1},\nonumber \end{aligned}$$

(4.10)

where $p=2(2N+\eta +N\gamma _1+N\gamma _2+\alpha _{1}+\alpha _{2})+V_2$, $N=\max \left\{ M,L \right\} $, $\eta =\max \left\{ \eta _1,\eta _2 \right\} $.

Proof

By the definition of $x_{k}$, we have that 0 must lie in the subdifferential at point $x_{k}$ of the function

$$\begin{aligned} x\longmapsto f(x)\!+\!\langle x,\widetilde{\nabla }_x(u_{k-1},y_{k-1})\rangle \!+\!D_{\phi _1}(x,x_{k-1})\!+\!\alpha _{1,k-1} \langle x,x_{k-2}\!-\!x_{k-1}\rangle \!+\!\alpha _{2,k-1} \langle x,x_{k-3}\!-\!x_{k-2}\rangle . \end{aligned}$$

Since $\phi $ are differential, we have

$$\begin{aligned} 0\in & {} \partial f(x_{k})+\widetilde{\nabla }_x(u_{k-1},y_{k-1})+\nabla \phi _1(x_{k})- \nabla \phi _1(x_{k-1})+\alpha _{1,k-1}(x_{k-2}-x_{k-1})\\{} & {} +\alpha _{2,k-1}(x_{k-3}-x_{k-2}), \end{aligned}$$

which implies that

$$\begin{aligned}&\nabla _xH(x_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1})+\nabla \phi _1(x_{k-1})-\nabla \phi _1(x_{k}) \nonumber \\&+\alpha _{1,k-1}(x_{k-1}-x_{k-2})+\alpha _{2,k-1}(x_{k-2}-x_{k-3})\nonumber \\&\in \nabla _xH(x_{k},y_{k})+\partial f(x_{k}). \end{aligned}$$

(4.11)

Similarly, we have

$$\begin{aligned}&\nabla _yH(x_{k},y_{k})-\widetilde{\nabla } _y(x_{k},v_{k-1})+\nabla \phi _2(y_{k-1})- \nabla \phi _2(y_{k})\nonumber \\&+\beta _{1,k-1}(y_{k-1}-y_{k-2})+\beta _{2,k-1}(y_{k-2}-y_{k-3})\nonumber \\&\in \nabla _yH(x_{k},y_{k})+\partial g(y_{k}). \end{aligned}$$

(4.12)

Because of the structure of $\Phi $, from (4.11) and (4.12), we have $(A_{x}^{k},A_{y}^{k} )\in \partial \Phi (x_k,y_k).$ All that remains is to bound the norms of $A_{x}^{k}$ and $A_{y}^{k}$. Because $\nabla H$ is M-Lipschitz continuous on bounded sets, then from Assumption 3.1 (iii) and (iv), we have

$$\begin{aligned}&\left\| A_{x}^{k} \right\| \nonumber \\ \le&\left\| \nabla _xH(x_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| +\left\| \nabla \phi _1(x_{k-1})-\nabla \phi _1(x_{k})\right\| \nonumber \\&+\alpha _{1,k-1}\left\| x_{k-1}-x_{k-2}\right\| +\alpha _{2,k-1}\left\| x_{k-2}-x_{k-3}\right\| \nonumber \\ \le&\left\| \nabla _xH(x_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1}) \right\| +\left\| \nabla _xH(u_{k-1},y_{k-1})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| \nonumber \\&+\eta _1\left\| x_{k-1}-x_{k}\right\| +\alpha _{1,k-1}\left\| x_{k-1}-x_{k-2}\right\| +\alpha _{2,k-1}\left\| x_{k-2}-x_{k-3}\right\| \nonumber \\ \le&\left\| \nabla _xH(u_{k-1},y_{k-1})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| +M\left\| x_{k}-u_{k-1}\right\| +M\left\| y_{k}-y_{k-1}\right\| \nonumber \\&+\eta _1\left\| x_{k-1}-x_{k}\right\| +\alpha _{1,k-1}\left\| x_{k-1}-x_{k-2}\right\| +\alpha _{2,k-1}\left\| x_{k-2}-x_{k-3}\right\| \nonumber \\ \le&\left\| \nabla _xH(u_{k-1},y_{k-1})\!-\!\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| \!+\!(M\!+\!\eta _1)\left\| x_{k}\!-\!x_{k-1}\right\| \!+\!M\left\| y_{k}\!-\!y_{k-1}\right\| \nonumber \\&(M\gamma _{1}+\alpha _{1})\left\| x_{k-1}-x_{k-2}\right\| +(M\gamma _{2}+\alpha _{2})\left\| x_{k-2}-x_{k-3}\right\| . \end{aligned}$$

(4.13)

A similar argument holds for $A_{y}^{k}$:

$$\begin{aligned}&\left\| A_{y}^{k} \right\| \nonumber \\ \le&\left\| \nabla _yH(x_{k},y_{k})-\nabla _yH(x_{k},v_{k-1}) \right\| +\left\| \nabla _yH(x_{k},v_{k-1})-\widetilde{\nabla }_y(x_{k},v_{k-1}) \right\| \nonumber \\&+\eta _2\left\| y_{k-1}-y_{k}\right\| +\beta _{1,k-1}\left\| y_{k-1}-y_{k-2}\right\| +\beta _{2,k-1}\left\| y_{k-2}-y_{k-3}\right\| \nonumber \\ \le&\left\| \nabla _yH(x_{k},v_{k-1})-\widetilde{\nabla }_y(x_{k},v_{k-1}) \right\| +(L+\eta _2)\left\| y_{k}-y_{k-1}\right\| \nonumber \\&(L\gamma _{1}+\alpha _{1})\left\| y_{k-1}-y_{k-2}\right\| +(L\gamma _{2}+\alpha _{2})\left\| y_{k-2}-y_{k-3}\right\| . \end{aligned}$$

(4.14)

Adding (4.13) and (4.14), we get

$$\begin{aligned}&\left\| A_{x}^{k}\right\| +\left\| A_{y}^{k}\right\| \\ \le&\left\| \nabla _xH(u_{k-1},y_{k-1})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| +\left\| \nabla _yH(x_{k},v_{k-1})-\widetilde{\nabla }_y(x_{k},v_{k-1}) \right\| \\&\!\!\!+\!2( 2N\!+\!\eta )\left\| z_{k}\!-\!z_{k-1}\right\| \!+\!2(N\gamma _1\!+\!\alpha _{1})\left\| z_{k-1}\!-\!z_{k-2}\right\| \!+\!2(N\gamma _2\!+\!\alpha _{2})\left\| z_{k-2}\!-\!z_{k-3}\right\| , \end{aligned}$$

where $N=\max \left\{ M,L \right\} $, $\eta =\max \left\{ \eta _1,\eta _2 \right\} $. Applying the conditional expectation operator and using (3.3) to bound the MSE terms, we can obtain

$$\begin{aligned}&\mathbb {E}_{k-1}\left\| (A_{x}^{k},A_{y}^{k} ) \right\| \le \mathbb {E}_{k-1}\left[ \left\| A_{x}^{k}\right\| +\left\| A_{y}^{k}\right\| \right] \\ \le&(4N+2\eta +V_2)\mathbb {E} _{k-1}\left\| z_{k}-z_{k-1}\right\| +(2N\gamma _1+2\alpha _{1}+V_2)\left\| z_{k-1}-z_{k-2}\right\| \\&+(2N\gamma _2+2\alpha _{2}+V_2)\left\| z_{k-2}-z_{k-3}\right\| +V_2\left\| z_{k-3}-z_{k-4}\right\| +\Gamma _{k-1}\\ \le&p\left( \mathbb {E}_{k-1}\left\| z_{k}\!-\!z_{k-1} \right\| \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| \!+\!\left\| z_{k-3}\!-\!z_{k-4}\right\| \right) \!+\!\Gamma _{k-1}, \end{aligned}$$

where $p=2(2N+\eta +N\gamma _1+N\gamma _2+\alpha _{1}+\alpha _{2})+V_2$. $\square $

Define the set of limit points of $\left\{ z_k\right\} _{k\in \mathbb {N} }$ as

$$\Omega :=\{\hat{ z}: \mathrm{\ there\ exists\ a\ subsequence \ }\left\{ z_{k_l}\right\} \mathrm{\ of}\ \left\{ z_{k}\right\} \mathrm{\ such\ that\ } z_{k_l}\rightarrow \hat{ z} \mathrm{\ as}\ l\rightarrow \infty \}.$$

The following lemma describes properties of $\Omega $.

Lemma 4.3

(Limit points of $\left\{ z_k\right\} _{k\in \mathbb {N} }$) Suppose Assumptions 3.1 and 3.2 hold. Let $\{z_k\}_{k\in \mathbb {N}}$ be a bounded sequence, which is generated by Algorithm 3.1 with variance-reduced gradient estimator, and let

$$\theta > L+2\alpha _1+2\alpha _2+2\sqrt{10(V_1+V_\Upsilon /\rho )+4L^2(\gamma _{1}^2+\gamma _{2}^2)}+6\epsilon .$$

where $\epsilon >0$ is small enough. Then,

(1):: $\sum _{k=1}^{\infty } \left\| z_{k}-z_{k-1} \right\| ^2<\infty $ a.s., and $\left\| z_{k}-z_{k-1} \right\| \rightarrow 0$ a.s.;
(2):: $\mathbb {E} \Phi (z_k)\rightarrow \Phi ^*$, where $\Phi ^*\in [\underline{\Phi },\infty )$;
(3):: $\mathbb {E} \textrm{dist}(0,\partial \Phi (z_k)) \rightarrow 0$;
(4):: the set $\Omega $ is nonempty, and for all $z^*\in \Omega $, $\mathbb {E} \textrm{dist}(0,\partial \Phi (z^*)) = 0$;
(5):: $\textrm{dist}(z_k,\Omega )\rightarrow 0$ a.s.;
(6):: $\Omega $ is a.s. compact and connected;
(7):: $\mathbb {E} \Phi (z^*)= \Phi ^*$ for all $z^*\in \Omega $.

Proof

By Lemma 4.1, we have claim (1) holds.

According to (4.2), the supermartingale convergence theorem ensures $\left\{ \Psi _{k}\right\} $ converges to a finite, positive random variable. Because $\left\| z_{k}-z_{k-1} \right\| \rightarrow 0$ a.s., $\left\| z_{k-1}-z_{k-2} \right\| \rightarrow 0$ a.s., $\left\| z_{k-2}-z_{k-3} \right\| \rightarrow 0$ a.s. and $\widetilde{\nabla }$ is variance-reduced so $\mathbb {E} \Upsilon _k \rightarrow 0$, we can say

$$\lim _{k \rightarrow \infty } \mathbb {E}\Psi _{k}=\lim _{k \rightarrow \infty } \mathbb {E}\Phi (z_{k}) \in [\underline{\Phi },\infty ),$$

which implys claim (2).

Claim (3) holds because, by Lemma 4.2,

$$\begin{aligned}&\mathbb {E}\left\| (A_{x}^{k},A_{y}^{k} ) \right\| \\ \le&p \mathbb {E}\left( \left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| +\left\| z_{k-3}-z_{k-4} \right\| \right) +\mathbb {E}\Gamma _{k-1}. \end{aligned}$$

We have that $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \rightarrow 0$ and $\mathbb {E} \Gamma _{k-1} \rightarrow 0$. This ensures that $\mathbb {E}\left\| (A_{x}^{k},A_{y}^{k} )\right\| \rightarrow 0$. Since $(A_{x}^{k},A_{y}^{k} )$ is one element of $\partial \Phi (z_k)$, we obtain $\mathbb {E} \textrm{dist}(0,\partial \Phi (z_k))\le \mathbb {E}\left\| (A_{x}^{k},A_{y}^{k} )\right\| \rightarrow 0$.

To prove claim (4), suppose $z^*=(x^*,y^*)$ is a limit point of the sequence $\left\{ z_k\right\} _{k\in \mathbb {N} }$ (a limit point must exist because we suppose the sequence $\left\{ z_k\right\} _{k\in \mathbb {N} }$ is bounded). This means there exists a subsequence $\left\{ z_{k_j}\right\} $ satisfying $\lim _{j\rightarrow \infty } z_{k_j}= z^*$. Furthermore, by the variance-reduced property of $\widetilde{\nabla }(u_{k_j-1},y_{k_j-1})$, we have $\mathbb {E} \left\| \widetilde{\nabla }_x(u_{k_j-1},y_{k_j-1})-\nabla _x H(u_{k_j-1},y_{k_j-1}) \right\| ^2\rightarrow 0$.

Because f and g are lower semicontinuous, we have

$$\begin{aligned}&\liminf _{j\rightarrow \infty }f(x_{k_j})\ge f(x^*),\nonumber \\&\liminf _{j\rightarrow \infty }g(y_{k_j})\ge g(y^*). \end{aligned}$$

(4.15)

By the update rule for $x_{k_j}$, letting $x=x^*$, we have

$$\begin{aligned}&f(x_{k_j})\!+\!\langle x _{k_j},\widetilde{\nabla }_x(u_{k_j-1},y_{k_j-1})\rangle \!+\!D_{\phi _1}(x_{k_j},x_{k_j-1})\!+\!\alpha _{1,k_j-1} \langle x_{k_j},x_{k_j-2}\!-\!x_{k_j-1}\rangle \\&+\alpha _{2,k_j-1} \langle x_{k_j},x_{k_j-3}-x_{k_j-2}\rangle \\ \le&f(x^*)+\langle x^*,\widetilde{\nabla }_x(u_{k_j-1},y_{k_j-1})\rangle +D_{\phi _1}(x^*,x_{k_j-1})+\alpha _{1,k_j-1} \langle x^*,x_{k_j-2}-x_{k_j-1}\rangle \\&+\alpha _{2,k_j-1} \langle x^*,x_{k_j-3}-x_{k_j-2}\rangle . \end{aligned}$$

Taking the expectation and taking the limit $j \rightarrow \infty $,

$$\begin{aligned}&\limsup _{j\rightarrow \infty }f(x_{k_j})\\ \le&\limsup _{j\rightarrow \infty }f(x^*)+\langle x^*-x_{k_j},\nabla _xH(u_{k_j-1},y_{k_j-1})\rangle +\langle x^*-x_{k_j},\widetilde{\nabla }_x(u_{k_j-1},y_{k_j-1})\\&-\nabla _xH(u_{k_j-1},y_{k_j-1})\rangle +\phi _1(x^*)-\phi _1(x_{k_j})+\left\langle \nabla \phi _1(x_{k_j-1}),x^*-x_{k_j-1} \right\rangle \\&+\alpha _{1,k_j-1} \langle x^*-x_{k_j},x_{k_j-2}-x_{k_j-1}\rangle +\alpha _{2,k_j-1} \langle x^*-x_{k_j},x_{k_j-3}-x_{k_j-2}\rangle . \end{aligned}$$

The second term on the right goes to zero because $x_{k_j}\rightarrow x^*$ and $\left\{ \nabla _xH(u_{k_j-1},y_{k_j-1})\right\} $ is bounded. The thrid term is zero almost surely because it is bounded above by $\left\| x^*-x_{k_j} \right\| ^2$, and $\widetilde{\nabla }_x(u_{k_j-1},y_{k_j-1})-\nabla _xH(u_{k_j-1},y_{k_j-1})$ $\rightarrow 0$ a.s. Noting that $\phi _1$ is differentiable, so $\limsup _{j\rightarrow \infty }f(x_{k_j})\le f(x^*)$ a.s., which, together with (4.15), implies that $\lim _{j\rightarrow \infty }f(x_{k_j})= f(x^*)$ a.s. Similarly, we have $\lim _{j\rightarrow \infty }g(y_{k_j})= g(y^*)$ a.s., and hence

$$\begin{aligned} \lim _{j\rightarrow \infty }\Phi (x_{k_j},y_{k_j})=\Phi (x^*,y^*)\ \ \mathrm{a.s}. \end{aligned}$$

(4.16)

Claim (3) ensures that $\mathbb {E} \textrm{dist}(0,\partial \Phi (z_k)) \rightarrow 0$. Combining (4.16) and the fact that the subdifferential of $\Phi $ is closed, we have $\mathbb {E} \textrm{dist}(0,\partial \Phi (z^*)) = 0$.

Claims (5) and (6) hold for any sequence satisfying $\left\| z_{k}-z_{k-1} \right\| \rightarrow 0$ a.s. (this fact is used in the same context in [11, 36]).

Finally, we must show that $\Phi $ has constant expectation over $\Omega $. From claim (2), we have $\mathbb {E} \Phi (z_k)\rightarrow \Phi ^*$, which implies $\mathbb {E} \Phi (z_{k_j})\rightarrow \Phi ^*$ for every subsequence $\left\{ z_{k_j}\right\} _{j\in \mathbb {N} }$ converging to some $z^*\in \Omega $. In the proof of claim (4), we show that $\Phi (z_{k_j})\rightarrow \Phi (z ^*)$ a.s., so $\mathbb {E} \Phi (z^*)= \Phi ^*$ for all $z^*\in \Omega $. $\square $

The following lemma is analogous to the uniformized Kurdyka–Łojasiewicz property [11]. It is a slight generalization of the KŁ property showing that $z_k$ eventually enters a region of $\tilde{z} $ for some $\tilde{z} $ satisfying $\Phi (\tilde{z} )= \Phi (z ^*)$, and in this region, the KŁ inequality holds.

Lemma 4.4

Assume that the conditions of Lemma 4.3 hold and that $z_k$ is not a critical point of $\Phi $ after a finite number of iterations. Let $\Phi $ be a semialgebraic function with KŁ exponent $\vartheta $. Then, there exists an index m and a desingularizing function $\varphi $ so that the following bound holds:

$$\varphi '(\mathbb {E} [\Phi (z_k)-\Phi _k ^*])\mathbb {E}\textrm{dist}(0,\partial \Phi (z_k))\ge 1,\ \ \forall k>m,$$

where $\Phi _k ^*$ is a nondecreasing sequence converging to $\mathbb {E} \Phi (z^*)$ for all $z^*\in \Omega $.

The proof is almost the same as that of Lemma 4.5 in [23]. We omit the proof here. We now show that the iterates of Algorithm 3.1 have finite length in expectation.

Theorem 4.1

(Finite length) Assume that the conditions of Lemma 4.3 hold and $\Phi $ is a semialgebraic function with KŁ exponent $\vartheta \in [0,1)$. Let $\{z_k\}_{k\in \mathbb {N}}$ be a bounded sequence, which is generated by Algorithm 3.1 with variance-reduced gradient estimator.

(i):

Either $z_k$ is a critical point after a finite number of iterations or $\left\{ z_k\right\} _{k\in \mathbb {N} }$ satisfies the finite length property in expectation:

$$\sum _{k=0}^{\infty } \mathbb {E}\left\| z_{k+1}-z_{k} \right\| <\infty , $$

and there exists an integer m so that, for all $i > m$,

$$\begin{aligned}&\sum _{k=m}^{i}\mathbb {E}\left\| z_{k+1}\!-\!z_{k} \right\| \!+\!\sum _{k=m}^{i} \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| \!+\! \sum _{k=m}^{i}\mathbb {E}\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\! \sum _{k=m}^{i}\mathbb {E}\left\| z_{k-2}\!-\!z_{k-3} \right\| \nonumber \\ \le&\sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+ \sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}\nonumber \\&+ \sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} +\frac{2\sqrt{s} }{K_1\rho } \sqrt{\mathbb {E}\Upsilon _{m-1}}+K_3\triangle _{m,i+1}, \end{aligned}$$

(4.17)

where

$$K_1=p+\frac{2\sqrt{sV_\Upsilon } }{\rho }, \ K_3=\frac{4K_1 }{K_2},\ K_2=\min \left\{ \kappa ,\epsilon ,Z \right\} , $$

p is as in Lemma 4.2, and $\triangle _{p,q}=(\mathbb {E}[\Psi _p-\Phi _{p}^{*} ]-\mathbb {E}[\Psi _q-\Phi _{q}^{*} ])$.

(ii):

$\left\{ z_k\right\} _{k\in \mathbb {N} }$ generated by Algorithm 3.1 converge to a critical point of $\Phi $ in expectation.

Proof

(i) If $\vartheta \in (0,\frac{1}{2})$, then $\Phi $ satisfies the KŁ property with exponent $\frac{1}{2}$, so we consider only the case $\vartheta \in [ \frac{1}{2},1)$. By Lemma 4.4, there exists a function $\varphi _0(r)=ar^{1-\vartheta }$ such that

$$\varphi _0'(\mathbb {E}[ \Phi (z_k)-\Phi _k ^*])\mathbb {E}\textrm{dist}(0,\partial \Phi (z_k))\ge 1,\ \ \forall k>m.$$

Lemma 4.2 provides a bound on $\mathbb {E}\textrm{dist}(0,\partial \Phi (z_k))$.

$$\begin{aligned}&\mathbb {E}\textrm{dist}(0,\partial \Phi (z_k)) \le \mathbb {E}\left\| (A_{x}^{k},A_{y}^{k} ) \right\| \nonumber \\ \le&p\mathbb {E}\left( \left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| +\left\| z_{k-3}-z_{k-4} \right\| \right) +\mathbb {E}\Gamma _{k-1}\nonumber \\ \le&p\Big ( \sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}\nonumber \\&+\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^2}\Big ) +\sqrt{s\mathbb {E}\Upsilon _{k-1} } . \end{aligned}$$

(4.18)

The final inequality is Jensen’s inequality. Because $\Gamma _k=\sum _{i=1}^{s} v_{k}^{i} $ for some nonnegative random variables $v_{k}^{i} $, we can say $\mathbb {E}\Gamma _k=\mathbb {E}\sum _{i=1}^{s} v_{k}^{i} \le \mathbb {E}\sqrt{s\sum _{i=1}^{s} (v_{k}^{i} )^2} \le \sqrt{s\mathbb {E}\Upsilon _{k} } $. We can bound the term $\sqrt{\mathbb {E}\Upsilon _{k} } $ using (3.4):

$$\begin{aligned}&\sqrt{\mathbb {E}\Upsilon _{k}}\nonumber \\ \le&\sqrt{(1\!-\!\rho )\mathbb {E}\Upsilon _{k-1}\!+\!V_\Upsilon \mathbb {E}\left( \left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2} \!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}+\left\| z_{k-3}\!-\!z_{k-4} \right\| ^{2}\right) }\nonumber \\ \le&\sqrt{(1-\rho )}\sqrt{\mathbb {E}\Upsilon _{k-1}} +\sqrt{V_\Upsilon } \left( \sqrt{ \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^{2} } +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^{2}}\right. \nonumber \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^{2}} \right) \nonumber \\ \le&(1-\frac{\rho }{2} )\sqrt{\mathbb {E}\Upsilon _{k-1}} +\sqrt{V_\Upsilon } \left( \sqrt{ \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^{2} } +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^{2}}\right. \nonumber \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^{2}} \right) . \end{aligned}$$

(4.19)

The final inequality uses the fact that $\sqrt{1-\rho } =1-\frac{\rho }{2}- \frac{\rho ^2 }{8}-\cdots $. This implies that

$$\begin{aligned}&\sqrt{s\mathbb {E}\Upsilon _{k-1}}\nonumber \\ \le&\frac{2\sqrt{s} }{\rho } \left( \sqrt{\mathbb {E}\Upsilon _{k-1}}-\sqrt{\mathbb {E}\Upsilon _{k}} \right) +\frac{2\sqrt{sV_\Upsilon }}{\rho } \left( \sqrt{ \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^{2} } \right. \nonumber \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^{2}}\right) . \end{aligned}$$

(4.20)

Then, from (4.18) and (4.20), we have

$$\begin{aligned}&\mathbb {E}\textrm{dist}(0,\partial \Phi (z_k))\\ \le&\left( p+\frac{2\sqrt{sV_\Upsilon } }{\rho } \right) \left( \sqrt{ \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^{2} } +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^{2}} \right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^{2}} \right) +\frac{2\sqrt{s} }{\rho } \left( \sqrt{\mathbb {E}\Upsilon _{k-1} }-\sqrt{\mathbb {E}\Upsilon _{k} } \right) \\ =&K_1\left( \sqrt{ \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^{2} } +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^{2}}\right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^{2}} \right) +\frac{2\sqrt{s} }{\rho } \left( \sqrt{\mathbb {E}\Upsilon _{k-1} }-\sqrt{\mathbb {E}\Upsilon _{k} } \right) , \end{aligned}$$

where $K_1=p+\frac{2\sqrt{sV_\Upsilon } }{\rho }$. Define $C_k$ to be the right side of this inequality:

$$\begin{aligned} C_k=&K_1\sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2}+ K_1\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2} + K_1\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}\\&+ K_1\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^2}+\frac{2\sqrt{s} }{\rho } \left( \sqrt{\mathbb {E}\Upsilon _{k-1} }-\sqrt{\mathbb {E}\Upsilon _{k} } \right) . \end{aligned}$$

We then have

$$\begin{aligned} \varphi _0'(\mathbb {E} [\Phi (z_k)-\Phi _k ^*])C_k\ge 1,\ \ \forall k>m. \end{aligned}$$

(4.21)

By the definition of $\varphi _0$, this is equivalent to

$$\begin{aligned} \frac{a(1-\vartheta )C_k}{(\mathbb {E} [\Phi (z_k)-\Phi _k ^*])^\vartheta }\ge 1,\ \ \forall k>m. \end{aligned}$$

(4.22)

We would like to hold the inequality above for $\Psi _k$ rather than $\Phi (z_k)$. Replace $\mathbb {E} \Phi (z_k)$ with $\mathbb {E}\Psi _k$ by introducing a term of $\mathcal {O}\left( \left( \mathbb {E}\left[ \left\| z_{k}-z_{k-1} \right\| ^2+\left\| z_{k-1}-z_{k-2} \right\| ^2\right. \right. \right. $$\left. \left. \left. +\left\| z_{k-2}-z_{k-3} \right\| ^2+\Upsilon _k \right] \right) ^\vartheta \right) $ in the denominator. We show that inequality (4.22) still holds after this adjustment because these terms are small compared to $C_k$. Indeed, the quantity

$$\begin{aligned} C_k\ge&c_1\left( \sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2}+ \sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}\right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^2}+\sqrt{\mathbb {E}\Upsilon _{k-1} } \right) \end{aligned}$$

for some constant $c_1>0$. And because $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2 \rightarrow 0$, $\mathbb {E} \Upsilon _k \rightarrow 0$, and $\vartheta >\frac{1}{2} $, there exists an index m and constants $c_2,c_3>0$ such that

$$\begin{aligned}&\left( \mathbb {E}[\Psi _k-\Phi (z_k) ]\right) ^\vartheta \\ =&\left( \mathbb {E}\left[ \frac{1}{L\lambda \rho } \Upsilon _k\!+\!\left( \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _1\!+\!\alpha _2}{2}+\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }+3Z \right) \Vert z_{k}-z_{k-1}\Vert ^2\!+\!\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }\right. \right. \right. \\&\left. \left. \left. +\frac{\alpha _2}{2}+\frac{2L\gamma _{2}^2}{\lambda }+2Z \right) \Vert z_{k-1}-z_{k-2}\Vert ^2+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } +Z \right) \Vert z_{k-2}-z_{k-3}\Vert ^2\right] \right) ^\vartheta \\ \le&c_2\left( \left( \mathbb {E}\left[ \Upsilon _{k-1}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}+\left\| z_{k-3}-z_{k-4} \right\| ^{2}\right] \right) ^\vartheta \right) \\ \le&c_3C_k,\ \ \forall k>m. \end{aligned}$$

The first inequality uses (3.4). Because the terms above are small compared to $C_k$, there exists a constant d such that $c_3<d<+\infty $ and

$$\begin{aligned} \frac{ad(1-\vartheta )C_k}{(\mathbb {E}[\Phi (z_k)-\Phi _k ^*])^\vartheta +\left( \mathbb {E}[\Psi _k-\Phi (z_k) ]\right) ^\vartheta }\ge 1,\ \ \forall k>m. \end{aligned}$$

For $\vartheta \in [ \frac{1}{2},1)$, using the fact that $(a+b)^\vartheta \le a^\vartheta +b^\vartheta $ for all $a, b \ge 0$, we have

$$\begin{aligned} \frac{ad(1-\vartheta )C_k}{\left( \mathbb {E}[\Psi _k-\Phi _k ^*]\right) ^\vartheta }&=\frac{ad(1-\vartheta )C_k}{\left( \mathbb {E}[\Phi (z_k)-\Phi _k ^*+\Psi _k-\Phi (z_k) ]\right) ^\vartheta }\\&\ge \frac{ad(1-\vartheta )C_k}{\left( \mathbb {E}[\Phi (z_k)-\Phi _k ^*]\right) ^\vartheta +\left( \mathbb {E}[\Psi _k-\Phi (z_k) ]\right) ^\vartheta } \\&\ge 1,\ \ \forall k>m. \end{aligned}$$

Therefore, with $\varphi (r)=adr^{1-\vartheta }$,

$$\begin{aligned} \varphi '(\mathbb {E}[\Psi _k-\Phi _k ^*])C_k\ge 1,\ \ \forall k>m. \end{aligned}$$

(4.23)

By the concavity of $\varphi $,

$$\begin{aligned} \varphi (\mathbb {E}[\Psi _k\!-\!\Phi _k ^*])\!-\!\varphi (\mathbb {E}[\Psi _{k+1}\!-\!\Phi _{k+1} ^*])&\ge \! \varphi '(\mathbb {E}[\Psi _k\!-\!\Phi _k ^*])(\mathbb {E}[\Psi _k\!-\!\Phi _k ^*\!+\!\Phi _{k+1} ^*\!-\!\Psi _{k+1}])\\&\ge \varphi '(\mathbb {E}[\Psi _k-\Phi _k ^*])(\mathbb {E}[\Psi _k-\Psi _{k+1}]), \end{aligned}$$

where the last inequality follows from the fact that $\Phi _k ^*$ is nondecreasing. With $\triangle _{p,q}=\varphi (\mathbb {E}[\Psi _p-\Phi _{p}^{*} ])-\varphi (\mathbb {E}[\Psi _q-\Phi _{q}^{*} ])$, we have shown

$$\triangle _{k,k+1}C_k\ge \mathbb {E}[\Psi _k-\Psi _{k+1}],\ \forall k>m.$$

Using Lemma 4.1, we can bound $\mathbb {E}[\Psi _k-\Psi _{k+1}]$ below by both $\mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2$, $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2$, $\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2$ and $\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2$. Specifically,

$$\begin{aligned} \triangle _{k,k+1}C_k&\ge \kappa \mathbb {E}\left\| z_{k+1}-z_{k}\right\| ^2+\epsilon \mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2+\epsilon \mathbb {E}\left\| z_{k-1}-z_{k-2}\right\| ^2+Z\mathbb {E}\left\| z_{k-2}-z_{k-3}\right\| ^2\nonumber \\&\ge \! K_2\mathbb {E}\left\| z_{k+1}\!-\!z_{k}\right\| ^2\!+\!K_2\mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^2\!+\!K_2\mathbb {E}\left\| z_{k-1}\!-\!z_{k-2}\right\| ^2\!+\!K_2\mathbb {E}\left\| z_{k-2}\!-\!z_{k-3}\right\| ^2, \end{aligned}$$

(4.24)

where $K_2=\min \left\{ \kappa ,\epsilon ,Z \right\} >0$, $\kappa $, $\lambda $, $\epsilon $ and Z are set as in Lemma 4.1. Let us use the first of these inequalities to begin. Applying Young’s inequality to (4.24) yields

$$\begin{aligned}&\sqrt{\mathbb {E}\left\| z_{k+1}\!-\!z_{k} \right\| ^2} \!+\!\sqrt{\mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^2} \!+\!\sqrt{\mathbb {E}\left\| z_{k-1}\!-\!z_{k-2} \right\| ^2}\!+\!\sqrt{\mathbb {E}\left\| z_{k-2}\!-\!z_{k-3} \right\| ^2}\nonumber \\ \le&2\sqrt{\mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2+\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2+\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2+\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}\nonumber \\ \le&2\sqrt{K_2^{-1}C_k\triangle _{k,k+1}} \le \frac{C_k}{2K_1}+\frac{2K_1\triangle _{k,k+1}}{K_2}\nonumber \\ \le&\frac{ 1}{2}\sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2} +\frac{1}{2}\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2} +\frac{1}{2}\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}\nonumber \\&+\frac{1}{2}\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^2}+\frac{\sqrt{s} }{K_1\rho } \left( \sqrt{\mathbb {E}\Upsilon _{k-1} }-\sqrt{\mathbb {E}\Upsilon _{k} } \right) +\frac{2K_1\triangle _{k,k+1}}{K_2}. \end{aligned}$$

(4.25)

Summing inequality (4.25) from $k=m$ to $k=i$, set

$$\begin{aligned} T_m^{i}=&\sum _{k=m}^{i} \sqrt{\mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2}+\sum _{k=m}^{i}\sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2} +\sum _{k=m}^{i}\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2}\nonumber \\&+\sum _{k=m}^{i}\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}. \end{aligned}$$

(4.26)

Then,

$$\begin{aligned} T_m^{i}\le \frac{1}{2}T_{m-1}^{i-1}+\frac{\sqrt{s} }{K_1\rho }\left( \sqrt{\mathbb {E}\Upsilon _{m-1} }-\sqrt{\mathbb {E}\Upsilon _{i} } \right) +\frac{2K_1}{K_2}\triangle _{m,i+1}, \end{aligned}$$

which implies that

$$\begin{aligned} \frac{1}{2}T_m^{i}\le&\frac{1}{2} \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\frac{1}{2}\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+\frac{1}{2} \sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2} \\&+\frac{1}{2} \sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}+\frac{\sqrt{s} }{K_1\rho } \left( \sqrt{\mathbb {E}\Upsilon _{m-1} }-\sqrt{\mathbb {E}\Upsilon _{i} } \right) +\frac{2K_1}{K_2}\triangle _{m,i+1}. \end{aligned}$$

Dropping the nonpositive term $-\sqrt{\mathbb {E}\Upsilon _{i} }$, this shows that

$$\begin{aligned} T_m^{i} \le&\sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+ \sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2} \nonumber \\&+ \sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}+\frac{2\sqrt{s} }{K_1\rho } \sqrt{\mathbb {E}\Upsilon _{m-1} }+K_3\triangle _{m,i+1}. \end{aligned}$$

(4.27)

where $K_3=\frac{4K_1}{K_2}$. Applying Jensen’s inequality to the terms on the left gives

$$\begin{aligned}&\sum _{k=m}^{i}\mathbb {E}\left\| z_{k+1}\!-\!z_{k} \right\| \!+\!\sum _{k=m}^{i} \mathbb {E}\left\| z_{k}\!-z_{k-1} \right\| \!+ \sum _{k=m}^{i}\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| +\! \sum _{k=m}^{i}\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| \le \! T_m^{i}\nonumber \\ \le&\sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} \nonumber \\&+\frac{2\sqrt{s} }{K_1\rho } \sqrt{\mathbb {E}\Upsilon _{m-1} }+K_3\triangle _{m,i+1}. \end{aligned}$$

The term $\lim _{i \rightarrow \infty } \triangle _{m,i+1}$ is bounded because $\mathbb {E}\Psi _k$ is bounded due to Lemma 4.1. Letting $i \rightarrow \infty $, we prove the assertion.

(ii) An immediate consequence of claim (i) is that the sequence $\left\{ z_k\right\} _{k\in \mathbb {N} }$ converges in expectation to a critical point. This is because, for any $p,q \in \mathbb {N}$ with $p \ge q$, $\mathbb {E}\left\| z_{p}-z_{q} \right\| =\mathbb {E}\left\| \sum _{k=q}^{p-1}( z_{k+1}-z_{k}) \right\| \le \sum _{k=q}^{p-1} \mathbb {E}\left\| z_{k+1}-z_{k} \right\| $, and the finite length property implies this final sum converges to zero. This proves claim (ii). $\square $

Theorem 4.2

Assume that the conditions of Lemma 4.3 hold and $\Phi $ is a semialgebraic function with KŁ exponent $\vartheta \in [0, 1)$. Let $\{z_k\}_{k\in \mathbb {N}}$ be a bounded sequence, which is generated by Algorithm 3.1 with variance-reduced gradient estimator. The following convergence rates hold:

(i):: If $\vartheta \in (0, \frac{1}{2} ]$, then there exist $d_1 > 0$ and $\tau \in [1 - \rho ,1)$ such that $\mathbb {E} \left\| z_k-z^*\right\| \le d_1\tau ^k$.
(ii):: If $\vartheta \in (\frac{1}{2},1)$, then there exists a constant $d_2 > 0$ such that $\mathbb {E} \left\| z_k-z^*\right\| \le d_2k ^{-\frac{1-\vartheta }{2\vartheta -1} }$.
(iii):: If $\vartheta = 0$, then there exists an $m \in \mathbb {N}$ such that $\mathbb {E} \Phi (z_k)=\mathbb {E} \Phi (z^*)$ for all $k \ge m$.

Proof

As in the proof of Theorem 4.1, if $\vartheta \in (0, \frac{1}{2} )$, then $\Phi $ satisfies the KŁ property with exponent $\frac{1}{2} $, so we consider only the case $\vartheta \in [\frac{1}{2},1)$.

Let

$$\begin{aligned} T_m=&\sum _{k=m}^{\infty } \sqrt{\mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2}+\sum _{k=m}^{\infty }\sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2} +\sum _{k=m}^{\infty }\sqrt{\mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| ^2}\\&+\sum _{k=m}^{\infty }\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^2}. \end{aligned}$$

Substituting the desingularizing function $\varphi (r)=ar^{1-\vartheta }$ into (4.27), let $i\rightarrow \infty $, then we have

$$\begin{aligned} T_m\le&\sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2} \nonumber \\&+\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} +\frac{2\sqrt{s} }{K_1\rho } \sqrt{\mathbb {E}\Upsilon _{m-1} }+aK_3(\mathbb {E}[\Psi _m-\Phi _{m}^{*} ])^{1-\vartheta }. \end{aligned}$$

(4.28)

Because $\Psi _m=\Phi (z_m)+\mathcal {O}(\left\| z_{m}-z_{m-1} \right\| ^2+\left\| z_{m-1}-z_{m-2} \right\| ^2+\left\| z_{m-2}-z_{m-3} \right\| ^2+\Upsilon _m) $, we can rewrite the final term as $\Phi (z_m)-\Phi _{m}^{*}$.

$$\begin{aligned}&(\mathbb {E}[\Psi _m-\Phi _{m}^{*} ])^{1-\vartheta }\nonumber \\ =&\left( \mathbb {E}\left[ \Phi (z_m)\!-\!\Phi _{m}^{*}\!+\! \frac{1}{L\lambda \rho } \Upsilon _k\!+\!\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _1\!+\alpha _2}{2}+\!\frac{2L(\gamma _{1}^2+\!\gamma _{2}^2)}{\lambda }\!+3Z \right) \Vert z_{m}\!-z_{m-1}\Vert ^2 \right. \right. \nonumber \\&\left. \left. +\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _2}{2}+\frac{2L\gamma _{2}^2}{\lambda }+2Z \right) \Vert z_{m-1}-z_{m-2}\Vert ^2+\left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } +Z \right) \right. \right. \nonumber \\&\left. \left. \Vert z_{m-2}-z_{m-3}\Vert ^2\right] \right) ^{1-\vartheta }\nonumber \\ \overset{(1)}{\le }\ {}&\left( \mathbb {E}[\Phi (z_m)\!-\!\Phi _{m}^{*}]\right) ^{1-\vartheta }\!+\!\left( \frac{1}{L\lambda \rho } \mathbb {E}\Upsilon _m\right) ^{1-\vartheta }\!+\!\left( \left( \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _1\!+\!\alpha _2}{2}\!+\!\frac{2L(\gamma _{1}^2\!+\!\gamma _{2}^2)}{\lambda }+3Z \right) \right. \nonumber \\&\left. \mathbb {E}\Vert z_{m}-z_{m-1}\Vert ^2\right) ^{1-\vartheta }+\left( \left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _2}{2}+\frac{2L\gamma _{2}^2}{\lambda }+2Z \right) \mathbb {E}\Vert z_{m-1}-z_{m-2}\Vert ^2\right) ^{1-\vartheta }\nonumber \\&+\left( \left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } +Z \right) \mathbb {E}\Vert z_{m-2}-z_{m-3}\Vert ^2\right) ^{1-\vartheta }. \end{aligned}$$

(4.29)

Inequality (1) is due to the fact that $(a+b) ^{1-\vartheta }\le a^{1-\vartheta }+b^{1-\vartheta }$. Applying the KŁ inequality (2.1),

$$\begin{aligned} aK_3\left( \mathbb {E}[\Phi (z_m)-\Phi _{m}^{*}]\right) ^{1-\vartheta }\le aK_4\left( \mathbb {E}\left\| \xi _m \right\| \right) ^{\frac{1-\vartheta }{\vartheta } } \end{aligned}$$

(4.30)

for all $\xi _m\in \partial \Phi (z_m)$ and we have absorbed the constant C into $K_4$. Inequality (4.18) provides a bound on the norm of the subgradient:

$$\begin{aligned} \left( \mathbb {E}\left\| \xi _m \right\| \right) ^{\frac{1-\vartheta }{\vartheta } }\! \le&\left( \! p\left( \! \sqrt{\mathbb {E}\left\| z_{m}\!-\!z_{m-1} \right\| ^2}\!+\!\sqrt{\mathbb {E}\left\| z_{m-1}\!-\!z_{m-2} \right\| ^2} \!+\!\sqrt{\mathbb {E}\left\| z_{m-2}\!-\!z_{m-3} \right\| ^2}\right. \right. \\&\left. \left. +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}\right) +\sqrt{s\mathbb {E}\Upsilon _{m-1} } \right) ^{\frac{1-\vartheta }{\vartheta } }. \end{aligned}$$

Let

$$\begin{aligned} \Theta _{m}=&p\left( \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}\right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}\right) +\sqrt{s\mathbb {E}\Upsilon _{m-1} }. \end{aligned}$$

Therefore, it follows from (4.28) to (4.30) that

$$\begin{aligned} T_m\le&\sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}\nonumber \\&+\!\sqrt{\mathbb {E}\left\| z_{m-3}\!-\!z_{m-4} \right\| ^2} \!+\!\frac{2\sqrt{s} }{K_1\rho } \sqrt{\mathbb {E}\Upsilon _{m-1} }\!+\!aK_4\Theta _{m}^{\frac{1-\vartheta }{\vartheta } }\!+\!aK_3\left( \frac{1}{L\lambda \rho } \mathbb {E}\Upsilon _m\right) ^{1-\vartheta }\nonumber \\&+\!aK_3\left( \left( \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _1+\alpha _2}{2}\!+\!\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }\!+\!3Z \right) \mathbb {E}\Vert z_{m}\!-\!z_{m-1}\Vert ^2\right) ^{1-\vartheta }\nonumber \\&+aK_3\left( \left( \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _2}{2}+\frac{2L\gamma _{2}^2}{\lambda }+2Z\right) \mathbb {E}\Vert z_{m-1}-z_{m-2}\Vert ^2\right) ^{1-\vartheta }\nonumber \\&+aK_3\left( \left( \frac{V_1+V_\Upsilon /\rho }{L\lambda } +Z\right) \mathbb {E}\Vert z_{m-2}-z_{m-3}\Vert ^2\right) ^{1-\vartheta }. \end{aligned}$$

(4.31)

(i) If $\vartheta = \frac{1}{2}$, then $\left( \mathbb {E}\left\| \xi _m \right\| \right) ^{\frac{1-\vartheta }{\vartheta } }=\mathbb {E}\left\| \xi _m \right\| $. Equation (4.31) then gives

$$\begin{aligned} T_m\le&\sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} \nonumber \\&+\frac{2\sqrt{s} }{K_1\rho } \sqrt{\mathbb {E}\Upsilon _{m-1} }+aK_4\left( p\left( \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} \right. \right. \nonumber \\&\left. \left. +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} \right) +\sqrt{s\mathbb {E}\Upsilon _{m-1} } \right) +aK_3\sqrt{\frac{1}{L\lambda \rho }}\sqrt{ \mathbb {E}\Upsilon _m}\nonumber \\&+\left( aK_3\sqrt{ \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _1+\alpha _2}{2}+\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }+3Z }\right) \sqrt{ \mathbb {E}\Vert z_{m}-z_{m-1}\Vert ^2}\nonumber \\&+\left( aK_3\sqrt{ \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _2}{2}+\frac{2L\gamma _{2}^2}{\lambda }+2Z }\right) \sqrt{ \mathbb {E}\Vert z_{m-1}-z_{m-2}\Vert ^2}\nonumber \\&+\left( aK_3\sqrt{ \frac{V_1+V_\Upsilon /\rho }{L\lambda } +Z }\right) \sqrt{ \mathbb {E}\Vert z_{m-2}-z_{m-3}\Vert ^2}\nonumber \\ \le&\left( 1\!+aK_5\left( p\!+\sqrt{ \frac{V_1\!+V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _1\!+\!\alpha _2}{2}+\frac{2L(\gamma _{1}^2\!+\!\gamma _{2}^2)}{\lambda }\!+\!3Z } \right) \right) \left( \sqrt{\mathbb {E}\left\| z_{m}\!-\!z_{m-1} \right\| ^2}\right. \nonumber \\&\left. +\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}\right) \nonumber \\&+\left( \frac{2\sqrt{s} }{K_1\rho }+aK_5 \sqrt{s}\right) \sqrt{\mathbb {E}\Upsilon _{m-1} } +aK_5\sqrt{\frac{1}{L\lambda \rho }}\sqrt{ \mathbb {E}\Upsilon _m}, \end{aligned}$$

(4.32)

where $K_5=\max \left\{ K_3,K_4 \right\} $. Using (4.19), we have that, for any constant $c > 0$,

$$\begin{aligned} 0\le&-\!c\sqrt{\mathbb {E}\Upsilon _{k}}\!+\!c(1\!-\!\frac{\rho }{2} )\sqrt{\mathbb {E}\Upsilon _{k-1}} \!+\!c\sqrt{V_\Upsilon } \left( \sqrt{ \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}} \!+\!\sqrt{\mathbb {E}\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2} } \right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| ^{2}} \right) . \end{aligned}$$

Combining this inequality with (4.32),

$$\begin{aligned} T_m\le&\left( \! 1\!+\!aK_5\left( p\!+\!\sqrt{ \frac{V_1\!+\!V_\Upsilon /\rho }{L\lambda }\!+\!\frac{\alpha _1\!+\!\alpha _2}{2}\!+\!\frac{2L(\gamma _{1}^2\!+\!\gamma _{2}^2)}{\lambda }\!+\!3Z }\!+\!c\sqrt{V_\Upsilon } \right) \! \right) \left( \sqrt{\mathbb {E}\left\| z_{m}\!-\!z_{m-1} \right\| ^2} \right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}\right) \\&+c\left( 1-\frac{\rho }{2}+\frac{2\sqrt{s} }{K_1\rho c }+\frac{aK_5 \sqrt{s}}{c}\right) \sqrt{\mathbb {E}\Upsilon _{m-1} } -c\left( 1-\frac{aK_5 }{c}\sqrt{\frac{1}{L\lambda \rho }}\right) \sqrt{ \mathbb {E}\Upsilon _m}. \end{aligned}$$

Defining $A=1+aK_5\left( p+\sqrt{ \frac{V_1+V_\Upsilon /\rho }{L\lambda }+\frac{\alpha _1+\alpha _2}{2}+\frac{2L(\gamma _{1}^2+\gamma _{2}^2)}{\lambda }+3Z }+c\sqrt{V_\Upsilon }\right) $, we have shown

$$\begin{aligned}&T_m+c\left( 1-\frac{aK_5 }{c}\sqrt{\frac{1}{L\lambda \rho }}\right) \sqrt{ \mathbb {E}\Upsilon _m}\\ \le&A \left( T_{m-1}-T_m\right) +c\left( 1-\frac{\rho }{2}+\frac{2\sqrt{s} }{K_1\rho c }+\frac{aK_5 \sqrt{s}}{c}\right) \sqrt{\mathbb {E}\Upsilon _{m-1} }. \end{aligned}$$

Then, we get

$$\begin{aligned}&(1+A)T_m+c\left( 1-\frac{aK_5 }{c}\sqrt{\frac{1}{L\lambda \rho }}\right) \sqrt{ \mathbb {E}\Upsilon _m}\\ \le&AT_{m-1}+c\left( 1-\frac{\rho }{2}+\frac{2\sqrt{s} }{K_1\rho c }+\frac{aK_5 \sqrt{s}}{c}\right) \sqrt{\mathbb {E}\Upsilon _{m-1} }. \end{aligned}$$

This implies

$$\begin{aligned}&T_m+\sqrt{ \mathbb {E}\Upsilon _m}\\ \le&\max \left\{ \frac{A}{1\!+\!A},\left( 1\!-\!\frac{\rho }{2}\!+\!\frac{2\sqrt{s} }{K_1\rho c }\!+\!\frac{aK_5 \sqrt{s}}{c}\right) \left( 1\!-\!\frac{aK_5 }{c}\sqrt{\frac{1}{L\lambda \rho }}\right) ^{-1} \right\} \left( T_{m-1}\!+\!\sqrt{\mathbb {E}\Upsilon _{m-1} }\right) . \end{aligned}$$

For large c, the second coefficient in the above expression approaches $1-\frac{\rho }{2}$. So there exist $\tau \in [1 - \rho ,1)$ such that

$$\sum _{k=m}^{\infty }\sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^2}\le \tau ^k\left( T_{0}+\sqrt{\mathbb {E}\Upsilon _{0} }\right) \le d_1\tau ^k$$

for some constant $d_1$. Then, using the fact that $\mathbb {E}\left\| z_{m}\!-\!z^{*} \right\| \!=\!\mathbb {E}\left\| \sum _{k=m+1}^{\infty } (z_{k}\!-\!z_{k-1}) \right\| $$\le \sum _{k=m}^{\infty }\mathbb {E}\left\| z_{k}-z_{k-1} \right\| $, we prove claim (i).

(ii) Suppose $\vartheta \in (\frac{1}{2},1)$. Each term on the right side of (4.31) converges to zero, but at different rates. Because

$$\begin{aligned} \Theta _m =&\mathcal {O}\left( \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2}\right. \\&\left. +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2}+\sqrt{s\mathbb {E}\Upsilon _{m-1} } \right) , \end{aligned}$$

and $\vartheta $ satisfies $\frac{1-\vartheta }{\vartheta }< 1$, the term $\Theta _{m}^{\frac{1-\vartheta }{\vartheta } }$ dominates the first five terms on the right side of (4.31) for large m. Also, because $\frac{1-\vartheta }{2\vartheta }< 1-\vartheta $, $\Theta _{m}^{\frac{1-\vartheta }{\vartheta } }$ dominates the final four terms as well. Combining these facts, there exists a natural number $M_1$ such that for all $m \ge M_1$,

$$\begin{aligned} T_m\le P\Theta _m \end{aligned}$$

(4.33)

for some constant $P>(aK_3)^{\frac{\vartheta }{1-\vartheta } }$. The bound of (4.20) implies

$$\begin{aligned}&2\sqrt{s\mathbb {E}\Upsilon _{m-1}}\\ \le&\frac{4\sqrt{s} }{\rho } \left( \sqrt{\mathbb {E}\Upsilon _{m-1}}-\sqrt{\mathbb {E}\Upsilon _{m}} +\sqrt{V_\Upsilon }\left( \sqrt{ \mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^{2} }\right. \right. \\&\left. \left. +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^{2}} +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^{2}} \right) \right) . \end{aligned}$$

Therefore,

$$\begin{aligned} \Theta _m =&p\left( \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2} \right. \nonumber \\&\left. +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} \right) +\left( 2\sqrt{s\mathbb {E}\Upsilon _{m-1} }-\sqrt{s\mathbb {E}\Upsilon _{m-1} } \right) \nonumber \\ \le&\left( p\!+\! \frac{4\sqrt{sV_\Upsilon } }{\rho }\right) \left( \sqrt{\mathbb {E}\left\| z_{m}\!-\!z_{m-1} \right\| ^2}\!+\!\sqrt{\mathbb {E}\left\| z_{m-1}\!-\!z_{m-2} \right\| ^2} \!+\!\sqrt{\mathbb {E}\left\| z_{m-2}\!-\!z_{m-3} \right\| ^2} \right. \nonumber \\&\left. +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} \right) +\frac{4\sqrt{s} }{\rho }\left( \sqrt{\mathbb {E}\Upsilon _{m-1} }-\sqrt{\mathbb {E}\Upsilon _{m} } \right) -\sqrt{s\mathbb {E}\Upsilon _{m-1} }. \end{aligned}$$

(4.34)

Furthermore, because ${\frac{\vartheta }{1-\vartheta } }>1$ and $\mathbb {E}\Upsilon _m\rightarrow 0$, for large enough m, we have $\left( \sqrt{\mathbb {E}\Upsilon _m} \right) ^{\frac{\vartheta }{1-\vartheta } } \ll \sqrt{\mathbb {E}\Upsilon _m} $. This ensures that there exists a natural number $M_2$ such that for every $m \ge M_2$,

$$\begin{aligned} \left( \frac{4\sqrt{s} (1-\rho /4)}{\rho (p+4\sqrt{sV_\Upsilon } /\rho )} \sqrt{\mathbb {E}\Upsilon _m} \right) ^{\frac{\vartheta }{1-\vartheta } } \le P\sqrt{s\mathbb {E}\Upsilon _m} . \end{aligned}$$

(4.35)

The constant appearing on the left was chosen to simplify later arguments. Therefore, (4.33) implies

$$\begin{aligned}&\left( T_m+\frac{4\sqrt{s} (1-\rho /4)}{\rho (p+4\sqrt{sV_\Upsilon } /\rho )} \sqrt{\mathbb {E}\Upsilon _m}\right) ^{\frac{\vartheta }{ 1-\vartheta } }\\ \overset{(1)}{\le }\&\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( T_m\right) ^{\frac{\vartheta }{ 1-\vartheta } }\!+\!\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( \frac{4\sqrt{s} (1-\rho /4)}{\rho (p+4\sqrt{sV_\Upsilon } /\rho )} \sqrt{\mathbb {E}\Upsilon _m}\right) ^{\frac{\vartheta }{ 1-\vartheta } } \overset{(2)}{\le }\ \frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( T_m\right) ^{\frac{\vartheta }{ 1-\vartheta } }+\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( P\sqrt{s\mathbb {E}\Upsilon _m}\right) \\ \overset{(3)}{\le }\&\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( P\left( p+ \frac{4\sqrt{sV_\Upsilon } }{\rho }\right) \left( \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2} \right. \right. \\&\left. \left. +\sqrt{\mathbb {E}\left\| z_{m-3}\!-\!z_{m-4} \right\| ^2} \right) \!+\!\frac{4\sqrt{s}P }{\rho }\left( \sqrt{\mathbb {E}\Upsilon _{m-1} }\!-\!\sqrt{\mathbb {E}\Upsilon _{m} } \right) \!-\!P\sqrt{s\mathbb {E}\Upsilon _{m-1} } \right) \!+\!\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( P\sqrt{s\mathbb {E}\Upsilon _m}\right) \\ \le&\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}\left( P\left( p+ \frac{4\sqrt{sV_\Upsilon } }{\rho }\right) \left( \sqrt{\mathbb {E}\left\| z_{m}-z_{m-1} \right\| ^2}+\sqrt{\mathbb {E}\left\| z_{m-1}-z_{m-2} \right\| ^2} +\sqrt{\mathbb {E}\left\| z_{m-2}-z_{m-3} \right\| ^2} \right. \right. \\&\left. \left. +\sqrt{\mathbb {E}\left\| z_{m-3}-z_{m-4} \right\| ^2} \right) +\frac{4\sqrt{s}P(1-\rho /4) }{\rho }\left( \sqrt{\mathbb {E}\Upsilon _{m-1} }-\sqrt{\mathbb {E}\Upsilon _{m} } \right) \right) . \end{aligned}$$

Here, (1) follows by convexity of the function $x^{\frac{\vartheta }{1-\vartheta }}$ for $\vartheta \in [1/2, 1)$ and $x \ge 0$, (2) is (4.35), and (3) is (4.33) combined with (4.34). We absorb the constant $\frac{2^{{\frac{\vartheta }{1-\vartheta } }}}{2}$ into P. Define

$$\begin{aligned} S_m=T_m+\frac{4\sqrt{s} (1-\rho /4)}{\rho (p+4\sqrt{sV_\Upsilon } /\rho )} \sqrt{\mathbb {E}\Upsilon _m}. \end{aligned}$$

$S_m$ is bounded for all m because $\sum _{k=m}^{\infty } \sqrt{\mathbb {E}\left\| z_{k+1}-z_{k} \right\| ^2}$ is bounded by (4.28). Hence, we have shown

$$\begin{aligned} S_{m}^{\frac{\vartheta }{1-\vartheta }} \le P\left( p+ \frac{4\sqrt{sV_\Upsilon } }{\rho }\right) (S_{m-1}-S_m). \end{aligned}$$

(4.36)

The rest of the proof is almost the same as what was mentioned in [23, 37]. We omit the proof here. (iii) When $\vartheta = 0$, the KŁ property (2.1) implies that exactly one of the following two scenarios holds: either $\mathbb {E} \Phi (z_k)\ne \Phi _{k}^{*}$ and

$$\begin{aligned} 0<C\le \mathbb {E}\left\| \xi _k \right\| ,\ \ \forall \xi _k\in \partial \Phi (z_k) \end{aligned}$$

(4.37)

or $\mathbb {E} \Phi (z_k)= \Phi _{k}^{*}$. We show that the above inequality can hold only for a finite number of iterations.

Using the subgradient bound (4.10), the first scenario implies

$$\begin{aligned} C^2\le&\left( \mathbb {E}\left\| \xi _k \right\| \right) ^2 \\ \le&\left( p\left( \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| \!+\!\mathbb {E}\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\!\mathbb {E}\left\| z_{k-2}\!-\!z_{k-3} \right\| \!+\!\mathbb {E}\left\| z_{k-3}\!-\!z_{k-4} \right\| \right) \!+\!\Gamma _{k-1} \right) ^2\\ \le&5p^2 \left( \mathbb {E}\left\| z_{k}-z_{k-1} \right\| \right) ^2+5p^2 \left( \mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| \right) ^2+5p^2 \left( \mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| \right) ^2\\&+5p^2 \left( \mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| \right) ^2+5(\mathbb {E} \Gamma _{k-1})^2\\ \le&5p^2 \left( \mathbb {E}\left\| z_{k}-z_{k-1} \right\| \right) ^2+5p^2 \left( \mathbb {E}\left\| z_{k-1}-z_{k-2} \right\| \right) ^2+5p^2 \left( \mathbb {E}\left\| z_{k-2}-z_{k-3} \right\| \right) ^2\\&+5p^2 \left( \mathbb {E}\left\| z_{k-3}-z_{k-4} \right\| \right) ^2+5s\mathbb {E} \Upsilon _{k-1}, \end{aligned}$$

where we have used the inequality $(a_1+a_2+\cdots +a_s)^2\le s (a_1^2+a_2^2+\cdots +a_s^2)$ and Jensen’s inequality. Applying this inequality to the decrease of $\Psi _ k$ (4.2), we obtain

$$\begin{aligned}&\mathbb {E}_k\Psi _{k} \\ \le&\mathbb {E}_k\Psi _{k-1}\!-\!\kappa \left\| z_{k+1}\!-\!z_{k} \right\| ^2\!-\!\epsilon \left\| z_{k}\!-\!z_{k-1} \right\| ^2\!-\! \epsilon \left\| z_{k-1}\!-\!z_{k-2} \right\| ^2\!-\! Z \left\| z_{k-2}\!-\!z_{k-3} \right\| ^2\\ \le&\mathbb {E}_k\Psi _{k-1}-C^2+\mathcal {O}\left( \left\| z_{k+1}-z_{k} \right\| ^2 \right) +\mathcal {O} \left( \left\| z_{k}-z_{k-1} \right\| ^2 \right) +\mathcal {O} \left( \left\| z_{k-1}-z_{k-2} \right\| ^2 \right) \\&+\mathcal {O} \left( \left\| z_{k-2}-z_{k-3} \right\| ^2 \right) +\mathcal {O} \left( \mathbb {E} \Upsilon _{k-1} \right) \end{aligned}$$

for some constant $C^2$. Because the final five terms go to zero as $k \rightarrow \infty $, there exists an index $M_4$ so that the sum of these five terms is bounded above by $\frac{C^2}{2}$ for all $k \ge M_4$. Therefore,

$$\mathbb {E}_k\Psi _{k}\le \mathbb {E}_k\Psi -\frac{C^2}{2},\ \ \forall k\ge M_4.$$

Because $\Psi _k$ is bounded below for all k, this inequality can only hold for $N < \infty $ steps. After N steps, it is no longer possible for the bound (4.37) to hold, so it must be that $\mathbb {E} \Phi (z_k)= \Phi _{k}^{*}$. Because $\Phi _{k}^{*}<\Phi (z^{*})$, $\Phi _{k}^{*}<\mathbb {E} \Phi (z_k)$, and both $\mathbb {E} \Phi (z_k)$, $\Phi _{k}^{*}$ converge to $\mathbb {E}\Phi (z^{*})$, we must have $\Phi _{k}^{*}=\mathbb {E} \Phi (z_k)=\mathbb {E}\Phi (z^{*})$. $\square $

5 Numerical experiments

In this section, to demonstrate the advantages of STiBPALM (Algorithm 3.1), we present our numerical study on the practical performance of the proposed STiBPALM with three different stochastic gradient estimators, i.e., SGD estimator [35] (STiBPALM-SGD), SAGA gradient [28] estimator (STiBPALM-SAGA), and SARAH gradient [29] estimator (STiBPALM-SARAH), compared with PALM [11], iPALM [6], TiPALM [17], SPRING [23], and SiPALM [24] algorithms. We refer to SPRING with SGD, SAGA, and SARAH gradient estimators as SPRING-SGD, SPRING-SAGA, and SPRING-SARAH; and SiPALM using the SGD, SAGA, and SARAH gradient estimators as SiPALM-SGD, SiPALM-SAGA, and SiPALM-SARAH, respectively. Two applications are considered here for comparison: sparse nonnegative matrix factorization (S-NMF) and blind image-deblurring (BID).

Since the proposed algorithm is based on the stochastic gradient estimator, we report the average results (over 10 independent runs) of objective values for all algorithms. The initial point is also the same for all algorithms. In addition, we choose step size which is suggested in [11] for PALM and in [6] for iPALM, respectively, and the same step size based on [23] for all stochastic algorithms for simplicity.

5.1 Sparse nonnegative matrix factorization

Given a matrix A, sparse nonnegative matrix factorization (S-NMF) [38,39,40] problem can be formulated as the following model:

$$\begin{aligned} \underset{X,Y}{\min }\ \left\{ \frac{\eta }{2}\left\| A-XY \right\| _{F}^{2} : \ X,Y\ge 0,\ \left\| X_i \right\| _0\le s,\ i=1,2,\dots ,r\right\} . \end{aligned}$$

(5.1)

In dictionary learning and sparse coding, X is called the learned dictionary with coefficients Y. In this formulation, the sparsity on X is restricted $75\%$ of the entries to be 0.

We use the extended Yale-B dataset and the ORL dataset, which are standard facial recognition benchmarks consisting of human face images.^{Footnote 1} For solving this S-NMF problem (5.1), [6, 14] gave the details on how to solve the X-subproblems and Y-subproblems. The extended Yale-B dataset contains 2414 cropped images of size $32 \times 32$, while the ORL dataset contains 400 images sized $64 \times 64$ (see Fig. 1). In the experiment for the Yale dataset, we extract 49 sparse basis images for the dataset. For the ORL dataset, we extract 25 sparse basis images. In each iteration of the stochastic algorithms, we randomly subsample $5\%$ of the full batch as a minibatch. Here, for SARAH gradient estimator, we set $p=\frac{1}{20}$.

In STiBPALM, let $\phi _1(X)=\frac{\theta _1 }{2} \left\| X\right\| ^{2}$, $\phi _2(Y)=\frac{\theta _2 }{2} \left\| Y \right\| ^{2}$. In a numerical experiment, we choose $\eta =3$ and calculate $\theta _1$ and $\theta _2$ by computing the largest eigenvalues of $\eta YY^T$ and $\eta X^TX$ at k-th iteration, respectively. We choose $\alpha _{1k}=\beta _{1k}=\gamma _{1k}=\mu _{1k}=\frac{k-1}{k+2}$, $\alpha _{2k}=\beta _{2k}=\gamma _{2k}=\mu _{2k}=\frac{k-1}{k+2}$ in TiPALM and STiBPALM and $\alpha _{1k}=\beta _{1k}=\gamma _{1k}=\mu _{1k}=\frac{k-1}{k+2}$ in iPALM and SiPALM. We use BTiPALM and BSTiPALM to denote TiPALM and STiBPALM with $\phi _1(X)=\frac{\theta _1^2 }{4} \left\| X\right\| ^{4}$, $\phi _2(Y)=\frac{\theta _2 }{2} \left\| Y \right\| ^{2}$, respectively. We refer to BSTiPALM using the SGD, SAGA, and SARAH gradient estimators as BSTiPALM-SGD, BSTiPALM-SAGA, and BSTiPALM-SARAH, respectively.

In Figs. 2 and 3, we report the numerical results for Yale-B dataset. A similar result for the ORL dataset is plotted in Figs. 4 and 5. One can observe from these four figures that the STiBPALM can get slightly lower values than the other algorithms within almost the same computation time. In addition, STiBPALM can get better performance than the SPRING and SiPALM stochastic algorithms with epoch changes. The stochastic algorithms can improve the numerical results compared with the corresponding deterministic method. Furthermore, compared with the stochastic gradient algorithm without variance reduction (SGD), the variance-reduced stochastic gradient (SAGA, SARAH) algorithm can get better numerical results.

The numerical results applying different Bregman distances under the Yale-B dataset and ORL dataset are reported in Figs. 6 and 7, respectively. We can observe that BSTiPALM algorithm can obtain better numerical results compared to STiBPALM algorithm, where SARAH gradient estimator can get the best performance with epoch changes.

We also compare STiBPALM with SGD, SAGA, and SARAH for different sparsity settings (the value of s). The results of the basis images are shown in Fig. 8. One can observe from Fig. 8 that for smaller values of s, the four algorithms lead to more compact representations. This might improve the generalization capabilities of the representation.

5.2 Blind image-deblurring

Let A be a blurred image, the problem of blind deconvolution is given by

$$\begin{aligned} \underset{X,Y}{\min }\ \left\{ \frac{1}{2} \left\| A\!-\!X\odot Y \right\| _{F}^{2}\!+\!\eta \sum _{r=1}^{2d} R([D(X)]_r) : \ 0\!\le \! X\!\le \! 1,\ 0\!\le \! Y\!\le \! 1,\ \left\| Y \right\| _1\!\le \! 1\right\} . \end{aligned}$$

(5.2)

In numerical experiment, we choose $R(v)= \log (1 + \sigma v^2)$ as in [6], where $\sigma =10^3$ and $\eta =5\times 10^{-5}$.

We consider two images, Kodim08 and Kodim15, of size $256 \times 256$ for testing. For each image, two blur kernels—linear motion blur and out-of-focus blur—are considered with additional additive Gaussian noise. In this numerical experiment, we mainly use SARAH gradient estimator and set $p=\frac{1}{64}$. We take $\alpha _{1k}=\beta _{1k}=\gamma _{1k}=\mu _{1k}=\frac{k-1}{k+2}$, $\alpha _{2k}=\beta _{2k}=\gamma _{2k}=\mu _{2k}=\frac{k-1}{k+2}$ in TiPALM and STiBPALM and $\alpha _{1k}=\beta _{1k}=\gamma _{1k}=\mu _{1k}=\frac{k-1}{k+2}$ in iPALM.

The convergence comparisons of the algorithms for both images with motion blur are provided in Figs. 9 and 10, from which we observe STiBPALM-SARAH is faster than the other methods. Figures 11 and 12 provide comparisons of the recovered image and blur kernel. We observe superior performance of stochastic algorithms over deterministic algorithms in these figures as well. In particular, when comparing the estimated blur kernels of the two algorithms every 20 epochs, we clearly see that STiBPALM-SARAH more quickly recovers more accurate solutions than TiPALM.

6 Conclusion

In this paper, we propose a stochastic two-step inertial Bregman proximal alternating linearized minimization (STiBPALM) algorithm with the variance-reduced gradient estimator to solve a class of nonconvex nonsmooth optimization problems. Under some mild conditions, we analyze the convergence properties of STiBPALM when using a variety of variance-reduced gradient estimators and prove specific convergence rates using the SAGA and SARAH estimators. We also implement the STiBPALM algorithm to sparse nonnegative matrix factorization and blind image-deblurring problems and perform some numerical experiments to demonstrate the effectiveness of the proposed algorithm.

Data Availability

All data generated or analyzed during this study are included in this article.

Notes

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.

References

Chao, M.T., Han, D.R., Cai, X.J.: Convergence of the Peaceman-Rachford splitting method for a class of nonconvex programs. Numer. Math. Theory Methods Appl. 14(2), 438–460 (2021)
Article MathSciNet Google Scholar
Fu, X., Huang, K., Sidiropoulos, N.D., Ma, W.: Nonnegative matrix factorization for signal and data analytics: identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36(2), 59–80 (2019)
Article Google Scholar
Paatero, P., Tapper, U.: Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994)
Article Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nat. 401, 788–791 (1999)
Article Google Scholar
Ma, Y., Hu, X., He, T., Jiang, X.: Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with Laplacian regularization. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(3), 788–795 (2020)
Article Google Scholar
Pock, T., Sabach, S.: Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci. 9, 1756–1787 (2017)
Article MathSciNet Google Scholar
Aspremont, A., Ghaoui, L. E., Jordan, M. I., Lanckriet, G. R.: A direct formulation for sparse PCA using semidefinite programming. in Advances in Neural Information Processing Systems 41–48 (2005)
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Statist. 15, 265–286 (2006)
Article MathSciNet Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Guass-Seidel methods. Math. Program. 137, 91–129 (2013)
Article MathSciNet Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inform. Theory 4, 1289–1306 (2006)
Article MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearised minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)
Article MathSciNet Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
Article MathSciNet Google Scholar
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. Publ. Math. Program. Soc. 116(1–2), 5–16 (2009)
Article MathSciNet Google Scholar
Gao, X., Cai, X.J., Han, D.R.: A Gauss-Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems. J. Glob. Optim. 76, 863–887 (2020)
Article MathSciNet Google Scholar
Wang, Q.X., Han, D.R.: A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl. Numer. Math. 189, 66–87 (2023)
Article MathSciNet Google Scholar
Zhao, J., Dong, Q.L., Michael, Th.R., Wang, F.H.: Two-step inertial Bregman alternating minimization algorithm for nonconvex and nonsmooth problems. J. Glob. Optim. 84, 941–966 (2022)
Article MathSciNet Google Scholar
Guo, C. Z., Zhao, J.: Two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems, (2023). arXiv:2306.07614v1
Chao, M.T., Nong, F.F., Zhao, M.Y.: An inertial alternating minimization with Bregman distance for a class of nonconvex and nonsmooth problems. J. Appl. Math. Comput. 69, 1559–1581 (2023)
Article MathSciNet Google Scholar
Mukkamala, M.C., Ochs, P., Pock, T., Sabach, S.: Convex-concave backtracking for inertial Bregman proximal gradient algorithms in nonconvex optimization. SIAM J. Math. Data Sci. 2, 658–682 (2020)
Article MathSciNet Google Scholar
Ahookhosh, M., Hien, L.T.K., Gillis, N., Patrinos, P.: A block inertial Bregman proximal algorithm for nonsmooth nonconvex problems with application to symmetric nonnegative matrix tri-factorization. J. Optim. Theory Appl. 190, 234–258 (2021)
Article MathSciNet Google Scholar
Bottou, L.: In: Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, 1, 177–186 (2010)
Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)
Article MathSciNet Google Scholar
Driggs, D., Tang, J.Q., Liang, J.W., Davies, M., Schonlieb, C.B.: SPRING: a stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J. Imaing Sci. 4, 1932–1970 (2021)
Article Google Scholar
Hertrich, J., Steidl, G.: Inertial stochastic PALM and applications in machine learning. Sampl. Theory Sign Process. Data Anal. 20, (2022). https://doi.org/10.1007/s43670-022-00021-x
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)
Article MathSciNet Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. in Advances in Neural Information Processing Systems 315–323 (2013)
Konecny, J., Liu, J., Richtarik, P., Takac, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign Process. 10, 242–255 (2016)
Article Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. in Advances in Neural Information Processing Systems 1646–1654 (2014)
Li, B., Ma, M., Giannakis, G. B.: On the convergence of SARAH and beyond. in International Conference on Artificial Intelligence and Statistics, PMLR 223–233 (2020)
Nguyen, L. M., Liu, J., Scheinberg, K., and Takáĉ, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. in Proceedings of the 34th International Conference on Machine Learning, 2613–2621 (2017)
Rockafellar, R.T., Wets, R.: Variational analysis, Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Google Scholar
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)
Article Google Scholar
Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Amer. Math. Soc. 362, 3319–3363 (2010)
Article MathSciNet Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and distributed computation: numerical methods. Prentice hall, Englewood Cliffs, NJ (1989)
Google Scholar
Robbins, H., Siegmund, D.: A convergence theorem for non-negative almost supermartingales and some applications. Optimizing Methods in Statistics, Academic Press, New York, 233–257 (1971)
Damek, D.: The asynchronous PALM algorithm for nonsmooth nonconvex problems (2016). arXiv:1604.00526
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. B 116, 5–16 (2007)
Article MathSciNet Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nat. 788–791 (1999)
Pan, J., Gillis, N.: Generalized separable nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1546–1561 (2021)
Article Google Scholar
Rousset, F., Peyrin, F., Ducros, N.: A semi nonnegative matrix factorization technique for pattern generalization in single-pixel imaging. IEEE Trans. Comput. Imaging 4(2), 284–294 (2018)
Article MathSciNet Google Scholar
Peharz, R., Pernkopf, F.: Sparse nonnegative matrix factorization with $l_0$-constraints. Neurocomput. 80, 38–46 (2012)
Article Google Scholar

Download references

Acknowledgements

We are grateful to the editor and reviewers for their comments that improved the quality of our paper.

Funding

This work is supported by the Scientific Research Project of Tianjin Municipal Education Commission (2022ZD007).

Author information

Authors and Affiliations

College of Science, Civil Aviation University of China, Tianjin, 300300, China
Chenzheng Guo, Jing Zhao & Qiao-Li Dong

Authors

Chenzheng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qiao-Li Dong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the manuscript and approved the submitted version.

Corresponding author

Correspondence to Jing Zhao.

Ethics declarations

Ethics approval

Not applicable

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A SAGA variance bound

We define the SAGA gradient estimators $\widetilde{\nabla }_x(u_k,y_k)$ and $\widetilde{\nabla }_y(x_{k+1},v_k)$ as follows:

$$\begin{aligned} \widetilde{\nabla }_x(u_k,y_k)=&\frac{1}{b}\sum _{i\in I_k^x}\left( \nabla _xH_i(u_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) + \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}), \\ \widetilde{\nabla }_y(x_{k+1},v_k)=&\frac{1}{b}\sum _{i\in I_k^y}\left( \nabla _yH_i(x_{k+1},v_k)\!-\! \nabla _yH_i(x_{k+1},\xi _{k}^{i}) \right) \!+\! \frac{1}{n}\sum _{j=1}^n\nabla _yH_j(x_{k+1},\xi _{k}^{j}),\nonumber \end{aligned}$$

(A.1)

where $I_k^x$ and $I_k^y$ are mini-batches containing b indices. The variables $\varphi _{k}^{i}$ and $\xi _{k}^{i}$ follow the update rules $\varphi _{k+1}^{i}=u_k$ if $i\in I_k^x$ and $\varphi _{k+1}^{i}=\varphi _{k}^{i}$ otherwise, and $\xi _{k+1}^{i}=v_k$ if $i\in I_k^y$ and $\xi _{k+1}^{i}=\xi _{k}^{i}$ otherwise.

To prove our variance bounds, we require the following lemma.

Lemma A.1

Suppose $X_1,\cdots ,X_t$ are independent random variables satisfying $\mathbb {E}_{k}X_i$$=0$ for $1\le i\le t$. Then

$$\begin{aligned} \mathbb {E}_{k}\left\| X_1+\cdots +X_t \right\| ^2=\mathbb {E}_{k}\left[ \left\| X_1 \right\| ^2 +\cdots +\left\| X_t \right\| ^2 \right] . \end{aligned}$$

(A.2)

Proof

Our hypotheses on these random variables imply $\mathbb {E}_{k}\left\langle X_i,X_j \right\rangle =0$ for $i\ne j$. Therefore,

$$\begin{aligned} \mathbb {E}_{k}\left\| X_1+\cdots +X_t \right\| ^2= \mathbb {E}_{k}\sum _{i,j=1}^{t}\left\langle X_i,X_j \right\rangle =\mathbb {E}_{k}\left[ \left\| X_1 \right\| ^2 +\cdots +\left\| X_t \right\| ^2 \right] . \end{aligned}$$

$\square $

We are now prepared to prove that the SAGA gradient estimator is variance-reduced.

Lemma A.2

The SAGA gradient estimator satisfies

$$\begin{aligned}{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2\le \frac{1}{bn}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2,\nonumber \\{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _yH(x_{k+1},v_k) \right\| ^2\!\le \!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_k,v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\nonumber \\{} & {} +\frac{16N^2\gamma ^2}{b}\left( \mathbb {E}_{k}\left\| z_{k+1}-z_{k}\right\| ^2+\left\| z_{k}-z_{k-1}\right\| ^2+\left\| z_{k-1}-z_{k-2}\right\| ^2\right) , \end{aligned}$$

(A.3)

as well as

$$\begin{aligned}{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| \le \frac{1}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ,\nonumber \\{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _yH(x_{k+1},v_k) \right\| \!\le \!\frac{2}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _yH_j(x_k,v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| \nonumber \\{} & {} +\frac{4N\gamma }{\sqrt{b}}\left( \mathbb {E}_{k}\left\| z_{k+1}-z_{k}\right\| +\left\| z_{k}-z_{k-1}\right\| +\left\| z_{k-1}-z_{k-2}\right\| \right) , \end{aligned}$$

(A.4)

where $N=\max \left\{ M,L \right\} $, $\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} $.

Proof

According to (A.1), we have

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2\\ =&\mathbb {E}_{k}\left\| \frac{1}{b} \sum _{i\in I_k^x}\left( \nabla _xH_i(u_k,y_k)\!-\! \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) \!-\!\nabla _xH(u_k,y_k)\!+\! \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\nonumber \\ \overset{(1)}{\le }\&\frac{1}{b^2}\mathbb {E}_{k}\sum _{i\in I_k^x}\left\| \nabla _xH_i(u_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right\| ^2\nonumber \\ =&\frac{1}{bn}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2.\nonumber \end{aligned}$$

(A.5)

Inequality (1) follows from Lemma A.1. By the Jensen’s inequality, we can say that

$$\begin{aligned} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| \le&\sqrt{\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2}\\ \le&\frac{1}{\sqrt{bn}}\sqrt{\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2}\nonumber \\ \le&\frac{1}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| .\nonumber \end{aligned}$$

(A.6)

We use an analogous argument for $\widetilde{\nabla }_y(x_{k+1},v_k)$. Let $\mathbb {E}_{k,x}$ denote the expectation conditional on the first k iterations and $I_k^x$. By the same reasoning as in (A.5), applying the Lipschitz continuity of $\nabla _yH_j$, we obtain that

$$\begin{aligned}&\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\nonumber \\ \le&\frac{1}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k+1},v_k)- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k+1},v_k)\!-\! \nabla _yH_j(x_{k},y_{k}) \right\| ^2\!+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},y_k)\!-\! \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},\xi _{k}^{j})\!-\! \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\ \le&\frac{4M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{4M^2}{b}\left\| v_{k}-y_{k}\right\| ^2+\frac{4L^2}{b}\left\| y_{k}-v_{k}\right\| ^2\nonumber \\&+\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2+\frac{4M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2\nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2+\frac{8M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2\nonumber \\&+\frac{4(M^2+L^2)}{b}\left( 2\gamma _1^2\left\| y_{k}-y_{k-1}\right\| ^2+2\gamma _2^2\left\| y_{k-1}-y_{k-2}\right\| ^2\right) \nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{16N^2\gamma ^2}{b}\left( \left\| z_{k+1}\!-\!z_{k}\right\| ^2\!+\!\left\| z_{k}\!-\!z_{k-1}\right\| ^2\right. \nonumber \\&\left. +\left\| z_{k-1}-z_{k-2}\right\| ^2\right) , \end{aligned}$$

(A.7)

where $N=\max \left\{ M,L \right\} $, $\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} $. Also, by the same reasoning as in (A.6),

$$\begin{aligned}&\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \\ \le&\sqrt{\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2}\nonumber \\ \le&\frac{2}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| +\frac{4N\gamma }{\sqrt{b}}\left( \left\| z_{k+1}-z_{k}\right\| +\left\| z_{k}-z_{k-1}\right\| \right. \nonumber \\&\left. +\left\| z_{k-1}-z_{k-2}\right\| \right) ,\nonumber \end{aligned}$$

(A.8)

Applying the operator $\mathbb {E}_{k}$ to (A.7) and (A.8), we get the desired result. $\square $

Now, define

$$\begin{aligned} \Upsilon _{k+1}=&\frac{1}{bn}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \\&\left. +4\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) ,\nonumber \\ \Gamma _{k+1}=&\frac{1}{\sqrt{bn}}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \nonumber \\&\left. +2\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) .\nonumber \end{aligned}$$

(A.9)

By Lemma A.2, we have

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \\ \le&\Upsilon _k+V_1\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$

and

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| \right) . \end{aligned}$$

This is exactly the MSE bound, where $V_{1}=\frac{16N^2\gamma ^2}{b}$ and $V_{2}=\frac{4N\gamma }{\sqrt{b}}$.

Lemma A.3

(Geometric decay) Let $\Upsilon _{k}$ be defined as in (A.9), then we can establish the geometric decay property:

$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\le \left( 1-\rho \right) \Upsilon _{k}+V_{\Upsilon }\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$

(A.10)

where $\rho =\frac{b}{2n}$, $V_{\Upsilon }=\frac{408nN^2(1+2\gamma _1^2+\gamma _2^2)}{b^2}$.

Proof

We show that $\mathbb {E}_{k}\Upsilon _{k+1}$ is decreasing at a geometric rate. By applying the inequality $\left\| a-c \right\| ^2\le (1+\varepsilon )\left\| a-b \right\| ^2+(1+\varepsilon ^{-1} )\left\| b-c \right\| ^2$ twice, it follows that

$$\begin{aligned}&\frac{1}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2\nonumber \\ \le&\frac{1+\varepsilon }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n\mathbb {E}_{k} \left\| \nabla _xH_j(u_{k+1},y_{k+1})\right. \nonumber \\&\left. - \nabla _xH_j(u_{k},y_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^2 }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1} ) }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(\varphi _{k+1}^{j},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(u_{k},y_{k})\right\| ^2\nonumber \\ \le&\frac{(1\!+\varepsilon )^2 (1-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(1+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k}\!-\! y_{k+1} \right\| ^2\nonumber \\&+\frac{(1+\varepsilon ^{-1})M^2}{b}\mathbb {E}_{k}\left( \left\| u_{k+1}-u_{k}\right\| ^2+\left\| y_{k+1}-y_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1\!+\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(2+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k+1}\!- y_{k} \right\| ^2\nonumber \\&+\frac{(1+\varepsilon ^{-1})M^2}{b}\mathbb {E}_{k}\left( 3\left\| u_{k+1}-x_{k+1}\right\| ^2+3\left\| x_{k+1}-x_{k}\right\| ^2+3\left\| x_{k}-u_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1+\!\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(2+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k+1}\!-\! y_{k} \right\| ^2\nonumber \\&+\frac{3M^2(1+\varepsilon ^{-1})(1+2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\left\| x_{k}-x_{k-1}\right\| ^2\nonumber \\&+\frac{6M^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| x_{k-1}-x_{k-2}\right\| ^2. \end{aligned}$$

(A.11)

Similarly,

$$\begin{aligned}&\frac{1}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2\nonumber \\ \le&\frac{1+\varepsilon }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k},v_{k})- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^3 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )^2(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k},\xi _{k}^{j})- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(1\!+\!\varepsilon )^2(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k} \right\| ^2\!+\!\frac{(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k} \right\| ^2\!+\!\frac{(1\!+\!\varepsilon ^{-1})L^2}{b}\mathbb {E}_{k}\left\| v_{k+1}\!-\!v_{k} \right\| ^2\nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1\!-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2+\frac{(1+\varepsilon ^{-1})L^2}{b}\mathbb {E}_{k}\left( 3\left\| v_{k+1}-y_{k+1}\right\| ^2+3\left\| y_{k+1}-y_{k}\right\| ^2+3\left\| y_{k}-v_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1\!-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2+\frac{3L^2(1+\varepsilon ^{-1})(1+2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| y_{k+1}-y_{k}\right\| ^2+\frac{6L^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\nonumber \\&\left\| y_{k}-y_{k-1}\right\| ^2+\frac{6L^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| y_{k-1}-y_{k-2}\right\| ^2. \end{aligned}$$

(A.12)

With

$$\begin{aligned} \Upsilon _{k+1}=&\frac{1}{bn}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \\&\left. +4\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) , \end{aligned}$$

adding (A.11) and (A.12), we can obtain

$$\begin{aligned}&\mathbb {E}_{k}\Upsilon _{k+1}\\ \le&\!(1\!+\!\varepsilon )^3 (\!1\!-\!b/n)\Upsilon _{k}\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k\!+1}\!-\! y_{k} \right\| ^2\!+\!\frac{3M^2(\!1\!+\!\varepsilon ^{-1})(\!1\!+\!2\gamma _{1}^2)}{b}\\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\left\| x_{k}-x_{k-1}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\\&\left\| x_{k-1}-x_{k-2}\right\| ^2+\frac{4(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)M^2(2+\varepsilon )}{b}\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2\\&\!+\!\frac{12L^2(1\!+\!\varepsilon ^{-1})(1\!+\!2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| y_{k+1}\!-\!y_{k}\right\| ^2\!+\!\frac{24L^2(1\!+\!\varepsilon ^{-1})(\gamma _{1}^2\!+\!\gamma _{2}^2)}{b}\left\| y_{k}\!-\!y_{k-1}\right\| ^2\\&+\frac{24L^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| y_{k-1}-y_{k-2}\right\| ^2\\ \le&(1+\varepsilon )^3 (1-b/n)\Upsilon _{k}+\frac{13N^2(1+\varepsilon )(2+\varepsilon )(1+\varepsilon ^{-1} )(1+2\gamma _1^2)}{b}\mathbb {E}_{k}\left\| z_{k+1}- z_{k} \right\| ^2\\&+\frac{24N^2(1+\varepsilon ^{-1})(\gamma _1^2+\gamma _{2}^2)}{b}\left\| z_{k}-z_{k-1}\right\| ^2+\frac{24N^2\gamma _{2}^2(1+\varepsilon ^{-1})}{b}\left\| z_{k-1}-z_{k-2}\right\| ^2\\ \le&(1\!+\!\varepsilon )^3 (1\!-\!b/n)\Upsilon _{k}\!+\!\frac{24N^2(1\!+\!\varepsilon )(2\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1})(1\!+\!2\gamma _1^2\!+\!\gamma _2^2)}{b}\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\right. \\&\left. +\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$

where $N=\max \left\{ M,L \right\} $. Choosing $\varepsilon =\frac{b}{6n}$, we have $(1+\varepsilon )^3(1-\frac{b}{n} ) \le 1-\frac{b}{2n}$, producing the inequality

$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\le&(1-\frac{b}{2n})\Upsilon _{k}+\frac{24N^2(1+\frac{b}{6n})(2+\frac{b}{6n})(1+\frac{6n}{b})(1+2\gamma _1^2+\gamma _2^2)}{b}\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}\right. \nonumber \\&\left. +\left\| z_{k}-z_{k-1} \right\| ^{2}+\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) \nonumber \\ \le&(1\!-\!\frac{b}{2n})\Upsilon _{k}\!+\!\frac{408nN^2(1\!+\!2\gamma _1^2\!+\!\gamma _2^2)}{b^2}\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\right) . \end{aligned}$$

(A.13)

This completes the proof. $\square $

Lemma A.4

(Convergence of estimator) If $\left\{ z_k \right\} _{k\in \mathbb {N} }$ satisfies $\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!=0$, then $ \mathbb {E}\Upsilon _k\rightarrow 0$ and $\mathbb {E}\Gamma _k\rightarrow 0$ as $k\rightarrow \infty $.

Proof

We frist show that $\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\rightarrow 0$ as $k\rightarrow \infty $. Indeed,

$$\begin{aligned}&\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\le L^2\sum _{j=1}^n \mathbb {E}\left\| u_{k}-\varphi _{k}^{j} \right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1+\frac{b}{2n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1+\frac{b}{2n})(1-\frac{b}{n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1-\frac{b}{2n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\sum _{l=1}^k(1-\frac{b}{2n})^{k-l} \mathbb {E}\left\| u_{l}-u_{l-1}\right\| ^2. \end{aligned}$$

(A.14)

As $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}\rightarrow 0$, so $\mathbb {E}\left\| u_{k}-u_{k-1} \right\| ^{2}\rightarrow 0$, it is clear that $\sum _{l=1}^k(1-\frac{b}{2n})^{k-l} \mathbb {E}\left\| u_{l}\!-\!u_{l-1}\right\| ^2 \!\rightarrow \! 0$, and hence $\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\! \rightarrow \! 0$ as $k\!\rightarrow \! \infty $. An analogous argument shows that $\sum _{j=1}^n \!\mathbb {E}\!\left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \!\right\| ^2$$\rightarrow 0$ as $k\rightarrow \infty $. So $\mathbb {E}\Upsilon _k\rightarrow 0$ as $k\rightarrow \infty $. Similarly, we can get $\mathbb {E}\Gamma _k\rightarrow 0$ as $k\rightarrow \infty $. Indeed,

$$\begin{aligned}&\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| \le L\sum _{j=1}^n \mathbb {E}\left\| u_{k}-\varphi _{k}^{j} \right\| \nonumber \\ \le&nL\mathbb {E}\left\| u_{k}-u_{k-1}\right\| +L\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k}^{j}\right\| \nonumber \\ \le&nL\mathbb {E}\left\| u_{k}-u_{k-1}\right\| +L(1-\frac{b}{n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| \nonumber \\ \le&nL\sum _{l=1}^k(1-\frac{b}{n})^{k-l} \mathbb {E}\left\| u_{l}-u_{l-1}\right\| . \end{aligned}$$

(A.15)

Because $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}\rightarrow 0$, it follows that $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \rightarrow 0$ (because Jensen’s inequality implies $\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \le \sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}}\rightarrow 0$). So $\mathbb {E}\left\| u_{k}-u_{k-1} \right\| \rightarrow 0$, then it follows that the bound on the right goes to zero as $k\rightarrow \infty $, hence $\mathbb {E}\Gamma _k\rightarrow 0$.

$\square $

1.2 B SARAH variance bound

As in the previous section, we use $I_k^x$ and $I_k^y$ to denote the mini-batches used to approximate $\nabla _xH(u_k,y_k) $ and $\nabla _yH(x_{k+1},v_k)$, respectively.

Lemma B.1

The SARAH gradient estimator satisfies

$$\begin{aligned}&\mathbb {E}_{k}\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right) \\ \le&\left( \! 1\!-\!\frac{1}{p} \!\right) \left( \! \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\! \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\right) \\&+V_{1}\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) , \end{aligned}$$

as well as

$$\begin{aligned}&\mathbb {E}_k\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\ \le&\sqrt{1-\frac{1}{p} } \left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\&+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) , \end{aligned}$$

where $V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)$ and $V_{2}=M\sqrt{6(1-\frac{1}{p})(1+2\gamma _{1}^2+\gamma _{2}^2) }$.

Proof

Let $\mathbb {E}_{k,p}$ denote the expectation conditional on the first k iterations and the event that we do not compute the full gradient at iteration k. The conditional expectation of the SARAH gradient estimator in this case is

$$\begin{aligned} \mathbb {E}_{k,p}\widetilde{\nabla }_x(u_k,y_k)=&\frac{1}{b}\mathbb {E}_{k,p}\left( \! \sum _{i\in I_k^x} \nabla _xH_i(u_k,y_k)\!-\! \nabla _xH_i(u_{k-1},y_{k-1}) \!\right) \!+\!\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \nonumber \\ =&\nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1})+\widetilde{\nabla }_x(u_{k-1},y_{k-1}), \end{aligned}$$

(B.1)

and further

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \nonumber \\ =&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})+\nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k)\right. \nonumber \\&\left. +\widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| ^2 \nonumber \\ =&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2+\left\| \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\| ^2\nonumber \\&+\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2 \nonumber \\&+2\left\langle \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1}), \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\rangle \nonumber \\&-2\left\langle \nabla _xH(u_{k-1},y_{k-1})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle \nonumber \\&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle . \end{aligned}$$

(B.2)

By (B.1), we see that

$$\mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) =\nabla _xH(u_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1}).$$

Thus, the first two inner products in (B.2) sum to zero and the third one is equal to

$$\begin{aligned}&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle \\ =&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \nabla _xH(u_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1})\right\rangle \\ =&-2\left\| \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$

This yields

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ =&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2-\left\| \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\| ^2 \\&+\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2\\ \le&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)\!-\!\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2. \end{aligned}$$

We can bound the second term by computing the expectation.

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2\\ =&\mathbb {E}_{k,p}\left\| \frac{1}{b}\left( \sum _{i\in I_k^x} \nabla _xH_i(u_k,y_k)- \nabla _xH_i(u_{k-1},y_{k-1}) \right) \right\| ^2\\ \le&\frac{1}{b}\mathbb {E}_{k,p}\left[ \sum _{i\in I_k^x} \left\| \nabla _xH_i(u_k,y_k)- \nabla _xH_i(u_{k-1},y_{k-1}) \right\| ^2 \right] \\ =&\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$

The inequality is due to the convexity of the function $x\mapsto \left\| x \right\| ^2$. This results in the recursive inequality

$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ \le&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)\!-\! \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$

This bounds the MSE under the condition that the full gradient is not computed. When the full gradient is computed, the MSE is equal to zero, so taking the M-Lipschitz continuity of the gradients of the $H_j$ into account, we get

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)\!-\! \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2 \right) \\ \le&\left( 1-\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2+M^2\left\| (u_k,y_k)- (u_{k-1},y_{k-1}) \right\| ^2 \right) . \end{aligned}$$

Using $(a+b+c) ^2\le 3(a^2+b^2+c^2)$, we can estimate

$$\begin{aligned}&\left\| (u_k,y_k)- (u_{k-1},y_{k-1}) \right\| ^2=\left\| u_k-u_{k-1}\right\| ^2+\left\| y_k-y_{k-1}\right\| ^2\\ \le&3\left\| u_k-x_{k}\right\| ^2+3\left\| x_k-x_{k-1}\right\| ^2+3\left\| x_{k-1}-u_{k-1}\right\| ^2+\left\| y_k-y_{k-1}\right\| ^2\\ \le&3(1\!+\!2\gamma _{1}^2)\left\| x_k\!-\!x_{k-1}\right\| ^2\!+\!6(\gamma _{1}^2\!+\!\gamma _{2}^2)\left\| x_{k-1}\!-\!x_{k-2}\right\| ^2\!+\!6\gamma _{2}^2\left\| x_{k-2}\!-\!x_{k-3}\right\| ^2\!+\!\left\| y_k-y_{k-1}\right\| ^2. \end{aligned}$$

Substituting the above inequality, we can obtain

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \nonumber \\ \le&\left( 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!3M^2(1\!+\!2\gamma _{1}^2)\left\| x_k\!-\!x_{k-1}\right\| ^2\right. \nonumber \\&\left. +\!6M^2(\gamma _{1}^2\!+\!\gamma _{2}^2)\left\| x_{k-1}\!-\!x_{k-2}\right\| ^2\!+\!6M^2\gamma _{2}^2\left\| x_{k-2}\!-\!x_{k-3}\right\| ^2\!+\!M^2\left\| y_k\!-\!y_{k-1}\right\| ^2 \right) . \end{aligned}$$

(B.3)

By symmetric arguments, it holds

$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k) -\nabla _yH(x_{k+1},v_k) \right\| ^2 \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| (x_{k+1},v_k)\!-\! (x_{k},v_{k-1}) \right\| ^2 \right) \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k}\right\| ^2\!+\!3M^2(1\!+\!2\mu _{1k}^2)\right. \nonumber \\&\left. \left\| y_k\!-\!y_{k-1}\right\| ^2\!+\!6M^2(\mu _{1,k-1}^2\!+\!\mu _{2k}^2)\left\| y_{k-1}\!-\!y_{k-2}\right\| ^2\!+\!6M^2\mu _{2,k-1}^2\left\| y_{k-2}\!-\!y_{k-3}\right\| ^2 \right) \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k}\right\| ^2\!+\!3M^2(1\!+\!2\gamma _{1}^2)\right. \nonumber \\&\left. \left\| y_k-y_{k-1}\right\| ^2+6M^2(\gamma _{1}^2+\gamma _{2}^2)\left\| y_{k-1}-y_{k-2}\right\| ^2+6M^2\gamma _{2}^2\left\| y_{k-2}-y_{k-3}\right\| ^2 \right) . \end{aligned}$$

(B.4)

Combining (B.3) and (B.4), we can obtain

$$\begin{aligned}&\mathbb {E}_{k}\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right) \\ \le&\left( \!1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\! \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\right. \\&\left. +M^2\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+M^2\left\| y_{k}-y_{k-1}\right\| ^2+3M^2(1+2\gamma _{1}^2)\left\| z_k-z_{k-1}\right\| ^2\right. \\&\left. +6M^2(\gamma _{1}^2+\gamma _{2}^2)\left\| z_{k-1}-z_{k-2}\right\| ^2+6M^2\gamma _{2}^2\left\| z_{k-2}-z_{k-3}\right\| ^2 \right) \\ \le&\left( 1-\frac{1}{p} \right) \Upsilon _{k}+6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2}\right. \\&\left. +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) . \end{aligned}$$

Similar bounds hold for $\Gamma _k$ due to Jensen’s inequality:

$$\begin{aligned}&\mathbb {E}_k\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\ \le&\sqrt{1-\frac{1}{p} } \left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\&\!+\!M\sqrt{6(1\!-\!\frac{1}{p})(1\!+\!2\gamma _{1}^2\!+\!\gamma _{2}^2) }\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| \!+\!\left\| z_{k}\!-\!z_{k-1} \right\| \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| \right) . \end{aligned}$$

This completes the proof. $\square $

Now, define

$$\begin{aligned} \Upsilon _{k+1}=&\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2,\nonumber \\ \Gamma _{k+1}=&\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| . \end{aligned}$$

(B.5)

By Lemma B.1, we have

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \\ \le&\Upsilon _k\!+\!V_1\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$

and

$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) . \end{aligned}$$

This is exactly the MSE bound, where $V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)$ and

$V_{2}=M\sqrt{6(1-\frac{1}{p})(1+2\gamma _{1}^2+\gamma _{2}^2) }$.

Lemma B.2

(Geometric decay) Let $\Upsilon _{k}$ be defined as in (B.5), then we can establish the geometric decay property:

$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\!\le \!\left( 1\!-\!\rho \right) \Upsilon _{k}\!+\!V_{\Upsilon }\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$

(B.6)

where $\rho = \frac{1}{p}$, $V_{\Upsilon }=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)$.

Proof

This is a direct result of Lemma B.1. $\square $

Lemma B.3

(Convergence of estimator) If $\left\{ z_k \right\} _{k\in \mathbb {N} }$ satisfies $\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!=0$, then $ \mathbb {E}\Upsilon _k\rightarrow 0$ and $\mathbb {E}\Gamma _k\rightarrow 0$ as $k \rightarrow \infty $.

Proof

By (B.6), we have

$$\begin{aligned}&\mathbb {E}\Upsilon _{k}\\ \le&\left( 1\!-\!\rho \right) \mathbb {E}\Upsilon _{k-1}\!+\!V_{\Upsilon }\mathbb {E}\left( \left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\left\| z_{k-2}\!-z_{k-3} \right\| ^{2}+\left\| z_{k-3}-z_{k-4} \right\| ^{2}\right) \\ \le&V_{\Upsilon }\sum _{l=1}^{k}\left( 1-\rho \right) ^{k-l}\mathbb {E}\left( \left\| z_{l}-z_{l-1} \right\| ^{2} +\left\| z_{l-1}-z_{l-2} \right\| ^{2}+\left\| z_{l-2}-z_{l-3} \right\| ^{2}+\left\| z_{l-3}-z_{l-4} \right\| ^{2}\right) , \end{aligned}$$

which implies $ \mathbb {E}\Upsilon _k\rightarrow 0$ as $k \rightarrow \infty $. By Jensen’s inequality, we have $\mathbb {E}\Gamma _k\rightarrow 0$ as $k \rightarrow \infty $. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Guo, C., Zhao, J. & Dong, QL. A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems. Numer Algor 97, 51–100 (2024). https://doi.org/10.1007/s11075-023-01693-9

Download citation

Received: 10 July 2023
Accepted: 19 October 2023
Published: 09 November 2023
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11075-023-01693-9

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm for nonconvex and nonsmooth problems

Abstract

Similar content being viewed by others

A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems

A class of modified accelerated proximal gradient methods for nonsmooth and nonconvex minimization problems

Proximal linearization methods for Schatten p-quasi-norm minimization

1 Introduction

2 Preliminaries

Definition 2.1

Remark 2.1

Definition 2.2

Definition 2.3

Definition 2.4

Lemma 2.1

Lemma 2.2

Proof

Lemma 2.3

Proof

3 Stochastic two-step inertial Bregman proximal alternating linearized minimization algorithm

Assumption 3.1

Algorithm 3.1

Definition 3.1

Definition 3.2

Definition 3.3

Definition 3.4

Assumption 3.2

Proposition 3.1

Lemma 3.1

4 Convergence analysis under the KŁ property

Lemma 4.1

Proof

Lemma 4.2

Proof

Lemma 4.3

Proof

Lemma 4.4

Theorem 4.1

Proof

Theorem 4.2

Proof

5 Numerical experiments

5.1 Sparse nonnegative matrix factorization

5.2 Blind image-deblurring

6 Conclusion

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 A SAGA variance bound

Lemma A.1

Proof

Lemma A.2

Proof

Lemma A.3

Proof

Lemma A.4

Proof

1.2 B SARAH variance bound

Lemma B.1

Proof

Lemma B.2

Proof

Lemma B.3

Proof

Rights and permissions

About this article

Cite this article

Share this article