Distributed stochastic gradient tracking methods

Pu, Shi; Nedić, Angelia

doi:10.1007/s10107-020-01487-0

Distributed stochastic gradient tracking methods

Full Length Paper
Series A
Published: 17 March 2020

Volume 187, pages 409–457, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mathematical Programming Submit manuscript

Distributed stochastic gradient tracking methods

Download PDF

4921 Accesses
125 Citations
Explore all metrics

Abstract

In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size n, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.

Optimal gradient tracking for decentralized optimization

Article 07 July 2023

Convergence of Distributed Gradient-Tracking-Based Optimization Algorithms with Random Graphs

Article 05 April 2021

Randomized difference-based gradient-free algorithm for distributed resource allocation

Article 16 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider a set of agents ${\mathcal {N}}=\{1,2,\ldots ,n\}$ connected over a network. Each agent has a local smooth and strongly convex cost function $f_i:{\mathbb {R}}^{p}\rightarrow {\mathbb {R}}$. The global objective is to locate $x\in {\mathbb {R}}^p$ that minimizes the average of all cost functions:

$$\begin{aligned} \min _{x\in {\mathbb {R}}^{p}}f(x)\left( =\frac{1}{n}\sum _{i=1}^{n}f_i(x)\right) . \end{aligned}$$

(1)

Scenarios in which problem (1) is considered include distributed machine learning [12, 15, 35], multi-agent target seeking [8, 42], and wireless networks [2, 11, 30], among many others.

To solve problem (1), we assume each agent i queries a stochastic oracle (${{\mathcal {S}}}{{\mathcal {O}}}$) to obtain noisy gradient samples of the form $g_i(x,\xi _i)$ that satisfies the following condition:

Assumption 1

For all $i\in {\mathcal {N}}$ and all $x\in {\mathbb {R}}^p$, each random vector $\xi _i\in {\mathbb {R}}^m$ is independent, and

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{\xi _i}[g_i(x,\xi _i)\mid x] = \nabla f_i(x),\\&{\mathbb {E}}_{\xi _i}[\Vert g_i(x,\xi _i)-\nabla f_i(x)\Vert ^2\mid x] \le \sigma ^2\quad \hbox { for some }\sigma >0. \end{aligned} \end{aligned}$$

(2)

The above assumption of stochastic gradients holds true for many on-line distributed learning problems, where $f_i(x)={\mathbb {E}}_{\xi _i}[F_i(x,\xi _i)]$ denotes the expected loss function agent i wishes to minimize, while independent samples $\xi _i$ are gathered continuously over time. For another example, in simulation-based optimization, the gradient estimation often incurs noise that can be due to various sources, such as modeling and discretization errors, incomplete convergence, and finite sample size for Monte-Carlo methods [22].

Distributed algorithms dealing with problem (1) have been studied extensively in the literature [13, 19, 20, 28, 34, 36, 37, 45, 46, 52, 56]. Recently, there has been considerable interest in distributed implementation of stochastic gradient algorithms [3, 5,6,7, 9, 10, 14, 18, 24, 26, 32, 40, 41, 48, 51, 54, 55]. The literature has shown that distributed methods may compete with, or even outperform, their centralized counterparts under certain conditions [9, 10, 26, 40, 41]. For instance, in our recent work [40], we proposed a flocking-based approach for distributed stochastic optimization which beats a centralized gradient method in real-time assuming that all $f_i$ are identical. However, to the best of our knowledge, there is no distributed stochastic gradient method addressing problem (1) that shows comparable performance with a centralized approach for optimizing the sum of smooth and strongly convex objective functions under Assumption 1 only. In particular, under constant stepsize policies none of the existing algorithms achieve an error bound that is decreasing in the network size n.

A distributed gradient tracking method was proposed in [13, 34, 46], where the agent-based auxiliary variables $y_i$ were introduced to track the average gradients of $f_i$ assuming accurate gradient information is available. It was shown that the method, with constant stepsize, generates iterates that converge linearly to the optimal solution. Inspired by the approach, in this paper we consider a distributed stochastic gradient tracking method (DSGT). By comparison, in our proposed algorithm $y_i$ are tracking the stochastic gradient averages of $f_i$. We are able to show that the iterates generated by each agent reach, in expectation, a neighborhood of the optimal point exponentially fast under a constant stepsize. Interestingly, with a sufficiently small stepsize, the limiting error bounds on the distance between the agent iterates and the optimal solution decrease in the network size, which is comparable to the performance of a centralized stochastic gradient algorithm.

Gossip-based communication protocols are popular choices for distributed computation due to their low communication costs [4, 25, 29, 31]. In the second part of this paper, we consider a gossip-like stochastic gradient tracking algorithm (GSGT) where at each iteration, an agent wakes up uniformly randomly and communicates with one of her neighbors or updates by herself. Similar to DSGT, the method produces iterates that converge to a neighborhood of the optimal point exponentially fast under a sufficiently small constant stepsize. When the network of agents is well-connected (e.g., complete network, almost all regular graphs), GSGT is shown to employ a lower communication burden and similar computational cost when compared to DSGT.

1.1 Related work

We now briefly review the literature on (distributed) stochastic optimization. First of all, our work is related to the extensive literature in stochastic approximation (SA) methods dating back to the seminal works [21, 49]. These works include the analysis of convergence (conditions for convergence, rates of convergence, suitable choice of stepsize) in the context of diverse noise models [23]. Assuming the objective function f is strongly convex with Lipschitz continuous gradients, the optimal rate of convergence for solving problem (1) has been shown to be ${\mathcal {O}}\left( 1/k\right) $ under a diminishing SA stepsize where k denotes the iteration number [38]. With a constant stepsize $\alpha >0$ that is sufficiently small, the iterates generated by a stochastic gradient method is attracted to an ${\mathcal {O}}(\alpha )$-neighborhood of the optimal solution exponentially fast (in expectation).

Distributed implementations of stochastic gradient methods have become increasingly popular in recent years. In [48], the authors considered minimizing a sum of (possibly nonsmooth) convex objective functions subject to a common convex constraint set. It was shown that when the means of the stochastic subgradient errors diminish, there is mean consensus among the agents and mean convergence to the optimum function value under SA stepsizes. The work [54] used two diminishing stepsizes to deal with communication noises and subgradient errors, respectively. Asymptotic convergence to the optimal set was established; for constant stepsizes asymptotic error bounds were derived. In [14], a distributed dual averaging method was proposed for minimizing (possibly nonsmooth) convex functions. Under a carefully chosen SA stepsize sequence, the method exhibits the convergence rate ${\mathcal {O}}(\frac{n\log (k)}{(1-\lambda _2({\mathbf {W}}))\sqrt{k}})$, in which $\lambda _2({\mathbf {W}})$ denotes the second largest singular value of the doubly stochastic mixing matrix ${\mathbf {W}}$. Paper [3] considered a projected stochastic gradient algorithm for solving non-convex optimization problems by combining a local stochastic gradient update and a gossip step. It was proved that consensus is asymptotically achieved in the network and the solutions converge to the set of KKT points with SA stepsizes. A distributed online algorithm was devised and analyzed in [5] for solving dynamic optimization problems in noisy communication environments. Sufficient conditions were provided for almost sure convergence of the algorithm. In [55], the authors proposed an adaptive diffusion algorithm based on penalty methods. Under a constant stepsize $\alpha $, it was shown that the expected distance between the optimal solution and that obtained at each node is bounded by ${\mathcal {O}}(\alpha )$. The work [9, 10] further showed that under a sufficiently small stepsize and certain conditions on the stochastic gradients, distributed methods are able to achieve the same performance level as that of a centralized approach. Paper [7] considered the problem of distributed constrained convex optimization subject to multiple noise terms in both computation and communication stages. The authors utilized an augmented Lagrangian framework and established the almost sure convergence of the algorithm under a diminishing stepsize policy. In [32], a subgradient-push method was investigated for distributed optimization over time-varying directed graphs. When the objective function is strongly convex, the scheme exhibits the ${\mathcal {O}}(\frac{\ln k}{k})$ rate of convergence.

In a recent work [24], a class of decentralized first-order methods for nonsmooth and stochastic optimization was presented. The class was shown to exhibit the ${\mathcal {O}}(\frac{1}{k})$ (respectively, ${\mathcal {O}}(\frac{1}{\sqrt{k}})$) rate of convergence for minimizing the sum of strongly convex functions (respectively, general convex functions). The work in [1] considered a composite convex optimization problem with noisy gradient information and showed the $O(\frac{1}{\sqrt{k}})$ convergence rate using an ADMM-based approach. Paper [26] considered a decentralized stochastic gradient algorithm that achieves the ${\mathcal {O}}(\frac{1}{k}+\frac{1}{\sqrt{nk}})$ rate of convergence for minimizing the sum of non-convex functions. The rate is comparable to that of a centralized algorithm when k is large enough. At the same time, the communication cost for the decentralized approach is lower. Papers [40, 41] also demonstrates the advantage of distributively implementing a stochastic gradient method assuming that all $f_i$ are identical and sampling times are random and non-negligible. The work [51] utilized a time-dependent weighted mixing of stochastic subgradient updates to achieve the convergence rate of ${\mathcal {O}}(\frac{n\sqrt{n}}{(1-\lambda _2({\mathbf {W}}))k})$ for minimizing the sum of (possibly nonsmooth) strongly convex functions. In [53], the authors considered a decentralized consensus-based algorithm with delayed gradient information. The method was shown to achieve the optimal ${\mathcal {O}}(\frac{1}{\sqrt{k}})$ rate of convergence for general convex functions. In [18], the ${\mathcal {O}}(\frac{1}{k})$ convergence rate was established for strongly convex costs and random networks.

1.2 Main contribution

Our main contribution is summarized as follows. Firstly, we propose a novel distributed stochastic gradient tracking method (DSGT) for optimizing the sum of smooth and strongly convex objective functions. We employ an auxiliary variable $y_i$ for each agent that tracks the average stochastic gradients of the cost functions. We show that, under a constant stepsize choice, the algorithm is comparable to a centralized stochastic gradient scheme in terms of their convergence speeds and the ultimate error bounds. In particular, the obtained error bound under DSGT decreases with the network size n, which has not been shown in the literature to the best of our knowledge. Moreover, assuming the gradient estimates are accurate, DSGT recovers the linear rate of convergence to the optimal solution [34, 46], which is also a unique feature among other distributed stochastic gradient algorithms.

Secondly, with an SA stepsize $\alpha _k\rightarrow 0$, DSGT enjoys the optimal ${\mathcal {O}}(\frac{1}{k})$ rate of convergence to the optimal point. In addition, we characterize the dependency of the constant factors in the stepsize and the convergence rate on the properties of the mixing matrix as well as the characteristics of the objective functions, such as the strong convexity factor and the Lipschitz constant.

Thirdly, we introduce a gossip-like stochastic gradient tracking method that is efficient in communication. We show that, under a sufficiently small constant stepsize, GSGT also produces iterates that converge to a neighborhood of the optimal point exponentially fast. Again, when the gradient estimates are accurate, GSGT recovers the linear rate of convergence to the optimal solution. Compared to DSGT, we show that when the network is well-connected (e.g., complete network, almost all regular graphs), GSGT incurs lower communication cost than DSGT by a factor of ${\mathcal {O}}(\frac{|{\mathcal {E}}|}{n})$ ($|{\mathcal {E}}|$ denoting the number of edges in the network) while maintaining a similar computational cost.

Finally, we provide a numerical example that demonstrates the effectiveness of the proposed methods when contrasted with the centralized stochastic gradient algorithm and some existing variants of distributed stochastic gradient methods.

1.3 Notation and assumptions

Throughout the paper, vectors default to columns if not otherwise specified. Let each agent i hold a local copy $x_i\in {\mathbb {R}}^p$ of the decision variable and an auxiliary variable $y_i\in {\mathbb {R}}^p$. Their values at iteration/time k are denoted by $x_{i,k}$ and $y_{i,k}$, respectively. We let

$$\begin{aligned} {\mathbf {x}}:= [x_1, x_2, \ldots , x_n]^{\intercal }\in {\mathbb {R}}^{n\times p},\ \ {\mathbf {y}}:= [y_1, y_2, \ldots , y_n]^{\intercal }\in {\mathbb {R}}^{n\times p}, \end{aligned}$$

and

$$\begin{aligned} {\overline{x}}:= \frac{1}{n}{\mathbf {1}}^{\intercal } {\mathbf {x}}\in {\mathbb {R}}^{1\times p},\ \ {\overline{y}}:= \frac{1}{n}{\mathbf {1}}^{\intercal }{\mathbf {y}}\in {\mathbb {R}}^{1\times p}, \end{aligned}$$

where ${\mathbf {1}}$ denotes the vector with all entries equal to 1. We define an aggregate objective function of the local variables:

$$\begin{aligned} F({\mathbf {x}}):=\sum _{i=1}^nf_i(x_i), \end{aligned}$$

(3)

and let

$$\begin{aligned} \nabla F({\mathbf {x}}):=\left[ \nabla f_1(x_1), \nabla f_2(x_2), \ldots , \nabla f_n(x_n)\right] ^{\intercal }\in {\mathbb {R}}^{n\times p}. \end{aligned}$$

In addition, denote

$$\begin{aligned} \begin{aligned}&h({\mathbf {x}}):= \frac{1}{n}{\mathbf {1}}^{\intercal }\nabla F({\mathbf {x}})\in {\mathbb {R}}^{1\times p},\\&\varvec{\xi } := [\xi _1, \xi _2, \ldots , \xi _n]^{\intercal }\in {\mathbb {R}}^{n\times m},\\&G({\mathbf {x}},\varvec{\xi }):= [g_1(x_1,\xi _1), g_2(x_2,\xi _2), \ldots , g_n(x_n,\xi _n)]^{\intercal }\in {\mathbb {R}}^{n\times p}. \end{aligned} \end{aligned}$$

The inner product of two vectors a, b of the same dimension is denoted by $\langle a,b\rangle $. For two matrices ${\mathbf {A}},{\mathbf {B}}\in {\mathbb {R}}^{n\times p}$, we let $\langle {\mathbf {A}},{\mathbf {B}}\rangle $ be the Frobenius inner product. We use $\Vert \cdot \Vert $ to denote the 2-norm of vectors; for matrices, $\Vert \cdot \Vert $ represents the Frobenius norm. The spectral radius of a square matrix ${\mathbf {M}}$ is denoted by $\rho ({\mathbf {M}})$.

We make the following standing assumption on the individual objective functions $f_i$.

Assumption 2

Each $f_i:{\mathbb {R}}^p\rightarrow {\mathbb {R}}$ is $\mu $-strongly convex with L-Lipschitz continuous gradients, i.e., for any $x,x'\in {\mathbb {R}}^p$,

$$\begin{aligned} \begin{aligned}&\langle \nabla f_i(x)-\nabla f_i(x'),x-x'\rangle \ge \mu \Vert x-x'\Vert ^2,\\&\Vert \nabla f_i(x)-\nabla f_i(x')\Vert \le L \Vert x-x'\Vert . \end{aligned} \end{aligned}$$

We note that, under Assumption 2, problem (1) has a unique solution denoted by $x^*\in {\mathbb {R}}^{1\times p}$.

A graph is a pair ${\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})$ where ${\mathcal {V}}=\{1,2,\ldots ,n\}$ is the set of vertices (nodes) and ${\mathcal {E}}\subseteq {\mathcal {V}}\times {\mathcal {V}}$ represents the set of edges connecting vertices. We assume agents communicate in an undirected graph, i.e., $(i,j)\in {\mathcal {E}}$ iff (if and only if) $(j,i)\in {\mathcal {E}}$. For each agent i, let ${\mathcal {N}}_i=\{j\mid j\ne i, (i,j)\in {\mathcal {E}}\}$ be its set of neighbors. The cardinality of ${\mathcal {N}}_i$, denoted by $\deg (i)$, is referred to as agent i’s degree. We consider the following condition regarding the interaction graph of agents.

Assumption 3

The graph ${\mathcal {G}}$ corresponding to the network of agents is undirected and connected, i.e., there exists a path between any two agents.

1.4 Organization of the paper

The paper is organized as follows. In Sect. 2, we introduce the distributed stochastic gradient tracking method and present its main convergence results. We perform analysis in Sect. 3. In Sect. 4 we propose the gossip-like stochastic gradient tracking method. A numerical example is provided in Sect. 5 to illustrate our theoretical findings. Section 6 concludes the paper.

2 A distributed stochastic gradient tracking method (DSGT)

We consider the following distributed stochastic gradient tracking method: At each step $k\in {\mathbb {N}}$, every agent i independently implements the following two steps:

$$\begin{aligned} \begin{aligned} x_{i,k+1}&= \sum _{j=1}^{n}w_{ij}(x_{j,k}-\alpha y_{j,k}), \\ y_{i,k+1}&= \sum _{j=1}^{n}w_{ij}y_{j,k}+g_i(x_{i,k+1},\xi _{i,k+1})-g_i(x_{i,k},\xi _{i,k}), \end{aligned} \end{aligned}$$

(4)

where $w_{ij}$ are nonnegative weights and $\alpha >0$ is a constant stepsize. Agent i and j are connected iff $w_{ij}, w_{ji}>0$. The iterates are initiated with an arbitrary $x_{i,0}$ and $y_{i,0}= g_i(x_{i,0},\xi _{i,0})$ for all $i\in {{{\mathcal {N}}}}$. We can also write (4) in the following compact form:

$$\begin{aligned} \begin{aligned} {\mathbf {x}}_{k+1}&= {\mathbf {W}}({\mathbf {x}}_k-\alpha {\mathbf {y}}_k), \\ {\mathbf {y}}_{k+1}&= {\mathbf {W}}{\mathbf {y}}_k+G({\mathbf {x}}_{k+1},\varvec{\xi }_{k+1})-G({\mathbf {x}}_k,\varvec{\xi }_k), \end{aligned} \end{aligned}$$

(5)

where ${\mathbf {W}}=[w_{ij}]\in {\mathbb {R}}^{n\times n}$ denotes the coupling matrix of agents. We assume that ${\mathbf {W}}$ satisfies the following condition.

Assumption 4

Nonnegative coupling matrix ${\mathbf {W}}$ is doubly stochastic, i.e., ${\mathbf {W}}{\mathbf {1}}={\mathbf {1}}$ and ${\mathbf {1}}^{\intercal }{\mathbf {W}}={\mathbf {1}}^{\intercal }$. In addition, $w_{ii}>0$ for some $i\in {\mathcal {N}}$.

In the subsequent analysis, we will frequently use the following result, which is a direct implication of Assumptions 3 and 4 (see [46] Section II-B).

Lemma 1

Let Assumptions 3 and 4 hold, and let $\rho _w$ denote the spectral norm of the matrix ${\mathbf {W}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }$. Then, $\rho _w<1$ and

$$\begin{aligned} \Vert {\mathbf {W}}\omega -{\mathbf {1}}{\overline{\omega }}\Vert \le \rho _w\Vert \omega -{\mathbf {1}}{\overline{\omega }}\Vert \end{aligned}$$

for all $\omega \in {\mathbb {R}}^{n\times p}$, where ${\overline{\omega }}=\frac{1}{n}{\mathbf {1}}^{\intercal }\omega $.

Algorithm (4) is closely related to the schemes considered in [13, 34, 46], in which auxiliary variables $y_{i,k}$ were introduced to track the average $\frac{1}{n}\sum _{i=1}^{n}\nabla f_i(x_{i,k})$. This design ensures that the algorithms achieve linear convergence under a constant stepsize choice. Correspondingly, under our approach $y_{i,k}$ are (approximately) tracking $\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k})$. To see why this is the case, note that ${\overline{y}}_k = \frac{1}{n}{\mathbf {1}}^{\intercal } {\mathbf {y}}_k$. Since $y_{i,0}=g(x_{i,0},\xi _{i,0}),\forall i$, by induction we have

$$\begin{aligned} {\overline{y}}_k=\frac{1}{n}{\mathbf {1}}^{\intercal }G({\mathbf {x}}_k,\varvec{\xi }_k) =\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k}),\forall k. \end{aligned}$$

(6)

It will be shown that ${\mathbf {y}}_k$ is close to ${\mathbf {1}}{\overline{y}}_k$ in expectation when k is sufficiently large. As a result, $y_{i,k}$ are (approximately) tracking $\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k})$.

It is worth noting that compared to the standard distributed subgradient methods [36], DSGT incurs two times the communication and storage costs per iteration.

2.1 Main results

Main convergence properties of DSGT are covered in the following theorem.

Theorem 1

Let $\varGamma >1$ be arbitrarily chosen. Suppose Assumptions 1 and 2 hold and the stepsize $\alpha $ satisfies

$$\begin{aligned} \alpha \le \min \left\{ \frac{(1-\rho _w^2)}{12\rho _w L},\frac{(1-\rho _w^2)^2}{2\sqrt{\varGamma }L\max \{6\rho _w\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ,1-\rho _w^2\}}, \frac{(1-\rho _w^2)}{3\rho _w^{2/3}L}\left[ \frac{\mu ^2}{L^2}\frac{(\varGamma -1)}{\varGamma (\varGamma +1)}\right] ^{1/3}\right\} .\nonumber \\ \end{aligned}$$

(7)

Then $\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]$ and $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_{l}-{\mathbf {1}}{\overline{x}}_{l}\Vert ^2]$, respectively, converge to $\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]$ and $\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]$ at the linear rate ${\mathcal {O}}(\rho ({\mathbf {A}})^k)$, where $\rho ({\mathbf {A}})<1$ is the spectral radius of the matrix ${\mathbf {A}}$ given by

$$\begin{aligned} {\mathbf {A}}=\begin{bmatrix} 1-\alpha \mu &{}\quad \frac{\alpha L^2}{\mu n}(1+\alpha \mu ) &{}\quad 0\\ 0 &{}\quad \frac{1}{2}(1+\rho _w^2) &{}\quad \alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\\ 2\alpha nL^3 &{}\quad \left( \frac{1}{\beta }+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3 &{}\quad \frac{1}{2}(1+\rho _w^2) \end{bmatrix}, \end{aligned}$$

where $\beta =\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha L-2\alpha ^2 L^2$. Furthermore,

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]{\le } \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n} {+}\left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{4\alpha ^2 L^2(1+\alpha \mu )(1+\rho _w^2)\rho _w^2}{\mu ^2 n(1-\rho _w^2)^3}M_{\sigma }, \end{aligned}$$

(8)

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] \le \left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{4\alpha ^2 (1+\rho _w^2)\rho _w^2(2\alpha ^2L^3\sigma ^2+\mu M_{\sigma })}{\mu (1-\rho _w^2)^3}, \end{aligned}$$

(9)

where

$$\begin{aligned} M_{\sigma }:=\left[ 3\alpha ^2L^2+2(\alpha L+1)(n+1)\right] \sigma ^2. \end{aligned}$$

(10)

Remark 1

The first term on the right-hand side of (8) can be interpreted as the error caused by stochastic gradients only, since it does not depend on the network topology. The second term on the right hand side of (8) and the bound in (9) are network dependent, and they increase with $\rho _w$ (larger $\rho _w$ indicates worse network connectivity).

In light of (8) and (9), we have

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]=\alpha {\mathcal {O}}\left( \frac{\sigma ^2}{\mu n}\right) +\frac{\alpha ^2}{(1-\rho _w)^3}{\mathcal {O}}\left( \frac{ L^2\sigma ^2}{\mu ^2}\right) , \end{aligned}$$

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]=\frac{\alpha ^2}{(1-\rho _w)^3}{\mathcal {O}}\left( \sigma ^2\right) +\frac{\alpha ^4}{(1-\rho _w)^3}{\mathcal {O}}\left( \frac{ L^3\sigma ^2}{\mu n}\right) . \end{aligned}$$

Let $\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]$ measure the average quality of solutions obtained by all the agents. We have

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]&=\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\limsup _{k\rightarrow \infty } \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&=\alpha {\mathcal {O}}\left( \frac{\sigma ^2}{\mu n}\right) +\frac{\alpha ^2}{(1-\rho _w)^3}{\mathcal {O}}\left( \frac{ L^2\sigma ^2}{\mu ^2}\right) , \end{aligned}$$

(11)

which is decreasing in the network size n when $\alpha $ is sufficiently small, i.e., when

$$\begin{aligned} \alpha \frac{\sigma ^2}{\mu n}\sim \frac{\alpha ^2}{(1-\rho _w)^3}\frac{ L^2\sigma ^2}{\mu ^2},\quad \text {or}\quad \alpha \sim \frac{\mu }{L^2}\frac{(1-\rho _w)^3}{n}. \end{aligned}$$

The spectral gap $1-\rho _w$ depends on the graph topology as well as the choices of weights $[w_{ij}]$. For example, suppose the Lazy Metropolis rule is adopted (see [33]). In this case, if the communication graph is a 1) path or ring graph, then $1-\rho _w={\mathcal {O}}(1/n^2)$; 2) lattice, then $1-\rho _w={\mathcal {O}}(1/n)$; 3) geometric random graph, then $1-\rho _w={\mathcal {O}}(1/n\log n)$; 4) Erdős-Rényi random graph, then $1-\rho _w={\mathcal {O}}(1)$; 5) complete graph, then $1-\rho _w=1$; 6) any connected undirected graph, then $1-\rho _w={\mathcal {O}}(1/n^2)$.

From the above argument, the condition $\alpha \sim \frac{\mu }{L^2}\frac{(1-\rho _w)^3}{n}$ is in general more strict than (7) in Theorem 1, which requires $\alpha \sim \frac{(1-\rho _w)^2}{L}$ (when $1-\rho _w\le (\frac{\mu }{L})^{2/3}$). Such a difference suggests that when implementing DSGT in practice, it can be advantageous to use a larger stepsize in the early stage to achieve a faster convergence speed (see Corollary 1 below) and then switch to smaller stepsizes for more accuracy on the final solutions.$\square $

Remark 2

Under a centralized stochastic gradient (CSG) algorithm in the form of

$$\begin{aligned} x_{k+1}=x_k-\alpha \frac{1}{n}\sum _{i=1}^n g_i(x_k,\xi _{i,k}),\ \ k\in {\mathbb {N}}, \end{aligned}$$

(12)

we would obtain

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert x_k-x^*\Vert ^2]=\alpha {\mathcal {O}} \left( \frac{\sigma ^2}{\mu n}\right) . \end{aligned}$$

It can be observed that DSGT is comparable with CSG in their ultimate error bounds (up to constant factors) with sufficiently small stepsizes.$\square $

As shown in Theorem 1, the convergence rate of DSGT is determined by the spectral radius $\rho ({\mathbf {A}})<1$. In the corollary below we provide an upper bound of $\rho ({\mathbf {A}})$.

Corollary 1

Under the conditions in Theorem 1, assuming in addition that the stepsize $\alpha $ also satisfies^{Footnote 1}

$$\begin{aligned} \alpha \le \frac{(\varGamma +1)}{\varGamma }\frac{(1-\rho _w^2)}{8\mu }, \end{aligned}$$

(13)

we have

$$\begin{aligned} \rho ({\mathbf {A}})\le 1-\left( \frac{\varGamma -1}{\varGamma +1}\right) \alpha \mu . \end{aligned}$$

Corollary 1 implies that, for sufficiently small stepsizes, the distributed gradient tracking method has a comparable convergence speed to that of a centralized scheme (in which case the linear rate is ${\mathcal {O}}((1-2\alpha \mu )^k)$).

In the next theorem, we show DSGT achieves the ${\mathcal {O}}(\frac{1}{k})$ rate of convergence under a diminishing stepsize policy.

Theorem 2

Let Assumptions 1 and 2 hold. Consider the method in (4) where $\alpha $ is replaced with the time-varying stepsize $\alpha _k$ given by $\alpha _k:=\theta /(m+k)$, where $\theta >1/\mu $ and m satisfies

$$\begin{aligned} {\left\{ \begin{array}{ll} m>\max \left\{ \frac{\theta }{2}(\mu +L), \frac{4\theta L\rho _w^2+2\theta L\rho _w\sqrt{1+3\rho _w^2}}{1-\rho _w^2}\right\} ,\\ \frac{(1-\rho _w^2)^2}{\theta ^2(1+\rho _w^2)\rho _w^2} \left[ \frac{(1-\rho _w^2)}{2}-\frac{2m+1}{(m+1)^2}\right] > \frac{1}{(\theta \mu -1)}\left( \frac{1}{\mu }+\frac{\theta }{m}\right) \frac{4\theta ^2 L^5}{m^3}+\frac{2C}{m^2}, \end{array}\right. } \end{aligned}$$

(14)

with $C=\left[ \left( \frac{1-\rho _w^2}{2\rho _w^2}-\frac{4\theta L}{m}-\frac{2\theta ^2 L^2}{m^2}\right) ^{-1}+2\right] \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+\frac{3\theta L^3}{m}.$ Then for all $k\ge 0$, we have

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2] \le \frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)(m+k)}+\frac{{\mathcal {O}}_k(1)}{(m+k)^{\theta \mu }} +\frac{{\mathcal {O}}_k(1)}{(m+k)^2}, \end{aligned}$$

(15a)

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\le \frac{{\mathcal {O}}_k(1)}{(m+k)^2}, \end{aligned}$$

(15b)

where ${\mathcal {O}}_k(1)$ denotes some constant that does not depend on k.

Remark 3

From (15), noting that $\theta \mu >1$, we have

$$\begin{aligned} \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]={\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]+ \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]={{\mathcal {O}}_k}\left( \frac{1}{k}\right) . \end{aligned}$$

In particular, $\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]$ asymptotically converges to 0 at the rate $\frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)k}$, which does not depend on the spectral norm $\rho _w$ and matches the convergence rate (up to constant factors) for centralized stochastic gradient methods (see [38, 47]).^{Footnote 2}$\square $

Remark 4

Under a well connected network ($\rho _w\simeq 0$), $m>\frac{\theta }{2}(\mu +L)$ suffices. When $\rho _w\simeq 1$, the lower bound of m is in the order of ${\mathcal {O}}((1-\rho _w)^{-2})$.$\square $

Theorem 2 implies the following result if we choose $\alpha _k=\theta /(k+1)$.

Corollary 2

Under Assumptions 1 and 2 and stepsize policy $\alpha _k:=\theta /(k+1)$ for some $\theta >1/\mu $, we have for all $k\ge m$ where m satisfies condition (14),

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\le \frac{{\tilde{U}}}{k}, \ \ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\le \frac{{\tilde{X}}}{k^2},\ \ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2]\le {\tilde{Y}}, \end{aligned}$$

where ${\tilde{U}}$, ${\tilde{X}}$, and ${\tilde{Y}}$ are some positive constants.

3 Analysis

In this section, we prove Theorem 1 by studying the evolution of ${\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]$, ${\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]$ and ${\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]$. Our strategy is to bound the three expressions in terms of linear combinations of their past values, in which way we establish a linear system of inequalities. This approach is different from those employed in [34, 46], where the analyses pertain to the examination of $\Vert {\overline{x}}_k-x^*\Vert $, $\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert $ and $\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert $. Such distinction is due to the stochastic gradients $g_i(x_{i,k},\xi _{i,k})$ whose variances play a crucial role in deriving the main inequalities.

We first introduce some lemmas that will be used later in the analysis. Denote by ${\mathcal {F}}_k$ the $\sigma $-algebra generated by $\{\varvec{\xi }_0,\ldots ,\varvec{\xi }_{k-1}\}$, and define ${\mathbb {E}}[\cdot \mid {\mathcal {F}}_k]$ as the conditional expectation given ${\mathcal {F}}_k$.

Lemma 2

Under Assumption 1, recalling that $h({\mathbf {x}})= \frac{1}{n}{\mathbf {1}}^{\intercal }\nabla F({\mathbf {x}})$, we have for all $k\ge 0$,

$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k\right] \le \frac{\sigma ^2}{n}. \end{aligned}$$

(16)

$\square $

Proof

By the definitions of ${\overline{y}}_k$ and $h({\mathbf {x}}_k)$,

$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k\right] = \frac{1}{n^2}\sum _{i=1}^n{\mathbb {E}}\left[ \Vert g_i(x_{i,k},\xi _{i,k})-\nabla f_i(x_{i,k})\Vert ^2\vert {\mathcal {F}}_k\right] \le \frac{\sigma ^2}{n}. \end{aligned}$$

$\square $

Lemma 3

Under Assumption 2, we have for all $k\ge 0$,

$$\begin{aligned} \Vert \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\Vert \le \frac{L}{\sqrt{n}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert . \end{aligned}$$

(17)

If in addition $\alpha <2/(\mu +L)$, then

$$\begin{aligned} \Vert x-\alpha \nabla f(x)-x^*\Vert \le (1-\alpha \mu )\Vert x-x^*\Vert ,\, \forall x\in {\mathbb {R}}^p. \end{aligned}$$

Proof

See [46] Lemma 10 for reference. $\square $

In the following lemma, we establish bounds on $\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2$ and on the conditional expectations of $\Vert {\overline{x}}_{k+1}-x^*\Vert ^2$ and $\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2$, respectively.

Lemma 4

Suppose Assumptions 1 and 2 hold and $\alpha <2/(\mu +L)$. We have the following inequalities:

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\mid {\mathcal {F}}_k] \le \left( 1-\alpha \mu \right) \Vert {\overline{x}}_k-x^*\Vert ^2\nonumber \\&\quad +\,\frac{\alpha L^2}{\mu n}\left( 1+\alpha \mu \right) \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +\frac{\alpha ^2\sigma ^2}{n}, \end{aligned}$$

(18)

$$\begin{aligned}&\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2 \le \frac{(1+\rho _w^2)}{2}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2\nonumber \\&\quad +\,\alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_k\Vert ^2, \end{aligned}$$

(19)

and for any $\beta >0$,

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\mid {\mathcal {F}}_k] \le \left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\quad +\left( \frac{1}{\beta }\Vert {\mathbf {W}}{-}{\mathbf {I}}\Vert ^2 L^2{+}2\Vert {\mathbf {W}}{-}{\mathbf {I}}\Vert ^2 L^2{+}3\alpha L^3\right) \Vert {\mathbf {x}}_k{-}{\mathbf {1}}{\overline{x}}_k\Vert ^2{+}2\alpha nL^3\Vert {\overline{x}}_k-x^*\Vert ^2+M_{\sigma }. \end{aligned}$$

(20)

Proof

See Appendix 7.1. $\square $

3.1 Proof of Theorem 1

Taking full expectation on both sides of (18)–(20), we obtain the following linear system of inequalities

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix}+\begin{bmatrix} \frac{\alpha ^2\sigma ^2}{n}\\ 0\\ M_{\sigma } \end{bmatrix}, \end{aligned}$$

(21)

where the inequality is to be taken component-wise, and the entries of the matrix ${\mathbf {A}}=[a_{ij}]$ are given by

$$\begin{aligned}&\begin{bmatrix} a_{11}\\ a_{21}\\ a_{31} \end{bmatrix} = \begin{bmatrix} 1-\alpha \mu \\ 0\\ 2\alpha nL^3 \end{bmatrix}, \begin{bmatrix} a_{12}\\ a_{22}\\ a_{32} \end{bmatrix} = \begin{bmatrix} \frac{\alpha L^2}{\mu n}(1+\alpha \mu )\\ \frac{1}{2}(1+\rho _w^2)\\ \left( \frac{1}{\beta }+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3 \end{bmatrix},\\&\begin{bmatrix} a_{13}\\ a_{23}\\ a_{33} \end{bmatrix} = \begin{bmatrix} 0 \\ \alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\\ \left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2 \end{bmatrix}, \end{aligned}$$

and $M_{\sigma }$ is given in (10). Therefore, by induction we have

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}^k\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{0}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{0}-{\mathbf {1}}{\overline{x}}_{0}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{0}-{\mathbf {1}}{\overline{y}}_{0}\Vert ^2] \end{bmatrix}+\sum _{l=0}^{k-1}{\mathbf {A}}^l\begin{bmatrix} \frac{\alpha ^2\sigma ^2}{n}\\ 0\\ M_{\sigma } \end{bmatrix}. \end{aligned}$$

(22)

If the spectral radius of ${\mathbf {A}}$ satisfies $\rho ({\mathbf {A}})<1$, then ${\mathbf {A}}^k$ converges to ${\mathbf {0}}$ at the linear rate ${\mathcal {O}}(\rho ({\mathbf {A}})^k)$ (see [17]), in which case $\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]$, $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}{\overline{x}}_l\Vert ^2]$ and $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {y}}_l-{\mathbf {1}}{\overline{y}}_l\Vert ^2]$ all converge to a neighborhood of 0 at the linear rate ${\mathcal {O}}(\rho ({\mathbf {A}})^k)$. The next lemma provides conditions that ensure $\rho ({\mathbf {A}})<1$.

Lemma 5

Let ${\mathbf {S}}=[s_{ij}]\in {\mathbb {R}}^{3\times 3}$ be a nonnegative, irreducible matrix with $s_{ii}<\lambda ^*$ for some $\lambda ^*>0$ for all $i=1,2,3$. Then $\rho ({\mathbf {S}})<\lambda ^*$ iff $\mathrm {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0$.

Proof

See Appendix 7.2. $\square $

We now derive the conditions such that $\rho ({\mathbf {A}})<1$. Suppose $\alpha $ and $\beta $ meet the following relations:^{Footnote 3}

$$\begin{aligned} a_{33}&=\left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2=\frac{1+\rho _w^2}{2}<1, \end{aligned}$$

(23)

$$\begin{aligned} a_{23}a_{32}&=\alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)} \left[ \left( \frac{1}{\beta }+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3\right] \le \frac{1}{\varGamma }(1-a_{22})(1-a_{33}), \end{aligned}$$

(24)

and

$$\begin{aligned} a_{12}a_{23}a_{31}{=}\frac{2\alpha ^4 L^5(1+\alpha \mu )}{\mu }\frac{(1+\rho _w^2)}{(1-\rho _w^2)}\rho _w^2 {\le } \frac{1}{\varGamma +1}(1{-}a_{11})[(1{-}a_{22})(1{-}a_{33}){-}a_{23}a_{32}]. \end{aligned}$$

(25)

Then,

$$\begin{aligned} \text {det}({\mathbf {I}}-{\mathbf {A}})&=(1-a_{11})(1-a_{22})(1-a_{33})-(1-a_{11})a_{23}a_{32} -a_{12}a_{23}a_{31}\\&\ge \frac{\varGamma }{(\varGamma +1)}(1-a_{11})[(1-a_{22})(1-a_{33})-a_{23}a_{32}]\\&\ge \left( \frac{\varGamma -1}{\varGamma +1}\right) (1-a_{11})(1-a_{22})(1-a_{33})>0. \end{aligned}$$

Given that $a_{11},a_{22},a_{33}<1$, in light of Lemma 5, we have $\rho ({\mathbf {A}})<1$. In addition, by denoting $B:=[\frac{\alpha ^2\sigma ^2}{n}, 0, M_{\sigma }]^{\intercal }$ and $[({\mathbf {I}}-{\mathbf {A}})^{-1}B]_j$ the j-th element of the vector $[({\mathbf {I}}-{\mathbf {A}})^{-1}B]$, we obtain from (22) that

$$\begin{aligned}&\lim \sup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\le [({\mathbf {I}}-{\mathbf {A}})^{-1}B]_1 \nonumber \\&\quad = \frac{1}{\text {det}({\mathbf {I}}-{\mathbf {A}})}\left\{ \left[ (1-a_{22})(1-a_{33})-a_{23}a_{32}\right] \frac{\alpha ^2\sigma ^2}{n}+a_{12}a_{23}M_{\sigma }\right\} \nonumber \\&\quad \le \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n}+\left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{ a_{12}a_{23}M_{\sigma }}{(1-a_{11})(1-a_{22})(1-a_{33})}\nonumber \\&\quad = \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n}+ \left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{\alpha ^3 L^2(1+\alpha \mu )(1+\rho _w^2)\rho _w^2M_{\sigma }}{\mu n(1-\rho _w^2)(1-a_{11})(1-a_{22})(1-a_{33})}\nonumber \\&\quad = \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n}+\left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{4\alpha ^2 L^2(1+\alpha \mu )(1+\rho _w^2)\rho _w^2}{\mu ^2 n(1-\rho _w^2)^3}M_{\sigma }, \end{aligned}$$

(26)

and

$$\begin{aligned}&\lim \sup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] \le [({\mathbf {I}}-{\mathbf {A}})^{-1}B]_2 =\frac{1}{\text {det}({\mathbf {I}}-{\mathbf {A}})}\left[ a_{23}a_{31}\frac{\alpha ^2\sigma ^2}{n} +a_{23}(1-a_{11})M_{\sigma }\right] \\&\quad \le \left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{a_{23}}{(1-a_{11})(1-a_{22}) (1-a_{33})}\left( 2\alpha nL^3\frac{\alpha ^2 \sigma ^2}{n}+\alpha \mu M_{\sigma }\right) \\&\quad = \frac{4(\varGamma +1)\alpha ^2 (1+\rho _w^2)\rho _w^2(2\alpha ^2L^3\sigma ^2+\mu M_{\sigma })}{(\varGamma -1)\mu (1-\rho _w^2)^3}. \end{aligned}$$

It remains to show that (23)–(25) are satisfied under condition (7). By (23), we need

$$\begin{aligned} {\beta =\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha L-2\alpha ^2 L^2}. \end{aligned}$$

Since $\alpha \le \frac{1-\rho _w^2}{12\rho _w L}$ by (7), we know that

$$\begin{aligned} \beta \ge \frac{1-\rho _w^2}{2\rho _w^2}-\frac{1-\rho _w^2}{3\rho _w} -\frac{(1-\rho _w^2)^2}{72\rho _w^2}\ge \frac{11(1-\rho _w^2)}{72\rho _w^2}\ge \frac{1-\rho _w^2}{8\rho _w^2}>0. \end{aligned}$$

(27)

Condition (24) leads to the inequality below:

$$\begin{aligned} \alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\left[ \left( \frac{1}{\beta } +2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3\right] \le \frac{(1-\rho _w^2)^2}{4\varGamma }. \end{aligned}$$

By (27), we only need

$$\begin{aligned} \alpha ^2\left[ \frac{(2+6\rho _w^2)}{(1-\rho _w^2)}\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+\frac{(1-\rho _w^2)}{4\rho _w^2}L^2\right] \le \frac{(1-\rho _w^2)^3}{4\varGamma (1+\rho _w^2)\rho _w^2}. \end{aligned}$$

The preceding inequality is equivalent to

$$\begin{aligned} \alpha \le \frac{(1-\rho _w^2)^2}{L\sqrt{\varGamma (1+\rho _w^2)}\sqrt{4\rho _w^2(2+6\rho _w^2) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2+(1-\rho _w^2)^2}}, \end{aligned}$$

implying that it is sufficient to have

$$\begin{aligned} \alpha \le \frac{(1-\rho _w^2)^2}{2\sqrt{\varGamma }L\max (6\rho _w\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ,1-\rho _w^2)}. \end{aligned}$$

To see that relation (25) holds, consider a stronger condition

$$\begin{aligned} \frac{2\alpha ^4 L^5(1+\alpha \mu )}{\mu }\frac{(1+\rho _w^2)}{(1-\rho _w^2)}\rho _w^2 \le \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}(1-a_{11})(1-a_{22})(1-a_{33}), \end{aligned}$$

or equivalently,

$$\begin{aligned} \frac{2\alpha ^3 L^5(1+\alpha \mu )}{\mu ^2}\frac{(1+\rho _w^2)}{(1-\rho _w^2)}\rho _w^2 \le \frac{(\varGamma -1)}{4\varGamma (\varGamma +1)}(1-\rho _w^2)^2. \end{aligned}$$

It suffices that

$$\begin{aligned} \alpha \le \frac{(1-\rho _w^2)}{3\rho _w^{2/3}L}\left[ \frac{\mu ^2}{L^2} \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}\right] ^{1/3}. \end{aligned}$$

(28)

3.2 Proof of Corollary 1

We derive an upper bound of $\rho ({\mathbf {A}})$ under conditions (7) and (13). Note that the characteristic function of ${\mathbf {A}}$ is given by

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}})=(\lambda -a_{11})(\lambda -a_{22})(\lambda -a_{33}) -(\lambda -a_{11})a_{23}a_{32}-a_{12}a_{23}a_{31}. \end{aligned}$$

Since $\text {det}({\mathbf {I}}-{\mathbf {A}})> 0$ and $\text {det}(\max \{a_{11},a_{22},a_{33}\} {\mathbf {I}}-{\mathbf {A}})=\text {det}(a_{11}{\mathbf {I}}-{\mathbf {A}})<0$, we have $\rho ({\mathbf {A}})\in (a_{11},1)$. By (24) and (25),

$$\begin{aligned}&\text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}) \ge (\lambda -a_{11})(\lambda -a_{22})(\lambda -a_{33})-(\lambda -a_{11}) a_{23}a_{32}\\&\qquad -\frac{1}{\varGamma +1}(1-a_{11})[(1-a_{22})(1-a_{33})-a_{23}a_{32}]\\&\quad \ge (\lambda -a_{11})(\lambda -a_{22})(\lambda -a_{33})-\frac{1}{\varGamma }(\lambda -a_{11}) (1-a_{22})(1-a_{33})\\&\qquad -\frac{(\varGamma -1)}{\varGamma (\varGamma +1)}(1-a_{11})(1-a_{22})(1-a_{33}). \end{aligned}$$

Suppose $\lambda =1-\epsilon $ for some $\epsilon \in (0,\alpha \mu )$, satisfying

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}})&\ge \frac{1}{4}(\alpha \mu -\epsilon )\left( 1-\rho _w^2-2\epsilon \right) ^2\\&\quad -\frac{1}{4\varGamma }(\alpha \mu -\epsilon )(1-\rho _w^2)^2-\frac{(\varGamma -1) \alpha \mu }{4\varGamma (\varGamma +1)}(1-\rho _w^2)^2\ge 0, \end{aligned}$$

or equivalently,

$$\begin{aligned} \frac{(\alpha \mu -\epsilon )}{\alpha \mu }\left[ \frac{\left( 1-\rho _w^2-2\epsilon \right) ^2}{(1-\rho _w^2)^2} -\frac{1}{\varGamma }\right] \ge \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}. \end{aligned}$$

It suffices that

$$\begin{aligned} \epsilon \le \left( \frac{\varGamma -1}{\varGamma +1}\right) \alpha \mu . \end{aligned}$$

To see why this is the case, note that $\frac{(\alpha \mu -\epsilon )}{\alpha \mu }\ge \frac{2}{\varGamma +1}$, and under (13),

$$\begin{aligned} \epsilon \le \left( \frac{\varGamma -1}{\varGamma +1}\right) \frac{(\varGamma +1)}{\varGamma } \frac{(1-\rho _w^2)}{8\mu }\mu =\frac{(\varGamma -1)(1-\rho _w^2)}{8\varGamma }. \end{aligned}$$

As a result,

$$\begin{aligned} 1-\rho _w^2-2\epsilon \ge 1-\rho _w^2-\frac{(\varGamma -1)(1-\rho _w^2)}{4\varGamma }=\frac{(3\varGamma +1)(1-\rho _w^2)}{4\varGamma }. \end{aligned}$$

We then have

$$\begin{aligned}&\frac{(\alpha \mu -\epsilon )}{\alpha \mu }\left[ \frac{\left( 1-\rho _w^2-2\epsilon \right) ^2}{(1-\rho _w^2)^2} -\frac{1}{\varGamma }\right] \ge \frac{2}{(\varGamma +1)} \left[ \frac{(3\varGamma +1)^2}{16 \varGamma ^2}{-}\frac{1}{\varGamma }\right] {=}\frac{1}{\varGamma (\varGamma +1)} \left[ \frac{(3\varGamma +1)^2}{8\varGamma }-2\right] \\&\quad =\frac{1}{\varGamma (\varGamma +1)} \left( \varGamma -1+\frac{\varGamma }{8}+\frac{1}{8\varGamma } -\frac{1}{4}\right) \ge \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}. \end{aligned}$$

Denote

$$\begin{aligned} {\tilde{\lambda }}=1-\left( \frac{\varGamma -1}{\varGamma +1}\right) \alpha \mu . \end{aligned}$$

Then $\text {det}({\tilde{\lambda }} {\mathbf {I}}-{\mathbf {A}})\ge 0$ so that $\rho ({\mathbf {A}})\le {\tilde{\lambda }}$.

3.3 Proof of Theorem 2

Similar to (21), under stepsize policy $\alpha _k=\theta /(m+k)$ where $m>\frac{\theta }{2}(\mu +L)$, we have

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}_k\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix}+\begin{bmatrix} \frac{\alpha _k^2\sigma ^2}{n}\\ 0\\ M_k \end{bmatrix}, \end{aligned}$$

(29)

where

$$\begin{aligned} {\mathbf {A}}_k=\begin{bmatrix} 1-\alpha _k\mu &{}\quad \frac{\alpha _k L^2}{\mu n}(1+\alpha _k\mu ) &{}\quad 0\\ 0 &{}\quad \frac{1}{2}(1+\rho _w^2) &{}\quad \alpha _k^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\\ 2\alpha _k nL^3 &{}\quad \left( \frac{1}{\beta _k}+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha _k L^3 &{}\quad \frac{1}{2}(1+\rho _w^2) \end{bmatrix}, \end{aligned}$$

$\beta _k=\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha _k L-2\alpha _k^2 L^2>0$, and

$$\begin{aligned} M_k:=\left[ 3\alpha _k^2L^2+2(\alpha _k L+1)(n+1)\right] \sigma ^2. \end{aligned}$$

(30)

The condition $\beta _k>0$ is satisfied when

$$\begin{aligned} \beta _0=\frac{1-\rho _w^2}{2\rho _w^2}-\frac{4\theta L}{m}-\frac{2\theta ^2 L^2}{m^2}>0, \end{aligned}$$

or equivalently,

$$\begin{aligned} m>\frac{4\theta L\rho _w^2+2\theta L\rho _w\sqrt{1+3\rho _w^2}}{1-\rho _w^2}. \end{aligned}$$

(31)

We first show that ${\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\le {\hat{U}}/(m+k)$ and ${\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2\le {\hat{X}}/(m+k)^2$ for some ${\hat{U}}$, ${\hat{X}}>0$ by induction. Denote $U_k:={\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]$, $X_k:={\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]$, and $Y_k:={\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2]$. Suppose for some $k\ge 0$,

$$\begin{aligned} U_k\le {\hat{U}}/(m+k), \ \ X_k\le {\hat{X}}/(m+k)^2,\ \ Y_k\le {\hat{Y}}. \end{aligned}$$

(32)

We want to prove that

$$\begin{aligned}&U_{k+1}\le \frac{(1-\alpha _k\mu ){\hat{U}}}{(m+k)}+\frac{\alpha _k L^2}{\mu n}\frac{(1+\alpha _k\mu ){\hat{X}}}{(m+k)^2}+\frac{\alpha _k^2\sigma ^2}{n}\le \frac{{\hat{U}}}{(m+k+1)}, \end{aligned}$$

(33a)

$$\begin{aligned}&X_{k+1}\le \frac{(1+\rho _w^2){\hat{X}}}{2(m+k)^2}+\alpha _k^2 \frac{(1+\rho _w^2)\rho _w^2 {\hat{Y}}}{(1-\rho _w^2)}\le \frac{{\hat{X}}}{(m+k+1)^2}, \end{aligned}$$

(33b)

$$\begin{aligned}&Y_{k+1}\le \frac{2\alpha _k nL^3{\hat{U}}}{(m+k)}+\frac{C{\hat{X}}}{(m+k)^2} +\frac{(1+\rho _w^2){\hat{Y}}}{2}+M_0\le {\hat{Y}}, \end{aligned}$$

(33c)

where $C=(\frac{1}{\beta _0}+2)\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha _0 L^3$. Plugging in $\alpha _k=\theta /(m+k)$, it suffices to show

$$\begin{aligned}&{\hat{U}}\ge \frac{1}{n(\theta \mu -1)}\left[ \left( \frac{\theta L^2}{\mu m}+\frac{\theta ^2 L^2}{m^2}\right) {\hat{X}}+\theta ^2\sigma ^2\right] , \end{aligned}$$

(34a)

$$\begin{aligned}&{\hat{Y}}\le \frac{(1-\rho _w^2)}{\theta ^2(1+\rho _w^2)\rho _w^2}\left[ \frac{(1-\rho _w^2)}{2} -\frac{2m+1}{(m+1)^2}\right] {\hat{X}}, \end{aligned}$$

(34b)

$$\begin{aligned}&{\hat{Y}}\ge \frac{2}{(1-\rho _w^2)}\left( \frac{2\theta nL^3{\hat{U}}}{m^2}+\frac{C{\hat{X}}}{m^2}+\left[ \frac{3\theta ^2L^2}{m^2} +2\left( \frac{\theta L}{m}+1\right) (n+1)\right] \sigma ^2\right) . \end{aligned}$$

(34c)

Let

$$\begin{aligned} {\hat{U}}:= \max \left\{ \frac{1}{n(\theta \mu -1)}\left[ \left( \frac{\theta L^2}{\mu m}+\frac{\theta ^2 L^2}{m^2}\right) {\hat{X}}+\theta ^2\sigma ^2\right] ,m\Vert {\overline{x}}_0-x^*\Vert ^2\right\} , \end{aligned}$$

condition (34) admits a solution iff

$$\begin{aligned} \frac{(1-\rho _w^2)^2}{\theta ^2(1+\rho _w^2)\rho _w^2} \left[ \frac{(1-\rho _w^2)}{2}-\frac{2m+1}{(m+1)^2}\right] > \frac{1}{(\theta \mu -1)}\left( \frac{1}{\mu }+\frac{\theta }{m}\right) \frac{4\theta ^2 L^5}{m^3}+\frac{2C}{m^2},\nonumber \\ \end{aligned}$$

(35)

in which case ${\hat{X}}$ and ${\hat{Y}}$ are lower bounded under constraints {(34b), (34c), ${\hat{X}}\ge m^2 X_0$, ${\hat{Y}}\ge Y_0$}. Specifically, ${\hat{X}}$ can be chosen as follows:

$$\begin{aligned} {\hat{X}}:=\max \left\{ \frac{C_3}{C_1-C_2},\frac{C_5}{C_1-C_4},m^2 X_0,\frac{Y_0}{C_1}\right\} , \end{aligned}$$

where

$$\begin{aligned} C_1&=\frac{(1-\rho _w^2)}{\theta ^2(1+\rho _w^2)\rho _w^2} \left[ \frac{(1-\rho _w^2)}{2}-\frac{2m+1}{(m+1)^2}\right] , \end{aligned}$$

(36a)

$$\begin{aligned} C_2&=\frac{2C}{(1-\rho _w^2)m^2},\end{aligned}$$

(36b)

$$\begin{aligned} C_3&= \frac{2}{(1-\rho _w^2)}\left\{ \frac{2\theta nL^3\Vert {\overline{x}}_{0}-x^*\Vert ^2}{m}+\left[ \frac{3\theta ^2L^2}{m^2}+2\left( \frac{\theta L}{m}+1\right) (n+1)\right] \sigma ^2\right\} ,\end{aligned}$$

(36c)

$$\begin{aligned} C_4&= \frac{2}{(1-\rho _w^2)}\left[ \frac{2\theta ^2 L^5}{m^3(\theta \mu -1)}\left( \frac{1}{\mu }+\frac{\theta }{m}\right) +\frac{C}{m^2}\right] ,\end{aligned}$$

(36d)

$$\begin{aligned} C_5&= \frac{2}{(1-\rho _w^2)}\left[ \frac{2\theta ^3 L^3}{m^2(\theta \mu -1)}+\frac{3\theta ^2L^2}{m^2}+2\left( \frac{\theta L}{m}+1\right) (n+1)\right] \sigma ^2. \end{aligned}$$

(36e)

Noticing that relation (32) holds trivially when $k=0$, the induction is complete.

We further improve the bound on $U_k$ (inspired by [39, 44]). From (29),

$$\begin{aligned} U(k+1)\le (1-\alpha _k\mu )U_k+\frac{2\alpha _k L^2}{\mu n}{X_k}+\frac{\alpha _k^2 \sigma ^2}{n}. \end{aligned}$$

Since ${X_k}\le {\hat{X}}/(m+k)^2$,

$$\begin{aligned} U(k+1)\le \left( 1-\frac{\theta \mu }{m+k}\right) U_k+\frac{2\theta L^2{\hat{X}}}{\mu n(m+k)^3}+\frac{\theta ^2 \sigma ^2}{n(m+k)^2}. \end{aligned}$$

Hence

$$\begin{aligned} U_k{\le } \prod _{t=0}^{k-1}\left( 1{-}\frac{\theta \mu }{m{+}t}\right) U_0{+}\sum _{t=0}^{k-1}\left( \prod _{j=t+1}^{k-1}\left( 1-\frac{\theta \mu }{m+j}\right) \right) \left( \frac{2\theta L^2{\hat{X}}}{\mu n(m+t)^3}{+}\frac{\theta ^2 \sigma ^2}{n(m+t)^2}\right) . \end{aligned}$$

In light of Lemma 4.1 in [44],

$$\begin{aligned} \prod _{t=0}^{k-1}\left( 1-\frac{\theta \mu }{m+t}\right) \le \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }},\quad \prod _{j=t+1}^{k-1}\left( 1-\frac{\theta \mu }{m+j}\right) \le \frac{(m+t+1)^{\theta \mu }}{(m+k)^{\theta \mu }}. \end{aligned}$$

Then,

$$\begin{aligned} U_k&\le \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0+\sum _{t=0}^{k-1} \frac{(m+t+1)^{\theta \mu }}{(m+k)^{\theta \mu }}\left( \frac{2\theta L^2{\hat{X}}}{\mu n(m+t)^3}+\frac{\theta ^2 \sigma ^2}{n(m+t)^2}\right) \\&= \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0+\frac{1}{(m+k)^{\theta \mu }} \left( \frac{2\theta L^2{\hat{X}}}{\mu n}\sum _{t=0}^{k-1} \frac{(m+t+1)^{\theta \mu }}{(m+t)^3}+\frac{\theta ^2 \sigma ^2}{n} \sum _{t=0}^{k-1}\frac{(m+t+1)^{\theta \mu }}{(m+t)^2}\right) \\&\le \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0+\frac{1}{(m+k)^{\theta \mu }} \left( \frac{4\theta L^2{\hat{X}}}{\mu n}\sum _{t=0}^{k-1}(m+t)^{\theta \mu -3}+\frac{2\theta ^2 \sigma ^2}{n}\sum _{t=0}^{k-1}(m+t)^{\theta \mu -2}\right) . \end{aligned}$$

Note that

$$\begin{aligned}&\sum _{t=0}^{k-1}(m+t)^{\theta \mu -3}\le \int _{t=-1}^{k}(m+t)^{\theta \mu -3}dt\le \max \left\{ \frac{1}{\theta \mu -2}(m+k)^{\theta \mu -2},\frac{1}{2-\theta \mu }(m-1)^{\theta \mu -2}\right\} ,\\&\sum _{t=0}^{k-1}(m+t)^{\theta \mu -2}\le \int _{t=-1}^{k}(m+t)^{\theta \mu -2}dt\le \frac{1}{\theta \mu -1}(m+k)^{\theta \mu -1}. \end{aligned}$$

We conclude that

$$\begin{aligned} U_k&\le \frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)(m+k)}+\frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0\\&\quad +\frac{4\theta L^2{\hat{X}}}{\mu n}\max \left\{ \frac{1}{(\theta \mu -2)(m+k)^2},\frac{(m-1)^{\theta \mu -2}}{(2-\theta \mu )(m+k)^{\theta \mu }}\right\} \\&\le {\frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)(m+k)} +\frac{{\mathcal {O}}_k(1)}{(m+k)^{\theta \mu }}+\frac{{\mathcal {O}}_k(1)}{(m+k)^2}}. \end{aligned}$$

4 A gossip-like stochastic gradient tracking method (GSGT)

In this section, we consider a gossip-like stochastic gradient tracking method (GSGT): Initialize with an arbitrary $x_{i,0}$ and $y_{i,0}=g_i(x_{i,0},\xi _{i,0})$ for all $i\in {\mathcal {N}}$. At each round $k\in {\mathbb {N}}$, agent $i_k\in {\mathcal {N}}$ wakes up with probability 1/n. Then $i_k$ either communicates with one of its neighbors $j_k$ (with probability $\pi _{i_k j_k}$) or not (with probability $\pi _{i_k i_k}=1-\sum _{j\in {\mathcal {N}}_{i_k}}\pi _{i_k j}$). In the former situation, the update rule for $i\in \{i_k,j_k\}$ is as follows,

$$\begin{aligned} x_{i,k+1}&= \frac{1}{2}(x_{i_k,k}+x_{j_k,k})-\alpha y_{i,k}, \end{aligned}$$

(37a)

$$\begin{aligned} y_{i,k+1}&= \frac{1}{2}(y_{i_k,k}+y_{j_k,k})+g_i(x_{i,k+1},\xi _{i,k+1})-g_i(x_{i,k},\xi _{i,k}), \end{aligned}$$

(37b)

and for $i\notin \{i_k,j_k\}$, $x_{i,k+1}=x_{i,k}$, $y_{i,k+1}=y_{i,k}$, and $\xi _{i,k+1}=\xi _{i,k}$.^{Footnote 4} In the latter situation, agent $i_k$ performs update based on its own information:

$$\begin{aligned} x_{i_k,k+1}&= x_{i_k,k}-2\alpha y_{i_k,k}, \end{aligned}$$

(38a)

$$\begin{aligned} y_{i_k,k+1}&= y_{i_k,k}+g_{i_k}(x_{i_k,k+1},\xi _{i_k,k+1})-g_{i_k}(x_{i_k,k},\xi _{i_k,k}), \end{aligned}$$

(38b)

while no action is taken by agent $i\ne i_k$. For ease of analysis, we denote $j_k=i_k$ in this case, and let $\mathbb {1}_k$ be the indicator function for the event $\{j_k\ne i_k\}$, i.e., $\mathbb {1}_k=1$ iff $j_k\ne i_k$.

The use of stepsize $2\alpha $ instead of $\alpha $ in (38a) can be understood as follows. At each iteration, GSGT performs two gradient updates within the network. This can be achieved either by two different agents respectively updating their solutions, or by one agent using a doubled stepsize. The method is different from a standard gossip algorithm where exactly two agents update at each round. This difference allows us to design the probabilities $\pi _{ij}$ with more flexibility. In particular, it is possible to construct a doubly stochastic probability matrix ${{\varvec{\Pi }}}=[\pi _{ij}]$ for any graph ${\mathcal {G}}$ under Assumption 3.

We can present GSGT in the following compact matrix form, in which we adopt the notation previously used.

$$\begin{aligned} {\mathbf {x}}_{k+1}&= {\mathbf {W}}_k{\mathbf {x}}_k-\alpha {\mathbf {D}}_k{\mathbf {y}}_k, \end{aligned}$$

(39a)

$$\begin{aligned} {\mathbf {y}}_{k+1}&= {\mathbf {W}}_k{\mathbf {y}}_k+\tilde{{\mathbf {D}}}_k (G({\mathbf {x}}_{k+1},\xi _{k+1})-G({\mathbf {x}}_k,\xi _k)), \end{aligned}$$

(39b)

where the random coupling matrix ${\mathbf {W}}_k$ is defined as

$$\begin{aligned} {\mathbf {W}}_k:={\mathbf {I}}-\frac{(e_{i_k}-e_{j_k})(e_{i_k}-e_{j_k})^{\intercal }}{2}, \end{aligned}$$

in which $e_i=[0\, \cdots \, 0\, 1\, 0\,\cdots ]^{\intercal }\in {\mathbb {R}}^{n\times 1}$ is a unit vector with the ith component equal to 1. By definition, each ${\mathbf {W}}_k$ is symmetric and doubly stochastic. The matrices ${\mathbf {D}}_k$ and $\tilde{{\mathbf {D}}}_k$ are diagonal with their $i_k$th and $j_k$th diagonal entries equal to 1 and all other entries equal to 0 if $j_k\ne i_k$, otherwise the $i_k$th entry of ${\mathbf {D}}_k$ (respectively, $\tilde{{\mathbf {D}}}_k$) equals 2 (respectively, 1) while all other entries equal 0.

We assume the following condition on the probability matrix ${{\varvec{\Pi }}}$:

Assumption 5

Nonnegative matrix ${{\varvec{\Pi }}}$ is doubly stochastic.

Let

$$\begin{aligned} {\bar{{\mathbf {W}}}}:={\mathbb {E}}[{\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k]. \end{aligned}$$

(40)

It can be shown that (see [4])

$$\begin{aligned} {\bar{{\mathbf {W}}}}=\left( 1-\frac{1}{n}\right) {\mathbf {I}}+\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2n}, \end{aligned}$$

(41)

which is doubly stochastic.

Lemma 6

Let Assumptions 3 and 5 hold, and let $\rho _{{\bar{w}}}$ denote the spectral norm of the matrix ${\bar{{\mathbf {W}}}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }$. Then, $\rho _{{\bar{w}}}\in [1-2/n,1)$.

Proof

Since ${\bar{W}}$ is doubly stochastic, $\rho _{{\bar{w}}}<1$ follows from Lemma 1. To see $\rho _{{\bar{w}}}\ge 1-2/n$, note that $\rho (\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2n})=\frac{1}{n}$, and 1 is an algebraically simple eigenvalue of $\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}$ and $\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }$. We have

$$\begin{aligned} \rho _{{\bar{w}}}=1-\frac{1}{n}+\frac{1}{n}\lambda _2\left( \frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}\right) , \end{aligned}$$

(42)

where $\lambda _2(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2})$ is the second largest eigenvalue of $\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}$.

Since $\lambda _2(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2})\in [-1,1]$, we conclude that $\rho _{{\bar{w}}}\ge 1-2/n$. $\square $

Before proceeding, it is worth noting that for GSGT, we still have the following relation:

$$\begin{aligned} {\overline{y}}_k=\frac{1}{n}{\mathbf {1}}^{\intercal }G({\mathbf {x}}_k,\varvec{\xi }_k),\forall k. \end{aligned}$$

(43)

4.1 Main results

We present the main convergence results of GSGT in the following theorem.

Theorem 3

Let $\varGamma >1$ be arbitrarily chosen. Suppose Assumptions 1–5 hold, and assume that the stepsize $\alpha $ satisfies

$$\begin{aligned} \alpha\le & {} \frac{2n(1-\rho _{{\bar{w}}})}{\sqrt{\varGamma }L}\left\{ \left[ 27(2\eta +3) Q n+16(8\eta +9)\right] Q(1-\rho _{{\bar{w}}})\right. \nonumber \\&\left. +\,48(6\eta +1)(8\eta +3) +96Q(1-\rho _{{\bar{w}}})\right\} ^{-1/2}, \end{aligned}$$

(44)

where $\eta =\frac{1}{n(1-\rho _{{\bar{w}}})}$ and $Q=L/\mu $. Then $\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]$ and $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_{l}-{\mathbf {1}}{\overline{x}}_{l}\Vert ^2]$, respectively, converge to $\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]$ and $\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]$ at the linear rate ${\mathcal {O}}(\rho ({\mathbf {A}}_g)^k)$, where $\rho ({\mathbf {A}}_g)<1$ is the spectral radius of the matrix ${\mathbf {A}}_g$ given by

$$\begin{aligned} {\mathbf {A}}_g=\begin{bmatrix} 1-\frac{2\alpha \mu }{n} &{}\quad \frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) &{}\quad \frac{4\alpha ^2}{n^3}\\ 8\alpha ^2 L^2 &{}\quad \frac{1}{2}(1+\rho _{{\bar{w}}}) &{}\quad \frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) \\ 8\alpha ^2L^4+4\alpha L^3 &{}\quad \frac{L^2}{n}\left( 4+\frac{2}{\beta _2}+8\alpha ^2 L^2+4\alpha L\right) &{}\quad \frac{1}{2}(1+\rho _{{\bar{w}}}) \end{bmatrix}, \end{aligned}$$

in which $\beta _1=\frac{n(1-\rho _{{\bar{w}}})}{4\alpha }-4\alpha L^2$ and $\beta _2=\frac{n(1-\rho _{{\bar{w}}})}{4}-2\alpha L-2\alpha ^2 L^2$. In addition,

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\le \frac{\varGamma }{(\varGamma -1)}\frac{\sigma ^2}{n^2}\left[ \frac{20\alpha }{\mu (1-\rho _{{\bar{w}}})}+\frac{42(6\eta +1)\alpha ^2 L^2}{\mu ^2 (1-\rho _{{\bar{w}}})^2}\right] , \end{aligned}$$

(45)

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] \le \frac{4\varGamma \sigma ^2}{(\varGamma -1)(1-\rho _{{\bar{w}}})^2}\left[ \frac{9(6\eta +1)\alpha ^2}{n}+\frac{72\alpha ^3 L^2}{\mu n^2}\right] . \end{aligned}$$

(46)

Remark 5

Notice that $\eta =\frac{1}{n(1-\rho _{{\bar{w}}})}$ and $1-\rho _{{\bar{w}}}\le \frac{2}{n}$ from Lemma 6. We can see from (45) and (46) that

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]=\frac{\alpha }{(1-\rho _{{\bar{w}}})}{\mathcal {O}}\left( \frac{\sigma ^2}{\mu n^2}\right) +{\frac{\alpha ^2}{(1-\rho _{{\bar{w}}})^3}{\mathcal {O}}\left( \frac{ L^2\sigma ^2}{\mu ^2 n^3}\right) }, \end{aligned}$$

(47)

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]=\frac{\alpha ^2}{(1-\rho _{{\bar{w}}})^3}{\mathcal {O}}\left( \frac{\sigma ^2}{n^3}\right) +{\frac{\alpha ^3}{(1-\rho _{{\bar{w}}})^2}{\mathcal {O}}\left( \frac{L^2\sigma ^2}{\mu n^3}\right) }. \end{aligned}$$

(48)

Since

$$\begin{aligned} {\frac{ L^2\sigma ^2}{\mu ^2 n^3}\ge \frac{\sigma ^2}{n^3},} \end{aligned}$$

and by (44),

$$\begin{aligned} {\frac{ L^2\sigma ^2}{\mu ^2 n^3}\ge \alpha (1-\rho _{{\bar{w}}})\frac{L^2\sigma ^2}{\mu n^3},} \end{aligned}$$

the second term on the right-hand side of (47) dominates the two terms on the right-hand side of (48). Thus, we have

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]&=\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\limsup _{k\rightarrow \infty } \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&=\frac{\alpha }{(1-\rho _{{\bar{w}}})}{\mathcal {O}}\left( \frac{\sigma ^2}{\mu n^2}\right) +{\frac{\alpha ^2}{(1-\rho _{{\bar{w}}})^3}{\mathcal {O}} \left( \frac{ L^2\sigma ^2}{\mu ^2 n^3}\right) }. \end{aligned}$$

(49)

$\square $

The corollary below provides an upper bound for $\rho ({\mathbf {A}}_g)$.

Corollary 3

Under the conditions in Theorem 2 where $\varGamma >3/2$, we have

$$\begin{aligned} \rho ({\mathbf {A}}_g)\le 1-\frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}. \end{aligned}$$

Remark 6

Compared to DSGT, the convergence speed of GSGT is slower than DSGT under the same stepsize $\alpha $ (see Corollary 1). This is due to the fact that in GSGT, only two agents update their iterates at each iteration.$\square $

4.2 Performance comparison between DSGT and GSGT

In this section, we compare the performances of the two proposed algorithms in terms of their required computation and communication efforts for achieving an $\epsilon $-solution (with constant stepsizes), that is, we compute the number of stochastic gradient computations and communications needed to obtain $\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\le \epsilon $. Without loss of generality, for each method we first choose stepsize $\alpha $ such that $\frac{1}{n}\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\le \epsilon /2$ and then compute the number of iterations K such that $\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_K-{\mathbf {1}}x^*\Vert ^2]\le \epsilon $.

For DSGT, when $\epsilon $ is small enough, we have $\alpha ={\mathcal {O}}(\frac{n\mu \epsilon }{\sigma ^2})$ from (11). Then, noting that $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}x^*\Vert ^2]$ converges linearly at the rate ${\mathcal {O}}(\rho ({\mathbf {A}})^k)$ where $\rho ({\mathbf {A}})=1-{\mathcal {O}}(\alpha \mu )$, we obtain the number of required iterations:

$$\begin{aligned} K_d={\mathcal {O}}\left( \frac{\ln (\frac{1}{\epsilon })}{\alpha \mu }\right) ={\mathcal {O}}\left( \frac{\sigma ^2}{n\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon } \right) . \end{aligned}$$

In $K_d$ iterations, the number of stochastic gradient computations is $N_d=nK_d={\mathcal {O}}\left( \frac{\sigma ^2}{\mu ^2} \frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) $ and the number of communications is $N_d^c=2|{\mathcal {E}}|K_d={\mathcal {O}}\left( \frac{|{\mathcal {E}}|}{n} \frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) $ where $|{\mathcal {E}}|$ stands for the number of edges in the graph.

For GSGT we need $\alpha ={\mathcal {O}}(\frac{n^2\mu \epsilon (1-\rho _{{\bar{w}}})}{\sigma ^2})$ from (49). Given that $\rho ({\mathbf {A}}_g)=1-{\mathcal {O}}(\frac{\alpha \mu }{n})$, the number of required iterations $K_g$ can be calculated as follows:

$$\begin{aligned} K_g={\mathcal {O}}\left( \frac{n\ln (\frac{1}{\epsilon })}{\alpha \mu }\right) ={\mathcal {O}}\left( \frac{\sigma ^2}{n(1-\rho _{{\bar{w}}})\mu ^2} \frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) . \end{aligned}$$

In $K_g$ iterations, the number of gradient computations and communications are both bounded by $N_g=N_g^c=2K_g={\mathcal {O}}(\frac{1}{n(1-\rho _{{\bar{w}}})}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })$.

Suppose the Metropolis rule is applied to define the weights $\pi _{ij}$ [50]. We first compare the number of stochastic gradient computations for DSGT and GSGT, respectively. Noticing that $1-\rho _{{\bar{w}}}\le \frac{2}{n}$ by Lemma 6, $N_g$ is at most in the same order of $N_d={\mathcal {O}}(\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })$, which happens when $1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})$. Given that $1-\rho _{{\bar{w}}}=(1-\lambda _2({{\varvec{\Pi }}}))/n$ from (42), we have $1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})$ for complete networks, almost all regular graphs [16], among others.

We then compare the number of required communications $N_d^c={\mathcal {O}}(\frac{|{\mathcal {E}}|}{n}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })$ and $N_g^c={\mathcal {O}}(\frac{1}{n(1-\rho _{{\bar{w}}})}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })$. When $1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})$, we have $N_g^c={\mathcal {O}}(\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })$. By contrast, $N_d^c$ is ${\mathcal {O}}(\frac{|{\mathcal {E}}|}{n})$ times larger than $N_g^c$. In particular when $|{\mathcal {E}}|={\mathcal {O}}(n^2)$ (e.g., complete network), the number of communications for GSGT is ${\mathcal {O}}(n)$ times smaller than that of DSGT.

4.3 Proof of Theorem 3

We first derive a linear system of inequalities regarding ${\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]$, ${\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]$, ${\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]$ and their values in the last iteration.

Lemma 7

Suppose Assumptions 1–5 hold and the stepsize satisfies $\alpha <n/(\mu +L)$. Then, we have the following inequalities:

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}{-}x^*\Vert ^2]&\le \left( 1-\frac{2\alpha \mu }{n}\right) {\mathbb {E}}[\Vert {\overline{x}}_k{-}x^*\Vert ^2]{+}\frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\quad +\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}. \end{aligned}$$

(50)

For any $\beta _1,\beta _2>0$,

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2] \le \left( \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\quad +\frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\nonumber \\&\quad +8\alpha ^2 L^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\frac{4\alpha ^2 \sigma ^2}{n}, \end{aligned}$$

(51)

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2]\le \left( \rho _{{\bar{w}}} +\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\nonumber \\&\quad +\frac{L^2}{n}\left( 4+\frac{2}{\beta _2}+8\alpha ^2 L^2+4\alpha L\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\quad +(8\alpha ^2L^4+4\alpha L^3){\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2] +\frac{(4\alpha ^2L^2+2\alpha L)\sigma ^2}{n}\nonumber \\&\quad +4(\alpha L+1)\sigma ^2. \end{aligned}$$

(52)

Proof

See Appendix 7.3. $\square $

In light of Lemma 7, we have the following linear system of inequalities:

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}_g \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} +\begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{4\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix}, \end{aligned}$$

where

$$\begin{aligned} {\mathbf {A}}_g=[b_{ij}]=\begin{bmatrix} 1-\frac{2\alpha \mu }{n} &{} \frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) &{} \frac{4\alpha ^2}{n^3}\\ 8\alpha ^2 L^2 &{} \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n} &{} \frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) \\ 8\alpha ^2L^4+4\alpha L^3 &{}\quad \frac{L^2}{n}\left( 4+\frac{2}{\beta _2}+8\alpha ^2 L^2+4\alpha L\right) &{}\quad \rho _{{\bar{w}}}+\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n} \end{bmatrix}, \end{aligned}$$

and $M_g=\frac{(4\alpha ^2L^2+2\alpha L)\sigma ^2}{n}+4(\alpha L+1)\sigma ^2$. Suppose $\alpha $, $\beta _1,\beta _2>0$ satisfy

$$\begin{aligned}&b_{22}=\rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n} = \frac{1+\rho _{{\bar{w}}}}{2}, \end{aligned}$$

(53)

$$\begin{aligned}&b_{33}=\rho _{{\bar{w}}}+\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n} = \frac{1+\rho _{{\bar{w}}}}{2}, \end{aligned}$$

(54)

and

$$\begin{aligned} \text {det}({\mathbf {I}}-{\mathbf {A}}_g)&=(1-b_{11})(1-b_{22})(1-b_{33})-b_{12}b_{23}b_{31} -b_{13}b_{21}b_{32}\nonumber \\&\quad -(1-b_{11})b_{23}b_{32}-(1-b_{22})b_{13}b_{31}\nonumber \\&\quad -(1-b_{33})b_{12}b_{21}\ge (1-1/\varGamma )(1-b_{11})(1-b_{22})(1-b_{33})>0. \end{aligned}$$

(55)

Then, by Lemma 5, the spectral radius of ${\mathbf {A}}_g$ is smaller than 1, and we have

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}_g^k\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{0}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{0}-{\mathbf {1}}{\overline{x}}_{0}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{0}-{\mathbf {1}}{\overline{y}}_{0}\Vert ^2] \end{bmatrix}+\sum _{l=0}^{k-1}{\mathbf {A}}_g^l \begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{4\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix}. \end{aligned}$$

(56)

Hence, $\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]$, $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}{\overline{x}}_l\Vert ^2]$ and $\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {y}}_l-{\mathbf {1}}{\overline{y}}_l\Vert ^2]$ all converge to a neighborhood of 0 at the linear rate ${\mathcal {O}}(\rho ({\mathbf {A}}_g)^k)$. Moreover,

$$\begin{aligned}&\begin{bmatrix} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\\ \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} \le ({\mathbf {I}}-{\mathbf {A}}_g)^{-1} \begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{2\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix} =\frac{1}{\text {det}({\mathbf {I}}-{\mathbf {A}})} \nonumber \\&\cdot \begin{bmatrix} (1-b_{22})(1-b_{33})-b_{23}b_{32} &{} b_{13}b_{32}+b_{12}(1-b_{33}) &{} b_{12}b_{23}+b_{13}(1-b_{22})\\ b_{23}b_{31}+b_{21}(1-b_{33}) &{} \quad (1-b_{11})(1-b_{33})-b_{13}b_{31} &{} b_{13}b_{21}+b_{23}(1-b_{11})\\ b_{21}b_{32}+b_{31}(1-b_{22}) &{} b_{12}b_{31}+b_{32}(1-b_{11}) &{}\quad (1-b_{11})(1-b_{22})-b_{12}b_{21} \end{bmatrix}\nonumber \\&\begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{2\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix}. \end{aligned}$$

(57)

We now show (53), (54), and (57) are satisfied under condition (44). First, relation (44) implies that

$$\begin{aligned}&4\alpha ^2 L^2\le \frac{n(1-\rho _{{\bar{w}}})}{12}, \end{aligned}$$

(58)

$$\begin{aligned}&2\alpha L+2\alpha ^2 L^2\le \frac{n(1-\rho _{{\bar{w}}})}{12}. \end{aligned}$$

(59)

Therefore, from (53) and (54) we have

$$\begin{aligned}&\beta _1=\frac{n(1-\rho _{{\bar{w}}})}{4\alpha }-4\alpha L^2{\ge } \frac{n(1-\rho _{{\bar{w}}})}{6\alpha }>0, \end{aligned}$$

(60)

$$\begin{aligned}&\beta _2=\frac{n(1-\rho _{{\bar{w}}})}{4}-2\alpha L-2\alpha ^2 L^2\ge \frac{n(1-\rho _{{\bar{w}}})}{6}>0. \end{aligned}$$

(61)

By (58)–(61) and the fact that $\rho _{{\bar{w}}}\ge 1-2/n$ obtained from (41), we have

$$\begin{aligned} b_{12}&\le \frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{1}{8}\right) \le \frac{9\alpha L^2}{4\mu n^2}, \end{aligned}$$

(62a)

$$\begin{aligned} b_{23}&\le \frac{2\alpha ^2}{n}\left( 6\eta +1\right) ,\end{aligned}$$

(62b)

$$\begin{aligned} b_{31}&= \alpha L^3(8\alpha L+4)\le 6\alpha L^3,\end{aligned}$$

(62c)

$$\begin{aligned} b_{32}&\le \frac{L^2}{n}\left( 12\eta +\frac{9}{2}\right) . \end{aligned}$$

(62d)

Then, for relation (55) to hold, it is sufficient that

$$\begin{aligned} \frac{1}{\varGamma }\frac{\alpha \mu }{2n}(1{-}\rho _{{\bar{w}}})^2&\ge \frac{27(6\eta {+}1)\alpha ^4L^5}{\mu n^3}+\frac{48(8\eta +3)\alpha ^4 L^4}{n^4}+\frac{6(6\eta +1)(8\eta +3)\alpha ^3\mu L^2}{n^3}\\&\quad +\frac{12\alpha ^3 L^3}{n^3}(1-\rho _{{\bar{w}}}) +\frac{9\alpha ^3L^4}{\mu n^2}(1-\rho _{{\bar{w}}}). \end{aligned}$$

In light of (58), $\alpha L\le n(1-\rho _{{\bar{w}}})/24$. We only need

$$\begin{aligned} \frac{1}{\varGamma }\frac{\mu n^2}{2}(1-\rho _{{\bar{w}}})^2&\ge \frac{9(6\eta +1)\alpha ^2L^4 n}{8\mu }(1-\rho _{{\bar{w}}})+2(8\eta +3)\alpha ^2 L^3(1-\rho _{{\bar{w}}})\\&\quad +6(6\eta +1)(8\eta +3)\alpha ^2\mu L^2 +12\alpha ^2 L^3(1-\rho _{{\bar{w}}})\\&\quad +\frac{9\alpha ^2L^4 n}{\mu }(1-\rho _{{\bar{w}}}), \end{aligned}$$

which gives

$$\begin{aligned}&\alpha \le \frac{2n(1-\rho _{{\bar{w}}})}{\sqrt{\varGamma }L}\left\{ \left[ 27(2\eta +3) Q n+16(8\eta +9)\right] Q(1-\rho _{{\bar{w}}})\right. \\&\qquad \left. +\,48(6\eta +1)(8\eta +3) +96Q(1-\rho _{{\bar{w}}})\right\} ^{-1/2}. \end{aligned}$$

We now derive the bounds for $\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]$ and $\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]$. By (57) and (62),

$$\begin{aligned}&\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2] \le \frac{\varGamma }{(\varGamma -1)(1-b_{11})(1-b_{22})(1-b_{33})}\\&\qquad \cdot \left\{ [(1-b_{22})(1-b_{33})-b_{23}b_{32}]\frac{4\alpha ^2\sigma ^2}{n^3} +[b_{13}b_{32}+b_{12}(1-b_{33})]\frac{2\alpha ^2 \sigma ^2}{n}\right. \\&\qquad \left. +\,[b_{12}b_{23}+b_{13}(1-b_{22})]M_g\right\} \\&\quad \le \frac{2\varGamma n}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2} \Bigg \{\frac{\alpha ^2\sigma ^2(1-\rho _{{\bar{w}}})^2}{n^3} +\left[ \frac{6\alpha ^2L^2}{n^4}(8\eta +3)+\frac{9\alpha L^2}{8\mu n^2}(1-\rho _{{\bar{w}}})\right] \frac{2\alpha ^2 \sigma ^2}{n}\\&\qquad +\left[ \frac{9(6\eta +1)\alpha ^3 L^2}{2\mu n^3}+\frac{2\alpha ^2}{n^3} (1-\rho _{{\bar{w}}})\right] \frac{9}{2}\sigma ^2\Bigg \}\\&\quad \le \frac{2\varGamma n\sigma ^2}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2} \left[ \frac{10\alpha ^2(1-\rho _{{\bar{w}}})}{n^3}+\frac{21(6\eta +1)\alpha ^3 L^2}{\mu n^3}\right] \\&\quad =\frac{\varGamma }{(\varGamma -1)}\frac{\sigma ^2}{n^2}\left[ \frac{20\alpha }{\mu (1-\rho _{{\bar{w}}})}+\frac{42(6\eta +1)\alpha ^2 L^2}{\mu ^2 (1-\rho _{{\bar{w}}})^2}\right] .\\&\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\le \frac{\varGamma }{(\varGamma -1)(1-b_{11})(1-b_{22})(1-b_{33})}\\&\qquad \cdot \left\{ [b_{23}b_{31}+b_{21}(1-b_{33})]\frac{4\alpha ^2\sigma ^2}{n^3}+[(1-b_{11})(1-b_{33})-b_{13}b_{31}]\frac{2\alpha ^2 \sigma ^2}{n}\right. \\&\qquad \left. +\,[b_{13}b_{21}+b_{23}(1-b_{11})]M_g\right\} \\&\quad \le \frac{2\varGamma n}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2}\Bigg \{\left[ \frac{12(6\eta +1)\alpha ^3 L^3}{n}+4\alpha ^2 L^2(1-\rho _{{\bar{w}}})\right] \frac{4\alpha ^2\sigma ^2}{n^3}\\&\qquad +\left[ \frac{\alpha \mu (1-\rho _{{\bar{w}}})}{n}-\frac{24\alpha ^3L^3}{n^3}\right] \frac{2\alpha ^2 \sigma ^2}{n}+\left[ \frac{32\alpha ^4 L^2}{n^3}+\frac{4(6\eta +1)\alpha ^3\mu }{n^2}\right] \frac{17}{4}\sigma ^2\Bigg \}\\&\quad \le \frac{2\varGamma n\sigma ^2}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2}\left[ \frac{18(6\eta +1) \alpha ^3\mu }{n^2}+\frac{136\alpha ^4 L^2}{n^3}\right] \\&\quad =\frac{4\varGamma \sigma ^2}{(\varGamma -1)(1-\rho _{{\bar{w}}})^2}\left[ \frac{9(6\eta +1)\alpha ^2}{n} +\frac{72\alpha ^3 L^2}{\mu n^2}\right] . \end{aligned}$$

4.4 Proof of Corollary 3

The characteristic function of ${\mathbf {A}}_g$ is

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}_g)&=(\lambda {-}b_{11})(\lambda {-}b_{22})(\lambda {-}b_{33}) -b_{12}b_{23}b_{31}-b_{13}b_{21}b_{32}-(\lambda -b_{11})b_{23}b_{32}\nonumber \\&\quad -(\lambda -b_{22})b_{13}b_{31} -(\lambda -b_{33})b_{12}b_{21}. \end{aligned}$$

(63)

By (55),

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}_g)&\ge (\lambda -b_{11})(\lambda -b_{22})(\lambda -b_{33}) +(1-\lambda )b_{23}b_{32}\nonumber \\&\quad +(1-\lambda )b_{13}b_{31}+(1-\lambda )b_{12}b_{21}\nonumber \\&\quad -\frac{1}{\varGamma }(1-b_{11})(1-b_{22})(1-b_{33})\nonumber \\&\ge (\lambda -b_{11}) (\lambda -b_{22}) (\lambda -b_{33})\nonumber \\&\quad -\frac{1}{\varGamma }(1-b_{11})(1-b_{22})(1-b_{33}). \end{aligned}$$

(64)

Let $\lambda =1-\epsilon $ for some $\epsilon \in (0,2\alpha \mu /n)$ that satisfies

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}_g)\ge \left( \frac{2\alpha \mu }{n}-\epsilon \right) \left[ \frac{1-\rho _{{\bar{w}}}}{2}-\epsilon \right] ^2- \frac{1}{\varGamma }\frac{2\alpha \mu }{n}\frac{(1-\rho _{{\bar{w}}})^2}{4}\ge 0. \end{aligned}$$

Under condition (44), it suffices that

$$\begin{aligned} \epsilon \le \frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}. \end{aligned}$$

Denote ${\tilde{\lambda }}=1-\frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}$. We have $\text {det}({\tilde{\lambda }}{\mathbf {I}}-{\mathbf {A}}_g)\ge 0$, and therefore $\rho ({\mathbf {A}}_g)\le {\tilde{\lambda }}$.

5 Numerical example

In this section, we provide a numerical example to illustrate our theoretic findings. Consider the on-line Ridge regression problem, i.e.,

$$\begin{aligned} \min _{x\in {\mathbb {R}}^{p}}f(x)=\frac{1}{n}\sum _{i=1}^nf_i(x)\left( ={\mathbb {E}}_{u_i,v_i} \left[ \left( u_i^{\intercal } x-v_i\right) ^2+\rho \Vert x\Vert ^2\right] \right) , \end{aligned}$$

(65)

where $\rho >0$ is a penalty parameter. For each agent i, samples in the form of $(u_i,v_i)$ are gathered continuously with $u_i\in {\mathbb {R}}^p$ representing the features and $v_i\in {\mathbb {R}}$ being the observed outputs. We assume that each $u_i\in [0.3,0.4]^p$ is uniformly distributed, and $v_i$ is drawn according to $v_i=u_i^{\intercal } {\tilde{x}}_i+\varepsilon _i$, where ${\tilde{x}}_i$ are predefined parameters evenly located in $[0,10]^p$, and $\varepsilon _i$ are independent Gaussian noises with mean 0 and variance 1. Given a pair $(u_i,v_i)$, agent i can compute an estimated gradient of $f_i(x)$: $g_i(x,u_i,v_i)=2(u_i^{\intercal }x -v_i)u_i+2\rho x$, which is unbiased. Problem (65) has a unique solution $x^*$ given by $x^*=(\sum _{i=1}^n{\mathbb {E}}_{u_i}[u_iu_i^{\intercal }]+n\rho {\mathbf {I}})^{-1}\sum _{i=1}^n{\mathbb {E}}_{u_i}[u_iu_i^{\intercal }]{\tilde{x}}_i$.

In addition to DSGT, GSGT and CSG, we consider the following distributed stochastic gradient (DSG) algorithm, which is similar to the ones studied in [18, 26]:

$$\begin{aligned} {\mathbf {x}}_{k+1} = {\mathbf {W}}{\mathbf {x}}_k-\alpha G({\mathbf {x}}_k,\varvec{\xi }_k). \end{aligned}$$

(66)

Noticing that some existing algorithms for deterministic distributed optimization can also be adapted to the stochastic gradient setting, e.g., EXTRA [52] and decentralized ADMM (DLM) [27], we also include them in our experiments for comparison.

In the experiments, we consider 3 instances with $p=20$ and $n\in \{10,25,100\}$, respectively. Under each instance, we let ${\mathbf {x}}_0={\mathbf {0}}$ and the penalty parameter $\rho =0.1$. For the distributed methods, we assume that n agents constitute a random network, in which each two agents are linked with probability 0.4. The Metropolis rule is applied to define the weights $w_{ij}$ (and $\pi _{ij}$) where applicable [50]:

$$\begin{aligned} w_{ij}={\left\{ \begin{array}{ll} 1/\max \{\mathrm {deg}(i),\mathrm {deg}(j)\} &{} \text {if }\, i\in {\mathcal {N}}_i, \\ 1- \sum _{j\in {\mathcal {N}}_i}w_{ij} &{} \text {if }\, i=j,\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

For EXTRA, we choose ${\tilde{{\mathbf {W}}}}=\frac{{\mathbf {I}}+{\mathbf {W}}}{2}$ as recommended by [52]. For DLM, we tune the free parameters to make its convergence speed comparable to the other algorithms. In each instance, we use two different stepsizes $\alpha =5\times 10^{-3}$ and $\alpha =5\times 10^{-2}$, respectively. We run the simulations 50 times for DSGT, CSG, DSG, EXTRA and DLM and 100 times for GSGT and average the results to approximate the expected errors.

In Fig. 1a–f, we compare the average performances of DSGT, GSGT, CSG, DSG, EXTRA and DLM with the same parameters. It can be seen that DSGT and CSG are comparable in their convergence speeds as well as the ultimate error bounds (almost indistinguishable). EXTRA and DLM are worse than DSGT and CSG in their final error bounds. The performance gap increases with the network size and the stepsize. GSGT is slower as expected but still reaches a comparable error level under small stepsize $\alpha =5\times 10^{-3}$. In addition, the error bounds for DSGT, GSGT and CSG decrease in n as expected from our theoretical analysis. The performance of DSG is not favorable given its largest final errors.^{Footnote 5}

In Fig. 1g–i (respectively, j–l), we further compare the solutions obtained under DSGT and GSGT with the same number of stochastic gradient evaluations (respectively, inter-node communications) under small stepsize $\alpha =5\times 10^{-3}$. We see the two methods are comparable in their speeds of convergence w.r.t. the number of gradient evaluations. However, GSGT is much faster than DSGT assuming the same number of communications. These numerical results verified our arguments in Sect. 4.2.

6 Conclusions and future work

This paper considers distributed multi-agent optimization over a network, where each agent only has access to inexact gradients of its local cost function. We propose a distributed stochastic gradient tracking method (DSGT) and show that the iterates obtained by each agent, using a constant stepsize value, reach a neighborhood of the optimum (in expectation) exponentially fast. More importantly, in a limit, the error bounds for the distances between the iterates and the optimal solution decrease in the network size, which is comparable with the performance of a centralized stochastic gradient algorithm. With a diminishing stepsize, the method exhibits the optimal ${\mathcal {O}}(1/k)$ rate of convergence. In the second part of this paper, we discuss a gossip-like stochastic gradient tracking method (GSGT) that is communication-efficient. Under a well-connected interaction graph, we show GSGT requires fewer communications than DSGT to reach an $\epsilon $ error level. Finally, we provide a numerical example that demonstrates the effectiveness of both algorithms. In our future work, we will deal with directed and/or time-varying interaction graphs among agents. We also plan to explore other more flexible randomized algorithms such as broadcast-based protocols with possible transmission failures.

Notes

This condition is weaker than (7) when $\rho _w^2\ge \frac{\varGamma }{\varGamma +1}\frac{2\mu }{3L}$.
It is worth noting that the transient time for $\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]$ to approach the asymptotic convergence rate depends on the network topology, and its dependence on n can be significant.
Matrix ${\mathbf {A}}$ in Theorem 1 corresponds to such a choice of $\alpha $ and $\beta $.
In practice, this means agent i holds vectors $x_i$, $y_i$ and $g_i(x_i,\xi _i)$ if it does not wake up.
DSG still holds the advantage over DSGT in the early stage for achieving a similar convergence speed with lower communication and storage costs.

References

Aybat, N.S., Wang, Z., Lin, T., Ma, S.: Distributed linearized alternating direction method of multipliers for composite convex consensus optimization. IEEE Trans. Autom. Control 63(1), 5–20 (2017)
MathSciNet MATH Google Scholar
Baingana, B., Mateos, G., Giannakis, G.B.: Proximal-gradient algorithms for tracking cascades over social networks. IEEE J. Sel. Top. Signal Process. 8(4), 563–575 (2014)
Google Scholar
Bianchi, P., Jakubowicz, J.: Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Trans. Autom. Control 58(2), 391–405 (2013)
MathSciNet MATH Google Scholar
Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Randomized gossip algorithms. IEEE Trans. Inf. Theory 52(6), 2508–2530 (2006)
MathSciNet MATH Google Scholar
Cavalcante, R.L., Stanczak, S.: A distributed subgradient method for dynamic convex optimization problems under noisy information exchange. IEEE J. Sel. Top. Signal Process. 7(2), 243–256 (2013)
Google Scholar
Chatzipanagiotis, N., Dentcheva, D., Zavlanos, M.M.: An augmented Lagrangian method for distributed optimization. Math. Program. 152(1–2), 405–434 (2015)
MathSciNet MATH Google Scholar
Chatzipanagiotis, N., Zavlanos, M.M.: A distributed algorithm for convex constrained optimization under noise. IEEE Trans. Autom. Control 61(9), 2496–2511 (2016)
MathSciNet MATH Google Scholar
Chen, J., Sayed, A.H.: Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012)
MathSciNet MATH Google Scholar
Chen, J., Sayed, A.H.: On the learning behavior of adaptive networks—-part I: transient analysis. IEEE Trans. Inf. Theory 61(6), 3487–3517 (2015)
MathSciNet MATH Google Scholar
Chen, J., Sayed, A.H.: On the learning behavior of adaptive networks—part II: performance analysis. IEEE Trans. Inf. Theory 61(6), 3518–3548 (2015)
MathSciNet MATH Google Scholar
Cohen, K., Nedić, A., Srikant, R.: Distributed learning algorithms for spectrum sharing in spatial random access wireless networks. IEEE Trans. Autom. Control 62(6), 2854–2869 (2017)
MathSciNet MATH Google Scholar
Cohen, K., Nedić, A., Srikant, R.: On projected stochastic gradient descent algorithm with weighted averaging for least squares regression. IEEE Trans. Autom. Control 62(11), 5974–5981 (2017)
MathSciNet MATH Google Scholar
Di Lorenzo, P., Scutari, G.: Next: in-network nonconvex optimization. IEEE Trans. Signal Inf. Process. Over Netw. 2(2), 120–136 (2016)
MathSciNet Google Scholar
Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(3), 592–606 (2012)
MathSciNet MATH Google Scholar
Forrester, A.I., Sóbester, A., Keane, A.J.: Multi-fidelity optimization via surrogate modelling. In: Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. The Royal Society, vol. 463, pp. 3251–3269 (2007)
Friedman, J., Kahn, J., Szemeredi, E.: On the second eigenvalue of random regular graphs. In: Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing. ACM, pp. 587–598 (1989)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990)
MATH Google Scholar
Jakovetic, D., Bajovic, D., Sahu, A.K., Kar, S.: Convergence rates for distributed stochastic optimization over random networks. In: 2018 IEEE Conference on Decision and Control (CDC), pp. 4238–4245. IEEE (2018)
Jakovetić, D., Xavier, J., Moura, J.M.: Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014)
MathSciNet MATH Google Scholar
Kia, S.S., Cortés, J., Martínez, S.: Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication. Automatica 55, 254–264 (2015)
MathSciNet MATH Google Scholar
Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)
MathSciNet MATH Google Scholar
Kleijnen, J.P.: Design and Analysis of Simulation Experiments, vol. 20. Springer, Berlin (2008)
MATH Google Scholar
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
MATH Google Scholar
Lan, G., Lee, S., Zhou, Y.: Communication-efficient algorithms for decentralized and stochastic optimization. Math. Program. 180, 237–284 (2020). https://doi.org/10.1007/s10107-018-1355-4
Article MathSciNet MATH Google Scholar
Lee, S., Nedić, A.: Asynchronous gossip-based random projection algorithms over networks. IEEE Trans. Autom. Control 61(4), 953–968 (2016)
MathSciNet MATH Google Scholar
Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., Liu, J.: Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 5336–5346 (2017)
Ling, Q., Shi, W., Wu, G., Ribeiro, A.: Dlm: decentralized linearized alternating direction method of multipliers. IEEE Trans. Signal Process. 63(15), 4051–4064 (2015)
MathSciNet MATH Google Scholar
Lobel, I., Ozdaglar, A., Feijer, D.: Distributed multi-agent optimization with state-dependent communication. Math. Program. 129(2), 255–284 (2011)
MathSciNet MATH Google Scholar
Lu, J., Tang, C.Y., Regier, P.R., Bow, T.D.: Gossip algorithms for convex consensus optimization over networks. IEEE Trans. Autom. Control 56(12), 2917–2923 (2011)
MathSciNet MATH Google Scholar
Mateos, G., Giannakis, G.B.: Distributed recursive least-squares: stability and performance analysis. IEEE Trans. Signal Process. 60(7), 3740–3754 (2012)
MathSciNet MATH Google Scholar
Mathkar, A.S., Borkar, V.S.: Nonlinear gossip. SIAM J. Control Optim. 54(3), 1535–1557 (2016)
MathSciNet MATH Google Scholar
Nedić, A., Olshevsky, A.: Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Trans. Autom. Control 61(12), 3936–3947 (2016)
MathSciNet MATH Google Scholar
Nedić, A., Olshevsky, A., Rabbat, M.G.: Network topology and communication–computation tradeoffs in decentralized optimization. Proc. IEEE 106(5), 953–976 (2018)
Google Scholar
Nedic, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
MathSciNet MATH Google Scholar
Nedić, A., Olshevsky, A., Uribe, C.A.: Fast convergence rates for distributed non-bayesian learning. IEEE Trans. Autom. Control 62(11), 5538–5553 (2017)
MathSciNet MATH Google Scholar
Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
MathSciNet MATH Google Scholar
Nedić, A., Ozdaglar, A., Parrilo, P.A.: Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
MathSciNet MATH Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
MathSciNet MATH Google Scholar
Olshevsky, A., Paschalidis, I.C., Spiridonoff, A.: Robust asynchronous stochastic gradient-push: asymptotically optimal and network-independent performance for strongly convex functions. arXiv preprint arXiv:1811.03982 (2018)
Pu, S., Garcia, A.: A flocking-based approach for distributed stochastic optimization. Oper. Res. 1, 267–281 (2018)
MathSciNet MATH Google Scholar
Pu, S., Garcia, A.: Swarming for faster convergence in stochastic optimization. SIAM J. Control Optim. 56(4), 2997–3020 (2018)
MathSciNet MATH Google Scholar
Pu, S., Garcia, A., Lin, Z.: Noise reduction by swarming in social foraging. IEEE Trans. Autom. Control 61(12), 4007–4013 (2016)
MathSciNet MATH Google Scholar
Pu, S., Nedić, A.: A distributed stochastic gradient tracking method. In: 2018 IEEE Conference on Decision and Control (CDC). IEEE, pp. 963–968 (2018)
Pu, S., Olshevsky, A., Paschalidis, I.C.: A sharp estimate on the transient time of distributed stochastic gradient descent. arXiv preprint arXiv:1906.02702 (2019)
Pu, S., Shi, W., Xu, J., Nedic, A.: Push–pull gradient methods for distributed optimization in networks. IEEE Trans. Autom. Control (2020). https://doi.org/10.1109/TAC.2020.2972824
Qu, G., Li, N.: Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2017)
MathSciNet MATH Google Scholar
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning. Omnipress, pp. 1571–1578 (2012)
Ram, S.S., Nedić, A., Veeravalli, V.V.: Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)
MathSciNet MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
MathSciNet MATH Google Scholar
Sayed, A.H.: Adaptive networks. Proc. IEEE 102(4), 460–497 (2014)
Google Scholar
Sayin, M.O., Vanli, N.D., Kozat, S.S., Başar, T.: Stochastic subgradient algorithms for strongly convex optimization over distributed networks. IEEE Trans. Netw. Sci. Eng. 4(4), 248–260 (2017)
MathSciNet Google Scholar
Shi, W., Ling, Q., Wu, G., Yin, W.: Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
MathSciNet MATH Google Scholar
Sirb, B., Ye, X.: Decentralized consensus algorithm with delayed and stochastic gradients. SIAM J. Optim. 28(2), 1232–1254 (2018)
MathSciNet MATH Google Scholar
Srivastava, K., Nedic, A.: Distributed asynchronous constrained stochastic optimization. IEEE J. Sel. Top. Signal Process. 5(4), 772–790 (2011)
Google Scholar
Towfic, Z.J., Sayed, A.H.: Adaptive penalty-based distributed stochastic convex optimization. IEEE Trans. Signal Process. 62(15), 3924–3938 (2014)
MathSciNet MATH Google Scholar
Tsitsiklis, J., Bertsekas, D., Athans, M.: Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control 31(9), 803–812 (1986)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Data and Decision Analytics, The Chinese University of Hong Kong, Shenzhen, China
Shi Pu
Shenzhen Research Institute of Big Data, Shenzhen, China
Shi Pu
School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ, 85287, USA
Angelia Nedić

Authors

Shi Pu
View author publications
You can also search for this author in PubMed Google Scholar
Angelia Nedić
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi Pu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Parts of the results appeared in the 57th IEEE Conference on Decision and Control [43].

This work was supported in parts by the NSF Grant CCF-1717391, the ONR Grant No. N00014-16-1-2245, and the Shenzhen Research Institute of Big Data (SRIBD) Startup Fund JCYJ-SP2019090001.

Appendix

1.1 Proof of Lemma 4

By (4),

$$\begin{aligned} {\overline{x}}_{k+1}={\overline{x}}_k-\alpha {\overline{y}}_k. \end{aligned}$$

(67)

It follows that

$$\begin{aligned} \Vert {\overline{x}}_{k+1}-x^*\Vert ^2=\Vert {\overline{x}}_k-\alpha {\overline{y}}_k-x^*\Vert ^2 =\Vert {\overline{x}}_k-x^*\Vert ^2-2\alpha \langle {\overline{x}}_k-x^*,{\overline{y}}_k \rangle +\alpha ^2\Vert {\overline{y}}_k\Vert ^2.\nonumber \\ \end{aligned}$$

(68)

Notice that ${\mathbb {E}}[{\overline{y}}_k\mid {\mathcal {F}}_k]=h({\mathbf {x}}_k)$, and

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2\mid {\mathcal {F}}_k]={\mathbb {E}}[\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2 \mid {\mathcal {F}}_k]+\Vert h({\mathbf {x}}_k)\Vert ^2. \end{aligned}$$

We have

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\mid {\mathcal {F}}_k] =\Vert {\overline{x}}_k-x^*\Vert ^2-2\alpha \langle {\overline{x}}_k-x^*,h({\mathbf {x}}_k) \rangle \nonumber \\&\qquad +\alpha ^2{\mathbb {E}}[\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k]+\alpha ^2\Vert h({\mathbf {x}}_k)\Vert ^2\nonumber \\&\quad \le \Vert {\overline{x}}_k-x^*\Vert ^2-2\alpha \langle {\overline{x}}_k-x^*,h({\mathbf {x}}_k) \rangle +\alpha ^2\Vert h({\mathbf {x}}_k)\Vert ^2+\frac{\alpha ^2\sigma ^2}{n}, \end{aligned}$$

(69)

where the inequality follows from Lemma 2. Denote $\lambda =1-\alpha \mu $. In light of Lemma 3,

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\mid {\mathcal {F}}_k]\\&\quad \le \Vert {\overline{x}}_k-x^*\Vert ^2-2\alpha \langle {\overline{x}}_k-x^*,\nabla f({\overline{x}}_k)\rangle +2\alpha \langle {\overline{x}}_k-x^*,\nabla f({\overline{x}}_k)-h({\mathbf {x}}_k) \rangle \\&\qquad +\alpha ^2\Vert \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\Vert ^2 +\alpha ^2\Vert \nabla f({\overline{x}}_k)\Vert ^2-2\alpha ^2\langle \nabla f({\overline{x}}_k),\nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\rangle +\frac{\alpha ^2\sigma ^2}{n}\\&\quad = \Vert {\overline{x}}_k-\alpha \nabla f({\overline{x}}_k)-x^*\Vert ^2+\alpha ^2\Vert \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\Vert ^2+\frac{\alpha ^2\sigma ^2}{n}\\&\qquad +2\alpha \langle {\overline{x}}_k-\alpha \nabla f({\overline{x}}_k)-x^*, \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\rangle \\&\quad \le \lambda ^2\Vert {\overline{x}}_k-x^*\Vert ^2+2\alpha \lambda \Vert {\overline{x}}_k-x^*\Vert \Vert \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\Vert +\alpha ^2\Vert \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\Vert ^2+\frac{\alpha ^2\sigma ^2}{n}\\&\quad \le \lambda ^2\Vert {\overline{x}}_k-x^*\Vert ^2+\frac{2\alpha \lambda L}{\sqrt{n}}\Vert {\overline{x}}_k-x^*\Vert \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +\frac{\alpha ^2 L^2}{n}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{\alpha ^2\sigma ^2}{n}\\&\quad \le \lambda ^2\Vert {\overline{x}}_k-x^*\Vert ^2+\alpha \left( \lambda ^2\mu \Vert {\overline{x}}_k-x^*\Vert ^2+\frac{L^2}{\mu n}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 \right) +\frac{\alpha ^2 L^2}{n}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{\alpha ^2\sigma ^2}{n}\\&\quad = \lambda ^2\left( 1+\alpha \mu \right) \Vert {\overline{x}}_k-x^*\Vert ^2 +\frac{\alpha L^2}{\mu n}\left( 1+\alpha \mu \right) \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +\frac{\alpha ^2\sigma ^2}{n}\\&\quad \le \left( 1-\alpha \mu \right) \Vert {\overline{x}}_k-x^*\Vert ^2+\frac{\alpha L^2}{\mu n}\left( 1+\alpha \mu \right) \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +\frac{\alpha ^2\sigma ^2}{n}. \end{aligned}$$

Relation (19) follows from the following argument:

$$\begin{aligned} \Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2&= \Vert W{\mathbf {x}}_k-\alpha W{\mathbf {y}}_k-{\mathbf {1}}{\overline{x}}_k+\alpha {\mathbf {1}}{\overline{y}}_k\Vert ^2\\&\le \Vert W{\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2-2\alpha \langle W{\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\rangle +\alpha ^2\Vert W{\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_k\Vert ^2\\&\le \rho _w^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\alpha \rho _w^2 \left[ \frac{(1-\rho _w^2)}{2\alpha \rho _w^2}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +\frac{2\alpha \rho _w^2}{(1-\rho _w^2)}\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_k\Vert ^2\right] \\&\quad +\alpha ^2\rho _w^2\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_k\Vert ^2 \le \frac{(1+\rho _w^2)}{2}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\alpha ^2 \frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_k\Vert ^2, \end{aligned}$$

where we used Lemma 1.

To prove (20), we need some preparations first. For ease of exposition we will write $G_k:=G({\mathbf {x}}_k,\varvec{\xi }_k)$ and $\nabla _k:=\nabla F({\mathbf {x}}_k)$ for short. From (5) and Lemma 1,

$$\begin{aligned}&\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2 = \Vert W{\mathbf {y}}_k+G_{k+1}-G_k-{\mathbf {1}}{\overline{y}}_k+{\mathbf {1}}({\overline{y}}_k-{\overline{y}}_{k+1})\Vert ^2\\&\quad =|W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2{+}\Vert G_{k+1}{-}G_k\Vert ^2{+}n\Vert {\overline{y}}_k{-}{\overline{y}}_{k+1}\Vert ^2+2\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,G_{k+1}-G_k\rangle \\&\qquad +2\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,{\mathbf {1}}({\overline{y}}_k{-}{\overline{y}}_{k+1})\rangle {+}2\langle G_{k+1}-G_k,{\mathbf {1}}({\overline{y}}_k{-}{\overline{y}}_{k+1})\rangle \\&\quad =\Vert W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\Vert G_{k+1}-G_k\Vert ^2-n\Vert {\overline{y}}_k-{\overline{y}}_{k+1}\Vert ^2+2\langle W{\mathbf {y}}_k{-}{\mathbf {1}}{\overline{y}}_k,G_{k+1}{-}G_k\rangle \\&\quad \le \rho _w^2\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\Vert G_{k+1}-G_k\Vert ^2+2\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,G_{k+1}-G_k\rangle . \end{aligned}$$

Notice that

$$\begin{aligned}&{\mathbb {E}}[\Vert G_{k+1}-G_k\Vert ^2\mid {\mathcal {F}}_k]= {\mathbb {E}}[\Vert \nabla _{k+1}-\nabla _k \Vert ^2\mid {\mathcal {F}}_k]\\&\qquad +2{\mathbb {E}}[\langle \nabla _{k+1}-\nabla _k , G_{k+1}-\nabla _{k+1}-G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]\\&\qquad +{\mathbb {E}}[\Vert G_{k+1}-\nabla _{k+1}-G_k+\nabla _k \Vert ^2\mid {\mathcal {F}}_k]\\&\quad \le {\mathbb {E}}[\Vert \nabla _{k+1}-\nabla _k \Vert ^2\mid {\mathcal {F}}_k]+2{\mathbb {E}}[\langle \nabla _{k+1}, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]+2n\sigma ^2 \end{aligned}$$

by Assumption 1, and

$$\begin{aligned}&{\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,G_{k+1}-G_k\rangle \mid {\mathcal {F}}_k] = {\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\nabla _{k+1}-G_k\rangle \mid {\mathcal {F}}_k]\\&\quad = {\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\nabla _{k+1}-\nabla _k \rangle \mid {\mathcal {F}}_k] +{\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]. \end{aligned}$$

We have

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\mid {\mathcal {F}}_k] \le \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2\mid {\mathcal {F}}_k] +{\mathbb {E}}[\Vert \nabla _{k+1}-\nabla _k \Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\quad +2{\mathbb {E}}[\langle \nabla _{k+1}, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k] +2{\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\nabla _{k+1}{-}\nabla _k \rangle \mid {\mathcal {F}}_k]\nonumber \\&\quad +2{\mathbb {E}}[\langle W{\mathbf {y}}_k{-}{\mathbf {1}}{\overline{y}}_k, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]+2n\sigma ^2. \end{aligned}$$

(70)

Two additional lemmas are in hand.

Lemma 8

$$\begin{aligned} {\mathbb {E}}[\langle \nabla _{k+1}, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]\le \alpha L n\sigma ^2. \end{aligned}$$

Proof

From (4),

$$\begin{aligned}&\nabla f_i(x_{i,k+1})= \nabla f_i\left( \sum _{j=1}^n w_{ij}(x_{j,k}-\alpha y_{j,k})\right) \\&\quad = \nabla f_i\left( \sum _{j=1}^n w_{ij}x_{j,k}-\alpha \sum _{j=1}^n w_{ij}\left[ \sum _{l=1}^{n}w_{jl}y_{l,k-1}+g_j(x_{j,k},\xi _{j,k}) -g_j(x_{j,k-1},\xi _{j,k-1})\right] \right) \\&\quad = \nabla f_i\left( \sum _{j=1}^n w_{ij}x_{j,k}-\alpha \sum _{j=1}^n w_{ij}\sum _{l=1}^{n}w_{jl}y_{l,k-1}-\alpha \sum _{j=1}^n w_{ij}g_j(x_{j,k},\xi _{j,k})\right. \\&\qquad \left. +\alpha \sum _{j=1}^n w_{ij}g_j(x_{j,k-1},\xi _{j,k-1})\right) . \end{aligned}$$

In light of Assumption 2,

$$\begin{aligned}&\left\| \nabla f_i(x_{i,k+1}){-}\nabla f_i\left( \sum _{j=1}^n w_{ij}x_{j,k}{-}\alpha \sum _{j=1}^n w_{ij}\sum _{l=1}^{n}w_{jl}y_{l,k-1}{-}\alpha \sum _{j\ne i}^n w_{ij}g_j(x_{j,k},\xi _{j,k})\right. \right. \nonumber \\&\quad \left. \left. -\,\alpha w_{ii}\nabla f_i(x_{i,k}) {+}\alpha \sum _{j=1}^n w_{ij}g_j(x_{j,k-1},\xi _{j,k-1})\right) \right\| {\le } \alpha L\Vert g_i(x_{i,k},\xi _{i,k}){-}\nabla f_i(x_{i,k})\Vert . \end{aligned}$$

(71)

Then,

$$\begin{aligned}&{\mathbb {E}}[\langle \nabla f_i(x_{i,k+1}), -g_i(x_{i,k},\xi _{i,k})+\nabla f_i(x_{i,k})\rangle \mid {\mathcal {F}}_k] \nonumber \\&\quad \le \alpha L {\mathbb {E}}[\Vert g_i(x_{i,k},\xi _{i,k})-\nabla f_i(x_{i,k})\Vert ^2\mid {\mathcal {F}}_k]\le \alpha L \sigma ^2. \end{aligned}$$

(72)

The desired result then follows. $\square $

Lemma 9

$$\begin{aligned} {\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]\le \sigma ^2. \end{aligned}$$

(73)

Proof

By (4), we have

$$\begin{aligned}&{\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k]\\&\quad =\sum _{i=1}^n{\mathbb {E}}\left[ \left\langle \sum _{j=1}^n w_{ij}y_{j,k}-{\overline{y}}_k, \nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\right\rangle \Big \vert {\mathcal {F}}_k\right] . \end{aligned}$$

On one hand,

$$\begin{aligned}&{\mathbb {E}}[\langle y_{j,k},\nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\rangle \mid {\mathcal {F}}_k]\\&\quad ={\mathbb {E}}\left[ \Big \langle \sum _{n=1}^{n}w_{jn}y_{n,k-1}+g_j(x_{j,k},\xi _{j,k})-g_j(x_{j,k-1},\xi _{j,k-1}),\nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\Big \rangle \Big \vert {\mathcal {F}}_k\right] \\&\quad ={\mathbb {E}}\left[ \left\langle g_j(x_{j,k},\xi _{j,k}),\nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\right\rangle \big \vert {\mathcal {F}}_k\right] , \end{aligned}$$

which gives

$$\begin{aligned}&{\mathbb {E}}\left[ \Big \langle \sum _{j=1}^n w_{ij}y_{j,k},\nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\Big \rangle \Big \vert {\mathcal {F}}_k\right] \\&\quad ={\mathbb {E}}[\langle w_{ii}g_i(x_{i,k},\xi _{i,k}),\nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\rangle \mid {\mathcal {F}}_k]\le 0. \end{aligned}$$

On the other hand,

$$\begin{aligned}&{\mathbb {E}}[\langle {\overline{y}}_k, \nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\rangle \mid {\mathcal {F}}_k]\\&\quad ={\mathbb {E}}\left[ \Big \langle \frac{1}{n}\sum _{j=1}^{n}g_j(x_{j,k},\xi _{j,k}), \nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\Big \rangle \Big \vert {\mathcal {F}}_k\right] \\&\quad ={\mathbb {E}}\left[ \Big \langle \frac{1}{n}g_i(x_{i,k},\xi _{i,k}), \nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\Big \rangle \Big \vert {\mathcal {F}}_k\right] . \end{aligned}$$

We have

$$\begin{aligned}&{\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, -G_k+\nabla _k \rangle \mid {\mathcal {F}}_k] \nonumber \\&\le -\frac{1}{n}\sum _{i=1}^n{\mathbb {E}}[\langle g_i(x_{i,k},\xi _{i,k}), \nabla f_i(x_{i,k})-g_i(x_{i,k},\xi _{i,k})\rangle \mid {\mathcal {F}}_k] \le \sigma ^2. \end{aligned}$$

(74)

$\square $

By (70), Lemma 8 and Lemma 9, we obtain

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\mid {\mathcal {F}}_k]\le \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2\mid {\mathcal {F}}_k] +{\mathbb {E}}[\Vert \nabla _{k+1}-\nabla _k \Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\quad +2{\mathbb {E}}[\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, \nabla _{k+1}-\nabla _k \rangle \mid {\mathcal {F}}_k]+2(n+\alpha Ln+1)\sigma ^2. \end{aligned}$$

(75)

Now we bound $\Vert \nabla _{k+1}-\nabla _k \Vert ^2$ and $\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, \nabla _{k+1}-\nabla _k \rangle $. First, by Assumption 2 and Lemma 1,

$$\begin{aligned}&\Vert \nabla _{k+1}-\nabla _k \Vert ^2\le L^2\Vert {\mathbf {x}}_{k+1}-{\mathbf {x}}_k\Vert ^2=L^2\Vert W{\mathbf {x}}_k-{\mathbf {x}}_k-\alpha W{\mathbf {y}}_k\Vert ^2\\&\quad =L^2\Vert ({\mathbf {W}}-{\mathbf {I}})({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)-\alpha W{\mathbf {y}}_k\Vert ^2\\&\quad = \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2L^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2-2\alpha L^2\langle ({\mathbf {W}}-{\mathbf {I}})({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k), W{\mathbf {y}}_k\rangle +\alpha ^2L^2\Vert W{\mathbf {y}}_k\Vert ^2\\&\quad =\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2L^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2-2\alpha L^2\langle ({\mathbf {W}}-{\mathbf {I}})({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k), W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\rangle \\&\qquad +\alpha ^2L^2\Vert W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\alpha ^2 nL^2\Vert {\overline{y}}_k\Vert ^2\\&\quad \le \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2L^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+2\alpha \Vert {\mathbf {W}}-{\mathbf {I}}\Vert L^2\rho _w\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert \Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \\&\qquad +\alpha ^2 L^2\rho _w^2\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\alpha ^2 nL^2\Vert {\overline{y}}_k\Vert ^2\\&\quad \le 2\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2L^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+2\alpha ^2 L^2\rho _w^2\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\alpha ^2 nL^2\Vert {\overline{y}}_k\Vert ^2. \end{aligned}$$

Second,

$$\begin{aligned}&\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, \nabla _{k+1}-\nabla _k \rangle \le L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert ({\mathbf {W}}-{\mathbf {I}})({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)-\alpha W{\mathbf {y}}_k\Vert \\&\quad \le \Vert {\mathbf {W}}-{\mathbf {I}}\Vert L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +\alpha L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k+{\mathbf {1}}{\overline{y}}_k\Vert \\&\quad \le \Vert {\mathbf {W}}-{\mathbf {I}}\Vert L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +\alpha L\rho _w^2\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\alpha \sqrt{n}L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert {\overline{y}}_k\Vert . \end{aligned}$$

Notice that

$$\begin{aligned}&\Vert {\overline{y}}_k\Vert \le \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert +\Vert h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)\Vert +\Vert \nabla f({\overline{x}}_k)\Vert \\&\le \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert +\,\frac{L}{\sqrt{n}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +L\Vert {\overline{x}}_k-x^*\Vert . \end{aligned}$$

We have

$$\begin{aligned}&\sqrt{n}L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert {\overline{y}}_k\Vert \le \sqrt{n}L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \\&\quad \left( \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert +\frac{L}{\sqrt{n}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +L\Vert {\overline{x}}_k-x^*\Vert \right) \\&\quad \le L\rho _w^2\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+nL\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2+ L^3\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{1}{2}n L^3\Vert {\overline{x}}_k-x^*\Vert ^2, \end{aligned}$$

and

$$\begin{aligned} \Vert {\overline{y}}_k\Vert ^2\le 3\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2+\frac{3L^2}{n}\Vert {\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k\Vert ^2+3L^2\Vert {\overline{x}}_k-x^*\Vert ^2. \end{aligned}$$

By (75) and the above relations,

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\mid {\mathcal {F}}_k] \le \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\qquad +2\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2L^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +2\alpha ^2 L^2\rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\qquad +\alpha ^2 nL^2\left( 3{\mathbb {E}}[\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k]+\frac{3L^2}{n}\Vert {\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k\Vert ^2+3L^2\Vert {\overline{x}}_k-x^*\Vert ^2\right) \nonumber \\&\qquad +2\left( \Vert {\mathbf {W}}-{\mathbf {I}}\Vert L\rho _w\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +\alpha L\rho _w^2\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2\right) \nonumber \\&\qquad +2\left( \alpha L\rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2\mid {\mathcal {F}}_k] +\alpha nL{\mathbb {E}}[\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k]\right. \nonumber \\&\qquad +\left. \alpha L^3\Vert {\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{1}{2}\alpha n L^3\Vert {\overline{x}}_k-x^*\Vert ^2\right) \nonumber \\&\qquad +2(n+\alpha Ln+1)\sigma ^2n\nonumber \\&\quad \le \left( \rho _w^2+4\alpha L\rho _w^2+2\alpha ^2 L^2\rho _w^2\right) {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\qquad +\left( \beta \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2\mid {\mathcal {F}}_k] +\frac{1}{\beta }\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2\right) \nonumber \\&\qquad +\left( 2\Vert {\mathbf {W}}{-}{\mathbf {I}}\Vert ^2L^2{+}2\alpha L^3{+}3\alpha ^2 L^4\right) \Vert {\mathbf {x}}_k{-}{\mathbf {1}} {\overline{x}}_k\Vert ^2{+}\left( 3\alpha ^2 nL^4{+}\alpha nL^3\right) \Vert {\overline{x}}_k-x^*\Vert ^2\nonumber \\&\qquad +\left[ 3\alpha ^2L^2+2\alpha L+2(n+\alpha Ln+1)\right] \sigma ^2\nonumber \\&\quad = \left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2\mid {\mathcal {F}}_k] \nonumber \\&\qquad +\left( \frac{1}{\beta }\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+2\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3\right) \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2\nonumber \\&\qquad +2\alpha nL^3\Vert {\overline{x}}_k-x^*\Vert ^2+M_{\sigma } \end{aligned}$$

(76)

for any $\beta >0$.

1.2 Proof of Lemma 5

The characteristic function of ${\mathbf {S}}$ is given by

$$\begin{aligned} g(\lambda )&:=\text {det}(\lambda {\mathbf {I}}-{\mathbf {S}})=(\lambda -s_{11})(\lambda {-}s_{22})(\lambda {-}s_{33}) -a_{23}a_{32}(\lambda -s_{11}){-}a_{13}a_{31}(\lambda {-}s_{22})\nonumber \\&\quad -a_{12}a_{21}(\lambda -s_{33})-a_{12}a_{23}a_{31}-a_{13}a_{32}a_{21}. \end{aligned}$$

(77)

Necessity is trivial since $\text {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})\le 0$ implies $g(\lambda )=0$ for some $\lambda \ge \lambda ^*$. We now show $\text {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0$ is also a sufficient condition. Given that $g(\lambda ^*)=\text {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0$,

$$\begin{aligned} (\lambda ^*-s_{11})(\lambda ^*{-}s_{22})(\lambda ^*{-}s_{33}){>} a_{23}a_{32}(\lambda ^*{-}s_{11}){+}a_{13}a_{31}(\lambda ^*{-}s_{22}) +a_{12}a_{21}(\lambda ^*-s_{33}). \end{aligned}$$

It follows that

$$\begin{aligned} \begin{array}{ccc} \gamma _1(\lambda ^*-s_{22})(\lambda ^*-s_{33}) &{}> &{} a_{23}a_{32}\\ \gamma _2(\lambda ^*-s_{11})(\lambda ^*-s_{33}) &{}> &{} a_{13}a_{31}\\ \gamma _3(\lambda ^*-s_{11})(\lambda ^*-s_{22}) &{} > &{} a_{12}a_{21} \end{array} \end{aligned}$$

(78)

for some $\gamma _1,\gamma _2,\gamma _3>0$ with $\gamma _1+\gamma _2+\gamma _3\le 1$. Consider

$$\begin{aligned} g'(\lambda ){=}(\lambda {-}s_{22})(\lambda {-}s_{33}){+}(\lambda {-}s_{11})(\lambda {-}s_{33}) {+}(\lambda -s_{11})(\lambda -s_{22}){-}a_{23}a_{32}{-}a_{13}a_{31}{-}a_{12}a_{21}. \end{aligned}$$

We have $g'(\lambda )>0$ for $\lambda \in (-\infty ,-\lambda ^*]\cup [\lambda ^*,+\infty )$. Notice that

$$\begin{aligned}&g(-\lambda ^*)\le -(\lambda ^*+s_{11})(\lambda ^*+s_{22})(\lambda ^*+s_{33})+a_{23}a_{32}(1+s_{11})\\&\quad +\,a_{13}a_{31}(\lambda ^*+s_{22})+a_{12}a_{21}(\lambda ^*+s_{33})< 0. \end{aligned}$$

All the real roots of $g(\lambda )=0$ lie in the interval $(-\lambda ^*,\lambda ^*)$. By the Perron-Frobenius theorem, $\rho ({\mathbf {S}})\in {\mathbb {R}}$ is an eigenvalue of ${\mathbf {S}}$. We conclude that $\rho ({\mathbf {S}})<\lambda ^*$.

1.3 Proof of Lemma 7

First we prove relation (50). In light of (39a), we have

$$\begin{aligned} {\bar{x}}_{k+1}{=}\frac{1}{n}\left[ (x_{i_k,k}+x_{j_k,k})-\alpha (y_{i_k,k}+y_{j_k,k})\right] {+}\frac{1}{n}\sum _{i\ne \{i_k,j_k\}}x_{i,k} ={\overline{x}}_k{-}\frac{\alpha }{n}(y_{i_k,k}+y_{j_k,k}).\nonumber \\ \end{aligned}$$

(79)

Then,

$$\begin{aligned} \Vert {\overline{x}}_{k+1}-x^*\Vert ^2= & {} \left\| {\overline{x}}_k-\frac{\alpha }{n}(y_{i_k,k} +y_{j_k,k})-x^*\right\| ^2\\= & {} \Vert {\overline{x}}_k-x^*\Vert ^2-\frac{2\alpha }{n}\langle {\overline{x}}_k-x^*, y_{i_k,k}+y_{j_k,k}\rangle +\frac{\alpha ^2}{n^2}\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2. \end{aligned}$$

Taking conditional expectations of $(y_{i_k,k}+y_{j_k,k})$ and $\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2$ w.r.t. the random selections of $i_k$ and $j_k$, we get

$$\begin{aligned} {\mathbb {E}}[y_{i_k,k}+y_{j_k,k}\mid {\mathcal {F}}_{k+1}]=\frac{1}{n}\sum _{i=1}^n\left( y_{i,k}+\sum _{j=1}^n\pi _{ij}y_{j,k}\right) =2{\overline{y}}_k, \end{aligned}$$

(80)

and

$$\begin{aligned}&{\mathbb {E}}[\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2\mid {\mathcal {F}}_{k+1}]=\frac{1}{n}\sum _{i=1}^n \sum _{j=1}^n\pi _{ij}\Vert y_{i,k}+y_{j,k}\Vert ^2\nonumber \\&\quad =\frac{2}{n}\sum _{i=1}^n\Vert y_{i,k}\Vert ^2 +\frac{2}{n}\sum _{i=1}^n\sum _{j=1}^n\pi _{ij}\langle y_{i,k},y_{j,k}\rangle \nonumber \\&\quad =\frac{2}{n}\Vert {\mathbf {y}}_k\Vert ^2+\frac{2}{n}{\mathbf {y}}_k^{\intercal }{{\varvec{\Pi }}}{\mathbf {y}}_k=\frac{2}{n}\Vert {\mathbf {y}}_k\Vert ^2+\frac{2}{n}({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)^{\intercal }{{\varvec{\Pi }}}({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k) +2\Vert {\overline{y}}_k\Vert ^2\nonumber \\&\quad \le \frac{2}{n}(1+\rho _{\pi })\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+4\Vert {\overline{y}}_k\Vert ^2\nonumber \\&\quad \le \frac{4}{n}\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+4\Vert {\overline{y}}_k\Vert ^2, \end{aligned}$$

(81)

where $\rho _{\pi }<1$ denotes the spectral norm of ${{\varvec{\Pi }}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }$. It follows that

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\mid {\mathcal {F}}_{k+1}]\le \Vert {\overline{x}}_k-x^*\Vert ^2-\frac{4\alpha }{n}\langle {\overline{x}}_k-x^*, {\overline{y}}_k\rangle +\frac{4\alpha ^2}{n^2}\Vert {\overline{y}}_k\Vert ^2 +\frac{4\alpha ^2}{n^3}\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2. \end{aligned}$$

(82)

Noticing that ${\mathbb {E}}[{\overline{y}}_k]={\mathbb {E}}[h({\mathbf {x}}_k)]$, from Lemma 2 we have

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2]={\mathbb {E}}[\Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2+2\langle {\overline{y}}_k-h({\mathbf {x}}_k),h({\mathbf {x}}_k)\rangle +\Vert h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k]\le \frac{\sigma ^2}{n}+{\mathbb {E}}[\Vert h({\mathbf {x}}_k)\Vert ^2]. \end{aligned}$$

(83)

Then from (82) we obtain

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\&\le {\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]-\frac{4\alpha }{n}{\mathbb {E}}[\langle {\overline{x}}_k-x^*, h({\mathbf {x}}_k)\rangle ]+\frac{4\alpha ^2}{n^2}{\mathbb {E}}[\Vert h({\mathbf {x}}_k)\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}+\frac{4\alpha ^2}{n^3} {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\\&= {\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]-\frac{4\alpha }{n}{\mathbb {E}}[\langle {\overline{x}}_k-x^*, \nabla f({\overline{x}}_k)\rangle ]-\frac{4\alpha }{n}{\mathbb {E}}[\langle {\overline{x}}_k-x^*, h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)\rangle ]\\&\quad +\frac{4\alpha ^2}{n^2}{\mathbb {E}}[\Vert h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)\Vert ^2]+\frac{4\alpha ^2}{n^2}{\mathbb {E}}[\Vert \nabla f({\overline{x}}_k)\Vert ^2]+\frac{8\alpha ^2}{n^2}{\mathbb {E}}[\langle h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k), \nabla f({\overline{x}}_k)\rangle ]\\&\quad +\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}\\&= {\mathbb {E}}\left[ \left\| {\overline{x}}_k-\frac{2\alpha }{n}\nabla f({\overline{x}}_k)-x^*\right\| ^2\right] -\frac{4\alpha }{n}{\mathbb {E}}[\langle {\overline{x}}_k-\frac{2\alpha }{n}\nabla f({\overline{x}}_k)-x^*,h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)\rangle ]\\&\quad +\frac{4\alpha ^2}{n^2}{\mathbb {E}}[\Vert h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)\Vert ^2]+\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2 \mid {\mathcal {F}}_{k}^{-}]+\frac{4\alpha ^2\sigma ^2}{n^3}. \end{aligned}$$

Since $\alpha <n/(\mu +L)$, we know from Lemma 3 that

$$\begin{aligned} \left\| {\overline{x}}_k-\frac{2\alpha }{n}\nabla f({\overline{x}}_k)-x^*\right\| ^2\le \left( 1-\frac{2\alpha \mu }{n}\right) ^2\Vert {\overline{x}}_k-x^*\Vert ^2, \end{aligned}$$

and

$$\begin{aligned} \Vert h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)\Vert \le \frac{L}{\sqrt{n}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert . \end{aligned}$$

It follows that

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\&\quad \le \left( 1-\frac{2\alpha \mu }{n}\right) ^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2] +\frac{4\alpha }{n}\left( 1-\frac{2\alpha \mu }{n}\right) \frac{L}{\sqrt{n}}{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ]\\&\qquad +\frac{4\alpha ^2L^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]+\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}\\&\quad \le \left( 1-\frac{2\alpha \mu }{n}\right) ^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2] +\frac{2\alpha }{n}\left( \left( 1-\frac{2\alpha \mu }{n}\right) ^2\mu {\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\frac{L^2}{\mu n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\right) \\&\qquad +\frac{4\alpha ^2L^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] +\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}\\&\quad \le \left( 1-\frac{2\alpha \mu }{n}\right) {\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\\&\qquad +\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}. \end{aligned}$$

To bound the consensus error ${\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]$, note that from (39a) and (79) we have

$$\begin{aligned}&\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2=\left\| {\mathbf {W}}_k {\mathbf {x}}_k-\alpha {\mathbf {D}}_k {\mathbf {y}}_k-{\mathbf {1}}{\overline{x}}_k+\frac{\alpha }{n}{\mathbf {1}}(y_{i_k,k}+y_{j_k,k})\right\| ^2\nonumber \\&\quad =\Vert {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2-2\alpha \langle {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {D}}_k {\mathbf {y}}_k-\frac{1}{n}{\mathbf {1}}(y_{i_k,k}+y_{j_k,k})\rangle \nonumber \\&\qquad +\alpha ^2\left\| {\mathbf {D}}_k {\mathbf {y}}_k-\frac{1}{n}{\mathbf {1}}(y_{i_k,k}+y_{j_k,k})\right\| ^2\nonumber \\&\quad =\mathbf {tr}\left[ ({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal } {\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)\right] -2\alpha \langle {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {D}}_k {\mathbf {y}}_k\rangle \nonumber \\&\qquad +\alpha ^2\left\| {\mathbf {D}}_k {\mathbf {y}}_k-\frac{1}{n}{\mathbf {1}}(y_{i_k,k}+y_{j_k,k})\right\| ^2. \end{aligned}$$

(84)

By Lemma 6, the conditional expectation of $\mathbf {tr}\left[ ({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal }{\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)\right] $ can be bounded below,

$$\begin{aligned} {\mathbb {E}}[\mathbf {tr}[({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal } {\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)]\mid {\mathcal {F}}_{k+1}]= & {} \mathbf {tr}\left[ ({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal }{\bar{{\mathbf {W}}}}({\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k)\right] \nonumber \\\le & {} \rho _{{\bar{w}}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2. \end{aligned}$$

(85)

In view of the structure of ${\mathbf {D}}_k$, we rewrite $2\langle {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {D}}_k {\mathbf {y}}_k\rangle $ as follows:

$$\begin{aligned}&2\langle {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {D}}_k {\mathbf {y}}_k\rangle \\&\quad =2\left\langle \frac{1}{2}(x_{i_k,k}+x_{j_k,k})-{\overline{x}}_k,y_{i_k,k}+y_{j_k,k}\rangle =\langle x_{i_k,k}+x_{j_k,k}-2{\overline{x}}_k,y_{i_k,k}+y_{j_k,k}\right\rangle . \end{aligned}$$

Note that

$$\begin{aligned}&{\mathbb {E}}[\langle x_{i_k,k}-{\overline{x}}_k, y_{i_k,k}\rangle \mid {\mathcal {F}}_{k+1}]=\frac{1}{n}\sum _{i=1}^n\langle x_{i,k}-{\overline{x}}_k, y_{i,k}\rangle \\&\quad =\frac{1}{n}\sum _{i=1}^n\langle x_{i,k}-{\overline{x}}_k, y_{i,k}-{\overline{y}}_k\rangle =\frac{1}{n}\langle {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\rangle ,\\&{\mathbb {E}}[\langle x_{j_k,k}-{\overline{x}}_k, y_{j_k,k}\rangle \mid {\mathcal {F}}_{k+1}]=\frac{1}{n}\sum _{i=1}^n \sum _{j=1}^n\pi _{ij}\langle x_{j,k}-{\overline{x}}_k, y_{j,k}\rangle \\&\quad =\frac{1}{n}\sum _{j=1}^n\langle x_{j,k}-{\overline{x}}_k, y_{j,k}\rangle =\frac{1}{n}\langle {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\rangle ,\\&{\mathbb {E}}[\langle x_{i_k,k}-{\overline{x}}_k, y_{j_k,k}\rangle \mid {\mathcal {F}}_{k+1}]=\frac{1}{n}\sum _{i=1}^n\langle x_{i,k}-{\overline{x}}_k,\sum _{j=1}^n \pi _{ij}y_{j,k}\rangle \\&\quad =\frac{1}{n}\sum _{i=1}^n\sum _{j=1}^n\langle x_{i,k}-{\overline{x}}_k,\pi _{ij}(y_{j,k}-{\overline{y}}_k)\rangle \\&\quad =\frac{1}{n}\mathbf {tr}\left[ ({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal }{{\varvec{\Pi }}}({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)\right] , \end{aligned}$$

and similarly,

$$\begin{aligned} {\mathbb {E}}[\langle x_{j_k,k}-{\overline{x}}_k, y_{i_k,k}\rangle \mid {\mathcal {F}}_{k+1}]=\frac{1}{n}\mathbf {tr}\left[ ({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal }{{\varvec{\Pi }}}({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)\right] . \end{aligned}$$

The following inequality holds:

$$\begin{aligned} 2{\mathbb {E}}[\langle {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {D}}_k {\mathbf {y}}_k\rangle \mid {\mathcal {F}}_{k+1}]&=\frac{2}{n}\langle {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\rangle +\frac{2}{n}\mathbf {tr}\left[ ({\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k)^{\intercal }{{\varvec{\Pi }}}({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)\right] \nonumber \\&\le \frac{4}{n}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert \Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert . \end{aligned}$$

(86)

The last term in Eq. (84) can be bounded in the following way:

$$\begin{aligned} \left\| {\mathbf {D}}_k {\mathbf {y}}_k-\frac{1}{n}{\mathbf {1}}(y_{i_k,k}+y_{j_k,k})\right\| ^2&=\Vert {\mathbf {D}}_k {\mathbf {y}}_k\Vert ^2-\frac{2}{n}\langle {\mathbf {D}}_k {\mathbf {y}}_k, {\mathbf {1}}(y_{i_k,k}+y_{j_k,k})\rangle +\frac{1}{n}\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2\nonumber \\&\le 2(\Vert y_{i_k,k}\Vert ^2+\Vert y_{j_k,k}\Vert ^2)-\frac{2}{n}\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2 +\frac{1}{n}\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2\nonumber \\ \le 2(\Vert y_{i_k,k}\Vert ^2+\Vert y_{j_k,k}\Vert ^2). \end{aligned}$$

(87)

In view of inequalities (85)–(87), from (84) we obtain

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2\mid {\mathcal {F}}_{k+1}] \le \rho _{{\bar{w}}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{4\alpha }{n} \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert \Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert +\frac{4\alpha ^2}{n}\Vert {\mathbf {y}}_k\Vert ^2\\&\quad \le \rho _{{\bar{w}}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{2\alpha }{n} \left( \beta _1\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{1}{\beta _1}\Vert {\mathbf {y}}_k -{\mathbf {1}}{\overline{y}}_k\Vert ^2\right) +\frac{4\alpha ^2}{n}\Vert {\mathbf {y}}_k\Vert ^2\\&\quad =\left( \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1\right) \Vert {\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{2\alpha }{n}\frac{1}{\beta _1}\Vert {\mathbf {y}}_k -{\mathbf {1}}{\overline{y}}_k\Vert ^2+\frac{4\alpha ^2}{n}\left( \Vert {\mathbf {y}}_k-{\mathbf {1}} {\overline{y}}_k\Vert ^2+n\Vert {\overline{y}}_k\Vert ^2\right) \\&\quad =\left( \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1\right) \Vert {\mathbf {x}}_k -{\mathbf {1}}{\overline{x}}_k\Vert ^2+\frac{2\alpha }{n}\left( \frac{1}{\beta _1} +2\alpha \right) \Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+4\alpha ^2\Vert {\overline{y}}_k\Vert ^2, \end{aligned}$$

where $\beta _1>0$ is arbitrary. Notice that by relation (83) and Lemma 3,

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2]&\le \frac{\sigma ^2}{n}+{\mathbb {E}}[\Vert h({\mathbf {x}}_k)-\nabla f({\overline{x}}_k)+\nabla f({\overline{x}}_k)\Vert ^2]\nonumber \\&\le \frac{\sigma ^2}{n}+{\mathbb {E}}\left[ \left( \frac{L}{\sqrt{n}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert +L\Vert {\overline{x}}_k-x^*\Vert \right) ^2\right] \nonumber \\&\le \frac{\sigma ^2}{n}+\frac{2L^2}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] +2L^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]. \end{aligned}$$

(88)

We obtain

$$\begin{aligned} {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]&\le \left( \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\\&\quad +\frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\\&\quad +8\alpha ^2 L^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\frac{4\alpha ^2 \sigma ^2}{n}, \end{aligned}$$

which is exactly inequality (51).

Finally we prove inequality (52). From the update rule (39b),

$$\begin{aligned}&\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2=\Vert {\mathbf {W}}_k {\mathbf {y}}_k+\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k-{\mathbf {1}}{\overline{y}}_k+{\mathbf {1}}{\overline{y}}_k-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\nonumber \\&\quad =\Vert {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2+\Vert \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\Vert ^2+n\Vert {\overline{y}}_k-{\overline{y}}_{k+1}\Vert ^2\nonumber \\&\qquad +2\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle \nonumber \\&\qquad +2\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,{\mathbf {1}}{\overline{y}}_k-{\mathbf {1}} {\overline{y}}_{k+1}\rangle +2\langle \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k,{\mathbf {1}}{\overline{y}}_k-{\mathbf {1}}{\overline{y}}_{k+1}\rangle \nonumber \\&\quad =\mathbf {tr}\left[ ({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)^{\intercal }{\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)\right] +\Vert \tilde{{\mathbf {D}}}_k G_{k+1} -\tilde{{\mathbf {D}}}_k G_k\Vert ^2+n\Vert {\overline{y}}_k-{\overline{y}}_{k+1}\Vert ^2\nonumber \\&\qquad +2\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle +2\langle \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k,{\mathbf {1}}{\overline{y}}_k-{\mathbf {1}}{\overline{y}}_{k+1}\rangle . \end{aligned}$$

(89)

Denote $g_{i,k}:=g_i(x_{i,k},\xi _{i,k})$ for short. Notice that

$$\begin{aligned} 2\langle \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k,{\mathbf {1}}{\overline{y}}_k-{\mathbf {1}}{\overline{y}}_{k+1}\rangle =2\langle g_{i_k,k+1}-g_{i_k,k}+\mathbb {1}_k(g_{j_k,k+1}-g_{j_k,k}) ,{\overline{y}}_k-{\overline{y}}_{k+1}\rangle , \end{aligned}$$

and

$$\begin{aligned} {\overline{y}}_k-{\overline{y}}_{k+1}= & {} \frac{1}{n}\left( y_{i_k,k}-y_{i_k,k+1}+\mathbb {1}_k(y_{j_k,k} -y_{j_k,k+1})\right) \\= & {} -\frac{1}{n}\left( g_{i_k,k+1}-g_{i_k,k} +\mathbb {1}_k(g_{j_k,k+1}-g_{j_k,k})\right) . \end{aligned}$$

We have

$$\begin{aligned} 2\langle \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k,{\mathbf {1}}{\overline{y}}_k-{\mathbf {1}}{\overline{y}}_{k+1}\rangle =-2n\Vert {\overline{y}}_k-{\overline{y}}_{k+1}\Vert ^2. \end{aligned}$$

Hence relation (89) leads to

$$\begin{aligned} \Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2&\le \mathbf {tr}\left[ ({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)^{\intercal }{\mathbf {W}}_k^{\intercal }{\mathbf {W}}_k({\mathbf {y}}_k -{\mathbf {1}}{\overline{y}}_k)\right] +\Vert \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\Vert ^2\nonumber \\&\quad +2\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle . \end{aligned}$$

(90)

Now we analyze the three terms on the right-hand side of (90), respectively. First,

$$\begin{aligned} {\mathbb {E}}[\mathbf {tr}[({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)^{\intercal }{\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k({\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k)]\mid {\mathcal {F}}_{k+1}]\le \rho _{{\bar{w}}}\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2. \end{aligned}$$

(91)

Second,

$$\begin{aligned} \Vert \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\Vert ^2=\Vert g_{i_k,k+1}-g_{i_k,k}\Vert ^2+\mathbb {1}_k\Vert g_{j_k,k+1}-g_{j_k,k}\Vert ^2. \end{aligned}$$

(92)

In light of Assumptions 1 and 2, we can bound ${\mathbb {E}}[\Vert g_{i_k,k+1}-g_{i_k,k}\Vert ^2]$ and ${\mathbb {E}}[\mathbb {1}_k\Vert g_{j_k,k+1}-g_{j_k,k}\Vert ^2]$:

$$\begin{aligned}&{\mathbb {E}}[\Vert g_{i_k,k+1}-g_{i_k,k}\Vert ^2] ={\mathbb {E}}[\Vert \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k})\Vert ^2]\nonumber \\&\qquad +2{\mathbb {E}}[\langle \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k}),g_{i_k,k+1}{-}\nabla f_{i_k}(x_{i_k,k+1}){-}g_{i_k,k}{+}\nabla f_{i_k}(x_{i_k,k})\rangle ]\nonumber \\&\qquad +{\mathbb {E}}[\Vert g_{i_k,k+1}-\nabla f_{i_k}(x_{i_k,k+1})-g_{i_k,k}+\nabla f_{i_k}(x_{i_k,k})\Vert ^2]\nonumber \\&\quad ={\mathbb {E}}[\Vert \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k})\Vert ^2]+2{\mathbb {E}}[\langle \nabla f_{i_k}(x_{i_k,k+1}), -g_{i_k,k}+\nabla f_{i_k}(x_{i_k,k})\rangle ]\nonumber \\&\qquad +{\mathbb {E}}[\Vert g_{i_k,k+1}-\nabla f_{i_k}(x_{i_k,k+1})\Vert ^2]+{\mathbb {E}}[\Vert g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})\Vert ^2]\nonumber \\&\quad \le L^2{\mathbb {E}}[\Vert x_{i_k,k+1}-x_{i_k,k}\Vert ^2]+2{\mathbb {E}}[\langle \nabla f_{i_k}(x_{i_k,k+1}), -g_{i_k,k}+\nabla f_{i_k}(x_{i_k,k})\rangle ]+2\sigma ^2. \end{aligned}$$

(93)

Similarly,

$$\begin{aligned}&{\mathbb {E}}[\mathbb {1}_k\Vert g_{j_k,k+1}-g_{j_k,k}\Vert ^2] \le L^2{\mathbb {E}}[\mathbb {1}_k\Vert x_{j_k,k+1}-x_{j_k,k}\Vert ^2]\nonumber \\&\quad +2{\mathbb {E}}[\mathbb {1}_k\langle \nabla f_{j_k}(x_{j_k,k+1}), -g_{j_k,k}+\nabla f_{j_k}(x_{j_k,k})\rangle ]+2\sigma ^2. \end{aligned}$$

(94)

To further bound ${\mathbb {E}}[\Vert g_{i_k,k+1}-g_{i_k,k}\Vert ^2+\mathbb {1}_k\Vert g_{j_k,k+1}-g_{j_k,k}\Vert ^2]$, we introduce the following lemma.

Lemma 10

$$\begin{aligned}&{\mathbb {E}}[\langle \nabla f_{i_k}(x_{i_k,k+1}), -g_{i_k,k}+\nabla f_{i_k}(x_{i_k,k})\rangle ]+{\mathbb {E}}[\mathbb {1}_k\langle \nabla f_{j_k}(x_{j_k,k+1}), -g_{j_k,k}\\&\quad +\nabla f_{j_k}(x_{j_k,k})\rangle ] \le 2\alpha L\sigma ^2. \end{aligned}$$

The proof is similar to that of Lemma 8 and is omitted here. Equation (92) together with (93), (94) and Lemma 10 leads to the following inequality:

$$\begin{aligned}&{\mathbb {E}}[\Vert \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\Vert ^2]\nonumber \\&\quad \le L^2{\mathbb {E}}[\Vert x_{i_k,k+1}-x_{i_k,k}\Vert ^2]+L^2{\mathbb {E}}[\mathbb {1}_k\Vert x_{j_k,k+1}-x_{j_k,k}\Vert ^2]+4(\alpha L+1)\sigma ^2\nonumber \\&\quad = L^2{\mathbb {E}}[\mathbb {1}_k(\Vert x_{i_k,k+1}-x_{i_k,k}\Vert ^2+\Vert x_{j_k,k+1}-x_{j_k,k}\Vert ^2)]\nonumber \\&\qquad +L^2{\mathbb {E}}[(1-\mathbb {1}_k)\Vert x_{i_k,k+1}-x_{i_k,k}\Vert ^2]+4(\alpha L+1)\sigma ^2. \end{aligned}$$

(95)

Note that from (37a) and (38a), we have

$$\begin{aligned}&\mathbb {1}_k(\Vert x_{i_k,k+1}-x_{i_k,k}\Vert ^2+\Vert x_{j_k,k+1}-x_{j_k,k}\Vert ^2)\\&= \mathbb {1}_k\left[ \left\| \frac{1}{2}(x_{i_k,k}+x_{j_k,k})-\alpha y_{i_k,k}-x_{i_k,k}\right\| ^2+\left\| \frac{1}{2}\left( x_{i_k,k}+x_{j_k,k}\right) -\alpha y_{j_k,k}-x_{j_k,k}\right\| ^2\right] \\&= \frac{1}{4}\mathbb {1}_k\left( \Vert x_{j_k,k}-x_{i_k,k}-2\alpha y_{i_k,k}\Vert ^2+\Vert x_{i_k,k}-x_{j_k,k}-2\alpha y_{j_k,k}\Vert ^2\right) \\&= \frac{1}{2}\mathbb {1}_k\left[ \Vert x_{i_k,k}-x_{j_k,k}\Vert ^2+2\alpha ^2 \left( \Vert y_{i_k,k}\Vert ^2+\Vert y_{j_k,k}\Vert ^2\right) +2\alpha \langle x_{i_k,k}-x_{j_k,k}, y_{i_k,k}-y_{j_k,k} \rangle \right] , \end{aligned}$$

and

$$\begin{aligned} (1-\mathbb {1}_k)\Vert x_{i_k,k+1}-x_{i_k,k}\Vert ^2=4\alpha ^2(1-\mathbb {1}_k)\Vert y_{i_k,k}\Vert ^2. \end{aligned}$$

Then by (95),

$$\begin{aligned}&{\mathbb {E}}[\Vert \tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\Vert ^2]\nonumber \\&\quad \le L^2{\mathbb {E}}\left[ \mathbb {1}_k\left( \frac{1}{2}\Vert x_{i_k,k}-x_{j_k,k}\Vert ^2+\alpha ^2 \left( \Vert y_{i_k,k}\Vert ^2+\Vert y_{j_k,k}\Vert ^2\right) \right. \right. \nonumber \\&\qquad \left. \left. +\alpha \langle x_{i_k,k}-x_{j_k,k}, y_{i_k,k}-y_{j_k,k} \rangle \right) \right] \nonumber \\&\qquad +4\alpha ^2 L^2{\mathbb {E}}[(1-\mathbb {1}_k)\Vert y_{i_k,k}\Vert ^2]+4(\alpha L+1)\sigma ^2\nonumber \\&\quad \le L^2{\mathbb {E}}\left[ \mathbb {1}_k\left( \Vert x_{i_k,k}-x_{j_k,k}\Vert ^2+2\alpha ^2 \left( \Vert y_{i_k,k}\Vert ^2+\Vert y_{j_k,k}\Vert ^2\right) \right) \right] \nonumber \\&\qquad +4\alpha ^2 L^2{\mathbb {E}}[(1-\mathbb {1}_k)\Vert y_{i_k,k}\Vert ^2]+4(\alpha L+1)\sigma ^2\nonumber \\&\quad \le \frac{L^2}{n}{\mathbb {E}}\left[ 2\Vert {\mathbf {I}}-{{\varvec{\Pi }}}\Vert \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +4\alpha ^2\Vert {\mathbf {y}}_k\Vert ^2\right] +4(\alpha L+1)\sigma ^2\nonumber \\&\quad \le \frac{4L^2}{n}{\mathbb {E}}\left[ \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2+\alpha ^2\Vert {\mathbf {y}}_k -{\mathbf {1}}{\overline{y}}_k\Vert ^2+\alpha ^2 n\Vert {\overline{y}}_k\Vert ^2\right] +4(\alpha L+1)\sigma ^2. \end{aligned}$$

(96)

For the last term on the right-hand side of (90), note that

$$\begin{aligned} 2\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle =\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k,g_{i_k,k+1}-g_{i_k,k}+\mathbb {1}_k(g_{j_k,k+1}-g_{j_k,k}) \rangle . \end{aligned}$$

By Assumption 1, we have

$$\begin{aligned}&2{\mathbb {E}}[\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle ]\nonumber \\&\quad ={\mathbb {E}}[\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, \nabla f_{i_k}(x_{i_k,k+1})-g_{i_k,k}+\mathbb {1}_k(\nabla f_{j_k}(x_{j_k,k+1})-g_{j_k,k}) \rangle ]. \end{aligned}$$

(97)

The following lemma is useful.

Lemma 11

$$\begin{aligned} {\mathbb {E}}[\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})+\mathbb {1}_k(g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})) \rangle ]\ge 0. \end{aligned}$$

Proof

See Appendix 7.4. $\square $

In light of (97) and Lemma 11,

$$\begin{aligned}&2{\mathbb {E}}[\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle ]\nonumber \\&\le {\mathbb {E}}[\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k})+\mathbb {1}_k[\nabla f_{j_k}(x_{j_k,k+1})-\nabla f_{j_k}(x_{j_k,k})] \rangle ]\nonumber \\&= {\mathbb {E}}[\mathbb {1}_k\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k})+\nabla f_{j_k}(x_{j_k,k+1})-\nabla f_{j_k}(x_{j_k,k}) \rangle ]\nonumber \\&\quad + 2{\mathbb {E}}[(1-\mathbb {1}_k)\langle y_{i_k,k}-{\overline{y}}_k, \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k})\rangle ]. \end{aligned}$$

(98)

Notice that from (37a), (38a) and Assumption 2,

$$\begin{aligned}&\mathbb {1}_k\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k}(x_{i_k,k})+\nabla f_{j_k}(x_{j_k,k+1})-\nabla f_{j_k}(x_{j_k,k}) \rangle \\&\quad \le L\mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert +\Vert y_{j_k,k}-{\overline{y}}_k\Vert )(\Vert x_{i_k,k+1}-x_{i_k,k}\Vert +\Vert x_{j_k,k+1}-x_{j_k,k}\Vert )\\&\quad = \frac{L}{2}\mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert +\Vert y_{j_k,k}-{\overline{y}}_k\Vert )(\Vert x_{j_k,k} -x_{i_k,k}-2\alpha y_{i_k,k}\Vert +\Vert x_{i_k,k}-x_{j_k,k}-2\alpha y_{j_k,k}\Vert )\\&\quad = L\mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert +\Vert y_{j_k,k}-{\overline{y}}_k\Vert )\Vert x_{i_k,k}-x_{j_k,k}\Vert \\&\qquad +\alpha L\mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert +\Vert y_{j_k,k}-{\overline{y}}_k\Vert )(\Vert y_{i_k,k}\Vert +\Vert y_{j_k,k}\Vert ), \end{aligned}$$

and

$$\begin{aligned}&(1-\mathbb {1}_k)\langle y_{i_k,k}-{\overline{y}}_k, \nabla f_{i_k}(x_{i_k,k+1})-\nabla f_{i_k} (x_{i_k,k})\rangle \le L(1-\mathbb {1}_k)\Vert y_{i_k,k}-{\overline{y}}_k\Vert \Vert x_{i_k,k+1}-x_{i_k,k}\Vert \\&\quad = 2\alpha L(1-\mathbb {1}_k)\Vert y_{i_k,k}-{\overline{y}}_k\Vert \Vert y_{i_k,k}\Vert . \end{aligned}$$

We have from (98) that

$$\begin{aligned}&2{\mathbb {E}}[\langle {\mathbf {W}}_k{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k,\tilde{{\mathbf {D}}}_k G_{k+1}-\tilde{{\mathbf {D}}}_k G_k\rangle ]\nonumber \\&\quad \le L{\mathbb {E}}[\mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert +\Vert y_{j_k,k}-{\overline{y}}_k\Vert )\Vert x_{i_k,k}-x_{j_k,k}\Vert ]\nonumber \\&\qquad +\alpha L{\mathbb {E}}[\mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert +\Vert y_{j_k,k}-{\overline{y}}_k\Vert )(\Vert y_{i_k,k}\Vert +\Vert y_{j_k,k}\Vert )]\nonumber \\&\qquad +4\alpha L{\mathbb {E}}[(1-\mathbb {1}_k)\Vert y_{i_k,k}-{\overline{y}}_k\Vert \Vert y_{i_k,k}\Vert )\nonumber \\&\quad \le {\mathbb {E}}\left[ \mathbb {1}_k\beta _2(\Vert y_{i_k,k}-{\overline{y}}_k\Vert ^2+\Vert y_{j_k,k}-{\overline{y}}_k\Vert ^2)+\frac{L^2\mathbb {1}_k}{2\beta _2} \Vert x_{i_k,k}-x_{j_k,k}\Vert ^2\right] \nonumber \\&\qquad +\alpha L{\mathbb {E}}\left[ \mathbb {1}_k(\Vert y_{i_k,k}-{\overline{y}}_k\Vert ^2+\Vert y_{j_k,k}-{\overline{y}}_k\Vert ^2+\Vert y_{i_k,k}\Vert ^2 +\Vert y_{j_k,k}\Vert ^2)\right] \nonumber \\&\qquad +2\alpha L{\mathbb {E}}[(1-\mathbb {1}_k)(\Vert y_{i_k,k}-{\overline{y}}_k\Vert ^2+\Vert y_{i_k,k}\Vert ^2)]\nonumber \\&\quad \le \frac{2}{n}\beta _2{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]+\frac{L^2\Vert {\mathbf {I}}-{{\varvec{\Pi }}}\Vert }{n}\frac{1}{\beta _2}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\qquad +\frac{2\alpha L}{n} {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]+\frac{2\alpha L}{n}{\mathbb {E}}[\Vert {\mathbf {y}}_k\Vert ^2]\nonumber \\&\quad = \frac{2}{n}\beta _2{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]+\frac{2L^2}{n} \frac{1}{\beta _2}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]+\frac{4\alpha L}{n}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]+2\alpha L{\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2], \end{aligned}$$

(99)

for any $\beta _2>0$. In light of (91), (96) and (99), we obtain by (90) that

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2]\\&\quad \le \left( \rho _{{\bar{w}}}+\frac{2}{n}\beta _2+\frac{4\alpha L}{n}\right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]+\frac{2L^2}{n}\frac{1}{\beta _2} {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]+2\alpha L{\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2]\\&\qquad +\frac{4L^2}{n}\left[ {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]+\alpha ^2 {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]+\alpha ^2 n{\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2]\right] +4(\alpha L+1)\sigma ^2\\&\quad = \left( \rho _{{\bar{w}}}+\frac{2}{n}\beta _2+\frac{4\alpha L}{n} +\frac{4\alpha ^2L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2\mid {\mathcal {F}}_k] +\frac{L^2}{n}\left( 4+\frac{2}{\beta _2}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\\&\qquad +(4\alpha ^2L^2+2\alpha L){\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2]+4(\alpha L+1)\sigma ^2. \end{aligned}$$

Since by (88),

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{y}}_k\Vert ^2] \le \frac{\sigma ^2}{n}+\frac{2L^2}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] +2L^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]. \end{aligned}$$

We conclude that

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2]\le \left( \rho _{{\bar{w}}} +\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\\&\quad +\frac{L^2}{n}\left( 4{+}\frac{2}{\beta _2}{+}8\alpha ^2 L^2{+}4\alpha L\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]+(8\alpha ^2L^4{+}4\alpha L^3){\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\\&\quad +\frac{(4\alpha ^2L^2+2\alpha L)\sigma ^2}{n} +4(\alpha L+1)\sigma ^2. \end{aligned}$$

1.4 Proof of Lemma 11

The following relation holds:

$$\begin{aligned}&{\mathbb {E}}[\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})+\mathbb {1}_k(g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})) \rangle ]\\&\quad ={\mathbb {E}}[\mathbb {1}_k\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})+g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})\rangle ]\\&\qquad +2{\mathbb {E}}[(1-\mathbb {1}_k)\langle y_{i_k,k}-{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})\rangle ]. \end{aligned}$$

From the updating rules (37b) and (38b), we have

$$\begin{aligned}&{\mathbb {E}}[\mathbb {1}_k\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})+g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})\rangle ]\\&\quad =\left( 1-\frac{2}{n}\right) {\mathbb {E}}[\mathbb {1}_k\langle (g_{i_k,k}+g_{j_k,k}), g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})+g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})\rangle ]\\&\quad =\left( 1-\frac{2}{n}\right) {\mathbb {E}}[\mathbb {1}_k\Vert g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})\Vert ^2+\mathbb {1}_k\Vert g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})\Vert ^2]\ge 0, \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}[(1-\mathbb {1}_k)\langle y_{i_k,k}-{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})\rangle ]=\left( 1-\frac{1}{n}\right) {\mathbb {E}}[(1-\mathbb {1}_k)\Vert g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})\Vert ^2]\ge 0. \end{aligned}$$

Hence

$$\begin{aligned} {\mathbb {E}}[\langle y_{i_k,k}+y_{j_k,k}-2{\overline{y}}_k, g_{i_k,k}-\nabla f_{i_k}(x_{i_k,k})+\mathbb {1}_k(g_{j_k,k}-\nabla f_{j_k}(x_{j_k,k})) \rangle ]\ge 0. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pu, S., Nedić, A. Distributed stochastic gradient tracking methods. Math. Program. 187, 409–457 (2021). https://doi.org/10.1007/s10107-020-01487-0

Download citation

Received: 25 May 2018
Accepted: 04 March 2020
Published: 17 March 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10107-020-01487-0

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Distributed stochastic gradient tracking methods

Abstract

Similar content being viewed by others

Optimal gradient tracking for decentralized optimization

Convergence of Distributed Gradient-Tracking-Based Optimization Algorithms with Random Graphs

Randomized difference-based gradient-free algorithm for distributed resource allocation

1 Introduction

Assumption 1

1.1 Related work

1.2 Main contribution

1.3 Notation and assumptions

Assumption 2

Assumption 3

1.4 Organization of the paper

2 A distributed stochastic gradient tracking method (DSGT)

Assumption 4

Lemma 1

2.1 Main results

Theorem 1

Remark 1

Remark 2

Corollary 1

Theorem 2

Remark 3

Remark 4

Corollary 2

3 Analysis

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

3.1 Proof of Theorem 1

Lemma 5

Proof

3.2 Proof of Corollary 1

3.3 Proof of Theorem 2

4 A gossip-like stochastic gradient tracking method (GSGT)

Assumption 5

Lemma 6

Proof

4.1 Main results

Theorem 3

Remark 5

Corollary 3

Remark 6

4.2 Performance comparison between DSGT and GSGT

4.3 Proof of Theorem 3

Lemma 7

Proof

4.4 Proof of Corollary 3

5 Numerical example

6 Conclusions and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Lemma 4

Lemma 8

Proof

Lemma 9

Proof

1.2 Proof of Lemma 5

1.3 Proof of Lemma 7

Lemma 10

Lemma 11

Proof

1.4 Proof of Lemma 11

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification