1 Introduction

Consider a set of agents \({\mathcal {N}}=\{1,2,\ldots ,n\}\) connected over a network. Each agent has a local smooth and strongly convex cost function \(f_i:{\mathbb {R}}^{p}\rightarrow {\mathbb {R}}\). The global objective is to locate \(x\in {\mathbb {R}}^p\) that minimizes the average of all cost functions:

$$\begin{aligned} \min _{x\in {\mathbb {R}}^{p}}f(x)\left( =\frac{1}{n}\sum _{i=1}^{n}f_i(x)\right) . \end{aligned}$$
(1)

Scenarios in which problem (1) is considered include distributed machine learning [12, 15, 35], multi-agent target seeking [8, 42], and wireless networks [2, 11, 30], among many others.

To solve problem (1), we assume each agent i queries a stochastic oracle (\({{\mathcal {S}}}{{\mathcal {O}}}\)) to obtain noisy gradient samples of the form \(g_i(x,\xi _i)\) that satisfies the following condition:

Assumption 1

For all \(i\in {\mathcal {N}}\) and all \(x\in {\mathbb {R}}^p\), each random vector \(\xi _i\in {\mathbb {R}}^m\) is independent, and

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{\xi _i}[g_i(x,\xi _i)\mid x] = \nabla f_i(x),\\&{\mathbb {E}}_{\xi _i}[\Vert g_i(x,\xi _i)-\nabla f_i(x)\Vert ^2\mid x] \le \sigma ^2\quad \hbox { for some }\sigma >0. \end{aligned} \end{aligned}$$
(2)

The above assumption of stochastic gradients holds true for many on-line distributed learning problems, where \(f_i(x)={\mathbb {E}}_{\xi _i}[F_i(x,\xi _i)]\) denotes the expected loss function agent i wishes to minimize, while independent samples \(\xi _i\) are gathered continuously over time. For another example, in simulation-based optimization, the gradient estimation often incurs noise that can be due to various sources, such as modeling and discretization errors, incomplete convergence, and finite sample size for Monte-Carlo methods [22].

Distributed algorithms dealing with problem (1) have been studied extensively in the literature [13, 19, 20, 28, 34, 36, 37, 45, 46, 52, 56]. Recently, there has been considerable interest in distributed implementation of stochastic gradient algorithms [3, 5,6,7, 9, 10, 14, 18, 24, 26, 32, 40, 41, 48, 51, 54, 55]. The literature has shown that distributed methods may compete with, or even outperform, their centralized counterparts under certain conditions [9, 10, 26, 40, 41]. For instance, in our recent work [40], we proposed a flocking-based approach for distributed stochastic optimization which beats a centralized gradient method in real-time assuming that all \(f_i\) are identical. However, to the best of our knowledge, there is no distributed stochastic gradient method addressing problem (1) that shows comparable performance with a centralized approach for optimizing the sum of smooth and strongly convex objective functions under Assumption 1 only. In particular, under constant stepsize policies none of the existing algorithms achieve an error bound that is decreasing in the network size n.

A distributed gradient tracking method was proposed in [13, 34, 46], where the agent-based auxiliary variables \(y_i\) were introduced to track the average gradients of \(f_i\) assuming accurate gradient information is available. It was shown that the method, with constant stepsize, generates iterates that converge linearly to the optimal solution. Inspired by the approach, in this paper we consider a distributed stochastic gradient tracking method (DSGT). By comparison, in our proposed algorithm \(y_i\) are tracking the stochastic gradient averages of \(f_i\). We are able to show that the iterates generated by each agent reach, in expectation, a neighborhood of the optimal point exponentially fast under a constant stepsize. Interestingly, with a sufficiently small stepsize, the limiting error bounds on the distance between the agent iterates and the optimal solution decrease in the network size, which is comparable to the performance of a centralized stochastic gradient algorithm.

Gossip-based communication protocols are popular choices for distributed computation due to their low communication costs [4, 25, 29, 31]. In the second part of this paper, we consider a gossip-like stochastic gradient tracking algorithm (GSGT) where at each iteration, an agent wakes up uniformly randomly and communicates with one of her neighbors or updates by herself. Similar to DSGT, the method produces iterates that converge to a neighborhood of the optimal point exponentially fast under a sufficiently small constant stepsize. When the network of agents is well-connected (e.g., complete network, almost all regular graphs), GSGT is shown to employ a lower communication burden and similar computational cost when compared to DSGT.

1.1 Related work

We now briefly review the literature on (distributed) stochastic optimization. First of all, our work is related to the extensive literature in stochastic approximation (SA) methods dating back to the seminal works [21, 49]. These works include the analysis of convergence (conditions for convergence, rates of convergence, suitable choice of stepsize) in the context of diverse noise models [23]. Assuming the objective function f is strongly convex with Lipschitz continuous gradients, the optimal rate of convergence for solving problem (1) has been shown to be \({\mathcal {O}}\left( 1/k\right) \) under a diminishing SA stepsize where k denotes the iteration number [38]. With a constant stepsize \(\alpha >0\) that is sufficiently small, the iterates generated by a stochastic gradient method is attracted to an \({\mathcal {O}}(\alpha )\)-neighborhood of the optimal solution exponentially fast (in expectation).

Distributed implementations of stochastic gradient methods have become increasingly popular in recent years. In [48], the authors considered minimizing a sum of (possibly nonsmooth) convex objective functions subject to a common convex constraint set. It was shown that when the means of the stochastic subgradient errors diminish, there is mean consensus among the agents and mean convergence to the optimum function value under SA stepsizes. The work [54] used two diminishing stepsizes to deal with communication noises and subgradient errors, respectively. Asymptotic convergence to the optimal set was established; for constant stepsizes asymptotic error bounds were derived. In [14], a distributed dual averaging method was proposed for minimizing (possibly nonsmooth) convex functions. Under a carefully chosen SA stepsize sequence, the method exhibits the convergence rate \({\mathcal {O}}(\frac{n\log (k)}{(1-\lambda _2({\mathbf {W}}))\sqrt{k}})\), in which \(\lambda _2({\mathbf {W}})\) denotes the second largest singular value of the doubly stochastic mixing matrix \({\mathbf {W}}\). Paper [3] considered a projected stochastic gradient algorithm for solving non-convex optimization problems by combining a local stochastic gradient update and a gossip step. It was proved that consensus is asymptotically achieved in the network and the solutions converge to the set of KKT points with SA stepsizes. A distributed online algorithm was devised and analyzed in [5] for solving dynamic optimization problems in noisy communication environments. Sufficient conditions were provided for almost sure convergence of the algorithm. In [55], the authors proposed an adaptive diffusion algorithm based on penalty methods. Under a constant stepsize \(\alpha \), it was shown that the expected distance between the optimal solution and that obtained at each node is bounded by \({\mathcal {O}}(\alpha )\). The work [9, 10] further showed that under a sufficiently small stepsize and certain conditions on the stochastic gradients, distributed methods are able to achieve the same performance level as that of a centralized approach. Paper [7] considered the problem of distributed constrained convex optimization subject to multiple noise terms in both computation and communication stages. The authors utilized an augmented Lagrangian framework and established the almost sure convergence of the algorithm under a diminishing stepsize policy. In [32], a subgradient-push method was investigated for distributed optimization over time-varying directed graphs. When the objective function is strongly convex, the scheme exhibits the \({\mathcal {O}}(\frac{\ln k}{k})\) rate of convergence.

In a recent work [24], a class of decentralized first-order methods for nonsmooth and stochastic optimization was presented. The class was shown to exhibit the \({\mathcal {O}}(\frac{1}{k})\) (respectively, \({\mathcal {O}}(\frac{1}{\sqrt{k}})\)) rate of convergence for minimizing the sum of strongly convex functions (respectively, general convex functions). The work in [1] considered a composite convex optimization problem with noisy gradient information and showed the \(O(\frac{1}{\sqrt{k}})\) convergence rate using an ADMM-based approach. Paper [26] considered a decentralized stochastic gradient algorithm that achieves the \({\mathcal {O}}(\frac{1}{k}+\frac{1}{\sqrt{nk}})\) rate of convergence for minimizing the sum of non-convex functions. The rate is comparable to that of a centralized algorithm when k is large enough. At the same time, the communication cost for the decentralized approach is lower. Papers [40, 41] also demonstrates the advantage of distributively implementing a stochastic gradient method assuming that all \(f_i\) are identical and sampling times are random and non-negligible. The work [51] utilized a time-dependent weighted mixing of stochastic subgradient updates to achieve the convergence rate of \({\mathcal {O}}(\frac{n\sqrt{n}}{(1-\lambda _2({\mathbf {W}}))k})\) for minimizing the sum of (possibly nonsmooth) strongly convex functions. In [53], the authors considered a decentralized consensus-based algorithm with delayed gradient information. The method was shown to achieve the optimal \({\mathcal {O}}(\frac{1}{\sqrt{k}})\) rate of convergence for general convex functions. In [18], the \({\mathcal {O}}(\frac{1}{k})\) convergence rate was established for strongly convex costs and random networks.

1.2 Main contribution

Our main contribution is summarized as follows. Firstly, we propose a novel distributed stochastic gradient tracking method (DSGT) for optimizing the sum of smooth and strongly convex objective functions. We employ an auxiliary variable \(y_i\) for each agent that tracks the average stochastic gradients of the cost functions. We show that, under a constant stepsize choice, the algorithm is comparable to a centralized stochastic gradient scheme in terms of their convergence speeds and the ultimate error bounds. In particular, the obtained error bound under DSGT decreases with the network size n, which has not been shown in the literature to the best of our knowledge. Moreover, assuming the gradient estimates are accurate, DSGT recovers the linear rate of convergence to the optimal solution [34, 46], which is also a unique feature among other distributed stochastic gradient algorithms.

Secondly, with an SA stepsize \(\alpha _k\rightarrow 0\), DSGT enjoys the optimal \({\mathcal {O}}(\frac{1}{k})\) rate of convergence to the optimal point. In addition, we characterize the dependency of the constant factors in the stepsize and the convergence rate on the properties of the mixing matrix as well as the characteristics of the objective functions, such as the strong convexity factor and the Lipschitz constant.

Thirdly, we introduce a gossip-like stochastic gradient tracking method that is efficient in communication. We show that, under a sufficiently small constant stepsize, GSGT also produces iterates that converge to a neighborhood of the optimal point exponentially fast. Again, when the gradient estimates are accurate, GSGT recovers the linear rate of convergence to the optimal solution. Compared to DSGT, we show that when the network is well-connected (e.g., complete network, almost all regular graphs), GSGT incurs lower communication cost than DSGT by a factor of \({\mathcal {O}}(\frac{|{\mathcal {E}}|}{n})\) (\(|{\mathcal {E}}|\) denoting the number of edges in the network) while maintaining a similar computational cost.

Finally, we provide a numerical example that demonstrates the effectiveness of the proposed methods when contrasted with the centralized stochastic gradient algorithm and some existing variants of distributed stochastic gradient methods.

1.3 Notation and assumptions

Throughout the paper, vectors default to columns if not otherwise specified. Let each agent i hold a local copy \(x_i\in {\mathbb {R}}^p\) of the decision variable and an auxiliary variable \(y_i\in {\mathbb {R}}^p\). Their values at iteration/time k are denoted by \(x_{i,k}\) and \(y_{i,k}\), respectively. We let

$$\begin{aligned} {\mathbf {x}}:= [x_1, x_2, \ldots , x_n]^{\intercal }\in {\mathbb {R}}^{n\times p},\ \ {\mathbf {y}}:= [y_1, y_2, \ldots , y_n]^{\intercal }\in {\mathbb {R}}^{n\times p}, \end{aligned}$$

and

$$\begin{aligned} {\overline{x}}:= \frac{1}{n}{\mathbf {1}}^{\intercal } {\mathbf {x}}\in {\mathbb {R}}^{1\times p},\ \ {\overline{y}}:= \frac{1}{n}{\mathbf {1}}^{\intercal }{\mathbf {y}}\in {\mathbb {R}}^{1\times p}, \end{aligned}$$

where \({\mathbf {1}}\) denotes the vector with all entries equal to 1. We define an aggregate objective function of the local variables:

$$\begin{aligned} F({\mathbf {x}}):=\sum _{i=1}^nf_i(x_i), \end{aligned}$$
(3)

and let

$$\begin{aligned} \nabla F({\mathbf {x}}):=\left[ \nabla f_1(x_1), \nabla f_2(x_2), \ldots , \nabla f_n(x_n)\right] ^{\intercal }\in {\mathbb {R}}^{n\times p}. \end{aligned}$$

In addition, denote

$$\begin{aligned} \begin{aligned}&h({\mathbf {x}}):= \frac{1}{n}{\mathbf {1}}^{\intercal }\nabla F({\mathbf {x}})\in {\mathbb {R}}^{1\times p},\\&\varvec{\xi } := [\xi _1, \xi _2, \ldots , \xi _n]^{\intercal }\in {\mathbb {R}}^{n\times m},\\&G({\mathbf {x}},\varvec{\xi }):= [g_1(x_1,\xi _1), g_2(x_2,\xi _2), \ldots , g_n(x_n,\xi _n)]^{\intercal }\in {\mathbb {R}}^{n\times p}. \end{aligned} \end{aligned}$$

The inner product of two vectors ab of the same dimension is denoted by \(\langle a,b\rangle \). For two matrices \({\mathbf {A}},{\mathbf {B}}\in {\mathbb {R}}^{n\times p}\), we let \(\langle {\mathbf {A}},{\mathbf {B}}\rangle \) be the Frobenius inner product. We use \(\Vert \cdot \Vert \) to denote the 2-norm of vectors; for matrices, \(\Vert \cdot \Vert \) represents the Frobenius norm. The spectral radius of a square matrix \({\mathbf {M}}\) is denoted by \(\rho ({\mathbf {M}})\).

We make the following standing assumption on the individual objective functions \(f_i\).

Assumption 2

Each \(f_i:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) is \(\mu \)-strongly convex with L-Lipschitz continuous gradients, i.e., for any \(x,x'\in {\mathbb {R}}^p\),

$$\begin{aligned} \begin{aligned}&\langle \nabla f_i(x)-\nabla f_i(x'),x-x'\rangle \ge \mu \Vert x-x'\Vert ^2,\\&\Vert \nabla f_i(x)-\nabla f_i(x')\Vert \le L \Vert x-x'\Vert . \end{aligned} \end{aligned}$$

We note that, under Assumption 2, problem (1) has a unique solution denoted by \(x^*\in {\mathbb {R}}^{1\times p}\).

A graph is a pair \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) where \({\mathcal {V}}=\{1,2,\ldots ,n\}\) is the set of vertices (nodes) and \({\mathcal {E}}\subseteq {\mathcal {V}}\times {\mathcal {V}}\) represents the set of edges connecting vertices. We assume agents communicate in an undirected graph, i.e., \((i,j)\in {\mathcal {E}}\) iff (if and only if) \((j,i)\in {\mathcal {E}}\). For each agent i, let \({\mathcal {N}}_i=\{j\mid j\ne i, (i,j)\in {\mathcal {E}}\}\) be its set of neighbors. The cardinality of \({\mathcal {N}}_i\), denoted by \(\deg (i)\), is referred to as agent i’s degree. We consider the following condition regarding the interaction graph of agents.

Assumption 3

The graph \({\mathcal {G}}\) corresponding to the network of agents is undirected and connected, i.e., there exists a path between any two agents.

1.4 Organization of the paper

The paper is organized as follows. In Sect. 2, we introduce the distributed stochastic gradient tracking method and present its main convergence results. We perform analysis in Sect. 3. In Sect. 4 we propose the gossip-like stochastic gradient tracking method. A numerical example is provided in Sect. 5 to illustrate our theoretical findings. Section 6 concludes the paper.

2 A distributed stochastic gradient tracking method (DSGT)

We consider the following distributed stochastic gradient tracking method: At each step \(k\in {\mathbb {N}}\), every agent i independently implements the following two steps:

$$\begin{aligned} \begin{aligned} x_{i,k+1}&= \sum _{j=1}^{n}w_{ij}(x_{j,k}-\alpha y_{j,k}), \\ y_{i,k+1}&= \sum _{j=1}^{n}w_{ij}y_{j,k}+g_i(x_{i,k+1},\xi _{i,k+1})-g_i(x_{i,k},\xi _{i,k}), \end{aligned} \end{aligned}$$
(4)

where \(w_{ij}\) are nonnegative weights and \(\alpha >0\) is a constant stepsize. Agent i and j are connected iff \(w_{ij}, w_{ji}>0\). The iterates are initiated with an arbitrary \(x_{i,0}\) and \(y_{i,0}= g_i(x_{i,0},\xi _{i,0})\) for all \(i\in {{{\mathcal {N}}}}\). We can also write (4) in the following compact form:

$$\begin{aligned} \begin{aligned} {\mathbf {x}}_{k+1}&= {\mathbf {W}}({\mathbf {x}}_k-\alpha {\mathbf {y}}_k), \\ {\mathbf {y}}_{k+1}&= {\mathbf {W}}{\mathbf {y}}_k+G({\mathbf {x}}_{k+1},\varvec{\xi }_{k+1})-G({\mathbf {x}}_k,\varvec{\xi }_k), \end{aligned} \end{aligned}$$
(5)

where \({\mathbf {W}}=[w_{ij}]\in {\mathbb {R}}^{n\times n}\) denotes the coupling matrix of agents. We assume that \({\mathbf {W}}\) satisfies the following condition.

Assumption 4

Nonnegative coupling matrix \({\mathbf {W}}\) is doubly stochastic, i.e., \({\mathbf {W}}{\mathbf {1}}={\mathbf {1}}\) and \({\mathbf {1}}^{\intercal }{\mathbf {W}}={\mathbf {1}}^{\intercal }\). In addition, \(w_{ii}>0\) for some \(i\in {\mathcal {N}}\).

In the subsequent analysis, we will frequently use the following result, which is a direct implication of Assumptions 3 and 4 (see [46] Section II-B).

Lemma 1

Let Assumptions 3 and 4 hold, and let \(\rho _w\) denote the spectral norm of the matrix \({\mathbf {W}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). Then, \(\rho _w<1\) and

$$\begin{aligned} \Vert {\mathbf {W}}\omega -{\mathbf {1}}{\overline{\omega }}\Vert \le \rho _w\Vert \omega -{\mathbf {1}}{\overline{\omega }}\Vert \end{aligned}$$

for all \(\omega \in {\mathbb {R}}^{n\times p}\), where \({\overline{\omega }}=\frac{1}{n}{\mathbf {1}}^{\intercal }\omega \).

Algorithm (4) is closely related to the schemes considered in [13, 34, 46], in which auxiliary variables \(y_{i,k}\) were introduced to track the average \(\frac{1}{n}\sum _{i=1}^{n}\nabla f_i(x_{i,k})\). This design ensures that the algorithms achieve linear convergence under a constant stepsize choice. Correspondingly, under our approach \(y_{i,k}\) are (approximately) tracking \(\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k})\). To see why this is the case, note that \({\overline{y}}_k = \frac{1}{n}{\mathbf {1}}^{\intercal } {\mathbf {y}}_k\). Since \(y_{i,0}=g(x_{i,0},\xi _{i,0}),\forall i\), by induction we have

$$\begin{aligned} {\overline{y}}_k=\frac{1}{n}{\mathbf {1}}^{\intercal }G({\mathbf {x}}_k,\varvec{\xi }_k) =\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k}),\forall k. \end{aligned}$$
(6)

It will be shown that \({\mathbf {y}}_k\) is close to \({\mathbf {1}}{\overline{y}}_k\) in expectation when k is sufficiently large. As a result, \(y_{i,k}\) are (approximately) tracking \(\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k})\).

It is worth noting that compared to the standard distributed subgradient methods [36], DSGT incurs two times the communication and storage costs per iteration.

2.1 Main results

Main convergence properties of DSGT are covered in the following theorem.

Theorem 1

Let \(\varGamma >1\) be arbitrarily chosen. Suppose Assumptions 1 and 2 hold and the stepsize \(\alpha \) satisfies

$$\begin{aligned} \alpha \le \min \left\{ \frac{(1-\rho _w^2)}{12\rho _w L},\frac{(1-\rho _w^2)^2}{2\sqrt{\varGamma }L\max \{6\rho _w\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ,1-\rho _w^2\}}, \frac{(1-\rho _w^2)}{3\rho _w^{2/3}L}\left[ \frac{\mu ^2}{L^2}\frac{(\varGamma -1)}{\varGamma (\varGamma +1)}\right] ^{1/3}\right\} .\nonumber \\ \end{aligned}$$
(7)

Then \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_{l}-{\mathbf {1}}{\overline{x}}_{l}\Vert ^2]\), respectively, converge to \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\) and \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\) at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\), where \(\rho ({\mathbf {A}})<1\) is the spectral radius of the matrix \({\mathbf {A}}\) given by

$$\begin{aligned} {\mathbf {A}}=\begin{bmatrix} 1-\alpha \mu &{}\quad \frac{\alpha L^2}{\mu n}(1+\alpha \mu ) &{}\quad 0\\ 0 &{}\quad \frac{1}{2}(1+\rho _w^2) &{}\quad \alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\\ 2\alpha nL^3 &{}\quad \left( \frac{1}{\beta }+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3 &{}\quad \frac{1}{2}(1+\rho _w^2) \end{bmatrix}, \end{aligned}$$

where \(\beta =\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha L-2\alpha ^2 L^2\). Furthermore,

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]{\le } \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n} {+}\left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{4\alpha ^2 L^2(1+\alpha \mu )(1+\rho _w^2)\rho _w^2}{\mu ^2 n(1-\rho _w^2)^3}M_{\sigma }, \end{aligned}$$
(8)

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] \le \left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{4\alpha ^2 (1+\rho _w^2)\rho _w^2(2\alpha ^2L^3\sigma ^2+\mu M_{\sigma })}{\mu (1-\rho _w^2)^3}, \end{aligned}$$
(9)

where

$$\begin{aligned} M_{\sigma }:=\left[ 3\alpha ^2L^2+2(\alpha L+1)(n+1)\right] \sigma ^2. \end{aligned}$$
(10)

Remark 1

The first term on the right-hand side of (8) can be interpreted as the error caused by stochastic gradients only, since it does not depend on the network topology. The second term on the right hand side of (8) and the bound in (9) are network dependent, and they increase with \(\rho _w\) (larger \(\rho _w\) indicates worse network connectivity).

In light of (8) and (9), we have

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]=\alpha {\mathcal {O}}\left( \frac{\sigma ^2}{\mu n}\right) +\frac{\alpha ^2}{(1-\rho _w)^3}{\mathcal {O}}\left( \frac{ L^2\sigma ^2}{\mu ^2}\right) , \end{aligned}$$

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]=\frac{\alpha ^2}{(1-\rho _w)^3}{\mathcal {O}}\left( \sigma ^2\right) +\frac{\alpha ^4}{(1-\rho _w)^3}{\mathcal {O}}\left( \frac{ L^3\sigma ^2}{\mu n}\right) . \end{aligned}$$

Let \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\) measure the average quality of solutions obtained by all the agents. We have

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]&=\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\limsup _{k\rightarrow \infty } \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&=\alpha {\mathcal {O}}\left( \frac{\sigma ^2}{\mu n}\right) +\frac{\alpha ^2}{(1-\rho _w)^3}{\mathcal {O}}\left( \frac{ L^2\sigma ^2}{\mu ^2}\right) , \end{aligned}$$
(11)

which is decreasing in the network size n when \(\alpha \) is sufficiently small, i.e., when

$$\begin{aligned} \alpha \frac{\sigma ^2}{\mu n}\sim \frac{\alpha ^2}{(1-\rho _w)^3}\frac{ L^2\sigma ^2}{\mu ^2},\quad \text {or}\quad \alpha \sim \frac{\mu }{L^2}\frac{(1-\rho _w)^3}{n}. \end{aligned}$$

The spectral gap \(1-\rho _w\) depends on the graph topology as well as the choices of weights \([w_{ij}]\). For example, suppose the Lazy Metropolis rule is adopted (see [33]). In this case, if the communication graph is a 1) path or ring graph, then \(1-\rho _w={\mathcal {O}}(1/n^2)\); 2) lattice, then \(1-\rho _w={\mathcal {O}}(1/n)\); 3) geometric random graph, then \(1-\rho _w={\mathcal {O}}(1/n\log n)\); 4) Erdős-Rényi random graph, then \(1-\rho _w={\mathcal {O}}(1)\); 5) complete graph, then \(1-\rho _w=1\); 6) any connected undirected graph, then \(1-\rho _w={\mathcal {O}}(1/n^2)\).

From the above argument, the condition \(\alpha \sim \frac{\mu }{L^2}\frac{(1-\rho _w)^3}{n}\) is in general more strict than (7) in Theorem 1, which requires \(\alpha \sim \frac{(1-\rho _w)^2}{L}\) (when \(1-\rho _w\le (\frac{\mu }{L})^{2/3}\)). Such a difference suggests that when implementing DSGT in practice, it can be advantageous to use a larger stepsize in the early stage to achieve a faster convergence speed (see Corollary 1 below) and then switch to smaller stepsizes for more accuracy on the final solutions.\(\square \)

Remark 2

Under a centralized stochastic gradient (CSG) algorithm in the form of

$$\begin{aligned} x_{k+1}=x_k-\alpha \frac{1}{n}\sum _{i=1}^n g_i(x_k,\xi _{i,k}),\ \ k\in {\mathbb {N}}, \end{aligned}$$
(12)

we would obtain

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert x_k-x^*\Vert ^2]=\alpha {\mathcal {O}} \left( \frac{\sigma ^2}{\mu n}\right) . \end{aligned}$$

It can be observed that DSGT is comparable with CSG in their ultimate error bounds (up to constant factors) with sufficiently small stepsizes.\(\square \)

As shown in Theorem 1, the convergence rate of DSGT is determined by the spectral radius \(\rho ({\mathbf {A}})<1\). In the corollary below we provide an upper bound of \(\rho ({\mathbf {A}})\).

Corollary 1

Under the conditions in Theorem 1, assuming in addition that the stepsize \(\alpha \) also satisfiesFootnote 1

$$\begin{aligned} \alpha \le \frac{(\varGamma +1)}{\varGamma }\frac{(1-\rho _w^2)}{8\mu }, \end{aligned}$$
(13)

we have

$$\begin{aligned} \rho ({\mathbf {A}})\le 1-\left( \frac{\varGamma -1}{\varGamma +1}\right) \alpha \mu . \end{aligned}$$

Corollary 1 implies that, for sufficiently small stepsizes, the distributed gradient tracking method has a comparable convergence speed to that of a centralized scheme (in which case the linear rate is \({\mathcal {O}}((1-2\alpha \mu )^k)\)).

In the next theorem, we show DSGT achieves the \({\mathcal {O}}(\frac{1}{k})\) rate of convergence under a diminishing stepsize policy.

Theorem 2

Let Assumptions 1 and 2 hold. Consider the method in (4) where \(\alpha \) is replaced with the time-varying stepsize \(\alpha _k\) given by \(\alpha _k:=\theta /(m+k)\), where \(\theta >1/\mu \) and m satisfies

$$\begin{aligned} {\left\{ \begin{array}{ll} m>\max \left\{ \frac{\theta }{2}(\mu +L), \frac{4\theta L\rho _w^2+2\theta L\rho _w\sqrt{1+3\rho _w^2}}{1-\rho _w^2}\right\} ,\\ \frac{(1-\rho _w^2)^2}{\theta ^2(1+\rho _w^2)\rho _w^2} \left[ \frac{(1-\rho _w^2)}{2}-\frac{2m+1}{(m+1)^2}\right] > \frac{1}{(\theta \mu -1)}\left( \frac{1}{\mu }+\frac{\theta }{m}\right) \frac{4\theta ^2 L^5}{m^3}+\frac{2C}{m^2}, \end{array}\right. } \end{aligned}$$
(14)

with \(C=\left[ \left( \frac{1-\rho _w^2}{2\rho _w^2}-\frac{4\theta L}{m}-\frac{2\theta ^2 L^2}{m^2}\right) ^{-1}+2\right] \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+\frac{3\theta L^3}{m}.\) Then for all \(k\ge 0\), we have

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2] \le \frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)(m+k)}+\frac{{\mathcal {O}}_k(1)}{(m+k)^{\theta \mu }} +\frac{{\mathcal {O}}_k(1)}{(m+k)^2}, \end{aligned}$$
(15a)
$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\le \frac{{\mathcal {O}}_k(1)}{(m+k)^2}, \end{aligned}$$
(15b)

where \({\mathcal {O}}_k(1)\) denotes some constant that does not depend on k.

Remark 3

From (15), noting that \(\theta \mu >1\), we have

$$\begin{aligned} \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]={\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]+ \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]={{\mathcal {O}}_k}\left( \frac{1}{k}\right) . \end{aligned}$$

In particular, \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]\) asymptotically converges to 0 at the rate \(\frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)k}\), which does not depend on the spectral norm \(\rho _w\) and matches the convergence rate (up to constant factors) for centralized stochastic gradient methods (see [38, 47]).Footnote 2\(\square \)

Remark 4

Under a well connected network (\(\rho _w\simeq 0\)), \(m>\frac{\theta }{2}(\mu +L)\) suffices. When \(\rho _w\simeq 1\), the lower bound of m is in the order of \({\mathcal {O}}((1-\rho _w)^{-2})\).\(\square \)

Theorem 2 implies the following result if we choose \(\alpha _k=\theta /(k+1)\).

Corollary 2

Under Assumptions 1 and 2 and stepsize policy \(\alpha _k:=\theta /(k+1)\) for some \(\theta >1/\mu \), we have for all \(k\ge m\) where m satisfies condition (14),

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\le \frac{{\tilde{U}}}{k}, \ \ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\le \frac{{\tilde{X}}}{k^2},\ \ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2]\le {\tilde{Y}}, \end{aligned}$$

where \({\tilde{U}}\), \({\tilde{X}}\), and \({\tilde{Y}}\) are some positive constants.

3 Analysis

In this section, we prove Theorem 1 by studying the evolution of \({\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\), \({\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\) and \({\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\). Our strategy is to bound the three expressions in terms of linear combinations of their past values, in which way we establish a linear system of inequalities. This approach is different from those employed in [34, 46], where the analyses pertain to the examination of \(\Vert {\overline{x}}_k-x^*\Vert \), \(\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert \) and \(\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \). Such distinction is due to the stochastic gradients \(g_i(x_{i,k},\xi _{i,k})\) whose variances play a crucial role in deriving the main inequalities.

We first introduce some lemmas that will be used later in the analysis. Denote by \({\mathcal {F}}_k\) the \(\sigma \)-algebra generated by \(\{\varvec{\xi }_0,\ldots ,\varvec{\xi }_{k-1}\}\), and define \({\mathbb {E}}[\cdot \mid {\mathcal {F}}_k]\) as the conditional expectation given \({\mathcal {F}}_k\).

Lemma 2

Under Assumption 1, recalling that \(h({\mathbf {x}})= \frac{1}{n}{\mathbf {1}}^{\intercal }\nabla F({\mathbf {x}})\), we have for all \(k\ge 0\),

$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k\right] \le \frac{\sigma ^2}{n}. \end{aligned}$$
(16)

\(\square \)

Proof

By the definitions of \({\overline{y}}_k\) and \(h({\mathbf {x}}_k)\),

$$\begin{aligned} {\mathbb {E}}\left[ \Vert {\overline{y}}_k-h({\mathbf {x}}_k)\Vert ^2\mid {\mathcal {F}}_k\right] = \frac{1}{n^2}\sum _{i=1}^n{\mathbb {E}}\left[ \Vert g_i(x_{i,k},\xi _{i,k})-\nabla f_i(x_{i,k})\Vert ^2\vert {\mathcal {F}}_k\right] \le \frac{\sigma ^2}{n}. \end{aligned}$$

\(\square \)

Lemma 3

Under Assumption 2, we have for all \(k\ge 0\),

$$\begin{aligned} \Vert \nabla f({\overline{x}}_k)-h({\mathbf {x}}_k)\Vert \le \frac{L}{\sqrt{n}}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert . \end{aligned}$$
(17)

If in addition \(\alpha <2/(\mu +L)\), then

$$\begin{aligned} \Vert x-\alpha \nabla f(x)-x^*\Vert \le (1-\alpha \mu )\Vert x-x^*\Vert ,\, \forall x\in {\mathbb {R}}^p. \end{aligned}$$

Proof

See [46] Lemma 10 for reference. \(\square \)

In the following lemma, we establish bounds on \(\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2\) and on the conditional expectations of \(\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\) and \(\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\), respectively.

Lemma 4

Suppose Assumptions 1 and 2 hold and \(\alpha <2/(\mu +L)\). We have the following inequalities:

$$\begin{aligned}&{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\mid {\mathcal {F}}_k] \le \left( 1-\alpha \mu \right) \Vert {\overline{x}}_k-x^*\Vert ^2\nonumber \\&\quad +\,\frac{\alpha L^2}{\mu n}\left( 1+\alpha \mu \right) \Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2 +\frac{\alpha ^2\sigma ^2}{n}, \end{aligned}$$
(18)
$$\begin{aligned}&\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2 \le \frac{(1+\rho _w^2)}{2}\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2\nonumber \\&\quad +\,\alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_k\Vert ^2, \end{aligned}$$
(19)

and for any \(\beta >0\),

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\mid {\mathcal {F}}_k] \le \left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2{\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2\mid {\mathcal {F}}_k]\nonumber \\&\quad +\left( \frac{1}{\beta }\Vert {\mathbf {W}}{-}{\mathbf {I}}\Vert ^2 L^2{+}2\Vert {\mathbf {W}}{-}{\mathbf {I}}\Vert ^2 L^2{+}3\alpha L^3\right) \Vert {\mathbf {x}}_k{-}{\mathbf {1}}{\overline{x}}_k\Vert ^2{+}2\alpha nL^3\Vert {\overline{x}}_k-x^*\Vert ^2+M_{\sigma }. \end{aligned}$$
(20)

Proof

See Appendix 7.1. \(\square \)

3.1 Proof of Theorem 1

Taking full expectation on both sides of (18)–(20), we obtain the following linear system of inequalities

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix}+\begin{bmatrix} \frac{\alpha ^2\sigma ^2}{n}\\ 0\\ M_{\sigma } \end{bmatrix}, \end{aligned}$$
(21)

where the inequality is to be taken component-wise, and the entries of the matrix \({\mathbf {A}}=[a_{ij}]\) are given by

$$\begin{aligned}&\begin{bmatrix} a_{11}\\ a_{21}\\ a_{31} \end{bmatrix} = \begin{bmatrix} 1-\alpha \mu \\ 0\\ 2\alpha nL^3 \end{bmatrix}, \begin{bmatrix} a_{12}\\ a_{22}\\ a_{32} \end{bmatrix} = \begin{bmatrix} \frac{\alpha L^2}{\mu n}(1+\alpha \mu )\\ \frac{1}{2}(1+\rho _w^2)\\ \left( \frac{1}{\beta }+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3 \end{bmatrix},\\&\begin{bmatrix} a_{13}\\ a_{23}\\ a_{33} \end{bmatrix} = \begin{bmatrix} 0 \\ \alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\\ \left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2 \end{bmatrix}, \end{aligned}$$

and \(M_{\sigma }\) is given in (10). Therefore, by induction we have

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}^k\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{0}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{0}-{\mathbf {1}}{\overline{x}}_{0}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{0}-{\mathbf {1}}{\overline{y}}_{0}\Vert ^2] \end{bmatrix}+\sum _{l=0}^{k-1}{\mathbf {A}}^l\begin{bmatrix} \frac{\alpha ^2\sigma ^2}{n}\\ 0\\ M_{\sigma } \end{bmatrix}. \end{aligned}$$
(22)

If the spectral radius of \({\mathbf {A}}\) satisfies \(\rho ({\mathbf {A}})<1\), then \({\mathbf {A}}^k\) converges to \({\mathbf {0}}\) at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\) (see [17]), in which case \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\), \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}{\overline{x}}_l\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {y}}_l-{\mathbf {1}}{\overline{y}}_l\Vert ^2]\) all converge to a neighborhood of 0 at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\). The next lemma provides conditions that ensure \(\rho ({\mathbf {A}})<1\).

Lemma 5

Let \({\mathbf {S}}=[s_{ij}]\in {\mathbb {R}}^{3\times 3}\) be a nonnegative, irreducible matrix with \(s_{ii}<\lambda ^*\) for some \(\lambda ^*>0\) for all \(i=1,2,3\). Then \(\rho ({\mathbf {S}})<\lambda ^*\) iff \(\mathrm {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0\).

Proof

See Appendix 7.2. \(\square \)

We now derive the conditions such that \(\rho ({\mathbf {A}})<1\). Suppose \(\alpha \) and \(\beta \) meet the following relations:Footnote 3

$$\begin{aligned} a_{33}&=\left( 1+4\alpha L+2\alpha ^2 L^2+\beta \right) \rho _w^2=\frac{1+\rho _w^2}{2}<1, \end{aligned}$$
(23)
$$\begin{aligned} a_{23}a_{32}&=\alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)} \left[ \left( \frac{1}{\beta }+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3\right] \le \frac{1}{\varGamma }(1-a_{22})(1-a_{33}), \end{aligned}$$
(24)

and

$$\begin{aligned} a_{12}a_{23}a_{31}{=}\frac{2\alpha ^4 L^5(1+\alpha \mu )}{\mu }\frac{(1+\rho _w^2)}{(1-\rho _w^2)}\rho _w^2 {\le } \frac{1}{\varGamma +1}(1{-}a_{11})[(1{-}a_{22})(1{-}a_{33}){-}a_{23}a_{32}]. \end{aligned}$$
(25)

Then,

$$\begin{aligned} \text {det}({\mathbf {I}}-{\mathbf {A}})&=(1-a_{11})(1-a_{22})(1-a_{33})-(1-a_{11})a_{23}a_{32} -a_{12}a_{23}a_{31}\\&\ge \frac{\varGamma }{(\varGamma +1)}(1-a_{11})[(1-a_{22})(1-a_{33})-a_{23}a_{32}]\\&\ge \left( \frac{\varGamma -1}{\varGamma +1}\right) (1-a_{11})(1-a_{22})(1-a_{33})>0. \end{aligned}$$

Given that \(a_{11},a_{22},a_{33}<1\), in light of Lemma 5, we have \(\rho ({\mathbf {A}})<1\). In addition, by denoting \(B:=[\frac{\alpha ^2\sigma ^2}{n}, 0, M_{\sigma }]^{\intercal }\) and \([({\mathbf {I}}-{\mathbf {A}})^{-1}B]_j\) the j-th element of the vector \([({\mathbf {I}}-{\mathbf {A}})^{-1}B]\), we obtain from (22) that

$$\begin{aligned}&\lim \sup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\le [({\mathbf {I}}-{\mathbf {A}})^{-1}B]_1 \nonumber \\&\quad = \frac{1}{\text {det}({\mathbf {I}}-{\mathbf {A}})}\left\{ \left[ (1-a_{22})(1-a_{33})-a_{23}a_{32}\right] \frac{\alpha ^2\sigma ^2}{n}+a_{12}a_{23}M_{\sigma }\right\} \nonumber \\&\quad \le \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n}+\left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{ a_{12}a_{23}M_{\sigma }}{(1-a_{11})(1-a_{22})(1-a_{33})}\nonumber \\&\quad = \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n}+ \left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{\alpha ^3 L^2(1+\alpha \mu )(1+\rho _w^2)\rho _w^2M_{\sigma }}{\mu n(1-\rho _w^2)(1-a_{11})(1-a_{22})(1-a_{33})}\nonumber \\&\quad = \frac{(\varGamma +1)}{\varGamma }\frac{\alpha \sigma ^2}{\mu n}+\left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{4\alpha ^2 L^2(1+\alpha \mu )(1+\rho _w^2)\rho _w^2}{\mu ^2 n(1-\rho _w^2)^3}M_{\sigma }, \end{aligned}$$
(26)

and

$$\begin{aligned}&\lim \sup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] \le [({\mathbf {I}}-{\mathbf {A}})^{-1}B]_2 =\frac{1}{\text {det}({\mathbf {I}}-{\mathbf {A}})}\left[ a_{23}a_{31}\frac{\alpha ^2\sigma ^2}{n} +a_{23}(1-a_{11})M_{\sigma }\right] \\&\quad \le \left( \frac{\varGamma +1}{\varGamma -1}\right) \frac{a_{23}}{(1-a_{11})(1-a_{22}) (1-a_{33})}\left( 2\alpha nL^3\frac{\alpha ^2 \sigma ^2}{n}+\alpha \mu M_{\sigma }\right) \\&\quad = \frac{4(\varGamma +1)\alpha ^2 (1+\rho _w^2)\rho _w^2(2\alpha ^2L^3\sigma ^2+\mu M_{\sigma })}{(\varGamma -1)\mu (1-\rho _w^2)^3}. \end{aligned}$$

It remains to show that (23)–(25) are satisfied under condition (7). By (23), we need

$$\begin{aligned} {\beta =\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha L-2\alpha ^2 L^2}. \end{aligned}$$

Since \(\alpha \le \frac{1-\rho _w^2}{12\rho _w L}\) by (7), we know that

$$\begin{aligned} \beta \ge \frac{1-\rho _w^2}{2\rho _w^2}-\frac{1-\rho _w^2}{3\rho _w} -\frac{(1-\rho _w^2)^2}{72\rho _w^2}\ge \frac{11(1-\rho _w^2)}{72\rho _w^2}\ge \frac{1-\rho _w^2}{8\rho _w^2}>0. \end{aligned}$$
(27)

Condition (24) leads to the inequality below:

$$\begin{aligned} \alpha ^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\left[ \left( \frac{1}{\beta } +2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha L^3\right] \le \frac{(1-\rho _w^2)^2}{4\varGamma }. \end{aligned}$$

By (27), we only need

$$\begin{aligned} \alpha ^2\left[ \frac{(2+6\rho _w^2)}{(1-\rho _w^2)}\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+\frac{(1-\rho _w^2)}{4\rho _w^2}L^2\right] \le \frac{(1-\rho _w^2)^3}{4\varGamma (1+\rho _w^2)\rho _w^2}. \end{aligned}$$

The preceding inequality is equivalent to

$$\begin{aligned} \alpha \le \frac{(1-\rho _w^2)^2}{L\sqrt{\varGamma (1+\rho _w^2)}\sqrt{4\rho _w^2(2+6\rho _w^2) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2+(1-\rho _w^2)^2}}, \end{aligned}$$

implying that it is sufficient to have

$$\begin{aligned} \alpha \le \frac{(1-\rho _w^2)^2}{2\sqrt{\varGamma }L\max (6\rho _w\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ,1-\rho _w^2)}. \end{aligned}$$

To see that relation (25) holds, consider a stronger condition

$$\begin{aligned} \frac{2\alpha ^4 L^5(1+\alpha \mu )}{\mu }\frac{(1+\rho _w^2)}{(1-\rho _w^2)}\rho _w^2 \le \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}(1-a_{11})(1-a_{22})(1-a_{33}), \end{aligned}$$

or equivalently,

$$\begin{aligned} \frac{2\alpha ^3 L^5(1+\alpha \mu )}{\mu ^2}\frac{(1+\rho _w^2)}{(1-\rho _w^2)}\rho _w^2 \le \frac{(\varGamma -1)}{4\varGamma (\varGamma +1)}(1-\rho _w^2)^2. \end{aligned}$$

It suffices that

$$\begin{aligned} \alpha \le \frac{(1-\rho _w^2)}{3\rho _w^{2/3}L}\left[ \frac{\mu ^2}{L^2} \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}\right] ^{1/3}. \end{aligned}$$
(28)

3.2 Proof of Corollary 1

We derive an upper bound of \(\rho ({\mathbf {A}})\) under conditions (7) and (13). Note that the characteristic function of \({\mathbf {A}}\) is given by

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}})=(\lambda -a_{11})(\lambda -a_{22})(\lambda -a_{33}) -(\lambda -a_{11})a_{23}a_{32}-a_{12}a_{23}a_{31}. \end{aligned}$$

Since \(\text {det}({\mathbf {I}}-{\mathbf {A}})> 0\) and \(\text {det}(\max \{a_{11},a_{22},a_{33}\} {\mathbf {I}}-{\mathbf {A}})=\text {det}(a_{11}{\mathbf {I}}-{\mathbf {A}})<0\), we have \(\rho ({\mathbf {A}})\in (a_{11},1)\). By (24) and (25),

$$\begin{aligned}&\text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}) \ge (\lambda -a_{11})(\lambda -a_{22})(\lambda -a_{33})-(\lambda -a_{11}) a_{23}a_{32}\\&\qquad -\frac{1}{\varGamma +1}(1-a_{11})[(1-a_{22})(1-a_{33})-a_{23}a_{32}]\\&\quad \ge (\lambda -a_{11})(\lambda -a_{22})(\lambda -a_{33})-\frac{1}{\varGamma }(\lambda -a_{11}) (1-a_{22})(1-a_{33})\\&\qquad -\frac{(\varGamma -1)}{\varGamma (\varGamma +1)}(1-a_{11})(1-a_{22})(1-a_{33}). \end{aligned}$$

Suppose \(\lambda =1-\epsilon \) for some \(\epsilon \in (0,\alpha \mu )\), satisfying

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}})&\ge \frac{1}{4}(\alpha \mu -\epsilon )\left( 1-\rho _w^2-2\epsilon \right) ^2\\&\quad -\frac{1}{4\varGamma }(\alpha \mu -\epsilon )(1-\rho _w^2)^2-\frac{(\varGamma -1) \alpha \mu }{4\varGamma (\varGamma +1)}(1-\rho _w^2)^2\ge 0, \end{aligned}$$

or equivalently,

$$\begin{aligned} \frac{(\alpha \mu -\epsilon )}{\alpha \mu }\left[ \frac{\left( 1-\rho _w^2-2\epsilon \right) ^2}{(1-\rho _w^2)^2} -\frac{1}{\varGamma }\right] \ge \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}. \end{aligned}$$

It suffices that

$$\begin{aligned} \epsilon \le \left( \frac{\varGamma -1}{\varGamma +1}\right) \alpha \mu . \end{aligned}$$

To see why this is the case, note that \(\frac{(\alpha \mu -\epsilon )}{\alpha \mu }\ge \frac{2}{\varGamma +1}\), and under (13),

$$\begin{aligned} \epsilon \le \left( \frac{\varGamma -1}{\varGamma +1}\right) \frac{(\varGamma +1)}{\varGamma } \frac{(1-\rho _w^2)}{8\mu }\mu =\frac{(\varGamma -1)(1-\rho _w^2)}{8\varGamma }. \end{aligned}$$

As a result,

$$\begin{aligned} 1-\rho _w^2-2\epsilon \ge 1-\rho _w^2-\frac{(\varGamma -1)(1-\rho _w^2)}{4\varGamma }=\frac{(3\varGamma +1)(1-\rho _w^2)}{4\varGamma }. \end{aligned}$$

We then have

$$\begin{aligned}&\frac{(\alpha \mu -\epsilon )}{\alpha \mu }\left[ \frac{\left( 1-\rho _w^2-2\epsilon \right) ^2}{(1-\rho _w^2)^2} -\frac{1}{\varGamma }\right] \ge \frac{2}{(\varGamma +1)} \left[ \frac{(3\varGamma +1)^2}{16 \varGamma ^2}{-}\frac{1}{\varGamma }\right] {=}\frac{1}{\varGamma (\varGamma +1)} \left[ \frac{(3\varGamma +1)^2}{8\varGamma }-2\right] \\&\quad =\frac{1}{\varGamma (\varGamma +1)} \left( \varGamma -1+\frac{\varGamma }{8}+\frac{1}{8\varGamma } -\frac{1}{4}\right) \ge \frac{(\varGamma -1)}{\varGamma (\varGamma +1)}. \end{aligned}$$

Denote

$$\begin{aligned} {\tilde{\lambda }}=1-\left( \frac{\varGamma -1}{\varGamma +1}\right) \alpha \mu . \end{aligned}$$

Then \(\text {det}({\tilde{\lambda }} {\mathbf {I}}-{\mathbf {A}})\ge 0\) so that \(\rho ({\mathbf {A}})\le {\tilde{\lambda }}\).

3.3 Proof of Theorem 2

Similar to (21), under stepsize policy \(\alpha _k=\theta /(m+k)\) where \(m>\frac{\theta }{2}(\mu +L)\), we have

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}_k\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix}+\begin{bmatrix} \frac{\alpha _k^2\sigma ^2}{n}\\ 0\\ M_k \end{bmatrix}, \end{aligned}$$
(29)

where

$$\begin{aligned} {\mathbf {A}}_k=\begin{bmatrix} 1-\alpha _k\mu &{}\quad \frac{\alpha _k L^2}{\mu n}(1+\alpha _k\mu ) &{}\quad 0\\ 0 &{}\quad \frac{1}{2}(1+\rho _w^2) &{}\quad \alpha _k^2\frac{(1+\rho _w^2)\rho _w^2}{(1-\rho _w^2)}\\ 2\alpha _k nL^3 &{}\quad \left( \frac{1}{\beta _k}+2\right) \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha _k L^3 &{}\quad \frac{1}{2}(1+\rho _w^2) \end{bmatrix}, \end{aligned}$$

\(\beta _k=\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha _k L-2\alpha _k^2 L^2>0\), and

$$\begin{aligned} M_k:=\left[ 3\alpha _k^2L^2+2(\alpha _k L+1)(n+1)\right] \sigma ^2. \end{aligned}$$
(30)

The condition \(\beta _k>0\) is satisfied when

$$\begin{aligned} \beta _0=\frac{1-\rho _w^2}{2\rho _w^2}-\frac{4\theta L}{m}-\frac{2\theta ^2 L^2}{m^2}>0, \end{aligned}$$

or equivalently,

$$\begin{aligned} m>\frac{4\theta L\rho _w^2+2\theta L\rho _w\sqrt{1+3\rho _w^2}}{1-\rho _w^2}. \end{aligned}$$
(31)

We first show that \({\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\le {\hat{U}}/(m+k)\) and \({\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2\le {\hat{X}}/(m+k)^2\) for some \({\hat{U}}\), \({\hat{X}}>0\) by induction. Denote \(U_k:={\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\), \(X_k:={\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\), and \(Y_k:={\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2]\). Suppose for some \(k\ge 0\),

$$\begin{aligned} U_k\le {\hat{U}}/(m+k), \ \ X_k\le {\hat{X}}/(m+k)^2,\ \ Y_k\le {\hat{Y}}. \end{aligned}$$
(32)

We want to prove that

$$\begin{aligned}&U_{k+1}\le \frac{(1-\alpha _k\mu ){\hat{U}}}{(m+k)}+\frac{\alpha _k L^2}{\mu n}\frac{(1+\alpha _k\mu ){\hat{X}}}{(m+k)^2}+\frac{\alpha _k^2\sigma ^2}{n}\le \frac{{\hat{U}}}{(m+k+1)}, \end{aligned}$$
(33a)
$$\begin{aligned}&X_{k+1}\le \frac{(1+\rho _w^2){\hat{X}}}{2(m+k)^2}+\alpha _k^2 \frac{(1+\rho _w^2)\rho _w^2 {\hat{Y}}}{(1-\rho _w^2)}\le \frac{{\hat{X}}}{(m+k+1)^2}, \end{aligned}$$
(33b)
$$\begin{aligned}&Y_{k+1}\le \frac{2\alpha _k nL^3{\hat{U}}}{(m+k)}+\frac{C{\hat{X}}}{(m+k)^2} +\frac{(1+\rho _w^2){\hat{Y}}}{2}+M_0\le {\hat{Y}}, \end{aligned}$$
(33c)

where \(C=(\frac{1}{\beta _0}+2)\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha _0 L^3\). Plugging in \(\alpha _k=\theta /(m+k)\), it suffices to show

$$\begin{aligned}&{\hat{U}}\ge \frac{1}{n(\theta \mu -1)}\left[ \left( \frac{\theta L^2}{\mu m}+\frac{\theta ^2 L^2}{m^2}\right) {\hat{X}}+\theta ^2\sigma ^2\right] , \end{aligned}$$
(34a)
$$\begin{aligned}&{\hat{Y}}\le \frac{(1-\rho _w^2)}{\theta ^2(1+\rho _w^2)\rho _w^2}\left[ \frac{(1-\rho _w^2)}{2} -\frac{2m+1}{(m+1)^2}\right] {\hat{X}}, \end{aligned}$$
(34b)
$$\begin{aligned}&{\hat{Y}}\ge \frac{2}{(1-\rho _w^2)}\left( \frac{2\theta nL^3{\hat{U}}}{m^2}+\frac{C{\hat{X}}}{m^2}+\left[ \frac{3\theta ^2L^2}{m^2} +2\left( \frac{\theta L}{m}+1\right) (n+1)\right] \sigma ^2\right) . \end{aligned}$$
(34c)

Let

$$\begin{aligned} {\hat{U}}:= \max \left\{ \frac{1}{n(\theta \mu -1)}\left[ \left( \frac{\theta L^2}{\mu m}+\frac{\theta ^2 L^2}{m^2}\right) {\hat{X}}+\theta ^2\sigma ^2\right] ,m\Vert {\overline{x}}_0-x^*\Vert ^2\right\} , \end{aligned}$$

condition (34) admits a solution iff

$$\begin{aligned} \frac{(1-\rho _w^2)^2}{\theta ^2(1+\rho _w^2)\rho _w^2} \left[ \frac{(1-\rho _w^2)}{2}-\frac{2m+1}{(m+1)^2}\right] > \frac{1}{(\theta \mu -1)}\left( \frac{1}{\mu }+\frac{\theta }{m}\right) \frac{4\theta ^2 L^5}{m^3}+\frac{2C}{m^2},\nonumber \\ \end{aligned}$$
(35)

in which case \({\hat{X}}\) and \({\hat{Y}}\) are lower bounded under constraints {(34b), (34c), \({\hat{X}}\ge m^2 X_0\), \({\hat{Y}}\ge Y_0\)}. Specifically, \({\hat{X}}\) can be chosen as follows:

$$\begin{aligned} {\hat{X}}:=\max \left\{ \frac{C_3}{C_1-C_2},\frac{C_5}{C_1-C_4},m^2 X_0,\frac{Y_0}{C_1}\right\} , \end{aligned}$$

where

$$\begin{aligned} C_1&=\frac{(1-\rho _w^2)}{\theta ^2(1+\rho _w^2)\rho _w^2} \left[ \frac{(1-\rho _w^2)}{2}-\frac{2m+1}{(m+1)^2}\right] , \end{aligned}$$
(36a)
$$\begin{aligned} C_2&=\frac{2C}{(1-\rho _w^2)m^2},\end{aligned}$$
(36b)
$$\begin{aligned} C_3&= \frac{2}{(1-\rho _w^2)}\left\{ \frac{2\theta nL^3\Vert {\overline{x}}_{0}-x^*\Vert ^2}{m}+\left[ \frac{3\theta ^2L^2}{m^2}+2\left( \frac{\theta L}{m}+1\right) (n+1)\right] \sigma ^2\right\} ,\end{aligned}$$
(36c)
$$\begin{aligned} C_4&= \frac{2}{(1-\rho _w^2)}\left[ \frac{2\theta ^2 L^5}{m^3(\theta \mu -1)}\left( \frac{1}{\mu }+\frac{\theta }{m}\right) +\frac{C}{m^2}\right] ,\end{aligned}$$
(36d)
$$\begin{aligned} C_5&= \frac{2}{(1-\rho _w^2)}\left[ \frac{2\theta ^3 L^3}{m^2(\theta \mu -1)}+\frac{3\theta ^2L^2}{m^2}+2\left( \frac{\theta L}{m}+1\right) (n+1)\right] \sigma ^2. \end{aligned}$$
(36e)

Noticing that relation (32) holds trivially when \(k=0\), the induction is complete.

We further improve the bound on \(U_k\) (inspired by [39, 44]). From (29),

$$\begin{aligned} U(k+1)\le (1-\alpha _k\mu )U_k+\frac{2\alpha _k L^2}{\mu n}{X_k}+\frac{\alpha _k^2 \sigma ^2}{n}. \end{aligned}$$

Since \({X_k}\le {\hat{X}}/(m+k)^2\),

$$\begin{aligned} U(k+1)\le \left( 1-\frac{\theta \mu }{m+k}\right) U_k+\frac{2\theta L^2{\hat{X}}}{\mu n(m+k)^3}+\frac{\theta ^2 \sigma ^2}{n(m+k)^2}. \end{aligned}$$

Hence

$$\begin{aligned} U_k{\le } \prod _{t=0}^{k-1}\left( 1{-}\frac{\theta \mu }{m{+}t}\right) U_0{+}\sum _{t=0}^{k-1}\left( \prod _{j=t+1}^{k-1}\left( 1-\frac{\theta \mu }{m+j}\right) \right) \left( \frac{2\theta L^2{\hat{X}}}{\mu n(m+t)^3}{+}\frac{\theta ^2 \sigma ^2}{n(m+t)^2}\right) . \end{aligned}$$

In light of Lemma 4.1 in [44],

$$\begin{aligned} \prod _{t=0}^{k-1}\left( 1-\frac{\theta \mu }{m+t}\right) \le \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }},\quad \prod _{j=t+1}^{k-1}\left( 1-\frac{\theta \mu }{m+j}\right) \le \frac{(m+t+1)^{\theta \mu }}{(m+k)^{\theta \mu }}. \end{aligned}$$

Then,

$$\begin{aligned} U_k&\le \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0+\sum _{t=0}^{k-1} \frac{(m+t+1)^{\theta \mu }}{(m+k)^{\theta \mu }}\left( \frac{2\theta L^2{\hat{X}}}{\mu n(m+t)^3}+\frac{\theta ^2 \sigma ^2}{n(m+t)^2}\right) \\&= \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0+\frac{1}{(m+k)^{\theta \mu }} \left( \frac{2\theta L^2{\hat{X}}}{\mu n}\sum _{t=0}^{k-1} \frac{(m+t+1)^{\theta \mu }}{(m+t)^3}+\frac{\theta ^2 \sigma ^2}{n} \sum _{t=0}^{k-1}\frac{(m+t+1)^{\theta \mu }}{(m+t)^2}\right) \\&\le \frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0+\frac{1}{(m+k)^{\theta \mu }} \left( \frac{4\theta L^2{\hat{X}}}{\mu n}\sum _{t=0}^{k-1}(m+t)^{\theta \mu -3}+\frac{2\theta ^2 \sigma ^2}{n}\sum _{t=0}^{k-1}(m+t)^{\theta \mu -2}\right) . \end{aligned}$$

Note that

$$\begin{aligned}&\sum _{t=0}^{k-1}(m+t)^{\theta \mu -3}\le \int _{t=-1}^{k}(m+t)^{\theta \mu -3}dt\le \max \left\{ \frac{1}{\theta \mu -2}(m+k)^{\theta \mu -2},\frac{1}{2-\theta \mu }(m-1)^{\theta \mu -2}\right\} ,\\&\sum _{t=0}^{k-1}(m+t)^{\theta \mu -2}\le \int _{t=-1}^{k}(m+t)^{\theta \mu -2}dt\le \frac{1}{\theta \mu -1}(m+k)^{\theta \mu -1}. \end{aligned}$$

We conclude that

$$\begin{aligned} U_k&\le \frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)(m+k)}+\frac{m^{\theta \mu }}{(m+k)^{\theta \mu }}U_0\\&\quad +\frac{4\theta L^2{\hat{X}}}{\mu n}\max \left\{ \frac{1}{(\theta \mu -2)(m+k)^2},\frac{(m-1)^{\theta \mu -2}}{(2-\theta \mu )(m+k)^{\theta \mu }}\right\} \\&\le {\frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)(m+k)} +\frac{{\mathcal {O}}_k(1)}{(m+k)^{\theta \mu }}+\frac{{\mathcal {O}}_k(1)}{(m+k)^2}}. \end{aligned}$$

4 A gossip-like stochastic gradient tracking method (GSGT)

In this section, we consider a gossip-like stochastic gradient tracking method (GSGT): Initialize with an arbitrary \(x_{i,0}\) and \(y_{i,0}=g_i(x_{i,0},\xi _{i,0})\) for all \(i\in {\mathcal {N}}\). At each round \(k\in {\mathbb {N}}\), agent \(i_k\in {\mathcal {N}}\) wakes up with probability 1/n. Then \(i_k\) either communicates with one of its neighbors \(j_k\) (with probability \(\pi _{i_k j_k}\)) or not (with probability \(\pi _{i_k i_k}=1-\sum _{j\in {\mathcal {N}}_{i_k}}\pi _{i_k j}\)). In the former situation, the update rule for \(i\in \{i_k,j_k\}\) is as follows,

$$\begin{aligned} x_{i,k+1}&= \frac{1}{2}(x_{i_k,k}+x_{j_k,k})-\alpha y_{i,k}, \end{aligned}$$
(37a)
$$\begin{aligned} y_{i,k+1}&= \frac{1}{2}(y_{i_k,k}+y_{j_k,k})+g_i(x_{i,k+1},\xi _{i,k+1})-g_i(x_{i,k},\xi _{i,k}), \end{aligned}$$
(37b)

and for \(i\notin \{i_k,j_k\}\), \(x_{i,k+1}=x_{i,k}\), \(y_{i,k+1}=y_{i,k}\), and \(\xi _{i,k+1}=\xi _{i,k}\).Footnote 4 In the latter situation, agent \(i_k\) performs update based on its own information:

$$\begin{aligned} x_{i_k,k+1}&= x_{i_k,k}-2\alpha y_{i_k,k}, \end{aligned}$$
(38a)
$$\begin{aligned} y_{i_k,k+1}&= y_{i_k,k}+g_{i_k}(x_{i_k,k+1},\xi _{i_k,k+1})-g_{i_k}(x_{i_k,k},\xi _{i_k,k}), \end{aligned}$$
(38b)

while no action is taken by agent \(i\ne i_k\). For ease of analysis, we denote \(j_k=i_k\) in this case, and let \(\mathbb {1}_k\) be the indicator function for the event \(\{j_k\ne i_k\}\), i.e., \(\mathbb {1}_k=1\) iff \(j_k\ne i_k\).

The use of stepsize \(2\alpha \) instead of \(\alpha \) in (38a) can be understood as follows. At each iteration, GSGT performs two gradient updates within the network. This can be achieved either by two different agents respectively updating their solutions, or by one agent using a doubled stepsize. The method is different from a standard gossip algorithm where exactly two agents update at each round. This difference allows us to design the probabilities \(\pi _{ij}\) with more flexibility. In particular, it is possible to construct a doubly stochastic probability matrix \({{\varvec{\Pi }}}=[\pi _{ij}]\) for any graph \({\mathcal {G}}\) under Assumption 3.

We can present GSGT in the following compact matrix form, in which we adopt the notation previously used.

$$\begin{aligned} {\mathbf {x}}_{k+1}&= {\mathbf {W}}_k{\mathbf {x}}_k-\alpha {\mathbf {D}}_k{\mathbf {y}}_k, \end{aligned}$$
(39a)
$$\begin{aligned} {\mathbf {y}}_{k+1}&= {\mathbf {W}}_k{\mathbf {y}}_k+\tilde{{\mathbf {D}}}_k (G({\mathbf {x}}_{k+1},\xi _{k+1})-G({\mathbf {x}}_k,\xi _k)), \end{aligned}$$
(39b)

where the random coupling matrix \({\mathbf {W}}_k\) is defined as

$$\begin{aligned} {\mathbf {W}}_k:={\mathbf {I}}-\frac{(e_{i_k}-e_{j_k})(e_{i_k}-e_{j_k})^{\intercal }}{2}, \end{aligned}$$

in which \(e_i=[0\, \cdots \, 0\, 1\, 0\,\cdots ]^{\intercal }\in {\mathbb {R}}^{n\times 1}\) is a unit vector with the ith component equal to 1. By definition, each \({\mathbf {W}}_k\) is symmetric and doubly stochastic. The matrices \({\mathbf {D}}_k\) and \(\tilde{{\mathbf {D}}}_k\) are diagonal with their \(i_k\)th and \(j_k\)th diagonal entries equal to 1 and all other entries equal to 0 if \(j_k\ne i_k\), otherwise the \(i_k\)th entry of \({\mathbf {D}}_k\) (respectively, \(\tilde{{\mathbf {D}}}_k\)) equals 2 (respectively, 1) while all other entries equal 0.

We assume the following condition on the probability matrix \({{\varvec{\Pi }}}\):

Assumption 5

Nonnegative matrix \({{\varvec{\Pi }}}\) is doubly stochastic.

Let

$$\begin{aligned} {\bar{{\mathbf {W}}}}:={\mathbb {E}}[{\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k]. \end{aligned}$$
(40)

It can be shown that (see [4])

$$\begin{aligned} {\bar{{\mathbf {W}}}}=\left( 1-\frac{1}{n}\right) {\mathbf {I}}+\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2n}, \end{aligned}$$
(41)

which is doubly stochastic.

Lemma 6

Let Assumptions 3 and 5 hold, and let \(\rho _{{\bar{w}}}\) denote the spectral norm of the matrix \({\bar{{\mathbf {W}}}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). Then, \(\rho _{{\bar{w}}}\in [1-2/n,1)\).

Proof

Since \({\bar{W}}\) is doubly stochastic, \(\rho _{{\bar{w}}}<1\) follows from Lemma 1. To see \(\rho _{{\bar{w}}}\ge 1-2/n\), note that \(\rho (\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2n})=\frac{1}{n}\), and 1 is an algebraically simple eigenvalue of \(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}\) and \(\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). We have

$$\begin{aligned} \rho _{{\bar{w}}}=1-\frac{1}{n}+\frac{1}{n}\lambda _2\left( \frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}\right) , \end{aligned}$$
(42)

where \(\lambda _2(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2})\) is the second largest eigenvalue of \(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}\).

Since \(\lambda _2(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2})\in [-1,1]\), we conclude that \(\rho _{{\bar{w}}}\ge 1-2/n\). \(\square \)

Before proceeding, it is worth noting that for GSGT, we still have the following relation:

$$\begin{aligned} {\overline{y}}_k=\frac{1}{n}{\mathbf {1}}^{\intercal }G({\mathbf {x}}_k,\varvec{\xi }_k),\forall k. \end{aligned}$$
(43)

4.1 Main results

We present the main convergence results of GSGT in the following theorem.

Theorem 3

Let \(\varGamma >1\) be arbitrarily chosen. Suppose Assumptions 15 hold, and assume that the stepsize \(\alpha \) satisfies

$$\begin{aligned} \alpha\le & {} \frac{2n(1-\rho _{{\bar{w}}})}{\sqrt{\varGamma }L}\left\{ \left[ 27(2\eta +3) Q n+16(8\eta +9)\right] Q(1-\rho _{{\bar{w}}})\right. \nonumber \\&\left. +\,48(6\eta +1)(8\eta +3) +96Q(1-\rho _{{\bar{w}}})\right\} ^{-1/2}, \end{aligned}$$
(44)

where \(\eta =\frac{1}{n(1-\rho _{{\bar{w}}})}\) and \(Q=L/\mu \). Then \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_{l}-{\mathbf {1}}{\overline{x}}_{l}\Vert ^2]\), respectively, converge to \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\) and \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\) at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}}_g)^k)\), where \(\rho ({\mathbf {A}}_g)<1\) is the spectral radius of the matrix \({\mathbf {A}}_g\) given by

$$\begin{aligned} {\mathbf {A}}_g=\begin{bmatrix} 1-\frac{2\alpha \mu }{n} &{}\quad \frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) &{}\quad \frac{4\alpha ^2}{n^3}\\ 8\alpha ^2 L^2 &{}\quad \frac{1}{2}(1+\rho _{{\bar{w}}}) &{}\quad \frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) \\ 8\alpha ^2L^4+4\alpha L^3 &{}\quad \frac{L^2}{n}\left( 4+\frac{2}{\beta _2}+8\alpha ^2 L^2+4\alpha L\right) &{}\quad \frac{1}{2}(1+\rho _{{\bar{w}}}) \end{bmatrix}, \end{aligned}$$

in which \(\beta _1=\frac{n(1-\rho _{{\bar{w}}})}{4\alpha }-4\alpha L^2\) and \(\beta _2=\frac{n(1-\rho _{{\bar{w}}})}{4}-2\alpha L-2\alpha ^2 L^2\). In addition,

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\le \frac{\varGamma }{(\varGamma -1)}\frac{\sigma ^2}{n^2}\left[ \frac{20\alpha }{\mu (1-\rho _{{\bar{w}}})}+\frac{42(6\eta +1)\alpha ^2 L^2}{\mu ^2 (1-\rho _{{\bar{w}}})^2}\right] , \end{aligned}$$
(45)

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2] \le \frac{4\varGamma \sigma ^2}{(\varGamma -1)(1-\rho _{{\bar{w}}})^2}\left[ \frac{9(6\eta +1)\alpha ^2}{n}+\frac{72\alpha ^3 L^2}{\mu n^2}\right] . \end{aligned}$$
(46)

Remark 5

Notice that \(\eta =\frac{1}{n(1-\rho _{{\bar{w}}})}\) and \(1-\rho _{{\bar{w}}}\le \frac{2}{n}\) from Lemma 6. We can see from (45) and (46) that

$$\begin{aligned} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]=\frac{\alpha }{(1-\rho _{{\bar{w}}})}{\mathcal {O}}\left( \frac{\sigma ^2}{\mu n^2}\right) +{\frac{\alpha ^2}{(1-\rho _{{\bar{w}}})^3}{\mathcal {O}}\left( \frac{ L^2\sigma ^2}{\mu ^2 n^3}\right) }, \end{aligned}$$
(47)

and

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]=\frac{\alpha ^2}{(1-\rho _{{\bar{w}}})^3}{\mathcal {O}}\left( \frac{\sigma ^2}{n^3}\right) +{\frac{\alpha ^3}{(1-\rho _{{\bar{w}}})^2}{\mathcal {O}}\left( \frac{L^2\sigma ^2}{\mu n^3}\right) }. \end{aligned}$$
(48)

Since

$$\begin{aligned} {\frac{ L^2\sigma ^2}{\mu ^2 n^3}\ge \frac{\sigma ^2}{n^3},} \end{aligned}$$

and by (44),

$$\begin{aligned} {\frac{ L^2\sigma ^2}{\mu ^2 n^3}\ge \alpha (1-\rho _{{\bar{w}}})\frac{L^2\sigma ^2}{\mu n^3},} \end{aligned}$$

the second term on the right-hand side of (47) dominates the two terms on the right-hand side of (48). Thus, we have

$$\begin{aligned} \limsup _{k\rightarrow \infty }\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]&=\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\limsup _{k\rightarrow \infty } \frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&=\frac{\alpha }{(1-\rho _{{\bar{w}}})}{\mathcal {O}}\left( \frac{\sigma ^2}{\mu n^2}\right) +{\frac{\alpha ^2}{(1-\rho _{{\bar{w}}})^3}{\mathcal {O}} \left( \frac{ L^2\sigma ^2}{\mu ^2 n^3}\right) }. \end{aligned}$$
(49)

\(\square \)

The corollary below provides an upper bound for \(\rho ({\mathbf {A}}_g)\).

Corollary 3

Under the conditions in Theorem 2 where \(\varGamma >3/2\), we have

$$\begin{aligned} \rho ({\mathbf {A}}_g)\le 1-\frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}. \end{aligned}$$

Remark 6

Compared to DSGT, the convergence speed of GSGT is slower than DSGT under the same stepsize \(\alpha \) (see Corollary 1). This is due to the fact that in GSGT, only two agents update their iterates at each iteration.\(\square \)

4.2 Performance comparison between DSGT and GSGT

In this section, we compare the performances of the two proposed algorithms in terms of their required computation and communication efforts for achieving an \(\epsilon \)-solution (with constant stepsizes), that is, we compute the number of stochastic gradient computations and communications needed to obtain \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\le \epsilon \). Without loss of generality, for each method we first choose stepsize \(\alpha \) such that \(\frac{1}{n}\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\le \epsilon /2\) and then compute the number of iterations K such that \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_K-{\mathbf {1}}x^*\Vert ^2]\le \epsilon \).

For DSGT, when \(\epsilon \) is small enough, we have \(\alpha ={\mathcal {O}}(\frac{n\mu \epsilon }{\sigma ^2})\) from (11). Then, noting that \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}x^*\Vert ^2]\) converges linearly at the rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\) where \(\rho ({\mathbf {A}})=1-{\mathcal {O}}(\alpha \mu )\), we obtain the number of required iterations:

$$\begin{aligned} K_d={\mathcal {O}}\left( \frac{\ln (\frac{1}{\epsilon })}{\alpha \mu }\right) ={\mathcal {O}}\left( \frac{\sigma ^2}{n\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon } \right) . \end{aligned}$$

In \(K_d\) iterations, the number of stochastic gradient computations is \(N_d=nK_d={\mathcal {O}}\left( \frac{\sigma ^2}{\mu ^2} \frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) \) and the number of communications is \(N_d^c=2|{\mathcal {E}}|K_d={\mathcal {O}}\left( \frac{|{\mathcal {E}}|}{n} \frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) \) where \(|{\mathcal {E}}|\) stands for the number of edges in the graph.

For GSGT we need \(\alpha ={\mathcal {O}}(\frac{n^2\mu \epsilon (1-\rho _{{\bar{w}}})}{\sigma ^2})\) from (49). Given that \(\rho ({\mathbf {A}}_g)=1-{\mathcal {O}}(\frac{\alpha \mu }{n})\), the number of required iterations \(K_g\) can be calculated as follows:

$$\begin{aligned} K_g={\mathcal {O}}\left( \frac{n\ln (\frac{1}{\epsilon })}{\alpha \mu }\right) ={\mathcal {O}}\left( \frac{\sigma ^2}{n(1-\rho _{{\bar{w}}})\mu ^2} \frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) . \end{aligned}$$

In \(K_g\) iterations, the number of gradient computations and communications are both bounded by \(N_g=N_g^c=2K_g={\mathcal {O}}(\frac{1}{n(1-\rho _{{\bar{w}}})}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\).

Suppose the Metropolis rule is applied to define the weights \(\pi _{ij}\) [50]. We first compare the number of stochastic gradient computations for DSGT and GSGT, respectively. Noticing that \(1-\rho _{{\bar{w}}}\le \frac{2}{n}\) by Lemma 6, \(N_g\) is at most in the same order of \(N_d={\mathcal {O}}(\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\), which happens when \(1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})\). Given that \(1-\rho _{{\bar{w}}}=(1-\lambda _2({{\varvec{\Pi }}}))/n\) from (42), we have \(1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})\) for complete networks, almost all regular graphs [16], among others.

We then compare the number of required communications \(N_d^c={\mathcal {O}}(\frac{|{\mathcal {E}}|}{n}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\) and \(N_g^c={\mathcal {O}}(\frac{1}{n(1-\rho _{{\bar{w}}})}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\). When \(1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})\), we have \(N_g^c={\mathcal {O}}(\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\). By contrast, \(N_d^c\) is \({\mathcal {O}}(\frac{|{\mathcal {E}}|}{n})\) times larger than \(N_g^c\). In particular when \(|{\mathcal {E}}|={\mathcal {O}}(n^2)\) (e.g., complete network), the number of communications for GSGT is \({\mathcal {O}}(n)\) times smaller than that of DSGT.

4.3 Proof of Theorem 3

We first derive a linear system of inequalities regarding \({\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\), \({\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\), \({\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\) and their values in the last iteration.

Lemma 7

Suppose Assumptions 15 hold and the stepsize satisfies \(\alpha <n/(\mu +L)\). Then, we have the following inequalities:

$$\begin{aligned} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}{-}x^*\Vert ^2]&\le \left( 1-\frac{2\alpha \mu }{n}\right) {\mathbb {E}}[\Vert {\overline{x}}_k{-}x^*\Vert ^2]{+}\frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\quad +\frac{4\alpha ^2}{n^3}{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2] +\frac{4\alpha ^2\sigma ^2}{n^3}. \end{aligned}$$
(50)

For any \(\beta _1,\beta _2>0\),

$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2] \le \left( \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\quad +\frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\nonumber \\&\quad +8\alpha ^2 L^2{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]+\frac{4\alpha ^2 \sigma ^2}{n}, \end{aligned}$$
(51)
$$\begin{aligned}&{\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2]\le \left( \rho _{{\bar{w}}} +\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n}\right) {\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\nonumber \\&\quad +\frac{L^2}{n}\left( 4+\frac{2}{\beta _2}+8\alpha ^2 L^2+4\alpha L\right) {\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\nonumber \\&\quad +(8\alpha ^2L^4+4\alpha L^3){\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2] +\frac{(4\alpha ^2L^2+2\alpha L)\sigma ^2}{n}\nonumber \\&\quad +4(\alpha L+1)\sigma ^2. \end{aligned}$$
(52)

Proof

See Appendix 7.3. \(\square \)

In light of Lemma 7, we have the following linear system of inequalities:

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}_g \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} +\begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{4\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix}, \end{aligned}$$

where

$$\begin{aligned} {\mathbf {A}}_g=[b_{ij}]=\begin{bmatrix} 1-\frac{2\alpha \mu }{n} &{} \frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{2\alpha \mu }{n}\right) &{} \frac{4\alpha ^2}{n^3}\\ 8\alpha ^2 L^2 &{} \rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n} &{} \frac{2\alpha }{n}\left( \frac{1}{\beta _1}+\alpha \right) \\ 8\alpha ^2L^4+4\alpha L^3 &{}\quad \frac{L^2}{n}\left( 4+\frac{2}{\beta _2}+8\alpha ^2 L^2+4\alpha L\right) &{}\quad \rho _{{\bar{w}}}+\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n} \end{bmatrix}, \end{aligned}$$

and \(M_g=\frac{(4\alpha ^2L^2+2\alpha L)\sigma ^2}{n}+4(\alpha L+1)\sigma ^2\). Suppose \(\alpha \), \(\beta _1,\beta _2>0\) satisfy

$$\begin{aligned}&b_{22}=\rho _{{\bar{w}}}+\frac{2\alpha }{n}\beta _1+\frac{8\alpha ^2 L^2}{n} = \frac{1+\rho _{{\bar{w}}}}{2}, \end{aligned}$$
(53)
$$\begin{aligned}&b_{33}=\rho _{{\bar{w}}}+\frac{2}{n}\beta _2+\frac{4\alpha L}{n}+\frac{4\alpha ^2L^2}{n} = \frac{1+\rho _{{\bar{w}}}}{2}, \end{aligned}$$
(54)

and

$$\begin{aligned} \text {det}({\mathbf {I}}-{\mathbf {A}}_g)&=(1-b_{11})(1-b_{22})(1-b_{33})-b_{12}b_{23}b_{31} -b_{13}b_{21}b_{32}\nonumber \\&\quad -(1-b_{11})b_{23}b_{32}-(1-b_{22})b_{13}b_{31}\nonumber \\&\quad -(1-b_{33})b_{12}b_{21}\ge (1-1/\varGamma )(1-b_{11})(1-b_{22})(1-b_{33})>0. \end{aligned}$$
(55)

Then, by Lemma 5, the spectral radius of \({\mathbf {A}}_g\) is smaller than 1, and we have

$$\begin{aligned} \begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} \le {\mathbf {A}}_g^k\begin{bmatrix} {\mathbb {E}}[\Vert {\overline{x}}_{0}-x^*\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {x}}_{0}-{\mathbf {1}}{\overline{x}}_{0}\Vert ^2]\\ {\mathbb {E}}[\Vert {\mathbf {y}}_{0}-{\mathbf {1}}{\overline{y}}_{0}\Vert ^2] \end{bmatrix}+\sum _{l=0}^{k-1}{\mathbf {A}}_g^l \begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{4\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix}. \end{aligned}$$
(56)

Hence, \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\), \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}{\overline{x}}_l\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {y}}_l-{\mathbf {1}}{\overline{y}}_l\Vert ^2]\) all converge to a neighborhood of 0 at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}}_g)^k)\). Moreover,

$$\begin{aligned}&\begin{bmatrix} \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\\ \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\\ \limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2] \end{bmatrix} \le ({\mathbf {I}}-{\mathbf {A}}_g)^{-1} \begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{2\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix} =\frac{1}{\text {det}({\mathbf {I}}-{\mathbf {A}})} \nonumber \\&\cdot \begin{bmatrix} (1-b_{22})(1-b_{33})-b_{23}b_{32} &{} b_{13}b_{32}+b_{12}(1-b_{33}) &{} b_{12}b_{23}+b_{13}(1-b_{22})\\ b_{23}b_{31}+b_{21}(1-b_{33}) &{} \quad (1-b_{11})(1-b_{33})-b_{13}b_{31} &{} b_{13}b_{21}+b_{23}(1-b_{11})\\ b_{21}b_{32}+b_{31}(1-b_{22}) &{} b_{12}b_{31}+b_{32}(1-b_{11}) &{}\quad (1-b_{11})(1-b_{22})-b_{12}b_{21} \end{bmatrix}\nonumber \\&\begin{bmatrix} \frac{4\alpha ^2\sigma ^2}{n^3}\\ \frac{2\alpha ^2 \sigma ^2}{n}\\ M_g \end{bmatrix}. \end{aligned}$$
(57)

We now show (53), (54), and (57) are satisfied under condition (44). First, relation (44) implies that

$$\begin{aligned}&4\alpha ^2 L^2\le \frac{n(1-\rho _{{\bar{w}}})}{12}, \end{aligned}$$
(58)
$$\begin{aligned}&2\alpha L+2\alpha ^2 L^2\le \frac{n(1-\rho _{{\bar{w}}})}{12}. \end{aligned}$$
(59)

Therefore, from (53) and (54) we have

$$\begin{aligned}&\beta _1=\frac{n(1-\rho _{{\bar{w}}})}{4\alpha }-4\alpha L^2{\ge } \frac{n(1-\rho _{{\bar{w}}})}{6\alpha }>0, \end{aligned}$$
(60)
$$\begin{aligned}&\beta _2=\frac{n(1-\rho _{{\bar{w}}})}{4}-2\alpha L-2\alpha ^2 L^2\ge \frac{n(1-\rho _{{\bar{w}}})}{6}>0. \end{aligned}$$
(61)

By (58)–(61) and the fact that \(\rho _{{\bar{w}}}\ge 1-2/n\) obtained from (41), we have

$$\begin{aligned} b_{12}&\le \frac{2\alpha L^2}{\mu n^2}\left( 1+\frac{1}{8}\right) \le \frac{9\alpha L^2}{4\mu n^2}, \end{aligned}$$
(62a)
$$\begin{aligned} b_{23}&\le \frac{2\alpha ^2}{n}\left( 6\eta +1\right) ,\end{aligned}$$
(62b)
$$\begin{aligned} b_{31}&= \alpha L^3(8\alpha L+4)\le 6\alpha L^3,\end{aligned}$$
(62c)
$$\begin{aligned} b_{32}&\le \frac{L^2}{n}\left( 12\eta +\frac{9}{2}\right) . \end{aligned}$$
(62d)

Then, for relation (55) to hold, it is sufficient that

$$\begin{aligned} \frac{1}{\varGamma }\frac{\alpha \mu }{2n}(1{-}\rho _{{\bar{w}}})^2&\ge \frac{27(6\eta {+}1)\alpha ^4L^5}{\mu n^3}+\frac{48(8\eta +3)\alpha ^4 L^4}{n^4}+\frac{6(6\eta +1)(8\eta +3)\alpha ^3\mu L^2}{n^3}\\&\quad +\frac{12\alpha ^3 L^3}{n^3}(1-\rho _{{\bar{w}}}) +\frac{9\alpha ^3L^4}{\mu n^2}(1-\rho _{{\bar{w}}}). \end{aligned}$$

In light of (58), \(\alpha L\le n(1-\rho _{{\bar{w}}})/24\). We only need

$$\begin{aligned} \frac{1}{\varGamma }\frac{\mu n^2}{2}(1-\rho _{{\bar{w}}})^2&\ge \frac{9(6\eta +1)\alpha ^2L^4 n}{8\mu }(1-\rho _{{\bar{w}}})+2(8\eta +3)\alpha ^2 L^3(1-\rho _{{\bar{w}}})\\&\quad +6(6\eta +1)(8\eta +3)\alpha ^2\mu L^2 +12\alpha ^2 L^3(1-\rho _{{\bar{w}}})\\&\quad +\frac{9\alpha ^2L^4 n}{\mu }(1-\rho _{{\bar{w}}}), \end{aligned}$$

which gives

$$\begin{aligned}&\alpha \le \frac{2n(1-\rho _{{\bar{w}}})}{\sqrt{\varGamma }L}\left\{ \left[ 27(2\eta +3) Q n+16(8\eta +9)\right] Q(1-\rho _{{\bar{w}}})\right. \\&\qquad \left. +\,48(6\eta +1)(8\eta +3) +96Q(1-\rho _{{\bar{w}}})\right\} ^{-1/2}. \end{aligned}$$

We now derive the bounds for \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\) and \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\). By (57) and (62),

$$\begin{aligned}&\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2] \le \frac{\varGamma }{(\varGamma -1)(1-b_{11})(1-b_{22})(1-b_{33})}\\&\qquad \cdot \left\{ [(1-b_{22})(1-b_{33})-b_{23}b_{32}]\frac{4\alpha ^2\sigma ^2}{n^3} +[b_{13}b_{32}+b_{12}(1-b_{33})]\frac{2\alpha ^2 \sigma ^2}{n}\right. \\&\qquad \left. +\,[b_{12}b_{23}+b_{13}(1-b_{22})]M_g\right\} \\&\quad \le \frac{2\varGamma n}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2} \Bigg \{\frac{\alpha ^2\sigma ^2(1-\rho _{{\bar{w}}})^2}{n^3} +\left[ \frac{6\alpha ^2L^2}{n^4}(8\eta +3)+\frac{9\alpha L^2}{8\mu n^2}(1-\rho _{{\bar{w}}})\right] \frac{2\alpha ^2 \sigma ^2}{n}\\&\qquad +\left[ \frac{9(6\eta +1)\alpha ^3 L^2}{2\mu n^3}+\frac{2\alpha ^2}{n^3} (1-\rho _{{\bar{w}}})\right] \frac{9}{2}\sigma ^2\Bigg \}\\&\quad \le \frac{2\varGamma n\sigma ^2}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2} \left[ \frac{10\alpha ^2(1-\rho _{{\bar{w}}})}{n^3}+\frac{21(6\eta +1)\alpha ^3 L^2}{\mu n^3}\right] \\&\quad =\frac{\varGamma }{(\varGamma -1)}\frac{\sigma ^2}{n^2}\left[ \frac{20\alpha }{\mu (1-\rho _{{\bar{w}}})}+\frac{42(6\eta +1)\alpha ^2 L^2}{\mu ^2 (1-\rho _{{\bar{w}}})^2}\right] .\\&\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\le \frac{\varGamma }{(\varGamma -1)(1-b_{11})(1-b_{22})(1-b_{33})}\\&\qquad \cdot \left\{ [b_{23}b_{31}+b_{21}(1-b_{33})]\frac{4\alpha ^2\sigma ^2}{n^3}+[(1-b_{11})(1-b_{33})-b_{13}b_{31}]\frac{2\alpha ^2 \sigma ^2}{n}\right. \\&\qquad \left. +\,[b_{13}b_{21}+b_{23}(1-b_{11})]M_g\right\} \\&\quad \le \frac{2\varGamma n}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2}\Bigg \{\left[ \frac{12(6\eta +1)\alpha ^3 L^3}{n}+4\alpha ^2 L^2(1-\rho _{{\bar{w}}})\right] \frac{4\alpha ^2\sigma ^2}{n^3}\\&\qquad +\left[ \frac{\alpha \mu (1-\rho _{{\bar{w}}})}{n}-\frac{24\alpha ^3L^3}{n^3}\right] \frac{2\alpha ^2 \sigma ^2}{n}+\left[ \frac{32\alpha ^4 L^2}{n^3}+\frac{4(6\eta +1)\alpha ^3\mu }{n^2}\right] \frac{17}{4}\sigma ^2\Bigg \}\\&\quad \le \frac{2\varGamma n\sigma ^2}{(\varGamma -1)\alpha \mu (1-\rho _{{\bar{w}}})^2}\left[ \frac{18(6\eta +1) \alpha ^3\mu }{n^2}+\frac{136\alpha ^4 L^2}{n^3}\right] \\&\quad =\frac{4\varGamma \sigma ^2}{(\varGamma -1)(1-\rho _{{\bar{w}}})^2}\left[ \frac{9(6\eta +1)\alpha ^2}{n} +\frac{72\alpha ^3 L^2}{\mu n^2}\right] . \end{aligned}$$

4.4 Proof of Corollary 3

The characteristic function of \({\mathbf {A}}_g\) is

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}_g)&=(\lambda {-}b_{11})(\lambda {-}b_{22})(\lambda {-}b_{33}) -b_{12}b_{23}b_{31}-b_{13}b_{21}b_{32}-(\lambda -b_{11})b_{23}b_{32}\nonumber \\&\quad -(\lambda -b_{22})b_{13}b_{31} -(\lambda -b_{33})b_{12}b_{21}. \end{aligned}$$
(63)

By (55),

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}_g)&\ge (\lambda -b_{11})(\lambda -b_{22})(\lambda -b_{33}) +(1-\lambda )b_{23}b_{32}\nonumber \\&\quad +(1-\lambda )b_{13}b_{31}+(1-\lambda )b_{12}b_{21}\nonumber \\&\quad -\frac{1}{\varGamma }(1-b_{11})(1-b_{22})(1-b_{33})\nonumber \\&\ge (\lambda -b_{11}) (\lambda -b_{22}) (\lambda -b_{33})\nonumber \\&\quad -\frac{1}{\varGamma }(1-b_{11})(1-b_{22})(1-b_{33}). \end{aligned}$$
(64)

Let \(\lambda =1-\epsilon \) for some \(\epsilon \in (0,2\alpha \mu /n)\) that satisfies

$$\begin{aligned} \text {det}(\lambda {\mathbf {I}}-{\mathbf {A}}_g)\ge \left( \frac{2\alpha \mu }{n}-\epsilon \right) \left[ \frac{1-\rho _{{\bar{w}}}}{2}-\epsilon \right] ^2- \frac{1}{\varGamma }\frac{2\alpha \mu }{n}\frac{(1-\rho _{{\bar{w}}})^2}{4}\ge 0. \end{aligned}$$

Under condition (44), it suffices that

$$\begin{aligned} \epsilon \le \frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}. \end{aligned}$$

Denote \({\tilde{\lambda }}=1-\frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}\). We have \(\text {det}({\tilde{\lambda }}{\mathbf {I}}-{\mathbf {A}}_g)\ge 0\), and therefore \(\rho ({\mathbf {A}}_g)\le {\tilde{\lambda }}\).

5 Numerical example

In this section, we provide a numerical example to illustrate our theoretic findings. Consider the on-line Ridge regression problem, i.e.,

$$\begin{aligned} \min _{x\in {\mathbb {R}}^{p}}f(x)=\frac{1}{n}\sum _{i=1}^nf_i(x)\left( ={\mathbb {E}}_{u_i,v_i} \left[ \left( u_i^{\intercal } x-v_i\right) ^2+\rho \Vert x\Vert ^2\right] \right) , \end{aligned}$$
(65)

where \(\rho >0\) is a penalty parameter. For each agent i, samples in the form of \((u_i,v_i)\) are gathered continuously with \(u_i\in {\mathbb {R}}^p\) representing the features and \(v_i\in {\mathbb {R}}\) being the observed outputs. We assume that each \(u_i\in [0.3,0.4]^p\) is uniformly distributed, and \(v_i\) is drawn according to \(v_i=u_i^{\intercal } {\tilde{x}}_i+\varepsilon _i\), where \({\tilde{x}}_i\) are predefined parameters evenly located in \([0,10]^p\), and \(\varepsilon _i\) are independent Gaussian noises with mean 0 and variance 1. Given a pair \((u_i,v_i)\), agent i can compute an estimated gradient of \(f_i(x)\): \(g_i(x,u_i,v_i)=2(u_i^{\intercal }x -v_i)u_i+2\rho x\), which is unbiased. Problem (65) has a unique solution \(x^*\) given by \(x^*=(\sum _{i=1}^n{\mathbb {E}}_{u_i}[u_iu_i^{\intercal }]+n\rho {\mathbf {I}})^{-1}\sum _{i=1}^n{\mathbb {E}}_{u_i}[u_iu_i^{\intercal }]{\tilde{x}}_i\).

In addition to DSGT, GSGT and CSG, we consider the following distributed stochastic gradient (DSG) algorithm, which is similar to the ones studied in [18, 26]:

$$\begin{aligned} {\mathbf {x}}_{k+1} = {\mathbf {W}}{\mathbf {x}}_k-\alpha G({\mathbf {x}}_k,\varvec{\xi }_k). \end{aligned}$$
(66)

Noticing that some existing algorithms for deterministic distributed optimization can also be adapted to the stochastic gradient setting, e.g., EXTRA [52] and decentralized ADMM (DLM) [27], we also include them in our experiments for comparison.

In the experiments, we consider 3 instances with \(p=20\) and \(n\in \{10,25,100\}\), respectively. Under each instance, we let \({\mathbf {x}}_0={\mathbf {0}}\) and the penalty parameter \(\rho =0.1\). For the distributed methods, we assume that n agents constitute a random network, in which each two agents are linked with probability 0.4. The Metropolis rule is applied to define the weights \(w_{ij}\) (and \(\pi _{ij}\)) where applicable [50]:

$$\begin{aligned} w_{ij}={\left\{ \begin{array}{ll} 1/\max \{\mathrm {deg}(i),\mathrm {deg}(j)\} &{} \text {if }\, i\in {\mathcal {N}}_i, \\ 1- \sum _{j\in {\mathcal {N}}_i}w_{ij} &{} \text {if }\, i=j,\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

For EXTRA, we choose \({\tilde{{\mathbf {W}}}}=\frac{{\mathbf {I}}+{\mathbf {W}}}{2}\) as recommended by [52]. For DLM, we tune the free parameters to make its convergence speed comparable to the other algorithms. In each instance, we use two different stepsizes \(\alpha =5\times 10^{-3}\) and \(\alpha =5\times 10^{-2}\), respectively. We run the simulations 50 times for DSGT, CSG, DSG, EXTRA and DLM and 100 times for GSGT and average the results to approximate the expected errors.

In Fig. 1a–f, we compare the average performances of DSGT, GSGT, CSG, DSG, EXTRA and DLM with the same parameters. It can be seen that DSGT and CSG are comparable in their convergence speeds as well as the ultimate error bounds (almost indistinguishable). EXTRA and DLM are worse than DSGT and CSG in their final error bounds. The performance gap increases with the network size and the stepsize. GSGT is slower as expected but still reaches a comparable error level under small stepsize \(\alpha =5\times 10^{-3}\). In addition, the error bounds for DSGT, GSGT and CSG decrease in n as expected from our theoretical analysis. The performance of DSG is not favorable given its largest final errors.Footnote 5

In Fig. 1g–i (respectively, j–l), we further compare the solutions obtained under DSGT and GSGT with the same number of stochastic gradient evaluations (respectively, inter-node communications) under small stepsize \(\alpha =5\times 10^{-3}\). We see the two methods are comparable in their speeds of convergence w.r.t. the number of gradient evaluations. However, GSGT is much faster than DSGT assuming the same number of communications. These numerical results verified our arguments in Sect. 4.2.

Fig. 1
figure 1

Performance comparison between DSGT, GSGT, CSG, DSG, EXTRA [52] and DLM [27] for on-line Ridge regression. For CSG, the plots show \(\Vert x_k-x^*\Vert ^2\). For the other methods, the plots show \(\frac{1}{n}\sum _{i=1}^n\Vert x_{i,k}-x^*\Vert ^2\).

6 Conclusions and future work

This paper considers distributed multi-agent optimization over a network, where each agent only has access to inexact gradients of its local cost function. We propose a distributed stochastic gradient tracking method (DSGT) and show that the iterates obtained by each agent, using a constant stepsize value, reach a neighborhood of the optimum (in expectation) exponentially fast. More importantly, in a limit, the error bounds for the distances between the iterates and the optimal solution decrease in the network size, which is comparable with the performance of a centralized stochastic gradient algorithm. With a diminishing stepsize, the method exhibits the optimal \({\mathcal {O}}(1/k)\) rate of convergence. In the second part of this paper, we discuss a gossip-like stochastic gradient tracking method (GSGT) that is communication-efficient. Under a well-connected interaction graph, we show GSGT requires fewer communications than DSGT to reach an \(\epsilon \) error level. Finally, we provide a numerical example that demonstrates the effectiveness of both algorithms. In our future work, we will deal with directed and/or time-varying interaction graphs among agents. We also plan to explore other more flexible randomized algorithms such as broadcast-based protocols with possible transmission failures.