Abstract
In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size n, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider a set of agents \({\mathcal {N}}=\{1,2,\ldots ,n\}\) connected over a network. Each agent has a local smooth and strongly convex cost function \(f_i:{\mathbb {R}}^{p}\rightarrow {\mathbb {R}}\). The global objective is to locate \(x\in {\mathbb {R}}^p\) that minimizes the average of all cost functions:
Scenarios in which problem (1) is considered include distributed machine learning [12, 15, 35], multi-agent target seeking [8, 42], and wireless networks [2, 11, 30], among many others.
To solve problem (1), we assume each agent i queries a stochastic oracle (\({{\mathcal {S}}}{{\mathcal {O}}}\)) to obtain noisy gradient samples of the form \(g_i(x,\xi _i)\) that satisfies the following condition:
Assumption 1
For all \(i\in {\mathcal {N}}\) and all \(x\in {\mathbb {R}}^p\), each random vector \(\xi _i\in {\mathbb {R}}^m\) is independent, and
The above assumption of stochastic gradients holds true for many on-line distributed learning problems, where \(f_i(x)={\mathbb {E}}_{\xi _i}[F_i(x,\xi _i)]\) denotes the expected loss function agent i wishes to minimize, while independent samples \(\xi _i\) are gathered continuously over time. For another example, in simulation-based optimization, the gradient estimation often incurs noise that can be due to various sources, such as modeling and discretization errors, incomplete convergence, and finite sample size for Monte-Carlo methods [22].
Distributed algorithms dealing with problem (1) have been studied extensively in the literature [13, 19, 20, 28, 34, 36, 37, 45, 46, 52, 56]. Recently, there has been considerable interest in distributed implementation of stochastic gradient algorithms [3, 5,6,7, 9, 10, 14, 18, 24, 26, 32, 40, 41, 48, 51, 54, 55]. The literature has shown that distributed methods may compete with, or even outperform, their centralized counterparts under certain conditions [9, 10, 26, 40, 41]. For instance, in our recent work [40], we proposed a flocking-based approach for distributed stochastic optimization which beats a centralized gradient method in real-time assuming that all \(f_i\) are identical. However, to the best of our knowledge, there is no distributed stochastic gradient method addressing problem (1) that shows comparable performance with a centralized approach for optimizing the sum of smooth and strongly convex objective functions under Assumption 1 only. In particular, under constant stepsize policies none of the existing algorithms achieve an error bound that is decreasing in the network size n.
A distributed gradient tracking method was proposed in [13, 34, 46], where the agent-based auxiliary variables \(y_i\) were introduced to track the average gradients of \(f_i\) assuming accurate gradient information is available. It was shown that the method, with constant stepsize, generates iterates that converge linearly to the optimal solution. Inspired by the approach, in this paper we consider a distributed stochastic gradient tracking method (DSGT). By comparison, in our proposed algorithm \(y_i\) are tracking the stochastic gradient averages of \(f_i\). We are able to show that the iterates generated by each agent reach, in expectation, a neighborhood of the optimal point exponentially fast under a constant stepsize. Interestingly, with a sufficiently small stepsize, the limiting error bounds on the distance between the agent iterates and the optimal solution decrease in the network size, which is comparable to the performance of a centralized stochastic gradient algorithm.
Gossip-based communication protocols are popular choices for distributed computation due to their low communication costs [4, 25, 29, 31]. In the second part of this paper, we consider a gossip-like stochastic gradient tracking algorithm (GSGT) where at each iteration, an agent wakes up uniformly randomly and communicates with one of her neighbors or updates by herself. Similar to DSGT, the method produces iterates that converge to a neighborhood of the optimal point exponentially fast under a sufficiently small constant stepsize. When the network of agents is well-connected (e.g., complete network, almost all regular graphs), GSGT is shown to employ a lower communication burden and similar computational cost when compared to DSGT.
1.1 Related work
We now briefly review the literature on (distributed) stochastic optimization. First of all, our work is related to the extensive literature in stochastic approximation (SA) methods dating back to the seminal works [21, 49]. These works include the analysis of convergence (conditions for convergence, rates of convergence, suitable choice of stepsize) in the context of diverse noise models [23]. Assuming the objective function f is strongly convex with Lipschitz continuous gradients, the optimal rate of convergence for solving problem (1) has been shown to be \({\mathcal {O}}\left( 1/k\right) \) under a diminishing SA stepsize where k denotes the iteration number [38]. With a constant stepsize \(\alpha >0\) that is sufficiently small, the iterates generated by a stochastic gradient method is attracted to an \({\mathcal {O}}(\alpha )\)-neighborhood of the optimal solution exponentially fast (in expectation).
Distributed implementations of stochastic gradient methods have become increasingly popular in recent years. In [48], the authors considered minimizing a sum of (possibly nonsmooth) convex objective functions subject to a common convex constraint set. It was shown that when the means of the stochastic subgradient errors diminish, there is mean consensus among the agents and mean convergence to the optimum function value under SA stepsizes. The work [54] used two diminishing stepsizes to deal with communication noises and subgradient errors, respectively. Asymptotic convergence to the optimal set was established; for constant stepsizes asymptotic error bounds were derived. In [14], a distributed dual averaging method was proposed for minimizing (possibly nonsmooth) convex functions. Under a carefully chosen SA stepsize sequence, the method exhibits the convergence rate \({\mathcal {O}}(\frac{n\log (k)}{(1-\lambda _2({\mathbf {W}}))\sqrt{k}})\), in which \(\lambda _2({\mathbf {W}})\) denotes the second largest singular value of the doubly stochastic mixing matrix \({\mathbf {W}}\). Paper [3] considered a projected stochastic gradient algorithm for solving non-convex optimization problems by combining a local stochastic gradient update and a gossip step. It was proved that consensus is asymptotically achieved in the network and the solutions converge to the set of KKT points with SA stepsizes. A distributed online algorithm was devised and analyzed in [5] for solving dynamic optimization problems in noisy communication environments. Sufficient conditions were provided for almost sure convergence of the algorithm. In [55], the authors proposed an adaptive diffusion algorithm based on penalty methods. Under a constant stepsize \(\alpha \), it was shown that the expected distance between the optimal solution and that obtained at each node is bounded by \({\mathcal {O}}(\alpha )\). The work [9, 10] further showed that under a sufficiently small stepsize and certain conditions on the stochastic gradients, distributed methods are able to achieve the same performance level as that of a centralized approach. Paper [7] considered the problem of distributed constrained convex optimization subject to multiple noise terms in both computation and communication stages. The authors utilized an augmented Lagrangian framework and established the almost sure convergence of the algorithm under a diminishing stepsize policy. In [32], a subgradient-push method was investigated for distributed optimization over time-varying directed graphs. When the objective function is strongly convex, the scheme exhibits the \({\mathcal {O}}(\frac{\ln k}{k})\) rate of convergence.
In a recent work [24], a class of decentralized first-order methods for nonsmooth and stochastic optimization was presented. The class was shown to exhibit the \({\mathcal {O}}(\frac{1}{k})\) (respectively, \({\mathcal {O}}(\frac{1}{\sqrt{k}})\)) rate of convergence for minimizing the sum of strongly convex functions (respectively, general convex functions). The work in [1] considered a composite convex optimization problem with noisy gradient information and showed the \(O(\frac{1}{\sqrt{k}})\) convergence rate using an ADMM-based approach. Paper [26] considered a decentralized stochastic gradient algorithm that achieves the \({\mathcal {O}}(\frac{1}{k}+\frac{1}{\sqrt{nk}})\) rate of convergence for minimizing the sum of non-convex functions. The rate is comparable to that of a centralized algorithm when k is large enough. At the same time, the communication cost for the decentralized approach is lower. Papers [40, 41] also demonstrates the advantage of distributively implementing a stochastic gradient method assuming that all \(f_i\) are identical and sampling times are random and non-negligible. The work [51] utilized a time-dependent weighted mixing of stochastic subgradient updates to achieve the convergence rate of \({\mathcal {O}}(\frac{n\sqrt{n}}{(1-\lambda _2({\mathbf {W}}))k})\) for minimizing the sum of (possibly nonsmooth) strongly convex functions. In [53], the authors considered a decentralized consensus-based algorithm with delayed gradient information. The method was shown to achieve the optimal \({\mathcal {O}}(\frac{1}{\sqrt{k}})\) rate of convergence for general convex functions. In [18], the \({\mathcal {O}}(\frac{1}{k})\) convergence rate was established for strongly convex costs and random networks.
1.2 Main contribution
Our main contribution is summarized as follows. Firstly, we propose a novel distributed stochastic gradient tracking method (DSGT) for optimizing the sum of smooth and strongly convex objective functions. We employ an auxiliary variable \(y_i\) for each agent that tracks the average stochastic gradients of the cost functions. We show that, under a constant stepsize choice, the algorithm is comparable to a centralized stochastic gradient scheme in terms of their convergence speeds and the ultimate error bounds. In particular, the obtained error bound under DSGT decreases with the network size n, which has not been shown in the literature to the best of our knowledge. Moreover, assuming the gradient estimates are accurate, DSGT recovers the linear rate of convergence to the optimal solution [34, 46], which is also a unique feature among other distributed stochastic gradient algorithms.
Secondly, with an SA stepsize \(\alpha _k\rightarrow 0\), DSGT enjoys the optimal \({\mathcal {O}}(\frac{1}{k})\) rate of convergence to the optimal point. In addition, we characterize the dependency of the constant factors in the stepsize and the convergence rate on the properties of the mixing matrix as well as the characteristics of the objective functions, such as the strong convexity factor and the Lipschitz constant.
Thirdly, we introduce a gossip-like stochastic gradient tracking method that is efficient in communication. We show that, under a sufficiently small constant stepsize, GSGT also produces iterates that converge to a neighborhood of the optimal point exponentially fast. Again, when the gradient estimates are accurate, GSGT recovers the linear rate of convergence to the optimal solution. Compared to DSGT, we show that when the network is well-connected (e.g., complete network, almost all regular graphs), GSGT incurs lower communication cost than DSGT by a factor of \({\mathcal {O}}(\frac{|{\mathcal {E}}|}{n})\) (\(|{\mathcal {E}}|\) denoting the number of edges in the network) while maintaining a similar computational cost.
Finally, we provide a numerical example that demonstrates the effectiveness of the proposed methods when contrasted with the centralized stochastic gradient algorithm and some existing variants of distributed stochastic gradient methods.
1.3 Notation and assumptions
Throughout the paper, vectors default to columns if not otherwise specified. Let each agent i hold a local copy \(x_i\in {\mathbb {R}}^p\) of the decision variable and an auxiliary variable \(y_i\in {\mathbb {R}}^p\). Their values at iteration/time k are denoted by \(x_{i,k}\) and \(y_{i,k}\), respectively. We let
and
where \({\mathbf {1}}\) denotes the vector with all entries equal to 1. We define an aggregate objective function of the local variables:
and let
In addition, denote
The inner product of two vectors a, b of the same dimension is denoted by \(\langle a,b\rangle \). For two matrices \({\mathbf {A}},{\mathbf {B}}\in {\mathbb {R}}^{n\times p}\), we let \(\langle {\mathbf {A}},{\mathbf {B}}\rangle \) be the Frobenius inner product. We use \(\Vert \cdot \Vert \) to denote the 2-norm of vectors; for matrices, \(\Vert \cdot \Vert \) represents the Frobenius norm. The spectral radius of a square matrix \({\mathbf {M}}\) is denoted by \(\rho ({\mathbf {M}})\).
We make the following standing assumption on the individual objective functions \(f_i\).
Assumption 2
Each \(f_i:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\) is \(\mu \)-strongly convex with L-Lipschitz continuous gradients, i.e., for any \(x,x'\in {\mathbb {R}}^p\),
We note that, under Assumption 2, problem (1) has a unique solution denoted by \(x^*\in {\mathbb {R}}^{1\times p}\).
A graph is a pair \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) where \({\mathcal {V}}=\{1,2,\ldots ,n\}\) is the set of vertices (nodes) and \({\mathcal {E}}\subseteq {\mathcal {V}}\times {\mathcal {V}}\) represents the set of edges connecting vertices. We assume agents communicate in an undirected graph, i.e., \((i,j)\in {\mathcal {E}}\) iff (if and only if) \((j,i)\in {\mathcal {E}}\). For each agent i, let \({\mathcal {N}}_i=\{j\mid j\ne i, (i,j)\in {\mathcal {E}}\}\) be its set of neighbors. The cardinality of \({\mathcal {N}}_i\), denoted by \(\deg (i)\), is referred to as agent i’s degree. We consider the following condition regarding the interaction graph of agents.
Assumption 3
The graph \({\mathcal {G}}\) corresponding to the network of agents is undirected and connected, i.e., there exists a path between any two agents.
1.4 Organization of the paper
The paper is organized as follows. In Sect. 2, we introduce the distributed stochastic gradient tracking method and present its main convergence results. We perform analysis in Sect. 3. In Sect. 4 we propose the gossip-like stochastic gradient tracking method. A numerical example is provided in Sect. 5 to illustrate our theoretical findings. Section 6 concludes the paper.
2 A distributed stochastic gradient tracking method (DSGT)
We consider the following distributed stochastic gradient tracking method: At each step \(k\in {\mathbb {N}}\), every agent i independently implements the following two steps:
where \(w_{ij}\) are nonnegative weights and \(\alpha >0\) is a constant stepsize. Agent i and j are connected iff \(w_{ij}, w_{ji}>0\). The iterates are initiated with an arbitrary \(x_{i,0}\) and \(y_{i,0}= g_i(x_{i,0},\xi _{i,0})\) for all \(i\in {{{\mathcal {N}}}}\). We can also write (4) in the following compact form:
where \({\mathbf {W}}=[w_{ij}]\in {\mathbb {R}}^{n\times n}\) denotes the coupling matrix of agents. We assume that \({\mathbf {W}}\) satisfies the following condition.
Assumption 4
Nonnegative coupling matrix \({\mathbf {W}}\) is doubly stochastic, i.e., \({\mathbf {W}}{\mathbf {1}}={\mathbf {1}}\) and \({\mathbf {1}}^{\intercal }{\mathbf {W}}={\mathbf {1}}^{\intercal }\). In addition, \(w_{ii}>0\) for some \(i\in {\mathcal {N}}\).
In the subsequent analysis, we will frequently use the following result, which is a direct implication of Assumptions 3 and 4 (see [46] Section II-B).
Lemma 1
Let Assumptions 3 and 4 hold, and let \(\rho _w\) denote the spectral norm of the matrix \({\mathbf {W}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). Then, \(\rho _w<1\) and
for all \(\omega \in {\mathbb {R}}^{n\times p}\), where \({\overline{\omega }}=\frac{1}{n}{\mathbf {1}}^{\intercal }\omega \).
Algorithm (4) is closely related to the schemes considered in [13, 34, 46], in which auxiliary variables \(y_{i,k}\) were introduced to track the average \(\frac{1}{n}\sum _{i=1}^{n}\nabla f_i(x_{i,k})\). This design ensures that the algorithms achieve linear convergence under a constant stepsize choice. Correspondingly, under our approach \(y_{i,k}\) are (approximately) tracking \(\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k})\). To see why this is the case, note that \({\overline{y}}_k = \frac{1}{n}{\mathbf {1}}^{\intercal } {\mathbf {y}}_k\). Since \(y_{i,0}=g(x_{i,0},\xi _{i,0}),\forall i\), by induction we have
It will be shown that \({\mathbf {y}}_k\) is close to \({\mathbf {1}}{\overline{y}}_k\) in expectation when k is sufficiently large. As a result, \(y_{i,k}\) are (approximately) tracking \(\frac{1}{n}\sum _{i=1}^{n}g_i(x_{i,k},\xi _{i,k})\).
It is worth noting that compared to the standard distributed subgradient methods [36], DSGT incurs two times the communication and storage costs per iteration.
2.1 Main results
Main convergence properties of DSGT are covered in the following theorem.
Theorem 1
Let \(\varGamma >1\) be arbitrarily chosen. Suppose Assumptions 1 and 2 hold and the stepsize \(\alpha \) satisfies
Then \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_{l}-{\mathbf {1}}{\overline{x}}_{l}\Vert ^2]\), respectively, converge to \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\) and \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\) at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\), where \(\rho ({\mathbf {A}})<1\) is the spectral radius of the matrix \({\mathbf {A}}\) given by
where \(\beta =\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha L-2\alpha ^2 L^2\). Furthermore,
and
where
Remark 1
The first term on the right-hand side of (8) can be interpreted as the error caused by stochastic gradients only, since it does not depend on the network topology. The second term on the right hand side of (8) and the bound in (9) are network dependent, and they increase with \(\rho _w\) (larger \(\rho _w\) indicates worse network connectivity).
In light of (8) and (9), we have
and
Let \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\) measure the average quality of solutions obtained by all the agents. We have
which is decreasing in the network size n when \(\alpha \) is sufficiently small, i.e., when
The spectral gap \(1-\rho _w\) depends on the graph topology as well as the choices of weights \([w_{ij}]\). For example, suppose the Lazy Metropolis rule is adopted (see [33]). In this case, if the communication graph is a 1) path or ring graph, then \(1-\rho _w={\mathcal {O}}(1/n^2)\); 2) lattice, then \(1-\rho _w={\mathcal {O}}(1/n)\); 3) geometric random graph, then \(1-\rho _w={\mathcal {O}}(1/n\log n)\); 4) Erdős-Rényi random graph, then \(1-\rho _w={\mathcal {O}}(1)\); 5) complete graph, then \(1-\rho _w=1\); 6) any connected undirected graph, then \(1-\rho _w={\mathcal {O}}(1/n^2)\).
From the above argument, the condition \(\alpha \sim \frac{\mu }{L^2}\frac{(1-\rho _w)^3}{n}\) is in general more strict than (7) in Theorem 1, which requires \(\alpha \sim \frac{(1-\rho _w)^2}{L}\) (when \(1-\rho _w\le (\frac{\mu }{L})^{2/3}\)). Such a difference suggests that when implementing DSGT in practice, it can be advantageous to use a larger stepsize in the early stage to achieve a faster convergence speed (see Corollary 1 below) and then switch to smaller stepsizes for more accuracy on the final solutions.\(\square \)
Remark 2
Under a centralized stochastic gradient (CSG) algorithm in the form of
we would obtain
It can be observed that DSGT is comparable with CSG in their ultimate error bounds (up to constant factors) with sufficiently small stepsizes.\(\square \)
As shown in Theorem 1, the convergence rate of DSGT is determined by the spectral radius \(\rho ({\mathbf {A}})<1\). In the corollary below we provide an upper bound of \(\rho ({\mathbf {A}})\).
Corollary 1
Under the conditions in Theorem 1, assuming in addition that the stepsize \(\alpha \) also satisfiesFootnote 1
we have
Corollary 1 implies that, for sufficiently small stepsizes, the distributed gradient tracking method has a comparable convergence speed to that of a centralized scheme (in which case the linear rate is \({\mathcal {O}}((1-2\alpha \mu )^k)\)).
In the next theorem, we show DSGT achieves the \({\mathcal {O}}(\frac{1}{k})\) rate of convergence under a diminishing stepsize policy.
Theorem 2
Let Assumptions 1 and 2 hold. Consider the method in (4) where \(\alpha \) is replaced with the time-varying stepsize \(\alpha _k\) given by \(\alpha _k:=\theta /(m+k)\), where \(\theta >1/\mu \) and m satisfies
with \(C=\left[ \left( \frac{1-\rho _w^2}{2\rho _w^2}-\frac{4\theta L}{m}-\frac{2\theta ^2 L^2}{m^2}\right) ^{-1}+2\right] \Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+\frac{3\theta L^3}{m}.\) Then for all \(k\ge 0\), we have
where \({\mathcal {O}}_k(1)\) denotes some constant that does not depend on k.
Remark 3
From (15), noting that \(\theta \mu >1\), we have
In particular, \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]\) asymptotically converges to 0 at the rate \(\frac{2\theta ^2 \sigma ^2}{n(\theta \mu -1)k}\), which does not depend on the spectral norm \(\rho _w\) and matches the convergence rate (up to constant factors) for centralized stochastic gradient methods (see [38, 47]).Footnote 2\(\square \)
Remark 4
Under a well connected network (\(\rho _w\simeq 0\)), \(m>\frac{\theta }{2}(\mu +L)\) suffices. When \(\rho _w\simeq 1\), the lower bound of m is in the order of \({\mathcal {O}}((1-\rho _w)^{-2})\).\(\square \)
Theorem 2 implies the following result if we choose \(\alpha _k=\theta /(k+1)\).
Corollary 2
Under Assumptions 1 and 2 and stepsize policy \(\alpha _k:=\theta /(k+1)\) for some \(\theta >1/\mu \), we have for all \(k\ge m\) where m satisfies condition (14),
where \({\tilde{U}}\), \({\tilde{X}}\), and \({\tilde{Y}}\) are some positive constants.
3 Analysis
In this section, we prove Theorem 1 by studying the evolution of \({\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\), \({\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\) and \({\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\). Our strategy is to bound the three expressions in terms of linear combinations of their past values, in which way we establish a linear system of inequalities. This approach is different from those employed in [34, 46], where the analyses pertain to the examination of \(\Vert {\overline{x}}_k-x^*\Vert \), \(\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert \) and \(\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert \). Such distinction is due to the stochastic gradients \(g_i(x_{i,k},\xi _{i,k})\) whose variances play a crucial role in deriving the main inequalities.
We first introduce some lemmas that will be used later in the analysis. Denote by \({\mathcal {F}}_k\) the \(\sigma \)-algebra generated by \(\{\varvec{\xi }_0,\ldots ,\varvec{\xi }_{k-1}\}\), and define \({\mathbb {E}}[\cdot \mid {\mathcal {F}}_k]\) as the conditional expectation given \({\mathcal {F}}_k\).
Lemma 2
Under Assumption 1, recalling that \(h({\mathbf {x}})= \frac{1}{n}{\mathbf {1}}^{\intercal }\nabla F({\mathbf {x}})\), we have for all \(k\ge 0\),
\(\square \)
Proof
By the definitions of \({\overline{y}}_k\) and \(h({\mathbf {x}}_k)\),
\(\square \)
Lemma 3
Under Assumption 2, we have for all \(k\ge 0\),
If in addition \(\alpha <2/(\mu +L)\), then
Proof
See [46] Lemma 10 for reference. \(\square \)
In the following lemma, we establish bounds on \(\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2\) and on the conditional expectations of \(\Vert {\overline{x}}_{k+1}-x^*\Vert ^2\) and \(\Vert {\mathbf {y}}_{k+1}-{\mathbf {1}}{\overline{y}}_{k+1}\Vert ^2\), respectively.
Lemma 4
Suppose Assumptions 1 and 2 hold and \(\alpha <2/(\mu +L)\). We have the following inequalities:
and for any \(\beta >0\),
Proof
See Appendix 7.1. \(\square \)
3.1 Proof of Theorem 1
Taking full expectation on both sides of (18)–(20), we obtain the following linear system of inequalities
where the inequality is to be taken component-wise, and the entries of the matrix \({\mathbf {A}}=[a_{ij}]\) are given by
and \(M_{\sigma }\) is given in (10). Therefore, by induction we have
If the spectral radius of \({\mathbf {A}}\) satisfies \(\rho ({\mathbf {A}})<1\), then \({\mathbf {A}}^k\) converges to \({\mathbf {0}}\) at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\) (see [17]), in which case \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\), \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}{\overline{x}}_l\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {y}}_l-{\mathbf {1}}{\overline{y}}_l\Vert ^2]\) all converge to a neighborhood of 0 at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\). The next lemma provides conditions that ensure \(\rho ({\mathbf {A}})<1\).
Lemma 5
Let \({\mathbf {S}}=[s_{ij}]\in {\mathbb {R}}^{3\times 3}\) be a nonnegative, irreducible matrix with \(s_{ii}<\lambda ^*\) for some \(\lambda ^*>0\) for all \(i=1,2,3\). Then \(\rho ({\mathbf {S}})<\lambda ^*\) iff \(\mathrm {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0\).
Proof
See Appendix 7.2. \(\square \)
We now derive the conditions such that \(\rho ({\mathbf {A}})<1\). Suppose \(\alpha \) and \(\beta \) meet the following relations:Footnote 3
and
Then,
Given that \(a_{11},a_{22},a_{33}<1\), in light of Lemma 5, we have \(\rho ({\mathbf {A}})<1\). In addition, by denoting \(B:=[\frac{\alpha ^2\sigma ^2}{n}, 0, M_{\sigma }]^{\intercal }\) and \([({\mathbf {I}}-{\mathbf {A}})^{-1}B]_j\) the j-th element of the vector \([({\mathbf {I}}-{\mathbf {A}})^{-1}B]\), we obtain from (22) that
and
It remains to show that (23)–(25) are satisfied under condition (7). By (23), we need
Since \(\alpha \le \frac{1-\rho _w^2}{12\rho _w L}\) by (7), we know that
Condition (24) leads to the inequality below:
By (27), we only need
The preceding inequality is equivalent to
implying that it is sufficient to have
To see that relation (25) holds, consider a stronger condition
or equivalently,
It suffices that
3.2 Proof of Corollary 1
We derive an upper bound of \(\rho ({\mathbf {A}})\) under conditions (7) and (13). Note that the characteristic function of \({\mathbf {A}}\) is given by
Since \(\text {det}({\mathbf {I}}-{\mathbf {A}})> 0\) and \(\text {det}(\max \{a_{11},a_{22},a_{33}\} {\mathbf {I}}-{\mathbf {A}})=\text {det}(a_{11}{\mathbf {I}}-{\mathbf {A}})<0\), we have \(\rho ({\mathbf {A}})\in (a_{11},1)\). By (24) and (25),
Suppose \(\lambda =1-\epsilon \) for some \(\epsilon \in (0,\alpha \mu )\), satisfying
or equivalently,
It suffices that
To see why this is the case, note that \(\frac{(\alpha \mu -\epsilon )}{\alpha \mu }\ge \frac{2}{\varGamma +1}\), and under (13),
As a result,
We then have
Denote
Then \(\text {det}({\tilde{\lambda }} {\mathbf {I}}-{\mathbf {A}})\ge 0\) so that \(\rho ({\mathbf {A}})\le {\tilde{\lambda }}\).
3.3 Proof of Theorem 2
Similar to (21), under stepsize policy \(\alpha _k=\theta /(m+k)\) where \(m>\frac{\theta }{2}(\mu +L)\), we have
where
\(\beta _k=\frac{1-\rho _w^2}{2\rho _w^2}-4\alpha _k L-2\alpha _k^2 L^2>0\), and
The condition \(\beta _k>0\) is satisfied when
or equivalently,
We first show that \({\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\le {\hat{U}}/(m+k)\) and \({\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2\le {\hat{X}}/(m+k)^2\) for some \({\hat{U}}\), \({\hat{X}}>0\) by induction. Denote \(U_k:={\mathbb {E}}[\Vert {\overline{x}}_{k}-x^*\Vert ^2]\), \(X_k:={\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}{\overline{x}}_{k}\Vert ^2]\), and \(Y_k:={\mathbb {E}}[\Vert {\mathbf {y}}_{k}-{\mathbf {1}}{\overline{y}}_{k}\Vert ^2]\). Suppose for some \(k\ge 0\),
We want to prove that
where \(C=(\frac{1}{\beta _0}+2)\Vert {\mathbf {W}}-{\mathbf {I}}\Vert ^2 L^2+3\alpha _0 L^3\). Plugging in \(\alpha _k=\theta /(m+k)\), it suffices to show
Let
condition (34) admits a solution iff
in which case \({\hat{X}}\) and \({\hat{Y}}\) are lower bounded under constraints {(34b), (34c), \({\hat{X}}\ge m^2 X_0\), \({\hat{Y}}\ge Y_0\)}. Specifically, \({\hat{X}}\) can be chosen as follows:
where
Noticing that relation (32) holds trivially when \(k=0\), the induction is complete.
We further improve the bound on \(U_k\) (inspired by [39, 44]). From (29),
Since \({X_k}\le {\hat{X}}/(m+k)^2\),
Hence
In light of Lemma 4.1 in [44],
Then,
Note that
We conclude that
4 A gossip-like stochastic gradient tracking method (GSGT)
In this section, we consider a gossip-like stochastic gradient tracking method (GSGT): Initialize with an arbitrary \(x_{i,0}\) and \(y_{i,0}=g_i(x_{i,0},\xi _{i,0})\) for all \(i\in {\mathcal {N}}\). At each round \(k\in {\mathbb {N}}\), agent \(i_k\in {\mathcal {N}}\) wakes up with probability 1/n. Then \(i_k\) either communicates with one of its neighbors \(j_k\) (with probability \(\pi _{i_k j_k}\)) or not (with probability \(\pi _{i_k i_k}=1-\sum _{j\in {\mathcal {N}}_{i_k}}\pi _{i_k j}\)). In the former situation, the update rule for \(i\in \{i_k,j_k\}\) is as follows,
and for \(i\notin \{i_k,j_k\}\), \(x_{i,k+1}=x_{i,k}\), \(y_{i,k+1}=y_{i,k}\), and \(\xi _{i,k+1}=\xi _{i,k}\).Footnote 4 In the latter situation, agent \(i_k\) performs update based on its own information:
while no action is taken by agent \(i\ne i_k\). For ease of analysis, we denote \(j_k=i_k\) in this case, and let \(\mathbb {1}_k\) be the indicator function for the event \(\{j_k\ne i_k\}\), i.e., \(\mathbb {1}_k=1\) iff \(j_k\ne i_k\).
The use of stepsize \(2\alpha \) instead of \(\alpha \) in (38a) can be understood as follows. At each iteration, GSGT performs two gradient updates within the network. This can be achieved either by two different agents respectively updating their solutions, or by one agent using a doubled stepsize. The method is different from a standard gossip algorithm where exactly two agents update at each round. This difference allows us to design the probabilities \(\pi _{ij}\) with more flexibility. In particular, it is possible to construct a doubly stochastic probability matrix \({{\varvec{\Pi }}}=[\pi _{ij}]\) for any graph \({\mathcal {G}}\) under Assumption 3.
We can present GSGT in the following compact matrix form, in which we adopt the notation previously used.
where the random coupling matrix \({\mathbf {W}}_k\) is defined as
in which \(e_i=[0\, \cdots \, 0\, 1\, 0\,\cdots ]^{\intercal }\in {\mathbb {R}}^{n\times 1}\) is a unit vector with the ith component equal to 1. By definition, each \({\mathbf {W}}_k\) is symmetric and doubly stochastic. The matrices \({\mathbf {D}}_k\) and \(\tilde{{\mathbf {D}}}_k\) are diagonal with their \(i_k\)th and \(j_k\)th diagonal entries equal to 1 and all other entries equal to 0 if \(j_k\ne i_k\), otherwise the \(i_k\)th entry of \({\mathbf {D}}_k\) (respectively, \(\tilde{{\mathbf {D}}}_k\)) equals 2 (respectively, 1) while all other entries equal 0.
We assume the following condition on the probability matrix \({{\varvec{\Pi }}}\):
Assumption 5
Nonnegative matrix \({{\varvec{\Pi }}}\) is doubly stochastic.
Let
It can be shown that (see [4])
which is doubly stochastic.
Lemma 6
Let Assumptions 3 and 5 hold, and let \(\rho _{{\bar{w}}}\) denote the spectral norm of the matrix \({\bar{{\mathbf {W}}}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). Then, \(\rho _{{\bar{w}}}\in [1-2/n,1)\).
Proof
Since \({\bar{W}}\) is doubly stochastic, \(\rho _{{\bar{w}}}<1\) follows from Lemma 1. To see \(\rho _{{\bar{w}}}\ge 1-2/n\), note that \(\rho (\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2n})=\frac{1}{n}\), and 1 is an algebraically simple eigenvalue of \(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}\) and \(\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). We have
where \(\lambda _2(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2})\) is the second largest eigenvalue of \(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2}\).
Since \(\lambda _2(\frac{{{\varvec{\Pi }}}+{{\varvec{\Pi }}}^{\intercal }}{2})\in [-1,1]\), we conclude that \(\rho _{{\bar{w}}}\ge 1-2/n\). \(\square \)
Before proceeding, it is worth noting that for GSGT, we still have the following relation:
4.1 Main results
We present the main convergence results of GSGT in the following theorem.
Theorem 3
Let \(\varGamma >1\) be arbitrarily chosen. Suppose Assumptions 1–5 hold, and assume that the stepsize \(\alpha \) satisfies
where \(\eta =\frac{1}{n(1-\rho _{{\bar{w}}})}\) and \(Q=L/\mu \). Then \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_{l}-{\mathbf {1}}{\overline{x}}_{l}\Vert ^2]\), respectively, converge to \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\) and \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\) at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}}_g)^k)\), where \(\rho ({\mathbf {A}}_g)<1\) is the spectral radius of the matrix \({\mathbf {A}}_g\) given by
in which \(\beta _1=\frac{n(1-\rho _{{\bar{w}}})}{4\alpha }-4\alpha L^2\) and \(\beta _2=\frac{n(1-\rho _{{\bar{w}}})}{4}-2\alpha L-2\alpha ^2 L^2\). In addition,
and
Remark 5
Notice that \(\eta =\frac{1}{n(1-\rho _{{\bar{w}}})}\) and \(1-\rho _{{\bar{w}}}\le \frac{2}{n}\) from Lemma 6. We can see from (45) and (46) that
and
Since
and by (44),
the second term on the right-hand side of (47) dominates the two terms on the right-hand side of (48). Thus, we have
\(\square \)
The corollary below provides an upper bound for \(\rho ({\mathbf {A}}_g)\).
Corollary 3
Under the conditions in Theorem 2 where \(\varGamma >3/2\), we have
Remark 6
Compared to DSGT, the convergence speed of GSGT is slower than DSGT under the same stepsize \(\alpha \) (see Corollary 1). This is due to the fact that in GSGT, only two agents update their iterates at each iteration.\(\square \)
4.2 Performance comparison between DSGT and GSGT
In this section, we compare the performances of the two proposed algorithms in terms of their required computation and communication efforts for achieving an \(\epsilon \)-solution (with constant stepsizes), that is, we compute the number of stochastic gradient computations and communications needed to obtain \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\le \epsilon \). Without loss of generality, for each method we first choose stepsize \(\alpha \) such that \(\frac{1}{n}\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}x^*\Vert ^2]\le \epsilon /2\) and then compute the number of iterations K such that \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_K-{\mathbf {1}}x^*\Vert ^2]\le \epsilon \).
For DSGT, when \(\epsilon \) is small enough, we have \(\alpha ={\mathcal {O}}(\frac{n\mu \epsilon }{\sigma ^2})\) from (11). Then, noting that \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}x^*\Vert ^2]\) converges linearly at the rate \({\mathcal {O}}(\rho ({\mathbf {A}})^k)\) where \(\rho ({\mathbf {A}})=1-{\mathcal {O}}(\alpha \mu )\), we obtain the number of required iterations:
In \(K_d\) iterations, the number of stochastic gradient computations is \(N_d=nK_d={\mathcal {O}}\left( \frac{\sigma ^2}{\mu ^2} \frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) \) and the number of communications is \(N_d^c=2|{\mathcal {E}}|K_d={\mathcal {O}}\left( \frac{|{\mathcal {E}}|}{n} \frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon }\right) \) where \(|{\mathcal {E}}|\) stands for the number of edges in the graph.
For GSGT we need \(\alpha ={\mathcal {O}}(\frac{n^2\mu \epsilon (1-\rho _{{\bar{w}}})}{\sigma ^2})\) from (49). Given that \(\rho ({\mathbf {A}}_g)=1-{\mathcal {O}}(\frac{\alpha \mu }{n})\), the number of required iterations \(K_g\) can be calculated as follows:
In \(K_g\) iterations, the number of gradient computations and communications are both bounded by \(N_g=N_g^c=2K_g={\mathcal {O}}(\frac{1}{n(1-\rho _{{\bar{w}}})}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\).
Suppose the Metropolis rule is applied to define the weights \(\pi _{ij}\) [50]. We first compare the number of stochastic gradient computations for DSGT and GSGT, respectively. Noticing that \(1-\rho _{{\bar{w}}}\le \frac{2}{n}\) by Lemma 6, \(N_g\) is at most in the same order of \(N_d={\mathcal {O}}(\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\), which happens when \(1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})\). Given that \(1-\rho _{{\bar{w}}}=(1-\lambda _2({{\varvec{\Pi }}}))/n\) from (42), we have \(1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})\) for complete networks, almost all regular graphs [16], among others.
We then compare the number of required communications \(N_d^c={\mathcal {O}}(\frac{|{\mathcal {E}}|}{n}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\) and \(N_g^c={\mathcal {O}}(\frac{1}{n(1-\rho _{{\bar{w}}})}\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\). When \(1-\rho _{{\bar{w}}}={\mathcal {O}}(\frac{1}{n})\), we have \(N_g^c={\mathcal {O}}(\frac{\sigma ^2}{\mu ^2}\frac{\ln (\frac{1}{\epsilon })}{\epsilon })\). By contrast, \(N_d^c\) is \({\mathcal {O}}(\frac{|{\mathcal {E}}|}{n})\) times larger than \(N_g^c\). In particular when \(|{\mathcal {E}}|={\mathcal {O}}(n^2)\) (e.g., complete network), the number of communications for GSGT is \({\mathcal {O}}(n)\) times smaller than that of DSGT.
4.3 Proof of Theorem 3
We first derive a linear system of inequalities regarding \({\mathbb {E}}[\Vert {\overline{x}}_k-x^*\Vert ^2]\), \({\mathbb {E}}[\Vert {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k\Vert ^2]\), \({\mathbb {E}}[\Vert {\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k\Vert ^2]\) and their values in the last iteration.
Lemma 7
Suppose Assumptions 1–5 hold and the stepsize satisfies \(\alpha <n/(\mu +L)\). Then, we have the following inequalities:
For any \(\beta _1,\beta _2>0\),
Proof
See Appendix 7.3. \(\square \)
In light of Lemma 7, we have the following linear system of inequalities:
where
and \(M_g=\frac{(4\alpha ^2L^2+2\alpha L)\sigma ^2}{n}+4(\alpha L+1)\sigma ^2\). Suppose \(\alpha \), \(\beta _1,\beta _2>0\) satisfy
and
Then, by Lemma 5, the spectral radius of \({\mathbf {A}}_g\) is smaller than 1, and we have
Hence, \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\overline{x}}_l-x^*\Vert ^2]\), \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {x}}_l-{\mathbf {1}}{\overline{x}}_l\Vert ^2]\) and \(\sup _{l\ge k}{\mathbb {E}}[\Vert {\mathbf {y}}_l-{\mathbf {1}}{\overline{y}}_l\Vert ^2]\) all converge to a neighborhood of 0 at the linear rate \({\mathcal {O}}(\rho ({\mathbf {A}}_g)^k)\). Moreover,
We now show (53), (54), and (57) are satisfied under condition (44). First, relation (44) implies that
Therefore, from (53) and (54) we have
By (58)–(61) and the fact that \(\rho _{{\bar{w}}}\ge 1-2/n\) obtained from (41), we have
Then, for relation (55) to hold, it is sufficient that
In light of (58), \(\alpha L\le n(1-\rho _{{\bar{w}}})/24\). We only need
which gives
We now derive the bounds for \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\overline{x}}_{k+1}-x^*\Vert ^2]\) and \(\limsup _{k\rightarrow \infty }{\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\). By (57) and (62),
4.4 Proof of Corollary 3
The characteristic function of \({\mathbf {A}}_g\) is
By (55),
Let \(\lambda =1-\epsilon \) for some \(\epsilon \in (0,2\alpha \mu /n)\) that satisfies
Under condition (44), it suffices that
Denote \({\tilde{\lambda }}=1-\frac{(2\varGamma -3)}{\varGamma }\frac{\alpha \mu }{n}\). We have \(\text {det}({\tilde{\lambda }}{\mathbf {I}}-{\mathbf {A}}_g)\ge 0\), and therefore \(\rho ({\mathbf {A}}_g)\le {\tilde{\lambda }}\).
5 Numerical example
In this section, we provide a numerical example to illustrate our theoretic findings. Consider the on-line Ridge regression problem, i.e.,
where \(\rho >0\) is a penalty parameter. For each agent i, samples in the form of \((u_i,v_i)\) are gathered continuously with \(u_i\in {\mathbb {R}}^p\) representing the features and \(v_i\in {\mathbb {R}}\) being the observed outputs. We assume that each \(u_i\in [0.3,0.4]^p\) is uniformly distributed, and \(v_i\) is drawn according to \(v_i=u_i^{\intercal } {\tilde{x}}_i+\varepsilon _i\), where \({\tilde{x}}_i\) are predefined parameters evenly located in \([0,10]^p\), and \(\varepsilon _i\) are independent Gaussian noises with mean 0 and variance 1. Given a pair \((u_i,v_i)\), agent i can compute an estimated gradient of \(f_i(x)\): \(g_i(x,u_i,v_i)=2(u_i^{\intercal }x -v_i)u_i+2\rho x\), which is unbiased. Problem (65) has a unique solution \(x^*\) given by \(x^*=(\sum _{i=1}^n{\mathbb {E}}_{u_i}[u_iu_i^{\intercal }]+n\rho {\mathbf {I}})^{-1}\sum _{i=1}^n{\mathbb {E}}_{u_i}[u_iu_i^{\intercal }]{\tilde{x}}_i\).
In addition to DSGT, GSGT and CSG, we consider the following distributed stochastic gradient (DSG) algorithm, which is similar to the ones studied in [18, 26]:
Noticing that some existing algorithms for deterministic distributed optimization can also be adapted to the stochastic gradient setting, e.g., EXTRA [52] and decentralized ADMM (DLM) [27], we also include them in our experiments for comparison.
In the experiments, we consider 3 instances with \(p=20\) and \(n\in \{10,25,100\}\), respectively. Under each instance, we let \({\mathbf {x}}_0={\mathbf {0}}\) and the penalty parameter \(\rho =0.1\). For the distributed methods, we assume that n agents constitute a random network, in which each two agents are linked with probability 0.4. The Metropolis rule is applied to define the weights \(w_{ij}\) (and \(\pi _{ij}\)) where applicable [50]:
For EXTRA, we choose \({\tilde{{\mathbf {W}}}}=\frac{{\mathbf {I}}+{\mathbf {W}}}{2}\) as recommended by [52]. For DLM, we tune the free parameters to make its convergence speed comparable to the other algorithms. In each instance, we use two different stepsizes \(\alpha =5\times 10^{-3}\) and \(\alpha =5\times 10^{-2}\), respectively. We run the simulations 50 times for DSGT, CSG, DSG, EXTRA and DLM and 100 times for GSGT and average the results to approximate the expected errors.
In Fig. 1a–f, we compare the average performances of DSGT, GSGT, CSG, DSG, EXTRA and DLM with the same parameters. It can be seen that DSGT and CSG are comparable in their convergence speeds as well as the ultimate error bounds (almost indistinguishable). EXTRA and DLM are worse than DSGT and CSG in their final error bounds. The performance gap increases with the network size and the stepsize. GSGT is slower as expected but still reaches a comparable error level under small stepsize \(\alpha =5\times 10^{-3}\). In addition, the error bounds for DSGT, GSGT and CSG decrease in n as expected from our theoretical analysis. The performance of DSG is not favorable given its largest final errors.Footnote 5
In Fig. 1g–i (respectively, j–l), we further compare the solutions obtained under DSGT and GSGT with the same number of stochastic gradient evaluations (respectively, inter-node communications) under small stepsize \(\alpha =5\times 10^{-3}\). We see the two methods are comparable in their speeds of convergence w.r.t. the number of gradient evaluations. However, GSGT is much faster than DSGT assuming the same number of communications. These numerical results verified our arguments in Sect. 4.2.
6 Conclusions and future work
This paper considers distributed multi-agent optimization over a network, where each agent only has access to inexact gradients of its local cost function. We propose a distributed stochastic gradient tracking method (DSGT) and show that the iterates obtained by each agent, using a constant stepsize value, reach a neighborhood of the optimum (in expectation) exponentially fast. More importantly, in a limit, the error bounds for the distances between the iterates and the optimal solution decrease in the network size, which is comparable with the performance of a centralized stochastic gradient algorithm. With a diminishing stepsize, the method exhibits the optimal \({\mathcal {O}}(1/k)\) rate of convergence. In the second part of this paper, we discuss a gossip-like stochastic gradient tracking method (GSGT) that is communication-efficient. Under a well-connected interaction graph, we show GSGT requires fewer communications than DSGT to reach an \(\epsilon \) error level. Finally, we provide a numerical example that demonstrates the effectiveness of both algorithms. In our future work, we will deal with directed and/or time-varying interaction graphs among agents. We also plan to explore other more flexible randomized algorithms such as broadcast-based protocols with possible transmission failures.
Notes
This condition is weaker than (7) when \(\rho _w^2\ge \frac{\varGamma }{\varGamma +1}\frac{2\mu }{3L}\).
It is worth noting that the transient time for \(\frac{1}{n}{\mathbb {E}}[\Vert {\mathbf {x}}_{k}-{\mathbf {1}}x^*\Vert ^2]\) to approach the asymptotic convergence rate depends on the network topology, and its dependence on n can be significant.
Matrix \({\mathbf {A}}\) in Theorem 1 corresponds to such a choice of \(\alpha \) and \(\beta \).
In practice, this means agent i holds vectors \(x_i\), \(y_i\) and \(g_i(x_i,\xi _i)\) if it does not wake up.
DSG still holds the advantage over DSGT in the early stage for achieving a similar convergence speed with lower communication and storage costs.
References
Aybat, N.S., Wang, Z., Lin, T., Ma, S.: Distributed linearized alternating direction method of multipliers for composite convex consensus optimization. IEEE Trans. Autom. Control 63(1), 5–20 (2017)
Baingana, B., Mateos, G., Giannakis, G.B.: Proximal-gradient algorithms for tracking cascades over social networks. IEEE J. Sel. Top. Signal Process. 8(4), 563–575 (2014)
Bianchi, P., Jakubowicz, J.: Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Trans. Autom. Control 58(2), 391–405 (2013)
Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Randomized gossip algorithms. IEEE Trans. Inf. Theory 52(6), 2508–2530 (2006)
Cavalcante, R.L., Stanczak, S.: A distributed subgradient method for dynamic convex optimization problems under noisy information exchange. IEEE J. Sel. Top. Signal Process. 7(2), 243–256 (2013)
Chatzipanagiotis, N., Dentcheva, D., Zavlanos, M.M.: An augmented Lagrangian method for distributed optimization. Math. Program. 152(1–2), 405–434 (2015)
Chatzipanagiotis, N., Zavlanos, M.M.: A distributed algorithm for convex constrained optimization under noise. IEEE Trans. Autom. Control 61(9), 2496–2511 (2016)
Chen, J., Sayed, A.H.: Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012)
Chen, J., Sayed, A.H.: On the learning behavior of adaptive networks—-part I: transient analysis. IEEE Trans. Inf. Theory 61(6), 3487–3517 (2015)
Chen, J., Sayed, A.H.: On the learning behavior of adaptive networks—part II: performance analysis. IEEE Trans. Inf. Theory 61(6), 3518–3548 (2015)
Cohen, K., Nedić, A., Srikant, R.: Distributed learning algorithms for spectrum sharing in spatial random access wireless networks. IEEE Trans. Autom. Control 62(6), 2854–2869 (2017)
Cohen, K., Nedić, A., Srikant, R.: On projected stochastic gradient descent algorithm with weighted averaging for least squares regression. IEEE Trans. Autom. Control 62(11), 5974–5981 (2017)
Di Lorenzo, P., Scutari, G.: Next: in-network nonconvex optimization. IEEE Trans. Signal Inf. Process. Over Netw. 2(2), 120–136 (2016)
Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(3), 592–606 (2012)
Forrester, A.I., Sóbester, A., Keane, A.J.: Multi-fidelity optimization via surrogate modelling. In: Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. The Royal Society, vol. 463, pp. 3251–3269 (2007)
Friedman, J., Kahn, J., Szemeredi, E.: On the second eigenvalue of random regular graphs. In: Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing. ACM, pp. 587–598 (1989)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990)
Jakovetic, D., Bajovic, D., Sahu, A.K., Kar, S.: Convergence rates for distributed stochastic optimization over random networks. In: 2018 IEEE Conference on Decision and Control (CDC), pp. 4238–4245. IEEE (2018)
Jakovetić, D., Xavier, J., Moura, J.M.: Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014)
Kia, S.S., Cortés, J., Martínez, S.: Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication. Automatica 55, 254–264 (2015)
Kiefer, J., Wolfowitz, J., et al.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)
Kleijnen, J.P.: Design and Analysis of Simulation Experiments, vol. 20. Springer, Berlin (2008)
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
Lan, G., Lee, S., Zhou, Y.: Communication-efficient algorithms for decentralized and stochastic optimization. Math. Program. 180, 237–284 (2020). https://doi.org/10.1007/s10107-018-1355-4
Lee, S., Nedić, A.: Asynchronous gossip-based random projection algorithms over networks. IEEE Trans. Autom. Control 61(4), 953–968 (2016)
Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., Liu, J.: Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 5336–5346 (2017)
Ling, Q., Shi, W., Wu, G., Ribeiro, A.: Dlm: decentralized linearized alternating direction method of multipliers. IEEE Trans. Signal Process. 63(15), 4051–4064 (2015)
Lobel, I., Ozdaglar, A., Feijer, D.: Distributed multi-agent optimization with state-dependent communication. Math. Program. 129(2), 255–284 (2011)
Lu, J., Tang, C.Y., Regier, P.R., Bow, T.D.: Gossip algorithms for convex consensus optimization over networks. IEEE Trans. Autom. Control 56(12), 2917–2923 (2011)
Mateos, G., Giannakis, G.B.: Distributed recursive least-squares: stability and performance analysis. IEEE Trans. Signal Process. 60(7), 3740–3754 (2012)
Mathkar, A.S., Borkar, V.S.: Nonlinear gossip. SIAM J. Control Optim. 54(3), 1535–1557 (2016)
Nedić, A., Olshevsky, A.: Stochastic gradient-push for strongly convex functions on time-varying directed graphs. IEEE Trans. Autom. Control 61(12), 3936–3947 (2016)
Nedić, A., Olshevsky, A., Rabbat, M.G.: Network topology and communication–computation tradeoffs in decentralized optimization. Proc. IEEE 106(5), 953–976 (2018)
Nedic, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
Nedić, A., Olshevsky, A., Uribe, C.A.: Fast convergence rates for distributed non-bayesian learning. IEEE Trans. Autom. Control 62(11), 5538–5553 (2017)
Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
Nedić, A., Ozdaglar, A., Parrilo, P.A.: Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Olshevsky, A., Paschalidis, I.C., Spiridonoff, A.: Robust asynchronous stochastic gradient-push: asymptotically optimal and network-independent performance for strongly convex functions. arXiv preprint arXiv:1811.03982 (2018)
Pu, S., Garcia, A.: A flocking-based approach for distributed stochastic optimization. Oper. Res. 1, 267–281 (2018)
Pu, S., Garcia, A.: Swarming for faster convergence in stochastic optimization. SIAM J. Control Optim. 56(4), 2997–3020 (2018)
Pu, S., Garcia, A., Lin, Z.: Noise reduction by swarming in social foraging. IEEE Trans. Autom. Control 61(12), 4007–4013 (2016)
Pu, S., Nedić, A.: A distributed stochastic gradient tracking method. In: 2018 IEEE Conference on Decision and Control (CDC). IEEE, pp. 963–968 (2018)
Pu, S., Olshevsky, A., Paschalidis, I.C.: A sharp estimate on the transient time of distributed stochastic gradient descent. arXiv preprint arXiv:1906.02702 (2019)
Pu, S., Shi, W., Xu, J., Nedic, A.: Push–pull gradient methods for distributed optimization in networks. IEEE Trans. Autom. Control (2020). https://doi.org/10.1109/TAC.2020.2972824
Qu, G., Li, N.: Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2017)
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning. Omnipress, pp. 1571–1578 (2012)
Ram, S.S., Nedić, A., Veeravalli, V.V.: Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Sayed, A.H.: Adaptive networks. Proc. IEEE 102(4), 460–497 (2014)
Sayin, M.O., Vanli, N.D., Kozat, S.S., Başar, T.: Stochastic subgradient algorithms for strongly convex optimization over distributed networks. IEEE Trans. Netw. Sci. Eng. 4(4), 248–260 (2017)
Shi, W., Ling, Q., Wu, G., Yin, W.: Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
Sirb, B., Ye, X.: Decentralized consensus algorithm with delayed and stochastic gradients. SIAM J. Optim. 28(2), 1232–1254 (2018)
Srivastava, K., Nedic, A.: Distributed asynchronous constrained stochastic optimization. IEEE J. Sel. Top. Signal Process. 5(4), 772–790 (2011)
Towfic, Z.J., Sayed, A.H.: Adaptive penalty-based distributed stochastic convex optimization. IEEE Trans. Signal Process. 62(15), 3924–3938 (2014)
Tsitsiklis, J., Bertsekas, D., Athans, M.: Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control 31(9), 803–812 (1986)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Parts of the results appeared in the 57th IEEE Conference on Decision and Control [43].
This work was supported in parts by the NSF Grant CCF-1717391, the ONR Grant No. N00014-16-1-2245, and the Shenzhen Research Institute of Big Data (SRIBD) Startup Fund JCYJ-SP2019090001.
Appendix
Appendix
1.1 Proof of Lemma 4
By (4),
It follows that
Notice that \({\mathbb {E}}[{\overline{y}}_k\mid {\mathcal {F}}_k]=h({\mathbf {x}}_k)\), and
We have
where the inequality follows from Lemma 2. Denote \(\lambda =1-\alpha \mu \). In light of Lemma 3,
Relation (19) follows from the following argument:
where we used Lemma 1.
To prove (20), we need some preparations first. For ease of exposition we will write \(G_k:=G({\mathbf {x}}_k,\varvec{\xi }_k)\) and \(\nabla _k:=\nabla F({\mathbf {x}}_k)\) for short. From (5) and Lemma 1,
Notice that
by Assumption 1, and
We have
Two additional lemmas are in hand.
Lemma 8
Proof
From (4),
In light of Assumption 2,
Then,
The desired result then follows. \(\square \)
Lemma 9
Proof
By (4), we have
On one hand,
which gives
On the other hand,
We have
\(\square \)
By (70), Lemma 8 and Lemma 9, we obtain
Now we bound \(\Vert \nabla _{k+1}-\nabla _k \Vert ^2\) and \(\langle W{\mathbf {y}}_k-{\mathbf {1}}{\overline{y}}_k, \nabla _{k+1}-\nabla _k \rangle \). First, by Assumption 2 and Lemma 1,
Second,
Notice that
We have
and
By (75) and the above relations,
for any \(\beta >0\).
1.2 Proof of Lemma 5
The characteristic function of \({\mathbf {S}}\) is given by
Necessity is trivial since \(\text {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})\le 0\) implies \(g(\lambda )=0\) for some \(\lambda \ge \lambda ^*\). We now show \(\text {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0\) is also a sufficient condition. Given that \(g(\lambda ^*)=\text {det}(\lambda ^* {\mathbf {I}}-{\mathbf {S}})>0\),
It follows that
for some \(\gamma _1,\gamma _2,\gamma _3>0\) with \(\gamma _1+\gamma _2+\gamma _3\le 1\). Consider
We have \(g'(\lambda )>0\) for \(\lambda \in (-\infty ,-\lambda ^*]\cup [\lambda ^*,+\infty )\). Notice that
All the real roots of \(g(\lambda )=0\) lie in the interval \((-\lambda ^*,\lambda ^*)\). By the Perron-Frobenius theorem, \(\rho ({\mathbf {S}})\in {\mathbb {R}}\) is an eigenvalue of \({\mathbf {S}}\). We conclude that \(\rho ({\mathbf {S}})<\lambda ^*\).
1.3 Proof of Lemma 7
First we prove relation (50). In light of (39a), we have
Then,
Taking conditional expectations of \((y_{i_k,k}+y_{j_k,k})\) and \(\Vert y_{i_k,k}+y_{j_k,k}\Vert ^2\) w.r.t. the random selections of \(i_k\) and \(j_k\), we get
and
where \(\rho _{\pi }<1\) denotes the spectral norm of \({{\varvec{\Pi }}}-\frac{1}{n}{\mathbf {1}}{\mathbf {1}}^{\intercal }\). It follows that
Noticing that \({\mathbb {E}}[{\overline{y}}_k]={\mathbb {E}}[h({\mathbf {x}}_k)]\), from Lemma 2 we have
Then from (82) we obtain
Since \(\alpha <n/(\mu +L)\), we know from Lemma 3 that
and
It follows that
To bound the consensus error \({\mathbb {E}}[\Vert {\mathbf {x}}_{k+1}-{\mathbf {1}}{\overline{x}}_{k+1}\Vert ^2]\), note that from (39a) and (79) we have
By Lemma 6, the conditional expectation of \(\mathbf {tr}\left[ ({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)^{\intercal }{\mathbf {W}}_k^{\intercal } {\mathbf {W}}_k({\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k)\right] \) can be bounded below,
In view of the structure of \({\mathbf {D}}_k\), we rewrite \(2\langle {\mathbf {W}}_k {\mathbf {x}}_k-{\mathbf {1}}{\overline{x}}_k, {\mathbf {D}}_k {\mathbf {y}}_k\rangle \) as follows:
Note that
and similarly,
The following inequality holds:
The last term in Eq. (84) can be bounded in the following way:
In view of inequalities (85)–(87), from (84) we obtain
where \(\beta _1>0\) is arbitrary. Notice that by relation (83) and Lemma 3,
We obtain
which is exactly inequality (51).
Finally we prove inequality (52). From the update rule (39b),
Denote \(g_{i,k}:=g_i(x_{i,k},\xi _{i,k})\) for short. Notice that
and
We have
Hence relation (89) leads to
Now we analyze the three terms on the right-hand side of (90), respectively. First,
Second,
In light of Assumptions 1 and 2, we can bound \({\mathbb {E}}[\Vert g_{i_k,k+1}-g_{i_k,k}\Vert ^2]\) and \({\mathbb {E}}[\mathbb {1}_k\Vert g_{j_k,k+1}-g_{j_k,k}\Vert ^2]\):
Similarly,
To further bound \({\mathbb {E}}[\Vert g_{i_k,k+1}-g_{i_k,k}\Vert ^2+\mathbb {1}_k\Vert g_{j_k,k+1}-g_{j_k,k}\Vert ^2]\), we introduce the following lemma.
Lemma 10
The proof is similar to that of Lemma 8 and is omitted here. Equation (92) together with (93), (94) and Lemma 10 leads to the following inequality:
Note that from (37a) and (38a), we have
and
Then by (95),
For the last term on the right-hand side of (90), note that
By Assumption 1, we have
The following lemma is useful.
Lemma 11
Proof
See Appendix 7.4. \(\square \)
In light of (97) and Lemma 11,
Notice that from (37a), (38a) and Assumption 2,
and
We have from (98) that
for any \(\beta _2>0\). In light of (91), (96) and (99), we obtain by (90) that
Since by (88),
We conclude that
1.4 Proof of Lemma 11
The following relation holds:
From the updating rules (37b) and (38b), we have
and
Hence
Rights and permissions
About this article
Cite this article
Pu, S., Nedić, A. Distributed stochastic gradient tracking methods. Math. Program. 187, 409–457 (2021). https://doi.org/10.1007/s10107-020-01487-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-020-01487-0