1 Introduction

In recent years, distributed optimization problems in multi-agent systems have attracted increasing attention. Distributed optimization is concerned with that all agents to cooperatively minimize a sum of local objective functions over a graph. The key feature of such an optimization problem lies in that each agent only knows a local component of the objective function and thus must cooperate with its neighbors to compute the optimal value. The interaction between nodes is modeled by an algebraic graph. The motivating examples for distributed computation include the AUV formation control [24], large-scale machine learning [4, 17, 25], and the distributed quantile regression over sensor networks [21].

To solve the distributed optimization problem, the majority of the algorithms (see e.g., [11, 15, 16, 21] and the references therein) are generally comprised of two parts. One is to drive all agents to consensus, which is based on the well-known consensus algorithm [18]. The other one is to push the consensus value toward an optimal point by using the local (sub)gradient in each node. In this case, subgradient-based algorithms have been widely used. To achieve consensus of the multi-agent network, most of the existing methods require each agent to access the state values of its neighbors at each time, either exactly [15, 18] or in a quantized form [19, 23]. However, in some situations, an agent may only roughly know relative state measurements between its neighbors. For example, consider the case of several robots working in a plane, when each robot can only tell which quadrant its neighbor is lying by cheap sensors but not the neighbor’s accurate relative position. Thus, the information accessible is restricted to be only one bit. Note that this is different from the quantized setting in [19], which studied the effects of exchanging a quantized rather than an exact state between neighbors. This is also different from previous studies on exchanging quantized gradients [13] since we are only using the quantized relative state information. Therefore, most algorithms in the literature, particularly the ones in the references cited above, cannot handle the case of one-bit information. It is worth noting that another advantage of our algorithm, in addition to using only one bit of relative information, is that it does not require the interaction matrix of the agents to be doubly-stochastic. A doubly-stochastic adjacency matrix is a common assumption in many existing algorithms [14, 16, 20], but it is restrictive in the distributed setting. For example, the Metropolis method [20] to construct a doubly-stochastic matrix requires each node to know its neighbors’ degrees, which may be impractical in applications, especially when the graph is time-varying.

Designing an algorithm using one bit of information often involves nonlinear systems analysis, which is substantially different from the commonly applied graph Laplacian theory in the aforementioned works. There are, however, some exceptions [5, 9, 12]. In [5], the authors designed a consensus algorithm using only sign information of the relative state. A similar algorithm was also proposed in [9] to distributedly compute a sample median. The algorithm in [12] is the most relevant to the one in this chapter except that it is a continuous-time algorithm, which adopts a completely different analysis tool than ours. We will return to this point, and discuss more extensively later.

In fact, all the aforementioned works that use one bit of information focused on continuous-time algorithms. However, a discrete-time algorithm is worth studying because many distributed optimization applications involve communication between agents and control of agents, which are typically discrete in nature. Besides, a discrete-time algorithm is easier to implement. What is more, a continuous-time algorithm cannot be extended to the discrete-time case that easily, since the methods used to analyze continuous-time algorithms in the above works are often based on Lyapunov theory. We know that some general stepsize rules (e.g., constant, diminishing) in discrete-time gradient-based algorithms cannot guarantee the nonincreasingness of a latent Lyapunov function, and some special stepsize rules (e.g., line minimization rule) often fail to meet the requirement of distributed computation, which renders the Lyapunov analysis difficult to extend to the discrete-time case. Therefore, an alternative method is urgently needed, which is what this chapter does.

More precisely, we propose in this chapter a distributed optimization algorithm using only one bit of information in the discrete-time case. Different from most of the previous works, our analysis is based on optimization theory rather than algebraic graph theory or Lyapunov theory. There are two underlying advantages of this. First, compared to many existing approaches which first propose an algorithm, and then find a Lyapunov function to prove its convergence, the intuition behind our algorithm appears to be more natural and reasonable, as it aims to minimize a well-designed objective function. Second, a wealth of research in convex optimization theory ensures our algorithm more easily extensible to more general cases. For example, our algorithm over time-varying graphs is a direct extension of that over static graphs. Specifically, we extend our algorithm to both deterministically time-varying graphs and randomly time-varying graphs. The former can model the time-varying topology of agents in applications [17, 22], while the latter can be used to describe the gossip networks [10], random package losses in communication networks, etc. Based on optimization theory, our methods to analyze the cases of deterministically time-varying graphs and randomly time-varying graphs take advantage of incremental gradient methods and stochastic gradient descent methods, respectively.

For a connected static graph, each node of the distributed optimization algorithm is shown to converge asymptotically to the same optimal point of the optimization without any reduction in the convergence rate. For deterministically time-varying graphs, the convergence of the distributed optimization algorithm is established if the graphs are uniformly jointly connected. For randomly time-varying graphs, we show the convergence of the distributed optimization algorithm in the almost sure sense under the so-called randomly activated connected graph assumption.

The rest of the chapter is organized as follows. Section 2 provides some preliminaries and introduces the distributed optimization problem. In Sect. 3, we present our discrete-time distributed optimization algorithm using one bit of information. Section 4 includes our main results on convergence and convergence rate of the algorithm over static graphs. Section 5 provides the convergence results over uniformly jointly connected graphs and randomly activated graphs. Finally, we perform simulations to validate the theoretical results in Sect. 6, and draw some concluding remarks in Sect. 7.

Notation: We use \(a,\mathbf {a},A\), and \(\mathscr {A}\) to denote a scalar, vector, matrix, and set, respectively. \(\mathbf {a}^\mathsf {T}\) and \(A^\mathsf {T}\) denote the transposes of \(\mathbf {a}\) and A, respectively. \([A]_{ij}\) denotes the element in row i and column j of A. \(\mathbb {R}\) denotes the set of real numbers and \(\mathbb {R}^n\) denotes the set of all n-dimensional real vectors. \({\mathbbm {1}}\) denotes the vector with all ones, the dimension of which depends on the context. We let \(\Vert \cdot \Vert _1,\Vert \cdot \Vert \) and \(\Vert \cdot \Vert _\infty \) denote the \(l_1\)-norm, \(l_2\)-norm and \(l_{\infty }\)-norm of a vector or matrix, respectively. We define

$$ \text {sgn}(x)=\left\{ \begin{array}{cl} 1, &{} \text {if }x\ge 0, \\ -1, &{} \text {otherwise}. \end{array}\right. $$

With a slight abuse of notation, \(\nabla f(x)\) denotes any subgradient of f(x) at x, i.e., \(\nabla f(x)\) satisfies

$$\begin{aligned} f(y)\ge f(x)+(y-x)^\mathsf {T}\nabla f(x),\ \forall y\in \mathbb {R}. \end{aligned}$$
(1)

The subdifferential \(\partial f(x)\) is the set of all subgradients of f(x) at x. If f(x) is differentiable at x, then \(\partial f(x)\) includes only the gradient of f(x) at x.

We call \(\inf _{\mathbf {x} \in \mathbb {R}^{n}} f(\mathbf {x})\) the optimal value of \(f(\mathbf {x})\). Any element from the set \(\arg \inf _{\mathbf {x} \in \mathbb {R}^{n}} f(\mathbf {x})\) is called an optimal solution or optimal point of \(f(\mathbf {x})\).

Superscripts are usually used to represent sequence indices, i.e., \(x^k\) represents the value of the sequence \(\{x^k\}\) at time k.

2 Problem Formulation

This section introduces some basics of graph theory, and presents the distributed optimization problem in multi-agent networks.

2.1 Basics of Graph Theory

A graph (network) is represented as \(\mathscr {G}=(\mathscr {V},\mathscr {E})\), where \(\mathscr {V}=\{1,...,n\}\) is the set of nodes and \(\mathscr {E}\subseteq \mathscr {V}\times \mathscr {V}\) is the set of edges. Let \(\mathscr {N}_i=\{j\in \mathscr {V}|(i,j)\in \mathscr {E}\}\) be the set of neighbors of node i, and \(A=[a_{ij}]\) be the weighted adjacency matrix of \(\mathscr {G}\), where \(a_{ij}>0\) if and only if there exists an edge connecting nodes i and j, and otherwise, \(a_{ij}=0\). If \(A=A^\mathsf {T}\), the associated graph is undirected. This chapter focuses only on undirected graphs.

In the case of time-varying graphs, we use \(\mathscr {G}^k=(\mathscr {V},\mathscr {E}^k,A^k)\) to represent the graph at time k. Let \(\mathscr {G}^{k_1}\cup \mathscr {G}^{k_2}\) denote the graph \((\mathscr {V},\mathscr {E}^{k_1}\cup \mathscr {E}^{k_2},A^{k_1}+A^{k_2})\). Let \(\mathscr {N}_i^k=\{j\in \mathscr {V}|(i,j)\in \mathscr {E}^k\}\) denote the set of neighbors of node i at time k. The incidence matrix \(B\in \mathbb {R}^{n\times m}\) of \(\mathscr {G}\) is defined by

$$ B_{ie}=\left\{ \begin{aligned}1,&\text { if node}\, i\, \text {is the source node of edge}\, e,\\ -1,&\text { if node}\, i\, \text {is the sink node of edge}\, e,\\ 0,&\text { otherwise}.\end{aligned}\right. $$

For any \(\mathbf {x}=[x_1,...,x_n]^\mathsf {T}\), we have that

$$ \mathbf {b}_e^\mathsf {T}\mathbf {x}=x_i-x_j $$

where \(\mathbf {b}_e,e\in \mathscr {E}\) is the e-th column of B, and i and j are the source and the sink nodes of edge e, respectively.

A path is a sequence of consecutive edges that connect a set of different nodes. We say a graph is connected if there exists a path between any pair of two nodes. To evaluate the intensity of the graph’s connectivity, we introduce an important concept called l-connected graph below.

Definition 1

(l-connected graph) A connected graph is l-connected (\(l\ge 1\)) if it remains connected whenever fewer than l edges are removed.

Clearly, a connected graph is at least 1-connected and each node of an l-connected graph has at least l neighbors.

2.2 Distributed Optimization Problem

With only the sign of relative state, our objective is to distributedly solve the multi-agent optimization problem

$$\begin{aligned} \mathop {\text {minimize}}_{x\in \mathbb {R}}\ f(x):=\sum _{i=1}^n f_i(x) \end{aligned}$$
(2)

where for each \(i\in \mathscr {V}\), the local objective function \(f_i(x)\) is continuously convex but not necessarily differentiable, and is only known by node i. The number of nodes is set to be \(n>1\). We first make a mild assumption.

Assumption 1

(Nonempty optimal set and bounded subgradients)

  1. (a)

    The set \(\mathscr {X}^\star \) of optimal solutions of problem (2) is nonempty, i.e., for any \(x^\star \in \mathscr {X}^\star \), it holds that \(f^\star :=f(x^\star )=\inf _{x\in \mathbb {R}}f(x)\).

  2. (b)

    There exists a constant \(c>0\) such that

    $$\begin{aligned} |\nabla f_i(x)|\le c,\ \forall i\in \mathscr {V},x\in \mathbb {R}. \end{aligned}$$
    (3)

Assumption 1 is common in the literature, see e.g., [16, 25]. In particular, the second part is often made to guarantee the convergence of a subgradient method [16], and obviously holds if the decision variable x is restricted to a compact set.

3 The Distributed Optimization Algorithm Over Static Graphs

In this section, we provide the discrete-time distributed optimization algorithm that uses only the sign information of the relative state of the neighboring nodes (hence one-bit information), and then interpret it via the penalty method in optimization theory.

This section only focuses on static graphs, which are important to the analysis of time-varying cases in following sections.

3.1 The Distributed Optimization Algorithm

The discrete-time distributed algorithm to solve (2) over a static network \(\mathscr {G}\) is given in Algorithm 1.

figure a

The continuous-time version of Algorithm 1 is also given in (4) of [12] and is proved to be convergent by using the non-smooth analysis tool [6]. To ensure a valid algorithm, it is important to choose both \(\lambda \) and \(\rho _k\), which, for the discrete-time case, requires a completely different approach from that of [12], as it will be evident in Sect. 3.2.

Compared with the celebrated distributed gradient descent (DGD) algorithm, see e.g.,[16],

$$\begin{aligned} x_i^{k+1}=x_i^{k}+\sum _{j\in \mathscr {N}_i}\tilde{a}_{ij}(x_j^k-x_i^k)-\rho ^k \nabla f_i(x_i^k). \end{aligned}$$
(4)

Algorithm 1 has at least two advantages. First, each node i in Algorithm 1 only uses the binary information of the relative state \((x_j^k-x_i^k)\), instead of the exact relative state from each of its neighbors j, which is essential in some cases where \(\text {sgn}(x_j^k-x_i^k)\) is the only available information. Second, Algorithm 1 does not require the adjacency matrix \(A^k\) to be doubly-stochastic, while associated adjacency matrix \(\tilde{A}^k\) must be doubly-stochastic in DGD [16], where \([\tilde{A}^k]_{ij}:=\tilde{a}_{ij}^k\). This is very restrictive in the distributed setting.

Remark 1

Algorithm 1 also works if x is a vector by applying \(\text {sgn}(\cdot )\) to each element of the relative state vector. All the results on the scalar case continue to hold with such an adjustment.

3.2 Penalty Method Interpretation of Algorithm 1

In this subsection, we interpret Algorithm 1 via the penalty method and show that it is the subgradient iteration of a penalized optimization problem.

Notice that problem (2) can be essentially reformulated as follows:

(5)

where \(\mathbf {x}=[x_1,...,x_n]^\mathsf {T}\). It is easy to see that the optimal value of problem (5) is also \(f^\star \), and the set of optimal solutions is \(\{x^\star {\mathbbm {1}}|x^\star \in \mathscr {X}^\star \}\).

Define a penalty function by

$$\begin{aligned} h(\mathbf {x})=\frac{1}{2}\sum _{i=1}^n\sum _{j\in \mathscr {N}_i}a_{ij}|x_i-x_j|. \end{aligned}$$
(6)

If the associated network \(\mathscr {G}\) is connected, then \(h(\mathbf {x})=0\) is equivalent to that \(x_i=x_j,\ \forall i,j\in \{1,...,n\}\). Thus, a penalized optimization problem of (5) can be given as

$$\begin{aligned} \mathop {\text {minimize}}_{\mathbf {x}\in \mathbb {R}^n}\ \widetilde{f}_\lambda (\mathbf {x}):=g(\mathbf {x})+\lambda h(\mathbf {x}) \end{aligned}$$
(7)

where \(\lambda >0\) is the penalty factor.

We show below that Algorithm 1 is just the subgradient iteration of the penalized problem (7) with stepsizes \(\rho ^k\). Recall that \(\text {sgn}(x)\) is a subgradient of |x| for any \(x\in \mathbb {R}\). It follows from (6) that a subgradient \(\nabla h(\mathbf {x})=[\nabla h(\mathbf {x})_1,...,\nabla h(\mathbf {x})_n]^\mathsf {T}\) of \(h(\mathbf {x})\) is given element-wise by

$$\begin{aligned} \nabla h(\mathbf {x})_i&=\sum _{j\in \mathscr {N}_i}a_{ij}\text {sgn}(x_i-x_j),\ i\in \mathscr {V}. \end{aligned}$$

Similarly, a subgradient \(\nabla g(\mathbf {x})=[\nabla g(\mathbf {x})_1,...,\nabla g(\mathbf {x})_n]^\mathsf {T}\) of \(g(\mathbf {x})\) is given element-wise by \( \nabla g(\mathbf {x})_i=\nabla f_i(x_i). \) Then, the i-th element of a subgradient of \(\widetilde{f}_\lambda (\mathbf {x})\) is given as

$$\begin{aligned} \nabla \widetilde{f}_\lambda (\mathbf {x})_i=\lambda \sum _{j\in \mathscr {N}_i}a_{ij}\text {sgn}(x_i-x_j)+\nabla f_i(x_i), i\in \mathscr {V}. \end{aligned}$$

Finally, the subgradient method for solving (7) is given as

$$\begin{aligned} \mathbf {x}^{k+1}=\mathbf {x}^k-\rho ^k\nabla \widetilde{f}_\lambda (\mathbf {x}^k), \end{aligned}$$

which is exactly the vector form of the local update in Algorithm 1. By [2], it follows that the subgradient method converges to an optimal solution of problem (7) if \(\rho ^k\) is appropriately chosen.

For a finite \(\lambda >0\), the optimization problems (5) and (7) are generally not equivalent. Under mild conditions, our main result shows that they actually become equivalent if the penalty factor \(\lambda \) is strictly greater than an explicit lower bound. To this end, we define

(8)

and let \(a_\text {min}^{(l)}\) be the sum of the l smallest edges’ weights, i.e.,

$$\begin{aligned} a_\text {min}^{(l)}=\sum _{e=1}^l a_{(e)} \end{aligned}$$
(9)

where \(a_{(1)},a_{(2)}, \ldots \) are given as an ascending order of the positive weights \(a_{ij}\) for any edge \((i,j)\in \mathscr {E}\).

Theorem 1

(Lower bound for the penalty factor, [28]) Suppose that Assumption 1 holds, and that the multi-agent network is l-connected. If the penalty factor is selected as

$$\begin{aligned} \lambda >\underline{\lambda }:=\frac{n c}{2a_\text {min}^{(l)}}, \end{aligned}$$
(10)

where c and \(a_\text {min}^{(l)}\) are defined in (3) and (9), then

  1. (a)

    The optimization problems (2) and (7) are equivalent in the sense that the set of optimal solutions and optimal value of (7) are given by \( \widetilde{\mathscr {X}}^\star =\{x^\star {\mathbbm {1}}|x^\star \in \mathscr {X}^\star \}\) and \(f^\star \), respectively.

  2. (b)

    For any \(\mathbf {x} \notin \{\alpha {\mathbbm {1}}| \alpha \in \mathbb {R}\}\), it holds that

    $$\begin{aligned} \Vert \nabla \widetilde{f}_\lambda (\mathbf {x})\Vert _\infty \ge \frac{2\lambda a^{(l)}_\text {min}}{n}- c. \end{aligned}$$

Proof

(of part (a))  Consider the inequalities below

(11)

where the equality follows from the definition of \(\widetilde{f}_\lambda (\mathbf {x})\), the first inequality is from (1), and the second inequality results from the Cauchy–Schwarz inequality [2] as well as the fact that \(g(a{\mathbbm {1}})=f(a)\).

Then, we can show that

$$\begin{aligned} h(\mathbf {x})\ge a_\text {min}^{(l)}v(\mathbf {x}). \end{aligned}$$
(12)

Since the multi-agent network is l-connected, it follows from Menger’s theorem [8] that there exist at least l disjoint paths (two paths are disjoint if they have no common edge) between any two nodes of the graph. Therefore, letting \(x_\text {max}\) and \(x_\text {min}\) be two nodes associated with the maximum element and the minimum element of \(\mathbf {x}\), respectively, we can find l disjoint paths from \(x_\text {max}\) to \(x_\text {min}\).

Let \(x_{(p,1)},...,x_{(p,n_p)}\) denote the nodes of path p in order, where \(n_p\) is the number of nodes in path p, and \(x_{(p,1)}=x_\text {max}\), \(x_{(p,n_p)}=x_\text {min}\) for all \(p\in \{1,...,l\}\). Since these l paths are disjoint, it follows that

$$\begin{aligned} h(\mathbf {x})&\ge \sum _{p=1}^{l}\sum _{i=1}^{n_p-1}a_{(p,i,i+1)}|x_{(p,i)}-x_{(p,i+1)}| \\&\ge \sum _{p=1}^{l}\sum _{i=1}^{n_p-1}\min _{i}a_{(p,i,i+1)}|x_{(p,i)}-x_{(p,i+1)}| \nonumber \\&\ge \sum _{p=1}^{l}\min _{i}a_{(p,i,i+1)}\sum _{i=1}^{n_p-1}(x_{(p,i)}-x_{(p,i+1)}) \nonumber \\&\ge \sum _{p=1}^{l}\min _{i}a_{(p,i,i+1)}(x_\text {max}-x_\text {min})\ge a_\text {min}^{(l)}v(\mathbf {x})\nonumber \end{aligned}$$
(13)

where \(a_{(p,i,i+1)}\) is the weight of the edge connecting nodes \(x_{(p,i)}\) and \(x_{(p,i+1)}\).

Letting \(\tilde{x}=\frac{1}{2}(\max _i(x_i)+\min _i(x_i))\), we have

$$\begin{aligned} \Vert \mathbf {x}-\bar{x}{\mathbbm {1}}\Vert \Vert \nabla g(\bar{x}{\mathbbm {1}})\Vert&\le \Vert \mathbf {x}-\tilde{x}{\mathbbm {1}}\Vert \Vert \nabla g(\bar{x}{\mathbbm {1}})\Vert \\&\le \sqrt{n}\Vert \mathbf {x}-\tilde{x}{\mathbbm {1}}\Vert _\infty \cdot \sqrt{n}\Vert \nabla g(\bar{x}{\mathbbm {1}})\Vert _\infty \nonumber \\&\le \frac{n c}{2}v(\mathbf {x}).\nonumber \end{aligned}$$
(14)

where the first inequality follows from the fact that \(\bar{x}\) minimizes \(\Vert \mathbf {x}-\alpha {\mathbbm {1}}\Vert \) with respect to (w.r.t.) \(\alpha \) for all \(\mathbf {x}\). Equations (11), (12) and (14) jointly imply the following inequality

$$\begin{aligned}&\widetilde{f}_\lambda (\mathbf {x})-f^\star \ge f(\bar{x})-f^\star +(\lambda a_\text {min}^{(l)}-\frac{ cn}{2})v(\mathbf {x}). \end{aligned}$$
(15)

Since \(\lambda >{n c}/({2a_\text {min}^{(l)}}),v(\mathbf {x})\ge 0,\ \forall \mathbf {x}\in \mathbb {R}^n\) and \(f(\bar{x})\ge f^\star , \forall \bar{x}\in \mathbb {R}\), then the right-hand side of (15) is nonnegative. That is, \(\widetilde{f}_\lambda (\mathbf {x})\ge f^\star \) for all \(\mathbf {x} \in \mathbb {R}^n\).

Moreover, it follows from (7) that \(\widetilde{f}_\lambda (x^\star {\mathbbm {1}})=f^\star \) for any \(x^\star \in \mathscr {X}^\star \), i.e., \(\widetilde{f}_\lambda (\mathbf {x})=f^\star \) for any \(\mathbf {x} \in \widetilde{\mathscr {X}}^\star \). What remains to be shown is that \(\widetilde{f}_\lambda (\mathbf {x})> f^\star \) for all \(\mathbf {x}\notin \widetilde{\mathscr {X}}^\star \), which includes

Case (a)::

\(\mathbf {x}\ne \alpha {\mathbbm {1}}\) for any \(\alpha \in \mathbb {R}\),

Case (b)::

\(\mathbf {x}= \alpha {\mathbbm {1}}\) for some \(\alpha \notin \mathscr {X}^\star \).

For Case (a), \(v(\mathbf {x})\) is strictly positive, and hence we know that \(\widetilde{f}_\lambda (\mathbf {x})> f^\star \) from (15). For Case (b), we have \(v(\mathbf {x})=0\). By (15) we have that \(\widetilde{f}_\lambda (\mathbf {x})\ge f(\bar{x})=f(\alpha )>f^\star \). Thus, \(\widetilde{f}_\lambda (\mathbf {x})> f^\star \) for all \(\mathbf {x}\notin \widetilde{\mathscr {X}}^\star \), which completes the proof of part (a).

The proof of part (b) is very involved and the interested readers are referred to [28] for details. \(\square \)

Algorithm 1(b) can also be modified to deal with objective functions with unbounded subgradients, e.g., quadratic functions, see [28] for details. Theorem 1 provides a sufficient condition for the equivalence between problems (5) and (7), and allows us to focus only on problem (7). Notice that this result is nontrivial even though the penalty method has been widely studied in optimization theory [2]. For example, a well-known result is that the gap between the optimal values of the penalized problem (7) and the problem (5) gets smaller as \(\lambda \) becomes larger, which however cannot always guarantee the existence of a finite penalty factor \(\lambda \) to eliminate the gap. A large \(\lambda \) may have negative effects on the transient performance of Algorithm 1.

Remark 2

It is worth mentioning that (10) in Theorem 1 also holds for the multidimensional case if Assumption 1(b) is replaced with \(\Vert \nabla f_i({\mathbf {x}})\Vert \le c\) for all i and \(\mathbf {x}\).

In view of the duality theory [2], a potential lower bound for \(\lambda \) could be the absolute value of the associated Lagrange multiplier. However, a Lagrange multiplier usually cannot be obtained before solving its dual problem. Theorem 1 gives an explicit lower bound for \(\lambda \) in terms of the network size and its connectivity, and is tighter than the bounds in [9] and [12].

In fact, the lower bound can be tight in some cases as shown in the following example. Note that [9] does not consider a generic optimization problem.

Example 1

([28]) Consider the graph in Fig. 1b with unit edge weights, i.e., \(a_{ij}=1\) for all \((i,j)\in \mathscr {V}\). Let \(f_1(x)=|x|,f_2(x)=|x-2|,f_3(x)=|x-4|,f_4(x)=|x-6|\) and \(f(x)=\sum _{i=1}^4f_i(x)\). It is not difficult to compute that the optimal value of f(x) is 8 and the set of optimal solutions is a closed interval [2, 4]. By (7), the corresponding penalized problem is given as

$$ \begin{aligned} \widetilde{f}_\lambda (\mathbf {x})&=|x_1|+|x_2-2|+|x_3-4|+|x_4-6|+ \\&~~~\lambda (|x_1-x_2|+|x_2-x_3|+|x_3-x_4|+|x_4-x_1|). \end{aligned} $$

Theorem 1 implies that \(\widetilde{f}_\lambda (\mathbf {x})\) has the same optimal value as f(x) and the set of optimal solutions is \(\widetilde{\mathscr {X}}^\star =\{x^{\star } {\mathbbm {1}}|x^{\star }\in [2, 4]\}\), provided that \(\lambda >{4\cdot 1}/({2\cdot 2})=1\).

Given any \(\lambda \le 1\), consider \(\mathbf {x} =[2,2,4,4]^\mathsf {T}\notin \widetilde{\mathscr {X}}^\star \). Clearly,

$$\widetilde{f}_\lambda (\mathbf {x})=4+4\lambda \le f^\star =8,$$

which implies that the set of optimal solutions of the penalized problem is not \(\widetilde{\mathscr {X}}^\star \). Thus for any \(\lambda \le 1\), the original problem f(x) cannot be solved via the penalized problem \(\widetilde{f}_\lambda (\mathbf {x})\), and the lower bound in (10) is tight in this example. \(\square \)

Fig. 1
figure 1

Some graphs

The lower bound in (10) is in a simple form and \(a_{\min }^{(l)}\) cannot be easily replaced. One may consider to use the minimum degree of the network, i.e., \(d_{m}=\min _{i\in \mathscr {V}} \sum _{j=1}^n a_{ij}\). This is impossible in some cases. Consider the 1-connected graph in Fig. 1c with unit edge weights. Then, \(a_{\min }^{(1)}=1\) and \(d_{m}=2\). Let \([s_1,...,s_6]=[1,2,3,4,5,6]\) and \(f_i(x)=|x-s_i|,\ \forall i\in \{1,...,6\}\). Set

$$\mathbf {x} =[x_1,...,x_6]^\mathsf {T}=[3,3,3,4,4,4]^\mathsf {T}.$$

Then, using similar arguments as in Example 1, one can infer that the lower bound \(\underline{\lambda }\) in (10) cannot be reduced to \({n c}/(2d_{m})=3/2\).

A similar penalty method interpretation of (4) with constant \(\rho ^k\) is provided in [14], where the penalty function is chosen as

$$\mathbf {x}^\mathsf {T}L\mathbf {x}=\frac{1}{2}\sum _{i,j} a_{ij}(x_i-x_j)^2$$

and L is the graph Laplacian matrix. However, such a quadratic penalty function cannot always guarantee the existence of a finite \(\lambda \) for the equivalence of the two problems. We provide a concrete example to illustrate this.

Example 2

Consider the graph in Fig. 1a with unit edge weights. Let \(f_1(x)=(x-1)^2\) and \(f_2(x)=(x-3)^2\). Clearly, the optimal solution of \(f(x)=f_1(x)+f_2(x)\) is \(x^\star =2\). Then a corresponding penalized problem using \(x^\mathsf {T}L x\) is

$$\begin{aligned} \mathop {\text {minimize}}_{x_1, x_2\in \mathbb {R}}\ f_1(x_1)+f_2(x_2)+\lambda (x_1-x_2)^2. \end{aligned}$$
(16)

The optimal solution of (16) is \(x_1^\star =(1+4\lambda )/(1+2\lambda )\) and \(x_2^\star =(3+4\lambda )/(1+2\lambda )\), and there does not exist a finite value of \(\lambda \) which makes both of them equal to \(x^\star =2\). \(\square \)

By [2], \({\mathbf {x}^\star }\) is an optimal solution of (7) if and only if \(0\in \partial \widetilde{f}_\lambda ({\mathbf {x}^\star })\). Part (b) of Theorem 1 shows that for any \(\mathbf {x} \notin \{\alpha {\mathbbm {1}}| \alpha \in \mathbb {R}\}\), the norm of the corresponding subgradient is uniformly greater than a positive lower bound, which clearly shows non-optimality of \({\mathbf {x}}\).

4 Convergence Analysis of Algorithm 1 Over Static Graphs

In this section, we examine the convergence behavior of Algorithm 1 over static graphs. If \(\rho ^k\) is diminishing, all agents converge to the same optimal solution of problem (2) under Algorithm 1. With a constant stepsize, all agents eventually converge to a neighborhood of an optimal solution, where the error size is proportional to the stepsize. For both cases, we perform the non-asymptotic analysis to quantify their convergence rates.

Before providing the convergence results of \(\{\mathbf {x}^k\}\), we recall from Proposition A.4.6 in [2] a well-known result on the convergence of a sequence of vectors.

Lemma 1

([2]) Let \(\mathscr {X}^\star \) be a nonempty subset of \(\mathbb {R}^n\), and let \(\{\mathbf {x}^k\}\in \mathbb {R}^n\) be a sequence satisfying for some \(p>0\) and for all k,

$$\begin{aligned} \Vert x^{k+1}-x^\star \Vert ^p\le \Vert x^k-x^\star \Vert ^p-\gamma ^k\phi (x^k)+\delta ^k,\ \forall \mathbf {x}^\star \in \mathscr {X}^\star , \end{aligned}$$

where \(\{\gamma ^k\}\) and \(\{\delta ^k\}\) are nonnegative sequences satisfying

$$\begin{aligned} \sum _{k=0}^{\infty }\gamma ^k=\infty ,\ \sum _{k=0}^{\infty }\delta ^k<\infty . \end{aligned}$$

Suppose that \(\phi (\cdot )\) is continuous, nonnegative, and satisfies \(\phi (x)=0\) if and only if \(x\in \mathscr {X}^\star \). Then \(\{\mathbf {x}^k\}\) converges to an optimal point in \(\mathscr {X}^\star \).

The first result in this section is on the convergence of Algorithm 1 under the assumption of diminishing stepsize, which is given as follow:

Assumption 2

The sequence \(\{\rho ^k\}\) satisfies

$$ \sum _{k=0}^\infty \rho ^k=\infty , \text { and}\quad \sum _{k=0}^\infty (\rho ^k)^2<\infty . $$

Proof of the convergence of Algorithm 1 under Assumption 2 is now given below.

Theorem 2

(Convergence, [28]) Suppose that the conditions in Theorem 1 and Assumption 2 hold. Let \(\{\mathbf {x}^k\}\) be generated by Algorithm 1. Then, there exists some optimal point \(x^\star \in \mathscr {X}^\star \) such that \(\lim _{k\rightarrow \infty }\mathbf {x}^k=x^\star {\mathbbm {1}}.\)

Proof

Under Assumption 1, we have that

$$\begin{aligned} \Vert \nabla \widetilde{f}_\lambda (\mathbf {x})\Vert \le c_a, \forall \mathbf {x}\in \mathbb {R}^n \end{aligned}$$
(17)

where \(c_a= \sqrt{n}( c+\lambda \Vert A\Vert _\infty )\). Since Algorithm 1 is the exact iteration of the subgradient method of problem (7), this implies that

(18)

where the first inequality follows from (1) and (17), and the second inequality is from Theorem 1.

By virtue of Lemma 1 and Theorem 1, the result follows immediately. \(\square \)

Our next result provides a non-asymptotic result to evaluate the convergence rate for \(\rho ^k=k^{-\alpha },\alpha \in (0.5,1]\). To this end, we first define

$$\begin{aligned} d(\mathbf {x})=\min _{x^\star \in \mathscr {X}^\star }\Vert \mathbf {x}-x^\star {\mathbbm {1}}\Vert . \end{aligned}$$
(19)

Then, it follows from (8) that

Clearly, \(d(\mathbf {x})\) is the distance between \(\mathbf {x}\) and the set of optimal solutions, \(v^k\) is the maximum divergence between agents’ states at time k, and \(\bar{x}^k\) is the mean of all agents’ states at time k. Intuitively, we can use the rates that \(f(\bar{x}^k)\) approaches \(f^\star \) and \(v^k\) reduces to 0 to evaluate the convergence rate of Algorithm 1.

Theorem 3

Suppose that the conditions in Theorem 1 hold, and let \(\{\mathbf {x}^k\}\) be generated by Algorithm 1. If \(\rho ^k={k^{-\alpha }}, \alpha \in (0.5,1]\), then

$$\begin{aligned} \min _{1< k\le {\bar{k}}}f(\bar{x}^k)-f^\star&\le \frac{(2\alpha -1)d(\mathbf {x}^0)^2+2\alpha c_a^2}{2(2\alpha -1)s(\bar{k})}, \\ \min _{1< k\le {\bar{k}}}v(\mathbf {x}^k)&\le \frac{(2\alpha -1)d(\mathbf {x}^0)^2+2\alpha c_a^2}{(2\lambda a_\text {min}^{(l)}- cn)(2\alpha -1)s(\bar{k})},\nonumber \end{aligned}$$
(20)

where \(\mathbf {x}^0\) is the initial point, \(\bar{x}^k\) and \(v(\mathbf {x}^k)\) are defined in (8), and

Proof

By Theorem 2, \(\{\mathbf {x}^k\}\) is a convergent sequence. For any \(x^\star \in \mathscr {X}^\star \), it follows from (18) that

$$\nonumber 2\rho ^k(\widetilde{f}_\lambda (\mathbf {x}^k)-f^\star )\le \Vert \mathbf {x}^{k}-x^\star {\mathbbm {1}}\Vert ^2-\Vert \mathbf {x}^{k+1}-x^\star {\mathbbm {1}}\Vert ^2+(\rho ^k)^2c_a^2. $$

Summing the above relation over \(k\in \{1,...,\bar{k}\}\) yields

$$\begin{aligned} 2\sum _{k=1}^{\bar{k}}\rho ^k(\widetilde{f}_\lambda (\mathbf {x}^k)-f^\star )&\le \Vert \mathbf {x}^{0}-x^\star {\mathbbm {1}}\Vert ^2-\Vert \mathbf {x}^{{\bar{k}}+1}-x^\star {\mathbbm {1}}\Vert ^2+\sum _{k=1}^{\bar{k}}(\rho ^k)^2c_a^2 \\&\le d(\mathbf {x}^0)^2+\sum _{k=1}^{\bar{k}}(\rho ^k)^2c_a^2 \end{aligned}$$

where the last inequality holds by choosing \(x^\star =\text {argmin}_{x\in \mathscr {X}^\star }\Vert \mathbf {x}^0-x{\mathbbm {1}}\Vert \). Then, we arrive at

$$\begin{aligned} \min _{0\le k\le {\bar{k}}}\widetilde{f}_\lambda (\mathbf {x}^k)-f^\star \le \frac{d(\mathbf {x}^0)^2+\sum _{k=1}^{\bar{k}}(\rho ^k)^2c_a^2}{2\sum _{k=1}^{\bar{k}}\rho ^k}. \end{aligned}$$
(21)

Since \(\int _1^{\bar{k}}x^{-\alpha }dx<\sum _{k=1}^{\bar{k}} k^{-\alpha }<\int _1^{\bar{k}}x^{-\alpha }dx+1,\) we have that

$$\sum _{k=1}^{\bar{k}}(\rho ^k)^2<\int _1^{\bar{k}}x^{-2\alpha }dx+1=\frac{1-\bar{k}^{1-2\alpha }}{2\alpha -1}+1<\frac{2\alpha }{2\alpha -1},$$

and \(\sum _{k=1}^{\bar{k}}\rho ^k>\int _1^{\bar{k}}x^{-\alpha }dx=s(\bar{k})\). Using the above and (21) leads to

$$\begin{aligned} \min _{0\le k\le {\bar{k}}}\widetilde{f}_\lambda (\mathbf {x}^k)-f^\star \le \frac{(2\alpha -1)d(\mathbf {x}^0)^2+2\alpha c_a^2}{2(2\alpha -1)s(\bar{k})}. \end{aligned}$$
(22)

Since \(f(\bar{x}^k)-f^\star >0\) and \(\lambda a_\text {min}^{(l)}-\frac{1}{2} cn>0\), it follows from (15) and (22) that (20) holds. \(\square \)

The first inequality in (20) quantifies the decreasing rate of the gap between \(f(\bar{x}^k)\) and the optimal value \(f^\star \), while the second one shows that the largest difference between agents’ states is reduced at a comparable rate. Thus, Theorem 3 reveals that the convergence rate lies between \(O(1/\text {ln}(k))\) and \(O(\text {ln}(k)/\sqrt{k})\), depending on the choice of \(\rho ^k\).

We also provide an alternative evaluation of the convergence rate, which uses a robust form and is presented in the following Corollary 1.

Corollary 1

(Non-asymptotic convergence, [28]) Suppose that the conditions in Theorem 3 hold. Then

$$\begin{aligned} \min _{1< k\le {\bar{k}}}\max _{i\in \mathscr {V}}f(x_i^k)-f^\star&\le \frac{(2\alpha -1)d(\mathbf {x}^0)^2+2\alpha c_a^2}{2(2\alpha -1)s(\bar{k})} \end{aligned}$$

where all notations are the same as those in Theorem 3.

Proof

For all k and any \(x_m\in [\min _{i\in \mathscr {V}}x_i^k,\max _{i\in \mathscr {V}}x_i^k]\), it follows from (11) that

$$\begin{aligned} f({x_m})&\le \widetilde{f}_\lambda (\mathbf {x}^k)-\lambda h(\mathbf {x}^k)+\Vert \mathbf {x}^k-x_m{\mathbbm {1}}\Vert \Vert \nabla g(x_m{\mathbbm {1}})\Vert \end{aligned}$$

which together with

$$\begin{aligned} \Vert \mathbf {x}^k-x_m{\mathbbm {1}}\Vert \Vert \nabla g(x_m{\mathbbm {1}})\Vert&\le \sqrt{n}\Vert \mathbf {x}^k-x_m{\mathbbm {1}}\Vert _\infty \cdot \sqrt{n}\Vert \nabla g(x_m{\mathbbm {1}})\Vert _\infty \le ncv(\mathbf {x}^k) \end{aligned}$$

and (13) yields that

$$\begin{aligned} f({x_m})&\le \widetilde{f}_\lambda (\mathbf {x}^k)-\lambda h(\mathbf {x}^k)+\frac{nc}{a_\text {min}^{(l)}}h(\mathbf {x}^k) \\&=g(\mathbf {x}^k)+\frac{nc}{a_\text {min}^{(l)}}h(\mathbf {x}^k) \\&\le \widetilde{f}_{2\lambda }(\mathbf {x}^k) \end{aligned}$$

where the last inequality follows from \(\lambda >nc/(2a_{\min }^{(l)})\).

Noting that (22) implies

$$\begin{aligned} \min _{0\le k\le {\bar{k}}}\widetilde{f}_{2\lambda }(\mathbf {x}^k)-f^\star \le \frac{(2\alpha -1)d(\mathbf {x}^0)^2+2\alpha c_a^2}{2(2\alpha -1)s(\bar{k})} \end{aligned}$$

the result follows immediately. \(\square \)

If f(x) is non-differentiable, the objective function of the classical distributed algorithm (4) converges at a rate of \(O(\text {ln}(k)/\sqrt{k})\) when \(\rho ^k={1}/{\sqrt{k}}\) [20], which is comparable to Algorithm 1 when \(\alpha \) approaches 0.5. Thus using only the sign of relative state essentially does not lead to any reduction in the convergence rate. However, if f(x) is more smooth, e.g., differentiable or strongly convex, Algorithm 1 may converge at a rate slower than that of (4).

For a constant stepsize, Algorithm 1 approaches a neighborhood of an optimal solution as fast as O(1 / k) and the error size is proportional to the stepsize. These are formally stated in Theorems 4 and 5.

Theorem 4

(Constant Stepsize, [28]) Suppose that the conditions in Theorem 1 hold, and let \(\{\mathbf {x}^k\}\) be generated by Algorithm 1. If \(\rho ^k=\rho \), then

$$\begin{aligned}&\limsup _{k\rightarrow \infty }d(\mathbf {x}^k)\le 2\sqrt{n}\max \left\{ \widetilde{d}(\rho ),\frac{\rho c_a^2}{2\lambda a_\text {min}^{(l)}- cn}\right\} +\rho c_a \end{aligned}$$

where \(\widetilde{\mathscr {X}}(\rho )=\{x|f(x)\le f^\star +{\rho c_a^2}/{2}\}\) and \(\widetilde{d}(\rho )=\max _{x\in \widetilde{\mathscr {X}}(\rho )}d(x)<\infty \).

Proof

See the Appendix. \(\square \)

In Theorem 4, \(\widetilde{d}(0)=0\) and \(\widetilde{d}(\rho )\) is increasing in \(\rho \). Thus, Algorithm 1 under a constant stepsize finally approaches a neighborhood of \(x^\star {\mathbbm {1}}\) for some \(x^\star \in \mathscr {X}^\star \), the size of which decreases to zero as \(\rho \) tends to zero. If the order of growth of f near the set of optimal solutions is available, then \(\widetilde{d}(\rho )\) can even be determined explicitly, which is illustrated in Corollary 2.

Corollary 2

([28]) Suppose that the conditions in Theorem 4 hold, and that f(x) satisfies

$$\begin{aligned} f(x)-f^\star \ge \gamma (d(x))^\alpha \end{aligned}$$

where \(\gamma >0\) and \(\alpha \ge 1\). Then, it holds that

$$\begin{aligned} \limsup _{k\rightarrow \infty }d(\mathbf {x}^k)\le 2\sqrt{n}\max \left\{ \left( \frac{\rho c_a^2}{2\gamma }\right) ^\frac{1}{\alpha },\frac{\rho c_a^2}{2\lambda a_\text {min}^{(l)}- cn}\right\} +\rho c_a \end{aligned}$$

Proof

Noting that \(\widetilde{d}(\rho )\le ({\rho c_a^2}/{2\gamma })^\frac{1}{\alpha }\), the result follows directly from Theorem 4. \(\square \)

The following theorem evaluates the convergence rate when the stepsize is set to be constant.

Theorem 5

([28]) Suppose that the conditions in Theorem 4 hold. Then

$$\begin{aligned} \min _{0\le k\le \bar{k}}f(\bar{x}^k)-f^\star&\le \frac{\rho c_a^2}{2}+\frac{d(\mathbf {x}^0)^2}{2\rho \bar{k}}, \\ \min _{0\le k\le \bar{k}}v(\mathbf {x}^k)&\le \frac{\rho c_a^2}{2\lambda a_\text {min}^{(l)}- cn}+\frac{d(\mathbf {x}^0)^2}{\rho \bar{k}(2\lambda a_\text {min}^{(l)}- cn)}.\nonumber \end{aligned}$$
(23)

Proof

From (21) we know that

$$\begin{aligned} \min _{0\le k\le {\bar{k}}}\widetilde{f}_\lambda (\mathbf {x}^k)-f^\star \le \frac{d(\mathbf {x}^0)^2+\bar{k}\rho ^2c_a^2}{2\rho \bar{k}}, \end{aligned}$$

which together with (15) implies the result. \(\square \)

Remark 3

The following conclusions can be easily drawn from Theorem 5.

  1. (a)

    \(\min _{0\le k\le \bar{k}}f(\bar{x}^k)\) approaches the interval \([f^\star ,f^\star +\frac{\rho c_a^2}{2}]\) at a rate of \(O(1/\bar{k})\).

  2. (b)

    Given \(\bar{k}\) iterations, let \(\rho =\frac{1}{c_a}\frac{d(\mathbf {x}^0)}{\sqrt{\bar{k}}}\), which minimizes the right-hand side of (23). Then

    $$ \begin{aligned} \min _{0\le k\le \bar{k}}f(\bar{x}^k)-f^\star&\le c_a\frac{d(\mathbf {x}^0)}{\sqrt{\bar{k}}}, \\ \min _{0\le k\le \bar{k}}v(\mathbf {x}^k)&\le \frac{c_a}{2\lambda a_\text {min}^{(l)}- cn}\frac{d(\mathbf {x}^0)}{\sqrt{\bar{k}}}. \end{aligned} $$

    The multi-agent network converges only to a point that is close to an optimal solution with an error size \(O(\bar{k}^{-1/2})\).

figure b

5 The Distributed Optimization Algorithm over Time-varying Graphs

When the graphs are time-varying, Algorithm 1 is revised and we provide the details in Algorithm 2. In this section, we study the convergence of Algorithm 2 over two types of time-varying graphs: uniformly jointly connected time-varying graphs and randomly activated graphs.

5.1 Uniformly Jointly Connected Time-varying Graphs

Now we introduce the concept of uniformly jointly connected time-varying graphs. First we define the union of the graphs \(\mathscr {G}^{(k,b)}\) for integers \(k\ge 0\) and \(b>0\) below

$$\begin{aligned} \mathscr {G}^{(k,b)}=(\mathscr {V},\mathscr {E}^{(k,b)},A^{(k,b)}):=\mathscr {G}^k\cup \mathscr {G}^{k+1}\cup \cdots \cup \mathscr {G}^{k+b-1} \end{aligned}$$

and \(A^{(k,b)}\) is the associated adjacency matrix of \(\mathscr {G}^{(k,b)}\). We make the following assumption.

Assumption 3

Assume that

  1. (a)

    For some \(\eta >0\), it holds that

    $$\begin{aligned} \left\{ \begin{aligned}a_{ij}^k\ge \eta ,&\text { if }(i,j)\in \mathscr {E}^k, \\ a_{ij}^k=0,&\text { otherwise.} \end{aligned}\right. \end{aligned}$$
    (24)
  2. (b)

    There exists an integer \(b\ge 1\) such that \(A^{(tb,b)}\) is l-connected for each \(t=0,1,2,...\)

Assumption 3 is commonly made in dealing with deterministically time-varying graphs. The first part requires that either an edge is not connected at some time, or the edge is connected with a weight larger than some fixed value. The second part assumes the joint graph in time intervals with length b to be connected. We call time-varying graphs satisfying Assumption 3 uniformly jointly connected graphs, which are also sometimes referred to as b-connected graphs [15, 17].

We are now ready to present the convergence result of Algorithm 2 over uniformly jointly connected graphs.

Theorem 6

(Convergence, [26]) Suppose that Assumptions 1-3 hold, and that there exists a constant \(c_\rho >0\) such that for all \(k>0\),

$$\begin{aligned} \max _{t\in [k,k+b)} \rho ^t\le c_\rho \min _{t\in [k, k+b)} \rho ^t. \end{aligned}$$
(25)

Select

$$\begin{aligned} \lambda >\frac{nbcc_\rho }{2l\eta }, \end{aligned}$$

where n is the number of agents, c is given in Assumption 1, \(c_\rho \) is given in Assumption 2, and \(b, l,\eta \) are given in Assumption 3. Let \(\{\mathbf {x}^k\}\) be generated by Algorithm 2. Then, \(\lim _{k\rightarrow \infty }\mathbf {x}^k=x^\star {\mathbbm {1}}\) for some \(x^\star \in \mathscr {X}^\star \).

Proof

We first consider the subsequence \(\{\mathbf {x}^{tb},t=0,1,2,...\}\), i.e., we let \(k=tb,t\in \{0,1,2,...\}\). Define

$$\begin{aligned} \widetilde{f}_\lambda ^k(\mathbf {x}):=\frac{\lambda }{2}\sum _{i,j\in \mathscr {V}}a_{ij}^k|x_i-x_j|+\sum _{i=1}^n f_i(x_i) \end{aligned}$$

and

$$\begin{aligned} \widetilde{f}_\lambda ^{(k,b)}(\mathbf {x})&:=\frac{1}{ \rho ^k}\sum _{t=k}^{b+k-1} \rho ^t\widetilde{f}_\lambda ^t(\mathbf {x}) \\&=\frac{\lambda }{2 \rho ^k}\sum _{i,j\in \mathscr {V}}\sum _{t=k}^{b+k-1} \rho ^ta_{ij}^{t}|x_i-x_j|+\frac{1}{ \rho ^k}\sum _{t=k}^{b+k-1} \rho ^t\sum _{i=1}^n f_i(x_i) \\&=\bar{ \rho }^k\left[ \frac{\lambda }{2}\sum _{i,j\in \mathscr {V}}\bar{a}_{ij}^{k}|x_i-x_j|+\sum _{i=1}^n f_i(x_i)\right] \end{aligned}$$

where

$$ \bar{\rho }^k=\sum _{t=k}^{b+k-1}\frac{ \rho ^t}{ \rho ^k},~\text {and}~ \bar{a}_{ij}^{k}=\frac{\sum _{t=k}^{b+k-1} \rho ^ta_{ij}^{t}}{\sum _{t=k}^{b+k-1} \rho ^t}. $$

Let \([\bar{A}^k]_{ij}:=\bar{a}_{ij}^k\) and \(\bar{a}_{\min }^{k,(l)}\) be the sum of the l smallest nonzero elements of \(\bar{A}^k\). Note that \(\bar{a}_{\min }^{k,(l)}\) is well defined because for any (ij), if \([A^{(k,b)}]_{ij}\) is nonzero, then \([\bar{A}^k]_{ij}\) is also nonzero, and \(A^{(k,b)}\) has at least l nonzero elements by Assumption 3.

Then, we obtain from (25) that

$$\begin{aligned} \bar{a}_{ij}^k\ge \frac{\min _{t\in [k,k+b)} \rho ^t\sum _{t=k}^{b+k-1}a_{ij}^t}{b\max _{t\in [k,k+b)} \rho ^t}\ge \frac{\sum _{t=k}^{b+k-1}a_{ij}^t}{bc_\rho }. \end{aligned}$$

Thus, if \(\bar{a}_{ij}^k\ne 0\), then it follows from (24) that \(\bar{a}_{ij}^k\) must be larger than \(\eta /bc_\rho \), which means that any nonzero element of \(\bar{A}^k\) is larger than \(\eta /bc_\rho \), and hence \(\bar{a}_{\min }^{k,(l)}\ge l\eta /bc_\rho \).

By virtue of that \(\lambda >nbcc_\rho /(2l\eta )\) and Theorem 1, we know that the problem

$$ \mathop {\text {minimize}}_{\mathbf {x}\in \mathbb {R}^n}\ \frac{1}{\bar{ \rho }^k}\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x}) $$

is equivalent to the original problem for all \(k=tb,t\in \{0,1,2,...\}\). That is, we have \(\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x})\ge \bar{\rho }^kf^\star \) for all \(\mathbf {x}\in \mathbb {R}^n\), and \(\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x})=\bar{\rho }^kf^\star \) if and only if \(\mathbf {x} \in \{a{\mathbbm {1}}|a\in \mathscr {X}^\star \}\).

Let \(\mathbf {d}^k=[d_1^k,...,d_n^k]^\mathsf {T}\), where

$$\begin{aligned} d_i^k=-\lambda \sum _{j\in \mathscr {N}_i^k}a_{ij}^k\text {sgn}(x_j^k-x_i^k)+\nabla f_i(x_i^k). \end{aligned}$$

Then, Algorithm 2 can be written in a compact form as

$$\begin{aligned} \mathbf {x}^{k+1}=\mathbf {x}^k- \rho ^k\mathbf {d}^k. \end{aligned}$$

Note that \(\mathbf {d}^k\) is a subgradient of \(\widetilde{f}_\lambda ^k(\mathbf {x})\) at \(\mathbf {x}^k\), and \(\Vert \nabla \widetilde{f}_\lambda ^k(\mathbf {x})\Vert \le c_a\) for any \(\mathbf {x}\in \mathbb {R}^n\) by (17). Hence \(\Vert \mathbf {d}^k\Vert \le c_a\) for any k. Let \(x^\star \) be an arbitrary element of \(\mathscr {X}^\star \). We have the following relation

(26)

Consider the second term of the right-hand-side of (26); then

(27)

Combining (27) with (26) yields that

$$\begin{aligned} \Vert \mathbf {x}^{k+b}-x^\star {\mathbbm {1}}\Vert ^2\le \Vert \mathbf {x}^{k}-x^\star {\mathbbm {1}}\Vert ^2+ 2\rho ^k(f^\star -\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x}^k))+5bc_a^2\sum _{t=k}^{b+k-1}( \rho ^t)^2. \end{aligned}$$
(28)

Noting that \(k=tb,t\in \{0,1,...\}\), the above relation becomes

$$\begin{aligned}&\Vert \mathbf {x}^{(t+1)b}-x^\star {\mathbbm {1}}\Vert ^2 \\&\le \Vert \mathbf {x}^{tb}-x^\star {\mathbbm {1}}\Vert ^2+ 2\rho ^{tb}(f^\star -\widetilde{f}_\lambda ^{(tb,b)}(\mathbf {x}^k))+5bc_a^2\sum _{u=tb}^{(t+1)b-1}( \rho ^u)^2. \end{aligned}$$

Note that \(\widetilde{f}_\lambda ^{(tb,b)}(\mathbf {x}^k)\) is nonnegative and \(\widetilde{f}_\lambda ^{(tb,b)}(\mathbf {x})=0\) if and only if \(\mathbf {x}\in \{a{\mathbbm {1}}|a\in \mathscr {X}^\star \}\), and that \(\sum _{t=1}^{\infty } \rho ^{tb}=\infty \), \(\sum _{t=1}^{\infty }(\rho ^{tb})^2<\infty \). It follows from Lemma 1 that there exists \(\bar{x}\in \mathscr {X}^\star \) such that the subsequence \(\{\mathbf {x}^{tb}\},t\in \{0,1,2,...\}\) must converge to \(\bar{x}{\mathbbm {1}}\). This, combined with \(\lim _{k\rightarrow \infty } \rho ^k=0\), implies that \(\{\mathbf {x}^k\}\) converges to \(\bar{x}{\mathbbm {1}}\). \(\square \)

Compared with the convergence result on static graphs (Theorem 2), the major difference on uniformly jointly connected graphs is that \(\lambda \) should be \(bc_\rho \) times larger than that in the case of static graphs.

Next, we evaluate the convergence rate of Algorithm 2 over uniformly jointly connected graphs when \(\rho ^k=k^{-\alpha },\alpha \in (0.5,1]\). As in Theorem 3, we evaluate the rates that \(f(\bar{x}^k)\) approaches \(f^\star \) and \(v(\mathbf {x}^k)\) tends to 0 to quantify the convergence rate.

Theorem 7

(Non-asymptotic result, [26]) Let the assumptions in Theorem 6 hold, and further assume that \(\lambda >nbc/l\eta \). Let \(\{\mathbf {x}^k\}\) be generated by Algorithm 2. If \(\rho ^k=k^{-\alpha }\) with some \(\alpha \in (0.5,1]\), then for any \(k_0> 2b\),

$$\begin{aligned} \min _{1< k\le {k_0}}f(\bar{x}^k)-f^\star&\le \frac{(2\alpha -1)(d(\mathbf {x}^0))^2+10\alpha bc_a^2}{b(2\alpha -1)s(k_0)} \\ \min _{1< k\le {k_0}}v(\mathbf {x}^k)&\le \frac{2(2\alpha -1)(d(\mathbf {x}^0))^2+12\alpha bc_a^2}{(\lambda l\eta -nbc)(2\alpha -1)s(k_0)}\nonumber \end{aligned}$$
(29)

where \(\mathbf {x}^0\) is the initial point, and

Proof

Note that \(\lambda \) and {\(\rho ^k\)} satisfy the conditions in Theorem 6 with \(c_\rho =2\), and \(\Vert \nabla \widetilde{f}_\lambda ^k(\mathbf {x})\Vert \le c_a\) for any \(\mathbf {x}\) and k. Let \(x^\star \) be an arbitrary optimal solution of problem (2) and \(t_0=\lfloor k_0/b\rfloor \), where \(\lfloor x\rfloor \) denotes the nearest integer to \((\cdot )\) that is smaller than \((\cdot )\). It then follows from (28) that

$$\begin{aligned}&2\rho ^{tb}(\widetilde{f}_\lambda ^{(tb,b)}(\mathbf {x}^k)-f^\star )\le \Vert \mathbf {x}^{tb}-x^\star {\mathbbm {1}}\Vert ^2-\Vert \mathbf {x}^{(t+1)b}-x^\star {\mathbbm {1}}\Vert ^2+5bc_a^2\sum _{u=tb}^{tb+b-1}( \rho ^u)^2. \end{aligned}$$

Summing the above relation over \(t=0,1,...,t_0\) yields

$$\begin{aligned}&2\sum _{t=0}^{t_0}\rho ^{tb}(\widetilde{f}_\lambda ^{(tb,b)}(\mathbf {x}^k)-f^\star ) \\&\le \Vert \mathbf {x}^{0}-x^\star {\mathbbm {1}}\Vert ^2-\Vert \mathbf {x}^{t_0b+1}-x^\star {\mathbbm {1}}\Vert ^2+5bc_a^2\sum _{t=0}^{t_0}\sum _{u=tb}^{(t+1)b-1}( \rho ^u)^2 \\&\le d(\mathbf {x}^0)+5bc_a^2\sum _{k=1}^{k_0}(\rho ^k)^2. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \min _{0\le k\le k_0}\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x}^k)-f^\star \le \frac{d(\mathbf {x}^0)+5bc_a^2\sum _{k=1}^{k_0}(\rho ^k)^2}{2\sum _{t=0}^{t_0}\rho ^{tb}}. \end{aligned}$$
(30)

Since

$$\begin{aligned} \int _1^{k_0}\frac{1}{x^\alpha }dx<\sum _{k=1}^{k_0} \frac{1}{k^\alpha }<\int _1^{k_0}\frac{1}{x^\alpha }dx+1, \end{aligned}$$

we obtain that

$$\begin{aligned} \sum _{k=1}^{k_0}(\rho ^k)^2<\int _1^{k_0}\frac{1}{x^{2\alpha }}dx+1=\frac{1-k_0^{1-2\alpha }}{2\alpha -1}+1<\frac{2\alpha }{2\alpha -1} \end{aligned}$$

and for \(\alpha \in (0.5,1)\),

$$\begin{aligned} \sum _{t=0}^{t_0}\rho ^{tb}&>b^{-a}\sum _{t=0}^{t_0}\rho ^t>b^{-a}\int _1^{t_0}\frac{1}{x^{\alpha }}dx=\frac{t_0^{1-\alpha }-1}{b^a(1-\alpha )} \\&>\frac{(k_0/b-1)^{1-\alpha }-1}{b^a(1-\alpha )}=s(k_0). \end{aligned}$$

We also obtain \(\sum _{t=0}^{t_0}\rho ^{tb}=s(k_0)\) using similar arguments. Substituting these two inequalities into (30) yields

$$\begin{aligned} \min _{0\le k\le {k_0}}\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x}^k)-f^\star \le \frac{(2\alpha -1)d(\mathbf {x}^0)+10bc_a\alpha }{2(2\alpha -1)s(k_0)}. \end{aligned}$$
(31)

Noticing that \(\bar{ \rho }^k\ge c_\rho /2\ge b/2\) for all k, we have

(32)

where the first equality follows from the definition of \(\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x})\), the second inequality is from the definition of a subgradient, and the last inequality is the result of the Cauchy–Schwarz inequality as well as the fact that \(g(a{\mathbbm {1}})=f(a)\).

Recall from (13) and (14) that

$$ h(\mathbf {x}^k)\ge \frac{l\eta }{2b} v^k,~\text {and}~\Vert \mathbf {x}^k-\bar{x}^k{\mathbbm {1}}\Vert \Vert \nabla g(\bar{x}^k{\mathbbm {1}})\Vert \le \frac{nc}{2}v(\mathbf {x}^k). $$

These two relations together with (32) yield

$$\begin{aligned}&\widetilde{f}_\lambda ^{(k,b)}(\mathbf {x}^k)-f^\star \ge \frac{b}{2}\left[ f(\bar{x}^k)-f^\star +(\frac{\lambda l\eta }{2b}-\frac{nc}{2})v(\mathbf {x}^k)\right] . \end{aligned}$$

Since \(f(\bar{x}^k)-f^\star >0\) and \(\lambda l \eta -bcn>0\), the above inequality combined with (31) implies (29) immediately. \(\square \)

Theorem 2 reveals that from the worst-case point of view, the convergence rate over uniformly jointly connected time-varying graphs is about b times slower than that of a static graph (Theorem 3), which is reasonable.

5.2 Randomly Activated Graphs

This subsection studies the convergence of Algorithm 2 over randomly activated graphs, which can model many networks such as gossip social networks and random measurement losses in networks. The definition is given as follows.

Definition 2

(Randomly Activated Graphs) The sequence of graphs \(\{\mathscr {G}^k\}\) are randomly activated if for all \(i,j\in \mathscr {V},i\ne j\), \(\{a_{ij}^k\}\) is an i.i.d. Bernoulli process with \(\mathbb {P}\{a_{ij}^k=1\}=p_{ij}\), where \(\mathbb {P}(\mathscr {X})\) denotes the probability of an event \(\mathscr {X}\) and \(0\le p_{ij}\le 1,\ \forall i,j\in \mathscr {V}\).

Remark 4

For brevity, we assume here that the weight of each edge \(a_{ij}^k\) is taken to be either zero or one at each time k in randomly activated graphs.

We call \(P=[p_{ij}]\) the activation matrix of \(\mathscr {G}^k\), and the graph associated with P is denoted as \(\mathscr {G}_P\), which is also the mean graph of \(\mathscr {G}^k\), i.e.,

$$\begin{aligned} \mathscr {G}_P:=\mathbb {E}(\mathscr {G}^k). \end{aligned}$$
(34)

Recall that Algorithm 1 is the iteration of subgradient methods of (7). Similarly, Algorithm 2 is just the iteration of the stochastic subgradient method of the following optimization problem

$$\begin{aligned} \mathop {\text {minimize}}_{\mathbf {x}\in \mathbb {R}^n}\ \widehat{f}_\lambda (\mathbf {x}):=g(\mathbf {x})+\lambda \widehat{h}(\mathbf {x}) \end{aligned}$$
(35)

where g(x) is given in (5) and

$$\begin{aligned} \widehat{h}(\mathbf {x})=\frac{1}{2}\sum _{i,j=1}^n p_{ij}|x_i-x_j|. \end{aligned}$$

To exposit it, notice that \(\mathbb {E}(a_{ij}^k)=p_{ij}\), and thus a stochastic subgradient \(\nabla _s \widehat{h}(\mathbf {x})=[\nabla _s \widehat{h}(\mathbf {x})_1,...,\nabla _s \widehat{h}(\mathbf {x})_n]^\mathsf {T}\) of \(\widehat{h}(\mathbf {x})\) is given element-wise by

$$\begin{aligned} \nabla _s \widehat{h}(\mathbf {x})_i&=\sum _{j=1}^na_{ij}^k\text {sgn}(x_i-x_j)=\sum _{j\in \mathscr {N}_i^k}\text {sgn}(x_i-x_j). \end{aligned}$$

Since \(\mathbb {E}\{\nabla _s \widehat{h}(\mathbf {x})_i\}=\sum _{j}p_{ij}\text {sgn}(x_i-x_j)\), \(\mathbb {E}\{\nabla _s \widehat{h}(\mathbf {x})\}\) is a subgradient of \(\widehat{h}(\mathbf {x})\). Hence, the almost sure convergence of Algorithm 2 follows from the following lemma.

Lemma 2

(Convergence of Stochastic Subgradient Method, [3]) Consider the optimization problem

$$\begin{aligned} \mathop {\text {minimize}}_{\mathbf {x}\in \mathbb {R}^n}\ \mathbb {E}\{F(\mathbf {x},w)\} \end{aligned}$$
(36)

where w is a random variable and \(F(\mathbf {x},w):\mathbb {R}^n\times \mathbb {R}\rightarrow \mathbb {R}\) is continuous and convex w.r.t. \(\mathbf {x}\) for any given w. Let \(\mathscr {X}^\star \) be the set of optimal solutions and assume that \(\mathscr {X}^\star \) is not empty.

The stochastic subgradient method for (36) is given by

$$\begin{aligned} \mathbf {x}^{k+1}=\mathbf {x}^k-\rho ^kr(\mathbf {x}^k,w^k) \end{aligned}$$

where \(r(\mathbf {x},w^k)\) is bounded and \(\mathbb {E}(r(\mathbf {x},w^k))\) is a subgradient of \(\mathbb {E}\{F(\mathbf {x},w^k)\}\) for all \(\mathbf {x}\in \mathbb {R}^n\). If \(\{\rho ^k\}\) is chosen such that

$$\begin{aligned} \sum _{k=0}^\infty \rho ^k=\infty ,\quad \sum _{k=0}^\infty (\rho ^k)^2<\infty , \end{aligned}$$

then it holds almost surely that \(\lim _{k\rightarrow \infty }\mathbf {x}^k=\mathbf {x}^\star \) for some \(\mathbf {x}^\star \in \mathscr {X}^\star \).

The following theorem summarizes the above analysis, and is the main result of this subsection.

Theorem 8

([28]) Suppose that Assumptions 1 and 2 hold, and that the multi-agent network \(\mathscr {G}_P\) is l-connected. Select

$$\begin{aligned} \lambda >\frac{n c}{2p_{\min }^{(l)}}, \end{aligned}$$

where \(\mathscr {G}_P\) is given in (34), \(p_{\min }^{(l)}\) denotes the sum of the l smallest nonzero elements of P. Let \(\{\mathbf {x}^k\}\) be generated by Algorithm 2. Then, it holds almost surely that \(\lim _{k\rightarrow \infty }\mathbf {x}^k=x^\star {\mathbbm {1}}\) for some \(x^\star \in \mathscr {X}^\star \).

Proof

By Theorem 1, it follows that problem (35) has the same set of optimal solutions and optimal value as problem (2). Combined with Lemma 2, the proof follows. \(\square \)

6 Numerical Examples

In this section, we apply our algorithms to distributedly find the geometric median of a couple of points in a two-dimensional plane. The geometric median of n points is defined as the point which minimizes the sum of Euclidean distances to these points [7]. In other words, it is the optimal solution of the following convex optimization problem:

$$\begin{aligned} \mathop {\text {minimize}}_{\mathbf {x}\in \mathbb {R}^2}\ f(\mathbf {x}):=\sum _{i=1}^n f_i(\mathbf {x})= \sum _{i=1}^n \Vert \mathbf {x}-\mathbf {x}_i\Vert . \end{aligned}$$
(37)

The local function \(f_i(\mathbf {x}):=\Vert \mathbf {x}-\mathbf {x}_i\Vert \) is convex but non-differentiable, the subdifferential of which is given as

Apparently, problem (37) satisfies Assumption 1, and hence it can be solved by Algorithms 1 and 2. Notice that \(\mathbf {x}\) in (37) is 2-dimensional and Algorithms 1 and 2 should be modified accordingly as stated in Remark 1.

The geometric median problem is a special case of least square problems in statistics and Weber problems in location theory. Here we provide a possible application in distributed settings. Consider n base stations under the sea, and we want to find a place to build a communication center, which should have the minimum distances to these stations to save the costs of cables. Since global positioning is very difficult under seas, a feasible distributed approach to find the desired place is for each station to send an agent, which however can only measure the distance to the station and know rough relative positions to its neighbor agents. Clearly, we can use the proposed algorithms to achieve this goal.

In this example, we consider five stations (hence five agents), the positions of which are randomly generated on a rectangular area with size \(100\times 100\). We run three simulations over a static graph, uniformly jointly connected graphs, and randomly activated graphs, respectively. We choose the stepsize \(\rho ^k=5/(k+10)\) in all simulations. The topology of the five agents is a ring graph as shown in Fig. 2a. The \(\lambda \) in Algorithm 1 used in the static graph’s case is chosen to be 2, which satisfies the condition in Remark 2. For the uniformly jointly connected graphs’ case, we let only one edge in the graph of Fig. 2a connect at each time, and each edge connects once and only once in each cycle, the order of which is determined by a random permutation of \(\{1,...,5\}\) at the beginning of each cycle. The \(\lambda \) in Algorithm 2 used in the case of uniformly jointly connected graphs is chosen to be 6. We generate randomly activated graphs by letting each edge in the graph of Fig. 2a connect with probability 0.5 at each time, and we choose \(\lambda \) to be 4.

Fig. 2b, c, d depict respectively the trajectories of the 5 agents from \(k=1\) to 1500 over the static graph, uniformly jointly connected graphs and randomly activate graphs, where the filled circles are the initial positions of the agents and the black triangle is the geometric median of these circles computed by Weiszfeld’s method [1]. As shown in the figures, agents in all cases converge to the geometric median with however slightly different transient performances.

Fig. 2
figure 2

a The topology of the agents. b The trajectories of the agents in a static graph, where the filled circles are the initial positions of the agents and the black triangle is the geometric median of these circles. c The trajectories of the agents in uniformly jointly connected graphs. d The trajectories of the agents in randomly activated graphs

Fig. 3
figure 3

The trajectories of agents with smaller \(\lambda \) over a static graph, uniformly jointly connected graphs, randomly activated graphs, respectively

If \(\lambda \) is smaller than the lower bound provided in Theorem 1, consensus may not be achieved among agents. Figure 3 shows the trajectories of agents with \(\lambda =0.8,2,1.5\) over a static graph, uniformly jointly connected graphs, randomly activated graphs, respectively. Other settings remain the same except that we increase the times of iterations to 5000. Clearly, agents fail to converge to the geometric median.

7 Conclusions

In this chapter, we have proposed a distributed optimization algorithm to solve the additive cost optimization problem in multi agent networks. Each agent in the algorithm uses only the sign of relative state to each of its neighbor agents. The network was allowed to be static or time-varying. For the former case, we have first provided a penalty method interpretation of our algorithm, and then studied its convergence under diminishing stepsizes as well as a constant stepsize. We have shown that the convergence rate varies from \(O(1/\text {ln}(k))\) to \(O(1/\sqrt{k})\), depending on the stepsize. For the latter case, we studied the performance of our algorithm over the so-called uniformly jointly connected graphs and randomly activated graphs, the convergence of which is also guaranteed. Finally, we have applied our algorithm to solve a geometric median problem. All the theoretical results have been corroborated via simulations.