1 Introduction

Distributed algorithms decompose an optimization problem into smaller and more manageable subproblems that can be solved in parallel by a group of agents or processors. Such algorithms are used for large-scale problems in machine learning, wireless sensor networks, social networks, smart grids, etc.

To date, a large number of distributed algorithms have been developed for distributed unconstrained optimization problems. In general, these algorithms can be classified into two categories: dual-decomposition-based algorithms and dynamic-average-consensus-based algorithms. The dual-decomposition-based algorithms utilize a consensus equality constraint to reformulate unconstrained optimization problems into constrained ones such that decision variables of agents are decoupled, then integrate ideas from centralized primal-dual methods. Typical of such algorithms are EXTRA [40], PG-EXTRA [41], NIDS [18], ABC [48], and DCPA [5], etc. On the other hand, the dynamic-average-consensus-based algorithms introduce auxiliary variables to track the average of some quantities that need to reach a consensus among agents, such as tracking the average gradients of all agents’ local objectives. To name a few, Aug-DGM [49], DIGing [26], Push–Pull Gradient [35], Harnessing Smoothness [36], and TV-\(\mathcal{A}\mathcal{B}\) [39], etc. Ideas behind these methods are generalized to design algorithms for solving distributed constrained optimization problems arising in practical applications.

In this paper, we consider the following constrained consensus optimization problem over a multi-agent network consisting of m agents,

$$\begin{aligned} \min _{x\in \mathcal {X}}&\hspace{1.1em} f(x)=\sum _{i=1}^mf_i(x)\\ \mathrm{s.t.}&\quad h(x)=\sum _{i=1}^mh_i(x)\in \mathcal {K},\\ \end{aligned}$$
(DCOP)

where \(\mathcal {K}=\Re ^p_-\times \{0_q\}\), implying that the constraints include inequality and equality. \(f_i:\Re ^n\rightarrow \Re \), \(h_i:\Re ^n\rightarrow \Re ^{p+q}\), \(i=1,2,\dots ,m\), and \(\mathcal {X}\) is a subset of \(\Re ^n\). For each \(i=1,2,\dots ,m\), the local objective \(f_i\) and the constraint function \(h_i\) are only known by agent i, but are not available to other agents. All agents cooperate to find an optimal solution to (DCOP) through peer-to-peer communication and local computation.

1.1 Motivations

The optimization problem (DCOP) is motivated by its applications in stochastic programs, distributed optimization, optimal control, etc. Due to space limitations, we will present two important examples of applications.

Example 1

Consider the following stochastic programming with expectation constraints,

$$\begin{aligned} \begin{aligned} \min _{x\in \mathcal {X}}\hspace{1em}&f(x)= \mathbb {E}_\xi [F(x,\xi )]\\ \mathrm {s.t.}\hspace{1em}&g(x)=\mathbb {E}_\xi [G(x,\xi )]\le 0, \end{aligned} \end{aligned}$$
(1.1)

where \(\xi \) is a random vector supported on \(\mathcal {P}\subseteq \Re ^n\) and independent of x. \(F(x,\xi ): \Re ^p\times \mathcal {P}\rightarrow \Re \) and \(G(x,\xi ):\Re ^p\times \mathcal {P}\rightarrow \Re ^q\). The problem (1.1) has many applications in machine learning, finance and data analysis, and operations research, e.g., Neyman–Pearson classification [44] and asset allocation problems [15, 38].

The sample average approximation (SAA) method is widely used for stochastic optimization problems. Applying SAA to solve (1.1), one first generates N independently and identically distributed (i.i.d.) random samples \(\{\xi _i\}_{i=1}^N\) and then uses the averages \(\frac{1}{N}\sum _{i=1}^NF(x,\xi _i)\) and \(\frac{1}{N}\sum _{i=1}^NG(x,\xi _i)\) to approximate f(x) and g(x), respectively. The associated approximation optimization subproblem is

$$\begin{aligned} \begin{aligned} \min _{x\in \mathcal {X}}\hspace{1em}&\frac{1}{N}\sum _{i=1}^NF(x,\xi _i)\\ \mathrm {s.t.}\hspace{1em}&\frac{1}{N}\sum _{i=1}^NG(x,\xi _i)\le 0. \end{aligned} \end{aligned}$$
(1.2)

It is well known that optimal solutions of (1.2) asymptotically solve (1.1) as the sample size N tends to infinity. An immediately following question is how to solve such subproblems.

Example 2

If the objective f(x) and constraint function h(x) of (DCOP) both have separable structures, namely, \(x=(x_1,x_2,\dots ,x_m)\), \(f_i(x)=\tilde{f}_i(x_i)\), and \(h_i(x)=\tilde{h}_i(x_i)\), (DCOP) reduces to the following constraint-coupled optimization problem,

$$\begin{aligned} \min _{x\in \mathcal {X}}&\hspace{1.1em} \sum _{i=1}^m\tilde{f}_i(x_i)\\ \mathrm{s.t.}&\quad \sum _{i=1}^m\tilde{h}_i(x_i)\in \mathcal {K}.\\ \end{aligned}$$
(CCOP)

The problem (CCOP) has been widely investigated in smart grids, optimal control, machine learning, etc. For instance, in multi-micro energy grid systems [51], the coupled constraints character that the sum of users’ power consumption is equal to the total power generation of grids and the total carbon emissions are not more than a given upper bound. In distributed resource allocation problems [27, 33], the relation of allocating some given resource to agents is formulated as the coupled constraints. Besides, (CCOP) also appears in problems such as distributed model predictive control [6], network utility maximization [20], and distributed estimation [13], etc.

Theoretically, algorithms for (CCOP) can be adopted to solve (DCOP) at the cost of adding \(m-1\) equality constraints, namely, \(x_1=x_2=\dots =x_m\). However, in practical computations, this causes challenges if the network size m and the variable dimension n are relatively large, e.g., \(m=1000\), \(n=100\). Moreover, the consensus constraints, \(x_i=x_j\), for each \(j\in \mathcal {N}_i\) (the set of neighbors of agent i), play an important role in applications such as distributed regression problems [50] and optimal control [24, 45], which motivates us to consider the lifted consensus optimization problem (DCOP).

1.2 Related Work

To accommodate the demands of practical applications, researchers aim at designing distributed algorithms that converge faster and can apply to more general multi-agent networks.

The earlier studies of distributed constrained consensus optimization problems focused on the case where agents’ decisions are only subject to local constraint sets. The distributed projection (sub)gradient method [30] is a seminal work in this direction. Afterward, a large number of dual-decomposition-based [16, 52] or dynamic-average-consensus-based [3, 21, 22] distributed algorithms with fast convergence rates have been proposed to solve such problems.

The first attempt to focus on distributed consensus optimization with equality and inequality constraints may be the work of [53], which presented a distributed primal-dual algorithm with vanishing step-sizes. Afterward, the community aims to design faster convergent distributed algorithms with a fixed step-size for such problems, e.g., see [12, 14, 23]. Under various assumptions such as smoothness and strong-convexity of functions, these algorithms have sublinear or linear convergence rates, while their considered equality and inequality constraints are shared with all agents.

The problem (CCOP) has attracted increasing research interest as the coupled constraints frequently appear in many fields. Approaches directly aiming at (CCOP) typically leverage the Lagrangian duality to deal with the coupled constraints, because the Lagrangian has a separable structure in the primal decision variables and the dual problem is a special case of (DCOP). Works based on the dual-decomposition method are references [2, 10, 27, 42], to mention a few. There are also consensus-based distributed algorithms for such problems, e.g., see [1, 8, 9, 22, 43]. While these mentioned papers considered only the linearly coupled constraints, i.e., \(\mathcal {K}=\{0_q\}\) and each \(\tilde{h}_i\) is affine. The recent works [4, 19, 47] investigated (CCOP) with coupled equality and inequality constraints, and proposed the distributed primal-dual gradient method [19], the integrated primal-dual proximal method [47], and the augmented Lagrangian tracking method [4], in which the integrated primal-dual proximal method has an O(1/k) convergence rate in terms of optimality and feasibility, the other two algorithms asymptotically solve (CCOP) with constant step-sizes but no results on rate analysis.

In general, the dual-decomposition-based algorithms have nice convergence and rates while lacking extensibility to directed and time-varying networks. The dynamic-average-consensus-based algorithms are inexact gradient methods. If the objective and constraint functions do not have good properties, such as strong-convexity or gradient Lipschitz continuity, convergence analysis of these algorithms is difficult. Moreover, such algorithms can be easily extended to directed and time-varying communication networks of multi-agent, but the complex networks also bring challenges in algorithm analysis.

It’s worth noting that the distributed algorithms mentioned so far for the problems (CCOP) or (DCOP) are all executed over a fixed (static) multi-agent network. To my knowledge, there is very little work that investigates (DCOP) over time-varying networks, except for [17]. The authors of [17] proposed a distributed proximal primal-dual algorithm that has an \(O(1/\sqrt{k})\) convergence rate for solving (DCOP).

In this paper, we integrate ideas from the algorithm TV-\(\mathcal{A}\mathcal{B}\) [39] and the primal-dual optimization method to develop a distributed algorithm for solving (DCOP) over time-varying and directed multi-agent networks. The main contributions of the article are threefold. First, this paper considers distributed consensus optimization problems with coupled affine equality and inequality (not necessarily affine) constraints. The objective and constraint functions are only convex and differentiable. Such assumptions are fundamental settings for consensus-based algorithms. The considered problem (DCOP) is more general and challenging than many papers in the literature, e.g., [2, 10, 27]. Second, the investigated multi-agent communication network is more general than the undirected and static ones considered by [2, 10, 19, 27, 42]. Third, The foundational assumptions on the objective and constraint functions, as well as the time-varying communication network of multi-agent bring challenges to analysis of the proposed inexact gradient algorithm, we still establish its convergence and iteration complexity.

The rest of this paper is organized as follows. Section 2 provides necessary preliminaries and constructs the saddle point problem of (DCOP). Section 3 develops a consensus-based distributed primal-dual algorithm. Section 4 presents the main results about the convergence properties of the proposed algorithm. Section 5 simulates the presented algorithm using a realistic example and compares it with some related methods. Finally, Sect. 6 makes some conclusions.

2 Preliminaries

In this section, we will present some preliminaries in graph theory and assumptions on the considered optimization problem and multi-agent network.

2.1 Notion and Notations

At each time slot \(k\ge 0\), a multi-agent system consisting of m agents is modeled as a directed graph \(\mathcal {G}(k)=(\mathcal {V},\mathcal {E}(k))\), where \(\mathcal {V}=\{1,2,\dots ,m\}\) and \(\mathcal {E}(k)\) is the set of directed edges. Let \(A(k)=[a_{ij}(k)]\) and \(B(k)=[b_{ij}(k)]\) be two matrices that are compatible with the graph \(\mathcal {G}(k)\), which means that \((i,j)\in \mathcal {E}(k)\) if and only if \(a_{ij}(k)>0\) and \(b_{ij}(k)>0\). We write \(\mathcal {N}_i^\mathrm{{out}}(k)=\{j\in \mathcal {V}: (i,j)\in \mathcal {E}(k)\}\) and \(\mathcal {N}_i^\mathrm{{in}}(k)=\{j\in \mathcal {V}: (j,i)\in \mathcal {E}(k)\}\) as the sets of out-neighbors and in-neighbors of agent i, respectively. The symbol \(\vert \mathcal {N}_i^\mathrm{{out}}(k)\vert \) (resp. \(\vert \mathcal {N}_i^\mathrm{{in}}(k)\vert \)) represents the out-degree (resp. in-degree) of agent i.

The properties of adjacency matrices of connected (or strongly connected) graphs, such as their eigenvalues and eigenvectors, are closely related to stochastic vectors. Lagrangian function and its saddle points are also fundamental concepts in constrained optimization. To clarify the presentation of the considered problem and algorithm analysis, we introduce the following three notions.

Definition 2.1

(Stochastic vector [29]) A vector \(\pi \in \Re ^m\) is said to be stochastic if its components \(\pi _i\) satisfy that

$$\begin{aligned} \sum _{i=1}^m\pi _i=1,\quad \pi _i\ge 0,\quad \forall \, i=1,2,\dots ,m. \end{aligned}$$

Definition 2.2

(Absolute probability sequence [39]) For row-stochastic matrices \(\{\mathcal {R}_k\}\), an absolute probability sequence is a sequence \(\{\pi _k\}\) of stochastic vectors such that

$$\begin{aligned} \pi _k^\top =\pi _{k+1}^\top \mathcal {R}_k,\quad \forall \, k\ge 0. \end{aligned}$$

Definition 2.3

(Saddle point [37]) Consider a function \(\mathcal {L}:X\times S\rightarrow \Re \), where X and S are non-empty subsets of \(\Re ^n\) and \(\Re ^{p+q}\) respectively, a pair of vectors \((x^*,\lambda ^*)\in X\times S\) is called a saddle point of function \(\mathcal {L}\) over \( X\times S\), if for all \((x,\lambda )\in X\times S\), it holds that

$$\begin{aligned} \mathcal {L}(x^*,\lambda )\le \mathcal {L}(x^*,\lambda ^*)\le \mathcal {L}(x,\lambda ^*). \end{aligned}$$

In this paper, the gradient vector of a function f at x is denoted by \(\nabla f(x)\), \(\Vert \cdot \Vert \) represents the Frobenius-norm of a matrix with suitable dimensions, and \(\langle \cdot ,\cdot \rangle \) is the inner product matched with the norm \(\Vert \cdot \Vert \). The projection of a point x onto a subset \(\varOmega \subseteq \Re ^n\) is written as \(P_\varOmega (x)\). For a non-empty subset \(\varOmega \), we denote its relative interior by \(\textrm{ri}(\varOmega )\). \(1_m\) represents the column vector in \(\Re ^m\) whose entries are all ones.

2.2 Assumptions and Saddle Point Problem

This paper focuses on time-varying communication networks of multi-agent. We make the following assumptions on the network communication graphs, which are standard for consensus-based algorithms, e.g., see [29, 30, 39].

Assumption 2.1

(Periodical strong connectivity) There exists a positive integer B such that the directed graph \((\mathcal {V},\bigcup _{t=0}^{B-1}\mathcal {E}(t+k))\) is strongly connected, for all \(k\ge 0\).

Assumption 2.2

(Weights) Let \(A(k)=[a_{ij}(k)]_{m\times m}\) and \(B(k)=[b_{ij}(k)]_{m\times m}\) be two matrices that are compatible with the graph \(\mathcal {G}(k)\).

  1. (a)

    Stochasticity: Matrices \(\{A(k)\}\) and \(\{B(k)\}\) are row-stochastic and column-stochastic, respectively.

  2. (b)

    The graph \(\mathcal {G}(k)\) has self-loops, i.e., \(a_{ii}(k)>0, b_{ii}(k)>0, \forall \, i\in V, k\ge 0\).

  3. (c)

    Uniform positivity: There is a scalar \(\eta \in (0,1)\) such that \(a_{ij}(k)\ge \eta \) and \(b_{ij}(k)\ge \eta , \forall \, (i,j)\in \mathcal {E}(k), k\ge 0\).

The matrices A(k) and B(k) are singly stochastic. Agents can use their out-degrees and in-degrees to construct these two matrices easily, which avoids the difficulty of generating symmetric doubly stochastic adjacency matrices over directed and time-varying networks. Assumption 2.2 (b) implies that each agent performs local computation using information held by itself and received from its in-neighbors. Item (c) indicates that the influence of each agent on the network is not vanishing.

Smooth convex optimization problems are within the range of our consideration, so we make the following standard assumption on (DCOP).

Assumption 2.3

(Convexity and smoothness)

  1. (a)

    The local function \(f_i\) of agent i is convex and continuous differentiable over the subset \(\mathcal {X}\), \(\forall \, i=1,2,\ldots ,m\).

  2. (b)

    The function \(h_i^j\) (the j-th coordinate component of \(h_i\)) corresponding to the inequality constraints is convex and continuous differentiable, \(j=1,\ldots ,p\), and \(h_i^l\) corresponding to the equality constraints is affine, \(l=p+1,\ldots ,p+q, \forall \, i=1,\ldots ,m\).

  3. (c)

    The non-empty subset \(\mathcal {X}\) of \(\Re ^n\) is compact and convex.

To construct a primal-dual method, a fundamental assumption is that the strong duality holds, which is guaranteed by the following Slater’s condition.

Assumption 2.4

(Slater’s condition) Suppose that there exists a point \(\bar{x}\in \textrm{ri}(\mathcal {X})\), such that \(\sum _{i=1}^mh_i(\bar{x})\in \textrm{ri}(\mathcal {K})\).

Let \(\mathcal {K}^\circ \) represent the polar of \(\mathcal {K}\), namely, \(\mathcal {K}^\circ =\Re ^p_+\times \Re ^q\). The saddle point problem of (DCOP) can be formulated as

$$\begin{aligned} \max _{\lambda \in \mathcal {K}^\circ }\min _{x\in \mathcal {X}}\mathcal {L}(x,\lambda )=\sum _{i=1}^m\mathcal {L}_i(x,\lambda ), \end{aligned}$$
(2.1)

where \(\mathcal {L}_i(x,\lambda )=f_i(x)+\lambda ^\top h_i(x)\), \(\lambda ^{\textrm{I}}\) and \(\lambda ^{\textrm{E}}\) are the multiplies respectively corresponding to the inequality and equality constraints, and \(~\lambda =(\lambda ^{\textrm{I}},\lambda ^{\textrm{E}})\in \mathcal {K}^\circ \).

Under Slater’s condition, there does not exist a duality gap between the primal problem (DCOP) and its dual. We then attempt to develop a distributed algorithm to solve the max-min problem (2.1) and find an optimal solution to (DCOP).

For any \(\bar{\lambda }\in \mathcal {K}^\circ \), denote

$$\begin{aligned} q(\lambda )&=\inf _{x\in \mathcal {X}}\mathcal {L}(x,\lambda ),\\ Q(\bar{\lambda })&=\left\{ \lambda =(\lambda ^{\textrm{I}},\lambda ^{\textrm{E}})\in \mathcal {K}^\circ :q(\lambda )\ge q(\bar{\lambda })\right\} . \end{aligned}$$

The boundness of multiplier \(\lambda ^{\textrm{I}}\) corresponding to the inequality constraints is indispensable in the design and analysis of our algorithm, which is guaranteed by the following lemma.

Lemma 2.1

([28], Lemma1) For any \(\bar{\lambda }\in \mathcal {K}^\circ \) and \(\alpha \in \Re \), it holds that

$$\begin{aligned} \max _{\lambda \in Q(\bar{\lambda })}\Vert \lambda ^{\textrm{I}}\Vert \le \frac{1}{\gamma (\bar{x})}\left( f(\bar{x})-q(\bar{\lambda })\right) , \end{aligned}$$
(2.2)

where \(\gamma (\bar{x})=\min _{1\le j\le p}\{-\sum _{i=1}^mh_i^j(\bar{x})\}\) and \(\bar{x}\) satisfies Assumption 2.4.

Let \(q^*\) denote the dual-optimal value and \(Q^*\) denote the dual optimal solutions set, namely

$$\begin{aligned} Q^*=\left\{ \lambda =(\lambda ^{\textrm{I}},\lambda ^{\textrm{E}})\in \mathcal {K}^\circ :q(\lambda )\ge q^*\right\} \subseteq Q(\bar{\lambda }). \end{aligned}$$

According to Lemma 2.1, the optimal multiplier \(\lambda ^{\textrm{I}}\) is bounded. Select an arbitrary vector \(\bar{\lambda }\in \mathcal {K}^\circ \), define

$$\begin{aligned} Q_I(\alpha )&=\left\{ \lambda ^{\textrm{I}}\in \Re ^p_+: \Vert \lambda ^{\textrm{I}}\Vert \le \alpha \right\} , \,\,\forall \, \alpha \ge \frac{1}{\gamma (\bar{x})}\left( f(\bar{x})-q(\bar{\lambda })\right) ,\\ Q&=Q_I(\alpha )\times \Re ^q. \end{aligned}$$

It follows that \(Q^*\subseteq Q(\bar{\lambda })\subseteq Q\). The set Q will be used in designing a distributed algorithm in Sect. 3. Note that Q is the Cartesian product of a ball in \(\Re ^p\) and \(\Re ^q\), and the projection onto Q has a simple closed-form solution.

3 Design of Distributed Algorithm

3.1 Primal-Dual Projected Gradient Method

Consider the following centralized optimization problem with equality and inequality constraints,

$$\begin{aligned} \min _{x\in \mathcal {X}}\tilde{f}(x)\quad \mathrm{s.t.} \quad \tilde{h}(x)\in \mathcal {K}, \end{aligned}$$
(3.1)

where \(\mathcal {K}=\Re ^p_-\times \{0_q\}\). The saddle point problem associated with (3.1) is

$$\begin{aligned} \max _{\lambda \in \mathcal {K}^\circ }\min _{x\in \mathcal {X}}\tilde{L}(x,\lambda ), \end{aligned}$$
(3.2)

where \(\tilde{L}(x,\lambda )=\tilde{f}(x)+\lambda ^\top \tilde{h}(x)\). Given a pair \((x^k,\lambda ^k)\), the primal-dual projected gradient method [7] for solving (3.2) obeys the following rules to generate the successive primal-dual pair \((x^{k+1},\lambda ^{k+1})\),

$$\begin{aligned} x^{k+1}&=P_{\mathcal {X}}\left( x^k-\alpha _k\nabla _x\tilde{L}\big (x^k,\lambda ^k\big )\right) , \end{aligned}$$
(3.3a)
$$\begin{aligned} \lambda ^{k+1}&=P_{\mathcal {K}^\circ }\left( \lambda ^k+\alpha _k\nabla _\lambda \tilde{L}\big (x^k,\lambda ^k\big )\right) , \end{aligned}$$
(3.3b)

where \(\alpha _k>0\) is a given step-size. Review the first-order optimality conditions [37, Theorem 36.6] for (3.1), it holds that \(x\in \mathcal {X}\) is an optimal solution to (3.1) if and only if there is a primal-dual pair \((x,\lambda )\) such that

$$\begin{aligned} x=&P_{\mathcal {X}}\left( x-\alpha \nabla _x\tilde{L}(x,\lambda )\right) ,\\ \lambda =&P_{\mathcal {K}^\circ }\left( \lambda +\alpha \nabla _\lambda \tilde{L}(x,\lambda )\right) , \end{aligned}$$

for any scalar \(\alpha >0\). Therefore, the primal-dual projected gradient method also can be viewed as the fixed point method.

3.2 TV-\(\mathcal{A}\mathcal{B}\) Algorithm

Over directed time-varying multi-agent networks consisting of m agents, consider an unconstrained distributed optimization problem:

$$\begin{aligned} \min _{x\in \Re ^n}f(x)=\sum _{i=1}^mf_i(x), \end{aligned}$$

where \(f_i\) is the local objective function of agent i. The authors of [39] proposed a distributed algorithm referred to as the time-varying \(\underline{\mathcal{A}\mathcal{B}}\) (TV-\(\mathcal{A}\mathcal{B}\)) gradient method for this problem. In specific, the updating rules of variables are

$$\begin{aligned} x_i^{k+1}&=\sum _{j=1}^mA(k)_{ij}x_j^k-\alpha y_i^k,\\ y_i^{k+1}&=\sum _{j=1}^mB(k)_{ij}y_j^k+\nabla f_i\big (x_i^{k+1}\big )-\nabla f_i\big (x_i^k\big ), \end{aligned}$$

where A(k) and B(k) are row-stochastic and column-stochastic matrices, respectively. The auxiliary variables \(\{y_i^k\}_{i=1}^m\) are used to track the average of \(\{\nabla f_i(x_i^k)\}_{i=1}^m\), precisely, it holds that \(\frac{1}{m}\sum _{i=1}^my_i^k=\frac{1}{m}\sum _{i=1}^m\nabla f_i(x_i^k)\) for all \(k\ge 0\). Each agent i then performs an inexact gradient descent step along the direction \(-y_i^k\). Obeying such distributed strategies, all agents adjust their decision vectors \(\{x_i^k\}_{i=1}^m\) to be consensual (i.e., \(\Vert x_i^k-x_j^k\Vert \rightarrow 0\) as \(k\rightarrow \infty \)) and optimal (i.e., \(\Vert x_i^k-x^*\Vert \rightarrow 0\) as \(k\rightarrow \infty \) for some optimal solution \(x^*\)).

In a multi-agent system, if agents adopts the primal-dual gradient method (3.3) to solve the saddle point problem (2.1) associated with (DCOP), they must access to the global information \(\sum _{i=1}^m\nabla \mathcal {L}_i\). However, in distributed settings, this information is not available to any agent. To avoid such an obstacle, we strategically integrate the ideas behind TV-\(\mathcal{A}\mathcal{B}\) and the primal-dual gradient method to propose the following primal-dual distributed algorithm for solving (DCOP),

$$\begin{aligned} x_i^{k+1}&=P_{\mathcal {X}}\left( \sum _{j=1}^ma_{ij}(k)x_j^k-\alpha _kz_i^k\right) , \end{aligned}$$
(3.4a)
$$\begin{aligned} \lambda _i^{k+1}&=P_Q\left( \sum _{j=1}^ma_{ij}(k)\lambda _j^k+\alpha _ky_i^k\right) , \end{aligned}$$
(3.4b)
$$\begin{aligned} z_i^{k+1}&=\sum _{j=1}^mb_{ij}(k)z_j^k+\nabla _x\mathcal {L}_i\left( x_i^{k+1},\lambda _i^{k+1}\right) -\nabla _x\mathcal {L}_i\left( x_i^k,\lambda _i^k\right) , \end{aligned}$$
(3.4c)
$$\begin{aligned} y_i^{k+1}&=\sum _{j=1}^mb_{ij}(k)y_j^k+h_i\left( x_i^{k+1}\right) -h_i\left( x_i^k\right) , \end{aligned}$$
(3.4d)

where \(z_i^0=\nabla _x\mathcal {L}_i(x_i^0,\lambda _i^0)\) and \( y_i^0=h_i(x_i^0)\), for each \( i=1,2,\ldots ,m\).

In the presented algorithm (3.3), each agent i generates a primal-dual pair \((x_i^k,\lambda _i^k)\) to estimate the saddle point \((x^*,\lambda ^*)\) of the max-min problem (2.1), and employs the sequences \(\{z_i^k\}_{k\ge 0}\) and \(\{y_i^k\}_{k\ge 0}\) to track the average gradients \(\frac{1}{m}\sum _{i=1}^m\nabla _x\mathcal {L}_i(x_i^k,\lambda _i^k)\) and \(\frac{1}{m}\sum _{i=1}^m\nabla _\lambda \mathcal {L}_i(x_i^k,\lambda _i^k)\), respectively. After one round of communication, each agent i mixes the information received from its in-neighbors, then uses the mixed information and local data to output new estimate vector \((x_i^{k+1},\lambda _i^{k+1})\). The local computation of each agent is independent and can be performed by agents in parallel.

Remark 3.1

For the updating formulas of the multipliers \(\lambda _i^k\), we adopt the projection operator \(P_Q\) instead of \(P_{\mathcal {K}^\circ }\). The reason is that we consider the saddle point problem

$$\begin{aligned} \max _{\lambda \in Q}\min _{x\in \mathcal {X}}\sum _{i=1}^m\mathcal {L}_i(x,\lambda ) \end{aligned}$$
(3.5)

instead of (2.1). One can get that the problems (2.1) and (3.5) have the same saddle points, e.g., see [53, Lemma 3.1]. The boundness of the sequence \(\{(\lambda _i^k)^{\textrm{I}}\}\) is guaranteed by the bounded set \(Q_I(\alpha )\), where \(\lambda _i^k=\left( (\lambda _i^k)^{\textrm{I}},(\lambda _i^k)^{\textrm{E}}\right) \).

Algorithm 1
figure a

Distributed Primal-Dual Algorithm.

In what follows, we assume that the step-size sequence \(\{\alpha _k\}\) is squared summable, which is also considered by references [2, 10, 32], etc. The step-size conditions are easy to satisfy, e.g., choose \(\alpha _k = c/k^\beta \) with \(\beta \in (1/2,1]\) and constant \(c>0\).

Assumption 3.1

Suppose that the sequence \(\{\alpha _k\}_{k=0}^\infty \) of step-sizes satisfies that

$$\begin{aligned} \alpha _k\ge 0, \quad \sum _{k=0}^\infty \alpha _k=\infty ,\quad \sum _{k=0}^\infty \alpha _k^2<\infty . \end{aligned}$$

Under Assumption 2.3, the subset \(\mathcal {X}\) is compact. According to the continuity and differentiability of \(f_i\) and \(h_i\), the gradients of \(f_i\) and \(h_i\) are Lipschitz continuous over \(\mathcal {X}\). Without loss of generality, for any \(x, y\in \mathcal {X}\) and \(i\in \mathcal {V}\), assume that there exists a constant \(L>0\) such that

$$\begin{aligned} \begin{aligned} \Vert f_i(x)\Vert&\le L,\quad \Vert h_i(x)\Vert \le L,\quad \Vert h_i(x)-h_i(y)\Vert \le L\Vert x-y\Vert ,\\&\quad \Vert \nabla _x\mathcal {L}_i(x,\lambda )-\nabla _x\mathcal {L}_i(y,\lambda )\Vert \le L\Vert x-y\Vert . \end{aligned} \end{aligned}$$
(3.6)

4 Convergence Analysis

In this section, we analyze the proposed Algorithm 1 and show the obtained theoretical results about the convergence and rate. To ease the analysis, we place some supporting lemmas in “Appendix C” and define the following notations,

$$\begin{aligned}&\omega _i^k=\left( x_i^k,\lambda _i^k\right) , \quad \eta _i^k=\left( z_i^k,-y_i^k\right) ,\\&\nabla \mathcal {L}_i(\omega _i^k)=\left( \nabla _x\mathcal {L}_i\left( x_i^k,\lambda _i^k\right) ,-h_i(x_i^k)\right) ,\\&\omega ^k=\left( \omega _1^k,\omega _2^k,\cdots ,\omega _m^k\right) ^\top , \eta ^k=\left( \eta _1^k,\eta _2^k,\cdots ,\eta _m^k\right) ^\top ,\\&\nabla \textrm{L}(\omega ^k)=\left( \nabla \mathcal {L}_1\left( \omega _1^k\right) ,\cdots ,\nabla \mathcal {L}_m\left( \omega _m^k\right) \right) ^\top ,\\&\mathcal {A}_k=\left( \begin{array}{cc} A(k) &{}\quad 0_{m\times m}\\ 0_{m\times m} &{}\quad A(k) \end{array} \right) , \ \mathcal {B}_k=\left( \begin{array}{cc} B(k) &{}\quad 0_{m\times m}\\ 0_{m\times m} &{}\quad B(k) \end{array} \right) ,\\&\varOmega =\underbrace{\mathcal {X}\times Q\times \mathcal {X}\times Q\times \dots \times \mathcal {X}\times Q}_{m}. \end{aligned}$$

We then rewrite the iterative schemes of Algorithm 1 into the following compact form,

$$\begin{aligned} \omega ^{k+1}&=P_\varOmega \left( \mathcal {A}_k\omega ^k-\alpha _k\eta ^k\right) , \end{aligned}$$
(4.1)
$$\begin{aligned} \eta ^{k+1}&=\mathcal {B}_k\eta ^k+\nabla \textrm{L}\big (\omega ^{k+1}\big )-\nabla \textrm{L}\big (\omega ^k\big ), \end{aligned}$$
(4.2)

where the matrices \(\{\mathcal {A}_k\}\) and \(\{\mathcal {B}_k\}\) are also row-stochastic and column-stochastic, respectively. Furthermore, we make a state transformation: \(s^k=V_k^{-1}\eta ^k\), where \(V_k=\textrm{diag}(v_k)\) and \(v_k\) are given by

$$\begin{aligned} v_{k+1}=\mathcal {B}_kv_k,\quad v_0=1_{2m}. \end{aligned}$$
(4.3)

Consequently, (4.1) and (4.2) can be equivalently rewritten as

$$\begin{aligned} \omega ^{k+1}&=P_\varOmega \left( \mathcal {A}_k\omega ^k-\alpha _kV_ks^k\right) , \end{aligned}$$
(4.4)
$$\begin{aligned} s^{k+1}&=\mathcal {R}_ks^k+V_{k+1}^{-1}\left( \nabla \textrm{L}\big (\omega ^{k+1}\big )-\nabla \textrm{L}\big (\omega ^k\big )\right) , \end{aligned}$$
(4.5)

where \(\mathcal {R}_k=V_{k+1}^{-1}\mathcal {B}_kV_k\). It can be verified that \(\{\mathcal {R}_k\}\) is a sequence of row-stochastic matrices such that \(\{\frac{1}{2m}v_k\}\) is an absolute probability sequence. For any \(k\ge s\ge 0\), define

$$\begin{aligned} \varPhi _\mathcal {A}(k,s)&=\mathcal {A}_k\times \mathcal {A}_{k-1}\times \cdots \times \mathcal {A}_s,\\ \varPhi _\mathcal {R}(k,s)&=\mathcal {R}_k\times \mathcal {R}_{k-1}\times \cdots \times \mathcal {R}_s. \end{aligned}$$

As shown in Lemma C.1, for any index \(s\ge 0\), the row-stochastic matrix sequence \(\{\varPhi _\mathcal {A}(k,s)\}\) linearly converges to a matrix \(\varPhi _\mathcal {A}(s)\) whose columns are stochastic vectors, as \(k\rightarrow \infty \). The asymptotic properties of the matrix sequences \(\{\varPhi _\mathcal {A}(k,s)\}\) and \(\{\varPhi _\mathcal {R}(k,s)\}\) are used to estimate consensus errors of the the sequence \(\{\omega ^k\}\).

From (4.2), it holds that

$$\begin{aligned} 1_{2m}^\top \eta ^{k+1}&=1_{2m}^\top \mathcal {B}_k\eta _k+1_{2m}^\top \left( \nabla \textrm{L}\big (\omega ^{k+1}\big )-\nabla \textrm{L}\big (\omega ^k\big )\right) \\&=1_{2m}^\top \eta ^k+1_{2m}^\top \left( \nabla \textrm{L}\big (\omega ^{k+1}\big )-\nabla \textrm{L}\big (\omega ^k\big )\right) , \end{aligned}$$

which is equivalent to

$$\begin{aligned} 1_{2m}^\top \eta ^{k+1}-1_{2m}^\top \nabla \textrm{L}\big (\omega ^{k+1}\big )=1_{2m}^\top \eta ^{k}-1_{2m}^\top \nabla \textrm{L}\big (\omega ^{k}\big ), \quad \forall \,k\ge 0. \end{aligned}$$

Noticing that the initialization \(\eta ^0=\nabla \textrm{L}(\omega ^0)\), the following tracking equations hold,

$$\begin{aligned} 1_{2m}^\top \eta ^k=1_{2m}^\top \nabla \textrm{L}\big (\omega ^k\big ), \quad \forall \, k\ge 0. \end{aligned}$$

Define

$$\begin{aligned} \bar{\omega }^k=\left( \bar{x}^k,\bar{\lambda }^k\right) =\phi _k^\top \omega ^k,\quad \bar{s}^k=\frac{1}{2m}v_k^\top s^k, \end{aligned}$$
(4.6)

where the sequence \(\{\phi _k\}\) is given by Lemma C.2. Therefore,

$$\begin{aligned} \bar{s}^k=\frac{1}{2m}v_k^\top V_k^{-1}\eta ^k =\frac{1}{2m}1_{2m}^\top \eta ^k=\frac{1}{2m}1_{2m}^\top \nabla \textrm{L}\big (\omega ^k\big ), \end{aligned}$$

which means that \(\bar{s}^k\) tracks the average gradient of the Lagrange function \(\mathcal {L}(x,\lambda )\). Since the subset \(\mathcal {X}\) is compact and convex, \(\nabla _x\mathcal {L}(x,\lambda )\) and \(\nabla _\lambda \mathcal {L}(x,\lambda )\) are bounded over \(\mathcal {X}\), which implies the sequence \(\{\bar{s}^k\}\) is bounded. Furthermore, the following lemma claims that the sequence \(\{s^k\}\) is also bounded.

Lemma 4.1

There is a constant \(M>0\) such that

$$\begin{aligned} \Vert s^k\Vert \le M,\quad \forall \, k\ge 0. \end{aligned}$$

Proof

See “Appendix A”. \(\square \)

According to Algorithm 1, each agent i generates the local estimate \(\omega _i^k=(x_i^k,\lambda _i^k)\) to the saddle point \(\omega ^*=(x^*,\lambda ^*)\). It’s necessary to evaluate the error between any two vectors \(\omega _i^k\) and \(\omega _j^k\), or equivalently, the consensus error \(\Vert \omega ^k-1_{2m}\bar{\omega }^k\Vert \), which is the constraint violation of \(x_i=x_j\) and \(\lambda _i=\lambda _j\) for all \((i,j)\in \mathcal {E}_k, k\ge 0\). The following lemma characterizes the consensus errors of the sequences \(\{\omega ^k\}\) and \(\{s^k\}\).

Lemma 4.2

Under Assumptions 2.12.4 and 3.1, for the sequences generated by Algorithm 1, it meets that

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert \omega ^k-1_{2m}\bar{\omega }^k\Vert&=0,\quad \sum _{k=0}^\infty \alpha _k\Vert \omega ^k-1_{2m}\bar{\omega }^k\Vert<\infty ,\\ \lim _{k\rightarrow \infty }\Vert s^k-1_{2m}\bar{s}^k\Vert&=0,\quad \sum _{k=0}^\infty \alpha _k\Vert s^k-1_{2m}\bar{s}^k\Vert <\infty . \end{aligned}$$

Proof

See “Appendix B”. \(\square \)

The rest of the convergence analysis is to evaluate the optimality error \(\Vert \bar{\omega }^k-\omega ^*\Vert \), which requires strategically applying the results of Lemmas 4.1 and 4.2, and the supporting lemmas in “Appendix C”.

Theorem 4.1

Under Assumptions 2.12.4 and 3.1, for any \(i\in \mathcal {V}\), the sequence \(\{(x_i^k,\lambda _i^k)\}\) generated by Algorithm 1 converges to some saddle point \((x^*,\lambda ^*)\) of the Lagrange function \(\mathcal {L}(x,\lambda )\), i.e.,

$$\begin{aligned} \lim _{k\rightarrow \infty }\left\| \left( x_i^k,\lambda _i^k\right) -\left( x^*,\lambda ^*\right) \right\| =0,\quad \forall \, i=1,2,\dots ,m. \end{aligned}$$
(4.7)

Proof

  For any \(\omega ^*=(x^*,\lambda ^*)\) \(\in \mathcal {X}\times \mathcal {K}^\circ \), define

$$\begin{aligned} I_1^k&=2\alpha _k\left\langle 1_{2m}(\nabla \textrm{L}(\omega ^k)-\nabla \textrm{L}(1_{2m}\bar{\omega }^k)),\bar{\omega }^k-\omega ^*\right\rangle ,\\ I_2^k&=2\alpha _k\left\langle V_ks^k-1_{2m}\bar{s}^k,\mathcal {A}_k\omega ^k-1_{2m}\omega ^*\right\rangle ,\\ I_3^k&=2\alpha _k\left\langle 1_{2m}(\bar{\omega }^k-\omega ^*),\mathcal {A}_k\omega ^k-1_{2m}\bar{\omega }^k\right\rangle . \end{aligned}$$

We then split the proof into the following three steps.

Step 1: Verify the following inequality,

$$\begin{aligned} \begin{aligned}&\Vert \omega ^{k+1}-1_{2m}\omega ^*\Vert ^2\\&\quad \le \Vert \omega ^k-1_{2m}\omega ^*\Vert ^2+I_1^k+I_2^k+I_3^k+4m^2M^2\alpha _k^2\\&\qquad -2\alpha _k\left( \mathcal {L}(x^*,\lambda ^*)-\mathcal {L}(x^*,\bar{\lambda }^k)\right) -2\alpha _k\left( \mathcal {L}(\bar{x}^k,\lambda ^*)-\mathcal {L}(x^*,\lambda ^*)\right) . \end{aligned} \end{aligned}$$
(4.8)

From (4.4) and noticing the non-expansive projection operator \(P_\varOmega \), one has

$$\begin{aligned} \Vert \omega ^{k+1}-1_{2m}\omega ^*\Vert ^2&\le \Vert \mathcal {A}_k\omega ^k-\alpha _kV_ks^k-1_{2m}\omega ^*\Vert ^2\\&=\Vert \mathcal {A}_k\omega ^k-1_{2m}\omega ^*\Vert ^2+\alpha _k^2\Vert V_ks^k\Vert ^2 -2\alpha _k\left\langle V_ks^k,\mathcal {A}_k\omega ^k-1_{2m}\omega ^*\right\rangle . \end{aligned}$$

Since \(\{\frac{1}{2m}v_k\}\) is a stochastic vector sequence and \(V_k=\textrm{diag}(v_k)\), it follows that

$$\begin{aligned} \Vert V_ks^k\Vert \le \Vert V_k\Vert \Vert s^k\Vert \le 2m\Vert s^k\Vert \le 2mM, \end{aligned}$$

where the last inequality derives from the boundedness of \(\{s^k\}\) (Lemma 4.1). Given that the matrix \(\mathcal {A}_k\) is row-stochastic, it holds that

$$\begin{aligned} \Vert \mathcal {A}_k\omega ^k-1_{2m}\omega ^*\Vert ^2\le \Vert \omega ^k-1_{2m}\omega ^*\Vert ^2. \end{aligned}$$

As to the term \(2\alpha _k\left\langle V_ks^k,\mathcal {A}_k\omega ^k-1_{2m}\omega ^*\right\rangle \), one has

$$\begin{aligned}&2\alpha _k\left\langle V_ks^k,\mathcal {A}_k\omega ^k-1_{2m}\omega ^*\right\rangle \\&\quad =2\alpha _k\left\langle 1_{2m}\bar{s}^k,1_{2m}\bar{\omega }^k-1_{2m}\omega ^*\right\rangle +2\alpha _k\left\langle V_ks^k-1_{2m}\bar{s}^k,\mathcal {A}_k\omega ^k-1_{2m}\omega ^*\right\rangle \\&\qquad +\left\langle 1_{2m}\bar{\omega }^k-\omega ^*,\mathcal {A}_k\omega ^k-1_{2m}\bar{\omega }^k\right\rangle \\&\quad =\left\langle 1_{2m}^\top \nabla \textrm{L}(1_{2m}\bar{\omega }^k),\bar{\omega }^k-\omega ^*\right\rangle +I_1^k+I_2^k+I_3^k. \end{aligned}$$

Note that the Lagrange function \(\mathcal {L}(x,\lambda )\) is convex-concave with respect to x and \(\lambda \), applying the subdifferential inequality of convex function yields

$$\begin{aligned} \left\langle 1_{2m}^\top \nabla \textrm{L}(1_{2m}\bar{\omega }^k),\bar{\omega }^k-\omega ^*\right\rangle \ge \left( \mathcal {L}\big (\bar{x}^k,\lambda ^*\big )-\mathcal {L}(x^*,\bar{\lambda }^k)\right) . \end{aligned}$$

Combining the above relations to yield (4.8).

Step 2: Verify that

$$\begin{aligned} \sum _{k=0}^\infty \vert I_1^k\vert<\infty ,\quad \sum _{k=0}^\infty \vert I_2^k\vert<\infty ,\quad \sum _{k=0}^\infty \vert I_3^k\vert <\infty . \end{aligned}$$

According to the Cauchy inequality and the Lipschitz continuity of \(\nabla \mathcal {L}\), there exist positive constants \(L_1, L_2\) and \(L_3\) such that

$$\begin{aligned} \vert I_1^k\vert&\le 2L_1\alpha _k\Vert \omega ^k-1_{2m}\bar{\omega }^k\Vert ,\\ \vert I_2^k\vert&\le 2L_2\alpha _k\Vert s^k-1_{2m}\bar{s}^k\Vert ,\\ \vert I_3^k\vert&\le 2L_3\alpha _k\Vert \omega ^k-1_{2m}\bar{\omega }^k\Vert . \end{aligned}$$

By Lemma 4.2, the three scalar sequences \(\{\vert I_1^k\vert \}\), \(\{\vert I_2^k\vert \}\), and \(\{\vert I_3^k\vert \}\) are summable.

Step 3: Prove the convergence. In the inequality (4.8), choose \(\omega ^*=(x^*,\lambda ^*)\) as an arbitrary saddle point of the Lagrange function \(\mathcal {L}(x,\lambda )\). The definition of saddle points indicates that

$$\begin{aligned} \mathcal {L}(\bar{x}^k,\lambda ^*)-\mathcal {L}(x^*,\lambda ^*)&\ge 0,\\ \mathcal {L}(x^*,\lambda ^*)-\mathcal {L}(x^*,\bar{\lambda }^k)&\ge 0. \end{aligned}$$

Consequently, it follows from Lemma C.3 that the limit \(\lim _{k\rightarrow \infty }\Vert \omega ^k-1_{2m}\omega ^*\Vert \) exists and

$$\begin{aligned} \sum _{k=0}^\infty \alpha _k\left( \mathcal {L}(\bar{x}^k,\lambda ^*)-\mathcal {L}(x^*,\lambda ^*)\right)&<\infty ,\\ \sum _{k=0}^\infty \alpha _k\left( \mathcal {L}(x^*,\lambda ^*)-\mathcal {L}(x^*,\bar{\lambda }^k)\right)&<\infty . \end{aligned}$$

In view of \(\sum _{k=0}^\infty \alpha _k=\infty \), it must hold that

$$\begin{aligned} \liminf _{k\rightarrow \infty }\mathcal {L}(\bar{x}^k,\lambda ^*)&=\mathcal {L}(x^*,\lambda ^*),\\ \limsup _{k\rightarrow \infty }\mathcal {L}(x^*,\bar{\lambda }^k)&=\mathcal {L}(x^*,\lambda ^*). \end{aligned}$$

Furthermore, the Lagrange function \(\mathcal {L}(x,\lambda )\) is continuous differentiable and convex-concave, hence,

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert \bar{x}^k-x^*\Vert =0,\quad \lim _{k\rightarrow \infty }\Vert \bar{\lambda }^k-\lambda ^*\Vert =0, \end{aligned}$$

or equivalently, \(\lim _{k\rightarrow \infty }\Vert \bar{\omega }^k-\omega ^*\Vert =0\). From Lemma 4.2, it’s clear that

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert \omega ^k-1_{2m}\omega ^*\Vert =0. \end{aligned}$$

The proof is completed. \(\square \)

The convergence results of Theorem 4.1 show that for each agent \(i\in \{1,2,\dots ,m\}\), its decision variable sequence \(\{x_i^k\}_{k=0}^\infty \) converges to an optimal solution \(x^*\) of (DCOP), and the Lagrange multiplier estimate sequence \(\{\lambda _i^k\}_{k=0}^\infty \) converges to an dual-optimal solution \(\lambda ^*\). In other words, by communicating with neighbors and computing locally, all agents cooperatively solve (DCOP).

To present results about the convergence rate of Algorithm 1, we define the following weighted sequences:

$$\begin{aligned} \tilde{x}_s^n=\frac{\sum _{k=s}^n\alpha _k\bar{x}^k}{\sum _{k=s}^n\alpha _k},\quad \tilde{\lambda }_s^n=\frac{\sum _{k=s}^n\alpha _k\bar{\lambda }^k}{\sum _{k=s}^n\alpha _k}, \quad \forall \, 0\le s\le n. \end{aligned}$$
(4.9)

where \(\bar{x}^k\) and \(\bar{\lambda }^k\) are given by (4.6). The following theorem characters a sublinear convergence rate of Algorithm 1 in terms of the primal-dual gap.

Theorem 4.2

Let \((x^*,\lambda ^*)\) be an arbitrary saddle point of \(\mathcal {L}(x,\lambda )\). Under Assumptions 2.12.4 and 3.1, there exists a constant \(M_1>0\) such that

$$\begin{aligned} 0\le \mathcal {L}\left( \tilde{x}_s^n,\lambda ^*\right) -\mathcal {L}\left( x^*,\tilde{\lambda }_s^n\right) \le \frac{M_1}{\sum _{k=s}^n\alpha _k}, \quad \forall \, 0\le s\le n. \end{aligned}$$
(4.10)

Furthermore, if one chooses the step-size sequence as \(\alpha _k=\frac{c}{k^{1/2+\beta }}\) with \(\beta \in (0,1/2)\) and constant \(c>0\), the primal-dual gap satisfies that

$$\begin{aligned} 0\le \mathcal {L}\left( \tilde{x}_{\lfloor n/2\rfloor }^n,\lambda ^*\right) -\mathcal {L}\left( x^*,\tilde{\lambda }_{\lfloor n/2\rfloor }^n\right) \le \frac{2M_1}{c(n+1)^{1/2-\beta }}, \end{aligned}$$
(4.11)

where \(\lfloor n\rfloor \) is the integer part of a positive real number n.

Proof

  By the definition of saddle points, it’s clear that

$$\begin{aligned} \mathcal {L}\left( \tilde{x}_s^n,\lambda ^*\right) -\mathcal {L}\left( x^*,\tilde{\lambda }_s^n\right) =\mathcal {L}\left( \tilde{x}_s^n,\lambda ^*\right) -\mathcal {L}(x^*,\lambda ^*)+\mathcal {L}(x^*,\lambda ^*)-\mathcal {L}\big (x^*,\tilde{\lambda }_s^n\big )\ge 0. \end{aligned}$$

For any fixed \(x\in \Re ^n\) and \(\lambda \in \Re ^{p+q}\), in view of the convexity of the functions \(\mathcal {L}(\cdot ,\lambda )\) and \(-\mathcal {L}(x,\cdot )\), one has

$$\begin{aligned}&\mathcal {L}\left( \tilde{x}_s^n,\lambda ^*\right) -\mathcal {L}\left( x^*,\tilde{\lambda }_s^n\right) \\&\quad \le \,\left( \sum _{k=s}^n\alpha _k\right) ^{-1}\sum _{k=s}^n\alpha _k\left[ \mathcal {L}(\bar{x}^k,\lambda ^*)-\mathcal {L}(x^*,\bar{\lambda }^k)\right] \\&\quad =\,\left( \sum _{k=s}^n\alpha _k\right) ^{-1} \sum _{k=s}^n\alpha _k \left[ \mathcal {L}(\bar{x}^k,\lambda ^*)-\mathcal {L}(x^*,\lambda ^*)+\mathcal {L}(x^*,\lambda ^*)-\mathcal {L}(x^*,\bar{\lambda }^k)\right] . \end{aligned}$$

From the proof of Theorem 4.1, it holds that

$$\begin{aligned}&0<\sum _{k=0}^\infty \alpha _k\left( \mathcal {L}(\bar{x}^k,\lambda ^*)-\mathcal {L}(x^*,\lambda ^*)\right) =M^\prime<\infty ,\\&0<\sum _{k=0}^\infty \alpha _k\left( \mathcal {L}(x^*,\lambda ^*)-\mathcal {L}(x^*,\bar{\lambda }^k)\right) =M^{\prime \prime }<\infty . \end{aligned}$$

Let \(M_1:=M^\prime +M^{\prime \prime }\), it’s clear that (4.10) holds. Take \(\alpha _k=\frac{c}{k^{1/2+\beta }}\), one can get the following estimate,

$$\begin{aligned} \sum _{k=\lfloor n/2\rfloor }^n\frac{1}{k^{1/2+\beta }}\ge \int _{\lfloor n/2\rfloor }^{n+1}\frac{{\textrm{d}}t}{t^{1/2+\beta }} =\frac{2}{1-2\beta }\left( (n+1)^{\frac{1}{2}-\beta }-\lfloor n/2\rfloor ^{\frac{1}{2}-\beta }\right) . \end{aligned}$$

Let \(\phi (x)=x^{1/2-\beta }\) with \(\beta \in (0,1/2)\), it’s easy to verify that \(\phi (x)\) is a concave function, hence,

$$\begin{aligned} \phi (n+1)-\phi (\lfloor n/2\rfloor ) \ge&\nabla \phi (n+1)(n+1-\lfloor n/2\rfloor )\\ \ge&\left( \frac{1}{2}-\beta \right) \frac{(n+1)-(n+1)/2}{(n+1)^{1/2+\beta }}\\ =&\left( \frac{1}{4}-\frac{\beta }{2}\right) (n+1)^{\frac{1}{2}-\beta }. \end{aligned}$$

Choose \(s=\lfloor n/2\rfloor \) in (4.10), it follows that

$$\begin{aligned} \mathcal {L}\left( \tilde{x}_{\lfloor n/2\rfloor }^n,\lambda ^*\right) -\mathcal {L}\left( x^*,\tilde{\lambda }_{\lfloor n/2\rfloor }^n\right) \le \,\frac{M_1}{\sum _{k=\lfloor n/2\rfloor }^n\frac{c}{k^{1/2+\beta }}} \le \frac{2M_1}{c(n+1)^{1/2-\beta }}. \end{aligned}$$

The proof is completed. \(\square \)

Theorem 4.2 shows that the primal-dual gap \(\mathcal {L}(\tilde{x}_{\lfloor n/2\rfloor }^n,\lambda ^*)-\mathcal {L}(x^*,\tilde{\lambda }_{\lfloor n/2\rfloor }^n)\) decays to zeros at a rate of \(O(\frac{1}{n^{1/2-\beta }})\). If one chooses the value of \(\beta \) to be small enough, the rate is near \(O(1/\sqrt{n})\), where n represents the number of iterations.

5 Numerical Simulations

In this section, we test the proposed algorithm using two examples. The first one is motivated by applications in wireless networks and is used to justify the theoretical results in the paper. Another one is an empirical risk optimization problem to be solved in the Neyman–Pearson (NP) classification [44], which is used to compare the performance of Algorithm 1 with the distributed proximal primal-dual algorithm [17].

Fig. 1
figure 1

Network topology of 100 agents

5.1 Example in Wireless Networks

Consider the following distributed optimization problem with nonlinear constraints,

$$\begin{aligned} \begin{aligned}&\min _{x\in [0,1]}\hspace{1em}\sum _{i=1}^{100} c_ix\\&\mathrm {s.t.}\hspace{1em}\sum _{i=1}^{100}-d_i\log (1+x)\le b, \end{aligned} \end{aligned}$$
(5.1)

where the constraints with this form arise, for instance, in wireless networks to ensure quality of service. This problem is also considered by [17, 25]. Set \(b=5\), \(c_i=i/m\), \(d_i=i/(m+1)\) for each \(i=1,2,\dots ,100\), where the network size (number of agents) \(m=100\). The optimal value \(f^*\) is computed by the convex optimization toolbox [11] (CVX) under the best precision. We then use \(f^*\) as a benchmark for our algorithm.

Fig. 2
figure 2

Evolution of agents’ decision vectors \(x_i^k\), \(i=1,2,\ldots ,100\), where the red dashed line represents the value of the optimal solution \(x^*\)

Fig. 3
figure 3

Convergence of algorithm 1

As shown in Fig. 1, the network topology of 100 agents is selected in turn among the line, star, circle, and Erdős–Rényi random graphs with period 4. The corresponding adjacency matrices A(k) and B(k) are constructed by

$$\begin{aligned} a_{ij}(k)&= {\left\{ \begin{array}{ll} \frac{1}{\vert \mathcal {N}_i^\textrm{in}(k)\vert },&{}\quad {\text {if}}\ (i,j)\in \mathcal {E}(k),\\ 0, &{}\quad {\text {otherwise,}} \end{array}\right. } \\ b_{ij}(k)&= {\left\{ \begin{array}{ll} \frac{1}{\vert \mathcal {N}_j^\textrm{out}(k)\vert },&{}\quad {\text {if}}\ (i,j)\in \mathcal {E}(k),\\ 0, &{}\quad {\text {otherwise,}} \end{array}\right. } \end{aligned}$$

where \(\vert \mathcal {N}_i^\mathrm{{in}}(k)\vert \) and \(\vert \mathcal {N}_j^\mathrm{{out}}(k)\vert \) are the in-degree of agent i and the out-degree of agent j at time k, respectively.

We execute the proposed Algorithm 1 with step-size \(\alpha _k=1/k^{0.51}\) to solve (5.1) on a laptop computer. The procedure stops if the maximum of objective error and constraint violation decays to smaller than the given tolerance error \(10^{-4}\), where the objective error is \(\vert f(\bar{x}^k)-f^*\vert /\vert f^*\vert \), the constraint violation is the sum of \(\max \left( \sum _{i=1}^{100}-d_i\log (1+\bar{x}^k)-b,0\right) \) (feasibility error) and \(\frac{1}{100}\sum _{i=1}^{100}\Vert x_i^k-\bar{x}^k\Vert \) (consistency error). Here, \(\bar{x}^k=\frac{1}{100}\sum _{i=1}^{100}x_i^k\).

At the beginning of the procedure, each agent i individually chooses the local decision vector \(x_i^0\). As Algorithm 1 runs, Fig. 2 shows that all agents reach a consensus decision by local computations and exchanging information with their neighbors. Furthermore, Fig. 3 shows that this consensus decision vector is an optimal solution with precision of \(10^{-4}\) to the problem (5.1), which verifies the theoretical result of Algorithm 1. In summary, Algorithm 1 guides all agents to adjust their decision vectors to be feasible, consistent, and optimal.

5.2 Neyman–Pearson Classification Problem

Suppose that h is a classifier to predict 1 and \(-1\). The type I error (misclassifying class \(-1\) as 1) and type II error (misclassifying class 1 as \(-1\)) are, respectively, defined by

$$\begin{aligned} \text {type I error}:=\mathbb {E}[\psi (-bh(a))|b=-1],\quad \text {type II error}:=\mathbb {E}[\psi (-bh(a))|b=1], \end{aligned}$$

where \(\psi \) is some merit function. Different from the conventional binary classification in machine learning, the NP classification paradigm is developed to learn a classifier by minimizing type II error subject to that type I error is below a user-specified level \(\tau >0\), see [44] and references therein. Specifically, let \(\mathcal {H}\) be a given class of classifiers. The NP classification is to solve the following problem,

$$\begin{aligned} \min _{h\in \mathcal {H}} \quad&\mathbb {E}[\psi (-bh(a))|b=1]\\ \mathrm {s.t.}\quad&\mathbb {E}[\psi (-bh(a))|b=-1]\le \tau . \end{aligned}$$

In what follows, we consider its empirical risk minimization counterpart. Given a labeled training dataset \(\{a_i\}_{i=1}^N\) consists of the positive set \(\{a_i^+\}_{i=1}^{N_+}\) and the negative set \(\{a_j^-\}_{j=1}^{N_-}\), The associated empirical NP classification problem is

$$\begin{aligned} \begin{aligned} \min _x \quad&f(x)=\frac{1}{N_+}\sum _{i=1}^{N_+}\ell \left( x^\top a_i^+\right) \\ \mathrm {s.t.}\quad&g(x):=\frac{1}{N_-}\sum _{i=1}^{N_-}\ell \left( x^\top a_i^-\right) -\tau \le 0, \end{aligned} \end{aligned}$$
(5.2)

where \(\ell (\cdot )\) is a loss function, and \(\tau \) is a user-specified level, usually a small value. In this numerical test, we choose \(\ell (\cdot )\) as the \(\ell _2\)-norm regularized logistic loss function, i.e., \(\ell (x^\top a):=\log \left( 1+\exp (-x^\top a)\right) +\theta \Vert x\Vert _2^2\), \(\theta = 10^{-2}\), \(\tau =5\%\). The data \({a_i^+}\)’s and \(a_i^-\)’s are randomly sampled from the dataset CINA [46]. The sample size \(N_+=N_-=1000\) and the variable \(x\in \Re ^{132}\).

We run Algorithm 1 and the distributed proximal primal-dual (DPPD) method of [17] to solve (5.2). The step-size sequences of Algorithm 1 and DPPD are chosen as \(0.1/k^{0.505}\) and \(4/k^{0.5}\), respectively. The subproblems of DPPD are computed by Nesterov’s accelerated gradient method [31] with a tolerance error of \(10^{-6}\). We start these two algorithms from the same initial point and plot the evolution of their objective errors and constraint violations during 1000 iterations, where the objective error is \(\vert f(\bar{x}^k)-f^*\vert /\vert f^*\vert \) and the constraint violation is the sum of \(\max (g(\bar{x}^k),0)\) (feasibility error) and \(\frac{1}{1000}\sum _{i=1}^{1000}\Vert x_i^k-\bar{x}^k\Vert \) (consistency error). Here, \(\bar{x}^k=\frac{1}{1000}\sum _{i=1}^{1000}x_i^k\).

Fig. 4
figure 4

Comparison of algorithm 1 and the distributed proximal primal-dual (DPPD) method [17]

In Fig. 4a, the objective error of DPPD decays to a level of \(10^{-3}\) after about 220 iterations, then suddenly climbs to near \(10^{-1}\) and stays there. While the objective error of Algorithm 1 keeps decreasing except for the first few iterations and reaches \(10^{-4}\) after 1000 iterations. As shown in Fig. 4b, the constraint violation of Algorithm 1 descends faster than that of DPPD, which indicates that the decision variable sequence generated by Algorithm 1 moves to be feasible at a faster speed. Moreover, the ratio of CPU time between Algorithm 1 and DPPD is 1 : 1.25. In summary, Fig. 4 illustrates that Algorithm 1 is efficient and has an advantage in convergence speed over DPPD.

6 Conclusions

This article focused on distributed convex optimization problems with coupled equality and inequality constraints over directed and time-varying multi-agent networks. We proposed a distributed primal-dual algorithm for such problems and established its convergence and iteration complexity. In numerical simulations, we showed that our algorithm is effective for the problem considered in this paper. However, there are still some respects that need to improve. One is that the presented algorithm adopts diminishing step-sizes, which leads to a slow convergence speed. Therefore, we expect to modify the diminishing step-sizes of Algorithm 1 to a fixed step-size in future studies.