1 Introduction

In this paper, we consider the following multi-block structured convex optimization model

$$\begin{aligned}&\min _{x,y}\;\; f(x_1,\cdots ,x_N)+\sum _{i=1}^N u_i(x_i) + g(y_1,\cdots ,y_M)+\sum _{j=1}^M v_j(y_j) \nonumber \\&\text{ s.t. } \sum _{i=1}^N A_{i}x_i +\sum _{j=1}^M B_jy_j = b,\\&\qquad \,x_i\in \mathcal {X}_i,\, i=1,\cdots ,N;\ y_j\in \mathcal {Y}_j,\, j=1,\cdots ,M,\nonumber \end{aligned}$$
(1.1)

where the variables \(x=(x_1;\cdots ;x_N)\) and \(y=(y_1;\cdots ;y_M)\) are naturally partitioned into N and M blocks, respectively, \(A=(A_1,\cdots ,A_{N})\) and \(B=(B_1,\cdots ,B_M)\) are block matrices, \(\mathcal {X}_i\) and \(\mathcal {Y}_j\) are some closed convex sets, f and g are smooth convex functions, and \(u_i\) and \(v_j\) are proper closed convex (possibly nonsmooth) functions.

1.1 Motivating Examples

Optimization problems in the form of (1.1) have many emerging applications from various fields. For example, the constrained lasso (classo) problem that was first studied by James et al. [1] as a generalization of the lasso problem can be formulated as

$$\begin{aligned}&\min \limits _{x} \frac{1}{2}\Vert A x-b\Vert _2^2+\tau \Vert x\Vert _1 \nonumber \\&\text{ s.t. } Cx\leqslant d, \end{aligned}$$
(1.2)

where \(A\in \mathbb {R}^{m\times p}\), \(b\in \mathbb {R}^m\) are the observed data, and \(C\in \mathbb {R}^{n\times p}\), \(d\in \mathbb {R}^n\) are the predefined data matrix and vector. Many widely used statistical models can be viewed as special cases of (1.2), including the monotone curve estimation, fused lasso, generalized lasso, and so on [1]. By partitioning the variable \(x\) into blocks as \(x=(x_1;\cdots ;x_K)\) where \(x_i\in \mathbb {R}^{p_i}\) as well as other matrices and vectors in (1.2) correspondingly, and introducing another slack variable y, the classo problem can be transformed to

$$\begin{aligned}&\min \limits _{x,y} \frac{1}{2}\left\| \sum \limits _{i=1}^K A_ix_i-b\right\| _2^2+\tau \sum \limits _{i=1}^K\Vert x_i\Vert _1\nonumber \\&\text{ s.t. } \sum \limits _{i=1}^{K}C_ix_i+y=d, \,\, y \geqslant 0, \end{aligned}$$
(1.3)

which is in the form of (1.1).

Another interesting example is the extended linear-quadratic programming [2] that can be formulated as

$$\begin{aligned}&\min \limits _x \frac{1}{2} x^\top P x+a^\top x+\max \limits _{s\in \mathcal {S}}\left\{ (d-Cx)^\top s-\frac{1}{2}s^\top Qs\right\} \nonumber \\&\text{ s.t. } Ax\leqslant b \end{aligned}$$
(1.4)

where P and Q are symmetric positive semidefinite matrices, and \(\mathcal {S}\) is a polyhedral set. Apparently, (1.4) includes quadratic programming as a special case. In general, its objective is a piece-wise linear-quadratic convex function. Let \(g(s)=\frac{1}{2}s^\top Qs+\iota _{{\mathcal {S}}}(s)\), where \(\iota _{\mathcal {S}}\) denotes the indicator function of \({\mathcal {S}}\). Then,

$$\begin{aligned} \max \limits _{s\in {\mathcal {S}}}\left\{ (d-Cx)^\top s-\frac{1}{2}s^\top Qs\right\} =g^*(d-Cx), \end{aligned}$$

where \(g^*\) denotes the convex conjugate of g. Replacing \(d-Cx\) by y and introducing slack variable z, we can equivalently write (1.4) into the form of (1.1):

$$\begin{aligned} \begin{array}{ll} \min \limits _{x,y,z} &{}\frac{1}{2} x^\top P x+a^\top x+g^*(y)\\ \text{ s.t. } &{} Ax+z= b,\ z\geqslant 0,\ Cx+y=d, \end{array} \end{aligned}$$
(1.5)

for which one can further partition the x-variable into a number of disjoint blocks.

Many other interesting applications in various areas can be formulated as optimization problems in the form of (1.1), including those arising from signal processing, image processing, machine learning, and statistical learning; see [3,4,5,6] and the references therein.

Finally, we mention that computing a point on the central path for a generic convex programming in block variables \((x_1;\cdots ;x_N)\):

$$\begin{aligned} \begin{array}{ll} \min \limits _{x} &{} f(x_1,\cdots ,x_N) \\ \text{ s.t. } &{} \sum \limits _{i=1}^N A_i x_i \leqslant b,\, x_i \geqslant 0, \, i=1,2,\cdots ,N \end{array} \end{aligned}$$

boils down to

$$\begin{aligned} \begin{array}{ll} \min \limits _{x,y} &{} f(x_1,\cdots ,x_N) - \mu e^\top \ln x - \mu e^\top \ln y \\ \text{ s.t. } &{} \sum \nolimits _{i=1}^N A_i x_i + y = b, \end{array} \end{aligned}$$

where \(\mu >0\) and \(e^\top \ln v\) indicates the sum of the logarithm of all the components of v. This model is again in the form of (1.1).

1.2 Related Works in the Literature

Our work relates to two recently very popular topics: the alternating direction method of multipliers (ADMM) for multi-block structured problems and the first-order primal–dual method for bilinear saddle-point problems. Below, we review the two methods and their convergence results. More complete discussion on their connections to our method will be provided after presenting our algorithm.

1.2.1 Multi-block ADMM and its Variants

One well-known approach for solving a linear constrained problem in the form of (1.1) is the augmented Lagrangian method, which iteratively updates the primal variable (xy) by minimizing the augmented Lagrangian function in (1.7) and then the multiplier \(\lambda \) through dual gradient ascent. However, since the linear constraint couples \(x_1,\cdots ,x_N\) and \(y_1,\cdots ,y_M\) all together, it can be very expensive to minimize the augmented Lagrangian function simultaneously with respect to all block variables. Utilizing the multi-block structure of the problem, the multi-block ADMM updates the block variables sequentially, one at a time with the others fixed to their most recent values, followed by the update of multiplier. Specifically, it performs the following updates iteratively (by assuming the absence of the coupled functions f and g):

$$\begin{aligned} \left\{ \begin{array}{rcl} x_1^{k+1} &{}=&{} {\mathop {\arg \min }\nolimits _{x_1 \in \mathcal {X}_1}} \mathcal {L}_\rho (x_1,x_2^k,\cdots ,x_N^k, y^k, \lambda ^k),\\ &{}\vdots &{} \\ x_N^{k+1} &{}=&{} {\mathop {\arg \min }\nolimits _{x_N \in \mathcal {X}_N}} \mathcal {L}_\rho (x_1^{k+1},\cdots ,x_{N-1}^{k+1},x_N, y^k,\lambda ^k),\\ y_1^{k+1} &{}=&{} {\mathop {\arg \min }\nolimits _{y_1\in {\mathcal {Y}}_1}}\mathcal {L}_\rho (x^{k+1},y_1,y_2^k,\cdots ,y_M^k,\lambda ^k),\\ &{}\vdots &{} \\ y_M^{k+1} &{}=&{} {\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{y_M\in {\mathcal {Y}}_M}}\mathcal {L}_\rho (x^{k+1},y_1^{k+1},\cdots ,y_{M-1}^{k+1}, y_M, \lambda ^k),\\ \lambda ^{k+1} &{}=&{} \lambda ^k-\rho (Ax^{k+1}+By^{k+1}-b).\\ \end{array}\right. \end{aligned}$$
(1.6)

In the above updates, \(\mathcal {L}_\rho \) is the augmented Lagrangian function, defined as:

$$\begin{aligned} \mathcal {L}_{\rho }(x,y,\lambda )=\sum _{i=1}^N u_i(x_i)+\sum _{j=1}^M v_j(y_j)-\lambda ^{\top }\left( Ax+By-b\right) +\frac{\rho }{2}\left\| Ax+By-b\right\| ^2, \end{aligned}$$
(1.7)

where \(\rho \) is the penalty parameter and \(\lambda \) is the Lagrangian multiplier.

When there are only two blocks, i.e., \(N=M=1\), the update scheme in (1.6) reduces to the classic 2-block ADMM [7, 8]. The convergence properties of the ADMM for solving 2-block separable convex problems have been studied extensively. Since the 2-block ADMM can be viewed as a manifestation of some kinds of operator splitting, its convergence follows from that of the so-called Douglas–Rachford operator splitting method; see [9, 10]. Moreover, the convergence rate of the 2-block ADMM has been established recently by many authors; see, e.g., [11,12,13,14,15,16].

Although the multi-block ADMM scheme in (1.6) performs very well for many instances encountered in practice (e.g., [17, 18]), it may fail to converge for some instances if there are more than 2 block variables, i.e., \(N+M \geqslant 3\). In particular, an example was presented in [19] to show that the ADMM may even diverge with 3 blocks of variables, when solving a linear system of equations. Thus, some additional assumptions or modifications will have to be in place to ensure convergence of the multi-block ADMM. In fact, by incorporating some extra correction steps or changing the Gauss–Seidel updating rule, [20,21,22,23,24] showed that the convergence can still be achieved for the multi-block ADMM. Moreover, if some parts of the objective function are strongly convex or the objective has certain regularity property, then it can be shown that the convergence holds under various conditions; see [14, 25,26,27,28,29,30]. Using some other conditions including the error bound condition and taking small dual stepsizes, or by adding some perturbations to the original problem, authors of [15, 31] established the rate of convergence results even without strong convexity. Not only for the problem with linear constraint, in [32,33,34] multi-block ADMM were extended to solve convex linear/quadratic conic programming problems. In a very recent work [35], Sun, Luo and Ye proposed a randomly permuted ADMM (RP-ADMM) that basically chose a random permutation of the block indices and performs the ADMM update according to the order of indices in that permutation, and they showed that the RP-ADMM converges in expectation for solving non-singular square linear system of equations.

In [3], the authors proposed a block successive upper-bound minimization method of multipliers (BSUMM) to solve problem (1.1) without y-variable. Essentially, at every iteration, the BSUMM replaced the nonseparable part f(x) by an upper-bound function and works on that modified function in an ADMM manner. Under some error bound conditions and a diminishing dual stepsize assumption, the authors are able to show that the iterates produced by the BSUMM algorithm converge to the set of primal–dual optimal solutions. Along a similar direction, Cui et al. [4] introduced a quadratic upper-bound function for the nonseparable function f to solve 2-block problems; they showed that their algorithm has an O(1 / t) convergence rate, where t is the number of total iterations. Very recently, [6] proposed a set of variants of the ADMM by adding some proximal terms into the algorithm; the authors have managed to prove O(1 / t) convergence rate for the 2-block case, and the same results applied for general multi-block case under some strong convexity assumptions. Moreover, [5] showed the convergence of the ADMM for 2-block problems by imposing quadratic structure on the coupled function f(x) and also the convergence of RP-ADMM for multi-block case where all separable functions vanish (i.e., \(u_i(x_i)=0,\,\forall i\)).

1.2.2 Primal–Dual Method for Bilinear Saddle-Point Problems

Recently, the work [36] generalized the first-order primal–dual method in [37] to a randomized method for solving a class of saddle-point problems in the following form:

$$\begin{aligned} \min _{z\in {\mathcal {Z}}}\left\{ h(z)+\max _{x\in {\mathcal {X}}} \left\langle z, \sum _{i=1}^N A_ix_i\right\rangle -\sum _{i=1}^N u_i(x_i)\right\} , \end{aligned}$$
(1.8)

where \(x=(x_1;\cdots ;x_N)\) and \({\mathcal {X}}={\mathcal {X}}_1\times \cdots \times {\mathcal {X}}_N\). Let \({\mathcal {Z}}=\mathbb {R}^p\) and \(h(z)=-b^\top z\). Then, it is easy to see that (1.8) is a saddle-point reformulation of the multi-block structured optimization problem

$$\begin{aligned} \min _{x\in {\mathcal {X}}} \sum _{i=1}^N u_i(x_i), \text{ s.t. } \sum _{i=1}^N A_ix_i=b, \end{aligned}$$

which is a special case of (1.1) without y-variable or the coupled function f.

At each iteration, the algorithm in [36] chose one block of x-variable uniformly at random and performs a proximal update to it, followed by another proximal update to the z-variable. More precisely, it iteratively performs the updates:

$$\begin{aligned}&x_i^{k+1}=\left\{ \begin{array}{ll} {{\,\mathrm{arg\,min}\,}}_{x_i\in {\mathcal {X}}_i} \langle -\bar{z}^k, A_ix_i\rangle + u_i(x_i)+\frac{\tau }{2}\Vert x_i-x_i^k\Vert _2^2, &{} \text { if }i=i_k, \\ x_i^k, &{} \text { if } i\ne i_k, \end{array}\right. \end{aligned}$$
(1.9a)
$$\begin{aligned}&z^{k+1}={\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{z\in {\mathcal {Z}}}}\, h(z)+\langle z, A x^{k+1}\rangle +\frac{\eta }{2}\Vert z-z^k\Vert _2^2, \end{aligned}$$
(1.9b)
$$\begin{aligned}&\bar{z}^{k+1}=q(z^{k+1}-z^k)+z^{k+1}, \end{aligned}$$
(1.9c)

where \(i_k\) is a randomly selected block, and \(\tau ,\eta \) and q are certain parameters.Footnote 1 When there is only one block of x-variable, i.e., \(N=1\), the scheme in (1.9) becomes exactly the primal–dual method in [37]. Assuming the boundedness of the constraint sets \({\mathcal {X}}\) and \({\mathcal {Z}}\), [36] showed that under convexity, O(1 / t) convergence rate result of the scheme can be established by choosing appropriate parameters, and if \(u_i\) are all strongly convex, the scheme can be accelerated to have \(O(1/t^2)\) convergence rate by adapting the parameters.

1.3 Contributions and Organization of this Paper

  • We propose a randomized primal–dual coordinate update algorithm to solve problems in the form of (1.1). The key feature is to introduce randomization as done in (1.9) to the multi-block ADMM framework (1.6). Unlike the random permutation scheme as previously investigated in [5, 35], we simply choose a subset of blocks of variables based on the uniform distribution. In addition, we perform a proximal update to that selected subset of variables. With appropriate proximal terms [e.g., the setting in (2.15)], the selected block variables can be decoupled, and thus the updates can be done in parallel.

  • More general than (1.6), we can accommodate coupled terms in the objective function in our algorithm by linearizing such terms. By imposing Lipschitz continuity condition on the partial gradient of the coupled functions f and g and using proximal terms, we show that our method has an expected O(1 / t) convergence rate for solving problem (1.1) under mere convexity assumption.

  • We show that our algorithm includes several existing methods as special cases such as the scheme in (1.9) and the proximal Jacobian ADMM in [20]. Our result indicates that the O(1 / t) convergence rate of the scheme in (1.9) can be shown without assuming boundedness of the constraint sets. In addition, the same order of convergence rate of the proximal Jacobian ADMM can be established in terms of a better measure.

  • Furthermore, the linearization scheme allows us to deal with stochastic objective function, for instance, when the function f is given in a form of expectation \(f=E_\xi [f_\xi (x)]\) where \(\xi \) is a random vector. As long as an unbiased estimator of the subgradient of f is available, we can extend our method to the stochastic problem and an expected \(O(1/\sqrt{t})\) convergence rate is achievable.

The rest of the paper is organized as follows. In Sect. 2, we introduce our algorithm and present some preliminary results. In Sect. 3, we present the sublinear convergence rate results of the proposed algorithm. Depending on the multi-block structure of \(y\), different conditions and parameter settings are presented in Sects. 3.1, 3.2 and 3.3, respectively. In Sect. 4, we present an extension of our algorithm where we assume the availability of only first-order stochastic approximation of the objective function. The convergence analysis is extended to such settings accordingly. Numerical results are shown in Sect. 5. In Sect. 6, we discuss the connections of our algorithm to other well-known methods in the literature. Finally, we conclude the paper in Sect. 7. The proofs for the technical lemmas are presented in Appendix A, and the proofs for the main theorems are in Appendix B.

2 Randomized Primal–Dual Block Coordinate Update Algorithm

In this section, we first present some notations and then introduce our algorithm as well as some preliminary lemmas.

2.1 Notations

We denote \({\mathcal {X}}={\mathcal {X}}_1\times \cdots \times {\mathcal {X}}_N\) and \({\mathcal {Y}}={\mathcal {Y}}_1\times \cdots \times {\mathcal {Y}}_M\). For any symmetric positive semidefinite matrix W, we define \(\Vert z\Vert _W=\sqrt{z^\top W z}\). Given an integer \(\ell >0\), \([\ell ]\) denotes the set \(\{1,2,\cdots ,\ell \}\). We use I and J as index sets, while I is also used to denote the identity matrix; we believe that the intention is evident in the context. Given \(I=\{i_1,i_2,\cdots ,i_n\}\), we denote:

  • Block-indexed variable: \(x_I=(x_{i_1};x_{i_2};\cdots ;x_{i_n})\);

  • Block-indexed set: \({\mathcal {X}}_I={\mathcal {X}}_{i_1}\times \cdots \times {\mathcal {X}}_{i_n}\);

  • Block-indexed function: \(u_I(x_I)=u_{i_1}(x_{i_1})+u_{i_2}(x_{i_2})+\cdots +u_{i_n}(x_{i_n})\);

  • Block-indexed gradient: \(\nabla _{I}f(x)=(\nabla _{i_1}f(x);\nabla _{i_2}f(x);\cdots ;\nabla _{i_n}f(x))\);

  • Block-indexed matrix: \(A_I=\big [A_{i_1}, A_{i_2},\cdots , A_{i_n} \big ]\).

2.2 Algorithm

Our algorithm is rather general. Its major ingredients are randomization in selecting block variables, linearization of the coupled functions f and g, and adding proximal terms. Specifically, at each iteration k, it first randomly samples a subset \(I_k\) of blocks of \(x\), and then a subset \(J_k\) of blocks of \(y\) according to the uniform distribution over the indices. The randomized sampling rule is as follows:

Randomization Rule (U): For the given integers \(n\leqslant N\) and \(m\leqslant M\), it randomly chooses index sets \(I_k\subset [N]\) with \(|I_k|=n\) and \(J_k\subset [M]\) with \(|J_k|=m\) uniformly; i.e., for any subsets \(\{i_1,i_2,\cdots ,i_n\}\subset [N]\) and \(\{j_1,j_2,\cdots ,j_m\}\subset [M]\), the following holds

$$\begin{aligned} {\mathrm {Prob}}\big [I_k=\{i_1,i_2,\cdots ,i_n\}\big ]=&1/\left( \begin{array}{c}N\\ n\end{array}\right) ,\\ {\mathrm {Prob}}\big [J_k=\{j_1,j_2,\cdots ,j_m\}\big ]=&1/\left( \begin{array}{c}M\\ m\end{array}\right) . \end{aligned}$$

After those subsets have been selected, it performs a prox-linear update to those selected blocks based on the augmented Lagrangian function, followed by an update of the Lagrangian multiplier. The details of the method are given in Algorithm 1.

figure a

In Algorithm 1, \(P^k\) and \(Q^k\) are predetermined positive semidefinite matrices with appropriate dimensions. For the selected blocks in \(I_k\) and \(J_k\), instead of implementing the exact minimization of the augmented Lagrangian function, we perform a block proximal gradient update. In particular, before minimization, we first linearize the coupled functions f, g and add some proximal terms to it. Note that one can always select all blocks, i.e., \(I_k=[N]\) and \(J_k=[M]\). Empirically, however, the block coordinate update method usually outperforms the full coordinate update method if the problem possesses certain structures; see [38] for an example. To decouple the selected x blocks and also y blocks, we choose the matrices \(P^k\) and \(Q^k\) in Algorithm 1 as follows:

$$\begin{aligned} P^k=\hat{P}_{I_k}-\rho _xA_{I_k}^\top A_{I_k}, \qquad Q^k=\hat{Q}_{J_k}-\rho _yB_{J_k}^\top B_{J_k}, \end{aligned}$$
(2.15)

where \(\hat{P}\) and \(\hat{Q}\) are symmetric positive semidefinite and block diagonal matrices, \(\hat{P}_{I_k}\) denotes the diagonal blocks of \(\hat{P}\) indexed by \(I_k\). With such setting of \(P^k\) and \(Q^k\), (2.1) and (2.3), respectively, become

$$\begin{aligned} x_I^{k+1}&= {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{x_I\in \mathcal {X}_I}}\big \langle \nabla _I f(x^k)-A_I^\top (\lambda ^k-\rho _x r^k), x_I\big \rangle +u_I(x_I)+\frac{1}{2}\Vert x_I-x_I^k\Vert _{\hat{P}_I}^2, \end{aligned}$$
(2.16)
$$\begin{aligned} y_J^{k+1}&= {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y_J\in \mathcal {Y}_J}}\big \langle \nabla _J g(y^k)-B_J^\top (\lambda ^k-\rho _y r^{k+\frac{1}{2}}), y_J\big \rangle +v_J(y_J)+\frac{1}{2}\Vert y_J-y_J^k\Vert _{\hat{Q}_J}^2. \end{aligned}$$
(2.17)

Due to the block diagonal structure of \(\hat{P}\) and \(\hat{Q}\), both x-update and y-update can be computed in parallel.

2.3 Preliminaries

Let \(w\) be the aggregated primal–dual variables and H(w) the primal–dual linear mapping; namely

$$\begin{aligned} w=\left( \begin{array}{c} x\\ y\\ \lambda \end{array}\right) ,\quad H(w)=\left( \begin{array}{c}-A^\top \lambda \\ -B^\top \lambda \\ Ax+By-b\end{array}\right) , \end{aligned}$$
(2.18)

and also let

$$\begin{aligned}&u(x)=\sum _{i=1}^N u_i(x_i),\quad v(y)=\sum _{j=1}^M v_j(y_j),\\&F(x)=f(x)+u(x),\quad G(y)=g(y)+v(y),\quad \varPhi (x,y)=F(x)+G(y). \end{aligned}$$

The point \((x^*,y^*)\) is a solution to (1.1) if and only if there exists \(\lambda ^*\) such that

$$\begin{aligned}&\varPhi (x,y)-\varPhi (x^*,y^*)+(w-w^*)^\top H(w^*)\geqslant 0,\,\forall (x,y) \in \mathcal {X}\times \mathcal {Y}, \,\forall \lambda , \end{aligned}$$
(2.19a)
$$\begin{aligned}&Ax^*+By^*=b, \end{aligned}$$
(2.19b)
$$\begin{aligned}&x^*\in \mathcal {X}, \quad y^*\in \mathcal {Y}. \end{aligned}$$
(2.19c)

The following lemmas will be used in our subsequent analysis, whose proofs are elementary and thus are omitted here.

Lemma 2.1

For any two vectors \(w\) and \(\tilde{w}\), it holds

$$\begin{aligned} (w-\tilde{w})^\top H(w)=(w-\tilde{w})^\top H(\tilde{w}). \end{aligned}$$
(2.20)

Lemma 2.2

For any two vectors \(u, v\) and a positive semidefinite matrix \(W\),

$$\begin{aligned} u^\top Wv= \frac{1}{2}\big (\Vert u\Vert _W^2+\Vert v\Vert _W^2-\Vert u-v\Vert _W^2\big ). \end{aligned}$$
(2.21)

Lemma 2.3

For any nonzero positive semidefinite matrix \(W\), it holds for any \(z\) and \(\hat{z}\) of appropriate size that

$$\begin{aligned} \Vert z-\hat{z}\Vert ^2\geqslant \frac{1}{\Vert W\Vert _2}\Vert z-\hat{z}\Vert _{W}^2, \end{aligned}$$
(2.22)

where \(\Vert W\Vert _2\) denotes the matrix operator norm of W.

The following lemma presents a useful property of \(H(w)\), which essentially follows from (2.20).

Lemma 2.4

For any vectors \(w^0,w^1,\cdots ,w^{t}\), and sequence of positive numbers \(\beta ^0,\beta ^1,\cdots ,\beta ^t\), it holds that

$$\begin{aligned} \left( \frac{\sum \nolimits _{k=0}^t\beta ^kw^k}{\sum \nolimits _{k=0}^t\beta ^k}-w\right) ^{\top }H\left( \frac{\sum \nolimits _{k=0}^t\beta ^kw^k}{\sum \nolimits _{k=0}^t\beta ^k}\right) =\frac{1}{\sum \nolimits _{k=0}^t\beta ^k}\sum \limits _{k=0}^t\beta ^k(w^k-w)^{\top }H(w^k). \end{aligned}$$
(2.23)

3 Convergence Rate Results

In this section, we establish sublinear convergence rate results of Algorithm 1 for three different cases. We differentiate those cases based on whether or not \(y\) in problem (1.1) also has the multi-block structure. The first case assumes no y-variable at all. It falls into the second and third cases, and we discuss this case separately since it requires weaker conditions to guarantee the same convergence rate. The second case is that \(y\) is treated as a single-block variable, and this can be reflected in our algorithm by simply selecting all \(y\)-blocks every time, i.e., \(m=M\). In the third case where \(y\) is a multi-block variable, it requires \(\frac{n}{N}=\frac{m}{M}\) where n and m are the cardinalities of the subsets of \(x\) and \(y\) selected in our algorithm, respectively. Since the analysis only requires convexity, we can ensure the condition to hold by adding zero component functions if necessary, in such a way that \(N=M\) and then choosing \(n=m\). In particular, we make the following assumptions:

Assumption 3.1

(Convexity) For (1.1), \(\mathcal {X}_i\) and \(\mathcal {Y}_j\) are some closed convex sets, f and g are smooth convex functions, and \(u_i\) and \(v_j\) are proper closed convex function.

Assumption 3.2

(Existence of a solution) There is at least one point \(w^*=(x^*,y^*,\lambda ^*)\) satisfying the conditions in (2.19).

Assumption 3.3

(Lipschitz continuous partial gradient) There exist constants \(L_f\) and \(L_g\) such that for any subset I of [N] with \(|I|=n\) and any subset J of [M] with \(|J|=m\), it holds that

$$\begin{aligned}&\Vert \nabla _I f(x+U_I\tilde{x})-\nabla _I f(x)\Vert \leqslant L_f\Vert \tilde{x}_I\Vert ,\quad \forall x, \tilde{x}, \end{aligned}$$
(3.1a)
$$\begin{aligned}&\Vert \nabla _J g(y+U_J\tilde{y})-\nabla _J g(y)\Vert \leqslant L_g\Vert \tilde{y}_J\Vert ,\quad \forall y, \tilde{y}, \end{aligned}$$
(3.1b)

where \(U_I \tilde{x}\) keeps the blocks of \(\tilde{x}\) that are indexed by I and zero elsewhere.

Before presenting the main convergence rate result, we first establish a few key lemmas.

Lemma 3.1

(One-step analysis) Let \(\{(x^k,y^k, r^k,\lambda ^k)\}\) be the sequence generated from Algorithm 1 with matrices \(P^k\) and \(Q^k\) defined as in (2.15). Then, the following inequalities hold

$$\begin{aligned}&{E}_{I_k}\left[ F(x^{k+1})-F(x)+(x^{k+1}-x)^\top (-A^\top \lambda ^{k+1})\right. \nonumber \\&\qquad +\left. (\rho _x-\rho )(x^{k+1}-x)^\top A^\top r^{k+1} -\rho _x(x^{k+1}-x)^\top A^\top B(y^{k+1}-y^k)\right] \nonumber \\&\qquad +\,{E}_{I_k}(x^{k+1}-x)^\top (\hat{P}-\rho _x A^\top A)(x^{k+1}-x^k)-\frac{L_f}{2}{E}_{I_k}\Vert x^k-x^{k+1}\Vert ^2\nonumber \\\leqslant & {} \left( 1-\frac{n}{N}\right) \big [F(x^k)-F(x)+(x^{k}-x)^\top (-A^\top \lambda ^k)+ \rho _x(x^{k}-x)^\top A^\top r^{k}\big ], \qquad \qquad \end{aligned}$$
(3.2)

and

$$\begin{aligned}&{E}_{J_k}\big [G(y^{k+1}) - G(y)+(y^{k+1}-y)^\top (-B^\top \lambda ^{k+1})+ (\rho _y-\rho )(y^{k+1}-y)^\top B^\top r^{k+1}\big ]\nonumber \\&\qquad +\,{E}_{J_k}(y^{k+1}-y)^\top (\hat{Q}-\rho _y B^\top B)(y^{k+1}-y^k)-\frac{L_g}{2}{E}_{J_k}\Vert y^k-y^{k+1}\Vert ^2\nonumber \\&\qquad - \left( 1-\frac{m}{M}\right) \rho _y(y^{k}-y)^\top B^\top A(x^{k+1}-x^k)\nonumber \\\leqslant & {} \left( 1-\frac{m}{M}\right) \left[ G(y^k)-G(y)+(y^{k}-y)^\top (-B^\top \lambda ^k)+ \rho _y(y^{k}-y)^\top B^\top r^{k}\right] , \end{aligned}$$
(3.3)

where \({E}_{I_k}\) denotes expectation over \(I_k\) and conditional on all previous history.

Note that for any feasible point (xy) (namely, \(x\in {\mathcal {X}}, y\in {\mathcal {Y}}\) and \(Ax+By=b\)),

$$\begin{aligned} Ax^{k+1}-Ax= & {} \frac{1}{\rho }(\lambda ^k-\lambda ^{k+1})-(By^{k+1}-b)+(By-b)\nonumber \\= & {} \frac{1}{\rho }(\lambda ^k-\lambda ^{k+1})-B(y^{k+1}-y) \end{aligned}$$
(3.4)

and

$$\begin{aligned} By^{k}-By= & {} \frac{1}{\rho }(\lambda ^{k-1}-\lambda ^{k})-(Ax^{k}-b)+(Ax-b)\nonumber \\= & {} \frac{1}{\rho }(\lambda ^{k-1}-\lambda ^{k})-A(x^{k}-x). \end{aligned}$$
(3.5)

Hence,

$$\begin{aligned} (x^{k+1}-x)^\top A^\top B(y^{k+1}-y^k)= & {} \frac{1}{\rho }(\lambda ^k-\lambda ^{k+1})^\top B(y^{k+1}-y^k)\\&-(y^{k+1}-y)^\top B^\top B(y^{k+1}-y^k), \end{aligned}$$

and

$$\begin{aligned} (y^{k}-y)^\top B^\top A(x^{k+1}-x^k)= \frac{1}{\rho }(\lambda ^{k-1}-\lambda ^{k})^\top A(x^{k+1}-x^k)-(x^{k}-x)^\top A^\top A(x^{k+1}-x^k). \end{aligned}$$

Then, using (2.21) we have the following result.

Lemma 3.2

For any feasible point \((x,y)\) and integer t, it holds

$$\begin{aligned}&\sum _{k=0}^t (x^{k+1}-x)^\top A^\top B(y^{k+1}-y^k)\nonumber \\= & {} \frac{1}{\rho }\sum _{k=0}^t (\lambda ^k-\lambda ^{k+1})^\top B(y^{k+1}-y^k)\nonumber \\&\quad -\frac{1}{2}\left( \Vert y^{t+1}-y\Vert _{B^\top B}^2-\Vert y^{0}-y\Vert _{B^\top B}^2 +\sum _{k=0}^t\Vert y^{k+1}-y^k\Vert _{B^\top B}^2\right) \end{aligned}$$
(3.6)

and

$$\begin{aligned}&\sum _{k=0}^t(y^{k}-y)^\top B^\top A(x^{k+1}-x^k)\nonumber \\= & {} \frac{1}{\rho }\sum _{k=0}^t(\lambda ^{k-1}-\lambda ^{k})^\top A(x^{k+1}-x^k)\nonumber \\&\quad +\frac{1}{2}\left( \Vert x^0-x\Vert _{A^\top A}^2-\Vert x^{t+1}-x\Vert _{A^\top A}^2 +\sum _{k=0}^t\Vert x^{k+1}-x^k\Vert _{A^\top A}^2\right) . \end{aligned}$$
(3.7)

Lemma 3.3

Given a continuous function h, for a random vector \(\hat{w}=(\hat{x},\hat{y},\hat{\lambda })\), if for any feasible point \(w=(x,y,\lambda )\) that may depend on \(\hat{w}\), we have

$$\begin{aligned} E\big [\varPhi (\hat{x},\hat{y})-\varPhi (x,y)+(\hat{w}-w)^\top H(w)\big ]\leqslant E [h(w)], \end{aligned}$$
(3.8)

then for any \(\gamma >0\) and any optimal solution \((x^*,y^*)\) to (1.1) we also have

$$\begin{aligned} E\big [\varPhi (\hat{x},\hat{y})-\varPhi (x^*,y^*)+\gamma \Vert A\hat{x}+B\hat{y}-b\Vert \big ]\leqslant \sup _{\Vert \lambda \Vert \leqslant \gamma }h(x^*,y^*,\lambda ). \end{aligned}$$

Noting

$$\begin{aligned}&\varPhi (x,y)-\varPhi (x^*,y^*)+(w-w^*)^\top H(w^*)\\= & {} \varPhi (x,y)-\varPhi (x^*,y^*)-(\lambda ^*)^\top (Ax+By-b), \end{aligned}$$

we can easily show the following lemma by the optimality of \((x^*,y^*,\lambda ^*)\) and the Cauchy-Schwarz inequality.

Lemma 3.4

Assume \((x^*,y^*,\lambda ^*)\) satisfies (2.19). Then, for any point \((\hat{x},\hat{y})\in {\mathcal {X}}\times {\mathcal {Y}}\), we have

$$\begin{aligned} \varPhi (\hat{x},\hat{y})-\varPhi (x^*,y^*)\geqslant -\Vert \lambda ^*\Vert \cdot \Vert A\hat{x}+B\hat{y}-b\Vert . \end{aligned}$$
(3.9)

The following lemma shows a connection between different convergence measures, and it can be simply proved by using (3.9). If both w and \(\hat{w}\) are deterministic, it implies Lemma 2.4 in [39].

Lemma 3.5

Assume that \((x^*,y^*,\lambda ^*)\) satisfies the optimality conditions in (2.19). Let \(\gamma > \Vert \lambda ^*\Vert \) and \(\varepsilon \geqslant 0\). If a random vector \((\hat{x},\hat{y})\) satisfies

$$\begin{aligned} E\big [\varPhi (\hat{x},\hat{y})-\varPhi (x^*,y^*)+\gamma \Vert A\hat{x}+B\hat{y}-b\Vert \big ] \leqslant \varepsilon , \end{aligned}$$

then

$$\begin{aligned} E\Vert A\hat{x}+B\hat{y}-b\Vert\leqslant & {} \frac{\varepsilon }{\gamma -\Vert \lambda ^*\Vert }, \\ E\big |\varPhi (\hat{x},\hat{y})-\varPhi (x^*,y^*)\big |\leqslant & {} \left( \frac{2\Vert \lambda ^*\Vert }{\gamma -\Vert \lambda ^*\Vert } + 1\right) \varepsilon . \end{aligned}$$

The convergence analysis for Algorithm 1 requires slightly different parameter settings under different structures. In fact, the underlying analysis and results also differ. To account for the differences, we present in the next three subsections the corresponding convergence results. The first one assumes there is no y part at all; the second case assumes a single block on the y side; the last one deals with the general case where the ratios n / N are assumed to be equal to m / M.

3.1 Multiple \(x\) Blocks and No y-Variable

We first consider a special case with no y-variable, namely \(g=v=0\) and \(B=0\) in (1.1). This case has its own importance. It is a parallel block coordinate update version of the linearized augmented Lagrangian method (ALM).

Theorem 3.6

(Sublinear ergodic convergence I) Assume \(g(y)=0, v_j(y_j)=0,\,\forall j\) and \(B=0\) in (1.1). Let \(\{(x^k,y^k, \lambda ^k)\}\) be the sequence generated from Algorithm 1 with \(y^k\equiv y^0\). Assume \(\frac{n}{N}=\theta ,\,\rho =\theta \rho _x,\) and

$$\begin{aligned} \hat{P}_{I_k}\succcurlyeq L_fI+ \rho _xA_{I_k}^\top A_{I_k},\,\forall k. \end{aligned}$$
(3.10)

Let

$$\begin{aligned} \hat{x}^{t}=\frac{x^{t+1}+\theta \sum _{k=1}^tx^k}{1+\theta t}. \end{aligned}$$
(3.11)

Then, under Assumptions 3.1, 3.2, and 3.3, we have

$$\begin{aligned}&\max \left\{ E\left| F(\hat{x}^t)-F(x^*)\right| ,\, E\Vert A\hat{x}^t-b\Vert \right\} \nonumber \\\leqslant & {} \frac{2}{1+\theta t}\left[ (1-\theta )\left( F(x^0)-F(x^*)+\frac{\rho _x}{2}\Vert r^0\Vert ^2\right) \right. \nonumber \\&\qquad \left. +\frac{1}{2}\Vert x^0-x^*\Vert _{\hat{P}}^2+\frac{\max \{(0.5+\Vert \lambda ^*\Vert )^2, 9\Vert \lambda ^*\Vert ^2\}}{2\rho _x}\right] , \end{aligned}$$
(3.12)

where \((x^*,\lambda ^*)\) is an arbitrary primal–dual solution.

Our result recovers the convergence of the proximal Jacobian ADMM introduced in [20]. In fact, the above theorem strengthens the convergence result in [20] by establishing an O(1 / t) rate of convergence in terms of the feasibility measure and the objective value. If strong convexity is assumed on the objective function, the algorithm can be accelerated to have the rate \(O(1/t^2)\) as shown in [40].

3.2 Multiple \(x\) Blocks and a Single \(y\) Block

When the y-variable is simple to update, it could be beneficial to renew the whole of it at every iteration, such as the problem (1.3). In this subsection, we consider the case that there are multiple \(x\)-blocks but a single \(y\)-block (or equivalently, \(m=M\)), and we establish a sublinear convergence rate result with a different technique of dealing with the y-variable.

Theorem 3.7

(Sublinear ergodic convergence II) Let \(\{(x^k,y^k, \lambda ^k)\}\) be the sequence generated from Algorithm 1 with \(m=M\) and \(\rho =\rho _y=\theta \rho _x\), where \(\theta =\frac{n}{N}.\) Assume

$$\begin{aligned} \hat{P}\succcurlyeq L_fI+\rho _x A^\top A, \qquad \hat{Q}\succcurlyeq \frac{L_g}{\theta }I+\left( \frac{\rho }{\theta ^4}-\frac{\rho }{\theta ^2}+\rho _y\right) B^\top B. \end{aligned}$$
(3.13)

Let

$$\begin{aligned} \hat{x}^{t}=\frac{x^{t+1}+\theta \sum _{k=1}^tx^k}{1+\theta t}, \quad \hat{y}^{t}=\frac{\tilde{y}^{t+1}+\theta \sum _{k=1}^ty^k}{1+\theta t}, \end{aligned}$$
(3.14)

where

$$\begin{aligned} \tilde{y}^{t+1}= & {} {\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y\in \mathcal {Y}}}\langle \nabla g(y^t)-B^\top \lambda ^t, y\rangle +v(y)+\frac{\rho _x}{2}\Vert Ax^{t+1}+By-b\Vert ^2\nonumber \\&+\frac{\theta }{2}\Vert y-y^t\Vert _{\hat{Q}-\rho _y B^\top B}^2. \end{aligned}$$
(3.15)

Then, under Assumptions 3.1, 3.2, and 3.3, we have

$$\begin{aligned}&\max \left\{ E\big |\varPhi (\hat{x}^t,\hat{y}^t)-\varPhi (x^*,y^*)\big |,\,E\Vert A\hat{x}^t+B\hat{y}^t-b\Vert \right\} \nonumber \\\leqslant & {} \frac{2}{1+\theta t}\left[ (1-\theta )\left( \varPhi (x^0,y^0)-\varPhi (x^*,y^*)+\frac{\rho _x}{2}\Vert r^{0}\Vert ^2\right) +\frac{1}{2}\Vert x^0-x^*\Vert _{\hat{P}}^2\right. \nonumber \\&\quad +\frac{1}{2}\Vert y^0-y^*\Vert _{\theta \hat{Q}+(\rho _x-\theta \rho _y) B^\top B}^2+\left. \frac{\max \{(0.5+\Vert \lambda ^*\Vert )^2, 9\Vert \lambda ^*\Vert ^2\}}{2\rho _x}\right] , \end{aligned}$$
(3.16)

where \((x^*,y^*,\lambda ^*)\) is an arbitrary primal–dual solution.

Remark 3.1

It is easy to see that if \(\theta =1\), the result in Theorem 3.7 becomes exactly the same as that in Theorem 3.8. In general, they are different because the conditions in (3.13) on \(\hat{P}\) and \(\hat{Q}\) are different from those in (3.18a).

3.3 Multiple \(x\) and \(y\) Blocks

In this subsection, we consider the most general case where both \(x\) and \(y\) have multi-block structure. Assuming \(\frac{n}{N}=\frac{m}{M}\), we can still have the O(1 / t) convergence rate. The assumption can be made without losing generality, e.g., by adding zero components if necessary (which is essentially equivalent to varying the probabilities of the variable selection).

Theorem 3.8

(Sublinear ergodic convergence III) Let \(\{(x^k,y^k, \lambda ^k)\}\) be the sequence generated from Algorithm 1 with the parameters satisfying

$$\begin{aligned}&\rho =\frac{n\rho _x}{N}=\frac{m\rho _y}{M}>0. \end{aligned}$$
(3.17)

Assume \(\frac{n}{N}=\frac{m}{M}=\theta ,\) and \(\hat{P}, \hat{Q}\) satisfy one of the following conditions:

$$\begin{aligned}&\hat{P}\succcurlyeq (2-\theta )\left( \frac{1-\theta }{\theta ^2}+1\right) \rho _xA^\top A+L_fI, \qquad \hat{Q}\succcurlyeq \frac{(2-\theta )}{\theta ^2}\rho _yB^\top B+L_gI, \end{aligned}$$
(3.18a)
$$\begin{aligned}&\hat{P}_i\succcurlyeq (2-\theta )\left( \frac{1-\theta }{\theta ^2}+1\right) n\rho _xA_i^\top A_i+L_fI,\,\forall i,\nonumber \\&\hat{Q}_j\succcurlyeq \frac{(2-\theta )}{\theta ^2}m\rho _yB_j^\top B_j+L_gI,\,\forall j. \end{aligned}$$
(3.18b)

Let

$$\begin{aligned} \hat{x}^{t}=\frac{x^{t+1}+\theta \sum _{k=1}^tx^k}{1+\theta t},\quad \hat{y}^{t}=\frac{y^{t+1}+\theta \sum _{k=1}^ty^k}{1+\theta t}. \end{aligned}$$
(3.19)

Then, under Assumptions 3.1, 3.2, and 3.3, we have

$$\begin{aligned}&\max \left\{ E\left| \varPhi (\hat{x}^t,\hat{y}^t)-\varPhi (x^*,y^*)\right| ,\,E\Vert A\hat{x}^t+B\hat{y}^t-b\Vert \right\} \nonumber \\\leqslant & {} \frac{2}{1+\theta t}\left[ (1-\theta )\left( \varPhi (x^0,y^0)-\varPhi (x^*,y^*)+\rho _x\Vert r^0\Vert ^2\right) \right. \nonumber \\&\left. +\frac{1}{2}\Vert x^0-x^*\Vert _{\hat{P}-\theta \rho _x A^\top A}^2\right. \nonumber \\&\left. +\frac{1}{2}\Vert y^0-y^*\Vert _{\hat{Q}}^2+\frac{\max \{(0.5+\Vert \lambda ^*\Vert )^2, 9\Vert \lambda ^*\Vert ^2\}}{2\rho _x}\right] , \end{aligned}$$
(3.20)

where \((x^*,y^*,\lambda ^*)\) is an arbitrary primal–dual solution.

Remark 3.2

When \(N=M=1\), the two conditions in (3.18) become the same. However, in general, neither of the two conditions in (3.18) implies the other one. Roughly speaking, for the case of \(n\approx N\) and \(m\approx M\), the one in (3.18a) can be weaker, and for the case of \(n\ll N\) and \(m\ll M\), the one in (3.18b) is more likely weaker. In addition, (3.18b) provides an explicit way to choose block diagonal \(\hat{P}\) and \(\hat{Q}\) by simply setting \(\hat{P}_i\) and \(\hat{Q}_j\) to the lower bounds there.

4 Randomized Primal–Dual Coordinate Approach for Stochastic Programming

In this section, we extend our method to solve a stochastic optimization problem where the objective function involves an expectation. Specifically, we assume the coupled function to be in the form of \(f(x)={E}_\xi f_\xi (x)\) where \(\xi \) is a random vector. For simplicity, we assume \(g=v=0\), namely, we consider the following problem:

$$\begin{aligned} \begin{aligned} \min _x\;\;&{E}_\xi f_\xi (x)+\sum _{i=1}^N u_i(x_i) \\ \text{ s.t. }&\sum _{i=1}^N A_ix_i=b,\ x_i\in \mathcal {X}_i,\, i=1,2,\cdots ,N. \end{aligned} \end{aligned}$$
(4.1)

One can easily extend our analysis to the case where \(g\ne 0, v\ne 0\) and g is also stochastic. An example of (4.1) is the penalized and constrained regression problem that includes (1.2) as a special case.

Due to the expectation form of f, it is natural that the exact gradient of f is not available or very expensive to compute. Instead, we assume that its stochastic gradient is readily accessible. By some slight abuse of the notation, we denote

$$\begin{aligned} w=\left[ \begin{array}{c}x\\ \lambda \end{array}\right] ,\quad H(w)=\left[ \begin{array}{c}-A^\top \lambda \\ Ax-b\end{array}\right] , \quad F(x) := {E}_\xi f_\xi (x)+\sum _{i=1}^N u_i(x_i). \end{aligned}$$
(4.2)

A point \(x^*\) is a solution to (4.1) if and only if there exists \(\lambda ^*\) such that

$$\begin{aligned}&F(x)-F(x^*)+(w-w^*)^\top H(w^*) \geqslant 0,\,\forall w, \end{aligned}$$
(4.3a)
$$\begin{aligned}&Ax^*=b, \ x^*\in \mathcal {X}. \end{aligned}$$
(4.3b)

Modifying Algorithm 1 to (4.1), we present the stochastic primal–dual coordinate update method of multipliers, summarized in Algorithm 2, where \(G^k\) is a stochastic approximation of \(\nabla f(x^k)\). The strategy of block coordinate update with stochastic gradient information was first proposed in [41, 42], which considered problems without linear constraint.

figure b

We make the following assumption on the stochastic gradient \(G^k\).

Assumption 4.1

Let \(\delta ^k=G^k-\nabla f(x^k)\). There exists a constant \(\sigma \) such that for all k,

$$\begin{aligned}&{E}[\delta ^k\,|\,x^k]=0, \end{aligned}$$
(4.6a)
$$\begin{aligned}&{E}\Vert \delta ^k\Vert ^2\leqslant \sigma ^2. \end{aligned}$$
(4.6b)

Following the proof of Lemma 3.1 and also noting

$$\begin{aligned} {E}_{I_k}\big [(x_{I_k}-x_{I_k}^{k+1})^\top \delta _{I_k}^k\,|\,x^k\big ]={E}_{I_k}(x^k-x^{k+1})^\top \delta ^k, \end{aligned}$$
(4.7)

we immediately have the following result.

Lemma 4.1

(One-step analysis) Let \(\{(x^k,r^k,\lambda ^k)\}\) be the sequence generated from Algorithm 2 where \(P^k\) is given in (2.15) with \(\rho _x=\rho \). Then,

$$\begin{aligned}&{E}_{I_k}\big [F(x^{k+1})-F(x)+(x^{k+1}-x)^\top (-A^\top \lambda ^k)+ \rho (x^{k+1}-x)^\top A^\top r^{k+1}\big ]\nonumber \\&+{E}_{I_k}(x^{k+1}-x)^\top \left( \hat{P}-\rho A^\top A+\frac{I}{\alpha _k}\right) (x^{k+1}-x^k)\nonumber \\&-\frac{L_f}{2}{E}_{I_k}\Vert x^k-x^{k+1}\Vert ^2+{E}_{I_k}(x^{k+1}-x^k)^\top \delta ^k\nonumber \\\leqslant & {} \left( 1-\frac{n}{N}\right) \big [F(x^k)-F(x)+(x^{k}-x)^\top (-A^\top \lambda ^k)+ \rho (x^{k}-x)^\top A^\top r^{k}\big ]. \end{aligned}$$
(4.8)

The following theorem is a key result, from which we can choose appropriate \(\alpha _k\) to obtain the \(O(1/\sqrt{t})\) convergence rate.

Theorem 4.2

Let \(\{(x^k,\lambda ^k)\}\) be the sequence generated from Algorithm 2. Let \(\theta =\frac{n}{N}\) and denote

$$\begin{aligned} \beta _k=\frac{\alpha _k}{\left( 1-\frac{\alpha _{k}(1-\theta )}{\alpha _{k-1}}\right) \rho },\forall k. \end{aligned}$$

Assume \(\alpha _k>0\) is nonincreasing, and

$$\begin{aligned}&Ax^0=b, \quad \lambda ^0=0, \end{aligned}$$
(4.9a)
$$\begin{aligned}&\hat{P}\succcurlyeq L_fI+\rho A^\top A,\end{aligned}$$
(4.9b)
$$\begin{aligned}&\frac{\alpha _{k-1}\beta _k}{2\alpha _k}+\frac{(1-\theta )\beta _{k+1}}{2}-\frac{\alpha _k\beta _{k+1}}{2\alpha _{k+1}}-\frac{(1-\theta )\beta _k}{2} \geqslant 0,\,\forall k,\end{aligned}$$
(4.9c)
$$\begin{aligned}&\frac{\alpha _t}{2\rho }\geqslant \left| \frac{\alpha _{t-1}\beta _t}{\alpha _t}-(1-\theta )\beta _t-\frac{\alpha _t}{\rho }\right| ,\text { for some }t. \end{aligned}$$
(4.9d)

Let

$$\begin{aligned} \hat{x}^{t}=\frac{\alpha _{t+1}x^{t+1}+\theta \sum _{k=1}^t\alpha _kx^k}{\alpha _{t+1}+\theta \sum \limits _{k=1}^t\alpha _k}. \quad \end{aligned}$$
(4.10)

Then, under Assumptions 3.1, 3.2, 3.3, and 4.1, we have

$$\begin{aligned}&(\alpha _{t+1}+\theta \sum \limits _{k=1}^t\alpha _k)E\left[ F(\hat{x}^{t})-F(x^*)+\gamma \Vert A\hat{x}^t-b\Vert \right] \nonumber \\&\quad \leqslant (1-\theta )\alpha _0\left[ F(x^0)-F(x^*)\right] +\frac{\alpha _0}{2}\Vert x^{0}-x^*\Vert _{\hat{P}-\rho A^\top A}^2+\frac{1}{2}\Vert x^0-x^*\Vert ^2\nonumber \\&\qquad +\left| \frac{\alpha _0\beta _1}{2\alpha _1}-\frac{(1-\theta )\beta _1}{2}\right| \gamma ^2+\sum _{k=0}^t\frac{\alpha _k^2}{2}E\Vert \delta ^k\Vert ^2. \end{aligned}$$
(4.11)

The following proposition gives sublinear convergence rate of Algorithm 2 by specifying the values of its parameters. The choice of \(\alpha _k\) depends on whether we fix the total number of iterations.

Proposition 4.3

Let \(\{(x^k,\lambda ^k)\}\) be the sequence generated from Algorithm 2 with \(P^k\) given in (2.15), \(\hat{P}\) satisfying (4.9b), and the initial point satisfying \(Ax^0=b\) and \(\lambda ^0=0\). Let \(C_0\) be

$$\begin{aligned} C_0= & {} (1-\theta )\alpha _0\big [F(x^0)-F(x^*)\big ]+\frac{1}{2}\Vert x^0-x^*\Vert _{D_x}^2\nonumber \\&+\frac{\alpha _0}{2\rho }\max \{(0.5+\Vert \lambda ^*\Vert )^2, 9\Vert \lambda ^*\Vert ^2\}, \end{aligned}$$
(4.12)

where \((x^*,\lambda ^*)\) is a primal–dual solution, and \(D_x:=\alpha _0(\hat{P}-\rho A^\top A)+I\).

  1. 1.

    If \(\alpha _k=\frac{\alpha _0}{\sqrt{k}},\forall k\geqslant 1\) for a certain \(\alpha _0>0\), then for \(t\geqslant 2\),

    $$\begin{aligned} \max \left\{ E\left| F(\hat{x}^{t})-F(x^*)\right| ,\, E\Vert A\hat{x}^t-b\Vert \right\} \leqslant \frac{2C_0}{\theta \alpha _0\sqrt{t}}+\frac{\alpha _0(\log t+2)\sigma ^2}{\theta \sqrt{t}}.\qquad \quad \end{aligned}$$
    (4.13)
  2. 2.

    If the number of maximum number of iterations is fixed a priori, then by choosing \(\alpha _k=\frac{\alpha _0}{\sqrt{t}},\,\forall k\geqslant 1\) with any given \(\alpha _0>0\), we have

    $$\begin{aligned} \max \left\{ E\left| F(\hat{x}^{t})-F(x^*)\right| ,\, E\Vert A\hat{x}^t-b\Vert \right\} \leqslant \frac{2C_0}{\theta \alpha _0\sqrt{t}}+\frac{2\alpha _0\sigma ^2}{\theta \sqrt{t}}. \end{aligned}$$
    (4.14)

Proof

When \(\alpha _k=\frac{\alpha _0}{\sqrt{k}}\), we can show that (4.9c) and (4.9d) hold for \(t\geqslant 2\); see Appendix A.4. Hence, the result in (4.13) follows from (4.11), the convexity of F, Lemma 3.5 with \(\gamma =\max \{1+\Vert \lambda ^*\Vert , 2\Vert \lambda ^*\Vert \}\), and the inequalities

$$\begin{aligned} \sum _{k=1}^t\frac{1}{\sqrt{k}}\geqslant \sqrt{t},\quad \sum _{k=1}^t\frac{1}{k}\leqslant \log t +1. \end{aligned}$$

When \(\alpha _k\) is a constant, the terms on the left-hand side of (4.9c) and on the right-hand side of (4.9d) are both zero, so they are satisfied. Hence, the result in (4.14) immediately follows by noting \(\sum _{k=1}^t \alpha _k=\alpha _0\sqrt{t}\) and \(\sum _{k=0}^t \alpha _k^2\leqslant 2\alpha _0^2\).

The sublinear convergence result of Algorithm 2 can also be shown if f is nondifferentiable convex and Lipschitz continuous. Indeed, if f is Lipschitz continuous with constant \(L_c\), i.e.,

$$\begin{aligned} \Vert f(x)-f(y)\Vert \leqslant L_c\Vert x-y\Vert ,\,\forall x, y, \end{aligned}$$

then \(\Vert \tilde{\nabla } f(x)\Vert \leqslant L_c,\,\forall x\), where \(\tilde{\nabla }f(x)\) is a subgradient of f at \(x\). Hence,

$$\begin{aligned}&{E}_{I_k}(x_{I_k}-x_{I_k}^{k+1})^\top \tilde{\nabla }_{I_k}f(x^k)\\= & {} {E}_{I_k}(x_{I_k}-x_{I_k}^k)^\top \tilde{\nabla }_{I_k}f(x^k) + {E}_{I_k}(x_{I_k}^k-x_{I_k}^{k+1})^\top \tilde{\nabla }_{I_k}f(x^k)\\= & {} \frac{n}{N}(x-x^k)^\top \tilde{\nabla } f(x^k) + {E}_{I_k}(x^k-x^{k+1})^\top \tilde{\nabla } f(x^{k+1})\\&+\,{E}_{I_k}(x^k-x^{k+1})^\top \big (\tilde{\nabla }f(x^k)-\tilde{\nabla } f(x^{k+1})\big )\\\leqslant & {} \frac{n}{N}(f(x)-f(x^k)) + {E}_{I_k}[f(x^k)-f(x^{k+1})]\\&+\,{E}_{I_k}(x^k-x^{k+1})^\top \big (\tilde{\nabla }f(x^k)-\tilde{\nabla } f(x^{k+1})\big )\\= & {} \frac{n-N}{N}(f(x)-f(x^k))+{E}_{I_k}[f(x)-f(x^{k+1})]\\&+\,{E}_{I_k}(x^k-x^{k+1})^\top \big (\tilde{\nabla }f(x^k)-\tilde{\nabla } f(x^{k+1})\big ). \end{aligned}$$

Now following the proof of Lemma 3.1, we can have a result similar to (4.8), and then through the same arguments as those in the proof of Theorem 4.2, we can establish sublinear convergence rate of \(O(1/\sqrt{t})\).

5 Numerical Experiments

In this section, we test the proposed randomized primal–dual method on solving three problems. The first one is the nonnegativity constrained quadratic programming (NCQP):

$$\begin{aligned} \min _{x\in \mathbb {R}^n} F(x)\equiv \frac{1}{2}x^\top Q x + c^\top x, \text{ s.t. } Ax=b, x_i\geqslant 0, i=1,\cdots , n, \end{aligned}$$
(5.1)

where \(A\in \mathbb {R}^{m\times n}\), and \(Q\in \mathbb {R}^{n\times n}\) is a symmetric positive semidefinite (PSD) matrix. The second problem is the constrained Lasso:

$$\begin{aligned} \min _{x\in \mathbb {R}^p} F(x)\equiv \frac{1}{2}\Vert Ax-b\Vert ^2+\tau \Vert x\Vert _1 \text{ s.t. } Cx=d, \end{aligned}$$
(5.2)

where \(A\in \mathbb {R}^{m\times p}\) and \(C\in \mathbb {R}^{n\times p}\), and the third one is the log-barrier penalized problem of a linear program:

$$\begin{aligned} \min _{x,y} F(x,y)\equiv & {} c^\top x - \mu e^\top \log x - \mu e^\top \log y \text{ s.t. } Ax + y \nonumber \\= & {} b,\, x_i \leqslant u_i, \, i = 1,\cdots , n, \end{aligned}$$
(5.3)

where \(A\in \mathbb {R}^{m\times n}\). The first two problems fall into the case in Theorem 3.6, i.e., multiple x blocks and no y, and the third one falls into Theorem 3.7, i.e., multiple x blocks and single y block. We perform all experiments on a Macbook Pro with 4 cores. The first experiment demonstrates the parallelization performance of the proposed method, and the rest ones compare it to other methods.

5.1 Parallelization

This test is to illustrate the power unleashed in our new method, which is flexible in terms of parallel and distributive computing. We test the algorithm on NCQP (5.1), and we set \(m = 200, n=\) 2 000 and generate \(Q=HH^\top \), where the components of \(H\in \mathbb {R}^{n\times n}\) follow the standard Gaussian distribution. The matrix A and vectors bc are also randomly generated. We treat every component of x as one block, and at every iteration we select and update p blocks, where p is the number of used cores. Figure 1 shows the running time by using 1, 2, and 4 cores, where the optimal value \(F(x^*)\) is obtained by calling Matlab function quadprog with tolerance \(10^{-16}\). From the figure, we see that our proposed method achieves nearly linear speedup.

Fig. 1
figure 1

Nearly linear speedup performance of the proposed primal–dual method for solving (5.1) on a 4-core machine. Left: distance of objective to optimal value \(|F(x^k)-F(x^*)|\); Right: violation of feasibility \(\Vert Ax^k-b\Vert \)

5.2 Comparison to Other Methods

In this set of experiments, we compare the proposed method to the linearized ALM and the cyclic linearized ADMM. Note that the linearized ALM is a fully Jacobian update method, while the cyclic linearized ADMM is fully Gauss–Seidel. Without linearization, the former method reduces to the proximal Jacobi ADMM (see section 6.4) and the latter one to the direct multi-block ADMM (see section 6.3). All three methods are tested on NCQP (5.1), the constrained Lasso (5.2), and also the log-barrier penalized problem (5.3).

For the NCQP (5.1), we set \(m =\) 1 000, \(n=\) 5 000 and generate \(Q=HH^\top \), where the components of \(H\in \mathbb {R}^{n\times (n-50)}\) follow standard Gaussian distribution. Note that Q is singular, and thus (5.1) is not strongly convex. We evenly partition the variable into 100 blocks. At each iteration of our method, we uniformly at random select one block variable to update. Figure 2 shows the performance by the three compared methods, where one epoch is equivalent to updating all blocks once. The optimal value is computed via Matlab built-in function quadprog. From the figure, we see that our proposed method is comparable to the cyclic linearized ADMM (L-ADMM) and significantly better than the linearized ALM (L-ALM). Although the cyclic ADMM performs well on this instance, in general it can diverge if the problem has more than two blocks; see [19].

Fig. 2
figure 2

Comparison of the proposed RPDBU to the L-ALM and the cyclic L-ADMM on solving the nonnegativity constrained quadratic programming (5.1). Left: distance of objective to optimal value \(|F(x^k)-F(x^*)|\); Right: violation of feasibility \(\Vert Ax^k-b\Vert \)

Fig. 3
figure 3

Comparison of the proposed RPDBU to the L-ALM and the cyclic L-ADMM on solving the constrained Lasso (5.2). Left: distance of objective to optimal value \(|F(x^k)-F(x^*)|\); right: violation of feasibility \(\Vert Cx^k-d\Vert \)

Fig. 4
figure 4

Comparison of the proposed RPDBU to the L-ALM and the cyclic L-ADMM on solving the log-barrier penalized problem (5.3). Left: distance of objective to optimal value \(|F(x^k,y^k)-F(x^*,y^*)|\); right: violation of feasibility \(\Vert Ax^k+y^k-b\Vert \)

For the constrained Lasso (5.2), we set \(m=n=200\) and \(p=2\, 000\). A and C are generated according to the standard Gaussian distribution. A sparse vector \(x^o\) is generated with 20 nonzero entries. The locations of these nonzeros are chosen uniformly at random, and their values follow the standard Gaussian distribution. Then, we let \(b=Ax^o+ 0.01\xi \) and \(d=Cx^o\), where \(\xi \) is generated by the standard Gaussian distribution. We evenly partition the variable into 400 blocks. The same as the previous test, at each iteration of our method, we uniformly at random select one block variable to update. Figure 3 plots the results given by the three compared methods. The optimal value is computed via CVX [43] with high-precision tolerance. From the figure, we see that our proposed method is significantly better than the linearized ALM but slightly worse than the cyclic linearized ADMM. However, the convergence of the cyclic ADMM is not guaranteed. In addition, our method can be parallelized to have nearly linear speedup, while the cyclic ADMM is intrinsically serial.

For the log-barrier penalized problem (5.3), we set \(m=500, n=2\,000\). \(\mu \) is set to one, and \(u_i=10\) for all i. The entries of A follow standard Gaussian distribution, and each entry of b is set as a positive number to ensure the existence of a feasible solution. We evenly partition the x-variable into 400 blocks and treat y as a single block. At each iteration, we uniformly at random choose one x block to update, then perform an update to y, and finally renew the multipliers. The results by all three methods are shown in Fig. 4, where the optimal solution is computed via CVX with high-precision tolerance. The figure again demonstrates the significant better performance of our method over the linearized ALM. The cyclic ADMM is slightly better. However, note again that our proposed method has the capability of nearly linear speedup by parallelization, while the cyclic ADMM can only be implemented in a serial way.

6 Connections to Existing Methods

In this section, we discuss how Algorithms 1 and 2 are related to several existing methods in the literature, and we also compare their convergence results. It turns out that the proposed algorithms specialize to several known methods or their variants in the literature under various specific conditions. Therefore, our convergence analysis recovers some existing results as special cases, as well as provides new convergence results for certain existing algorithms such as the Jacobian proximal parallel ADMM and the primal–dual scheme in (1.9).

6.1 Randomized Proximal Coordinate Descent

The randomized proximal coordinate descent (RPCD) was proposed in [44], where smooth convex optimization problems are considered. It was then extended in [45, 46] to deal with nonsmooth problems that can be formulated as

$$\begin{aligned} \min _x f(x_1,\cdots ,x_N)+\sum _{i=1}^N u_i(x_i), \end{aligned}$$
(6.1)

where \(x=(x_1;\cdots ;x_N)\). Toward solving (6.1), at each iteration k, the RPCD method first randomly selects one block \(i_k\) and then performs the update:

$$\begin{aligned} x_i^{k+1}=\left\{ \begin{array}{ll} {{\,\mathrm{arg\,min}\,}}_{x_i} \langle \nabla _i f(x^k), x_i\rangle + \frac{L_i}{2}\Vert x_i-x_i^k\Vert _2^2+u_i(x_i), &{} \text { if }i=i_k,\\ x_i^k, &{} \text { if } i\ne i_k, \end{array}\right. \end{aligned}$$
(6.2)

where \(L_i\) is the Lipschitz continuity constant of the partial gradient \(\nabla _i f(x)\). With more than one block selected every time, (6.2) has been further extended into parallel coordinate descent in [47].

When there is no linear constraint and no y-variable in (1.1), then Algorithm 1 reduces to the scheme in (6.2) if \(I_k=\{i_k\}\), i.e., only one block is chosen, and \(P^k= L_{i_k}I, \lambda ^k=0,\,\forall k\), and to the parallel coordinate descent in [47] if \(I_k=\{i_k^1,\cdots ,i_k^n\}\) and \(P^k=\text {blkdiag}(L_{i_k^1}I,\cdots , L_{i_k^n}I), \lambda ^k=0,\,\forall k\). Although the convergence rate results in [45,46,47] are non-ergodic, we can easily strengthen our result to a non-ergodic one by noticing that (3.2) implies nonincreasing monotonicity of the objective if Algorithm 1 is applied to (6.1).

6.2 Stochastic Block Proximal Gradient

For solving the problem (6.1) with a stochastic f, [41] proposed a stochastic block proximal gradient (SBPG) method, which iteratively performs the update in (6.2) with \(\nabla _i f(x^k)\) replaced by a stochastic approximation. If f is Lipschitz differentiable, then an ergodic \(O(1/\sqrt{t})\) convergence rate was shown. Setting \(I_k=\{i_k\},\forall k\), we reduce Algorithm 2 to the SBPG method, and thus our convergence results in Proposition 4.3 recover that in [41].

6.3 Multi-block ADMM

Without coupled functions or proximal terms, Algorithm 1 can be regarded as a randomized variant of the multi-block ADMM scheme in (1.6). While multi-block ADMM can diverge if the problem has three or more blocks, our result in Theorem 3.6 shows that O(1 / t) convergence rate is guaranteed if at each iteration, one randomly selected block is updated, followed by an update to the multiplier. Note that in the case of no coupled function and \(n=1\), (3.10) indicates that we can choose \(P^k=0\), i.e., without proximal term. Hence, randomization is a key to convergence.

When there are only two blocks, ADMM has been shown (e.g., [14]) to have an ergodic O(1 / t) convergence rate. If there are no coupled functions, (3.13) and (3.18a) both indicate that we can choose \(\hat{P}=\rho _x A^\top A, \hat{Q}=\rho _y B^\top B\) if \(\theta =1\), i.e., all x and y blocks are selected. Thus, according to (2.15), we can set \(P^k=0, Q^k=0, \,\forall k\), in which case Algorithm 1 reduces to the classic 2-block ADMM. Hence, our results in Theorems 3.7 and 3.8 both recover the ergodic O(1 / t) convergence rate of ADMM for two-block convex optimization problems.

6.4 Proximal Jacobian Parallel ADMM

In [20], the proximal Jacobian parallel ADMM (Prox-JADMM) was proposed to solve the linearly constrained multi-block separable convex optimization model

$$\begin{aligned} \min _x \sum _{i=1}^N u_i(x_i), \text{ s.t. } \sum _{i=1}^N A_i x_i=b. \end{aligned}$$
(6.3)

At each iteration, the Prox-JADMM performs the updates for \(i=1,\cdots ,n\) in parallel:

$$\begin{aligned} x_i^{k+1}={\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{x_i}}\, u_i(x_i)-\langle \lambda ^k, A_ix_i\rangle +\frac{\rho }{2}\big \Vert A_ix_i+\sum _{j\ne i}A_jx_j^k-b\big \Vert _2^2+\frac{1}{2}\Vert x_i-x_i^k\Vert _{P_i}^2, \end{aligned}$$
(6.4)

and then updates the multiplier by

$$\begin{aligned} \lambda ^{k+1}=\lambda ^k-\gamma \rho \left( \sum _{i=1}^N A_ix_i^{k+1}-b\right) , \end{aligned}$$
(6.5)

where \(P_i\succ 0,\forall i\) and \(\gamma >0\) is a damping parameter. By choosing appropriate parameters, [20] established convergence rate of order 1 / t based on norm square of the difference of two consecutive iterates.

If there is no y-variable or the coupled function f in (1.1), setting \(I_k=[N], P^k=\mathrm {blkdiag}(\rho _x A_1^\top A_1+P_1,\cdots ,\rho _x A_N^\top A_N+P_N)-\rho _x A^\top A\succcurlyeq 0,\,\forall k\), where \(P_i\) are the same as those in (6.4), then Algorithm 1 reduces to the Prox-JADMM with \(\gamma =1\), and Theorem 3.6 provides a new convergence result in terms of the objective value and the feasibility measure.

6.5 Randomized Primal–Dual Scheme in (1.9)

In this subsection, we show that the scheme in (1.9) is a special case of Algorithm 1. Let g be the convex conjugate of \(g^*:=h+\iota _{{\mathcal {Z}}}\), namely, \(g(y)=\sup _{z} \langle y, z\rangle -h(z)-\iota _{\mathcal {Z}}(z)\). Then, (1.8) is equivalent to the optimization problem:

$$\begin{aligned} \min _{x\in {\mathcal {X}}} \sum _{i=1}^N u_i(x_i)+g(-Ax), \end{aligned}$$

which can be further written as

$$\begin{aligned} \min _{x\in {\mathcal {X}}, y} \sum _{i=1}^N u_i(x_i)+g(y), \text{ s.t. } Ax+y=0. \end{aligned}$$
(6.6)

Proposition 6.1

The scheme in (1.9) is equivalent to the following updates:

$$\begin{aligned}&x_i^{k+1}=\left\{ \begin{array}{l} \underset{x_i\in {\mathcal {X}}_i}{{{\,\mathrm{arg\,min}\,}}}\langle -z^k, A_i x_i\rangle +u_i(x_i)+\frac{q}{2\eta }\Vert A_i (x_i-x_i^k)+r^k\Vert ^2+\frac{1}{2}\Vert x_i-x_i^k\Vert _{\tau I-\frac{q}{\eta }A_i^\top A_i}^{2}, \quad ~i=i_k,\\ x_i^k, \quad ~i\ne i_k, \end{array}\right. \end{aligned}$$
(6.7a)
$$\begin{aligned}&y^{k+1}={\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y}}\, g(y)-\langle y, z^k\rangle +\frac{1}{2\eta }\Vert y+ A x^{k+1}\Vert ^2, \end{aligned}$$
(6.7b)
$$\begin{aligned}&z^{k+1}=z^k-\frac{1}{\eta }(Ax^{k+1}+y^{k+1}), \end{aligned}$$
(6.7c)

where \(r^k=Ax^k+y^k\). Therefore, it is a special case of Algorithm 1 applied to (6.6) with the setting of \(I_k=\{i_k\}, \rho _x=\frac{q}{\eta }, \rho _y=\rho =\frac{1}{\eta }\) and \(P^k=\tau I-\frac{q}{\eta }A_{i_k}^\top A_{i_k}, Q^k=0,\forall k\).

While the sublinear convergence rate result in [36] requires the boundedness of \({\mathcal {X}}\) and \({\mathcal {Z}}\), the result in Theorem 3.7 indicates that the boundedness assumption can be removed if we add one proximal term to the y-update in (6.7b).

7 Concluding Remarks

We have proposed a randomized primal–dual coordinate update algorithm, called RPDBU, for solving linearly constrained convex optimization with multi-block decision variables and coupled terms in the objective. By using a randomization scheme and the proximal gradient mappings, we show a sublinear convergence rate of the RPDBU method. In particular, without any assumptions other than convexity on the objective function and without imposing any restrictions on the constraint matrices, an O(1 / t) convergence rate is established. We have also extended RPDBU to solve the problem where the objective is stochastic. If a stochastic subgradient estimator is available, then we show that by adaptively choosing the parameter \(\alpha _k\) in the added proximal term, an \(O(1/\sqrt{t})\) convergence rate can be established. Furthermore, if there is no coupled function f, then we can remove the proximal term, and the algorithm reduces to a randomized multi-block ADMM. Hence, the convergence of the original randomized multi-block ADMM follows as a consequence of our analysis. Remark also that by taking the sampling set \(I_k\) as the whole set and \(P^k\) as some special matrices, our algorithm specializes to the proximal Jacobian ADMM. Finally, we pose as an open problem to decide whether or not a deterministic counterpart of the RPDBU exists, retaining similar convergence properties for solving problem (1.1). For instance, it would be interesting to know whether the algorithm would still be convergent if a deterministic cyclic update rule is applied while a proper proximal term is incorporated.