Accelerated primal–dual proximal block coordinate updating methods for constrained convex optimization

Xu, Yangyang; Zhang, Shuzhong

doi:10.1007/s10589-017-9972-z

Accelerated primal–dual proximal block coordinate updating methods for constrained convex optimization

Published: 28 November 2017

Volume 70, pages 91–128, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

Accelerated primal–dual proximal block coordinate updating methods for constrained convex optimization

Download PDF

Yangyang Xu¹ &
Shuzhong Zhang²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Block coordinate update (BCU) methods enjoy low per-update computational complexity because every time only one or a few block variables would need to be updated among possibly a large number of blocks. They are also easily parallelized and thus have been particularly popular for solving problems involving large-scale dataset and/or variables. In this paper, we propose a primal–dual BCU method for solving linearly constrained convex program with multi-block variables. The method is an accelerated version of a primal–dual algorithm proposed by the authors, which applies randomization in selecting block variables to update and establishes an O(1 / t) convergence rate under convexity assumption. We show that the rate can be accelerated to $O(1/t^2)$ if the objective is strongly convex. In addition, if one block variable is independent of the others in the objective, we then show that the algorithm can be modified to achieve a linear rate of convergence. The numerical experiments show that the accelerated method performs stably with a single set of parameters while the original method needs to tune the parameters for different datasets in order to achieve a comparable level of performance.

Randomized Primal–Dual Proximal Block Coordinate Updates

Article 19 December 2018

Block coordinate proximal gradient methods with variable Bregman functions for nonsmooth separable optimization

Article 27 January 2016

Efficiency of Coordinate Descent Methods for Structured Nonconvex Optimization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Motivated by the need to solve large-scale optimization problems and increasing capabilities in parallel computing, block coordinate update (BCU) methods have become particularly popular in recent years due to their low per-update computational complexity, low memory requirements, and their potentials in a distributive computing environment. In the context of optimization, BCU first appeared in the form of block coordinate descent (BCD) type of algorithms which can be applied to solve unconstrained smooth problems or those with separable nonsmooth terms in the objective (possibly with separable constraints). More recently, it has been developed for solving problems with nonseparable nonsmooth terms and/or constraint in a primal–dual framework.

In this paper, we consider the following linearly constrained multi-block structured optimization model:

$$\begin{aligned} \min _x f(x)+\sum _{i=1}^M g_i(x_i), \text{ s.t. } \sum _{i=1}^M A_i x_i =b, \end{aligned}$$

(1)

where x is partitioned into disjoint blocks $(x_1,x_2,\ldots ,x_M)$, f is a smooth convex function with Lipschitz continuous gradient, and each $g_i$ is proper closed convex and possibly non-differentiable. Note that $g_i$ can include an indicator function of a convex set ${\mathcal {X}}_i$, and thus (1) can implicitly include certain separable block constraints in addition to the nonseparable linear constraint.

Many applications arising in statistical and machine learning, image processing, and finance can be formulated in the form of (1) including the basis pursuit [7], constrained regression [23], support vector machine in its dual form [10], portfolio optimization [28], just to name a few.

Towards finding a solution for (1), we will first present an accelerated proximal Jacobian alternating direction method of multipliers (Algorithm 1), and then we generalize it to an accelerated randomized primal–dual block coordinate update method (Algorithm 2). Assuming strong convexity on the objective function, we will establish $O(1/t^2)$ convergence rate results of the proposed algorithms by adaptively setting the parameters, where t is the total number of iterations. In addition, if further assuming smoothness and the full-rankness we then obtain linear convergence of a modified method (Algorithm 3).

1.1 Related methods

Our algorithms are closely related to randomized coordinate descent methods, primal–dual coordinate update methods, and accelerated primal–dual methods. In this subsection, let us briefly review the three classes of methods and discuss their relations to our algorithms.

1.1.1 Randomized coordinate descent methods

In the absence of linear constraint, Algorithm 2 specializes to randomized coordinate descent (RCD), which was first proposed in [31] for smooth problems and later generalized in [27, 38] to nonsmooth problems. It was shown that RCD converges sublinearly with rate O(1 / t), which can be accelerated to $O(1/t^2)$ for convex problems and achieves a linear rate for strongly convex problems. By choosing multiple block variables at each iteration, [37] proposed to parallelize the RCD method and showed the same convergence results for parallelized RCD. This is similar to setting $m>1$ in Algorithm 2, allowing parallel updates on the selected x-blocks.

1.1.2 Primal–dual coordinate update methods

In the presence of linear constraints, coordinate descent methods may fail to converge to a solution of the problem because fixing all but one block, the selected block variable may be uniquely determined by the linear constraint. To perform coordinate update to the linearly constrained problem (1), one effective approach is to update both primal and dual variables. Under this framework, the alternating direction method of multipliers (ADMM) is one popular choice. Originally, ADMM [14, 17] was proposed for solving two-block structured problems with separable objective (by setting $f=0$ and $M=2$ in (1)), for which its convergence and also convergence rate have been well-established (see e.g. [2, 13, 22, 29]). However, directly extending ADMM to the multi-block setting such as (1) may fail to converge; see [6] for a divergence example of the ADMM even for solving a linear system of equations. Lots of efforts have been spent on establishing the convergence of multi-block ADMM under stronger assumptions (see e.g. [4, 6, 16, 25, 26]) such as strong convexity or orthogonality conditions on the linear constraint. Without additional assumptions, modification is necessary for the ADMM applied to multi-block problems to be convergent; see [12, 19, 20, 39] for example. Very recently, [15] proposed a randomized primal–dual coordinate (RPDC) update method, whose asynchronous parallel version was then studied in [41]. Applied to (1), RPDC is a special case of Algorithm 2 with fixed parameters. It was shown that RPDC converges with rate O(1 / t) under convexity assumption. More general than solving an optimization problem, primal–dual coordinate (PDC) update methods have also appeared in solving fixed-point or monotone inclusion problems [9, 34,35,36]. However, for these problems, the PDC methods are only shown to converge but no convergence rate estimates are known unless additional assumptions are made such as the strong monotonicity condition.

1.1.3 Accelerated primal–dual methods

It is possible to accelerate the rate of convergence from O(1 / t) to $O(1/t^2)$ for gradient type methods. The first acceleration result was shown by Nesterov [30] for solving smooth unconstrained problems. The technique has been generalized to accelerate gradient-type methods on possibly nonsmooth convex programs [1, 32]. Primal–dual methods on solving linearly constrained problems can also be accelerated by similar techniques. Under convexity assumption, the augmented Lagrangian method (ALM) is accelerated in [21] from O(1 / t) convergence rate to $O(1/t^2)$ by using a similar technique as that in [1] to the multiplier update, and [40] accelerates the linearized ALM using a technique similar to that in [32]. Assuming strong convexity on the objective, [18] accelerates the ADMM method, and the assumption is weakened in [40] to assuming the strong convexity for one component of the objective function. On solving bilinear saddle-point problems, various primal–dual methods can be accelerated if either primal or dual problem is strongly convex [3, 5, 11]. Without strong convexity, partial acceleration is still possible in terms of the rate depending on some other quantities; see e.g. [8, 33].

1.2 Contributions of this paper

We accelerate the proximal Jacobian ADMM [12] and also generalize it to an accelerated primal–dual coordinate updating method for linearly constrained multi-block structured convex program, where in the objective there is a nonseparable smooth function. With parameters fixed during all iterations, the generalized method reduces to that in [15] and enjoys O(1 / t) convergence rate under mere convexity assumption. By adaptively setting the parameters at different iterations, we show that the accelerated method has $O(1/t^2)$ convergence rate if the objective is strongly convex. In addition, if there is one block variable that is independent of all others in the objective (but coupled in the linear constraint) and also the corresponding component function is smooth, we modify the algorithm by treating that independent variable in a different way and establish a linear convergence result. Numerically, we test the accelerated method on quadratic programming and compare it to the (nonaccelerated) RPDC method in [15]. The results demonstrate that the accelerated method performs efficiently and stably with the parameters automatically set in accordance of the analysis, while the RPDC method needs to tune its parameters for different data in order to have a comparable performance.

1.3 Nomenclature and basic facts

Notations For a positive integer M, we denote [M] as $\{1,\ldots ,M\}$. We let $x_S$ denote the subvector of x with blocks indexed by S. Namely, if $S=\{i_1, \ldots , i_m\}$, then $x_S=(x_{i_1},\ldots , x_{i_m})$. Similarly, $A_S$ denotes the submatrix of A with columns indexed by S, and $g_S$ denotes the sum of component functions indicated by S. We use $\nabla _i f(x)$ for the partial gradient of f with respect to $x_i$ at x and $\nabla _S f(x)$ with respect to $x_S$. For a nondifferentiable function g, $\tilde{\nabla } g(x)$ denotes a subgradient of g at x. We reserve I for the identity matrix and use $\Vert \cdot \Vert $ for Euclidean norm. Given a symmetric positive semidefinite (PSD) matrix W, for any vector v of appropriate size, we define $\Vert v\Vert _W^2=v^\top W v$, and

$$\begin{aligned} \Delta _W(v^+,v^o,v)=\frac{1}{2}\big [\Vert v^+-v\Vert _W^2-\Vert v^o-v\Vert _W^2+\Vert v^+-v^o\Vert _W^2\big ]. \end{aligned}$$

(2)

If $W=I$, we simply use $\Delta (v^+,v^o,v)$. Also, we denote

$$\begin{aligned} { g(x)=\sum _{i=1}^m g_i(x_i),}\quad F(x)=f(x)+g(x),\quad \Phi (\hat{x},x,\lambda )=F(\hat{x})-F(x)-\langle \lambda , A\hat{x}-b\rangle . \end{aligned}$$

(3)

Preparations A point $(x^*,\lambda ^*)$ is called a Karush–Kuhn–Tucker (KKT) point of (1) if

$$\begin{aligned} 0\in \partial F(x^*)-A^\top \lambda ^*,\quad Ax^*-b=0. \end{aligned}$$

(4)

For convex programs, the conditions in (4) are sufficient for $x^*$ to be an optimal solution of (1), and they are also necessary if a certain qualification condition holds (e.g., the Slater condition: there is x in the interior of the domain of F such that $Ax=b$). Together with the convexity of F, (4) implies

$$\begin{aligned} \Phi (x,x^*,\lambda ^*)\ge 0,\,\forall x. \end{aligned}$$

(5)

We will use the following lemmas as basic facts. The first lemma is straightforward to verify from the definition of $\Vert \cdot \Vert _W$; the second one is similar to Lemma 3.3 in [15]; the third one is from Lemma 3.5 in [15].

Lemma 1.1

For any vectors u, v and symmetric PSD matrix W of appropriate sizes, it holds that

$$\begin{aligned} u^\top W v = \frac{1}{2}\left[ \Vert u\Vert _W^2-\Vert u-v\Vert _W^2+\Vert v\Vert _W^2\right] . \end{aligned}$$

(6)

Lemma 1.2

Given a function $\phi $, for a given x and a random vector $\hat{x}$, if for any $\lambda $ (that may depend on $\hat{x}$) it holds $\mathbb {E}\Phi (\hat{x},x,\lambda )\le \mathbb {E}\phi (\lambda ),$ then for any $\gamma >0$, we have

$$\begin{aligned} \mathbb {E}\big [F(\hat{x})-F(x)+\gamma \Vert A\hat{x}-b\Vert \big ]\le \sup _{\Vert \lambda \Vert \le \gamma }\phi (\lambda ). \end{aligned}$$

Proof

Let $\hat{\lambda }=-\frac{\gamma (A\hat{x}-b)}{\Vert A\hat{x}-b\Vert }$ if $A\hat{x}-b\ne 0$, and $\hat{\lambda }=0$ otherwise. Then

$$\begin{aligned} \Phi (\hat{x},x,\hat{\lambda })=F(\hat{x})-F(x)+\gamma \Vert A\hat{x}-b\Vert . \end{aligned}$$

In addition, since $\Vert \hat{\lambda }\Vert \le \gamma $, we have $\phi (\hat{\lambda })\le \sup _{\Vert \lambda \Vert \le \gamma }\phi (\lambda )$ and thus $\mathbb {E}\phi (\hat{\lambda })\le \sup _{\Vert \lambda \Vert \le \gamma }\phi (\lambda )$. Hence, we have the desired result from $\mathbb {E}\Phi (\hat{x},x,\hat{\lambda })\le \mathbb {E}\phi (\hat{\lambda })$. $\square $

Lemma 1.3

Suppose $\mathbb {E}\big [F(\hat{x})-F(x^*)+\gamma \Vert A\hat{x}-b\Vert \big ] \le \epsilon .$ Then,

$$\begin{aligned} \mathbb {E}\Vert A\hat{x}-b\Vert \le \frac{\epsilon }{\gamma -\Vert \lambda ^*\Vert }, \text{ and } -\frac{\epsilon \Vert \lambda ^*\Vert }{\gamma -\Vert \lambda ^*\Vert }\le \mathbb {E}\big [F(\hat{x})-F(x^*)\big ] \le \epsilon , \end{aligned}$$

where $(x^*,\lambda ^*)$ satisfies the optimality conditions in (4), and we assume $\Vert \lambda ^*\Vert < \gamma $.

Outline The rest of the paper is organized as follows. Sect. 2 presents the accelerated proximal Jacobian ADMM and its convergence results. In Sect. 3, we propose an accelerated primal–dual block coordinate update method with convergence analysis. Section 4 assumes more structure on the problem (1) and modifies the algorithm in Sect. 3 to have linear convergence. Numerical results are provided in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Accelerated proximal Jacobian ADMM

In this section, we propose an accelerated proximal Jacobian ADMM for solving (1). At each iteration, the algorithm updates all M block variables in parallel by minimizing a linearized proximal approximation of the augmented Lagrangian function, and then it renews the multiplier. Specifically, it iteratively performs the following updates:

$$\begin{aligned} x_i^{k+1}=&{{\mathrm{arg\,min}}}_{x_i} \left\langle \nabla _i f(x^k)-A_i^\top (\lambda ^k-\beta _k r^k), x_i\right\rangle + g_i(x_i)\nonumber \\&+ \frac{1}{2}\Vert x_i-x_i^k\Vert _{P_i^k},\, i=1,\ldots , M, \end{aligned}$$

(7a)

$$\begin{aligned} \lambda ^{k+1}=&\lambda ^k-\rho _k r^{k+1}, \end{aligned}$$

(7b)

where $\beta _k$ and $\rho _k$ are scalar parameters, $P^k$ is an $M\times M$ block diagonal matrix with $P_i^k$ as its i-th diagonal block for $i=1,\ldots ,M$, and $r^k=Ax^k-b$ denotes the residual. Note that (7a) consists of M independent subproblems, and they can be solved in parallel.

Algorithm 1 summarizes the proposed method. It reduces to the proximal Jacobian ADMM in [12] if $\beta _k,\rho _k$ and $P^k$ are fixed for all k and there is no nonseparable function f. We will show that adapting the parameters as the iteration progresses can accelerate the convergence of the algorithm.

2.1 Technical assumptions

Throughout the analysis in this section, we make the following assumptions.

Assumption 1

There exists $(x^*,\lambda ^*)$ satisfying the KKT conditions in (4).

Assumption 2

$\nabla f$ is Lipschitz continuous with modulus $L_f$.

Assumption 3

The function g is strongly convex with modulus $\mu >0$.

The first two assumptions are standard, and the third one is for showing convergence rate of $O(1/t^2)$, where t is the number of iterations. Note that if f is strongly convex with modulus $\mu _f>0$, we can let $f\leftarrow f-\frac{\mu _f}{2}\Vert \cdot \Vert ^2$ and $g\leftarrow g+\frac{\mu _f}{2}\Vert \cdot \Vert ^2$. This way, we have a convex function f and a strongly convex function g. Hence, Assumption 3 is without loss of generality. With only convexity, Algorithm 1 can be shown to converge at the rate O(1 / t) with parameters fixed for all iterations, and the order 1 / t is optimal as shown in the very recent work [24].

2.2 Convergence results

In this subsection, we show the $O(1/t^2)$ convergence rate result of Algorithm 1. First, we establish a result of running one iteration of Algorithm 1.

Lemma 2.1

(One-iteration analysis) Under Assumptions 2 and 3, let $\{(x^k,\lambda ^k)\}$ be the sequence generated from Algorithm 1. Then for any k and $(x,\lambda )$ such that $Ax=b$, it holds that

$$\begin{aligned} \begin{aligned}&~\Phi (x^{k+1},x,\lambda ) \\&\quad \le \frac{1}{2\rho _k}\left[ \Vert \lambda -\lambda ^k\Vert ^2-\Vert \lambda -\lambda ^{k+1}\Vert ^2+\Vert \lambda ^k-\lambda ^{k+1}\Vert ^2\right] -\beta _k\Vert r^{k+1}\Vert ^2 \\&\quad \quad -\,\frac{1}{2}\left[ \Vert x^{k+1}-x\Vert _{P^k-\beta _k A^\top A+\mu I}^2-\Vert x^k-x\Vert _{P^k-\beta _k A^\top A}^2\right. \\&\qquad \left. +\,\Vert x^{k+1}-x^k\Vert _{P^k-\beta _k A^\top A - L_f I}^2\right] . \end{aligned} \end{aligned}$$

(8)

Using the above lemma, we are able to prove the following theorem.

Theorem 2.2

Under Assumptions 2 and 3, let $\{(x^k,\lambda ^k)\}$ be the sequence generated by Algorithm 1. Suppose that the parameters are set to satisfy

$$\begin{aligned} 0<\rho _k\le 2\beta _k,\quad P^k\succeq \beta _k A^\top A+ L_f I,\,\forall k\ge 1, \end{aligned}$$

(9)

and there exists a number $k_0$ such that for all $k\ge 2$,

$$\begin{aligned} \frac{k+k_0+1}{\rho _k}\le & {} \frac{k+k_0}{\rho _{k-1}}, \end{aligned}$$

(10)

$$\begin{aligned} (k+k_0+1)(P^k-\beta _k A^\top A)\preceq & {} (k+k_0)(P^{k-1}-\beta _{k-1} A^\top A+\mu I). \end{aligned}$$

(11)

Then, for any $(x,\lambda )$ satisfying $Ax=b$, we have

$$\begin{aligned}&\sum _{k=1}^t(k+k_0+1)\Phi (x^{k+1},x,\lambda ) +\sum _{k=1}^t\frac{k+k_0+1}{2}(2\beta _k-\rho _k)\Vert r^{k+1}\Vert ^2 \nonumber \\&\qquad +\frac{t+k_0+1}{2}\Vert x^{t+1}-x\Vert ^2_{P^t-\beta _t A^\top A+\mu I} \le \phi _1(x,\lambda ), \end{aligned}$$

(12)

where

$$\begin{aligned} \phi _1(x,\lambda )=\frac{k_0+2}{2\rho _1}\Vert \lambda -\lambda ^1\Vert ^2 + \frac{k_0+2}{2}\Vert x^1-x\Vert ^2_{P^1-\beta _1 A^\top A}. \end{aligned}$$

(13)

In the next theorem, we provide a set of parameters that satisfy the conditions in Theorem 2.2 and establish the $O(1/t^2)$ convergence rate result.

Theorem 2.3

(Convergence rate of order $1/t^2$) Under Assumptions 1 through 3, let $\{(x^k,\lambda ^k)\}$ be the sequence generated by Algorithm 1 with parameters set to:

$$\begin{aligned} \beta _k=\rho _k=k\beta ,\quad P^k=kP+L_f I,\,\forall k\ge 1, \end{aligned}$$

(14)

where P is a block diagonal matrix satisfying $0\prec P-\beta A^\top A\preceq \frac{\mu }{2}I$. Then,

$$\begin{aligned} \max \left\{ \beta \Vert r^{t+1}\Vert ^2,\ \Vert x^{t+1}-x^*\Vert ^2_{P-\beta A^\top A}\right\} \le \frac{2}{t(t+k_0+1)}\phi _1(x^*,\lambda ^*), \end{aligned}$$

(15)

where $k_0=\frac{2L_f}{\mu }$, and $\phi _1$ is defined in (13). In addition, letting $\gamma =\max \Big \{ 2\Vert \lambda ^*\Vert ,1+\Vert \lambda ^*\Vert \Big \}$ and

$$\begin{aligned} T=\frac{t(t+2k_0+3)}{2},\quad \bar{x}^{t+1}=\frac{\sum _{k=1}^t (k+k_0+1)x^k}{T}, \end{aligned}$$

we have

$$\begin{aligned}&|F(\bar{x}^{t+1})-F(x^*)|\le \frac{1}{T}\max _{|\Vert \lambda \Vert \le \gamma }\phi _1(x^*,\lambda ),\end{aligned}$$

(16a)

$$\begin{aligned}&\Vert A\bar{x}^{t+1}-b\Vert \le \frac{1}{T\max \{1,\Vert \lambda ^*\Vert \} }\max _{\Vert \lambda \Vert \le \gamma }\phi _1(x^*,\lambda ). \end{aligned}$$

(16b)

3 Accelerating randomized primal–dual block coordinate updates

In this section, we generalize Algorithm 1 to a randomized setting where the user may choose to update a subset of blocks at each iteration. Instead of updating all M block variables, we randomly choose a subset of them to renew at each iteration. Depending on the number of processors (nodes, or cores), we can choose a single or multiple block variables for each update.

3.1 The algorithm

Our algorithm is an accelerated version of the randomized primal–dual coordinate update method recently proposed in [15], for which we shall use RPDC as its acronym.^{Footnote 1} At each iteration, it performs a block proximal gradient update to a subset of randomly selected primal variables while keeping the remaining ones fixed, followed by an update to the multipliers. Specifically, at iteration k, it selects an index set $S_k\subset \{1,\ldots ,M\}$ with cardinality m and performs the following updates:

$$\begin{aligned}&x_i^{k+1}=\left\{ \begin{array}{ll}\mathop {{{\mathrm{arg\,min}}}}\limits _{x_i}\langle \nabla _i f(x^k)-A_i^\top (\lambda ^k-\beta _k r^k), x_i\rangle +g_i(x_i)+\frac{\eta _k}{2}\Vert x_i-x_i^k\Vert ^2, &{}\text { if } i\in S_k,\\ x_i^k,&{} \text { if }i\not \in S_k\end{array}\right. \end{aligned}$$

(17a)

$$\begin{aligned}&r^{k+1}=r^k+\sum _{i\in S_k}A_i(x_i^{k+1}-x_i^k),\end{aligned}$$

(17b)

$$\begin{aligned}&\lambda ^{k+1}=\lambda ^k - \rho _k r^{k+1}, \end{aligned}$$

(17c)

where $\beta _k, \rho _k$ and $\eta _k$ are algorithm parameters, and their values will be determined later. Note that we use $\frac{\eta _k}{2}\Vert x_i-x_i^k\Vert ^2$ in (17a) for simplicity. It can be replaced by a PSD matrix weighted norm square term as in (7a), and our convergence results still hold.

Algorithm 2 summarizes the above method. If the parameters $\beta _k, \rho _k$ and $\eta _k$ are fixed during all the iterations, i.e., constant parameters, the algorithm reduces to a special case of the RPDC method in [15]. Adapting these parameters to the iterations, we will show that Algorithm 2 enjoys faster convergence rate than RPDC if the problem is strongly convex.

3.2 Convergence results

In this subsection, we establish convergence results of Algorithm 2 under Assumptions 1 and 3, and also the following partial gradient Lipschitz continuity assumption.

Assumption 4

For any $S\subset \{1,\ldots ,M\}$ with $|S|=m$, $\nabla _S f$ is Lipschitz continuous with a uniform constant $L_m$.

Note that if $\nabla f$ is Lipschitz continuous with constant $L_f$, then $L_m\le L_f$ and $L_M = L_f$. In addition, if $x^+$ and x only differ on a set $S\subset [M]$ with cardinality m, then

$$\begin{aligned} f(x^+) \le f(x) + \langle \nabla f(x), x^+-x\rangle + \frac{L_m}{2}\Vert x^+-x\Vert ^2. \end{aligned}$$

(18)

Similar to the analysis in Sect. 2, we first establish a result of running one iteration of Algorithm 2. Throughout this section, we denote $\theta =\frac{m}{M}$.

Lemma 3.1

(One iteration analysis) Under Assumptions 3 and 4, let $\{(x^k,\lambda ^k)\}$ be the sequence generated from Algorithm 2. Then for any x such that $Ax=b$, it holds

$$\begin{aligned}&~\mathbb {E}\left[ \Phi (x^{k+1},x,\lambda ^{k+1})+(\beta _k-\rho _k)\Vert r^{k+1}\Vert ^2+ \frac{\mu }{2}\Vert x^{k+1}-x\Vert ^2\right] \\&\quad \le ~(1-\theta )\mathbb {E}\left[ \Phi (x^k,x,\lambda ^k)+\beta _k\Vert r^k\Vert ^2+\frac{\mu }{2}\Vert x^k-x\Vert ^2\right] \nonumber \\&\qquad - \mathbb {E}\left[ \Delta _{\eta _k I-\beta _k A^\top A}(x^{k+1},x^k,x) - \frac{L_m}{2} \Vert x^{k+1}-x^k\Vert ^2\right] . \nonumber \end{aligned}$$

(19)

When $\mu =0$ (i.e., (1) is convex), Algorithm 2 has O(1 / t) convergence rate with fixed $\beta _k,\rho _k,\eta _k$. This can be shown from (19), and a similar result in slightly different form has been established in [15, Theorem 3.6]. For completeness, we provide its proof in the appendix.

Theorem 3.2

(Un-accelerated convergence) Under Assumptions 1 and 4, let $\{(x^k,\lambda ^k)\}$ be the sequence generated from Algorithm 2 with $\beta _k=\beta ,\rho _k=\rho ,\eta _k=\eta $ for all k, satisfying

$$\begin{aligned} 0<\rho \le \theta \beta ,\quad \eta \ge L_m+\beta \Vert A\Vert _2^2, \end{aligned}$$

where $\Vert A\Vert _2$ denotes the spectral norm of A. Then

$$\begin{aligned}&\big |\mathbb {E}[F(\bar{x}^t)-F(x^*)]\big |\le \frac{1}{1+\theta (t-1)}\max _{\Vert \lambda \Vert \le \gamma }\phi _2(x^*,\lambda ), \end{aligned}$$

(20a)

$$\begin{aligned}&\mathbb {E}\Vert A\bar{x}^t-b\Vert \le \frac{1}{(1+\theta (t-1))\max \{1, \Vert \lambda ^*\Vert \} }\max _{\Vert \lambda \Vert \le \gamma }\phi _2(x^*,\lambda ), \end{aligned}$$

(20b)

where $(x^*,\lambda ^*)$ satisfies the KKT conditions in (4), $\gamma =\max \{\Vert 2\lambda ^*\Vert , 1+\Vert \lambda ^*\Vert \}$, and

$$\begin{aligned} \bar{x}^t= & {} \frac{x^{t+1}+\theta \sum _{k=2}^t x^k}{1+\theta (t-1)},\\ \phi _2(x,\lambda )= & {} (1-\theta )\left( F(x^1)-F(x)\right) +\frac{\eta }{2}\Vert x^1-x\Vert ^2+\frac{\theta \Vert \lambda \Vert ^2}{2\rho }. \end{aligned}$$

When F is strongly convex, the above O(1 / t) convergence rate can be accelerated to $O(1/t^2)$ by adaptively changing the parameters at each iteration. The following theorem is our main result. It shows an $O(1/t^2)$ convergence result under certain conditions on the parameters. Based on this theorem, we will give a set of parameters that satisfy these conditions, thus providing a specific scheme to choose the parameters.

Theorem 3.3

Under Assumptions 3 and 4, let $\{(x^k,\lambda ^k)\}$ be the sequence generated from Algorithm 2 with parameters satisfying the following conditions for a certain number $k_0$:

$$\begin{aligned} \theta (k+k_0+1)\ge & {} 1,\,\forall k\ge 2, \end{aligned}$$

(21a)

$$\begin{aligned} (\beta _{k-1}-\rho _{k-1})(k+k_0)\ge & {} (1-\theta )(k+k_0+1)\beta _k,\,\forall 2\le k \le t, \nonumber \\\end{aligned}$$

(21b)

$$\begin{aligned} \frac{\theta (k+k_0+1)-1}{\rho _{k-1}}\ge & {} \frac{\theta (k+k_0+2)-1}{\rho _k},\,\forall \, 2\le k\le t-1, \nonumber \\\end{aligned}$$

(21c)

$$\begin{aligned} \frac{\theta (t+k_0+1)-1}{\rho _{t-1}}\ge & {} \frac{t+k_0+1}{\rho _t},\, \end{aligned}$$

(21d)

$$\begin{aligned} \beta _k(k+k_0+1)\ge & {} \beta _{k-1}(k+k_0),\,\forall k\ge 2, \end{aligned}$$

(21e)

$$\begin{aligned} (k+k_0+1)(\eta _k-L_m)I\succeq & {} \beta _k(k+k_0+1)A^\top A,\,\forall k\ge 1, \end{aligned}$$

(21f)

$$\begin{aligned} (k+k_0)\eta _{k-1}+\mu \big (\theta (k+k_0+1)-1\big )\ge & {} (k+k_0+1)\eta _k,\,\forall k\ge 2. \end{aligned}$$

(21g)

Then for any $(x,\lambda )$ such that $Ax=b$, we have

$$\begin{aligned}&(t+k_0+1) \mathbb {E}\Phi (x^{t+1},x,\lambda )+\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big )\mathbb {E}\Phi (x^k,x,\lambda )\nonumber \\&\quad \le (1-\theta )(k_0+2)\mathbb {E}\left[ \Phi (x^1,x,\lambda ^1)+\beta _1\Vert r^1\Vert ^2+\frac{\mu }{2}\Vert x^1-x\Vert ^2\right] \nonumber \\&\qquad +\frac{\eta _1(k_0+2)}{2}\mathbb {E}\Vert x^1-x\Vert ^2 \nonumber \\&\qquad +\frac{\theta (k_0+3)-1}{2\rho _1}\mathbb {E}\Vert \lambda ^1-\lambda \Vert ^2-\frac{t+k_0+1}{2}\mathbb {E}\Vert x^{t+1}-x\Vert _{(\mu +\eta _t) I-\beta _t A^\top A}^2 .\nonumber \\ \end{aligned}$$

(22)

Specifying the parameters that satisfy (21), we show $O(1/t^2)$ convergence rate of Algorithm 2.

Proposition 3.4

The following parameters satisfy all conditions in (21):

$$\begin{aligned}&\beta _k= \frac{\mu (\theta k+2+\theta )}{2\rho \Vert A\Vert _2^2},\,\forall k\ge 1,\end{aligned}$$

(23a)

$$\begin{aligned}&\rho _k=\left\{ \begin{array}{ll} \frac{\theta \beta _k}{(6-5\theta )}, &{} \text { for }1\le k\le t-1,\\ \frac{(t+k_0+1)\rho _{t-1}}{\theta (t+k_0+1)-1}, &{}\text { for }k=t \end{array} \right. \end{aligned}$$

(23b)

$$\begin{aligned}&\eta _k=\rho \beta _k\Vert A\Vert _2^2+L_m,\,\forall k\ge 1, \end{aligned}$$

(23c)

where $\rho \ge 1$ and

$$\begin{aligned} k_0=\frac{4}{\theta }+\frac{2L_m}{\theta \mu }. \end{aligned}$$

(24)

Theorem 3.5

(Accelerated convergence) Under Assumptions 1, 3 and 4, let $\{(x^k,\lambda ^k)\}$ be the sequence generated from Algorithm 2 with parameters taken as in (23). Then

$$\begin{aligned} \big |\mathbb {E}[F(\bar{x}^{t+1})-F(x^*)]\big |\le & {} \frac{1}{T}\max _{\Vert \lambda \Vert \le \gamma }\phi _3(x^*,\lambda ),\nonumber \\ \mathbb {E}\Vert A\bar{x}^{t+1}-b\Vert\le & {} \frac{1}{T\max \{1,\Vert \lambda ^*\Vert \}}\max _{\Vert \lambda \Vert \le \gamma }\phi _3(x^*,\lambda ), \end{aligned}$$

(25)

where $\gamma =\max \{2\Vert \lambda ^*\Vert , 1+\Vert \lambda ^*\Vert \}$,

$$\begin{aligned} \bar{x}^{t+1}= & {} \frac{(t+k_0+1)x^{t+1}+\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big )x^k}{T}, \\ \phi _3(x, \lambda )= & {} (1-\theta )(k_0+2)\left[ F(x^1)-F(x)+\beta _1\Vert r^1\Vert ^2+\frac{\mu }{2}\Vert x^1-x\Vert ^2\right] \\&+\frac{\eta _1(k_0+2)}{2}\Vert x^1-x\Vert ^2+\frac{\theta (k_0+3)-1}{2\rho _1}\Vert \lambda \Vert ^2 \end{aligned}$$

and

$$\begin{aligned} T= (t+k_0+1)+\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big ). \end{aligned}$$

In addition,

$$\begin{aligned} \mathbb {E}\Vert x^{t+1}-x^*\Vert ^2\le \frac{2\phi _3(x^*,\lambda ^*)}{(t+k_0+1)\left( \frac{(\rho -1)\mu }{2\rho }(\theta t + \theta + 2)+2\mu +L_m\right) }. \end{aligned}$$

4 Linearly convergent primal–dual method

In this section, we assume some more structure on (1) and show that a linear rate of convergence is possible. If there is no linear constraint, Algorithm 2 reduces to the RCD method proposed in [31]. It is well-known that RCD converges linearly if the objective is strongly convex. However, with the presence of linear constraints, mere strong convexity of the objective of the primal problem only ensures the smoothness of its Lagrangian dual function, but not its strong concavity. Hence, in general, we do not expect linear convergence by only assuming strong convexity on the primal objective function. To ensure linear convergence on both the primal and dual variables, we need additional assumptions.

Throughout this section, we suppose that there is at least one block variable being absent in the nonseparable part of the objective, namely f. For convenience, we rename this block variable to be y, and the corresponding component function and constraint coefficient matrix as h and B. Specifically, we consider the following problem

$$\begin{aligned} \min _{x,y} f(x_1,\ldots ,x_M)+\sum _{i=1}^M g_i(x_i)+h(y), \text{ s.t. } \sum _{i=1}^M A_ix_i+By=b. \end{aligned}$$

(26)

One example of (26) is the problem that appears while computing a point on the central path of a convex program. Suppose we are interested in solving

$$\begin{aligned} \min _x f(x_1,\ldots ,x_M), \text{ s.t. } \sum _{i=1}^M A_ix_i \le b, \, x_i\ge 0, i = 1,\ldots , M. \end{aligned}$$

(27)

Let $y= b - \sum _{i=1}^M A_ix_i$ and use the log-barrier function. We have the log-barrier approximation of (27) as follows:

$$\begin{aligned} \min _{x, y} f(x_1,\ldots ,x_M) - \mu \sum _{i=1}^M e^\top \log x_i - \mu e^\top \log y, \text{ s.t. } \sum _{i=1}^M A_ix_i + y = b, \end{aligned}$$

(28)

where e is the all-one vector. As $\mu $ decreases, the approximation becomes more accurate.

Towards a solution to (26), we modify Algorithm 2 by updating y-variable after the x-update. Since there is only a single y-block, to balance x and y updates, we do not renew y in every iteration but instead update it in probability $\theta =\frac{m}{M}$. Hence, roughly speaking, x and y variables are updated in the same frequency. The method is summarized in Algorithm 3.

4.1 Technical assumptions

In this section, we denote $z=(x,y,\lambda )$. Assume h is differentiable. Similar to (4), a point $z^*=( x^*, y^*,\lambda ^*)$ is called a KKT point of (26) if

$$\begin{aligned}&0\in \partial F( x^*)- A^\top \lambda ^*, \end{aligned}$$

(32a)

$$\begin{aligned}&\nabla h( y^*)- B^\top \lambda ^*=0,\end{aligned}$$

(32b)

$$\begin{aligned}&A x^*+ B y^*- b=0. \end{aligned}$$

(32c)

Besides Assumptions 3 and 4, we make two additional assumptions as follows.

Assumption 5

There exists $z^*=(x^*,y^*,\lambda ^*)$ satisfying the KKT conditions in (32).

Assumption 6

The function h is strongly convex with modulus $\nu $, and its gradient $\nabla h$ is Lipschitz continuous with constant $L_h$.

The strong convexity of F and h implies

$$\begin{aligned} F(x^{k+1})-F(x^*)-\langle \tilde{\nabla }F(x^*),x^{k+1}-x^*\rangle\ge & {} \, \frac{\mu }{2}\Vert x^{k+1}-x^*\Vert ^2,\end{aligned}$$

(33a)

$$\begin{aligned} \langle y^{k+1}-y^*, \nabla h(y^{k+1})-\nabla h(y^*)\rangle\ge & {} \, \nu \Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$

(33b)

4.2 Convergence analysis

Similar to Lemma 3.1, we first establish a result of running one iteration of Algorithm 3. It can be proven by similar arguments to those showing Lemma 3.1.

Lemma 4.1

(One iteration analysis) Under Assumptions 3, 4, and 6, let $\{( x^k, y^k,\lambda ^k)\}$ be the sequence generated from Algorithm 3. Then for any k and $(x,y,\lambda )$ such that $Ax+By=b$, it holds

$$\begin{aligned}&\mathbb {E}\varphi (z^{k+1},z)+( \beta -\rho )\mathbb {E}\Vert r^{k+1}\Vert ^2+\frac{1}{\rho }\mathbb {E}\Delta (\lambda ^{k+1},\lambda ^k,\lambda )\nonumber \\&\qquad +\,\mathbb {E}\left[ \Delta _P(x^{k+1},x^k,x)-\frac{L_m}{2}\Vert x^{k+1}-x^k\Vert ^2\right] +\mathbb {E}\Delta _Q(y^{k+1},y^k,y)\nonumber \\&\quad \le (1-\theta )\mathbb {E}\varphi (z^k,z)+\beta (1 -\theta )\mathbb {E}\Vert r^k\Vert ^2+\frac{1-\theta }{\rho }\mathbb {E}\Delta (\lambda ^k,\lambda ^{k-1},\lambda ) \nonumber \\&\qquad +\, \beta \mathbb {E}\langle A(x^{k+1}- x), B( y^{k+1}- y^k)\rangle +\beta (1 -\theta )\mathbb {E}\langle B(y^k- y),A(x^{k+1}-x^k)\rangle . \nonumber \\ \end{aligned}$$

(34)

where $P=\eta _x I-\beta A^\top A$, $Q=\eta _y I-\beta B^\top B,$ and

$$\begin{aligned} \varphi (z^k,z)=F(x^k)-F(x)+\frac{\mu }{2}\Vert x^k-x\Vert ^2+\big \langle y^k- y,{\nabla } h(y^k)\big \rangle -\big \langle \lambda , Ax^k+By^k-b\big \rangle . \end{aligned}$$

(35)

In the following, we let

$$\begin{aligned} \Psi (z^k,z^*)=F(x^k)-F(x^*)-\langle \tilde{\nabla }F(x^*),x^k-x^*\rangle +\big \langle y^k- y^*,\nabla h(y^k)-\nabla h(y^*)\big \rangle , \end{aligned}$$

(36)

and

$$\begin{aligned}&\psi (z^k,z^*;P,Q,\beta ,\rho ,c,\tau ) \nonumber \\&\quad = (1-\theta )\mathbb {E}\Psi (z^k,z^*)+\frac{\beta (1 -\theta )}{2}\mathbb {E}\Vert r^k\Vert ^2\nonumber \\&\qquad +\frac{1}{2}\mathbb {E}\Vert x^{k}-x^*\Vert ^2_{P+\mu (1-\theta )I}+\frac{1}{2}\mathbb {E}\Vert y^{k}- y^*\Vert ^2_{Q+\frac{\beta (1 -\theta )}{\tau }B^\top B}\nonumber \\&\qquad +\frac{1}{2\rho }\mathbb {E}\left[ \Vert \lambda ^{k}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k-1}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k}-\lambda ^{k-1}\Vert ^2\right] . \end{aligned}$$

(37)

The following theorem is key to establishing linear convergence of Algorithm 3.

Theorem 4.2

Under Assumptions 3 through 6, let $\{( x^k, y^k,\lambda ^k)\}$ be the sequence generated from Algorithm 3 with $\rho =\theta \beta $. Let $0<\alpha <\theta $ and $\gamma =\max \left\{ \frac{8\Vert A\Vert _2^2}{\alpha \mu },\frac{8\Vert B\Vert _2^2}{\alpha \nu }\right\} $. Choose $\delta ,\kappa \ge 0$ such that

$$\begin{aligned} 2\left[ \begin{array}{cc}1-(1-\theta )(1+\delta ) &{}\quad (1-\theta )(1+\delta )\\ (1-\theta )(1+\delta )&{}\quad \kappa -(1-\theta )(1+\delta )\end{array}\right] \succeq \left[ \begin{array}{cc}\theta &{}\quad 1-\theta \\ 1-\theta &{} \quad \frac{1}{\theta }-(1-\theta )\end{array}\right] , \end{aligned}$$

(38)

and positive numbers $\eta _x,\eta _y,c,\tau _1,\tau _2,\beta $ such that

$$\begin{aligned} P\succeq & {} \beta (1-\theta )\tau _2 A^\top A+L_m I \end{aligned}$$

(39a)

$$\begin{aligned} Q\succeq & {} 8c Q^\top Q + 4c\rho ^2(1-\theta )\left( 1+\frac{1}{\delta }\right) B^\top B B^\top B+\beta \tau _1 B^\top B. \end{aligned}$$

(39b)

Then it holds that

$$\begin{aligned}&(1-\alpha )\mathbb {E}\Psi (z^{k+1},z^*)+\frac{1}{2}\mathbb {E}\Vert x^{k+1}- x^*\Vert ^2_{P+(\frac{\alpha \mu }{2}+\mu )I-\frac{\beta }{\tau _1}A^\top A}\nonumber \\&\qquad +\frac{1}{2}\mathbb {E}\Vert y^{k+1}- y^*\Vert ^2_{Q+\left( \frac{3\alpha \nu }{2}-8cL_h^2\right) I}\nonumber \\&\qquad +\left( \frac{\beta -\rho }{2}+\frac{1}{\gamma }\right) \mathbb {E}\Vert r^{k+1}\Vert ^2\nonumber \\&\qquad -\left( c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right) \mathbb {E}\Vert B^\top r^{k+1}\Vert ^2\nonumber \\&\qquad +\left( \frac{1}{2\rho }+\frac{c}{2}\sigma _{\min }(BB^\top )\right) \nonumber \\&\qquad \times \,\mathbb {E}\left[ \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] \nonumber \\&\quad \le \psi (z^k,z^*;P,Q,\beta ,\rho ,c,\tau _2). \end{aligned}$$

(40)

Using Theorem 4.2, a linear convergence rate of Algorithm 3 follows.

Theorem 4.3

Under Assumptions 3 through 6, let $\{( x^k, y^k,\lambda ^k)\}$ be the sequence generated from Algorithm 3 with $\rho =\theta \beta $. Let $0<\alpha <\theta $ and $\gamma =\max \left\{ \frac{8\Vert A\Vert _2^2}{\alpha \mu },\frac{8\Vert B\Vert _2^2}{\alpha \nu }\right\} $. Assume that B is full row-rank and $\max \{\Vert A\Vert _2,\Vert B\Vert _2\}\le 1$. Choose $\delta ,\kappa ,\eta _x,\eta _y,c,\beta ,\tau _1,\tau _2$ satisfying (38) and (39), and in addition,

$$\begin{aligned} \frac{\alpha }{2}\mu +\theta \mu> & {} \frac{\beta }{\tau _1} \end{aligned}$$

(41a)

$$\begin{aligned} \frac{3\alpha \nu }{4}> & {} 4cL_h^2+\frac{\beta (1-\theta )}{2\tau _2}\end{aligned}$$

(41b)

$$\begin{aligned} \frac{1}{\gamma }> & {} c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2. \end{aligned}$$

(41c)

Then

$$\begin{aligned} \psi (z^{k+1},z^*;P,Q,\beta ,\rho ,c,\tau _2)\le \frac{1}{\eta } \psi (z^k,z^*;P,Q,\beta ,\rho ,c,\tau _2), \end{aligned}$$

(42)

where

$$\begin{aligned} \eta= & {} \min \left\{ \frac{1-\alpha }{1-\theta },\, 1+\frac{\frac{\alpha }{2}\mu +\theta \mu -\frac{\beta }{\tau _1}}{\eta _x+\mu (1-\theta )},1+\frac{\frac{3\alpha \nu }{4}-4cL_h^2-\frac{\beta (1-\theta )}{2\tau _2}}{\frac{\eta _y}{2}+\frac{\beta (1-\theta )}{2\tau _2}},\,\right. \\&\quad \left. 1+\frac{\frac{2}{\gamma }-2c\rho ^2\left( \kappa +2(1-\theta )(1+\frac{1}{\delta })\right) -4c(\beta -\rho )^2}{\beta (1-\theta )},\,1+c\rho \sigma _{\min }(BB^\top )\right\} {>}1. \end{aligned}$$

We finish this section by making a few remarks.

Remark 4.1

We can always rescale A, B and b without essentially altering the linear constraints. Hence, the assumption $\max \{\Vert A\Vert _2,\Vert B\Vert _2\}\le 1$ can be made without losing generality. From (42), it is easy to see that when $P\succ 0$ and $Q\succ 0$, $(x^k,y^k)$ converges to $(x^*,y^*)$ R-linearly in expectation. In addition, note that

$$\begin{aligned}&\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^k-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2= \theta \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2\\&\qquad +\,2(1-\theta )\langle \lambda ^{k+1} -\lambda ^*,\lambda ^{k+1}-\lambda ^k\rangle + \left( \frac{1}{\theta }-1+\theta \right) \Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\\&\quad \ge \left( \theta -\frac{(1-\theta )^2}{\frac{1}{\theta }-1+\theta }\right) \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2 \\&\quad = \frac{\theta }{\frac{1}{\theta }-1+\theta }\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2. \end{aligned}$$

Hence, (42) also implies an R-linear convergence of $\lambda ^k$ to $\lambda ^*$ in expectation.

Remark 4.2

We give examples of parameters that satisfy the conditions required in Theorem 4.3. First consider the case of $\theta =1$, i.e., all blocks are updated at each iteration. In this case, we can choose $\delta =0,\kappa =\frac{1}{2}$ to satisfy (38) and $\eta _x=\beta \Vert A\Vert _2^2+L_f$ to satisfy (39a) and let $\alpha =\frac{1}{2}$ and $\tau _1=\frac{\beta }{\mu }$ to ensure that (41a) holds. Finally, choose $\eta _y>\big (\beta +\frac{\beta ^2}{\mu }\big )\Vert B\Vert _2^2$ and c sufficiently small, and all other conditions in Theorem 4.3 are satisfied. Next consider the case of $\theta <1$. We can choose $\delta =\frac{\theta }{4(1-\theta )}$ and $\kappa =\frac{3}{\theta }+\frac{3\theta }{4}-2$ to satisfy (38), and let $\alpha =\frac{\theta }{2}$, $\tau _1=\frac{\beta }{\theta \mu }$, $\tau _2=\frac{2\beta (1-\theta )}{\nu }$, $\eta _x=\beta (1+(1-\theta )\tau _2)\Vert A\Vert _2^2+L_m$, and $\eta _y>\beta (1+\tau _1)\Vert B\Vert _2^2$. With such choices, all other conditions required in Theorem 4.3 hold when c is sufficiently small.

Remark 4.3

If there is only one x-block and there is no f function, then Algorithm 3 reduces to the so-called linearized ADMM. To show the linear convergence of the linearized ADMM, one scenario in [13, Theorem 3.1] assumes^{Footnote 2} the strong convexity of g and h, the smoothness of h, and the full row-rankness of B. In Theorem 4.3, we make the same assumptions, and so our result can be considered as a generalization.

5 Numerical experiments

The aim of this section is to test the practical performance of the proposed algorithms. We test Algorithm 2 on quadratic programming

$$\begin{aligned} \min _x F(x)=\frac{1}{2} x^\top Q x + c^\top x, \text{ s.t. } Ax=b,\, x\ge 0, \end{aligned}$$

(43)

and Algorithm 3 on the log-barrier approximation of linear programming

$$\begin{aligned} \min _{x,y} c^\top x - e^\top \log x - e^\top \log y, \text{ s.t. } Ax+y = b,\, x_i\le u_i, \forall i. \end{aligned}$$

(44)

Quadratic programming Two types of randomized implementations are considered: one with fixed parameters and the newly introduced one with adaptive parameters, which shall be called nonadaptive RPDC and adaptive RPDC respectively. Note that the former reduces to the method proposed in [15] when applied to (43). The purpose of the experiment is to test the effect of acceleration for the latter approach.

The data was generated randomly as follows. We let $Q=HDH^\top \in \mathbb {R}^{n\times n}$, where H is Gaussian randomly generated orthogonal matrix and D is a diagonal matrix with $d_{ii}=1+(i-1)\frac{L-1}{n-1},\,i=1,\ldots ,n$. Hence, the smallest and largest singular values of Q are 1 and L respectively, and the objective of (43) is strongly convex with modulus 1. The components of c follow standard Gaussian distribution, and those of b follow uniform distribution on [0, 1]. We let $A=[B, I]\in \mathbb {R}^{p\times n}$ to guarantee the existence of feasible solutions, where B was generated according to standard Gaussian distribution. In addition, we normalized A so that it has a unit spectral norm.

In the test, we fixed $n=2000, p=200$ and varied L among $\{10, 100, 1000\}$. For both nonadaptive and adaptive RPDC, we evenly partitioned x into 40 blocks, i.e., each block consists of 50 coordinates, and we set $m=40$, i.e., all blocks are updated at each iteration. For the adaptive RPDC, we set the values of its parameters according to (23) with $\rho =1$, and those for the nonadaptive RPDC were set based on Theorem 3.2 with $\rho =\beta ,\, \eta =100+\beta ,\,\forall k$ where $\beta $ varied among $\{1,10,100,1000\}$. Figures 1, 2 and 3 plot the objective values and feasibility violations by Algorithm 2 under these two different settings. From these results, we see that adaptive RPDC performed well for all three datasets with a single set of parameters while the performance of the nonadaptive one was severely affected by the penalty parameter.

Linear programming In this test, we apply Algorithm 3 to the problem (44), where we let $f(x)=c^\top x, g(x)=-e^\top \log x$ and $h(y)= -e^\top \log y$. The purpose of this experiment is to demonstrate the linear convergence of Algorithm 3.

We generated $A\in \mathbb {R}^{200\times 2000}$ and c according to the standard Gaussian distribution and b by the uniform distribution on $[\frac{1}{2},\frac{3}{2}]$. The upper bound was set to $u_i=10,\forall i$. We treated x as a single block and set the algorithm parameters to $\beta =0.1$, $\eta _x=\beta \Vert A\Vert _2^2$, and $\eta _y=\beta \big (1+\frac{2.001\beta }{3\mu }\big )$. This setting satisfies the conditions required in Theorem 4.3 if $\alpha $ is sufficiently close to 1. Note that g and h do not have uniform strong convexity constants but they are both strongly convex on a bounded set. Figure 4 shows the convergence behavior of Algorithm 3. From the figure, we can clearly see that the algorithm linearly converges to an optimal solution.

6 Conclusions

In this paper we propose an accelerated proximal Jacobian ADMM method and generalize it to an accelerated randomized primal–dual coordinate updating method for solving linearly constrained multi-block structured convex programs. We show that if the objective is strongly convex then the methods achieve $O(1/t^2)$ convergence rate where t is the total number of iterations. In addition, if one block variable is independent of others in the objective and its part of the objective function is smooth, we have modified the primal–dual coordinate updating method to achieve linear convergence. Numerical experiments on quadratic programming and log-barrier approximation of linear programming have shown the efficacy of the newly proposed methods.

Notes

In fact, [15] presents a more general algorithmic framework. It assumes two groups of variables, and each has multi-block structure. Our method in Algorithm 2 is an accelerated version of one special case of Algorithm 1 in [15].
Besides the scenario that g and h are strongly convex, h is smooth, and B is of full row-rank, [13, Theorem 3.1] also shows linear convergence of the linearized ADMM under three other different scenarios.

References

Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Boley, D.: Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs. SIAM J. Optim. 23(4), 2183–2207 (2013)
Article MathSciNet MATH Google Scholar
Bredies, K., Sun, H.: Accelerated douglas-rachford methods for the solution of convex-concave saddle-point problems. arXiv preprint arXiv:1604.06282 (2016)
Cai, X., Han, D., Yuan, X.: The direct extension of admm for three-block separable convex minimization models is convergent when one function is strongly convex. Optim. Online (2014)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet MATH Google Scholar
Chen, C., He, B., Ye, Y., Yuan, X.: The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Math. Program. 155(1–2), 57–79 (2016)
Article MathSciNet MATH Google Scholar
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Article MathSciNet MATH Google Scholar
Chen, Y., Lan, G., Ouyang, Y.: Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014)
Article MathSciNet MATH Google Scholar
Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Article MathSciNet MATH Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Dang, C., Lan, G.: Randomized methods for saddle point computation. arXiv preprint arXiv:1409.8625 (2014)
Deng, W., Lai, M.-J., Peng, Z., Yin, W.: Parallel multi-block admm with $o (1/k)$ convergence. J. Sci. Comput. 71(2), 712–736 (2017)
Deng, W., Yin, W.: On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2015)
Article MathSciNet MATH Google Scholar
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)
Article MATH Google Scholar
Gao, X., Xu, Y., Zhang, S.: Randomized primal-dual proximal block coordinate updates. arXiv preprint arXiv:1605.05969 (2016)
Gao, X., Zhang, S.-Z.: First-order algorithms for convex optimization with nonseparable objective and coupled onstraints. J. Oper. Res. Soc. China 5(2), 131–159 (2017)
Glowinski, R., Marrocco, A.: Sur l’approximation, par eléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. ESAIM. Math. Modell. Numer. Anal. 9(R2), 41–76 (1975)
MATH Google Scholar
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
Article MathSciNet MATH Google Scholar
He, B., Hou, L., Yuan, X.: On full Jacobian decomposition of the augmented Lagrangian method for separable convex programming. SIAM J. Optim. 25(4), 2274–2312 (2015)
Article MathSciNet MATH Google Scholar
He, B., Tao, M., Yuan, X.: Alternating direction method with gaussian back substitution for separable convex programming. SIAM J. Optim. 22(2), 313–340 (2012)
Article MathSciNet MATH Google Scholar
He, B., Yuan, X.: On the acceleration of augmented lagrangian method for linearly constrained optimization. Optim. Online (2010)
He, B., Yuan, X.: On the $O(1/n)$ convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)
Article MathSciNet MATH Google Scholar
James, G. M., Paulson, C., Rusmevichientong, P.: Penalized and constrained regression. Technical report (2013)
Li, H., Lin, Z.: Optimal nonergodic $ o (1/k) $ convergence rate: when linearized adm meets nesterov’s extrapolation. arXiv preprint arXiv:1608.06366 (2016)
Li, M., Sun, D., Toh, K.-C.: A convergent 3-block semi-proximal ADMM for convex minimization problems with one strongly convex block. Asia-Pac. J. Oper. Res. 32(04), 1550024 (2015)
Article MathSciNet MATH Google Scholar
Lin, T., Ma, S., Zhang, S.: On the global linear convergence of the admm with multiblock variables. SIAM J. Optim. 25(3), 1478–1497 (2015)
Article MathSciNet MATH Google Scholar
Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)
Article MathSciNet MATH Google Scholar
Markowitz, H.: Portfolio selection. J. Finance 7(1), 77–91 (1952)
Google Scholar
Monteiro, R.D., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate ${O}(1/k^2)$. Soviet Math. Doklady 27(2), 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Ouyang, Y., Chen, Y., Lan, G., Pasiliao Jr., E.: An accelerated linearized alternating direction method of multipliers. SIAM J. Imaging Sci. 8(1), 644–681 (2015)
Article MathSciNet MATH Google Scholar
Peng, Z., Wu, T., Xu, Y., Yan, M., Yin, W.: Coordinate friendly structures, algorithms and applications. Ann. Math. Sci. Appl. 1(1), 57–119 (2016)
MATH Google Scholar
Peng, Z., Xu, Y., Yan, M., Yin, W.: Arock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput. 38(5), A2851–A2879 (2016)
Article MathSciNet MATH Google Scholar
Pesquet, J.-C., Repetti, A.: A class of randomized primal-dual algorithms for distributed optimization. arXiv preprint arXiv:1406.6404 (2014)
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Article MathSciNet MATH Google Scholar
Xu, Y.: Hybrid jacobian and gauss-seidel proximal block coordinate update methods for linearly constrained convex programming. arXiv preprint arXiv:1608.03928 (2016)
Xu, Y.: Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)
Article MathSciNet MATH Google Scholar
Xu, Y.: Asynchronous parallel primal-dual block update methods. arXiv preprint arXiv:1705.06391 (2017)

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY, USA
Yangyang Xu
Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN, USA
Shuzhong Zhang

Authors

Yangyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shuzhong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yangyang Xu.

Additional information

This work is partly supported by NSF Grant DMS-1719549 and CMMI-1462408.

Appendices

Appendix A: Technical proofs: Sect. 2

In this section, we give the detailed proofs of the lemmas and theorems in Sect. 2. The following lemma will be used a few times. Note that when $S=[M]$, the result is deterministic.

Lemma 7.1

Let S be a uniformly selected subset of [M] with cardinality m and $x^o$ be a vector independent of S. Suppose $x^+$ is a random vector dependent on S and its coordinates out of S are the same as $x^o$. Let $\beta \in \mathbb {R}$, $\lambda ^o$ and $r^o$ be vectors independent of S, and W a positive semidefinite $M\times M$ block diagonal matrix. If

$$\begin{aligned} \nabla _{S} f(x^o)+\tilde{\nabla }g_S(x_S^+)-A_S^\top (\lambda ^o- \beta r^o)+W_S(x_S^+-x_S^o)=0, \end{aligned}$$

then for any x, it holds that

$$\begin{aligned} \begin{aligned}&~ \mathbb {E}_S\left[ F(x^+)-F(x)+\frac{\mu }{2}\Vert x^+-x\Vert ^2-\left\langle A(x^+-x), \lambda ^o-\beta r^o\right\rangle \right] \\&\quad \le ~ (1- \theta )\left[ F(x^o)-F(x)+\frac{\mu }{2}\Vert x^o-x\Vert ^2-\big \langle A(x^o-x), \lambda ^o-\beta r^o\big \rangle \right] \\&\qquad ~-\frac{1}{2}\mathbb {E}_S\left[ \Vert x^+-x\Vert _W^2-\Vert x^o-x\Vert _W^2+\Vert x^+-x^o\Vert _{W-L_m I}^2\right] , \end{aligned} \end{aligned}$$

(45)

where $\theta =\frac{m}{M}$, $L_m$ is given in Assumption 4, and the expectation is taken on S.

Proof

For any x, we have

$$\begin{aligned} \left\langle x_S^+-x_S, \nabla _{S} f(x^o)+\tilde{\nabla }g_S(x_S^+)-A_S^\top (\lambda ^o- \beta r^o)+W_S(x_S^+-x_S^o)\right\rangle =0. \end{aligned}$$

We split the left hand side of the above equation into four terms and bound each of them as below. First, we have

$$\begin{aligned}&~\mathbb {E}_S\left\langle x^+_S-x_S, \nabla _{S} f(x^o)\right\rangle \\&\quad =~\mathbb {E}_S\left\langle x^+-x^o, \nabla f(x^o)\right\rangle + \mathbb {E}_S\left\langle x^o_S-x_S, \nabla _S f(x^o)\right\rangle \\&\quad \ge ~\mathbb {E}_S \left[ f(x^+)-f(x^o)-\frac{L_m}{2}\Vert x^+-x^o\Vert ^2\right] + \theta [f(x^o)-f(x)]\\&\quad =~ \mathbb {E}_S\left[ f(x^+)-f(x)-\frac{L_m}{2}\Vert x^+-x^o\Vert ^2\right] - (1- \theta )[f(x^o)-f(x)], \end{aligned}$$

(46)

where the first equality uses the fact $x_i^+=x_i^o,\,\forall i\not \in S$, and the inequality follows from the uniform distribution of S, the convexity of f, and also the inequality (18).

Secondly, it follows from the strong convexity of g that

$$\begin{aligned} \left\langle x^+_S-x_S, \tilde{\nabla } g_S(x_S^+)\right\rangle \ge g_S(x_S^+) - g_S(x_S) + \sum _{i\in S}\frac{\mu }{2}\Vert x_i^+-x_i\Vert ^2. \end{aligned}$$

(47)

Since $g_S(x_S^+) - g_S(x_S)=g(x^+) - g(x^o) + g_S(x_S^o)- g_S(x_S)$ and $\mathbb {E}_S[g_S(x_S^o)- g_S(x_S)]=\theta [g(x^o)-g(x)]$, we have

$$\begin{aligned} \mathbb {E}_S[g_S(x_S^+) - g_S(x_S)]=&~\mathbb {E}_S[g(x^+) - g(x^o)]+\theta [g(x^o)-g(x)] \\=&~\mathbb {E}_S[g(x^+)-g(x)]-(1-\theta ) [g(x^o)-g(x)]. \end{aligned}$$

(48)

Similarly, it holds $\mathbb {E}_S\sum _{i\in S}\frac{\mu }{2}\Vert x_i^+-x_i\Vert ^2=\frac{\mu }{2}\left( \mathbb {E}_S\Vert x^+-x\Vert ^2-(1-\theta )\Vert x^o-x\Vert ^2\right) .$ Hence, taking expectation on both sides of (47) yields

$$\begin{aligned}&~\mathbb {E}_S\left\langle x^+_S-x_S, \tilde{\nabla } g_S(x_S^+)\right\rangle \nonumber \\&\quad \ge ~\mathbb {E}_S \left[ g(x^+) - g(x)+\frac{\mu }{2}\Vert x^+-x\Vert ^2\right] \nonumber \\&\qquad -(1-\theta )\left[ g(x^o) - g(x)+\frac{\mu }{2}\Vert x^o-x\Vert ^2\right] . \end{aligned}$$

(49)

Thirdly, by essentially the same arguments on showing (48), we have

$$\begin{aligned} \mathbb {E}_S \left\langle x^+_S-x_S, -A_S^\top (\lambda ^o-\beta r^o)\right\rangle= & {} -\,\mathbb {E}_S \left\langle A(x^+-x), \lambda ^o-\beta r^o\right\rangle \nonumber \\&+\, (1-\theta ) \big \langle A(x^o-x), \lambda ^o-\beta r^o\big \rangle . \end{aligned}$$

(50)

Fourth, note $\left\langle x^+_S-x_S, W_S(x_S^+-x_S^o)\right\rangle =\left\langle x^+-x, W(x^+-x^o)\right\rangle $, and thus by (6),

$$\begin{aligned} \mathbb {E}_S\left\langle x^+_S-x_S, W_S(x_S^+-x_S^o)\right\rangle =\frac{1}{2}\mathbb {E}_S\left[ \Vert x^+-x\Vert _W^2-\Vert x^o-x\Vert _W^2+\Vert x^+-x^o\Vert _W^2\right] . \end{aligned}$$

(51)

The desired result is obtained by adding (46), (49), (50), and (51), and recalling $F=f+g$. $\square $

1.1 Proof of Lemma 2.1

From (7a), we have the optimality condition

$$\begin{aligned} \nabla f(x^k)-A^\top (\lambda ^k-\beta _kr^k)+\tilde{\nabla } g(x^{k+1})+P^k(x^{k+1}-x^k)=0. \end{aligned}$$

Hence, for any x such that $Ax=b$, it follows from the definition of $\Phi $ in (3) and Lemma 7.1 with $S=[M]$, $x^o=x^k$, $\lambda ^o=\lambda ^k$, $\beta =\beta _k$, $x^+=x^{k+1}$, and $W=P^k$ that

$$\begin{aligned} \Phi (x^{k+1},x,\lambda ) \le&~ \left\langle Ax^{k+1}-b,\lambda ^k-\beta _kr^k\right\rangle - \left\langle Ax^{k+1}-b, \lambda \right\rangle \\&-\frac{1}{2}\mathbb {E}_S\left[ \Vert x^{k+1}-x\Vert _{P^k+\mu I}^2-\Vert x^k-x\Vert _{P^k}^2+\Vert x^{k+1}-x^k\Vert _{P^k-L_f I}^2\right] .\nonumber \\ \end{aligned}$$

(52)

Using the fact $\lambda ^{k+1}=\lambda ^k-\rho _k(Ax^{k+1}-b)$, we have

$$\begin{aligned} \left\langle Ax^{k+1}-b,\lambda ^k-\lambda \right\rangle =&~\frac{1}{\rho _k}\left\langle \lambda ^k-\lambda ^{k+1},\lambda ^k-\lambda \right\rangle \\\overset{(6)}{=}&~\frac{1}{2\rho _k}\left[ \Vert \lambda -\lambda ^k\Vert ^2-\Vert \lambda -\lambda ^{k+1}\Vert ^2+\Vert \lambda ^k-\lambda ^{k+1}\Vert ^2\right] . \end{aligned}$$

(53)

In addition, we write $r^k=r^k-r^{k+1}+r^{k+1}=r^{k+1}-A(x^{k+1}-x^k)$ and have

$$\begin{aligned}&~\left\langle Ax^{k+1}-b,-\beta _k r^k\right\rangle \\&\quad =~-\beta _k\Vert r^{k+1}\Vert ^2+\beta _k\left\langle A(x^{k+1}-x), A(x^{k+1}-x^k)\right\rangle \\&\quad \overset{(6)}{=}~-\beta _k\Vert r^{k+1}\Vert ^2+\frac{\beta _k}{2}\left[ \Vert A(x^{k+1}-x)\Vert ^2-\Vert A(x^k-x)\Vert ^2+\Vert A(x^{k+1}-x^k)\Vert ^2\right] \nonumber \\ \end{aligned}$$

(54)

Substituting (53) and (54) into (52) gives the inequality in (8).

1.2 Proof of Theorem 2.2

First, we have

$$\begin{aligned}&\sum _{k=1}^t\frac{k+k_0+1}{2\rho _k}\left[ \Vert \lambda -\lambda ^k\Vert ^2-\Vert \lambda -\lambda ^{k+1}\Vert ^2\right] \nonumber \\&\quad = \frac{k_0+2}{2\rho _1}\Vert \lambda -\lambda ^1\Vert ^2-\frac{t+k_0+1}{2\rho _t}\Vert \lambda -\lambda ^{t+1}\Vert ^2\nonumber \\&\qquad +\sum _{k=2}^t \left( \frac{k+k_0+1}{2\rho _k}-\frac{k+k_0}{2\rho _{k-1}}\right) \Vert \lambda -\lambda ^k\Vert ^2 \nonumber \\&\quad \overset{(10)}{\le }\frac{k_0+2}{2\rho _1}\Vert \lambda -\lambda ^1\Vert ^2 . \end{aligned}$$

(55)

In addition,

$$\begin{aligned}&-\sum _{k=1}^t\frac{k+k_0+1}{2}\left( \Vert x^{k+1}-x\Vert _{P^k-\beta _k A^\top A+\mu I}^2-\Vert x^k-x\Vert _{P^k-\beta _k A^\top A}^2\right) \nonumber \\&\quad = \frac{k_0+2}{2}\Vert x^1-x\Vert _{P^1-\beta _1 A^\top A}^2-\frac{t+k_0+1}{2}\Vert x^{t+1}-x\Vert _{P^t-\beta _t A^\top A+\mu I}^2\nonumber \\&\qquad +\frac{1}{2}\sum _{k=2}^t\left( (k+k_0+1)\Vert x^k-x\Vert _{P^k-\beta _k A^\top A}^2-(k+k_0)\Vert x^k-x\Vert _{P^{k-1}-\beta _{k-1}A^\top A+\mu I}^2\right) \nonumber \\&\quad \overset{(11)}{\le }~\frac{k_0+2}{2}\Vert x^1-x\Vert _{P^1-\beta _1 A^\top A}^2-\frac{t+k_0+1}{2}\Vert x^{t+1}-x\Vert _{P^t-\beta _t A^\top A+\mu I}^2. \end{aligned}$$

(56)

Now multiplying $k+k_0+1$ to both sides of (8) and adding it over k, we obtain (12) by using (55) and (56), and noting $\Vert \lambda ^k-\lambda ^{k+1}\Vert ^2=\rho _k^2\Vert r^{k+1}\Vert ^2$ and $\Vert x^{k+1}-x^k\Vert _{P^k-\beta _k A^\top A - L_f I}^2 \ge 0$.

1.3 Proof of Theorem 2.3

From the choice of $k_0$ and the condition $P-\beta A^\top A \preceq \frac{\mu }{2} I$, it is not difficult to verify

$$\begin{aligned}&(k+k_0+1)\left[ kP-k\beta A^\top A+L_f I\right] \\&\quad \preceq (k+k_0)\left[ (k-1)P-(k-1)\beta A^\top A+(L_f+\mu )I\right] ,\,\forall k\ge 1. \end{aligned}$$

Hence, the condition in (11) holds. In addition, it is easy to see that all conditions in (9) and (10) also hold. Therefore, we have (12), which, by taking parameters in (14) and $x=x^*$, reduces to

$$\begin{aligned}&\sum _{k=1}^t(k+k_0+1)\Phi (x^{k+1},x^*,\lambda )+\sum _{k=1}^t\frac{k(k+k_0+1)}{2}\beta \Vert r^{k+1}\Vert ^2\nonumber \\&\quad +\frac{t+k_0+1}{2}\Vert x^{t+1}-x^*\Vert ^2_{t(P-\beta A^\top A)+(L_f+\mu ) I} \le \phi _1(x^*,\lambda ), \end{aligned}$$

(57)

where we have used the fact $\lambda ^1=0$.

Letting $\lambda =\lambda ^*$, we have from (5) and (57) that (by dropping nonnegative $\Phi (x^{k+1},x^*,\lambda ^*)$’s):

$$\begin{aligned} \frac{t(t+k_0+1)}{2}\beta \Vert r^{t+1}\Vert ^2+\frac{t+k_0+1}{2}\Vert x^{t+1}-x^*\Vert ^2_{t(P-\beta A^\top A)+(L_f+\mu ) I} \le \phi _1(x^*,\lambda ^*), \end{aligned}$$

which indicates (15). In addition, from the convexity of F and (57), we have that for any $\lambda $, it holds $\frac{t(t+2k_0+3)}{2}\Phi (\bar{x}^{t+1},x^*,\lambda )\le \phi _1(x^*,\lambda ),$ which together with Lemmas 1.2 and 1.3 implies (16).

Appendix B: Technical proofs: Sect. 3

In this section, we give the proofs of the lemmas and theorems in Sect. 3.

1.1 Proof of Lemma 3.1

From the update in (17a), we have the optimality condition:

$$\begin{aligned} \nabla _{S_k} f(x^k)-A_{S_k}^\top (\lambda ^k-\beta _k r^k)+\tilde{\nabla } g_{S_k}(x_{S_k}^{k+1})+\eta _k (x_{S_k}^{k+1}-x_{S_k}^k) = 0. \end{aligned}$$

(58)

It follows from the update rule of $\lambda $ that

$$\begin{aligned} -\langle Ax^{k+1}-b, \lambda ^k\rangle = - \langle Ax^{k+1}-b, \lambda ^{k+1}\rangle - \rho _k\Vert r^{k+1}\Vert ^2. \end{aligned}$$

Plugging (54) and the above equation into (45) with $S=S_k, \lambda ^o=\lambda ^k, \beta =\beta _k, x^o=x^k$, $x^+=x^{k+1}$, $W=\eta _k I$, and x satisfying $Ax=b$, we have the desired result by taking expectation and recalling the definition of $\Delta $ in (2) and $\Phi $ in (3).

1.2 Proof of Theorem 3.2

Let $\beta _k=\beta , \rho _k=\rho $ and $\eta _k=\eta $ in (19), and also note $\mu =0$ and $\eta \ge L_m+\beta \Vert A\Vert ^2$. We have

$$\begin{aligned}&~\mathbb {E}\left[ \Phi (x^{k+1},x,\lambda ^{k+1})+(\beta -\rho )\Vert r^{k+1}\Vert ^2\right] \\&\quad \le ~(1-\theta )\mathbb {E}\left[ \Phi (x^k,x,\lambda ^k)+\beta \Vert r^k\Vert ^2\right] \\&\qquad - \frac{1}{2}\mathbb {E}\left[ \Vert x^{k+1}-x\Vert ^2_{\eta I-\beta A^\top A}-\Vert x^{k}-x\Vert ^2_{\eta I-\beta A^\top A}\right] . \nonumber \end{aligned}$$

Summing the above inequality over $k=1$ through t and noting $\rho \le \theta \beta $ give

$$\begin{aligned}&~\mathbb {E}\left[ \Phi (x^{t+1},x,\lambda ^{t+1})+(\beta -\rho )\Vert r^{t+1}\Vert ^2\right] + \theta \sum _{k=1}^{t-1}\mathbb {E}\Phi (x^{k+1},x,\lambda ^{k+1}) \\&\quad \le ~(1-\theta )\mathbb {E}\left[ \Phi (x^1,x,\lambda ^1)+\beta \Vert r^1\Vert ^2\right] + \frac{1}{2}\Vert x^1-x\Vert ^2_{\eta I-\beta A^\top A}. \nonumber \end{aligned}$$

(59)

By the update of $\lambda $, it follows that

$$\begin{aligned}&\theta \sum _{k=1}^{t-1} \Phi (x^{k+1},x,\lambda ^{k+1})=~\theta \sum _{k=1}^{t-1} \left[ \Phi (x^{k+1},x,\lambda )+\frac{1}{\rho } \langle \lambda ^{k+1}-\lambda , \lambda ^{k+1}-\lambda ^k\rangle \right] \\&\quad =~\theta \sum _{k=1}^{t-1} \Phi (x^{k+1},x,\lambda )+\frac{\theta }{2\rho }\sum _{k=1}^{t-1} \left[ \Vert \lambda ^{k+1}-\lambda \Vert ^2-\Vert \lambda ^{k}-\lambda \Vert ^2+\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] \\&\quad =~\theta \sum _{k=1}^{t-1} \Phi (x^{k+1},x,\lambda )+\frac{\theta }{2\rho }\left[ \Vert \lambda ^{t}-\lambda \Vert ^2-\lambda ^1-\lambda \Vert ^2+\sum _{k=1}^{t-1}\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] \end{aligned}$$

(60)

and

$$\begin{aligned} \Phi (x^{t+1},x,\lambda ^{t+1}) =&~\Phi (x^{t+1},x,\lambda ) - \langle \lambda ^t - \lambda - \rho r^{t+1}, r^{t+1}\rangle \\=&~\Phi (x^{t+1},x,\lambda )- \langle \lambda ^t - \lambda , r^{t+1}\rangle +\rho \Vert r^{t+1}\Vert ^2. \end{aligned}$$

(61)

Since $\rho \le \theta \beta $, by Young’s inequality, it holds

$$\begin{aligned} \beta \Vert r^{t+1}\Vert ^2 - \langle \lambda ^t - \lambda , r^{t+1}\rangle + \frac{\theta }{2\rho }\Vert \lambda ^{t}-\lambda \Vert ^2 \ge 0. \end{aligned}$$

Then plugging (60) and (61) into (59), we have

$$\begin{aligned}&~\mathbb {E}\Phi (x^{t+1},x,\lambda ) +\theta \sum _{k=1}^{t-1} \mathbb {E}\Phi (x^{k+1},x,\lambda )\\&\quad \le ~(1-\theta )\mathbb {E}\left[ \Phi (x^1,x,\lambda ^1)+\beta \Vert r^1\Vert ^2\right] + \frac{1}{2}\Vert x^1-x\Vert ^2_{\eta I-\beta A^\top A} + \frac{\theta }{2\rho }\mathbb {E}\Vert \lambda ^1-\lambda \Vert ^2\\&\quad \le ~ \mathbb {E}\phi _2(x,\lambda ), \end{aligned}$$

(62)

where in the last inequality we have used $\lambda ^1=0$, $\theta >0$ and $\Vert r^1\Vert ^2=\Vert x^1-x\Vert ^2_{\beta A^\top A}$.

Therefore, from the convexity of F, it follows that $\mathbb {E}\Phi (\bar{x}^{t},x^*,\lambda ) \le \frac{1}{1+\theta (t-1)}\mathbb {E}\phi _2(x^*,\lambda ),\,\forall \lambda $, and we obtain the desired result from Lemmas 1.2 and 1.3.

1.3 Proof of Theorem 3.3

We first establish a few inequalities below.

Proposition 8.1

If (21e), (21f) and (21g) hold, then

$$\begin{aligned}&-\sum _{k=1}^t(k+k_0+1) \mathbb {E}\left[ \Delta _{\eta _k I-\beta _k A^\top A}(x^{k+1},x^k,x)-\frac{L_m}{2}\Vert x^{k+1}-x^k\Vert ^2\right] \\&\qquad -\frac{\mu (t+k_0+1)}{2}\mathbb {E}\Vert x^{t+1}-x\Vert ^2-\sum _{k=2}^t\frac{\mu \big (\theta (k+k_0+1)-1\big )}{2}\mathbb {E}\Vert x^k-x\Vert ^2\\&\quad \le \frac{\eta _1(k_0+2)}{2}\mathbb {E}\Vert x^1-x\Vert ^2-\frac{(t+k_0+1)}{2}\mathbb {E}\Vert x^{t+1}-x\Vert ^2_{(\mu +\eta _t)I-\beta _t A^\top A}. \end{aligned}$$

(63)

Proof

This inequality can be easily shown by noting that for any $1\le k\le t$, the weight matrix of $\frac{1}{2}\Vert x^{k+1}-x^k\Vert ^2$ is $ \beta _k(k+k_0+1)A^\top A-(k+k_0+1)(\eta _k-L_m)I$, which is negative semidefinite, and for any $2\le k\le t$, the weight matrix of $\frac{1}{2}\Vert x^{k}-x\Vert ^2$ is

$$\begin{aligned}&\big [\beta _{k-1}(k+k_0)-\beta _k(k+k_0+1)\big ]A^\top A\\&\quad +\left[ (k+k_0+1)\eta _k-(k+k_0)\eta _{k-1}-\mu \big (\theta (k+k_0+1)-1\big )\right] I, \end{aligned}$$

which is also negative semidefinite.$\square $

Proposition 8.2

If (21a), (21c) and (21d) hold, then

$$\begin{aligned}&-\frac{t+k_0+1}{\rho _t}\mathbb {E}\Delta (\lambda ^{t+1},\lambda ^t,\lambda )-\sum _{k=2}^t\frac{\theta (k+k_0+1)-1}{\rho _{k-1}}\mathbb {E}\Delta (\lambda ^{k},\lambda ^{k-1},\lambda )\\&\quad \le \frac{\theta (k_0+3)-1}{2\rho _1}\mathbb {E}\Vert \lambda ^1-\lambda \Vert ^2. \end{aligned}$$

(64)

Proof

On the left hand side of (64), the coefficient of each $\frac{1}{2}\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2$ is negative. For $2\le k\le t-1$, the coefficient of $\frac{1}{2}\Vert \lambda ^k-\lambda \Vert ^2$ is $\frac{\theta (k+k_0+2)-1}{\rho _k}-\frac{\theta (k+k_0+1)-1}{\rho _{k-1}}$, which is nonpositive; the coefficient of $\frac{1}{2}\Vert \lambda ^t-\lambda \Vert ^2$ is $\frac{t+k_0+1}{\rho _t}-\frac{\theta (t+k_0+1)-1}{\rho _{t-1}}$, which is nonpositive; the coefficient of $\frac{1}{2}\Vert \lambda ^{t+1}-\lambda \Vert ^2$ is also nonpositive. Hence, dropping these nonpositive terms, we have the desired result.$\square $

Now we are ready to prove Theorem 3.3.

Proof of Theorem 3.3

Multiplying $k+k_0+1$ to both sides of (19), summing it up from $k=1$ through t, and moving the terms about $\Phi (x^k,x,\lambda ^k)+\frac{\mu }{2}\Vert x^k-x\Vert ^2$ and $\Vert r^k\Vert ^2$ to the left hand side for $2\le k\le t$ give

$$\begin{aligned}&(t+k_0+1) \mathbb {E}\left[ \Phi (x^{t+1},x,\lambda ^{t+1})+(\beta _t-\rho _t)\Vert r^{t+1}\Vert ^2+ \frac{\mu }{2}\Vert x^{t+1}-x\Vert ^2\right] \\&\qquad +\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big )\mathbb {E}\left[ \Phi (x^k,x,\lambda ^k)+\frac{\mu }{2}\Vert x^k-x\Vert ^2\right] \\&\qquad +\sum _{k=2}^t\big ((\beta _{k-1}-\rho _{k-1})(k+k_0)-(1-\theta )(k+k_0+1)\beta _k\big )\mathbb {E}\Vert r^k\Vert ^2\\&\quad \le (1-\theta )(k_0+2)\mathbb {E}\left[ \Phi (x^1,x,\lambda ^1)+\beta _1\Vert r^1\Vert ^2+\frac{\mu }{2}\Vert x^1-x\Vert ^2\right] \\&\qquad -\sum _{k=1}^t(k+k_0+1) \mathbb {E}\left[ \Delta _{\eta _k I-\beta _k A^\top A}(x^{k+1},x^k,x)-\frac{L_m}{2}\Vert x^{k+1}-x^k\Vert ^2\right] . \nonumber \end{aligned}$$

(65)

Hence, from (21b) and (63), it follows that

$$\begin{aligned} \begin{aligned}&~(t+k_0+1) \mathbb {E}\Phi (x^{t+1},x,\lambda ^{t+1})+\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big )\mathbb {E}\Phi (x^k,x,\lambda ^k)\\&\quad \le ~ (1-\theta )(k_0+2)\mathbb {E}\left[ \Phi (x^1,x,\lambda ^1)+\beta _1\Vert r^1\Vert ^2+\frac{\mu }{2}\Vert x^1-x\Vert ^2\right] \\&\qquad ~+\frac{\eta _1(k_0+2)}{2}\mathbb {E}\Vert x^1-x\Vert ^2-\frac{t+k_0+1}{2}\mathbb {E}\Vert x^{t+1}-x\Vert ^2_{(\mu +\eta _t)I-\beta _t A^\top A}. \end{aligned} \end{aligned}$$

(66)

In addition, from the update of $\lambda $ in (17c), we have

$$\begin{aligned} \langle \lambda ^{k+1}-\lambda , Ax^{k+1}-b\rangle =-\frac{1}{\rho _k}\langle \lambda ^{k+1}-\lambda ,\lambda ^{k+1}-\lambda ^k\rangle =-\frac{1}{\rho _k}\Delta (\lambda ^{k+1},\lambda ^k,\lambda ), \end{aligned}$$

(67)

and thus

$$\begin{aligned}&(t+k_0+1)\mathbb {E}\langle \lambda ^{t+1}-\lambda , Ax^{t+1}-b\rangle +\!\!\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big )\mathbb {E}\langle \lambda ^k-\lambda , Ax^k-b\rangle \\&\quad =-\frac{t+k_0+1}{\rho _t}\mathbb {E}\Delta (\lambda ^{t+1},\lambda ^t,\lambda )-\sum _{k=2}^t\frac{\theta (k+k_0+1)-1}{\rho _{k-1}}\mathbb {E}\Delta (\lambda ^{k},\lambda ^{k-1},\lambda )\\&\quad \overset{(64)}{\le }\frac{\theta (k_0+3)-1}{2\rho _1}\mathbb {E}\Vert \lambda ^1-\lambda \Vert ^2. \end{aligned}$$

Since $\Phi (x^k,x,\lambda )=\Phi (x^k,x,\lambda ^k)+\langle \lambda ^k-\lambda , Ax^k-b\rangle ,$ we obtain the desired result by adding the above inequality to (66).$\square $

1.4 Proof of Proposition 3.4

Note that (24) implies $k_0\ge \frac{4}{\theta }$, and thus (21a) must hold. Also, it is easy to see that (21d) holds with equality from the second equation of (23b). Since $I\succeq \frac{A^\top A}{\Vert A\Vert _2^2}$, we can easily have (21f) by plugging in $\beta _k$ and $\eta _k$ defined in (23a) and (23c) respectively.

To verify (21c), we plug in $\rho _k$ defined in the first equation of (23b), and it is equivalent to requiring that for any $2\le k\le t-1$

$$\begin{aligned} \frac{\theta (k+k_0+1)-1}{\theta (k-1)+2+\theta }\ge & {} \frac{\theta (k+k_0+2)-1}{\theta k+2+\theta }\Longleftrightarrow 1+\frac{\theta (k_0+1)-3}{\theta k+2}\\\ge & {} 1+\frac{\theta (k_0+1)-3}{\theta k+2+\theta }. \end{aligned}$$

The inequality on the right hand side obviously holds, and thus we have (21c).

Plugging in the formula of $\beta _k$, (21e) is equivalent to

$$\begin{aligned} (\theta k+2+\theta )(k+k_0+1)\ge (\theta k+2)(k+k_0), \end{aligned}$$

which holds trivially, and thus (21e) follows.

With the given $\beta _k$ and $\rho _k$, (21b) becomes $\frac{6}{6-5\theta }(\theta k+2)(k+k_0)\ge (k+k_0+1)(\theta k+2+\theta ),\,\forall 2\le k\le t,$ which is equivalent to $\frac{6}{6-5\theta }\ge \frac{(k_0+3)(3\theta +2)}{(k_0+2)(2\theta +2)}$. Note that $\frac{k_0+3}{k_0+2}$ is decreasing with respect to $k_0\ge 0$ and also $\frac{6}{6-5\theta }\ge \frac{(\frac{3}{\theta }+3)(3\theta +2)}{(\frac{3}{\theta }+2)(2\theta +2)}$. Hence, (21b) is satisfied from the fact $k_0\ge \frac{4}{\theta }$.

Finally, we show (21g). Plugging in $\eta _k$, we have that (21g) becomes

$$\begin{aligned}&(k+k_0)\left( \frac{\mu }{2}\left( \theta k+2\right) +L_m\right) +\mu \big (\theta (k+k_0+1)-1\big )\\&\quad \ge (k+k_0+1)\left( \frac{\mu }{2}\left( \theta k+2+\theta \right) +L_m\right) ,\,\forall k\ge 2, \end{aligned}$$

which is equivalent to $k_0+1\ge \frac{4}{\theta }+\frac{2L_m}{\theta \mu }$. Hence, for $k_0$ given in (24), (21g) must hold. Therefore, we have verified all conditions in (21).

1.5 Proof of Theorem 3.5

From Proposition 3.4, we have the inequality in (22) that, as $\lambda ^1=0$, reduces to

$$\begin{aligned}&(t+k_0+1) \mathbb {E}\Phi (x^{t+1},x,\lambda )+\sum _{k=2}^t\big (\theta (k+k_0+1)-1\big )\mathbb {E}\Phi (x^k,x,\lambda )\nonumber \\&\quad \le \phi _3(x,\lambda )-\frac{t+k_0+1}{2}\mathbb {E}\Vert x^{t+1}-x\Vert _{(\mu +\eta _t) I-\beta _t A^\top A}^2. \end{aligned}$$

(68)

For $\rho \ge 1$, we have

$$\begin{aligned} (\mu +\eta _t) I-\beta _t A^\top A\succeq \left( \frac{(\rho -1)\mu }{2\rho }(\theta t + \theta + 2) + \mu +L_m\right) I. \end{aligned}$$

(69)

Letting $x=x^*$ and using the convexity of F, we have from (68) and the above inequality that

$$\begin{aligned} \mathbb {E}\left[ F(\bar{x}^{t+1})-F(x^*)-\big \langle \lambda ,A\bar{x}^{t+1}-b\big \rangle \right] \le \frac{1}{T}\mathbb {E}\phi _3(x^*,\lambda ),\,\forall \lambda , \end{aligned}$$

(70)

which together with Lemmas 1.2 and 1.3 with $\gamma =\max (2\Vert \lambda ^*\Vert , 1+\Vert \lambda ^*\Vert )$ indicates (25).

In addition, note

$$\begin{aligned} \Phi (x^{t+1},x^*,\lambda ^*)\ge \frac{\mu }{2}\Vert x^{t+1}-x^*\Vert ^2. \end{aligned}$$

Hence, letting $(x,\lambda )=(x^*,\lambda ^*)$ in (68) and using (5), we have from (69) that

$$\begin{aligned} \frac{t+k_0+1}{2}\left( \frac{(\rho -1)\mu }{2\rho }(\theta t + \theta + 2)+2\mu +L_m\right) \mathbb {E}\Vert x^{t+1}-x^*\Vert ^2 \le \phi _3(x^*,\lambda ^*), \end{aligned}$$

(71)

and the proof is completed.

Appendix C: Technical proofs: Sect. 4

In this section, we provide the proofs of the lemmas and theorems in Sect. 4.

1.1 Proof of Lemma 4.1

Note $r^{k+1}-r^k=A(x^{k+1}-x^k)+B(y^{k+1}-y^k)$. Hence by (6), we have

$$\begin{aligned} \begin{aligned} \left\langle A(x^{k+1}-x),-\beta r^k\right\rangle =&~-\beta \left\langle A(x^{k+1}-x), r^{k+1}\right\rangle +\beta \left\langle A(x^{k+1}-x), B(y^{k+1}-y^k)\right\rangle \\&~+\frac{\beta }{2}\left[ \Vert A(x^{k+1}-x)\Vert ^2-\Vert A(x^k-x)\Vert ^2+\Vert A(x^{k+1}-x^k)\Vert ^2\right] . \end{aligned} \end{aligned}$$

(72)

In addition, $\langle A(x^{k+1}-x), \lambda ^k\rangle = \langle A(x^{k+1}-x), \lambda ^{k+1}+\rho r^{k+1}\rangle $. Plugging this equation and (72) into (45) with $x^o=x^k, \lambda ^o=\lambda ^k, x^+=x^{k+1}, W=\eta _x I$ and taking expectation yield

$$\begin{aligned}&~ \mathbb {E}\left[ F(x^{k+1})-F(x)+\frac{\mu }{2}\Vert x^{k+1}-x\Vert ^2-\big \langle A(x^{k+1}-x), \lambda ^{k+1}\big \rangle \right. \nonumber \\&\qquad \left. +(\beta -\rho )\big \langle A(x^{k+1}-x), r^{k+1}\big \rangle \right] \nonumber \\&\qquad ~+\frac{1}{2}\mathbb {E}\left[ \Vert x^{k+1}-x\Vert _P^2-\Vert x^k-x\Vert _P^2+\Vert x^{k+1}-x^k\Vert ^2_{P-L_mI}\right] \nonumber \\&\quad \le ~ (1- \theta )\mathbb {E}\left[ F(x^k)-F(x)+\frac{\mu }{2}\Vert x^k-x\Vert ^2-\big \langle A(x^k-x), \lambda ^k-\beta r^k\big \rangle \right] \\&\qquad ~+\beta \mathbb {E}\left\langle A(x^{k+1}-x), B(y^{k+1}-y^k)\right\rangle ,\nonumber \end{aligned}$$

(73)

where $P=\eta _x I-\beta A^\top A$.

From (30), the optimality condition for $ \tilde{y}^{k+1}$ is

$$\begin{aligned} \nabla h( \tilde{y}^{k+1})- B^\top \lambda ^k+ \beta B^\top r^{k+\frac{1}{2}} +\eta _y (\tilde{y}^{k+1}-y^k)=0. \end{aligned}$$

(74)

Since ${\mathrm {Prob}}(y^{k+1}=\tilde{y}^{k+1})=\theta ,\, {\mathrm {Prob}}(y^{k+1}=y^k)=1-\theta ,$ we have

$$\begin{aligned}&\mathbb {E}\left\langle y^{k+1}- y, \nabla h( y^{k+1})- B^\top \lambda ^{k} + \beta B^\top r^{k+\frac{1}{2}}+\eta _y(y^{k+1}-y^k)\right\rangle \\&\quad =(1-\theta )\mathbb {E}\left\langle y^k- y, \nabla h( y^k)- B^\top \lambda ^k + \beta B^\top r^{k+\frac{1}{2}}\right\rangle , \end{aligned}$$

or equivalently,

$$\begin{aligned}&\mathbb {E}\left\langle y^{k+1}- y, \nabla h( y^{k+1})- B^\top \lambda ^{k+1} +( \beta -\rho ) B^\top r^{k+1}\right. \nonumber \\&\qquad \left. - \beta B^\top B(y^{k+1}-y^k)+\eta _y(y^{k+1}-y^k)\right\rangle \\&\quad =(1-\theta )\mathbb {E}\left\langle y^k- y, \nabla h( y^k)- B^\top \lambda ^k + \beta B^\top r^k\right\rangle \nonumber \\&\qquad +\,\beta (1-\theta ) \mathbb {E}\left\langle B(y^k- y),A(x^{k+1}-x^k)\right\rangle . \end{aligned}$$

(75)

Recall $Q=\eta _y I -\beta B^\top B$. We have

$$\begin{aligned}&\left\langle y^{k+1}- y, - \beta B^\top B(y^{k+1}-y^k)+\eta _y(y^{k+1}-y^k)\right\rangle \\&\quad&= \frac{1}{2}\left[ \Vert y^{k+1}- y\Vert _Q^2-\Vert y^k- y\Vert _Q^2+\Vert y^{k+1}- y^k\Vert _Q^2\right] . \end{aligned}$$

Therefore adding (75) to (73), noting $Ax+By=b$, and plugging (67) with $\rho _k=\rho $, we have the desired result.

1.2 Proof of Theorem 4.2

Before proving Theorem 4.2, we establish a few inequalities. First, using Young’s inequality, we have the following results.

Lemma 9.1

For any $\tau _1,\tau _2>0$, it holds that

$$\begin{aligned}&\langle A(x^{k+1}- x^*), B( y^{k+1}- y^k)\rangle \le \frac{1}{2\tau _1}\Vert A(x^{k+1}-x^*)\Vert ^2+\frac{\tau _1}{2}\Vert B(y^{k+1}-y^k)\Vert ^2, \end{aligned}$$

(76)

$$\begin{aligned}&\langle B(y^k- y^*),A(x^{k+1}-x^k)\rangle \le \frac{1}{2\tau _2}\Vert B(y^k- y^*)\Vert ^2+\frac{\tau _2}{2}\Vert A(x^{k+1}-x^k)\Vert ^2. \end{aligned}$$

(77)

In addition, we are able to bound the $\lambda $-term by y-term and the residual r. The proofs are given in Appendix C.4 and C.5.

Lemma 9.2

For any $\delta >0$, we have

$$\begin{aligned}&\mathbb {E}\Vert B^\top (\lambda ^{k+1}-\lambda ^*)\Vert ^2-(1-\theta )(1+\delta )\mathbb {E}\Vert B^\top (\lambda ^k-\lambda ^*)\Vert ^2\\&\quad \le 4\mathbb {E}\big [L_h^2\Vert y^{k+1}-y^*\Vert ^2+\Vert Q(y^{k+1}-y^k)\Vert ^2\big ]+2(\beta -\rho )^2\mathbb {E}\Vert B^\top r^{k+1}\Vert ^2\nonumber \\&\qquad +\,2\rho ^2(1-\theta )\left( 1+\frac{1}{\delta }\right) \mathbb {E}\big [\Vert B^\top r^{k+1}\Vert ^2+\Vert B^\top B(y^{k+1}-y^k)\Vert ^2\big ]. \end{aligned}$$

(78)

Lemma 9.3

Assume (38). Then

$$\begin{aligned}&\frac{\sigma _{\min }(BB^\top )}{2}\big [\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\big ]\\&\quad \le \Vert B^\top (\lambda ^{k+1}-\lambda ^*)\Vert ^2-(1-\theta )(1+\delta )\Vert B^\top (\lambda ^k-\lambda ^*)\Vert ^2+\kappa \Vert B^\top (\lambda ^{k+1}-\lambda ^k)\Vert ^2,\nonumber \\ \end{aligned}$$

(79)

where $\sigma _{\min }(BB^\top )$ denotes the smallest singular value of $BB^\top $.

Lemma 9.4

Let $c,\delta ,\tau _1,\tau _2$ and $\kappa $ be constants satisfying the conditions in Theorem 4.2. Then

$$\begin{aligned}&~\beta \mathbb {E}\big \langle A(x^{k+1}- x^*), B( y^{k+1}- y^k)\big \rangle +\beta (1 -\theta )\mathbb {E}\big \langle B(y^k- y^*),A(x^{k+1}-x^k)\big \rangle \\&\qquad ~+\frac{c}{2}\sigma _{\min }(BB^\top )\mathbb {E}\big [\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\big ]\\&\quad \le ~ \frac{1}{2}\mathbb {E}\Vert x^{k+1}- x^k\Vert _{P-L_mI}^2+\frac{\beta }{2\tau _1}\mathbb {E}\Vert A(x^{k+1}-x^*)\Vert ^2\\&\qquad ~+\frac{1}{2}\mathbb {E}\Vert y^{k+1}- y^k\Vert _Q^2+\frac{\beta (1-\theta )}{2\tau _2}\mathbb {E}\Vert B(y^k-y^*)\Vert ^2+4cL_h^2\mathbb {E}\Vert y^{k+1}-y^*\Vert ^2\\&\qquad ~+\left[ c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right] \mathbb {E}\Vert B^\top r^{k+1}\Vert ^2.\nonumber \end{aligned}$$

(80)

Now we are ready to show Theorem 4.2.

Proof of Theorem 4.2

Letting $(x,y,\lambda )=(x^*,y^*,\lambda ^*)$ in (34), plugging (32) into it, and noting $Ax^*+By^*=b$, we have

$$\begin{aligned}&\mathbb {E}\Psi (z^{k+1},z^*)+( \beta -\rho )\mathbb {E}\Vert r^{k+1}\Vert ^2+\mathbb {E}\left[ \Delta _P(x^{k+1},x^k, x^*)-\frac{L_m}{2}\Vert x^{k+1}- x^k\Vert ^2\right] \nonumber \\&\qquad +\,\mathbb {E}\Delta _Q(y^{k+1},y^k, y^*)+\frac{\mu }{2}\mathbb {E}\Vert x^{k+1}-x^*\Vert ^2+\frac{1}{\rho }\mathbb {E}\Delta (\lambda ^{k+1},\lambda ^{k},\lambda ^*)\nonumber \\&\quad \le (1-\theta )\mathbb {E}\Psi (z^k,z^*)+\beta (1 -\theta )\mathbb {E}\Vert r^k\Vert ^2+\frac{1-\theta }{\rho }\mathbb {E}\Delta (\lambda ^{k},\lambda ^{k-1},\lambda ^*)\nonumber \\&\qquad +\,\frac{\mu (1-\theta )}{2}\mathbb {E}\Vert x^k-x^*\Vert ^2 +\, \beta \mathbb {E}\big \langle A(x^{k+1}- x^*), B( y^{k+1}- y^k)\big \rangle +\beta (1 -\theta )\nonumber \\&\qquad \mathbb {E}\big \langle B(y^k- y^*),A(x^{k+1}-x^k)\big \rangle , \nonumber \\ \end{aligned}$$

(81)

where $\Psi $ is defined in (36). Note

$$\begin{aligned}&~\frac{1}{\rho }\Delta (\lambda ^{k+1},\lambda ^{k},\lambda ^*)\\&\quad =~\frac{1}{2\rho }\big [\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\big ]\\&\qquad -\frac{\rho }{2}\left( \frac{1}{\theta }-1\right) \Vert r^{k+1}\Vert ^2 - \frac{\theta }{2\rho }\Vert \lambda ^{k}-\lambda ^*\Vert ^2, \end{aligned}$$

and

$$\begin{aligned}&~\frac{1-\theta }{\rho }\Delta (\lambda ^{k},\lambda ^{k-1},\lambda ^*)\\&\quad =~\frac{1}{2\rho }\big [\Vert \lambda ^{k}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k-1}-\lambda ^*\Vert ^2 +\frac{1}{\theta }\Vert \lambda ^{k}-\lambda ^{k-1}\Vert ^2\big ]\\&\qquad -\frac{\rho }{2}\left( \frac{1}{\theta }-(1-\theta )\right) \Vert r^k\Vert ^2- \frac{\theta }{2\rho }\Vert \lambda ^{k}-\lambda ^*\Vert ^2. \end{aligned}$$

Adding (80) to (81) and plugging the above two equations yield

$$\begin{aligned}&\mathbb {E}\Psi (z^{k+1},z^*)+( \beta -\rho )\mathbb {E}\Vert r^{k+1}\Vert ^2+\mathbb {E}\left[ \Delta _P(x^{k+1},x^k, x^*)-\frac{L_m}{2}\Vert x^{k+1}- x^k\Vert ^2\right] \nonumber \\&\qquad +\mathbb {E}\Delta _Q(y^{k+1},y^k, y^*)+\frac{\mu }{2}\mathbb {E}\Vert x^{k+1}-x^*\Vert ^2-\frac{\rho }{2}\left( \frac{1}{\theta }-1\right) \mathbb {E}\Vert r^{k+1}\Vert ^2 - \frac{\theta }{2\rho }\mathbb {E}\Vert \lambda ^{k}-\lambda ^*\Vert ^2\nonumber \\&\qquad +\left( \frac{1}{2\rho }+\frac{c}{2}\sigma _{\min }(BB^\top )\right) \mathbb {E}\big [\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\big ]\nonumber \\&\quad \le (1-\theta )\mathbb {E}\Psi (z^k,z^*)+\beta (1 -\theta )\mathbb {E}\Vert r^k\Vert ^2 -\frac{\rho }{2}\left( \frac{1}{\theta }-(1-\theta )\right) \mathbb {E}\Vert r^k\Vert ^2- \frac{\theta }{2\rho }\mathbb {E}\Vert \lambda ^{k}-\lambda ^*\Vert ^2 \nonumber \\&\qquad +\frac{1}{2\rho }\mathbb {E}\big [\Vert \lambda ^{k}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k-1}-\lambda ^*\Vert ^2 +\frac{1}{\theta }\Vert \lambda ^{k}-\lambda ^{k-1}\Vert ^2\big ]\nonumber \\&\qquad +\frac{\mu (1-\theta )}{2}\mathbb {E}\Vert x^k-x^*\Vert ^2+\frac{1}{2}\mathbb {E}\Vert x^{k+1}- x^k\Vert _{P-L_mI}^2+\frac{\beta }{2\tau _1}\mathbb {E}\Vert A(x^{k+1}-x^*)\Vert ^2\nonumber \\&\qquad +\frac{1}{2}\mathbb {E}\Vert y^{k+1}- y^k\Vert _Q^2+\frac{\beta (1-\theta )}{2\tau _2}\mathbb {E}\Vert B(y^k-y^*)\Vert ^2+4cL_h^2\mathbb {E}\Vert y^{k+1}-y^*\Vert ^2\\&\qquad +\left[ c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right] \mathbb {E}\Vert B^\top r^{k+1}\Vert ^2. \nonumber \end{aligned}$$

Using the definition in (2) to expand $\Delta _P(x^{k+1},x^k, x^*)$ and $\Delta _Q(y^{k+1},y^k, y^*)$ in the above inequality, and then rearranging terms, we have

$$\begin{aligned}&\mathbb {E}\Psi (z^{k+1},z^*)+\left( ( \beta -\rho )-\frac{\rho }{2}\left( \frac{1}{\theta }-1\right) \right) \mathbb {E}\Vert r^{k+1}\Vert ^2\\&\qquad -\left[ c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right] \mathbb {E}\Vert B^\top r^{k+1}\Vert ^2\\&\qquad +\,\mathbb {E}\left[ \frac{1}{2}\Vert x^{k+1}-x^*\Vert _P^2+\frac{\mu }{2}\Vert x^{k+1}-x^*\Vert ^2-\frac{\beta }{2\tau _1}\Vert A(x^{k+1}-x^*)\Vert ^2\right] \nonumber \\&\qquad +\,\mathbb {E}\left[ \frac{1}{2}\Vert y^{k+1}-y^*\Vert _Q^2-4cL_h^2\Vert y^{k+1}-y^*\Vert ^2\right] \nonumber \\&\qquad +\,\left( \frac{1}{2\rho }+\frac{c}{2}\sigma _{\min }(BB^\top )\right) \mathbb {E}\left[ \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] \nonumber \\&\quad \le (1-\theta )\mathbb {E}\Psi (z^k,z^*)+\beta (1 -\theta )\mathbb {E}\Vert r^k\Vert ^2 -\frac{\rho }{2}\left( \frac{1}{\theta }-(1-\theta )\right) \mathbb {E}\Vert r^k\Vert ^2 + \frac{1}{2}\mathbb {E}\Vert x^{k}-x^*\Vert _P^2 \nonumber \\&\qquad +\frac{\mu (1-\theta )}{2}\mathbb {E}\Vert x^k-x^*\Vert ^2+\frac{1}{2}\mathbb {E}\Vert y^{k}- y^*\Vert _Q^2+\frac{\beta (1-\theta )}{2\tau _2}\mathbb {E}\Vert B(y^k-y^*)\Vert ^2\nonumber \\&\qquad +\frac{1}{2\rho }\mathbb {E}\left[ \Vert \lambda ^{k}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k-1}-\lambda ^*\Vert ^2 +\frac{1}{\theta }\Vert \lambda ^{k}-\lambda ^{k-1}\Vert ^2\right] . \end{aligned}$$

(82)

Since $\rho = \theta \beta $, it holds

$$\begin{aligned} ( \beta -\rho )-\frac{\rho }{2}\left( \frac{1}{\theta }-1\right) = \frac{\beta -\rho }{2}, \quad \beta (1 -\theta )-\frac{\rho }{2}\left( \frac{1}{\theta }-(1-\theta )\right) \le \frac{\beta (1-\theta )}{2}, \end{aligned}$$

and thus the inequality (82) implies

$$\begin{aligned}&\mathbb {E}\Psi (z^{k+1},z^*)+\frac{ \beta -\rho }{2}\mathbb {E}\Vert r^{k+1}\Vert ^2 \nonumber \\&\qquad -\left[ c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right] \mathbb {E}\Vert B^\top r^{k+1}\Vert ^2\\&\qquad +\,\mathbb {E}\left[ \frac{1}{2}\Vert x^{k+1}-x^*\Vert _P^2+\frac{\mu }{2}\Vert x^{k+1}-x^*\Vert ^2-\frac{\beta }{2\tau _1}\Vert A(x^{k+1}-x^*)\Vert ^2\right] \nonumber \\&\qquad +\,\mathbb {E}\left[ \frac{1}{2}\Vert y^{k+1}-y^*\Vert _Q^2-4cL_h^2\Vert y^{k+1}-y^*\Vert ^2\right] \nonumber \\&\qquad +\left( \frac{1}{2\rho }+\frac{c}{2}\sigma _{\min }(BB^\top )\right) \mathbb {E}\big [\Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\big ]\nonumber \\&\quad \le \psi (z^k,z^*; P,Q,\beta ,\rho ,c,\tau _2), \end{aligned}$$

(83)

where $\psi $ is defined in (37).

From (33), it follows that

$$\begin{aligned} (1-\alpha )\Psi (z^{k+1},z^*)+\frac{\alpha \mu }{2}\Vert x^{k+1}-x^*\Vert ^2+\alpha \nu \Vert y^{k+1}-y^*\Vert ^2\le \Psi (z^{k+1},z^*) . \end{aligned}$$

(84)

In addition, note that

$$\begin{aligned} \Vert r^{k+1}\Vert ^2= & {} \Vert Ax^{k+1}+By^{k+1}-(Ax^*+By^*)\Vert ^2\\\le & {} 2\Vert A\Vert _2^2\Vert x^{k+1}-x^*\Vert ^2+2\Vert B\Vert _2^2\Vert y^{k+1}-y^*\Vert ^2\\\le & {} \gamma \left( \frac{\alpha \mu }{4}\Vert x^{k+1}-x^*\Vert ^2+\frac{\alpha \nu }{4}\Vert y^{k+1}-y^*\Vert ^2\right) , \end{aligned}$$

and thus

$$\begin{aligned} \frac{1}{\gamma }\Vert r^{k+1}\Vert ^2\le \frac{\alpha \mu }{4}\Vert x^{k+1}-x^*\Vert ^2+\frac{\alpha \nu }{4}\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$

(85)

Adding (84) and (85) to (83) gives the desired result.$\square $

1.3 Proof of Theorem 4.3

From $0<\alpha <\theta $, the full row-rankness of B, and the conditions in (41), it is easy to see that $\eta >1$. Next we find lower bounds of the terms on the left hand of (40). Since $\eta \le \frac{1-\alpha }{1-\theta }$, we have

$$\begin{aligned} \eta (1-\theta )\Psi (z^{k+1},z^*)\le (1-\alpha )\Psi (z^{k+1},z^*). \end{aligned}$$

(86)

Note $\Vert A\Vert _2\le 1$ and

$$\begin{aligned} \left( \frac{\alpha \mu }{2}+\mu -\frac{\beta }{\tau _1}\right) I\succeq & {} \frac{\frac{\alpha \mu }{2}+\theta \mu -\frac{\beta }{\tau _1}}{\eta _x+\mu (1-\theta )}(\eta _x I-\beta A^\top A)\\&+ \frac{\frac{\alpha \mu }{2}+\theta \mu -\frac{\beta }{\tau _1}}{\eta _x+\mu (1-\theta )}\mu (1-\theta )I +\mu (1-\theta )I. \end{aligned}$$

Hence, from $\eta \le 1+\frac{\frac{\alpha \mu }{2}+\theta \mu -\frac{\beta }{\tau _1}}{\eta _x+\mu (1-\theta )}$ and $P=\eta _x I-\beta A^\top A$, it follows that

$$\begin{aligned} \eta \Vert x^{k+1}- x^*\Vert ^2_{P+\mu (1-\theta )I} \le \Vert x^{k+1}- x^*\Vert ^2_{P+(\frac{\alpha \mu }{2}+\mu )I-\frac{\beta }{\tau _1}A^\top A}. \end{aligned}$$

(87)

Similarly, since

$$\begin{aligned} \left( \frac{3\alpha \nu }{2}-8c L_h^2\right) I\succeq & {} \frac{\frac{3\alpha \nu }{2}-8c L_h^2-\frac{\beta (1-\theta )}{\tau _2}}{\eta _y+\frac{\beta (1-\theta )}{\tau _2}}(\eta _y I-\beta B^\top B) \\&+\frac{\frac{3\alpha \nu }{2}-8c L_h^2-\frac{\beta (1-\theta )}{\tau _2}}{\eta _y+\frac{\beta (1-\theta )}{\tau _2}}\frac{\beta (1-\theta )}{\tau _2}I+\frac{\beta (1-\theta )}{\tau _2}I, \end{aligned}$$

$Q=\eta _y I-\beta B^\top B$, and $B^\top B \preceq I$, we have

$$\begin{aligned} \eta \Vert y^{k+1}- y^*\Vert ^2_{Q+\frac{\beta (1 -\theta )}{\tau _2}B^\top B} \le \Vert y^{k+1}- y^*\Vert ^2_{Q+(\frac{3\alpha \nu }{2}-8cL_h^2)I}. \end{aligned}$$

(88)

For the r-term, we note from the definition of $\eta $ that

$$\begin{aligned} \eta \frac{ \beta (1-\theta )}{2}\le \left( \frac{ \beta (1-\theta )}{2}+\frac{1}{\gamma }\right) -\left( c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right) . \end{aligned}$$

In addition, since $\Vert B\Vert _2\le 1$, it holds $\Vert B^\top r^{k+1}\Vert \le \Vert r^{k+1}\Vert $, and thus

$$\begin{aligned} \eta \frac{ \beta (1-\theta )}{2}\Vert r^{k+1}\Vert ^2\le & {} \left( \frac{ \beta (1-\theta )}{2}+\frac{1}{\gamma }\right) \Vert r^{k+1}\Vert ^2\nonumber \\&-\left( c\rho ^2\left( \kappa +2(1-\theta )\left( 1+\frac{1}{\delta }\right) \right) +2c(\beta -\rho )^2\right) \Vert B^\top r^{k+1}\Vert ^2.\nonumber \\ \end{aligned}$$

(89)

Finally, it is obvious to have

$$\begin{aligned} \begin{aligned}&~ \frac{\eta }{2\rho }\left[ \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] \\&\quad \le ~\left( \frac{1}{2\rho }+\frac{c}{2}\sigma _{\min }(BB^\top )\right) \left[ \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] . \end{aligned} \end{aligned}$$

(90)

Therefore, we obtain (42) by the definition of $\psi $ and adding (86) through (90).

1.4 Proof of Lemma 9.2

Let $ \tilde{\lambda }^{k+1}=\lambda ^k-\rho (Ax^{k+1}+B\tilde{y}^{k+1}-b). $ Then from the update of y, we have

$$\begin{aligned} \begin{aligned}&~\mathbb {E}\Vert B^\top (\lambda ^{k+1}-\lambda ^*)\Vert ^2\\&\quad =~\theta \mathbb {E}\Vert B^\top (\tilde{\lambda }^{k+1}-\lambda ^*)\Vert ^2+(1-\theta )\mathbb {E}\Vert B^\top (\lambda ^k-\lambda ^*-\rho (Ax^{k+1}+By^k-b))\Vert ^2. \end{aligned} \end{aligned}$$

(91)

Below we bound the two terms on the right hand side of (91). First, the definition of $\tilde{\lambda }^{k+1}$ together with (74) implies

$$\begin{aligned} B^\top \tilde{\lambda }^{k+1}=\nabla h(\tilde{y}^{k+1})+Q(\tilde{y}^{k+1}-y^k)+(\beta -\rho )B^\top (Ax^{k+1}+B\tilde{y}^{k+1}-b). \end{aligned}$$

(92)

Hence, by the Young’s inequality and the condition in (32b), we have

$$\begin{aligned} \begin{aligned}&~\theta \mathbb {E}\Vert B^\top (\tilde{\lambda }^{k+1}-\lambda ^*)\Vert ^2\\&\quad \le ~2\theta \mathbb {E}\Vert \nabla h(\tilde{y}^{k+1})-\nabla h(y^*)+Q(\tilde{y}^{k+1}-y^k)\Vert ^2\\&\qquad +2\theta (\beta -\rho )^2\mathbb {E}\Vert B^\top (Ax^{k+1}+B\tilde{y}^{k+1}-b)\Vert ^2. \end{aligned} \end{aligned}$$

(93)

Since ${\mathrm {Prob}}(y^{k+1}=\tilde{y}^{k+1})=\theta $ and ${\mathrm {Prob}}(y^{k+1}=y^k)=1-\theta $, it follows that

$$\begin{aligned} \begin{aligned}&~\mathbb {E}\Vert \nabla h(y^{k+1})-\nabla h(y^*)+Q(y^{k+1}-y^k)\Vert ^2\\&\quad =~\theta \mathbb {E}\Vert \nabla h(\tilde{y}^{k+1}){-}\nabla h(y^*)+Q(\tilde{y}^{k+1}{-}y^k)\Vert ^2+(1-\theta )\mathbb {E}\Vert \nabla h(y^k)-\nabla h(y^*)\Vert ^2, \end{aligned} \end{aligned}$$

and thus

$$\begin{aligned}&\theta \mathbb {E}\Vert \nabla h(\tilde{y}^{k+1})-\nabla h(y^*)+Q(\tilde{y}^{k+1}-y^k)\Vert ^2\\&\quad \le \mathbb {E}\Vert \nabla h(y^{k+1})-\nabla h(y^*)+Q(y^{k+1}-y^k)\Vert ^2. \end{aligned}$$

Similarly,

$$\begin{aligned} \theta (\beta -\rho )^2\mathbb {E}\Vert B^\top (Ax^{k+1}+B\tilde{y}^{k+1}-b)\Vert ^2\le (\beta -\rho )^2\mathbb {E}\Vert B^\top (Ax^{k+1}+B y^{k+1}-b)\Vert ^2. \end{aligned}$$

Plugging the above two equations into (93) and applying the Young’s inequality and also the Lipschitz continuity of $\nabla h$ give

$$\begin{aligned} \theta \mathbb {E}\Vert B^\top (\tilde{\lambda }^{k+1}-\lambda ^*)\Vert ^2 \le 4\mathbb {E}\big [L_h^2\Vert y^{k+1}-y^*\Vert ^2+\Vert Q(y^{k+1}-y^k)\Vert ^2\big ]+2(\beta -\rho )^2\mathbb {E}\Vert B^\top r^{k+1}\Vert ^2. \end{aligned}$$

(94)

In addition, from the Young’s inequality, it follows for any $\delta >0$ that

$$\begin{aligned}&\Vert B^\top (\lambda ^k-\lambda ^*-\rho (Ax^{k+1}+By^k-b))\Vert ^2\\&\quad \le (1+\delta )\Vert B^\top (\lambda ^k-\lambda ^*)\Vert ^2+\rho ^2\left( 1+\frac{1}{\delta }\right) \Vert B^\top (Ax^{k+1}+By^k-b)\Vert ^2. \end{aligned}$$

Note $\Vert B^\top (Ax^{k+1}+By^k-b)\Vert ^2\le 2\Vert B^\top r^{k+1}\Vert ^2+2\Vert B^\top B(y^{k+1}-y^k)\Vert ^2$. Therefore, plugging (94) and the above two inequalites into (91), we complete the proof.

1.5 Proof of Lemma 9.3

It is straightforward to verify

$$\begin{aligned}&~\Vert B^\top (\lambda ^{k+1}-\lambda ^*)\Vert ^2-(1-\theta )(1+\delta )\Vert B^\top (\lambda ^k-\lambda ^*)\Vert ^2+\kappa \Vert B^\top (\lambda ^{k+1}-\lambda ^k)\Vert ^2\\&\quad =~\left[ \begin{array}{c}\lambda ^{k+1}-\lambda ^*\\ \lambda ^{k+1}-\lambda ^k \end{array}\right] ^\top \left[ \begin{array}{cc}(1-(1-\theta )(1+\delta )) &{}\quad (1-\theta )(1+\delta )\\ (1-\theta )(1+\delta )&{}\quad (\kappa -(1-\theta )(1+\delta ))\end{array}\right] \otimes BB^\top \left[ \begin{array}{c}(\lambda ^{k+1}-\lambda ^*)\\ (\lambda ^{k+1}-\lambda ^k) \end{array}\right] , \end{aligned}$$

and

$$\begin{aligned}&~\left[ \begin{array}{c}\lambda ^{k+1}-\lambda ^*\\ \lambda ^{k+1}-\lambda ^k \end{array}\right] ^\top \left[ \begin{array}{cc}\theta &{}\quad (1-\theta )\\ (1-\theta ) &{}\quad (\frac{1}{\theta }-(1-\theta ))\end{array}\right] \otimes I\left[ \begin{array}{c}\lambda ^{k+1}-\lambda ^*\\ \lambda ^{k+1}-\lambda ^k \end{array}\right] \\&\quad = ~\left[ \Vert \lambda ^{k+1}-\lambda ^*\Vert ^2-(1-\theta )\Vert \lambda ^{k}-\lambda ^*\Vert ^2+\frac{1}{\theta }\Vert \lambda ^{k+1}-\lambda ^k\Vert ^2\right] . \end{aligned}$$

Hence, we have the desired result from (38) and the inequality $U\otimes V\succeq \sigma _{\min }(V) U\otimes I$ for any PSD matrices U and V.

1.6 Proof of Lemma 9.4

From (39a) and (39b), we have

$$\begin{aligned} \beta (1 -\theta )\frac{\tau _2}{2}\Vert A(x^{k+1}-x^k)\Vert ^2\le \frac{1}{2}\Vert x^{k+1}- x^k\Vert _{P-L_mI}^2, \end{aligned}$$

and

$$\begin{aligned}&4c\Vert Q(y^{k+1}-y^k)\Vert ^2+2c\rho ^2(1-\theta )\left( 1+\frac{1}{\delta }\right) \Vert B^\top B(y^{k+1}-y^k)\Vert ^2\\&\qquad +\frac{\beta \tau _1}{2}\Vert B(y^{k+1}-y^k)\Vert ^2\nonumber \\&\quad \le \frac{1}{2}\Vert y^{k+1}- y^k\Vert _Q^2. \end{aligned}$$

The desired result is then obtained by adding the above two inequalities together with $\beta $ times of (76), $\beta (1-\theta )$ times of (77), c times of both (78) and (79), and also noting $\lambda ^{k+1}-\lambda ^k=-\rho r^{k+1}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Y., Zhang, S. Accelerated primal–dual proximal block coordinate updating methods for constrained convex optimization. Comput Optim Appl 70, 91–128 (2018). https://doi.org/10.1007/s10589-017-9972-z

Download citation

Received: 07 April 2017
Published: 28 November 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s10589-017-9972-z

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Accelerated primal–dual proximal block coordinate updating methods for constrained convex optimization

Abstract

Similar content being viewed by others

Randomized Primal–Dual Proximal Block Coordinate Updates

Block coordinate proximal gradient methods with variable Bregman functions for nonsmooth separable optimization

Efficiency of Coordinate Descent Methods for Structured Nonconvex Optimization

1 Introduction

1.1 Related methods

1.1.1 Randomized coordinate descent methods

1.1.2 Primal–dual coordinate update methods

1.1.3 Accelerated primal–dual methods

1.2 Contributions of this paper

1.3 Nomenclature and basic facts

Lemma 1.1

Lemma 1.2

Proof

Lemma 1.3

2 Accelerated proximal Jacobian ADMM

2.1 Technical assumptions

Assumption 1

Assumption 2

Assumption 3

2.2 Convergence results

Lemma 2.1

Theorem 2.2

Theorem 2.3

3 Accelerating randomized primal–dual block coordinate updates

3.1 The algorithm

3.2 Convergence results

Assumption 4

Lemma 3.1

Theorem 3.2

Theorem 3.3

Proposition 3.4

Theorem 3.5

4 Linearly convergent primal–dual method

4.1 Technical assumptions

Assumption 5

Assumption 6

4.2 Convergence analysis

Lemma 4.1

Theorem 4.2

Theorem 4.3

Remark 4.1

Remark 4.2

Remark 4.3

5 Numerical experiments

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Technical proofs: Sect. 2

Lemma 7.1

Proof

1.1 Proof of Lemma 2.1

1.2 Proof of Theorem 2.2

1.3 Proof of Theorem 2.3

Appendix B: Technical proofs: Sect. 3

1.1 Proof of Lemma 3.1

1.2 Proof of Theorem 3.2

1.3 Proof of Theorem 3.3

Proposition 8.1

Proof

Proposition 8.2

Proof

Proof of Theorem 3.3

1.4 Proof of Proposition 3.4

1.5 Proof of Theorem 3.5

Appendix C: Technical proofs: Sect. 4

1.1 Proof of Lemma 4.1

1.2 Proof of Theorem 4.2

Lemma 9.1

Lemma 9.2

Lemma 9.3

Lemma 9.4

Proof of Theorem 4.2