1 Introduction

The problem concerned in this paper is the following convex minimization model

$$\begin{aligned} \min \{\theta (x) \; | \; Ax =b, \, x\in {{\mathcal {X}}} \}, \end{aligned}$$
(1.1)

where \(A\in \mathfrak {R}^{m\times n}\), \(b\in \mathfrak {R}^m\), \({{\mathcal {X}}}\subset \mathfrak {R}^{n}\) is a closed convex set, \(\theta : \mathfrak {R}^{n} \rightarrow \mathfrak {R}\) is a generic convex but not necessarily smooth function. Throughout, the solution set of (1.1) is assumed to be nonempty. Problems of this form arise in many applications such as compressed sensing, statistical estimation, machine learning, and image processing, see e.g., [2, 8, 17, 33] for some examples.

We first reformulate (1.1) as a variational inequality form, so we can take a brief look at the algorithms to solve this problem. The Lagrangian function of the optimization problem (1.1) is

$$\begin{aligned} L(x,\lambda )= \theta (x) - \lambda ^T(Ax-b), \end{aligned}$$

which is defined on \({{\mathcal {X}}}\times \mathfrak {R}^m\). Let \((x^*,\lambda ^*)\) be a saddle point of the Lagrange function. Then we have

$$\begin{aligned} L_{\lambda \in \mathfrak {R}^m}(x^*,\lambda ) \le L(x^*,\lambda ^*) \le L_{x\in {{\mathcal {X}}}}(x,\lambda ^*). \end{aligned}$$

By rearranging the above saddle point inequalities, we have the following optimal conditions:

$$\begin{aligned} \left\{ \begin{array}{lll} x^*\in {{\mathcal {X}}}, &{}\quad \theta (x) - \theta (x^*) + (x-x^*)^T(- A^T\lambda ^*) \ge 0, &{}\quad \forall \, x\in {{\mathcal {X}}},\\ \lambda ^*\in \mathfrak {R}^m, &{}\quad (\lambda -\lambda ^*)^T(Ax^*-b)\ge 0, &{}\quad \forall \; \lambda \in \mathfrak {R}^m. \end{array} \right. \end{aligned}$$
(1.2)

More compactly, the inequalities (1.2) can be characterized as a monotone variational inequality (VI):

$$\begin{aligned} \hbox {VI}(\Omega ,F,\theta )\quad w^*\in \Omega , \quad \theta (x) -\theta (x^*) + (w-w^*)^T F(w^*) \ge 0, \quad \forall \, w\in \Omega ,\nonumber \\ \end{aligned}$$
(1.3a)

where

$$\begin{aligned} w= \left( \begin{array}{c} x\\ \lambda \end{array} \right) , \quad F(w) =\left( \begin{array}{c} - A^T\lambda \\ Ax-b \end{array} \right) \quad \hbox {and} \quad \Omega = \mathcal{X} \times \mathfrak {R}^m. \end{aligned}$$
(1.3b)

It is easily verified that \( (w_1-w_2)^T\big (F(w_1)-F(w_2)\big )\ge 0\) for any \(w_1, w_2\in \Omega \), thus the operator \(F(\cdot )\) defined in (1.3b) is monotone. Since the solution set of (1.1) is assumed to be nonempty, the solution set of VI (1.3), denoted by \(\Omega ^*\), is also nonempty.

As we have mentioned, our analysis will be conducted in the variational inequality context, so the VI reformulation (1.3) serves as the starting point for further analysis. The algorithms used to solve the model (1.1) amount to finding a solution point of VI (1.3). We now briefly revisit some relevant algorithms in VI form.

The classical proximal point algorithm (PPA) [29, 31] is a popular method to solve the VI. Starting from an initial point \(w^0\), the PPA iterates as follows

$$\begin{aligned}&w^{k+1} \in \Omega , \quad \theta (x) -\theta (x^{k+1}) + (w- w^{k+1})^T \{ F(w^{k+1}) \nonumber \\&\quad +\, r(w^{k+1}-w^k)\} \ge 0, \quad \forall \; w\in \Omega , \end{aligned}$$
(1.4)

where \(r> 0\) is a regularization parameter. The PPA is fundamental both theoretically and algorithmically in the area of the optimization. Indeed, the augmented Lagrangian method (ALM) [27, 30] can be interpreted as applying the PPA to the dual of (1.1) [31]. We now verify the interpretation by revisiting the ALM (1.5) in the VI context. For (1.1), the ALM reads as

figure a

where \(\lambda \) is the Lagrange multiplier and \(\beta >0\) is a penalty parameter for the linear constraints. Note that the first-order optimality condition of the x-subproblem (1.5a) is

Substituting (1.5b) into the above equality yields

$$\begin{aligned} \theta (x)-\theta (x^{k+1})+(x-x^{k+1})^T\{-A^T\lambda ^{k+1} \}\ge 0,\ \forall x\in {\mathcal {X}}. \end{aligned}$$
(1.6)

Next, (1.5b) can be written as

$$\begin{aligned} (\lambda -\lambda ^{k+1})^T\{Ax^{k+1}-b+\frac{1}{\beta }(\lambda ^{k+1}-\lambda ^k)\}\ge 0, \ \forall \lambda \in \mathfrak {R}^m. \end{aligned}$$
(1.7)

Combining (1.6) and (1.7), we have

$$\begin{aligned}&\theta (x) - \theta ({x}^{k+1}) + \left( \begin{array}{c} x- {x}^{k+1} \\ \lambda - \lambda ^{k+1} \end{array}\right) ^T \left\{ \left( \begin{array}{c} - A^T\lambda ^{k+1} \\ Ax^{k+1} -b \end{array}\right) \right. \nonumber \\&\left. + \left( \begin{array}{c} 0 \\ \frac{1}{\beta }(\lambda ^{k+1} -\lambda ^k) \end{array}\right) \right\} \ge 0, \quad \forall \; (x,\lambda )\in \Omega . \end{aligned}$$
(1.8)

By using the notations in (1.3), the compact form is

$$\begin{aligned}&w^{k+1}\in \Omega , \quad \theta (x) -\theta (x^{k+1}) + (w-w^{k+1})^T \{ F(w^{k+1}) \nonumber \\&\quad + Q(w^{k+1}-w^k)\} \ge 0, \quad \forall w\in \Omega , \end{aligned}$$
(1.9a)

with

$$\begin{aligned} Q= \left( \begin{array}{cc} 0&{}\quad \\ &{}\quad \frac{1}{\beta }I_m \end{array} \right) . \end{aligned}$$
(1.9b)

Compared with the PPA scheme (1.4), it is easy to see that only the dual term is proximally regularized by the coefficient \(\frac{1}{\beta }\). Thus, applying the PPA to the dual of (1.1), we get the ALM scheme (1.5).

ALM is a benchmark method that has been successfully used in many important areas (e.g., linear inverse problems in image processing [1, 10]). At each iteration, the computational effort of ALM is dominated by evaluating the operator \((A^TA+\frac{1}{\beta }\partial \theta )^{-1}(\cdot )\) in \({\mathcal {X}}\). When \(A^TA\) is not an identity matrix, or even large and dense, such evaluation usually does not have easy or closed form solutions.

On the other hand, we have encountered some concrete applications arising in sparse or low-rank optimization problems modeled by (1.1), in which the convex functions have some special structures to utilize. An important structure we often encountered in practice is that the proximal mapping related to the objective \(\theta (\cdot )\) defined by

$$\begin{aligned} \text{ Prox }_{\tau \theta }(a) :=\mathrm{argmin} \left\{ \theta (x)+\frac{1}{2\tau }\Vert x-a\Vert ^2 \right\} , \end{aligned}$$
(1.10)

with parameter \(\tau >0\), can be computed easily. Here “easy” means a closed-form solution exists or efficient solvers are available. Examples of such structure exist in linear and quadratic programming, basis pursuit, nuclear norm minimization, and model fitting problems, etc. Let us briefly review two typical algorithms developed for employing this structure.

The first is the primal-dual hybrid gradient (PDHG) algorithm, which is used for total variation image restoration problems in [39], and subsequently improved by Chambolle and Pock in [5, 6]. For solving (1.1), the PDHG iterates as

figure b

where r and s are required to satisfy \(r>0,s>0, rs>\Vert A^TA\Vert \).

Similar to the derivation of ALM, by using the notations in (1.3), the VI form of PDHG is

(1.12a)

where

$$\begin{aligned} Q= \left( \begin{array}{cc} r I_n &{}\quad A^T\\ A &{}\quad sI_m \end{array} \right) . \end{aligned}$$
(1.12b)

Since \(rs>\Vert A^TA\Vert \), Q is symmetric positive definite, and can be used to define Q-norm by \(\Vert w\Vert _Q:=\sqrt{w^TQw}\). With the VI characterization (1.12), we now can easily derive the contraction property of PDHG. Here we briefly discuss below.

Setting \(w=w^*\) in (1.12), we have

$$\begin{aligned} (w^{k+1}-w^*)^T Q(w^k-w^{k+1}) \ge \theta (x^{k+1}) -\theta (x^*) + (w^{k+1} -w^*)^T F(w^{k+1}). \end{aligned}$$

Recall \(F(\cdot )\) is monotone (see (1.3b)), we have

$$\begin{aligned}&\theta (x^{k+1}) -\theta (x^*) + (w^{k+1} -w^*)^T F(w^{k+1}) \ge \theta (x^{k+1}) -\theta (x^*) \\&\quad + (w^{k+1} -w^*)^T F(w^*) \ge 0. \end{aligned}$$

Then, we obtain

$$\begin{aligned} (w^{k+1}-w^*)^T Q(w^k-w^{k+1}) \ge 0. \end{aligned}$$

Using the above inequality, we get

$$\begin{aligned}&\Vert w^k-w^*\Vert _Q^2 = \Vert (w^{k+1}- w^*) +(w^k-w^{k+1})\Vert _Q^2 \\&\quad \ge \Vert w^{k+1}-w^*\Vert _Q^2 + \Vert w^k-w^{k+1}\Vert _Q^2, \end{aligned}$$

and thus the sequence generated by PDHG (1.12) obeys

$$\begin{aligned} \Vert w^{k+1} -w^*\Vert _Q^2 \le \Vert w^k - w^*\Vert _Q^2 - \Vert w^k - w^{k+1} \Vert _Q^2 . \end{aligned}$$
(1.13)

The above inequality implies that the sequence \(\{ w^k\}\) is strictly contractive to the solution set in Q-norm. Recall that the proximal parameter in the original PPA is a scalar r, means that all the coordinates of w are proximally regularized with equivalent weights. While for the VI form of PDHG (1.12), the proximal parameter is a positive definite matrix Q, and the coordinates of w are proximally regularized with different weights. Using this judicious choice of Q, the resulting subproblems are simple under the assumption (1.10) and the convergence is also guaranteed by noting that the distance to the solution point decreases at each step (see (1.13)). Further, we observe that the algorithm (1.12) owes a similar structure with the PPA (1.12). In this sense, it is essentially a customized application of PPA. The VI revisit (1.12) and PPA interpretation of PDHG is first shown in [24], and used in [7] to simplify the convergence analysis.

Since the PDHG (1.11) takes an algorithmic structure analogous to that of the generic PPA, the existing relaxation strategies for the PPA in the literature can be naturally used for these customized PPA algorithms. One popular relaxation scheme is to combine a computationally trivial relaxation step with the PPA [19]. Specially, let \(\tilde{w}^k = (\tilde{x}^k, \tilde{\lambda }^k)\) be the output of the customized PPA. The new iterate \(w^{k+1}\) generated by the relaxation step in [19, 20] is given by

figure c

As demonstrated in many cases, the relaxation step (1.14b) with \(\gamma \in (1 ,2)\) can speed up the convergence of the customized PPA, see, e.g. [23, 24]. However, this relaxation strategy has some issues that would be overlooked: the point \(w^{k+1}\) is a combination of \(w^k\) and \(\tilde{w}^k\). It may not be in the set \(\Omega \) or owes the structure that \(\tilde{w}^k\) has. To address these limitations, one can first use the relaxation term (1.14b) in the iteration loops; when the specified stop criterion is achieved, one should perform an extra iteration without the relaxation step.

On the other hand, observe that each customized PPA owes special structure and has popular applications, it is interesting to ask wether there exists another simple relaxation strategy for these algorithms based on their structures and can also have good improvements. In this paper we treat these customized PPA algorithms uniformly by a mixed variational inequality approach, and propose a new relaxation strategy for these customized PPA algorithms. Now, we specify our relaxed customized PPA as follows

figure d

The paper is organized as follows. In Sect. 2, we introduce our relaxation steps for some customized PPA algorithms. In Sect. 3, we discuss the relaxation strategy for the ADMM. We then analyze the convergence of these methods in Sect. 4, and further establish their convergence rate in both the non-ergodic and ergodic sense in Sect. 5. After that, we present some numerical results to demonstrate the improved performance of the relaxation strategy in Sect. 6. Finally, some conclusions are made in Sect. 7.

2 Relaxed customized PPA

In this section, we show how to relax the customized PPA via correcting the dual variable. To make our illustration more clear, the output variable \(w^{k+1}\) of the customized PPA is relabeled as \(\tilde{w}^k\) below.

Note that the relaxation falls into the category of prediction-correction methods. We use the prototype algorithm framework proposed in [22] to unify our analysis. This algorithm framework is also used in [26] for analyzing the PDHG.

figure e

Then, the convergence of the prototype algorithm (2.1) can be guaranteed if the following conditions are fulfilled.

figure f

In the following, we show that each PPA with the correction step falls into the prototype algorithm (2.1). Then we demonstrate the conditions specified in (2.2) are fulfilled.

2.1 Relaxed PDHG

Recall that the PDHG in variational form is (1.15a) with metric

$$\begin{aligned} Q= \left( \begin{array}{cc} r I_n &{}\quad A^T\\ A &{}\quad sI_m \end{array} \right) . \end{aligned}$$
(2.3)

Then we judiciously correct the dual variable via

$$\begin{aligned} \lambda ^{k+1}=\lambda ^k-\frac{\gamma -1}{s} A(x^k-\tilde{x}^k)-\gamma (\lambda ^k-\tilde{\lambda }^k), \end{aligned}$$
(2.4)

where the relaxation factor \(\gamma \in (0,2)\). Since we always have \(x^{k+1}=\tilde{x}^k\), using the notations in (1.3), we get the correction step as

$$\begin{aligned} w^{k+1} =w^k - M (w^k-\tilde{w}^k), \end{aligned}$$
(2.5a)

where

$$\begin{aligned} M = \left( \begin{array}{cc} I_n &{}\quad 0 \\ \frac{\gamma -1}{s}A &{}\quad \gamma I_m \end{array} \right) . \end{aligned}$$
(2.5b)

Now, we check the positive definiteness of H and G to verify whether the conditions in (2.2) are fulfilled. First, we need to deduce the expression of H and G. Recall M (see (2.5b)), we have

$$\begin{aligned} M^{-1}=\left( \begin{array}{cc} I_n &{}\quad 0 \\ -\frac{\gamma -1}{s\gamma }A &{}\quad \frac{1}{\gamma } I_m \end{array} \right) . \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} H&=QM^{-1}=\left( \begin{array}{cc} rI_n &{}\quad A^T \\ A &{}\quad sI_m \end{array} \right) \left( \begin{array}{cc} I_n &{}\quad 0 \\ -\frac{\gamma -1}{s\gamma }A &{}\quad \frac{1}{\gamma } I_m \end{array} \right) \\&=\left( \begin{array}{cc} rI_n-\frac{\gamma -1}{s\gamma }A^TA &{}\quad \frac{1}{\gamma }A^T \\ \frac{1}{\gamma }A &{}\quad \frac{s}{\gamma } I_m \end{array} \right) . \end{aligned} \end{aligned}$$
(2.6)

According to the Schur complement, we have

$$\begin{aligned} H=\left( \begin{array}{cc} I &{}\quad \frac{1}{s}A^T \\ 0 &{}\quad I \end{array} \right) \left( \begin{array}{cc} rI -\frac{ 1}{s }A^TA &{}\quad 0 \\ 0 &{}\quad \frac{s}{\gamma } I \end{array} \right) \left( \begin{array}{cc} I &{}\quad 0 \\ \frac{1}{s}A &{}\quad I \end{array} \right) . \end{aligned}$$

Since \(rs>\Vert A^TA\Vert \), H is positive definite.

The matrix G is symmetric by \(Q^T=Q\), we next deduce the expression of G.

$$\begin{aligned} \begin{aligned} G&=Q^T+Q-M^THM=2Q-M^TQ\\&=2\cdot \left( \begin{array}{cc} rI_n &{}\quad A^T \\ A &{}\quad sI_m \end{array} \right) -\left( \begin{array}{cc} I_n &{}\quad \frac{\gamma -1}{s}A^T \\ 0 &{}\quad \gamma I_m \end{array} \right) \left( \begin{array}{cc} rI_n &{}\quad A^T \\ A &{}\quad sI_m \end{array} \right) \\&=\left( \begin{array}{cc} 2rI_n &{}\quad 2A^T \\ 2 A &{}\quad 2sI_m \end{array} \right) -\left( \begin{array}{cc} rI_n+\frac{\gamma -1}{s}A^TA &{}\quad \gamma A^T \\ \gamma A &{}\quad \gamma s I_m \end{array} \right) \\&=\left( \begin{array}{cc} rI_n-\frac{\gamma -1}{s}A^TA &{}\quad (2-\gamma )A^T \\ (2-\gamma )A &{}\quad (2-\gamma )sI_m \end{array} \right) . \end{aligned} \end{aligned}$$
(2.7)

According to the Schur complement,

$$\begin{aligned} G=\left( \begin{array}{cc} I &{}\quad \frac{1}{s}A^T \\ 0 &{}\quad I \end{array} \right) \left( \begin{array}{cc} rI -\frac{ 1}{s }A^TA &{}\quad 0 \\ 0 &{}\quad (2-\gamma )s I \end{array} \right) \left( \begin{array}{cc} I &{}\quad 0 \\ \frac{1}{s}A &{}\quad I \end{array} \right) . \end{aligned}$$
(2.8)

Since \(rI\succ \frac{1}{s}A^TA\) and \(0< \gamma <2\), G is positive definite. The relaxed PDHG scheme is a special case of the prototype algorithm (2.2).

The relaxation step (2.4) looks somehow complicated. However, combining PDHG and (2.4), the update steps of the relaxed PDHG can be simple. By straightforward manipulations, the PDHG can be rewritten as a prediction step, which gives

figure g

Together with (2.4) and noting \(x^{k+1}=\tilde{x}^k\), we obtain:

figure h

2.2 Relaxed linearized ALM

In this section, we show that the linearized ALM (L-ALM) [35, 38] is also a special case of customized PPA, and it can be relaxed by (1.15b). The main idea of L-ALM is to linearize the augmented Lagrangian function (1.5a) in the ALM via adding a quadratic term \(\frac{1}{2} \Vert x-x^k\Vert _{(rI-\beta A^TA)}^2\). Ignoring some constant terms in the x-subproblem, the iterate scheme of L-ALM is

figure i

Note that the quadratic term should be positive, we have \(r>\beta \Vert A^TA\Vert \). Now, we revisit L-ALM (2.11) in VI form.

The first-order optimality condition of the x-subproblem of (2.11a) is

and it can be written as

$$\begin{aligned} \theta (x)-\theta (\tilde{x}^k)+({x}-\tilde{x}^k)^T\{-A^T\tilde{\lambda }^k+(rI-\beta A^TA) (\tilde{x}^k-{x}^k) \}\ge 0,\ \forall x\in {\mathcal {X}}. \end{aligned}$$
(2.12)

Next, we rewrite \( \tilde{\lambda }^k = \lambda ^k-\beta (A\tilde{x}^k-b)\) (see (2.11b)) as

$$\begin{aligned} (\lambda -\tilde{\lambda }^k)^T\left\{ A\tilde{x}^k-b+\frac{1}{\beta }(\tilde{\lambda }^{k}-\lambda ^k)\right\} \ge 0, \ \forall \lambda \in \mathfrak {R}^m. \end{aligned}$$
(2.13)

Combining (2.12) and (2.13), we have

$$\begin{aligned} \begin{aligned}&\forall w\in \Omega ,\quad \theta (x)-\theta (\tilde{x}^k)+\\&\left( \begin{array}{c} x-\tilde{x}^k \\ \lambda -\tilde{\lambda }^{k}\\ \end{array} \right) ^T\left\{ \left( \begin{array}{c} -A^T\tilde{\lambda }^{k} \\ A\tilde{x}^k-b\\ \end{array} \right) +\left( \begin{array}{c} (rI_n-\beta A^TA) (\tilde{x}^k-{x}^k) \\ \frac{1}{\beta }(\tilde{\lambda }^{k}-\lambda ^k)\\ \end{array}\right) \right\} \ge 0. \end{aligned} \end{aligned}$$
(2.14)

Recall the notations in (1.3), the L-ALM (2.11) can thus be interpreted as a customized PPA (1.15a) with Q given by

$$\begin{aligned} Q= \left( \begin{array}{cc} rI_n-\beta A^TA &{}\quad 0\\ 0 &{}\quad \frac{1}{\beta } I_m \end{array} \right) . \end{aligned}$$
(2.15)

To accelerate the L-ALM, we suggest to relax the dual variable by

$$\begin{aligned} \lambda ^{k+1} = \lambda ^k - \gamma (\lambda ^k-\tilde{\lambda }^k), \end{aligned}$$
(2.16)

where the relaxation factor \(\gamma \in (0,2)\). Note that \(x^{k+1}\) always amounts to \({{\tilde{x}}}^k\), we have

$$\begin{aligned} w^{k+1} = w^k - M (w^k-\tilde{w}^k), \end{aligned}$$
(2.17a)

where

$$\begin{aligned} M = \left( \begin{array}{cc} I_n &{}\quad 0 \\ 0 &{}\quad \gamma I_m \end{array} \right) . \end{aligned}$$
(2.17b)

Next, we show that the condition (2.2) is fulfilled by the relaxed L-ALM scheme. We have

$$\begin{aligned} \begin{aligned} H&=QM^{-1}=\left( \begin{array}{cc} rI_n-\beta A^TA &{}\quad 0\\ 0 &{}\quad \frac{1}{\beta } I_m \end{array} \right) \left( \begin{array}{cc} I_n &{}\quad 0 \\ 0 &{}\quad \frac{1}{\gamma } I_m \end{array} \right) \\&=\left( \begin{array}{cc} rI_n-\beta A^TA &{}\quad 0\\ 0 &{}\quad \frac{1}{\gamma \beta } I_m \end{array} \right) , \end{aligned} \end{aligned}$$
(2.18)

and

$$\begin{aligned} \begin{aligned} G&=Q^T+Q-M^THM=2Q-M^TQ\\&=\left( \begin{array}{cc} 2 (rI_n-\beta A^TA) &{}\quad 0\\ 0 &{}\quad \frac{2}{\beta } I_m \end{array} \right) - \left( \begin{array}{cc} I_n &{}\quad 0 \\ 0 &{}\quad \gamma I_m \end{array} \right) \left( \begin{array}{cc} rI_n-\beta A^TA &{}\quad 0\\ 0 &{}\quad \frac{1}{\beta } I_m \end{array} \right) \\&=\left( \begin{array}{cc} 2 (rI_n-\beta A^TA) &{}\quad 0\\ 0 &{}\quad \frac{2}{\beta } I_m \end{array} \right) -\left( \begin{array}{cc} rI_n-\beta A^TA &{}\quad 0\\ 0 &{}\quad \frac{\gamma }{\beta } I_m \end{array} \right) \\&=\left( \begin{array}{cc} rI_n-\beta A^TA &{}\quad 0\\ 0 &{}\quad \frac{2-\gamma }{\beta } I_m \end{array} \right) . \end{aligned} \end{aligned}$$
(2.19)

Under the condition that \(r>0, \beta>0, r>\beta \Vert A^TA\Vert \) and the relaxation factor \(\gamma \in (0,2)\), the matrix H and G are both symmetric and positive definite. Thus, the relaxed L-ALM scheme is a special case of the prototype algorithm (2.2) and the conditions in (2.2) are fulfilled. We refer to [32] for the relaxed L-ALM with (1.14b).

Now, the relaxed L-ALM scheme is:

figure j

3 Relaxed ADMM for separable case

Motivated by recent popular applications, we also discuss a separable case of (1.1), in which the objective function \(\theta (x)\) can be decomposed as a sum of two individual convex functions without coupled variables:

(3.1)

where \(\theta _1: \mathfrak {R}^{n_1} \rightarrow \mathfrak {R}\), \(\theta _2: \mathfrak {R}^{n_2} \rightarrow \mathfrak {R}\) are both closed convex (not necessarily smooth) functions; \(A_1\in \mathfrak {R}^{m\times n_1}\), \(A_2\in \mathfrak {R}^{m\times n_2}\); \({{\mathcal {X}}}_1\subset \mathfrak {R}^{n_1}\) , \({{\mathcal {X}}}_2\subset \mathfrak {R}^{n_2}\) are closed convex sets.

The optimality condition of the problem (3.1) can be analogously written as the following variational inequality:

$$\begin{aligned} \hbox {VI}(\Omega ,F,\theta )\quad w^*\in \Omega , \quad \theta (x)-\theta (x^*)+ (w-w^*)^T F(w^*) \ge 0, \quad \forall \, w\in \Omega ,\nonumber \\ \end{aligned}$$
(3.2a)

where

$$\begin{aligned} \begin{array}{ccc} x= \left( \begin{array}{c} x_1\\ x_2 \end{array} \right) , \quad w = \left( \begin{array}{c} x_1\\ x_2\\ \lambda \end{array} \right) , F(w) = \left( \begin{array}{c} - A_1^T\lambda \\ -A_2^T\lambda \\ A_1x_1 + A_2x_2 -b \end{array} \right) , \\ \quad \theta (x)=\theta _1(x_1) + \theta _2(x_2), \quad \hbox {and}\quad \Omega =\mathcal{X}_1 \times {{\mathcal {X}}}_2\times \mathfrak {R}^m. \end{array}\nonumber \\ \end{aligned}$$
(3.2b)

Note that the mapping F(w) defined in (3.2b) also satifies \( (w_1-w_2)^T\big (F(w_1)-F(w_2)\big )\ge 0\) for any \(w_1, w_2\in \Omega \), thus it is monotone.

The classical alternating directions method of multipliers (ADMM) originates in [9, 16] is suitable for solving (3.1). Starting with an initial iterate \((x_2^0, \lambda ^0) \in {{\mathcal {X}}}_2 \times \mathfrak {R}^m\), the ADMM iterates via the scheme

figure k

In [13], it was shown that ADMM is actually equivalent to the Douglas-Rachford splitting method (DRSM) applied to the dual of (3.1). We also refer to [21] for the primal application of DRSM. In this section, to make our analysis more general, we consider the proximal ADMM where each subproblem is regularized with an additional quadratic proximal term. Note that the original ADMM produces the new iterate in the forward order \(x_1 \rightarrow x_2\rightarrow \lambda \). In cyclical sense, it can also be written as the \(x_1 \rightarrow \lambda \rightarrow x_2\) order, which gives

figure l

where \(D_1\) and \(D_2\) are two positive definite matrices. Obviously, the proximal ADMM contains ADMM as a special case. When \(D_1=rI-\beta A_1^TA_1\) (\(r>\beta \Vert A_1^TA_1\Vert \)) or \(D_2=sI-\beta A_2^TA_2\) (\(s>\beta \Vert A_2^TA_2\Vert \)), the so-called split inexact Uzawa method [37, 38] is recovered. In this case, the resulting subproblem is in form of (1.10) (when \(\mathcal{X}_i=\mathfrak {R}^{n_i}, i =1,2\)), which is simple under the assumption (1.10). We refer the reader to see [11] for more different choices of \(D_1\) and \(D_2\).

In our relaxation, the output of (3.4) serves as a predictor, we here relabel it as \( \tilde{w}^k= (\tilde{x}_1^k,\tilde{x}_2^k,{\tilde{\lambda }^k})\) as before. Now, we first follow [4] to revisit the proximal ADMM (3.4) in VI form.

Lemma 3.1

For given \(w^k=(x_1^k,x_2^k,\lambda ^k)\), let \({{\tilde{w}}}^k\) be generated by (3.4). Then, we have

(3.5a)

where

$$\begin{aligned} Q= \left( \begin{array}{ccc} D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad D_2+ \beta A_2^TA_2&{}\quad - A_2^T\\ 0&{}\quad -A_2 &{}\quad \frac{1}{\beta } I_m \end{array} \right) . \end{aligned}$$
(3.5b)

Proof

The optimality condition of the \(x_1\)-subproblem in (3.4) is

$$\begin{aligned}&{\tilde{x}_1^{k}}\in \mathcal{X}_1, \quad \theta _1(x_1) - \theta _1(\tilde{x}_1^{k}) + (x_1-\tilde{x}_1^{k})^T\{-A_1^T\lambda ^k + \beta A_1^T(A_1\tilde{x}_1^{k}\\&\quad + A_2x_2^k - b)+D_1( \tilde{x}_1^k-x_1^k) \} \ge 0, \quad \forall \; x_1\in {{\mathcal {X}}}_1, \end{aligned}$$

and it can be written as (using \(\tilde{\lambda }^k\) generated by (3.4b))

$$\begin{aligned}&\tilde{x}_1^{k}\in {\mathcal{X}_1}, \quad \theta _1(x_1) - \theta _1(\tilde{x}_1^k) + (x_1 - \tilde{x}_1^k)^T(-A_1^T\tilde{\lambda }^k\nonumber \\&\quad +D_1( \tilde{x}_1^k-x_1^k)) \ge 0, \quad \forall \; x_1\in {{\mathcal {X}}}_1. \end{aligned}$$
(3.6a)

The optimality condition of the \(x_2\)-subproblem of (3.4) is

$$\begin{aligned}&\tilde{x}_2^k\in {{\mathcal {X}}}_2,\; \theta _2(x_2) - \theta _2(\tilde{x}_2^k) + (x_2 - \tilde{x}_2^k)^T\bigl \{-A_2^T\tilde{\lambda }^k + \beta A_2^T(A_1\tilde{x}_1^k \\&\quad + A_2\tilde{x}_2^k-b)+D_2( \tilde{x}_2^k-x_2^k)\bigr \} \ge 0, \; \forall x_2\in {{\mathcal {X}}}_2. \end{aligned}$$

Now, we treat the \(\{ \cdot \}\) term in the last inequality. Using \( \beta (A_1\tilde{x}_1^k + A_2x_2^k-b)= - (\tilde{\lambda }^k -\lambda ^k)\) (see (3.4b)), we obtain

$$\begin{aligned}&-A_2^T\tilde{\lambda }^k + \beta A_2^T(A_1\tilde{x}_1^k + A_2\tilde{x}_2^k-b)+D_2( \tilde{x}_2^k-x_2^k) \\&= -A_2^T \tilde{\lambda }^k+ \beta A_2^TA_2( \tilde{x}_2^k-x_2^k)-A_2^T(\tilde{\lambda }^k-\lambda ^k) +D_2( \tilde{x}_2^k-x_2^k)\\&= -A_2^T \tilde{\lambda }^k+ (D_2+ \beta A_2^TA_2)( \tilde{x}_2^k-x_2^k)-A_2^T(\tilde{\lambda }^k-\lambda ^k). \end{aligned}$$

Then the optimality condition of the \(x_2\)-subproblem can be written as

$$\begin{aligned}&\tilde{x}_2^k\in {{\mathcal {X}}}_2,\quad \theta _2(x_2) - \theta _2(\tilde{x}_2^k) + ({x_2} - \tilde{x}_2^k)^T \bigl \{ -A_2^T \tilde{\lambda }^k\nonumber \\&\quad + ( D_2+\beta A_2^TA_2)( \tilde{x}_2^k-x_2^k)-A_2^T(\tilde{\lambda }^k-\lambda ^k) \bigr \}\ge 0, \quad \forall x_2\in {{\mathcal {X}}}_2. \end{aligned}$$
(3.6b)

From (3.4b), we have

$$\begin{aligned} (A_1\tilde{x}_1^k + A_2\tilde{x}_2^k-b) -A_2(\tilde{x}_2^k-x_2^k) + (1/\beta ) (\tilde{\lambda }^k-\lambda ^k) = 0, \end{aligned}$$

and it can be written as

$$\begin{aligned}&\tilde{\lambda }^k\in \mathfrak {R}^m, \quad (\lambda -\tilde{\lambda }^k)^T\bigl \{ (A_1\tilde{x}_1^k + A_2\tilde{x}_2^k-b) -A_2(\tilde{x}_2^k-x_2^k) \nonumber \\&\quad + (1/\beta ) (\tilde{\lambda }^k-\lambda ^k)\bigr \} \ge 0, \quad \forall \, \lambda \in \mathfrak {R}^m. \end{aligned}$$
(3.6c)

Combining (3.6a), (3.6b) and (3.6c), and using the notations of (3.2b), the assertion of this lemma is proved. \(\square \)

Remark 3.2

The matrix Q (3.5b) is symmetric positive definite. Thus, the proximal ADMM (3.4) is a customized PPA.

To accelerate the ADMM (3.4), we suggest to relax the dual variable by

$$\begin{aligned} \lambda ^{k+1} = \lambda ^k + (\gamma -1)\beta A_2(x_2^k-\tilde{x}_2^k) - \gamma (\lambda ^k-\tilde{\lambda }^k), \end{aligned}$$
(3.7)

where the relaxation factor \(\gamma \in (0,2)\). Note that \(x_1^{k+1}\) always amounts to \({{\tilde{x}}}_1^k\) and \(x_2^{k+1}\) always amounts to \({{\tilde{x}}}_2^k\), we have

$$\begin{aligned} w^{k+1} = w^k - M (w^k-\tilde{w}^k), \end{aligned}$$
(3.8a)

where

$$\begin{aligned} M = \left( \begin{array}{ccc} I_{n_1} &{}\quad 0&{}\quad 0 \\ 0&{}\quad I_{n_2} &{}\quad 0 \\ 0&{}\quad - (\gamma -1)\beta A_2 &{}\quad \gamma I_m \end{array} \right) . \end{aligned}$$
(3.8b)

Then, we check the positive definiteness of H and G to verify whether the condition (2.2) is fulfilled for the relaxed ADMM scheme. First, recall M (see (3.8b)), we have

$$\begin{aligned} M^{-1}=\left( \begin{array}{ccc} I_{n_1} &{}\quad 0&{}\quad 0 \\ 0&{}\quad I_{n_2} &{}\quad 0 \\ 0&{}\quad \frac{\gamma -1}{\gamma }\beta A_2&{}\quad \frac{1}{\gamma } I_m \end{array} \right) , \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} H&=QM^{-1}=\left( \begin{array}{ccc} D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad D_2+ \beta A_2^TA_2&{}\quad - A_2^T\\ 0&{}\quad -A_2 &{}\quad \frac{1}{\beta } I_m \end{array} \right) \left( \begin{array}{ccc} I_{n_1} &{}\quad 0&{}\quad 0 \\ 0&{}\quad I_{n_2} &{}\quad 0 \\ 0&{}\quad \frac{\gamma -1}{\gamma }\beta A_2&{}\quad \frac{1}{\gamma } I_m \end{array} \right) \\&=\left( \begin{array}{ccc} D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad D_2+\frac{1}{\gamma }\beta A_2^TA_2 &{}\quad -\frac{1}{\gamma }A_2^T \\ 0&{}\quad - \frac{1}{\gamma }A_2 &{}\quad \frac{1}{\gamma \beta } I_m \end{array} \right) . \end{aligned} \end{aligned}$$
(3.9)

Thus H is positive definite. Next, we check the positive definiteness of G. We first deduce the expression of G:

$$\begin{aligned} \begin{aligned} G&=Q^T+Q-M^THM=2Q-M^TQ\\&=\left( \begin{array}{ccc} 2 D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad 2(D_2+ \beta A_2^TA_2)&{}\quad - 2A_2^T\\ 0&{}\quad -2A_2 &{}\quad \frac{2}{\beta } I_m \end{array} \right) \\&- \left( \begin{array}{ccc} I_{n_1} &{}\quad 0&{}\quad 0 \\ 0&{}\quad I_{n_2} &{}\quad - (\gamma -1){\beta A_2^T } \\ 0&{}\quad 0 &{}\quad \gamma I_m \end{array} \right) \left( \begin{array}{ccc} D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad D_2+ \beta A_2^TA_2&{}\quad - A_2^T\\ 0&{}\quad -A_2 &{}\quad \frac{1}{\beta } I_m \end{array} \right) \\&=\left( \begin{array}{ccc} 2 D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad 2(D_2+ \beta A_2^TA_2)&{}\quad - 2A_2^T\\ 0&{}\quad -2A_2 &{}\quad \frac{2}{\beta } I_m \end{array} \right) -\left( \begin{array}{ccc} D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad D_2+ \gamma \beta A_2^TA_2 &{}\quad -\gamma A_2^T\\ 0&{}\quad -\gamma A_2 &{}\quad \frac{\gamma }{\beta } I_m \end{array} \right) \\&=\left( \begin{array}{ccc} D_1&{}\quad 0&{}\quad 0\\ 0&{}\quad D_2+(2-\gamma )\beta A_2^TA_2 &{}\quad - (2-\gamma )A_2^T\\ 0&{}\quad - (2-\gamma )A_2 &{}\quad \frac{2-\gamma }{\beta } I_m \end{array} \right) . \end{aligned} \end{aligned}$$
(3.10)

From (3.10), G is positive definite with \(0< \gamma <2\). The relaxed ADMM scheme is thus a special case of the prototype algorithm (2.2). Note that

$$\begin{aligned} \lambda ^{k+1}= & {} \tilde{\lambda }^k+ (\gamma -1)\beta A_2(x_2^k-\tilde{x}_2^k) - (\gamma -1) (\lambda ^k-\tilde{\lambda }^k)\\= & {} \tilde{\lambda }^k+ (\gamma -1)\beta A_2(x_2^k-\tilde{x}_2^k) - (\gamma -1) \beta (A_1 \tilde{x}_1^k+A_2x_2^k-b) \\= & {} \tilde{\lambda }^k - \bigl (\gamma -1)\beta (A_1 \tilde{x}_1^k+A_2\tilde{x}_2^k-b). \end{aligned}$$

Write (3.4) and (3.7) together in a compact form and recall \(x_1^{k+1} = \tilde{x}_1^k\) and \(x_2^{k+1} = \tilde{x}_2^k\), the relaxed ADMM is

figure m

4 Global convergence

We have characterized the relaxed PDHG, relaxed L-ALM, relaxed ADMM as special cases of the Prototype Framework (2.1). Using the Prototype Framework (2.1), we can easily analyze their convergence property in a unified way.

Under the condition that H and G are positive definite, we now characterize the right hand side of (2.1a) in terms of \(\Vert w-w^k\Vert _H\) and \(\Vert w-w^{k+1}\Vert _H\). This is given in the following lemma.

Lemma 4.1

Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). We have

(4.1)

where H and G are defined in (2.2a) and (2.2b), respectively.

Proof

Observe that \(M(w^k-\tilde{w}^k)=(w^k-w^{k+1})\) (see (2.1b)) and \(Q=HM\), (2.1a) can be written as

$$\begin{aligned} \theta (x)-\theta (\tilde{x}^k)+(w-\tilde{w}^k)^TF(\tilde{w}^k)\ge (w-\tilde{w}^k)^TH(w^k-w^{k+1}),\ \forall w\in \Omega . \end{aligned}$$
(4.2)

Applying the identity

$$\begin{aligned} (a-b)^TH(c-d)=\frac{1}{2}(\Vert a-d\Vert _H^2-\Vert a-c\Vert _H^2)+\frac{1}{2}(\Vert c-b\Vert _H^2-\Vert d-b\Vert _H^2), \end{aligned}$$

with \(a=w, b=\tilde{w}^k, c=w^k, d=w^{k+1}\) to the right hand side of (4.2), we have

$$\begin{aligned}&(w-\tilde{w}^k)^TH(w^k-w^{k+1})=\frac{1}{2}(\Vert w-w^{k+1}\Vert _H^2-\Vert w-w^k\Vert _H^2)\nonumber \\&\quad +\frac{1}{2}(\Vert w^k-\tilde{w}^k\Vert _H^2-\Vert w^{k+1}-\tilde{w}^k\Vert _H^2). \end{aligned}$$
(4.3)

For the last term of (4.3), we have

$$\begin{aligned} \begin{aligned} \Vert w^k&-\tilde{w}^k\Vert _H^2-\Vert w^{k+1}-\tilde{w}^k\Vert _H^2\\&=\Vert w^k-\tilde{w}^k\Vert _H^2-\Vert (w^k-\tilde{w}^k)-(w^k-w^{k+1})\Vert _H^2\\&{\mathop {=}\limits ^{(2.1b)}}\Vert w^k-\tilde{w}^k\Vert _H^2-\Vert (w^k-\tilde{w}^k)-M(w^k-\tilde{w}^k)\Vert _H^2\\&=2(w^k-\tilde{w}^k)^THM(w^k-\tilde{w}^k)-(w^k-\tilde{w}^k)^TM^THM(w^k-\tilde{w}^k)\\&{\mathop {=}\limits ^{(2.2)}}(w^k-\tilde{w}^k)^T(Q^T+Q-M^THM)(w^k-\tilde{w}^k)\\&{\mathop {=}\limits ^{(2.2b)}}(w^k-\tilde{w}^k)^TG(w^k-\tilde{w}^k). \end{aligned} \end{aligned}$$
(4.4)

Combining (4.2), (4.3) and (4.4), the lemma is proved. \(\square \)

Theorem 4.2

Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). Then we have

$$\begin{aligned} \Vert w^{k+1}-w^*\Vert _H^2\le \Vert w^k-w^*\Vert _H^2-\Vert w^k-\tilde{w}^k\Vert _G^2,\ \forall w^*\in \Omega ^*. \end{aligned}$$
(4.5)

Proof

Setting \(w = w^*\) in (4.1), we get

$$\begin{aligned}&\Vert w^k-w^*\Vert _H^2-\Vert w^{k+1}-w^*\Vert _H^2\ge \Vert w^k-\tilde{w}^k\Vert _G^2\\&\quad + 2\{\theta (\tilde{x}^k)-\theta (x^*)+(\tilde{w}^k-w^*)^TF(\tilde{w}^k)\}. \end{aligned}$$

By using the optimality of \(w^*\) and the monotonicity of F(w), we have

$$\begin{aligned} \theta (\tilde{x}^k)-\theta (x^*)+(\tilde{w}^k-w^*)^TF(\tilde{w}^k)\ge \theta (\tilde{x}^k)-\theta (x^*)+(\tilde{w}^k-w^*)^TF(w^*)\ge 0, \end{aligned}$$

and thus

$$\begin{aligned} \Vert w^k-w^*\Vert _H^2-\Vert w^{k+1}-w^*\Vert _H^2\ge \Vert w^k-\tilde{w}^k\Vert _G^2. \end{aligned}$$

The assertion (4.5) follows directly. \(\square \)

Since both H and G are positive definite, Theorem 4.2 implies that the sequence \(\{w^k\}\) generated by (2.1) is strictly contractive with respect to the solution set \(\Omega ^*\). Thus the global convergence of these relaxed PPA algorithms can be easily established. This is given by the following theorem.

Theorem 4.3

Let \(\{w^{k}\}\) be a sequence generated by (2.1). Then it globally converges to a point \(w^\infty \) which belongs to \(\Omega ^*\).

Proof

From (4.5), we can see that \(\{w^k\}\) is bounded, and

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert w^k-\tilde{w}^k\Vert =0. \end{aligned}$$
(4.6)

Thus, the sequence \(\{\tilde{w}^k\}\) is also bounded. Let \(w^\infty \) be a cluster point of the sequence \(\{\tilde{w}^k\}\). Since the sequence \(\{\tilde{w}^k\}\) is bounded, it has a subsequence \(\{{\tilde{w}^{k_j}}\}\) that converges to \(w^\infty \). From (2.1a), we have

$$\begin{aligned} {\tilde{w}^{k_j}}\in \Omega ,\quad \theta (x)-\theta ({\tilde{x}^{k_j}})+(w-{\tilde{w}^{k_j}})^TF({\tilde{w}^{k_j}})\ge (w-{\tilde{w}^{k_j}})^TQ({w^{k_j}}-{\tilde{w}^{k_j}}),\quad \forall w\in \Omega . \end{aligned}$$

Using the continuity of \(\theta (x)\) and F(w), we have

$$\begin{aligned} w^\infty \in \Omega ,\quad \theta (x)-\theta (x^\infty )+(w-w^\infty )^TF(w^\infty )\ge 0,\quad \forall w\in \Omega . \end{aligned}$$

The above variational inequality implies that \(w^\infty \) is a solution point of VI\((\Omega ; F; \theta )\). By using (4.6) and \(\lim _{{j\rightarrow \infty }}{\tilde{w}^{k_j}} =w^\infty \), we get \(\lim _{{j\rightarrow \infty }} {w^{k_j}}=w^\infty \). Recall (4.5), we have

$$\begin{aligned} \Vert w^{k+1}-w^\infty \Vert _H\le \Vert w^k-w^\infty \Vert _H. \end{aligned}$$

Thus the sequence \(\{w^k\}\) converges to \(w^\infty \). The proof is completed. \(\square \)

5 Convergence rate

In this section, we analyze the iteration complexity in both the non-ergodic and ergodic sense for the sequence generated by these relaxed customized PPA algorithms. Note that by letting the matrix \(M=I\), our analysis is also applicable for the original customized PPA.

5.1 Worse-case convergence rate in the non-ergodic sense

In this subsection, we establish the \(\mathcal {O}(1/\sqrt{t})\) non-ergodic convergence rate of these relaxed PPA.

Lemma 5.1

Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). Then we have

$$\begin{aligned} (\tilde{w}^k-\tilde{w}^{k+1})^TQ\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}\ge 0. \end{aligned}$$
(5.1)

Proof

Setting \(w=\tilde{w}^{k+1}\) in (2.1a), we get

$$\begin{aligned} \theta (\tilde{x}^{k+1})-\theta (\tilde{x}^k)+(\tilde{w}^{k+1}-\tilde{w}^k)^TF(\tilde{w}^k)\ge (\tilde{w}^{k+1}-\tilde{w}^k)^TQ(w^k-\tilde{w}^k). \end{aligned}$$
(5.2)

Notice that (2.1a) is also suitable for \(k :=k+1\), thus

$$\begin{aligned} \theta (x)-\theta (\tilde{x}^{k+1})+(w-\tilde{w}^{k+1})^TF(\tilde{w}^{k+1})\ge (w-\tilde{w}^{k+1})^TQ(w^{k+1}-\tilde{w}^{k+1}),\quad \forall w\in \Omega . \end{aligned}$$
(5.3)

Let \(w=\tilde{w}^{k}\) in the above inequality, we get

$$\begin{aligned} \theta (\tilde{x}^k)-\theta (\tilde{x}^{k+1})+(\tilde{w}^{k}-\tilde{w}^{k+1})^TF(\tilde{w}^{k+1})\ge (\tilde{w}^{k}-\tilde{w}^{k+1})^TQ(w^{k+1}-\tilde{w}^{k+1}). \end{aligned}$$
(5.4)

Summing (5.2) and (5.4) together and using the monotonicity of F, the assertion (5.1) is proved. \(\square \)

Lemma 5.2

Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). Then we have

$$\begin{aligned} (w^k-\tilde{w}^{k})^TM^THM\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}\ge \Vert (w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\Vert _Q^2. \end{aligned}$$
(5.5)

Proof

Note that Q is positive definite. By adding the equation

$$\begin{aligned}&\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}^TQ\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}\\&=\Vert (w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\Vert _Q^2 \end{aligned}$$

to the both sides of (5.1), we get

$$\begin{aligned} (w^k-w^{k+1})^TQ\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}\ge \Vert (w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\Vert _Q^2. \end{aligned}$$
(5.6)

Substituting the terms

$$\begin{aligned} M(w^k-\tilde{w}^k)=w^k-w^{k+1}\ \text{ and } \ Q=HM \end{aligned}$$

into (5.6), we obtain

$$\begin{aligned}&(w^k-\tilde{w}^{k})^TM^THM\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}\\&\quad \ge \Vert (w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\Vert _Q^2. \end{aligned}$$

This completes the proof. \(\square \)

Theorem 5.3

Let \(\{w^{k}\}\) be a sequence generated by (2.1). Then we have

$$\begin{aligned} \Vert w^{k+1}-w^{k+2}\Vert _H^2\le \Vert w^{k}-w^{k+1}\Vert _H^2, \end{aligned}$$
(5.7)

where H is defined in (2.2a).

Proof

Let \(a=M(w^k-\tilde{w}^{k})\) and \(b=M(w^{k+1}-\tilde{w}^{k+1})\) be in the following identity

$$\begin{aligned} \Vert a\Vert _H^2-\Vert b\Vert _H^2=2a^TH(a-b)-\Vert a-b\Vert _H^2, \end{aligned}$$

we get

$$\begin{aligned} \begin{aligned}&\Vert M(w^k-\tilde{w}^{k})\Vert _{{H}}^2-\Vert M(w^{k+1}-\tilde{w}^{k+1})\Vert _{{H}}^2\\&=2(w^k-\tilde{w}^{k})^TM^THM\{(w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\}-\Vert (w^k-\tilde{w}^{k})\\&-(w^{k+1}-\tilde{w}^{k+1})\Vert _{M^THM}^2. \end{aligned} \end{aligned}$$
(5.8)

Substitute (5.5) into the right hand side of above equality, we get

$$\begin{aligned}&\Vert M(w^k-\tilde{w}^{k})\Vert _{{H}}^2-\Vert M(w^{k+1}-\tilde{w}^{k+1})\Vert _{{H}}^2\\&\quad \ge \Vert (w^k-\tilde{w}^{k})-(w^{k+1}-\tilde{w}^{k+1})\Vert _{{(}Q^T+Q-M^THM{)}}^2. \end{aligned}$$

Note that

$$\begin{aligned} M(w^k-\tilde{w}^k)=w^k-w^{k+1}\ \text{ and }\ M(w^{k+1}-\tilde{w}^{k+1})=w^{k+1}-w^{k+2}. \end{aligned}$$

Combining with \(G=Q^T+Q-M^THM\succ 0\) (see (2.2b)), the assertion is proved. \(\square \)

Theorem 5.3 implies that the sequences \(\{\Vert w^{k}-{{w}^{k+1}}\Vert _H^2\}\) is monotonically nonincreasing, which is crucial for our proof. Next, we need to establish the connection between \(\Vert w^k-\tilde{w}^k\Vert _G^2\) and \(\Vert w^k-w^{k+1}\Vert _H^2\).

Lemma 5.4

For each relaxed customized PPA, there exists a constant \(c: = \frac{2-\gamma }{4}\), such that

$$\begin{aligned} c\Vert w^k-w^{k+1}\Vert _H^2\prec \Vert w^k-\tilde{w}^k\Vert _G^2. \end{aligned}$$
(5.9)

Proof

From \(w^k-w^{k+1}=M(w^k-\tilde{w}^k)\) (see (2.1b)), we have

$$\begin{aligned} \Vert w^k-w^{k+1}\Vert _H^2=\Vert w^k-\tilde{w}^k\Vert _{M^THM}^2. \end{aligned}$$

The remaining task is to find a constant c, such that

$$\begin{aligned} G\succ c M^THM, \end{aligned}$$

for each GH pair emerged in these algorithms. Note that from (2.2b), we have \(M^THM\prec 2Q\). From (2.3) and (2.7), we have \(G\succcurlyeq \frac{2-\gamma }{2}Q\) for the relaxed PDHG. From (2.19) and (2.15), \(G\succcurlyeq \frac{2-\gamma }{2}Q\) for the relaxed L-ALM. From (3.5b) and (3.10), \(G\succcurlyeq \frac{2-\gamma }{2}Q\) for the relaxed ADMM. The assertion is thus proved. \(\square \)

Theorem 5.5

Let \(\{w^{k}\}\) be a sequence generated by (2.1). Then we have

$$\begin{aligned} \Vert w^{k}-w^{k+1}\Vert _H^2\le {\frac{1}{c(k+1)}}\Vert w^{0}-w^*\Vert _H^2,\ \forall w^*\in \Omega ^*, \end{aligned}$$
(5.10)

where the matrix H is defined as (2.2a).

Proof

Inserting (5.9) into (4.5), we get

$$\begin{aligned} \Vert w^{k+1}-w^*\Vert _H^2\le \Vert w^k-w^*\Vert _H^2-c\Vert w^k-w^{k+1}\Vert _H^2,\ \forall w^*\in \Omega ^*. \end{aligned}$$

Summing this inequality over \(t=0,1,\ldots , \infty \), we have

$$\begin{aligned} {c}\sum _{t=0}^{\infty }\Vert w^t-w^{t+1}\Vert _H^2\le \Vert w^0-w^*\Vert _H^2,\ \forall w^*\in \Omega ^*. \end{aligned}$$
(5.11)

Now observe that the sequences \(\{\Vert w^{k}-w^{k+1}\Vert _H^2\}\) is monotonically nonincreasing in Theorem 5.3, we have

$$\begin{aligned} (k+1)\Vert w^k-w^{k+1}\Vert _H^2\le \sum _{{t=0}}^{k}\Vert {w^t-w^{t+1}}\Vert _H^2. \end{aligned}$$
(5.12)

Using the above two inequalities (5.11) and (5.12), we obtain (5.10). \(\square \)

From (1.15a), we know that if \(\Vert w^k-w^{k+1}\Vert _H=0\), then \(w^k\) is the solution point of VI\((\Omega ; F; \theta )\). This allows us to use \(\Vert w^k-w^{k+1}\Vert _H\) as an error measurement in terms of the distance to the solution set of VI\((\Omega ; F; \theta )\) for the t-th iteration of the algorithm. Let \(d :=\inf \{\Vert w^0-w^*\Vert _H \ |\ w^*\in \Omega ^*\}\). Then, for any given \(\varepsilon >0\), Theorem 5.5 implies that these relaxed PPA needs at most \({\lceil \frac{d^2}{c\varepsilon ^2}\rceil }\) iterations to ensure that \(\Vert w^k-w^{k+1}\Vert _H\le \varepsilon \). Therefore, a worse-case \(\mathcal {O}(1/\sqrt{t})\) convergence rate in the non-ergodic sense for these relaxed PPA is established.

5.2 Worse-case convergence rate in ergodic sense

In this section, we establish \(\mathcal {O}(1/t)\) worse-case convergence rate in the ergodic sense for these relaxed customized PPA algorithms. First, we introduce the following lemma, which is convenient for analyzing the ergodic iteration complexity of the algorithm. Its proof can be found in [24] or Theorem 2.3.5 in [15].

Theorem 5.6

The solution set of \(\text{ VI }(\Omega ; F; \theta )\) ((1.3), (3.2)) is closed and convex, and can be characterized via

$$\begin{aligned} \Omega ^* :=\bigcap _{w\in \Omega }\{\tilde{w}\in \Omega :\ \theta (x)-\theta (\tilde{x})+(w-\tilde{w})^T F(w)\ge 0\}. \end{aligned}$$

Computing exactly a primal-dual solution \(w^*\) is impractical, but we can find an approximate solution \(\tilde{w}\in \Omega \). With the Theorem 5.6, we define an \(\varepsilon \)-approximation solution of \(\text{ VI }(\Omega ; F; \theta )\) as follows: Given an accuracy \(\varepsilon >0\), \(\tilde{w}\) is said to be an \(\varepsilon \)-approximation solution of \(\text{ VI }(\Omega ; F; \theta )\), if it satisfies

$$\begin{aligned} {\sup } \{\theta (\tilde{x})-\theta (x)+(\tilde{w}-w)^T F(w)\}\le \varepsilon , \ \forall w\in {\mathcal {D}}, \end{aligned}$$

where \(\mathcal {D}\in \Omega \) is a defined compact set, e.g., \(\mathcal {D}=\{w\ |\ \Vert w-\tilde{w}\Vert \le 1\}\). Next, our purpose is to prove that we can find an \(\varepsilon \)-solution after t iterations.

Theorem 5.7

Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). For any integer number \(t>0\), define

$$\begin{aligned} \tilde{w}_t=\frac{1}{t+1}\sum _{k=0}^{t}\tilde{w}^k, \end{aligned}$$
(5.13)

then we have \(\tilde{w}_t\in \Omega \) and

$$\begin{aligned} \theta (\tilde{x}_t)-\theta (x)+(\tilde{w}_t-w)^TF(w)\le \frac{1}{2(t+1)}\Vert w-w_0\Vert _H^2,\ \forall w\in \Omega , \end{aligned}$$
(5.14)

where H is defined in (2.2a).

Proof

First, because of \(\tilde{w}^k\in \Omega \), it holds that \(\tilde{w}_t\in \Omega \) for all integer t.

Next, by using the definitions of F (see (1.3b) and (3.2b)), we obtain,

$$\begin{aligned} (w-\tilde{w}^k)^TF(w)\ge (w-\tilde{w}^k)^TF(\tilde{w}^k). \end{aligned}$$

Also, using the positive definiteness of G and (4.1), we get

$$\begin{aligned} \theta (x)-\theta (\tilde{x}^k)+(w-\tilde{w}^k)^TF(w)+\frac{1}{2}\Vert w-w^k\Vert _H^2\ge \frac{1}{2}\Vert w-w^{k+1}\Vert _H^2,\ \forall w\in \Omega . \end{aligned}$$
(5.15)

Summing inequality (5.15) from \(k=0\) to \(k=t\), we derive

$$\begin{aligned}&(t+1)\theta (x)-\sum _{k=0}^t\theta (\tilde{x}^k)\\&\quad +\big ((t+1)w-\sum _{k=0}^t\tilde{w}^k\big )^TF(w)+\frac{1}{2}\Vert w-w^0\Vert _H^2 \ge 0,\ \forall w\in \Omega . \end{aligned}$$

From the definition of \(\tilde{w}_t\) (5.13), we can write above inequality as

$$\begin{aligned} \frac{1}{t+1}\sum _{k=0}^t\theta (\tilde{x}^k)-\theta (x)+(\tilde{w}_t-w)^TF(w)\le \frac{1}{2(t+1)}\Vert w-w^0\Vert _H^2, \ \forall w\in \Omega . \end{aligned}$$
(5.16)

Noticing the fact that \(\theta (x)\) is convex, for \(\tilde{w}_t\), we have

$$\begin{aligned} \theta (\tilde{x}_t)\le \frac{1}{t+1}\sum _{k=0}^t\theta (\tilde{x}^k). \end{aligned}$$
(5.17)

Substituting (5.17) into (5.16), we obtain (5.14). \(\square \)

Let \(d=\sup \ \{\Vert w-w_0\Vert _H, w\in \mathcal {D}\}.\) Theorem 5.7 implies that, after t iterations of (2.1), \(\tilde{w}_t\) defined in (5.13) satisfies

$$\begin{aligned} \sup _{w\in \mathcal {D}}\{\theta (\tilde{x}_t)-\theta (x)+(\tilde{w}_t-w)^T F(w)\}\le {\frac{d^2}{2(t+1)}}. \end{aligned}$$

Thus, we obtained computational complexity estimates for the average of \(\tilde{w}^k\), which is of order \( \mathcal {O}(1/t)\) to get an approximation solution for \(\text{ VI }(\Omega , F, \theta )\).

6 Numerical results

In this section, we report some numerical results of the proposed algorithms. Note that the numerical comparisons with related methods for the customized PPA have already been conducted in the literatures, see, e.g., [6, 35, 38]. Thus they are omitted in our tests and we focus on showing the advantages of the relaxed methods to the original customized PPA algorithms, and verify the improvements in the computational sense. All codes were written in Matlab 2015b and all experiments were conducted on a laptop with Intel Core (TM) CPU 2.40GHz and 8G memory.

6.1 Matrix completion problem

The matrix completion problem consists of reconstructing an unknown low rank matrix from a given subset of observed entries [3]. Specifically, let \(M\in \mathfrak {R}^{m\times n}\) be a low rank matrix, \(\Omega \) be a subset of the indices of entries \(\{1,\cdots , {m}\}\times \{1,\cdots , n\}\). The convex model is

$$\begin{aligned} \text{ min }\{ \Vert X\Vert _*\ |\ X_{ij}=M_{ij}, \forall \{{i, j}\}\in \Omega \}, \end{aligned}$$
(6.1)

where \(\Vert \cdot \Vert _*\) denotes the nuclear norm of a matrix. Note that the constraints can also be viewed as a projection equation \(P_\Omega (X)=P_\Omega (M)\), thus we have \(\Vert A^TA\Vert =1\).

In our implementations, for the SVD decomposition, we resort to the widely recommended PROPACK package [28]. We generate the rank-\(r_a\) matrix M as a product \(M_LM_R^T\) , where \(M_L\) and \(M_R\) are independent \(n\times r_a\) matrices who have i.i.d. Gaussian entries. The set of observations \(\Omega \) is sampled uniformly at random among all sets of cardinality \(|\Omega |\). The degrees of freedom of the matrix is \(d_{r_a}=r_a(2{n}-r_a)\), then the oversampling factor \(|\Omega |/d_{r_a}\) is the ratio of the number of samples to the degrees of freedom. The primal variable \(X^0\) and the dual variable \(\Lambda ^0\) were all initialized as zeros(n). The stopping criteria is set to be

$$\begin{aligned} \frac{\Vert X^k-M\Vert _F}{\Vert M\Vert _F}\le 10^{-4}. \end{aligned}$$
(6.2)

For each instance, we randomly generated ten cases, so the results reported were averaged over ten runs.

Fig. 1
figure 1

Comparison results of PDHG and relaxed PDHG with different r (\(r_a=8, OS=5\))

Table 1 Performance comparison of PDHG, Relaxed PDHG with \(r= 0.008\)
Fig. 2
figure 2

Comparison results of L-ALM and relaxed L-ALM with different r (\(r_a=8, OS=5\))

Table 2 Performance comparison of L-ALM, Relaxed L-ALM with \(r= 0.008\)

Now we show how to choose the parameters. Note that the PDHG requires multiple stepsize parameters, and it seems impossible to give a best choice of step sizes. We already know from (1.11) that 1 / r and 1 / s essentially play the role of proximal step sizes, and it may be preferable to choose r and s such that \(r\cdot s\) is close to the lower bound of \(\Vert A^TA\Vert \) by considering the convergence condition \(rs>\Vert A^TA\Vert \). Thus, in our experiments, we always set \(s=1.01/r\). However, the choice of r still lacks of theoretical analysis, and as far as we know, it is usually selected by numerical experiments. Since the condition varies for different problem size, or even different iteration step, it seems that adjusting the PDHG parameters adaptively at each iteration is a better way is to achieve faster convergence, see, e.g., [18] for a simple adaptive strategy. In this paper, our main goal is to illustrate the advantages of the relaxation steps, so the parameters are kept fixed in all our experiments. To choose a suitable value of r, We first run PDHG and relaxed PDHG (\(\gamma = 1.7\)) for several different values of r and record the resulting iteration numbers. In particular, r varies from 0.002 to 1.024. The results are reported in Fig. 1. We can see that the relaxed PDHG with \(\gamma = 1.7\) took fewer iterations than the PDHG for almost all cases. We observe that both algorithms are sensitive to the choice of r. As a result, for different dimension of n, the proper choice of r is different and there does not exist fixed and best parameters for PDHG throughout the implementation.

We further compare the two algorithms with specified values of the parameters. From Fig. 1, we see that when r is chosen in the interval (0.008, 0.0032), the performance is efficient and robust. Here, we set the parameters r and s to be \(r=0.008\), \(s=1.01/r\), respectively. We point out that the values of r and s are determined based on the experiments and may be far from optimal. For each setting, we report the number of iterations (“No. it”), the relative error (“RelErr”), the time in Table 1, respectively. As shown in the figure, the relaxed PDHG took much fewer iterations and is much faster than the PDHG for the cases \(n<300\). When n becomes larger, the relaxed PDHG also needs less number of iterations than PDHG, although they both perform well. We remark that, we can further improve the performance of the relaxed PDHG by judiciously choosing a proper value of \(\gamma \) in the interval (1, 2). But this is trivial and beyond the scope of this paper.

Fig. 3
figure 3

Comparison results of ADMM and relaxed ADMM with different \(\beta =k*\tau \)

Table 3 Performance comparison of ADMM and relaxed ADMM with \(\beta ={ 0.3*\tau }\)

Next, we test the performance of L-ALM. Similarly with PDHG, the parameters r and \(\beta \) would be properly chosen so that \(r/\beta \) is close to the lower bound of \(\Vert A^TA\Vert \). Thus, in our experiments, we always set \(\beta =r/1.01\). Since the condition is similar to PDHG, we illustrate the performance with different values of r in Fig. 2. We can see that the relaxed L-ALM with \(\gamma = 1.7\) took fewer iterations than the L-ALM for almost all cases. When \(n<500\) and \(r\in (0.008, 0.032)\), the two algorithms cost less iterations and perform robust and efficient. When \(r>0.032\), the two algorithms require more iterations to achieve the same stopping criteria, and the relaxation step seems less effective to reduce the iteration numbers. We can note that the performance is remarkably similar to the PDHG. This phenomenon is not difficult to be understood: L-ALM can be explained as the linearized ADMM algorithm, and is closely related to the PDHG [14]. Here, we also set the parameters r and \(\beta \) to be \(r=0.008\), \({\beta }=r/1.01\), respectively. In Table 2, we report the performance of the relaxed L-ALM algorithm for different settings of the dimensions. We can see that the relaxed L-ALM is more effective than the L-ALM when \(n<300\). For large problem dimension, the results also demonstrate that the relaxation took less iterations, and can reduce \(10\%\) iterations numbers.

6.2 Robust principal component analysis

In this subsection, we test the robust principal component analysis (RPCA) which has the form

$$\begin{aligned} \begin{array}{ll} \min _{L,S}&{}\quad \Vert L\Vert _* + \tau \Vert S\Vert _1,\\ \text{ s.t. } &{}\quad L+S=M. \end{array} \end{aligned}$$
(6.3)

where M is a known matrix. Note that the problem (6.3) is a standard model of (3.1), and thus can be solved by ADMM. We refer to [36] for the strategies to solve the subproblems. We now demonstrate the efficiency of the relaxed ADMM for solving (6.3). In the experiments, we generated the data similar to [36]. More specially, we generate the rank-\(r_a\) matrix \(L^*\) as a product \(L_LL_R^T\) , where \(L_L\) and \(L_R\) are independent \(n\times r_a\) matrices who have i.i.d. Gaussian entries. For \(S^*\), we first generate a \(n\times n\) zero matrix S, and let the numbers of the nonzero entries amount to the nearest integer of \(sr\times n\times n\), where sr denotes the sparsity ratio. We then set the locations of the nonzero entries of the underlying sparse matrix \(S^*\) randomly, where the nonzero elements are generated by standard Gaussian distribution. We use \(L^*+S^*\) to generate the matrix M. The model parameter \(\tau \) is always set to be \(1/\sqrt{n}\), and the stopping criteria is set to be

$$\begin{aligned} \max \left\{ \frac{\Vert L-L^*\Vert _F}{\Vert L^*\Vert _F},\frac{\Vert S-S^*\Vert _F}{\Vert S^*\Vert _F}\right\} \le 10^{-6}. \end{aligned}$$
(6.4)

We also randomly generated ten cases for each instance, and the results reported were averaged over ten runs.

It is well-known that, the numerical performance of ADMM is highly dependent on the value of \(\beta \), and how to determine the optimal value of \(\beta \) is a theoretically unknown question. To select a reasonably good value of \(\beta \), we first conduct some experiments to evaluate the performance of ADMM with different values of \(\beta \). In the experiments, we test a set of values \(\beta = k *\tau \) with \(k=0.02, 0.08,\ldots ,5.12\) on different dimensions from \(n=100\) to \(n=500\). The initial point is set to be zero. For simplicity, we always set \(r_a=0.1*n\) and the sparsity ratio \(sr=0.1\). For the relaxed ADMM, the relaxation factor is set to be \(\gamma =1.3\). The number of iterations of ADMM and the relaxed ADMM with respect to different values of \(\beta \) ( \(\beta = k *\tau \) ) are plotted in Fig. 3, respectively. According to the plots, the ADMM methods are very sensitive to the value of \(\beta \). For various dimensions of the problem, we can see that the optimal value of \(\beta \) could be different and there does not exist an optimal choice of \(\beta \) that is data-dependent. We can also find that the relaxed ADMM performs better than ADMM for most cases. In Fig. 3, we found that when \(k\in ({0.08},0.32)\), i.e., \(\beta \in ({0.08}\tau ,0.32\tau )\), the ADMM peforms more robust and efficient, and the two algorithms all cost less iterations compared with other choices. When \(\beta >0.32\tau \), both algorithms need more iterations for the same problem, and their performances are almost same.

Next, we report the experimental results of ADMM and the relaxed ADMM for more cases. For comparisons, we just set \(\beta =0.3*\tau \) for the two methods, since it demonstrates a good performance and is robust for different dimensions. The results of these algorithms are reported in Table 3. From these tests, we can see that the relaxed ADMM requires less iterations and computing time than ADMM to produce the same accurate solutions. For these tested instances, we can further improve the performance of relaxed ADMM by choosing the best relaxation factor. However, to simplify the demonstration, we used a fixed factor for all cases.

7 Conclusions

In the literature, it is well known that the proximal point algorithm (PPA) can be accelerated by using the relaxation strategy proposed in [19]. For many cases, the relaxed PPA with \(\gamma \in [1.5, 1.8]\) can obtain a guaranteed improvement of the convergence. In this paper, different from most previous works, we suggest to relax the PPA by correcting the dual variables individually. For the PDHG proposed in [5, 6], the linearized augmented Lagrangian method (L-ALM) and the alternating direction method of multipliers (ADMM), which are special cases of the customized PPA, we studied their relaxed schemes and observed the corresponding speedup. With the guidance of the variational inequality, we established the global convergence of these relaxed algorithms and the worst-case convergence rate in both ergodic and nonergodic sense. In conclusion, our results theoretically provide a new way of relaxing PPA algorithms.

It should be mentioned that this paper only discussed the customized PPA algorithms for the convex problems less than two blocks. In many real applications, the problems may be modeled in the form of (1.1) with the objective function consisting of multiple block components, and it is often more challenging to directly handle them. For the multi-block cases, we can also formulate them as variational inequality problems, and then design some customized PPA algorithms to solve them. In fact, the multi-block Jacobian ADMM [12, 34] and its linearized version, the partial parallel ADMM [25] are all designed for this purpose and can be viewed as PPA algorithms. Consequently, the relaxation strategy proposed in this paper can be further extended to these customized PPA algorithms, and therefore, gives to the relaxed versions. On the other hand, the stepsize parameters of the PPA algorithms discussed in this paper can all be tuned at each iteration by some self-adaptive rules, and the resulting performance would be further improved. So it would be interesting to investigate the self-adaptive rule for the relaxation factor when these algorithms are dynamically adjusted. We leave these as potential directions of future work.