1 Introduction

In this paper, we consider the following optimization problem:

$$\begin{aligned} \mathop {\mathrm {minimize}}_{x\in {\mathbb {R}}^n} f(x)+g(Ax), \end{aligned}$$
(1)

together with its dual problem:

$$\begin{aligned} \mathop {\mathrm {minimize}}_{z\in {\mathbb {R}}^m} f^*(-A^Tz)+g^*(z), \end{aligned}$$
(2)

where \(f:{{\,\mathrm{{\mathbb {R}}^n}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\cup \{+\infty \}\) and \(g: {{\,\mathrm{{\mathbb {R}}^m}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}\cup \{+\infty \}\) are closed proper convex, \(A\in {\mathbb {R}}^{m\times n}\), and \(f^*,g^*\) are the convex conjugates of fg, respectively.

Formulations (1) or (2) abstracts many application problems, which include image restoration [59], magnetic resonance imaging [51], network optimization [23], computer vision [45], and earth mover’s distance [35]. They can be solved by primal–dual algorithms such as primal–dual hybrid gradient (PDHG) and alternating direction method of multipliers (ADMM).

However, as first-order algorithms, PDHG and ADMM suffer from slow (tail) convergence. They may take thousands of iterations and still struggle reaching just four digits of accuracy. While they have many advantages such as being easy to implement and friendly to parallelization, their sensitivity to problem conditions is their main disadvantage.

To improve the performance of PDHG and ADMM, researchers have tried using preconditioners, which has been widely applied for forward-backward type of methods [10, 16, 53], as well as other methods [9, 17, 30, 52]. Depending on the application and how one applies splitting, preconditioned PDHG and ADMM may or may not have subproblems with closed-form solutions. When they do not, researchers have studied approximate subproblem solutions to reduce the total running time. In this work, we propose a new way of applying preconditioning that outperforms the existing state-of-the-art.

1.1 Proposed Approach

Simply speaking, we find a way to use non-diagonal preconditioners (thus much fewer iterations) and still have very simple subproblem procedures (thus maintaining the advantages of PDHG and ADMM).

First, we present Preconditioned PDHG (PrePDHG) along with its convergence condition and a performance bound. We propose to choose preconditioners to optimize the bound. In the special case where one preconditioner is trivially fixed as an identity matrix, optimizing the bound gives us the optimal choice of the other preconditioner, which actually reduces PrePDHG to ADMM. This explains why ADMM often takes fewer iterations than PDHG.

Next, we study how to solve PrePDHG subproblems. In all applications we are aware of, only one of the two subproblem is (subject to) ill-conditioned. (After all, we can always apply splitting to gather ill-conditioned components into one subproblem.) Therefore, we choose a non-diagonal preconditioner for the ill-conditioned subproblem and a trivial or diagonal preconditioner for the other subproblem. Again, the pair of preconditioners should be chosen to (nearly) optimize the performance bound. Since the non-diagonal preconditioner introduces dependence between different coordinates, its subproblem generally does not have a closed-form solution. In particular, if the subproblem has an \(\ell _1\)-norm, which is often the reason why PDHG or ADMM is used, it often loses its closed-form solution due to the preconditioner. Therefore, we propose to approximately solve it to satisfy an accuracy condition. Remarkably, there is no need to dynamically stops a subproblem procedure to honor the condition. Instead, the condition is automatically satisfied as long as one applies warm start and a common iterative procedure for some fixed number of iterations, which is new in the literature. Common choices of the procedure include proximal gradient descent, FISTA with restart, proximal block coordinate descent, and accelerated block-coordinate-gradient-descent (BCGD) methods (e.g., [1, 28, 37]). We call this method iPrePDHG (i for “inexact”).

Next, we establish the overall convergence of iPrePDHG. To handle the inexact subproblem, we first transform iPrePDHG into an equivalent form and then analyze an Lyapunov function to establish convergence. The technique in our proof appears to be new in the PDHG and ADMM literature.

Finally, we apply our approach to a few applications including image denoising, graph cut, optimal transport, and CT reconstruction. For CT reconstruction, we use a diagonal preconditioner in one subproblem and a non-diagonal preconditioner in the other, which we approximately solve. In each of the other applications, one subproblem uses no preconditioner, and the other uses a non-diagonal preconditioner. Using these preconditioners, we observed iPrePDHG was 4–95 times faster than the existing state-of-the-art.

Since ADMM is a special PrePDHG with one trivial preconditioner, our approach also applies to ADMM. In fact, for three of the above four applications, there are one trivial preconditioner in each, so their iPrePDHG are inexact preconditioned ADMM.

1.2 Related Literature

The main references for PDHG are [11, 22, 59]. Many problems to which we apply PDHG have separable functions f or g, or both, so the resulting PDHG subproblems often (though not always) have closed-form solutions. When subproblems are simple, we care mainly about the convergence rate of PDHG, which depends on the problem conditioning. To accelerate PDHG, diagonal preconditioning [44] was proposed since its diagonal structure maintains closed-form solutions for the subproblems and, therefore, reduces iteration complexity without making each iteration more difficult. In comparison, non-diagonal preconditioners are much more effective at reducing iteration complexity, but their off-diagonal entries couple different components in the subproblems, causing the lost of closed-form solutions of subproblems.

When a PDHG subproblem has no closed-form solution, one often uses an iterative algorithm to approximately solve it. We call it Inexact PDHG. Under certain conditions, Inexact PDHG still converges to the exact solution. Specifically, Rasch and Chambolle [46] uses three different types of conditions to skillfully control the errors of the subproblems; all those errors need to be summable over all the iterations and thereby requiring the error to diminish asymptotically. In an interesting method from [6, 8], one subproblem computes a proximal operator of a convex quadratic function, which can include a preconditioner and still has a closed-form solution involving matrix inversion. This proximal operator is successively applied n times in each iteration, for \(n\ge 1\).

ADMM has different subproblems. One of its subproblems minimizes the sum of f(x) and a squared term involving Ax. Only when A has special structures does the subproblem have closed-form solutions. Inexact ADMM refers to the ADMM with at least one of its subproblems inexactly solved. An absolute error criterion was introduced in [19], where the subproblem errors are controlled by a summable (thus diminishing) sequence of error tolerances. To simplify the choice of the sequences, a relative error criterion was adopted in several later works, where the subproblem errors are controlled by a single parameter multiplying certain quantities that one can compute during the iterations. In [40], the parameters need to be square summable. In [34], the parameters are constants when both objective functions are Lipschitz differentiable. In [20, 21], two possible outcomes of the algorithm are described: (i) infinite outer loops and finite inner loops, and (ii) finite outer loops and the last inner loop is infinite, both guaranteeing convergence to a solution. On the other hand, it is unclear how to recognize them. Since there is no bound on the number of inner loops in case (i), one may recognize it as case (ii) and stop the algorithm before it converges.

There are works that apply certain kinds of preconditioning to accelerate ADMM. Paper [24] uses diagonal preconditioning and observes improved performance. After that, non-diagonal preconditioning is analyzed [6, 8], which presents effective preconditioners for specific applications. One of their preconditioners needs to be inverted (though not needed in our method). Recently, preconditioning for problems with linear convergence has also been studied with promising numerical performances [25].

Finally, there is another line of work that combines Nesterov-type acceleration technique with primal–dual methods to obtain an accelerated convergence rate of \({\mathcal {O}}(1/k^2)\). This idea has been successfully applied to PDHG [11, 13, 31, 39], ADMM [7, 26, 27, 32], and linearized ADMM [41, 56]. We would like to point out that their contributions are orthogonal to this paper. For example, in order to have simple subproblems in accelerated PDHG, preconditioning is not applied. For accelerated linearized ADMM, certain proximal terms have to be added in the subproblems to guarantee a closed-form solution.

1.3 Organization

The rest of this paper is organized as follows: Sect. 2 establishes notation and reviews basics. In the first part of Sect. 3, we provide a criterion for choosing preconditioners. In its second part, we introduce the condition for inexact subproblems, which can be automatically satisfied by iterating a fixed number of certain inner loops. This method is called iPrePDHG. In the last part of Sect. 3, we establish the convergence of iPrePDHG. Section 4 describes specific preconditioners and reports numerical results. Finally, Sect. 5 concludes the paper.

2 Preliminaries

We use \(\Vert \cdot \Vert \) for \(\ell _2-\)norm and \(\langle \cdot , \cdot \rangle \) for dot product. We use \(I_{n}\) to denote the identity matrix of size \(n\times n\). \(M\succ 0\) means M is a symmetric, positive definite matrix, and \(M\succeq 0\) means M is a symmetric, positive semidefinite matrix.

We write \(\lambda _{\text {min}}(M)\) and \(\lambda _{\text {max}}(M)\) as the smallest and the largest eigenvalues of M, respectively, and \(\kappa (M)=\frac{\lambda _{\text {max}}(M)}{\lambda _{\text {min}}(M)}\) as the condition number of M. For \(M\succeq 0\), let \(\Vert \cdot \Vert _M\) and \(\langle \cdot , \cdot \rangle _M\) denote the semi-norm and inner product induced by M, respectively. If \(M\succ 0\), \(\Vert \cdot \Vert _M\) is a norm.

For a proper closed convex function \(\phi :{{\,\mathrm{{\mathbb {R}}^n}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}\cup \{+\infty \}}\,}}\), its subdifferential at \(x\in \mathbf {dom}\phi \) is written as

$$\begin{aligned} \partial \phi (x)=\{v\in {{\,\mathrm{{\mathbb {R}}^n}\,}}\,|\, \phi (z)\ge \phi (x)+\langle v, z-x\rangle , \,\,\forall z\in {{\,\mathrm{{\mathbb {R}}^n}\,}}\}, \end{aligned}$$

and its convex conjugate as

$$\begin{aligned} \phi ^*(y)=\sup _{x\in {{\,\mathrm{{\mathbb {R}}}\,}}^n}\{ \langle y, x\rangle -\phi (x) \}. \end{aligned}$$

We have \(y\in \partial \phi (x)\) if and only if \(x\in \partial \phi ^*(y)\).

For any \(M\succ 0\), we define the extended proximal operator of \(\phi \) as

$$\begin{aligned} {{\,\mathrm{Prox}\,}}^M_{\phi }(x):=\mathop {\mathrm{arg\,min}}\limits _{y\in {{\,\mathrm{{\mathbb {R}}^n}\,}}}\left\{ \phi (y)+\frac{1}{2}\Vert y-x\Vert ^2_{M}\right\} . \end{aligned}$$
(3)

If \(M=\gamma ^{-1} I\) for \(\gamma >0\), it reduces to a classic proximal operator.

We also have the following generalization of Moreau’s Identity:

Lemma 1

([15], Theorem 3.1(ii)) For any proper closed convex function \(\phi \) and \(M\succ 0\), we have

$$\begin{aligned} x={{\,\mathrm{Prox}\,}}^M_{\phi }(x)+M^{-1}{{\,\mathrm{Prox}\,}}^{M^{-1}}_{\phi ^*}(Mx). \end{aligned}$$
(4)

We say a proper closed function \(\phi \) is a Kurdyka–Lojasiewicz (KL) function if, for each \(x_0\in \mathbf{dom} \phi \), there exist \(\eta \in (0,\infty ]\), a neighborhood U of \(x_0,\) and a continuous concave function \(\varphi : [0, \eta )\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}_+\) such that:

  1. 1.

    \(\varphi (0) = 0\),

  2. 2.

    \(\varphi \) is \(C^1\) on \((0,\eta )\),

  3. 3.

    for all \(s\in (0,\eta ), \varphi '(s)>0\),

  4. 4.

    for all \(x\in U\cap \{x\,|\,\phi (x_0)<\phi (x)<\phi (x_0)+\eta \}\), the KL inequality holds:

    $$\begin{aligned} \varphi '(\phi (x)-\phi (x_0))\text {dist}(0,\partial \phi (x))\ge 1. \end{aligned}$$

3 Main Results

This section presents the key results of our paper. In Sect. 3.1 we demonstrate how to apply preconditioning to PDHG. Then, we establish rules of preconditioner selection in Sect. 3.2. In Sect. 3.3, we present the proposed method iPrePDHG. Finally, we establish the convergence of iPrePDHG in Sect. 3.4.

Throughout this section, we assume the following regularity assumptions:

Assumption 1

  1. 1.

    \(f:{{\,\mathrm{{\mathbb {R}}^n}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}\cup \{+\infty \}}\,}}\), \(g:{{\,\mathrm{{\mathbb {R}}^m}\,}}\rightarrow {{\,\mathrm{{\mathbb {R}}\cup \{+\infty \}}\,}}\) are proper closed convex.

  2. 2.

    A primal–dual solution pair \((x^{\star }, z^{\star })\) of (1) and (2) exists, i.e.,

    $$\begin{aligned} {\mathbf {0}}\in \partial f(x^{\star })+A^Tz^{\star },\quad {\mathbf {0}}\in \partial g(Ax^{\star })-z^{\star }. \end{aligned}$$

Problem (1) also has the following convex-concave saddle-point formulation:

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n}\max _{z\in {\mathbb {R}}^m} \varphi (x,z):=f(x)+\langle Ax, z\rangle -g^*(z). \end{aligned}$$
(5)

A primal–dual solution pair \((x^{\star }, z^{\star })\) is a solution of (5).

3.1 Preconditioned PDHG

The method of primal–dual hybrid gradient or PDHG [11, 59] for solving (1) refers to the iteration

$$\begin{aligned} \begin{aligned} {x}^{k+1}&={{\,\mathrm{Prox}\,}}_{\tau f}(x^{k}-\tau A^Tz^{k}),\\ {z}^{k+1}&={{\,\mathrm{Prox}\,}}_{\sigma g^*}(z^{k}+\sigma A(2{x}^{k+1}-x^{k})). \end{aligned} \end{aligned}$$
(6)

When \(\frac{1}{\tau \sigma }\ge \Vert A\Vert ^2\), the iterates of (6) converge [11] to a primal–dual solution pair of (1). We can generalize (6) by applying preconditioners \(M_1, M_2\succ 0\) (their choices are discussed below) to obtain Preconditioned PDHG or PrePDHG:

$$\begin{aligned} \begin{aligned} {x}^{k+1}&={{\,\mathrm{Prox}\,}}^{M_1}_{ f}\left( x^{k}-M_1^{-1} A^Tz^{k}\right) ,\\ {z}^{k+1}&={{\,\mathrm{Prox}\,}}^{M_2}_{ g^*}\left( z^{k}+M_2^{-1} A(2{x}^{k+1}-x^{k})\right) , \end{aligned} \end{aligned}$$
(7)

where the extended proximal operators \({{\,\mathrm{Prox}\,}}^{M_1}_{f}\) and \({{\,\mathrm{Prox}\,}}^{M_2}_{g^*}\) are defined in (3). We can obtain the convergence of PrePDHG using the analysis in [12].

There is no need to compute \(M_1^{-1}\) and \(M_2^{-1}\) since (7) is equivalent to

$$\begin{aligned} \begin{aligned} {x}^{k+1}&=\mathop {\mathrm{arg\,min}}\limits _{x\in {{\,\mathrm{{\mathbb {R}}^n}\,}}}\left\{ f(x)+\langle x-x^k, A^Tz^k\rangle + \frac{1}{2}\Vert x-x^k\Vert ^2_{M_1}\right\} ,\\ {z}^{k+1}&=\mathop {\mathrm{arg\,min}}\limits _{z\in {{\,\mathrm{{\mathbb {R}}^m}\,}}}\left\{ g^*(z)-\langle z-z^k, A(2x^{k+1}-x^k)\rangle +\frac{1}{2}\Vert z-z^k\Vert ^2_{M_2}\right\} . \end{aligned} \end{aligned}$$
(8)

3.2 Choice of Preconditioners

In this section, we discuss how to select appropriate preconditioners \(M_1\) and \(M_2\). As a by-product, we show that ADMM corresponds to choosing \(M_1=\frac{1}{\tau } I_{n}\) and optimally choosing \(M_2=\tau AA^T\), thereby, explaining why ADMM appears to be faster than PDHG.

The following well-known lemma characterizes primal–dual solution pairs of (1) and (2). For completeness, we included its proof in “Appendix A”.

Lemma 2

Under Assumption 1, (XZ) is a primal–dual solution pair of (1) if and only if \(\varphi (X, z)-\varphi (x,Z)\le 0\) for any \((x,z)\in {\mathbb {R}}^{n+m}\), where \(\varphi \) is given in the saddle-point formulation (5).

We present the following ergodic convergence result, adapted from [12, Theorem 1].

Theorem 1

Let \((x^k, z^k), k=0,1,\dots ,N\) be a sequence generated by PrePDHG (7). Under Assumption 1, if in addition

$$\begin{aligned} {\tilde{M}}:=\begin{pmatrix} M_1 &{}\quad -A^T\\ -A &{}\quad M_2 \end{pmatrix}\succeq 0, \end{aligned}$$
(9)

then, for any \(x\in {{\,\mathrm{{\mathbb {R}}^n}\,}}\) and \(z\in {{\,\mathrm{{\mathbb {R}}^m}\,}}\), it holds that

$$\begin{aligned} \varphi (X^N, z)-\varphi (x, Z^N)\le \frac{1}{2N}(x-x^0, z-z^0)\begin{pmatrix} M_1 &{}\quad -A^T\\ -A &{}\quad M_2 \end{pmatrix} \begin{pmatrix} x-x^0\\ z-z^0 \end{pmatrix}, \end{aligned}$$
(10)

where \(X^N=\frac{1}{N}\sum _{i=1}^{N}x^i\) and \( Z^N=\frac{1}{N}\sum _{i=1}^{N}z^i\).

Proof

This follows from Theorem 1 of [12] by setting \(L_f=0\), \(\frac{1}{\tau }D_x(x,x_0)=\frac{1}{2}\Vert x-x^0\Vert ^2_{M_1}\), \(\frac{1}{\sigma }D_z(z,z_0)=\frac{1}{2}\Vert z-z^0\Vert ^2_{M_2}\), and \(K=A\). Note that in Remark 1 of [12], \(D_x\) and \(D_z\) need to be \(1-\)strongly convex to ensure their inequality (13) holds, which is exactly our (9). Therefore, we do not need \(D_x\) and \(D_z\) to be strongly convex. \(\square \)

Based on the above results, one approach to accelerate convergence is to choose preconditioners \(M_1\) and \(M_2\) to obey (9) and make the right-hand side of (10) as small as possible for all \(x, z, x_0, z_0\). When a pair of preconditioner matrices attains this minimum, we say they are optimal. When one of them is fixed, the other that attains the minimum is also called optimal.

By Schur complement, the condition (9) is equivalent to \(M_2\succeq AM_1^{-1}A^T\). Hence, for any given \(M_1\succ 0\), the optimal \(M_2\) is \(AM_1^{-1}A^T\).Footnote 1

Original PDHG (6) corresponds to \(M_1=\frac{1}{\tau }I_{n}\), \(M_2=\frac{1}{\sigma }I_{m}\) with \(\tau \) and \(\sigma \) obeying \(\frac{1}{\tau \sigma }\ge \Vert A\Vert ^2\) for convergence. In “Appendix B”, we show that ADMM for problem (1) corresponds to setting \(M_1=\frac{1}{\tau } I_{n}, ~M_2={\tau } AA^T\), \(M_2\) is optimal since \(AM_1^{-1}A^T=\tau AA^T=M_2\) (This is related to, but different from, the result in [11, Sect. 4.3] stating that PDHG is equivalent to a preconditioned ADMM). In the next section, we show that when the \(z-\)subproblem is solved inexactly, a choice of \(M_1=\frac{1}{\tau } I_{n}, M_2={\tau } AA^T+\theta I_{m}\) with a small \(\theta \) guarantees convergence (see Proposition 2).

By using more general pairs of \(M_1,M_2\), we can potentially have even fewer iterations of PrePDHG than ADMM.

3.3 PrePDHG with Fixed Inner Iterations

It wastes time to solve the subproblems in (8) very accurately. It is more efficient to develop a proper condition and stop the subproblem procedure, i.e., inner iterations, once the condition is satisfied. It is even better if we can simply fix the number of inner iterations and still guarantee global convergence.

In this subsection, we describe the “bounded relative error” of the z-subproblem in (7) and then show that this can be satisfied by running a fixed number of inner iterations with warm start, uniformly for every outer loop, which is new in the literature.

Definition 1

(Bounded relative error condition) Given \(x^k\), \(x^{k+1}\) and \(z^k\), we say that the z-subproblem in PrePDHG (7) is solved to a bounded relative error by some iterator S, if there is a constant \(c>0\) such that

$$\begin{aligned}&{\mathbf {0}}\in \partial g^*(z^{k+1})+M_2\left( z^{k+1}-z^{k}-M_2^{-1}A(2x^{k+1}-x^k)\right) +\varepsilon ^{k+1}, \end{aligned}$$
(11)
$$\begin{aligned}&\Vert \varepsilon ^{k+1}\Vert \le c\Vert z^{k+1}-z^{k}\Vert . \end{aligned}$$
(12)

Remarkably, this condition does not need to be checked at run time. For a fixed \(c>0\), the condition can be satisfied by apply warm start and a fixed number of inner iterations using, for example, S being the proximal gradient iteration (Theorem 2). One can also use faster solvers, e.g., FISTA with restart [42], and solvers that suit the subproblem structure, e.g., cyclic proximal BCD (Theorem 3). Although the error in solving z-subproblems appears to be neither summable nor square summable, convergence can still be established. But first, we summarize this method in Algorithm 1.

figure a

Theorem 2

Take Assumption 1. Suppose in iPrePDHG, or Algorithm 1, we choose S as the proximal-gradient step with stepsize \(\gamma \in (0, \frac{2\lambda _{\mathrm {min}}(M_2)}{\lambda _{\mathrm {max}}^2(M_2)})\) and repeat it p times, where \(p \ge 1\). Then, \(z^{k+1}=z^{k+1}_p\) is an approximate solution to the z-subproblem up to a bounded relative error in Definition 1 for

$$\begin{aligned} c=c(p)=\frac{\frac{1}{\gamma }+\lambda _{\mathrm {max}}(M_2)}{1-\rho ^p}(\rho ^p+\rho ^{p-1}), \end{aligned}$$
(13)

where \(\rho =\sqrt{1-\gamma (2\lambda _{\mathrm {min}}(M_2)-\gamma \lambda _{\mathrm {max}}^2(M_2))}<1\).

Proof

The z-subproblem in (8) is of the form

$$\begin{aligned} \mathop {\mathrm {minimize}}_{z\in {{\,\mathrm{{\mathbb {R}}^m}\,}}} h_1(z)+h_2(z), \end{aligned}$$
(14)

for \(h_1(z)=g^*(z)\) and \(h_2(z)=\frac{1}{2}\Vert z-z^k-M_2^{-1}A(2x^{k+1}-x^k)\Vert ^2_{M_2}.\) With our choice of S as the proximal-gradient descent step, the inner iterations are

$$\begin{aligned} z^{k+1}_0&=z^{k},\nonumber \\ z^{k+1}_{i+1}&={{\,\mathrm{Prox}\,}}_{\gamma h_1}\left( z^{k+1}_{i}-\gamma \nabla h_2\left( z^{k+1}_{i}\right) \right) , \quad i=0, 1,\dots ,p-1, \end{aligned}$$
(15)

Concerning the last iterate \(z^{k+1}=z^{k+1}_p\), we have from the definition of \({{\,\mathrm{Prox}\,}}_{\gamma h_1}\) that

$$\begin{aligned} {\mathbf {0}}\in \partial h_1\left( z^{k+1}_p\right) +\nabla h_2\left( z^{k+1}_{p-1}\right) +\frac{1}{\gamma }\left( z^{k+1}_{p}-z^{k+1}_{p-1}\right) . \end{aligned}$$

Compare this with (11) and use \(z^{k+1}=z^{k+1}_p\) to get

$$\begin{aligned} \varepsilon ^{k+1}=\frac{1}{\gamma }\left( z^{k+1}_p-z^{k+1}_{p-1}\right) +\nabla h_2\left( z^{k+1}_{p-1}\right) -\nabla h_2\left( z^{k+1}_{p}\right) . \end{aligned}$$

It remains to show that \(\varepsilon ^{k+1}\) satisfies (12).

Let \(z^{k+1}_{\star }\) be the solution of (14), \(\alpha =\lambda _{\text {min}}(M_2)\), and \(\beta =\lambda _{\text {max}}(M_2)\). Then \(h_1(z)\) is convex and \(h_2(z)\) is \(\alpha \)-strongly convex and \(\beta \)-Lipschitz differentiable. Consequently, [3, Prop. 26.16(ii)] gives

$$\begin{aligned} \left\| z^{k+1}_{i}-z^{k+1}_{\star }\right\| \le \rho ^{i} \left\| z^{k+1}_0-z^{k+1}_{\star }\right\| , \quad \forall i= 0, 1, \dots , p, \end{aligned}$$

where \(\rho =\sqrt{1-\gamma (2\alpha -\gamma \beta ^2)}\).

Let \(a_i = \Vert z^{k+1}_i-z^{k+1}_{\star }\Vert \). Then, \(a_i\le \rho ^i a_0\). We can derive

$$\begin{aligned} \begin{aligned} \Vert \varepsilon ^{k+1}\Vert&\le \left( \frac{1}{\gamma }+\beta \right) \left\| z^{k+1}_p-z^{k+1}_{p-1}\right\| \le \left( \frac{1}{\gamma }+\beta \right) (a_p+a_{p-1})\\&\le \left( \frac{1}{\gamma }+\beta \right) (\rho ^p+\rho ^{p-1})a_0. \end{aligned} \end{aligned}$$
(16)

On the other hand, we have

$$\begin{aligned} \Vert z^{k+1}-z^k\Vert&\ge a_0-a_p\ge (1-\rho ^p)a_0. \end{aligned}$$
(17)

Combining these two equations yields

$$\begin{aligned} \Vert \varepsilon ^{k+1}\Vert \le c\Vert z^{k+1}-z^k\Vert , \end{aligned}$$

where c is given in (13). \(\square \)

Theorem 2 uses the iterator S that is the proximal-gradient step. It is straightforward to extend its proof to S being the FISTA step with restart since it is also linearly convergent [42]. We omit the proof.

In our next theorem, we let S be the iterator of one epoch of the cyclic proximal BCD method. A BCD method updates one block of coordinates at a time while fixing the remaining blocks. In one epoch of cyclic BCD, all the blocks of coordinates are sequentially updated, and every block is updated once. In cyclic proximal BCD, each block of coordinates is updated by a proximal-gradient step, just like (15) except only the chosen block is updated each time. When \(h_1\) is block separable, each update costs only a fraction of updating all the blocks together. When different blocks are updated one after another, the Gauss–Seidel effect brings more progress. In addition, since the Lipschitz constant of each block gradient of \(h_2\) is typically less than than that of \(\nabla h_2\), one can use a larger stepsize \(\gamma \) and get potentially even faster progress. Therefore, the iterator of cyclic proximal BCD is a better choice for S.

In summary, with \(h_1(z)=g^*(z)\) and \(h_2(z)=\frac{1}{2}\Vert z-z^k-M_2^{-1}A(2x^{k+1}-x^k)\Vert ^2_{M_2}\), an epoch of cyclic proximal BCD for the \(z-\)subproblem is written as

$$\begin{aligned} z^{k+1}_0&=z^{k},\\ z^{k+1}_{i+1}&=S\left( z^{k+1}_i, x^{k+1}, x^k\right) , \quad i=0,1,\dots ,p-1,\\ z^{k+1}&=z^{k+1}_p. \end{aligned}$$

where S is the iterator of cyclic proximal BCD. Define

$$\begin{aligned} T(z):={{\,\mathrm{Prox}\,}}_{\gamma h_1(z)}(z-\gamma \nabla h_2(z)),\qquad B(z):= \frac{1}{\gamma }(z-T(z)), \end{aligned}$$

and the jth coordinate operator of B:

$$\begin{aligned} B_j(z)=(0,\dots ,(B(z))_j,\dots ,0), \quad j = 1,2,\dots ,l. \end{aligned}$$

Then, we have

$$\begin{aligned} z^{k+1}_{i+1}=S\left( z^{k+1}_i, x^{k+1}, x^k\right) =(I-\gamma B_l)(I-\gamma B_2)\dots (I-\gamma B_1)z^{k+1}_i. \end{aligned}$$

Theorem 3

Let Assumption 1 hold and g be block separable, i.e., \(z=(z_1,z_2,\dots ,z_l)\) and \(g(z)= \sum _{j=1}^l g_j(z_j)\). Suppose in iPrePDHG, or Algorithm 1, we choose S as the iterator of cyclic proximal BCD with stepsize \(\gamma \) satisfying

$$\begin{aligned} 0<\gamma&\le \min \left\{ \frac{2\lambda _{\mathrm {min}}(M_2))}{\lambda _{\mathrm {max}}^2(M_2))},\frac{1-\sqrt{1-\gamma \left( 2\lambda _{\mathrm {min}}(M_2)-\gamma \lambda _{\mathrm {max}}^2(M_2)\right) }}{4\sqrt{2}\gamma l\lambda _{\mathrm {max}}(M_2)},\right. \\&\left. \quad \frac{1}{4l\lambda _{\mathrm {max}}(M_2)}, \frac{2l\lambda _{\mathrm {max}}(M_2)}{17l\lambda _{\mathrm {max}} (M_2)+2\bigg (\frac{1-\sqrt{1-\gamma \left( 2\lambda _{\mathrm {min}}(M_2)-\gamma \lambda _{\mathrm {max}}^2(M_2)\right) }}{\gamma }\bigg )^2}\right\} , \end{aligned}$$

and we set \(p \ge 1\). Then, \(z^{k+1}=z_p^{k+1}\) is an approximate solution to the z-subproblem up to a bounded relative error in Definition 1 for

$$\begin{aligned} c=c(p)=\frac{\left( l\lambda _{\mathrm {max}}(M_2)+\frac{1}{\gamma }\right) (\rho ^p+\rho ^{p-1})}{1-\rho ^p}, \end{aligned}$$
(18)

where \(\rho =1-\frac{\big (1-\sqrt{1-\gamma (2\lambda _{\mathrm {min}}(M_2)-\gamma \lambda _{\mathrm {max}}^2(M_2))}\big )^2}{2\gamma }<1\).

Proof

See “Appendix C”. \(\square \)

3.4 Global Convergence of iPrePDHG

In this subsection, we show the convergence of Algorithm 1. Our approach first transforms Algorithm 1 into an equivalent algorithm in Proposition 1 below and then proves its convergence in Theorems 5 and 6 below.

First, let us show that PrePDHG (7) is equivalent to an algorithm applied on the dual problem (2). This equivalence is analogous to the equivalence between PDHG (6) and Linearized ADMM applied to the dual problem (2), shown in [22]). Specifically, PrePDHG is equivalent to

$$\begin{aligned} \begin{aligned} z^{k+1}&={{\,\mathrm{Prox}\,}}^{M_2}_{g^*}\left( z^{k}+M_2^{-1}AM_1^{-1}\left( -A^Tz^{k}-y^{k}+u^{k}\right) \right) ,\\ y^{k+1}&={{\,\mathrm{Prox}\,}}^{M_1^{-1}}_{f^*}(u^{k}-A^Tz^{k+1}),\\ u^{k+1}&=u^{k}-A^Tz^{k+1}-y^{k+1}. \end{aligned} \end{aligned}$$
(19)

When \(M_1=\frac{1}{\tau }I, M_2=\lambda I\), (19) reduces to Linearized ADMM, also known as Split Inexact Uzawa [58].

Furthermore, iPrePDHG in Algorithm 1 is equivalent to (19) with inexact subproblems, which we present in Algorithm 2.

figure b

Proposition 1

Under Assumption 1 and the transforms \(u^{k}=M_1x^{k}\), \(y^{k+1}=u^{k}-A^T{z}^{k}-u^{k+1}\), PrePDHG (7) is equivalent to (19), and iPrePDHG in Algorithm 1 is equivalent to Algorithm 2.

Proof

Set \(u^{k}=M_1x^{k}\), \(y^{k+1}=u^{k}-A^T{z}^{k}-u^{k+1}\). Then (4) and (7) yield

$$\begin{aligned} y^{k+1}=M_1x^k-A^Tz^k-M_1x^{k+1}={{\,\mathrm{Prox}\,}}^{M_1^{-1}}_{f^*}(u^{k}-A^T{z}^{k}), \end{aligned}$$

and

$$\begin{aligned} u^{k+1}&=u^{k}-A^T{z}^{k}-y^{k+1},\\ {z}^{k+1}&={{\,\mathrm{Prox}\,}}^{M_2}_{g^*}\left( {z}^{k}+M_2^{-1}AM_1^{-1}(-A^T{z}^{k}-y^{k+1}+u^{k+1})\right) . \end{aligned}$$

If the z-update is performed first, then we arrive at (19).

In iPrePDHG or Algorithm 1, we are solving the z-subproblem of PrePDHG (7) approximately to the relative error in Definition 1. This is equivalent to doing the same to the z-subproblem of (19), which yields Algorithm 2. \(\square \)

Let us define the following generalized augmented Lagrangian:

$$\begin{aligned} L(z,y,u)=g^*(z)+f^*(y)+\left\langle -A^Tz-y, M_1^{-1}u\right\rangle +\frac{1}{2}\Vert A^Tz+y\Vert ^2_{M_1^{-1}}. \end{aligned}$$
(20)

Inspired by [54], we use (20) as the Lyapunov function to establish convergence of Algorithm 2 and, equivalently, the convergence of Algorithm 1. This appears to be a new proof technique for inexact PDHG and inexact ADMM.

We first establish subsequential convergence of iPrePDHG in Algorithm 1 under the following additional assumptions.

Assumption 2

  1. 1.

    f(x) is \(\mu _f-\)strongly convex.

  2. 2.

    \(g^*(z)+f^*(-A^Tz)\) is coercive, i.e., \( \lim _{\Vert z\Vert \rightarrow \infty }g^*(z)+f^*(-A^Tz)=\infty . \)

To establish convergence of iPrePDHG in Algorithm 1, we also need the following assumption.

Assumption 3

L(zyu) is a KL function.

Assumption 3 is true when both \(g^*(z)\) and \(f^*(y)\) are semi-algebraic, or more generally, definable in an o-minimal structure (more details can be referred to Sect. 2.2 of [2] and Sect. 2.2 of [56] and the references therein).

Note that under Assumptions 2 and 3, it is not necessarily true that PDHG is linearly convergent.Footnote 2

Theorem 4

Take Assumptions 1 and 2. Choose any preconditioners \(M_1, M_2\) and inner iteration number p such that

$$\begin{aligned} C_1&=\frac{1}{2}M_1^{-1}-\frac{\lambda _{\text {max}}(M_1)}{\mu _f^2}I_{n}\succ 0, \end{aligned}$$
(21)
$$\begin{aligned} C_2&=M_2-\frac{1}{2}AM_1^{-1}A^T-c(p)I_{m}\succ 0, \end{aligned}$$
(22)

where c(p) depends on the z-subproblem iterator S and \(M_2\) (e.g., (13) and (18)). Define \(L^k:=L(z^{k},y^{k},u^{k})\). Then, Algorithm 2 satisfies the following sufficient descent and lower boundedness properties, respectively:

$$\begin{aligned} L^k-L^{k+1}&\ge \Vert y^k-y^{k+1}\Vert ^2_{C_1}+\Vert z^{k}-z^{k+1}\Vert ^2_{C_2}, \end{aligned}$$
(23)
$$\begin{aligned} L^k&\ge g^*(z^{\star })+f^*(-A^Tz^{\star })>-\infty . \end{aligned}$$
(24)

Proof

Since the z-subproblem of Algorithm 2 is solved to the bounded relative error in Def. 1, we have

$$\begin{aligned} {\mathbf {0}}&\in \partial g^*(z^{k+1})+M_2\left( z^{k+1}-z^k-M_2^{-1}AM_1^{-1}(-A^Tz^k-y^k+u^k)\right) +\varepsilon ^{k+1}, \end{aligned}$$
(25)

where \(\varepsilon ^{k+1}\) satisfies (12):

$$\begin{aligned} \Vert \varepsilon ^{k+1}\Vert \le c(p)\Vert z^{k+1}-z^k\Vert . \end{aligned}$$
(26)

The y and u updates produce

$$\begin{aligned} {\mathbf {0}}&=\nabla f^*(y^{k+1})+M_1^{-1}(y^{k+1}-u^k+A^Tz^{k+1})=\nabla f^*(y^{k+1})-M_1^{-1}u^{k+1}, \end{aligned}$$
(27)
$$\begin{aligned} u^{k+1}&=u^k-A^Tz^{k+1}-y^{k+1}. \end{aligned}$$
(28)

In order to show (23), let us write

$$\begin{aligned} g^*(z^k)&\ge g^*(z^{k+1})\\&\quad +\left\langle M_2(z^k-z^{k+1})+AM_1^{-1}(-A^Tz^k-y^k+u^k)-\varepsilon ^{k+1}, z^k-z^{k+1}\right\rangle ,\\ f^*(y^k)&\ge f^*(y^{k+1})+\left\langle M_1^{-1}u^{k+1}, y^k-y^{k+1}\right\rangle . \end{aligned}$$

Assembling these inequalities with (26) gives us

$$\begin{aligned} L^k-L^{k+1}&\ge \Vert z^k-z^{k+1}\Vert ^2_{M_2-c(p)I_{m}}\nonumber \\&\quad +\left\langle AM_1^{-1}(-A^Tz^k-y^k+u^k), z^k-z^{k+1}\right\rangle +\left\langle M_1^{-1}u^{k+1}, y^k-y^{k+1}\right\rangle \nonumber \\&\quad +\left\langle -A^Tz^k-y^k, M_1^{-1}u^k\right\rangle \nonumber \\&\quad -\left\langle A^Tz^{k+1}-y^{k+1}, M_1^{-1}(u^k-A^Tz^{k+1}-y^{k+1})\right\rangle \nonumber \\&\quad +\frac{1}{2}\Vert A^Tz^k+y^k\Vert ^2_{M_1^{-1}} -\frac{1}{2}\Vert A^Tz^{k+1}+y^{k+1}\Vert ^2_{M_1^{-1}} \end{aligned}$$
(29)
$$\begin{aligned}&=\Vert z^k-z^{k+1}\Vert ^2_{M_2-c(p)I_{m}}\\&\quad +\left\langle AM_1^{-1}(-A^Tz^k-y^k), z^k-z^{k+1}\right\rangle +\left\langle M_1^{-1}u^{k+1}, y^k-y^{k+1}\right\rangle \end{aligned}$$
(A)
$$\begin{aligned}&+\left\langle -y^k, M_1^{-1}u^k\right\rangle -\left\langle -y^{k+1}, M_1^{-1}u^k\right\rangle \\&+\frac{1}{2}\Vert A^Tz^k+y^k\Vert ^2_{M_1^{-1}}-\frac{3}{2}\Vert A^Tz^{k+1} +y^{k+1}\Vert ^2_{M_1^{-1}}, \end{aligned}$$
(B)

where the terms in (A) and (B) simplify to

$$\begin{aligned} \left\langle AM_1^{-1}(-A^Tz^k-y^k), z^k-z^{k+1}\right\rangle +\left\langle M_1^{-1}(-A^Tz^{k+1}-y^{k+1}), y^k-y^{k+1}\right\rangle . \end{aligned}$$
(30)

Apply the following cosine rule on the two inner products above:

$$\begin{aligned} \langle a-b, a-c \rangle _{M_1^{-1}}=\frac{1}{2}\Vert a-b\Vert ^2_{M_1^{-1}} +\frac{1}{2}\Vert a-c\Vert ^2_{M_1^{-1}}-\frac{1}{2}\Vert b-c\Vert _{M_1^{-1}}. \end{aligned}$$

Set \(a=A^Tz^k, c=A^Tz^{k+1}\), and \(b=-y^k\) to obtain

$$\begin{aligned}&\left\langle AM_1^{-1}(-A^Tz^k-y^k), z^k-z^{k+1}\right\rangle \end{aligned}$$
(31)
$$\begin{aligned}&\quad =-\frac{1}{2}\Vert A^Tz^k+y^k\Vert ^2_{M_1^{-1}}-\frac{1}{2} \Vert A^Tz^k-A^Tz^{k+1}\Vert ^2_{M_1^{-1}}\nonumber \\&\qquad +\frac{1}{2}\Vert y^k+A^Tz^{k+1}\Vert ^2_{M_1^{-1}}. \end{aligned}$$
(32)

Set \(a=y^{k+1}, c=y^k\), and \(b=-A^Tz^{k+1}\) to obtain

$$\begin{aligned}&\left\langle M_1^{-1}(-A^Tz^{k+1}-y^{k+1}), y^k-y^{k+1}\right\rangle \end{aligned}$$
(33)
$$\begin{aligned}&\quad =\frac{1}{2}\Vert A^Tz^{k+1}+y^{k+1}\Vert ^2_{M_1^{-1}} +\frac{1}{2}\Vert y^k-y^{k+1}\Vert _{M_1^{-1}}\nonumber \\&\qquad -\frac{1}{2}\Vert A^Tz^{k+1}+y^{k}\Vert ^2_{M_1^{-1}}. \end{aligned}$$
(34)

Combining (30), (32), and (34) yields

$$\begin{aligned} L^k-L^{k+1}&\ge \Vert z^k-z^{k+1}\Vert ^2_{M_2-\frac{1}{2}AM_1^{-1}A^T-c(p)I_{m}} +\Vert y^k-y^{k+1}\Vert ^2_{\frac{1}{2}M_1^{-1}}\nonumber \\&\quad -\Vert A^Tz^{k+1}+y^{k+1}\Vert ^2_{M_1^{-1}}. \end{aligned}$$
(35)

Since f is \(\mu _f\)-strongly convex, we know that \(\nabla f^*\) is \(\frac{1}{\mu _f}-\)Lipschitz continuous. Consequently,

$$\begin{aligned} \Vert A^Tz^{k+1}+y^{k+1}\Vert ^2_{M_1^{-1}}&=\Vert u^k-u^{k+1}\Vert ^2_{M_1^{-1}} \le \frac{1}{\lambda _{\mathrm {min}}(M_1^{-1})}\left\| M_1^{-1}(u^k-u^{k+1})\right\| ^2\nonumber \\&\overset{(27)}{\le } \frac{\lambda _{\mathrm {max}}(M_1)}{\mu ^2_f}\Vert y^k-y^{k+1}\Vert ^2. \end{aligned}$$
(36)

Combining (35) and (36) gives us (23).

Now, to show (24), we use (27) and smoothness of \(f^*\) to get

$$\begin{aligned} f^*(y^k)\ge f^*(-A^Tz^k)+\left\langle M_1^{-1}u^{k}, y^k+A^Tz^k\right\rangle - \frac{1}{2\mu _f} \Vert A^Tz^k+y^k\Vert ^2. \end{aligned}$$

Hence, we arrive at

$$\begin{aligned} L^k&=g^*(z^k)+f^*(y^k)+\left\langle -A^Tz^k-y^k, M_1^{-1}u^k\right\rangle +\frac{1}{2}\Vert A^Tz^k+y^k\Vert ^2_{M_1^{-1}}\nonumber \\&\ge g^*(z^k)+f^*(-A^Tz^k)+\frac{1}{2}\Vert A^Tz^k+y^k\Vert ^2_{M_1^{-1}} -\frac{1}{2\mu _f}\Vert A^Tz^k+y^k\Vert ^2. \end{aligned}$$
(37)

Since \(C_1\succ 0\) if and only if \(\mu _f> \sqrt{2}\lambda _{\mathrm {max}}(M_1)\), (24) follows. \(\square \)

Remark 1

In Theorem 4, we require \(C_2=M_2-\frac{1}{2}AM_1^{-1}A^T-c(p)I_{m}\succ 0\). Recall that p is the number of inner loops applied to solve the \(z-\)subproblem and c(p) converges linearly to 0. Therefore, if we apply a smaller p, then \(M_2\) needs to be larger. This means that the dual update needs to use a smaller effective stepsize.

Next, we provide a simple choice of \(M_1, M_2, \) and p that ensures the positive definiteness of \(C_1\) and \(C_2\) in Theorem 4.

Proposition 2

In order to ensure (21) and (22), it suffices to set \(M_1=\frac{1}{\tau } I_{n}\) where \(\tau <\frac{1}{\sqrt{2}}\mu _f\), \(M_2=\tau AA^T+\theta I_{m}\) with \(\theta >0\), and large p such that \(c(p)<\theta \).

Proof

Since \(M_1=\frac{1}{\tau }I_{n}\), it is evident that \(C_1\succ 0\) if and only if \(\tau <\frac{1}{\sqrt{2}}\mu _f\). With \(M_1=\frac{1}{\tau }I_{n}\) and \(M_2 = \tau AA^T +\theta I_{m}\), we have

$$\begin{aligned} C_2 = \frac{1}{2}\tau AA^T +(\theta -c(p))I_{m}. \end{aligned}$$

As we have seen in Theorem 2, and 3, \(c(p) = {\mathcal {O}}(\tau ^p)\) with some \(\tau \in [0,1)\) for S being proximal gradient or cyclic proximal BCD. Therefore, there exists \(p_0\) such that \(C_2\succ 0\) for any \(p\ge p_0\).

Remark 2

With a small \(\theta >0\), the choices of \(M_1\) and \(M_2\) given in Proposition 2 is close to the “ADMM choice” \(M_1=\frac{1}{\tau }I_{n}\) and \(M_2 = \tau AA^T\), where \(M_2\) is optimal (see Sect. 3.2).

We are now ready to show convergence of Algorithm 1.

Theorem 5

Take Assumptions 1 and 2. Then, \((x^k, z^k)\) in Algorithm 1 are bounded, and any cluster point is a primal–dual solution pair of (1) and (2).

Proof

According to Theorem 1, it is sufficient to show that \(\{M_1^{-1}u^k, z^k\}\) is bounded, and its cluster points are primal–dual solution pairs of (1).

Since \(L^k\) is nonincreasing, (37) tells us that

$$\begin{aligned} g^*(z^k)+f^*(-A^Tz^k)+\frac{1}{2}\Vert A^Tz^k+y^k\Vert ^2_{M_1^{-1}}\le L^0< +\infty . \end{aligned}$$

Since \(g^*(z)+f^*(-A^Tz)\) is coercive, \(\{z^k\}\) is bounded, and, by the boundedness of \(\{A^Tz^k+y^k\}\), \(\{y^k\}\) is also bounded. Furthermore, (27) gives us

$$\begin{aligned} \left\| M_1^{-1}(u^k-u^0)\right\| \le \frac{1}{\mu _f}\Vert y^k-y^0\Vert . \end{aligned}$$

Therefore, \(\{M_1^{-1}u^k\}\) is bounded, too.

Let \((z^c, y^c, u^c)\) be a cluster point of \(\{z^k, y^k, u^k\}\). We shall show \((z^c, y^c, u^c)\) is a saddle point of L(zyu), i.e.,

$$\begin{aligned} {\mathbf {0}}\in \partial L(z^c, y^c, u^c), \end{aligned}$$
(38)

or equivalently,

$$\begin{aligned} {\mathbf {0}}\in \partial g^*(z^c)-AM_1^{-1}u^c,\quad {\mathbf {0}}= \nabla f^*(y^c)-M_1^{-1}u^c, \quad {\mathbf {0}}= A^Tz^c+y^c, \end{aligned}$$

which ensures \((M_1^{-1}u^c, z^c)\) to be a primal–dual solution pair of (1).

In order to show (38), we first notice that (20) gives

$$\begin{aligned} \partial _x L(z^{k+1}, y^{k+1}, u^{k+1})&=\partial g^*(z^{k+1})-AM_1^{-1}u^{k+1}+AM_1^{-1}(A^Tz^{k+1}+y^{k+1}),\\ \nabla _y L(z^{k+1}, y^{k+1}, u^{k+1})&=\nabla f^*(y^{k+1})-M_1^{-1}u^{k+1}+M_1^{-1}(A^Tz^{k+1}+y^{k+1}),\\ \nabla _u L(z^{k+1}, y^{k+1}, u^{k+1})&=M_1^{-1}(-A^Tz^{k+1}-y^{k+1}). \end{aligned}$$

Comparing these with the optimality conditions (25), (27), and (28), we have

$$\begin{aligned} d^{k+1}=\left( d^{k+1}_z, d^{k+1}_y, d^{k+1}_u\right) \in \partial L\left( z^{k+1}, y^{k+1}, u^{k+1}\right) , \end{aligned}$$
(39)

where

$$\begin{aligned}&d^{k+1}_z=M_2(z^k-z^{k+1})+2AM_1^{-1}(u^k-u^{k+1})-AM_1^{-1}(u^{k-1}-u^k)-\varepsilon ^{k+1}, \nonumber \\&d^{k+1}_y=M_1^{-1}(u^k-u^{k+1}), \nonumber \\&d^{k+1}_u=M_1^{-1}(u^{k+1}-u^k). \end{aligned}$$
(40)

Since (23) and (24) imply \(z^k-z^{k+1}, y^k-y^{k+1}\rightarrow {\mathbf {0}}\), (27) gives \(u^k-u^{k+1}\rightarrow {\mathbf {0}}\). Combine these with (12), we have \(d^k\rightarrow {\mathbf {0}}\).

Finally, let us take a subsequence \(\{z^{k_s}, y^{k_s}, u^{k_s}\}\rightarrow (z^c, y^c, u^c)\). Since \(d^{k_s}\rightarrow {\mathbf {0}}\) as \(s\rightarrow +\infty \), [48, Def. 8.3] and [48, Prop. 8.12] yield (38), which tells us that \((M_1^{-1}u^c, z^c)\) is a primal–dual solution pair of (1).

Following the axiomatic approach developed in [2] for decent algorithms on KL functions, we can show that the whole sequence \((x^k, z^k)\) in Algorithm 1 converges to a primal–dual solution pair. This approach has also been applied in [5] for KL-based Lagrangian optimization.

Theorem 6

Take Assumptions 12, and 3. Then, \(\{x^k, z^k\}\) in Algorithm 1 converges to a primal–dual solution pair of (1).

Proof

By Theorem 5, we can take \(\{z^{k_s}, y^{k_s}, u^{k_s}\}\rightarrow (z^c, y^c, u^c)\) as \(s\rightarrow \infty \). Since L is a KL function, we can prove the convergence of \(\{z^k, y^k, u^k\}\) to \(\{z^c, y^c, u^c\}\) following [2]. Specifically, let us first verify that conditions H1, H2, and H3 of [2] are satisfied for \(v^k:=(z^k,y^k,u^k)\) and \(L(v^k)\).

First, (23) gives

$$\begin{aligned} L(v^{k+1})+&\lambda _{\text {min}}(C_1)\Vert y^k-y^{k+1}\Vert ^2+\lambda _{\text {min}}(C_2)\Vert z^{k}-z^{k+1}\Vert ^2\le L(v^k). \end{aligned}$$
(41)

By (27) and the \(\frac{1}{\mu _f}-\)Lipschitz differentiability of \(f^*\), we know that

$$\begin{aligned} \frac{1}{2}\Vert y^k-y^{k+1}\Vert ^2\ge \frac{\mu _f^2}{2}\left\| M_1^{-1}u^k-M_1^{-1}u^{k+1}\right\| ^2. \end{aligned}$$
(42)

Combine (41) with (42), we know that there exists \(a>0\) such that

$$\begin{aligned} L(v^{k+1})+a\Vert v^{k+1}-v^k\Vert ^2\le L(v^k). \end{aligned}$$

which satisfies condition H1 of [2].

From (39) and (40), we know that \(d^{k+1}\in \partial L(v^{k+1})\) satisfies

$$\begin{aligned} \Vert d^{k+1}\Vert \le b \Vert v^{k+1}-v^k\Vert \end{aligned}$$

for some \(b>0\), which satisfies condition H2 of [2].

Next, let us verify that condition H3 of [2] also holds true.

Recall that we have taken \(\{z^{k_s}, y^{k_s}, u^{k_s}\}\rightarrow (z^c, y^c, u^c)\) as \(s\rightarrow \infty \). Note that \(L(z^{k_s}, y^{k_s}, u^{k_s})\) is monotonic nonincreasing and lower bounded due to Theorem 4, which implies the convergence of \(L(z^{k_s}, y^{k_s}, u^{k_s})\). Since L is lower semicontinuous, we have

$$\begin{aligned} L(z^c, y^c, u^c)\le \lim _{s\rightarrow \infty } L(z^{k_s}, y^{k_s}, u^{k_s}). \end{aligned}$$
(43)

Since the only potentially discontinuous term in L is \(g^*\), we have

$$\begin{aligned} \lim _{s\rightarrow \infty } L(z^{k_s}, y^{k_s}, u^{k_s})-L(z^c, y^c, u^c)\le \limsup _{s\rightarrow \infty } g^*(z^{k_s})-g^*(z^c). \end{aligned}$$
(44)

By (25), we know that

$$\begin{aligned} g^*(z^c)&\ge g^*(z^{k_s})+\left\langle M_2(z^{k_s-1}-z^{k_s})\right. \\&\left. \quad +AM_1^{-1}(-A^Tz^{k_s-1}-y^{k_s-1}+u^{k_s-1})-\varepsilon ^{k_s}, z^c-z^{k_s}\right\rangle , \end{aligned}$$

Then, by Theorem 4, we further get \(z^{k_s-1}-z^{k_s}\rightarrow {\mathbf {0}}\). Since \(z^{k_s}\rightarrow z^c\) and \(\{z^{k}, y^{k}, u^{k}\}\) is bounded, we obtain

$$\begin{aligned} \limsup _{s\rightarrow \infty } g^*(z^{k_s})-g^*(z^c)\le 0. \end{aligned}$$

Combining this with (43) and (44), we conclude that

$$\begin{aligned} \lim _{s\rightarrow \infty } L(z^{k_s}, y^{k_s}, u^{k_s})=L(z^c, y^c, u^c), \end{aligned}$$

which satisfies condition H3 of [2].

Finally, since the conditions H1, H2, and H3 are satisfied, we can follow the proof of Theorem 2.9 of [2] to establish the convergence of \(v^k=(z^k,y^k,u^k)\) to \((z^c,y^c,u^c)\), which is a critical point of L(zyu). By (40), we further now that \(\{M_1^{-1}u^k, z^k\}\) converges to a primal–dual solution pair of (1), which is exactly \(\{x^k,z^k\}\) in Algorithm 1 according to Theorem 1.

Remark 3

In order to remove the strong convexity assumption in Assumption 2, we also establish the ergodic convergence of iPrePDHG in the case where \(g^* = 0\), and gradient descent is applied to solve the z-subproblem inexactly. See “Appendix D” for details.

4 Numerical Experiments

In this section, we compare our iPrePDHG (Algorithm 1) with (original) PDHG (6), diagonally-preconditioned PDHG (DP-PDHG) [44], accelerated PDHG (APDHG) [11], and accelerated linearized ADMM (ALADMM) [56]. We consider four popular applications of PDHG: TV-\(\text {L}^1\) denoising, graph cuts, estimation of earth mover’s distance, and CT reconstruction.

For the preconditioners \(M_1\) and \(M_2\) in iPrePDHG, we choose \(M_1=\frac{1}{\tau }I_{n}\) and \(M_2 = \tau AA^T+\theta I\) as suggested in Proposition 2, which corresponds to ADMM and \(M_2\) is nearly optimal for small \(\theta \) (see Sect. 3.2). Although f may not be strongly convex in our experiments, we still observe significant speedups compared to other algorithms.

When we write these examples in the form of (1), the matrix A (or a part of A) is one of the following operators:

Case 1::

2D discrete gradient operator \(D: {{\,\mathrm{{\mathbb {R}}}\,}}^{M\times N}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}^{2M\times N}\):

For images of size \(M\times N\) and grid stepsize h, we have

$$\begin{aligned} (D u)_{i,j}=\begin{pmatrix}(D u)^1_{i,j} \\ (D u)_{i,j}^2\end{pmatrix}, \end{aligned}$$

where

$$\begin{aligned} (D u)_{i,j}^1&= {\left\{ \begin{array}{ll} \frac{1}{h}(u_{i+1,j}-u_{i,j}) &{}\quad \text{ if } \,i<M,\\ 0 &{}\quad \text{ if } \,i=M, \end{array}\right. }\\ (D u)_{i,j}^2&= {\left\{ \begin{array}{ll} \frac{1}{h}(u_{i,j+1}-u_{i,j}) &{}\quad \text{ if } \,j<N,\\ 0 &{}\quad \text{ if } \,j=N. \end{array}\right. } \end{aligned}$$
Case 2::

2D discrete divergence operator: div\(:{{\,\mathrm{{\mathbb {R}}}\,}}^{2M\times N}\rightarrow {{\,\mathrm{{\mathbb {R}}}\,}}^{ M\times N}\) given by

$$\begin{aligned} \text {div}(p)_{i,j} = h\left( p_{i,j}^1 - p_{i-1,j}^1+p_{i,j}^2-p_{i,j-1}^2\right) , \end{aligned}$$

where \(p=(p^1, p^2)^T\in {{\,\mathrm{{\mathbb {R}}}\,}}^{2M\times N}\), \(p^1_{0,j} = p^1_{M,j} = 0\) and \(p^2_{i,0} = p^2_{i,N} = 0\) for \(i = 1,\dots ,M\), \(j = 1,\dots ,N\).

in Algorithm 1, we can take S as the iterator of FISTA. To take the advantage of the finite-difference structure of these operators, we also let S be the iterator of cyclic proximal BCD. We split \(\{1,2,\dots m\}\) into 2 blocks (for case 2) or 4 blocks (for case 1), which are inspired by the popular red-black ordering [49] for solving sparse linear system.

According to Theorem 3, running finitely many epochs of cyclic proximal BCD gives us a bounded relative error in Definition 1. We expect that this solver brings fast overall convergence. Specifically, when \(g^*=0\), the z-subproblem in PrePDHG reduces to a linear system with a structured sparse matrix \(AA^T\). Therefore, proximal gradient descent amounts to the Richardson method [47, 49], and cyclic proximal BCD amounts to the Gauss–Seidel method and the Successive Overrelaxation (SOR) method [49, 55], which are typically faster.

The following two claims tell us that with specific block partitions, the cyclic proximal BCD steps have a closed-form, so Algorithm 1 is easy to implement. Furthermore, each execution of BCD step can use parallel computing.

Fig. 1
figure 1

Two-block ordering in Claim 4

Fig. 2
figure 2

Four-block ordering in Claim 4

Claim

When \(A=\mathrm div\) (i.e. \(A^T=-D\)) and \(M_2=\tau AA^T\), for \(z\in {{\mathbb {R}}}^{M\times N}\), we separate z into two block \(z_b\), \(z_r\) where

$$\begin{aligned} z_b:=\{ z_{i,j}\,|\,i+j\text { is even}\}, ~ z_r:=\{ z_{i,j}\,|\,i+j\text { is odd}\}, \end{aligned}$$

for \(1\le i\le M\), \(1\le j\le N\). If \(g(z)=\varSigma _{i,j}g_{i,j}(z_{i,j})\) and \(prox_{\gamma g^*_{i,j}}\) have closed-form solutions for all \(1\le i\le M\), \(1\le j\le N\) and \(\gamma > 0\), then S as the iterator of cyclic proximal BCD in Algorithm 1 has a closed-form and computing S is parallelizable.

Proof

As illustrated in Fig. 1, every black node is connected to its neighbor red nodes, so we can update all the coordinates corresponding to the black nodes in parallel, while those corresponding to the red nodes are fixed, and vice versa. See “Appendix E” for a complete explanation. \(\square \)

Claim

When \(A= D\) (i.e. \(A^T=-\mathrm div\)) and \(M_2=\tau AA^T\), for \(z=(z^1, z^2)^T \in {{\mathbb {R}}}^{2M\times N}\), we separate z into four blocks \(z_b\), \(z_r\), \(z_y\) and \(z_g\), where

$$\begin{aligned} z_b&=\left\{ z^1_{i,j}\,|\,i \text { is odd}\right\} , \quad z_r=\left\{ z^1_{i,j}\,|\,i \text { is even}\right\} ,\\ z_y&=\left\{ z^2_{i,j}\,|\,j \text { is odd}\right\} , \quad z_g=\left\{ z^2_{i,j}\,|\,j \text { is even}\right\} , \end{aligned}$$

for \(1\le i\le M\), \(1\le j\le N\). If \(g(z)=\varSigma _{i,j}g_{i,j}(z_{i,j})\) and all \(prox_{\gamma g^*_{i,j}}\) have closed-form solutions for all \(1\le i\le M\), \(1\le j\le N\) and \(\gamma > 0\), then S as the iterator of cyclic proximal BCD in Algorithm 1 has a closed-form and computing S is parallelizable.

Proof

In Fig. 2, the 4 blocks are in 4 different colors. The coordinates corresponding to nodes of the same color can be updated in parallel, while the rest are fixed. See “Appendix E” for details. \(\square \)

In the following sections, PDHG denotes original PDHG in (6) without any preconditioning; DP-PDHG denotes the diagonally-preconditioned PDHG in [44], APDHG is the accelerated PDHG in [11], ALADMM is the accelerated linearized ADMM proposed in [56]. PrePDHG denotes Preconditioned PDHG in (7) where the \((k+1)\)th z-subproblem is solved until \(\frac{\Vert z^k-z^{k+1}\Vert _2}{\max \{1,\Vert z^{k+1}\Vert _2\}}< 10^{-5}\) using the TFOCS [4] implementation of FISTA; iPrePDHG (Inner: BCD) and iPrePDHG (Inner: FISTA) denote our iPrePDHG in Algorithm 1 with the iterator S being cyclic proximal BCD or FISTA, respectively. All the experiments were performed on MATLAB R2018a on a MacBook Pro with a 2.5 GHz Intel i7 processor and 16 GB of 2133 MHz LPDDR3 memory.

A comparison between PDHG and DP-PDHG is presented in [44] on TV-\(\text {L}^1\) denoising and graph cuts, and in [50] on CT reconstruction. A PDHG algorithm is proposed to estimate earth mover’s distance (or optimal transport) in [35]. In order to provide a direct comparison, we use their problem formulations.

4.1 Graph Cuts

The total-variation-based graph cut model involves minimizing a weighted TV energy:

$$\begin{aligned} \begin{array}{ll} \text{ minimize }&{} \quad \Vert D_w u\Vert _{1}+\langle u,\omega ^u\rangle \\ \text{ subject } \text{ to }&{}\quad 0\le u\le 1, \\ \end{array} \end{aligned}$$

where \(w^u\in {{\,\mathrm{{\mathbb {R}}}\,}}^{M\times N}\) is a vector of unary weights, \(w^b\in {{\,\mathrm{{\mathbb {R}}}\,}}^{2MN}\) is a vector of binary weights, and \(D_w = \text {diag}(w^b)D\) for D being the 2D discrete gradient operator with \(h=1\). Specifically, we have \(w_{i,j}^u=\alpha (\Vert I_{i,j}-\mu _f\Vert ^2-\Vert I_{i,j}-\mu _b\Vert ^2)\), \(w_{i,j}^{b,1}=\text {exp}(-\beta |I_{i+1,j}-I_{i,j}|)\), and \(w_{i,j}^{b,2}=\text {exp}(-\beta |I_{i,j+1}-I_{i,j}|)\).

To formulate this problem as (1), we take \(f(u) = \langle u, w^u \rangle +\delta _{[0,1]}(u)\), \(A=D_w\), and g as the \(\ell _1-\)norm \(g(z) = \sum _{i=1}^{2MN} |z_i|.\)

In our experiment, the input imageFootnote 3 has a size \(660\times 720\). We set \(\alpha =1/2\), \(\beta =10\), \(\mu _f=[0;0;1]\) (for the blue foreground) and \(\mu _b=[0;1;0]\) (for the green background). We run all algorithms until \(\delta ^k:=\frac{|\varPhi ^k-\varPhi ^{\star }|}{|\varPhi ^{\star }|}< 10^{-8}\), where \(\varPhi ^k\) is the objective value at the kth iteration and \(\varPhi ^*\) is the optimal objective value obtained by running CVXFootnote 4 [18].

We summarize the test results in Table 1. For APDHG and ALADMM, the best results of \(\mu \in \{10,1,0.1,0.01,0.001\}\) are presented, and the rest of their parameters are set as suggested in [11, 56], respectively. For iPrePDHG, the best results of \(\tau \in \{10,1,0.1,0.01,0.001\}\) and \(p\in \{1,2,3, 10, 20, 30\}\) are presented, where the step size of cyclic proximal BCD was chosen as \(\gamma =\frac{1}{\Vert M_2\Vert }\). Thanks to the efficiency of cyclic proximal BCD on the subproblems, we can simply apply 2 inner loops to achieve a superior performance. It is also worth mentioning that its number of outer iterations is close to that of PrePDHG, which solves z-subproblem much more accurately. In the last row of Table 1, we take \(M_2=\tau D_w D_w^T+\theta I_m\) with \(\theta >0\) as suggest in Proposition 2, the performance is similar to that of \(\theta = 0\). In practice, we recommend simply taking \(\theta =0\). Finally, we would like to mention that the number of inner iterations are not exactly proportional to the runtime, this is because Matlab handles operations of sparse matrices in a pretty efficient way, and the runtime of the other parts of the tested algorithms is not negligible.

The input image can be found in Fig. 3. For all the tested algorithms, the output images look similar, therefore, we only present the output image of iPrePDHG in Fig. 4. This is also the case for the other tests in this paper.

Table 1 Performance of PDHG, DP-PDHG, ADMM, and iPrePDHG on the graph cut model
Fig. 3
figure 3

Input image of graph cut with a size \(660\times 720\)

Fig. 4
figure 4

Output of graph cut by our iPrePDHG (Inner: BCD), where the flower part has been extracted

4.2 Total Variation Based Image Denoising

The following problem is known as the (discrete) TV-\(\text {L}^1\) model for image denoising:

$$\begin{aligned} \begin{array}{ll} \mathop {\mathrm {minimize}}_u &{}\quad \varPhi (u) = \Vert D u\Vert _{1}+\lambda \Vert u-b\Vert _{1},\\ \end{array} \end{aligned}$$

where D is the 2D discrete gradient operator with \(h=1\), \(b\in {{\,\mathrm{{\mathbb {R}}}\,}}^{1024\times 1024}\) is a noisy input image with noise level 0.15 (see Fig. 5), and \(\lambda =1\) is a regularization parameter. We run the algorithms until \(\frac{|\varPhi ^k-\varPhi ^{\star }|}{|\varPhi ^{\star }|}< 10^{-6}\), where \(\varPhi ^k\) is the kth objective value and \(\varPhi ^*\) is the optimal objective value obtained by CVX [18].Footnote 5

To formulate as (1), we take \(f(u) = \lambda \Vert u-b\Vert _1\), \(g(z) = \Vert z\Vert _1\), and \(A=D\).

Observed performance is summarized in Table 2, where the best results for \(\mu , \tau \in \{10,1,0.1,0.01,0.001\}\) and \(p\in \{1,2,3, 5, 10, 20\}\) are presented (Again, the step size of cyclic proximal BCD has been chosen as \(\gamma =\frac{1}{\Vert M_2\Vert }\)). Our iPrePDHG (Inner: BCD) is significantly faster than the other algorithms. Finally, the denoised image can be found in Fig. 6.

When taking \(\theta = 0.1\), we get nearly identical results. This is because \(\theta >0\) adds a proximal term \(\frac{\theta }{2}\Vert z-z^k\Vert ^2\) in the z-subproblem (see Eq. (8)), whose gradient at \(z^k\) is 0. Since \(p=1\) and cyclic proximal BCD is initialized exactly at \(z^k\), we get the same iterates as that of \(\theta = 0\). In practice, we recommend simply taking \(\theta =0\).

Remarkably, our algorithm uses fewer outer iterations than PrePDHG under the stopping criterion \(\frac{\Vert z^k-z^{k+1}\Vert _2}{\max \{1,\Vert z^{k+1}\Vert _2\}}< 10^{-5}\), as this kind of stopping criteria may become looser as \(z^k\) is closer to \(z^{\star }\). In this example, \(\frac{\Vert z^k-z^{k+1}\Vert _2}{\max \{1,\Vert z^{k+1}\Vert _2\}}< 10^{-5}\) only requires 1 inner iteration of FISTA when Outer Iter \(\ge 368\), while as high as 228 inner iterations on average during the first 100 outer iterations. In comparison, our algorithm uses fewer outer iterations while each of them also costs less.

In addition, the diagonal preconditioner given in [44] appears to help very little when \(A=D\). In fact, \(M_1=\text {diag}(\varSigma _{i}|A_{i,j}|)\) will be \(4I_n\) and \(M_2=\text {diag}(\varSigma _{j}|A_{i,j}|)\) will be \(2I_m\) if we ignore the Neumann boundary condition. Therefore, DP-PDHG performs even worse than PDHG.

Fig. 5
figure 5

Noisy input image for the TV-\(L^1\) denoising model with \(1024\times 1024\) and noise level 0.15

Table 2 Test of PDHG, DP-PDHG, ADMM, and iPrePDHG on the TV-\(L^1\) denoising model.
Fig. 6
figure 6

Denoised image by iPrePDHG (Inner: BCD)

4.3 Earth Mover’s Distance

Earth mover’s distance is useful in image processing, computer vision, and statistics [33, 38, 43]. A recent method [35] to compute earth mover’s distance is based on

$$\begin{aligned} \begin{array}{ll} \text{ minimize }&{}\quad \Vert m\Vert _{1,2}\\ \text{ subject } \text{ to }&{}\quad \text {div}(m)+\rho ^1-\rho ^0=0, \end{array} \end{aligned}$$

where \(m\in {{\,\mathrm{{\mathbb {R}}}\,}}^{2M\times N}\) is the sought flux vector on the \(M\times N\) grid, and \(\rho ^0, \rho ^1\) represents two mass distributions on the \(M\times N\) grid. The setting in our experiment here is the same with that in [35], i.e. \(M=N=256\), \(h=\frac{N-1}{4}\), and for \(\rho ^0\) and \(\rho ^1\) see Fig. 8.

To formulate as (1), we take \(f(m)=\Vert m\Vert _{1,2}\), \(g(z) = \delta _{\{\rho ^0-\rho ^1\}}(z)\), and \(A = \text {div}\).

Since the iterates \(m^k\) may not satisfy the linear constraint, the objective \(\varPhi (m)=I_{\{m|\text {div}(m)=\rho ^0-\rho ^1\}}+\Vert m\Vert _{1,2}\) is not comparable. Instead, we compare \(\Vert m^k\Vert _{1,2}\) and the constraint violation until \(k=100{,}000\) outer iterations in Fig. 7, where we set \(\tau =3\times 10^{-6}\) as in [35], and \(\sigma =\frac{1}{\tau \Vert \text {div}\Vert ^2}\). For iPrePDHG (Inner: BCD), we set \(M_1=\tau ^{-1}I_n\), \(M_2=\tau \mathrm div \mathrm div^T\), BCD stepsize \(\gamma =\frac{1}{\Vert M_2\Vert }\), and number of BCD epochs \(p=2\).

In Fig. 7, we can see that our iPrePDHG provides much lower constraint violation and much more faithful earth mover’s distance \(\Vert m\Vert _{1,2}\) at any given runtime. Figure 8 shows the solution obtained by our iPrePDHG (Inner: BCD), where m is the flux that moves the standing cat \(\rho ^1\) into the crouching cat \(\rho ^0\). For our iPrePDHG with \(M_2=\tau \text {div div}^T+\theta I_m\), the performance is very similar when a small \(\theta \) is applied. In practice, we recommend simply taking \(\theta =0\).

DP-PDHG, ALADMM, and PrePDHG are extremely slow in this example and are not reported in Fig. 7. Similar to Sect. 4.2, when \(A=\text {div}\), the diagonal preconditioners proposed in [44] are approximately equivalent to fixed constant parameters \(\tau =\frac{1}{2h}\), \(\sigma =\frac{1}{4h}\) and they lead to extremely slow convergence. As for PrePDHG, it suffers from the high cost per outer iteration.

It is worth mentioning that unlike [35], the algorithms in our experiments are not parallelized. On the other hand, in our iPrePDHG (Inner: BCD), iterator S can be parallelized (which we did not implement). Therefore, one can expect a further speedup by a parallel implementation.

Fig. 7
figure 7

Comparison of PDHG and iPrePDHG on the EMD estimation problem over 100,000 outer iterations

Fig. 8
figure 8

Mass distributions \(\rho ^0\), \(\rho ^1\) for EMD estimation. \(\rho ^0\) is the white standing cat, and \(\rho ^1\) is the black crouching cat. Both images are \(256\times 256\), and the earth mover’s distance between \(\rho ^0\) and \(\rho ^1\) is 0.6718

4.4 CT Reconstruction

We test solving the following optimization problem for CT image reconstruction:

$$\begin{aligned} \begin{array}{ll} \text{ minimize }&\quad \varPhi (u)=\frac{1}{2}\Vert Ru-b\Vert ^2_{2}+\lambda \Vert Du\Vert _{1}, \end{array} \end{aligned}$$
(45)

where \(R\in \ {{\,\mathrm{{\mathbb {R}}}\,}}^{13032\times 65536}\) is a system matrix for 2D fan-beam CT with a curved detector, \(b=Ru_{\text {true}}\in {{\,\mathrm{{\mathbb {R}}}\,}}^{13032}\) is a vector of line-integration values, and we want to reconstruct \(u_{\text {true}}\in {{\,\mathrm{{\mathbb {R}}}\,}}^{MN}\), where \(M=N=256\). D is the 2D discrete gradient operator with \(h=1\), and \(\lambda =1\) is a regularization parameter. By using the fancurvedtomo function from the AIR Tools II package [29], we generate a test problem where the projection angles are \(0{^{\circ }}, 10{^{\circ }}, \dots , 350{^{\circ }}\), and for all the other input parameters we use the default values.

Following [50], we formulate the problem (45) in the form of (1) by taking

$$\begin{aligned} g\begin{pmatrix} p\\ q\end{pmatrix}= \frac{1}{2}\Vert p-b\Vert _2^2+\lambda \Vert q\Vert _1,\quad f(u)=0, \quad A=\begin{pmatrix} R\\ D\end{pmatrix}, \end{aligned}$$
(46)

By using this formulation, we avoids inverting the matrices R and D.

Since the block structure of \(AA^T\) is rather complicated, if we naively choose \(M_1=\frac{1}{\tau }I_n\) and \(M_2=\tau AA^T\) like in the previous three experiments, it becomes hard to find a fast subproblem solver for the z-subproblem. In Table 3, we report a TFOCS implementation of FISTA for solving the z-subproblem and the overall convergence is very slow.

Table 3 Performance of PDHG, DP-PDHG, ADMM, and iPrePDHG on CT reconstruction

Instead, we propose to choose

$$\begin{aligned} M_1=\frac{2}{\tau }I_{n},\quad M_2=\begin{pmatrix} \tau \Vert R\Vert ^2I_{m-2n} &{}\quad 0\\ 0 &{}\quad \tau DD^T+\theta I_{2n} \end{pmatrix} \end{aligned}$$
(47)

or

$$\begin{aligned} M_1=\text {diag}(\varSigma _i |{R}_{i,j}|)+\frac{1}{\tau }I_{n}, \quad M_2 = \begin{pmatrix} \text {diag}(\varSigma _j|{R}_{i,j}|) &{}\quad 0\\ 0 &{} \quad \tau DD^T+\theta I_{2n} \end{pmatrix} \end{aligned}$$
(48)

for some small \(\theta \ge 0\). These choices satisfy (9), and have simple block structures, a fixed epoch of S as cyclic proximal BCD iterator gives fast overall convergence. Note that (48) is a little slower but avoids the need of estimating \(\Vert R\Vert \).

We summarize the numerical results in Table 3. All the algorithms are executed until \(\delta ^k:=\frac{|\varPhi ^k-\varPhi ^{\star }|}{|\varPhi ^{\star }|}< 10^{-4}\), where \(\varPhi ^k\) is the objective value at the kth iteration and \(\varPhi ^*\) is the optimal objective value obtained by calling CVX. The best results of \(\mu , \tau \in \{10,1,0.1,0.01,0.001\}\) and \(p\in \{1,2,3\}\) are summarized in Table 3. As in the previous experiments, \(\theta =0.1\) gives similar performances for iPrePDHG (Inner: BCD). In practice, we recommend simply taking \(\theta =0\). For iPrePDHG (Inner: FISTA) with \(M_2=\tau AA^T\), the result for \(p=100\) is also reported (here we use the TFOCS implementation of FISTA).

5 Conclusions

We have developed an approach to improve the performance of PDHG and ADMM in this paper. Our approach uses effective preconditioners to significantly reduce the number of iterations. In general, most effective preconditioners are non-diagonal and cause very difficult subproblems in PDHG and ADMM, so previous arts are restrictive with less effective diagonal preconditioners. However, we deal with those difficult subproblems by “solving” them highly inexactly, running just very few epochs of proximal BCD iterations with warm start. In all of our numerical tests, our algorithm needs relatively few outer iterations (due to effective preconditioners) and has the shortest total running time, achieving 4–95 times speedup over the state-of-the-art.

Theoretically, we show a fixed number of inner iterations suffice for global convergence though a new relative error condition. The number depends on various factors but is easy to choose in all of our numerical results.

There are still open questions left for us to address in the future: (a) Depending on problem structures, there are choices of preconditioners that are better than \(M_1=\frac{1}{\tau }I_n, M_2=\tau AA^T\) (the ones that lead to ADMM if the subproblems are solved exactly). For example, in CT reconstruction, our choices of \(M_1\) and \(M_2\) have much faster overall convergence. (b) Is it possible to show Algorithm 1 converges even with S chosen as the iterator of faster solvers like APCG [36], NU_ACDM [1], and A2BCD [28]? (c) In general, how to accelerate a broader class of algorithms by integrating effective preconditioning and cheap inner loops while still ensuring global convergence?