Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

4.1 Introduction

We consider the convex optimization problem

$$\begin{aligned} \begin{aligned}&\text {minimize}&f(y) + g(x) \\&\text {subject to}&y = A x, \end{aligned} \end{aligned}$$
(4.1)

where \(x\in \mathbf{R}^n\) and \(y \in \mathbf{R}^{m}\) are the variables, and the (extended real-valued) functions \(f: \mathbf{R}^{m} \rightarrow \mathbf{R}\cup \{\infty \}\) and \(g: \mathbf{R}^{n} \rightarrow \mathbf{R}\cup \{\infty \}\) are convex, closed and proper. The matrix \(A \in \mathbf{R}^{m \times n}\), and the functions f and g are the problem data. Infinite values of f and g allow us to encode convex constraints on x and y, since any feasible point (xy) must satisfy

$$ x \in \{x \mid g(x)< \infty \}, \qquad y \in \{y \mid f(y) < \infty \}. $$

We will be interested in the case when f and g have simple proximal operators, but for now we do not make this assumption. The problem form (4.1) is known as graph form [39], since the variable (xy) is constrained to lie in the graph \(\mathscr {G} = \{(x,y) \in \mathbf{R}^{n + m}~|~y = Ax\}\) of A. We denote \(p^\star \) as the optimal value of (4.1), which we assume is finite.

The graph form includes a large range of convex problems, including linear and quadratic programming, general conic programming [8, Sect. 11.6], and many more specific applications such as logistic regression with various regularizers, support vector machine fitting [29], portfolio optimization [8, Sect. 4.4.1] [25] [4], signal recovery [16], and radiation treatment planning [38], to name just a few.

In [39], Parikh and Boyd described an operator splitting method for solving (a generalization of) the graph form problem (4.1), based on the alternating direction method of multipliers (ADMM) [5]. Each iteration of this method requires a projection (either exactly or approximately via an iterative method) onto the graph \(\mathscr {G}\), and evaluation of the proximal operators of f and g. Theoretical convergence was established in those papers, and basic implementations were demonstrated. However, it has been observed that practical convergence of the algorithm depends very much on the choice of algorithm parameters (such as the proximal parameter \(\rho \)), and scaling of the variables (i.e., preconditioning).

The purpose of this chapter is to explore these issues, and to add some critical variations on the algorithm that make it a relatively robust general purpose solver, at least for modest accuracy levels. The algorithm we propose, which is the same as the basic method described in [39], with modified parameter selection, diagonal preconditioning, and modified stopping criterion, has been implemented in an open-source software project called POGS (for Proximal Graph Solver), and tested on a wide variety of problems. Our CUDA implementation reliably solves (to modest accuracy) problems \(10^3\times \) larger than those that can be handled by interior-point methods; and for those that can be handled by interior-point methods, \(100\times \) faster.

4.1.1 Outline

In Sect. 4.1.2 we describe related work. In Sect. 4.2 we derive the graph form dual problem, and the primal-dual optimality conditions, which we use to motivate the stopping criterion and to interpret the iterates of the algorithm. In Sect. 4.3 we describe the ADMM-based graph form algorithm, and analyze the properties of its iterates, giving some results that did not appear in [39]. In Sect. 4.4 we address the topic of preconditioning, and suggest novel preconditioning and parameter selection techniques. In Sect. 4.5 we describe our implementation POGS, and in Sect. 4.6 we report performance results on various problem families.

4.1.2 Related Work

Many generic methods can be used to solve the graph form problem (4.1), including projected gradient descent [12], projected subgradient methods [42, Chap. 5] [47], operator splitting methods [32] [20], interior-point methods [35, Chap. 19] [7, Chap. 6], and many more. (Of course many of these methods can only be used when additional assumptions are made on f and g, e.g., differentiability or strong convexity.) For example, if f and g are separable and smooth, the problem (4.1) can be solved by Newton’s method, with each iteration requiring the solution of a set of linear equations that requires \(O(\max \{m,n\}\min \{m,n\}^2)\) floating point operations (flops) when A is dense. If f and g are separable and have smooth barrier functions for their epigraphs, the problem (4.1) can be solved by an interior-point method, which in practice always takes no more than a few tens of iterations, with each iteration involving the solution of a system of linear equations that requires \(O(\max \{m,n\}\min \{m,n\}^2)\) flops when A is dense [8, Chap. 11] [35, Chap. 19].

We now turn to first-order methods for the graph form problem (4.1). In [1] Briceño-Arias and Combettes describe methods for solving a generalized version of (4.1), including a forward–backward–forward algorithm and one based on Douglas–Rachford splitting [17]. Their methods are especially interesting in the case when A represents an abstract operator, and one only has access to A through Ax and \(A^Ty\). In [37] O’Connor and Vandenberghe propose a primal-dual method for the graph form problem where A is the sum of two structured matrices. They contrast it with methods such as Spingarn’s method of partial inverses [49], Douglas–Rachford splitting, and the Chambolle–Pock method [14].

Davis and Yin [18] analyze convergence rates for different operator splitting methods, and in [24] Giselsson proves the tightness of linear convergence for the operator splitting problems considered [22]. Goldstein et al. [26] derive Nesterov-type acceleration, and show \(O(1/k^2)\) convergence for problems where f and g are both strongly convex.

Nishihara et al. [34] introduce a parameter selection framework for ADMM with over relaxation [19]. The framework is based on solving a fixed-size semidefinite program (SDP). They also make the assumption that f is strongly convex. Ghadimi et al. [27] derive optimal parameter choices for the case when f and g are both quadratic. In [22], Giselsson and Boyd show how to choose metrics to optimize the convergence bound, and in [21] Giselsson and Boyd suggest a diagonal preconditioning scheme for graph form problems based on semidefinite programming. This scheme is primarily relevant in small to medium scale problems, or situations where many different graph form problems, with the same matrix A, are to be solved. It is clear from these papers (and indeed, a general rule) that the practical convergence of first-order methods depends heavily on algorithm parameter choices.

GPUs are used extensively for stochastic gradient descent-based optimization when training neural networks [11, 31, 33], and they are slowly gaining popularity in convex optimization as well [13, 41, 52].

4.2 Optimality Conditions and Duality

4.2.1 Dual Graph Form Problem

The Lagrange dual function of (4.1) is given by

$$ \inf _{x,y} f(y) + g(x) + \nu ^T(Ax - y) = - f^*(\nu ) -g^*(-A^T\nu ), $$

where \(\nu \in \mathbf {R}^n\) is the dual variable associated with the equality constraint, and \(f^*\) and \(g^*\) are the conjugate functions of f and g respectively [8, Chap. 4]. Introducing the variable \(\mu = -A^T\nu \), we can write the dual problem as

$$\begin{aligned} \begin{aligned}&\text {maximize}&-f^*(\nu ) - g^*(\mu ) \\&\text {subject to}&\mu = -A^T \nu . \end{aligned} \end{aligned}$$
(4.2)

The dual problem can be written as a graph form problem, if we negate the objective and minimize rather than maximize. The dual graph form problem (4.2) is related to the primal graph form problem (4.1) by switching the roles of the variables, replacing the objective function terms with their conjugates, and replacing A with \(-A^T\).

The primal and dual objectives are \(p(x,y) = f(y) + g(x)\) and \(d(\mu ,\nu ) = -f^*(\nu ) - g^*(\mu )\), respectively, giving us the duality gap

$$\begin{aligned} \eta = p(x,y) - d(\mu ,\nu ) = f(y) + f^*(\nu ) + g(x) + g^*(\mu ). \end{aligned}$$
(4.3)

We have \(\eta \ge 0\), for any primal and dual feasible tuple \((x, y, \mu , \nu )\). The duality gap \(\eta \) gives a bound on the suboptimality of (xy) (for the primal problem) and also \((\mu ,\nu )\) for the dual problem:

$$ f(y)+g(x) \le p^\star + \eta , \qquad -f^*(\nu )-g^*(\mu ) \ge p^\star - \eta . $$

4.2.2 Optimality Conditions

The optimality conditions for (4.1) are readily derived from the dual problem. The tuple \((x, y, \mu , \nu )\) satisfies the following three conditions if and only it is optimal:

Primal feasibility:

$$\begin{aligned} y = Ax. \end{aligned}$$
(4.4)

Dual feasibility:

$$\begin{aligned} \mu = -A^T\nu . \end{aligned}$$
(4.5)

Zero gap:

$$\begin{aligned} f(y) + f^*(\nu ) + g(x) + g^*(\mu ) = 0. \end{aligned}$$
(4.6)

If both (4.4) and (4.5) hold, then the zero gap condition (4.6) can be replaced by the Fenchel equalities

$$\begin{aligned} f(y) + f^*(\nu ) = \nu ^Ty, \quad g(x) + g^*(\mu ) = \mu ^Tx. \end{aligned}$$
(4.7)

We refer to a tuple \((x,y,\mu ,\nu )\) that satisfies (4.7) as Fenchel feasible. To verify the statement, we add the two equations in (4.7), which yields

$$\begin{aligned} f(y) + f^*(\nu ) + g(x) + g^*(\mu ) = y^T\nu + x^T\mu = (Ax)^T\nu -x^T A^T\nu = 0. \end{aligned}$$

The Fenchel equalities (4.7) are also equivalent to

$$\begin{aligned} \nu \in \partial f(y), \quad \mu \in \partial g(x), \end{aligned}$$
(4.8)

where \(\partial \) denotes the subdifferential, which follows because

$$ \nu \in \partial f(y) \Leftrightarrow \sup _z\left( z^T\nu - f(z) \right) = \nu ^Ty - f(y) \Leftrightarrow f(y) + f^*(\nu ) = \nu ^Ty. $$

In the sequel we will assume that strong duality holds, meaning that there exists a tuple \((x^\star , y^\star , \mu ^\star , \nu ^\star )\) which satisfies all three optimality conditions.

4.3 Algorithm

4.3.1 Graph Projection Splitting

In [39] Parikh et al. apply ADMM [5, Sect. 5] to the problem of minimizing \(f(y)+g(x)\), subject to the constraint \((x,y)\in \mathscr {G}\). This yields the graph projection splitting Algorithm 1.

figure a

The variable k is the iteration counter, \(x^{k+1}, x^{k+1/2} \in \mathbf {R}^{n}\) and \(y^{k+1}, y^{k+1/2}, \in \mathbf {R}^{m}\) are primal variables, \({\tilde{x}}^{k+1} \in \mathbf {R}^{n}\) and \({\tilde{y}}^{k+1} \in \mathbf {R}^{m}\) are scaled dual variables, \(\varPi \) denotes the (Euclidean) projection onto the graph \(\mathscr {G}\),

is the proximal operator of f (and similarly for g), and \(\rho >0\) is the proximal parameter. The projection \(\varPi \) can be explicitly expressed as the linear operator

$$\begin{aligned} \varPi (c,d)= K^{-1}\begin{bmatrix} c + A^Td \\ 0 \end{bmatrix}, \qquad K = \begin{bmatrix} I&A^T \\ A&-I \end{bmatrix}. \end{aligned}$$
(4.9)

Roughly speaking, in steps 3 and 5, the x (and \({\tilde{x}}\)) and y (and \({\tilde{y}}\)) variables do not mix; the computations can be carried out in parallel. The projection step 4 mixes the \(x,{\tilde{x}}\) and \(y,{\tilde{y}}\) variables.

General convergence theory for ADMM [5, Sect. 3.2] guarantees that (with our assumption on the existence of a solution)

$$\begin{aligned} (x^{k+1},y^{k+1}) - (x^{k+1/2}, y^{k+1/2}) \rightarrow 0, \quad f(y^k) +g(x^k) \rightarrow p^\star , \quad ({\tilde{x}}^k,{\tilde{y}}^k) \rightarrow ({\tilde{x}}^\star , {\tilde{y}}^\star ), \end{aligned}$$
(4.10)

as \(k \rightarrow \infty \).

4.3.2 Extensions

We discuss three common extensions that can be used to speed up convergence in practice: over-relaxation, approximate projection, and varying penalty.

Over-relaxation. Replacing \(x^{k+1/2}\) by \(\alpha x^{k+1/2}+(1-\alpha )x^{k}\) in the projection and dual update steps is known as over-relaxation if \(\alpha > 1\) or under-relaxation if \(\alpha < 1\). The algorithm is guaranteed to converge [19] for any \(\alpha \in (0,2)\); it is observed in practice [36] that using an over-relaxation parameter in the range [1.5, 1.8] can improve practical convergence.

Approximate projection. Instead of computing the projection \(\varPi \) exactly one can use an approximation \({\tilde{\varPi }}\), with the only restriction that

$$ \textstyle {\sum }_{k = 0}^\infty \Vert \varPi (x^{k+1/2}, y^{k+1/2}) - {\tilde{\varPi }}(x^{k+1/2}, y^{k+1/2})\Vert _2 < \infty $$

must hold. This is known as approximate projection [36], and is guaranteed to converge [1]. This extension is particularly useful if the approximate projection is computed using an indirect or iterative method.

Varying penalty. Large values of \(\rho \) tend to encourage primal feasibility, while small values tend to encourage dual feasibility [5, Sect. 3.4.1]. A common approach is to adjust or vary \(\rho \) in each iteration, so that the primal and dual residuals are (roughly) balanced in magnitude. When doing so, it is important to re-scale \(({\tilde{x}}^{k+1}, {\tilde{y}}^{k+1})\) by a factor \(\rho ^{k}/\rho ^{k+1}\).

4.3.3 Feasible Iterates

In each iteration, Algorithm 1 produces sets of points that are either primal, dual, or Fenchel feasible. Define

$$\begin{aligned} \mu ^{k} = -\rho {\tilde{x}}^k, \quad \nu ^{k} = -\rho {\tilde{y}}^k, \quad \mu ^{k+1/2} = -\rho (x^{k+1/2} - x^k + {\tilde{x}}^k), \quad \nu ^{k+1/2} = -\rho (y^{k+1/2} - y^k + {\tilde{y}}^k). \end{aligned}$$

The following statements hold.

  1. 1.

    The pair \((x^{k+1}, y^{k+1})\) is primal feasible, since it is the projection onto the graph \(\mathscr {G}\).

  2. 2.

    The pair \((\mu ^{k+1}, \nu ^{k+1})\) is dual feasible, as long as \((\mu ^{0}, \nu ^{0})\) is dual feasible and \((x^0, y^0)\) is primal feasible. Dual feasibility implies \(\mu ^{k+1} +A^T \nu ^{k+1}=0\), which we show using the update equations in Algorithm 1:

    $$\begin{aligned} \mu ^{k+1} +A^T \nu ^{k+1}&= -\rho ({\tilde{x}}^{k} + x^{k+1/2} - x^{k+1} +A^T ( {\tilde{y}}^{k} + y^{k+1/2} - y^{k+1} )) \\&= -\rho ({\tilde{x}}^{k} + A^T{\tilde{y}}^k + x^{k+1/2} +A^Ty^{k+1/2} - (I+A^TA)x^{k+1}), \end{aligned}$$

    where we substituted \(y^{k+1} = Ax^{k+1}\). From the projection operator in (4.9) it follows that \((I+A^TA)x^{k+1} = x^{k+1/2} + A^Ty^{k+1/2}\), therefore

    $$ \mu ^{k+1} +A^T \nu ^{k+1} = -\rho ({\tilde{x}}^{k} + A^T{\tilde{y}}^k) = \mu ^{k} +A^T \nu ^{k} = \mu ^0 + A^T\nu ^0, $$

    where the last equality follows from an inductive argument. Since we made the assumption that \( (\mu ^{0},\nu ^{0})\) is dual feasible, we can conclude that \((\mu ^{k+1}, \nu ^{k+1})\) is also dual feasible.

  3. 3.

    The tuple \((x^{k+1/2}, y^{k+1/2}, \mu ^{k+1/2}, \nu ^{k+1/2})\) is Fenchel feasible. From the definition of the proximal operator,

    By the same argument \( \nu ^{k+1/2} \in \partial f(y^{k+1/2})\).

Applying the results in (4.10) to the dual variables, we find \(\nu ^{k+1/2} \rightarrow \nu ^\star \) and \(\mu ^{k+1/2} \rightarrow \mu ^\star \), from which we conclude that \((x^{k+1/2}, y^{k+1/2}, \mu ^{k+1/2}, \nu ^{k+1/2})\) is primal and dual feasible in the limit.

4.3.4 Stopping Criteria

In Sect. 4.3.3 we noted that either (4.4, 4.5, 4.6) or (4.4, 4.5, 4.7) are sufficient for optimality. We present two different stopping criteria based on these conditions.

Residual-based stopping. The tuple \((x^{k+1/2}, y^{k+1/2}, \mu ^{k+1/2}, \nu ^{k+1/2})\) is Fenchel feasible in each iteration, but only primal and dual feasible in the limit. Accordingly, we propose the residual-based stopping criterion

$$\begin{aligned} \Vert Ax^{k+1/2} - y^{k+1/2}\Vert _2 \le \varepsilon ^{\text {pri}}, \quad \Vert A^T\nu ^{k+1/2} + \mu ^{k+1/2}\Vert _2 \le \varepsilon ^{\text {dual}}, \end{aligned}$$
(4.11)

where the \(\varepsilon ^\text {pri}\) and \(\varepsilon ^\text {dua}\) are positive tolerances. These should be chosen as a mixture of absolute and relative tolerances, such as

$$ \varepsilon ^\text {pri} = \varepsilon ^{\text {abs}} + \varepsilon ^{\text {rel}} \Vert y^{k+1/2}\Vert _2, \quad \varepsilon ^\text {dual} = \varepsilon ^{\text {abs}} + \varepsilon ^{\text {rel}} \Vert \mu ^{k+1/2}\Vert _2. $$

Reasonable values for \(\varepsilon ^{\text {abs}}\) and \(\varepsilon ^{\text {rel}}\) are in the range \([10^{-4}, 10^{-2}]\).

Gap-based stopping. The tuple \((x^k, y^k, \mu ^k, \nu ^k)\) is primal and dual feasible, but only Fenchel feasible in the limit. We propose the gap-based stopping criteria

$$ \eta ^k = f(y^k) + g(x^k) + f^*(\nu ^k) + g^*(\mu ^k) \le \varepsilon ^\text {gap}, $$

where \(\varepsilon ^\text {gap}\) should be chosen relative to the current objective value, i.e.,

$$ \varepsilon ^\text {gap} = \varepsilon ^{\text {abs}} + \varepsilon ^{\text {rel}}|f(y^k) + g(x^k)|. $$

Here too, reasonable values for \(\varepsilon ^{\text {abs}}\) and \(\varepsilon ^{\text {rel}}\) are in the range \([10^{-4}, 10^{-2}]\).

Although the gap-based stopping criteria is very informative, since it directly bounds the suboptimality of the current iterate, it suffers from the drawback that \(f, g, f^*\), and \(g^*\) must all have full domain, since otherwise the gap \(\eta ^k\) can be infinite. Indeed, the gap \(\eta ^k\) is almost always infinite when f or g represent constraints.

4.3.5 Implementation

Projection. There are different ways to evaluate the projection operator \(\varPi \), depending on the structure and size of A.

One simple method that can be used if A is sparse and not too large is a direct sparse factorization. The matrix K is quasi-definite, and therefore the \(LDL^T\) decomposition is well defined [51]. Since K does not change from iteration to iteration, the factors L and D (and the permutation or elimination ordering) can be computed in the first iteration (e.g., using CHOLMOD [9]) and reused in subsequent iterations. This is known as factorization caching [5, Sect. 4.2.3] [39, Sect. A.1]. With factorization caching, we get a (potentially) large speedup in iterations, after the first one.

If A is dense, and \(\min (m,n)\) is not too large, then block elimination [8, Appendix C] can be applied to K [39, Appendix A], yielding the reduced update

$$\begin{aligned} x^{k+1}&:= (A^TA + I)^{-1}(c + A^Td) \\ y^{k+1}&:= Ax^{k+1} \end{aligned}$$

if \(m \ge n\), or

$$\begin{aligned} y^{k+1}&:= d + (AA^T + I)^{-1}(Ac -d) \\ x^{k+1}&:= c - A^T(d-y^{k+1}) \end{aligned}$$

if \(m < n\). Both formulations involve forming and solving a system of \(\min (m,n)\) equations with \(\min (m,n)\) unknowns. Since the coefficient matrix is symmetric positive definite, we can use the Cholesky decomposition. Forming the coefficient matrix \(A^TA+I\) or \(AA^T+I\) dominates the computation. Here too, we can take advantage of factorization caching.

The regular structure of dense matrices allows us to analyze the computational complexity of each step. We define \(q =\min (m,n)\) and \(p = \max (m,n)\). The first iteration involves the factorization and the solve step; subsequent iterations only require the solve step. The computational cost of the factorization is the combined cost of computing \(A^TA\) (or \(AA^T\), whichever is smaller), at a cost of \(pq^2\) flops, in addition to the Cholesky decomposition, at a cost of \((1/3)q^3\) flops. The solve step consists of two matrix-vector multiplications at a cost of 4pq flops and solving a triangular system of equations at a cost of \(q^2\) flops. The total cost of the first iteration is \(O(pq^2)\) flops, while each subsequent iteration only costs O(pq) flops, showing that we obtain savings by a factor of q flops, after the first iteration, by using factorization caching.

For very large problems direct methods are no longer practical, at which point indirect (iterative) methods can be used. Fortunately, as the primal and dual variables converge, we are guaranteed that \((x^{k+1/2}, y^{k+1/2}) \rightarrow (x^{k+1}, y^{k+1})\), meaning that we will have a good initial guess we can use to initialize the iterative method to (approximately) evaluate the projection. One can either apply CGLS (conjugate gradient least-squares) [28] or LSQR [45] to the reduced update or apply MINRES (minimum residual) [44] to K directly. It can be shown the latter requires twice the number of iterations as compared to the former, and is therefore not recommended.

Proximal operators. Since the \(x,{\tilde{x}}\) and \(y,{\tilde{y}}\) components are decoupled in the proximal step and dual variable update step, both of these can be done separately, and in parallel for x and y. If either f or g is separable, then the proximal step can be parallelized further. Combettes and Pesquet [15, Sect. 10.2] contains a table of proximal operators for a wide range of functions, and the monograph [40] details how proximal operators can be computed efficiently, in particular for the case where there is no analytical solution. Typically, the cost of computing the proximal operator will be negligible compared to the cost of the projection. In particular, if f and g are separable, then the cost will be \(O(m + n)\), and completely parallelizable.

4.4 Preconditioning and Parameter Selection

The practical convergence of the algorithm (i.e., the number of iterations required before it terminates) can depend greatly on the choice of the proximal parameter \(\rho \), and the scaling of the variables. In this section we analyze these, and suggest a method for choosing \(\rho \) and for scaling the variables that (empirically) speeds up practical convergence.

4.4.1 Preconditioning

Consider scaling the variables x and y in (4.1), by \(E^{-1}\) and D respectively, where \(D \in \mathbf {R}^{m\times m}\) and \(E \in \mathbf {R}^{n \times n}\) are non-singular matrices. We define the scaled variables

$$ {\hat{y}} = Dy, \quad {\hat{x}} = E^{-1}x, $$

which transforms (4.1) into

$$\begin{aligned} \begin{aligned}&\text {minimize}&f(D^{-1}{\hat{y}}) + g(E {\hat{x}}) \\&\text {subject to}&{\hat{y}}= DAE {\hat{x}}. \end{aligned} \end{aligned}$$
(4.12)

This is also a graph form problem, and for notational convenience, we define

$$ \quad {\hat{A}} = DAE, \quad {\hat{f}}({\hat{y}}) = f(D^{-1}{\hat{y}}), \quad {\hat{g}}({\hat{x}}) = g(E {\hat{x}}), $$

so that the problem can be written as

$$\begin{aligned} \begin{aligned}&\text {minimize}&{\hat{f}}({\hat{y}}) + {\hat{g}}({\hat{x}}) \\&\text {subject to}&{\hat{y}} = {\hat{A}}{\hat{x}}. \end{aligned} \end{aligned}$$

We refer to this problem as the preconditioned version of (4.1). Our goal is to choose D and E so that (a) the algorithm applied to the preconditioned problem converges in fewer steps in practice, and (b) the additional computational cost due to the preconditioning is minimal.

Graph projection splitting applied to the preconditioned problem (4.12) can be interpreted in terms of the original iterates. The proximal step iterates are redefined as

$$\begin{aligned} x^{k+1/2}&= \mathop {\text {argmin}}\limits _{x} \left( g(x) + (\rho /2)\Vert x - x^k + \tilde{x}^k\Vert _{(EE^T)^{-1}}^2 \right) , \\ y^{k+1/2}&= \mathop {\text {argmin}}\limits _{y} \left( f(y) + (\rho /2)\Vert y - y^k + {\tilde{y}}^k\Vert _{(D^TD)}^2 \right) , \end{aligned}$$

and the projected iterates are the result of the weighted projection

$$\begin{aligned} \begin{array}{l@{\qquad }l} \text {minimize}&{} (1/2)\Vert x - x^{k+1/2}\Vert _{(EE^T)^{-1}}^2 + (1/2)\Vert y - y^{k+1/2}\Vert _{(D^TD)}^2 \\ \text {subject to} &{} y = A x, \end{array} \end{aligned}$$

where \(\Vert x\Vert _P = \sqrt{x^TPx}\) for a symmetric positive-definite matrix P. This projection can be expressed as

$$ \varPi (c,d)= {\hat{K}}^{-1}\begin{bmatrix} (EE^T)^{-1}c + A^TD^TDd \\ 0 \end{bmatrix}, \qquad {\hat{K}} = \begin{bmatrix} (EE^T)^{-1}&A^TD^TD \\ D^TDA&-D^TD \end{bmatrix}. $$

Notice that graph projection splitting is invariant to orthogonal transformations of the variables x and y, since the preconditioners only appear in terms of \(D^TD\) and \(EE^T\). In particular, if we let \(D = U^T\) and \(E = V\), where \(A=U\varSigma V^T\), then the preconditioned constraint matrix \({\hat{A}} = DAE = \varSigma \) is diagonal. We conclude that any graph form problem can be preconditioned to one with a diagonal nonnegative constraint matrix \(\varSigma \). For analysis purposes, we are therefore free to assume that A is diagonal. We also note that for orthogonal preconditioners, there exists an analytical relationship between the original proximal operator and the preconditioned proximal operator. With \(\phi (x) = \varphi (Qx)\), where Q is any orthogonal matrix (\(Q^TQ = QQ^T= I\)), we have

$$ \mathbf {prox}_{\phi }(v) = Q^T\mathbf {prox}_{\varphi }(Qv). $$

While the proximal operator of \(\phi \) is readily computed, orthogonal preconditioners destroy separability of the objective. As a result, we cannot easily combine them with other preconditioners.

Multiplying D by a scalar \(\alpha \) and dividing E by the same scalar has the effect of scaling \(\rho \) by a factor of \(\alpha ^2\). It however has no effect on the projection step, showing that \(\rho \) can be thought of as the relative scaling of D and E.

In the case where f and g are separable and both D and E are diagonal, the proximal step takes the simplified form

where \(\rho ^E_{j} = \rho /E_{jj}^2\) and \(\rho ^D_{i} = \rho D_{ii}^2\). Since only \(\rho \) is modified, any routine capable of computing \(\mathbf {prox}_f\) and \(\mathbf {prox}_g\) can also be used to compute the preconditioned proximal update.

4.4.1.1 Effect of Preconditioning on Projection

For the purpose of analysis, we will assume that \(A = \varSigma \), where \(\varSigma \) is a nonnegative diagonal matrix. The projection operator simplifies to

$$ \varPi (c,d) = \begin{bmatrix} (I + \varSigma ^T\varSigma )^{-1}&(I + \varSigma ^T\varSigma )^{-1}\varSigma ^T \\ (I + \varSigma \varSigma ^T)^{-1}\varSigma&(I + \varSigma \varSigma ^T)^{-1}\varSigma \varSigma ^T \end{bmatrix}\begin{bmatrix} c \\ d\end{bmatrix}, $$

which means the projection step can be written explicitly as

$$\begin{aligned} x^{k+1}_i&= \frac{1}{1+\sigma _i^2}(x_i^{k+1/2} + {\tilde{x}}_i^{k} + \sigma _i(y_i^{k+1/2} + {\tilde{y}}_i^{k}))&i=1, \ldots , \min (m,n) \\ x^{k+1}_i&= x_i^{k+1/2} + {\tilde{x}}_i^{k}&i=\min (m,n)+1, \ldots , n \\ y^{k+1}_i&= \frac{\sigma _i}{1+\sigma _i^2}(x_i^{k+1/2} + {\tilde{x}}_i^{k} + \sigma _i (y_i^{k+1/2} + {\tilde{y}}_i^{k}))&i=1, \ldots , \min (m,n) \\ y^{k+1}_i&= 0&i=\min (m,n)+1, \ldots , m, \end{aligned}$$

where \(\sigma _i\) is the ith diagonal entry of \(\varSigma \) and subscripted indices of x and y denote the ith entry of the respective vector. Notice that the projected variables \(x_i^{k+1}\) and \(y_i^{k+1}\) are equally dependent on \((x_i^{k+1/2} + \tilde{x}_i^k)\) and \(\sigma _i(y_i^{k+1/2} + {\tilde{y}}_i^{k})\). If \(\sigma _i\) is either significantly smaller or larger than 1, then the terms \(x_i^{k+1}\) and \(y_i^{k+1}\) will be dominated by either \((x_i^{k+1/2} + {\tilde{x}}_i^k)\) or \((y_i^{k+1/2} + {\tilde{y}}_i^{k})\). However if \(\sigma _i = 1\), then the projection step exactly averages the two quantities

$$\begin{aligned} x_i^{k+1} = y_i^{k+1}&= \frac{1}{2}(x_i^{k+1/2} + {\tilde{x}}_i^{k} + y_i^{k+1/2} + {\tilde{y}}_i^k)&i=1, \ldots , \min (m,n). \end{aligned}$$

As pointed out in Sect. 4.3, the projection step mixes the variables x and y. For this to approximately reduce to averaging, we need \(\sigma _i \approx 1\).

4.4.1.2 Choosing D and E

The analysis suggests that the algorithm will converge quickly when the singular values of DAE are all near one, i.e.,

$$\begin{aligned} \mathbf {cond}\big (DAE\big ) \approx 1, \quad \Vert DAE\Vert _2 \approx 1. \end{aligned}$$
(4.13)

(This claim is also supported in [23], and is consistent with our computational experience.) Preconditioners that exactly satisfy these conditions can be found using the singular value decomposition of A. They will, however, be of little use, since such preconditioners generally destroy our ability to evaluate the proximal operators of \({\hat{f}}\) and \({\hat{g}}\) efficiently.

So we seek choices of D and E for which (4.13) holds (very) approximately, and for which the proximal operators of \({\hat{f}}\) and \({\hat{g}}\) can still be efficiently computed. We now specialize to the special case when f and g are separable. In this case, diagonal D and E are candidates for which the proximal operators are still easily computed. (The same ideas apply to block separable f and g, where we impose the further constraint that the diagonal entries within a block are the same.) So we now limit ourselves to the case of diagonal preconditioners.

Diagonal matrices that minimize the condition number of DAE, and therefore approximately satisfy the first condition in (4.13), can be found exactly, using semidefinite programming [3, Sect. 3.1]. But this computation is quite involved, and may not be worth the computational effort since the conditions (4.13) are just a heuristic for faster convergence. (For control problems, where the problem is solved many times with the same matrix A, this approach makes sense; see [21].)

A heuristic that tends to minimize the condition number is to equilibrate the matrix, i.e., choose D and E so that the rows all have the same p-norm, and the columns all have the same p-norm. (Such a matrix is said to be equilibrated.) This corresponds to finding D and E so that

$$ |DAE|^p\mathbf {1} = \alpha \mathbf {1}, \qquad \mathbf {1}^T|DAE|^p = \beta \mathbf {1}^T, $$

where \(\alpha , \beta > 0\). Here the notation \(|\cdot |^p\) should be understood in the elementwise sense. Various authors [6, 13, 36] suggest that equilibration can decrease the number of iterations needed for operator splitting and other first-order methods. One issue that we need to address is that not every matrix can be equilibrated. Given that equilibration is only a heuristic for achieving \(\sigma _i(DAE)\approx 1\), which is in turn a heuristic for fast convergence of the algorithm, partial equilibration should serve the same purpose just as well.

Sinkhorn and Knopp [48] suggest a method for matrix equilibration for \(p<\infty \), and A is square and has full support. In the case \(p = \infty \), the Ruiz algorithm [46] can be used. Both of these methods fail (as they must) when the matrix A cannot be equilibrated. We give below a simple modification of the Sinkhorn–Knopp algorithm, modified to handle the case when A is non-square, or cannot be equilibrated.

Choosing preconditioners that satisfy \(\Vert DAE\Vert _2=1\) can be achieved by scaling D and E by \(\sigma _{\max }(DAE)^{-q}\) and \(\sigma _{\max }(DAE)^{q-1}\) respectively for \(q \in \mathbf {R}\). The quantity \(\sigma _{\max }(DAE)\) can be approximated using power iteration, but we have found it is unnecessary to exactly enforce \(\Vert DAE\Vert _2=1\). A more computationally efficient alternative is to replace \(\sigma _{\max }(DAE)\) by \(\Vert DAE\Vert _F/\sqrt{\min (m,n)}\). This quantity coincides with \(\sigma _{\max }(DAE)\) when \(\mathbf {cond}(DAE) = 1\). If DAE is equilibrated and \(p=2\), this scaling corresponds to \((DAE)^T(DAE)\) (or \((DAE)(DAE)^T\) when \(m < n\)) having unit diagonal.

4.4.2 Regularized Equilibration

In this section, we present a self-contained derivation of our matrix-equilibration method. It is similar to the Sinkhorn–Knopp algorithm, but also works when the matrix is non-square or cannot be exactly equilibrated.

Consider the convex optimization problem with variables u and v,

$$\begin{aligned} \begin{aligned}&\text {minimize}&\sum _{i=1}^m\sum _{j=1}^n|A_{ij}|^pe^{u_i+v_j} - n \mathbf {1}^Tu - m \mathbf {1}^Tv +\gamma \left[ n \sum _{i=1}^me^{u_i} + m \sum _{j=1}^ne^{v_j}\right] , \end{aligned} \end{aligned}$$
(4.14)

where \(\gamma \ge 0\) is a regularization parameter. The objective is bounded below for any \(\gamma > 0\). The optimality conditions are

$$\begin{aligned} \sum _{j=1}^n|A_{ij}|^pe^{u_i+v_j} - n + n \gamma e^{u_i}= 0,\quad i = 1,\ldots ,m,\\ \sum _{i=1}^m|A_{ij}|^pe^{u_i+v_j} - m + m \gamma e^{v_j} = 0,\quad j = 1,\ldots ,n. \end{aligned}$$

By defining \(D_{ii} = e^{u_i/p}\) and \(E_{jj}^p = e^{v_j/p}\), these conditions are equivalent to

$$ |DAE|^p \mathbf {1} + n \gamma D\mathbf {1} =n \mathbf {1}, \quad \mathbf {1}^T |DAE|^p + m \gamma \mathbf {1}^TE=m \mathbf {1}^T, $$

where \(\mathbf {1}\) is the vector with all entries one. When \(\gamma = 0\), these are the conditions for a matrix to be equilibrated. The objective may not be bounded when \(\gamma = 0\), which exactly corresponds to the case when the matrix cannot be equilibrated. As \(\gamma \rightarrow \infty \), both D and E converge to the scaled identity matrix \((1/\gamma )I\), showing that \(\gamma \) can be thought of as a regularizer on the elements of D and E. If D and E are optimal, then the two equalities

$$ \mathbf {1}^T|DAE|^p\mathbf {1} + n \gamma \mathbf {1}^TD\mathbf {1} = mn, \qquad \mathbf {1}^T|DAE|^p\mathbf {1} + m \gamma \mathbf {1}^TE\mathbf {1} = mn $$

must hold. Subtracting the one from the other, and dividing by \(\gamma \), we find the relationship

$$ n \mathbf {1}^TD\mathbf {1} = m \mathbf {1}^TE\mathbf {1}, $$

implying that the average entry in D and E is the same.

There are various ways to solve the optimization problem (4.14), one of which is to apply coordinate descent. Minimizing the objective in (4.14) with respect to \(u_i\) yields

$$ \sum _{j = 1}^ne^{u_i^{k}+v^{k-1}_j}|A_{ij}|^p + n \gamma e^{u_i^{k}}= n \Leftrightarrow e^{u_i^{k}} = \frac{n}{\sum _{j = 1}^ne^{v^{k-1}_j}|A_{ij}|^p + n \gamma } $$

and similarly for \(v_j\)

$$ e^{v_i^{k}} = \frac{m}{\sum _{i = 1}^ne^{u^{k-1}_i}|A_{ij}|^p + m \gamma }. $$

Since the minimization with respect to \(u_i^k\) is independent of \(u_{i-1}^{k}\), the update can be done in parallel for each element of u, and similarly for v. Repeated minimization over u and v will eventually yield values that satisfy the optimality conditions.

Algorithm 2 summarizes the equilibration routine. The inverse operation in steps 4 and 5 should be understood in the element-wise sense.

figure b

4.4.3 Adaptive Penalty Update

The projection operator \(\varPi \) does not depend on the choice of \(\rho \), so we are free to update \(\rho \) in each iteration, at no extra cost. While the convergence theory only holds for fixed \(\rho \), it still applies if one assumes that \(\rho \) becomes fixed after a finite number of iterations [5].

As a rule, increasing \(\rho \) will decrease the primal residual, while decreasing \(\rho \) will decrease the dual residual. The authors in [5, 30] suggest adapting \(\rho \) to balance the primal and dual residuals. We have found that substantially better practical convergence can be obtained using a variation on this idea. Rather than balancing the primal and dual residuals, we allow either the primal or dual residual to approximately converge and only then start adjusting \(\rho \). Based on this observation, we propose the following adaptive update scheme.

figure c

Once either the primal or dual residual converges, the algorithm begins to steer \(\rho \) in a direction so that the other residual also converges. By making small adjustments to \(\rho \), we will tend to remain approximately primal (or dual) feasible once primal (dual) feasibility has been attained. Additionally by requiring a certain number of iterations between an increase in \(\rho \) and a decrease (and vice versa), we enforce that changes to \(\rho \) do not flip-flop between one direction and the other. The parameter \(\tau \) determines the relative number of iterations between changes in direction.

4.5 Implementation

Proximal Graph Solver (POGS) is an open-source (BSD-3 license) implementation of graph projection splitting, written in C++. It supports both GPU and CPU platforms and includes wrappers for C, MATLAB, and R. POGS handles all combinations of sparse/dense matrices, single/double precision arithmetic, and direct/indirect solvers, with the exception (for now) of sparse indirect solvers. The only dependency is a tuned BLAS library on the respective platform (e.g., cuBLAS or the Apple Accelerate Framework). The source code is available at

figure d

In lieu of having the user specify the proximal operators of f and g, POGS contains a library of proximal operators for a variety of different functions. It is currently assumed that the objective is separable, in the form

$$ f(y) + g(x) = \sum _{i = 1}^mf_i(y_i) + \sum _{j = 1}^ng_j(x_j), $$

where \(f_i, g_j : \mathbf {R} \rightarrow \mathbf {R} \cup \{\infty \}\). The library contains a set of base functions, and by applying various transformations, the range of functions can been greatly extended. In particular we use the parametric representation

$$ f_i(y_i) = c_i h_i(a_iy_i - b_i) + d_iy_i + (1/2)e_iy_i^2, $$

where \(a_i,b_i,d_i \in \mathbf {R}\), \(c_i, e_i \in \mathbf {R}_+\), and \(h_i : \mathbf {R} \rightarrow \mathbf {R} \cup \{\infty \}\). The same representation is also used for \(g_j\). It is straightforward to express the proximal operators of \(f_i\) in terms of the proximal operator of \(h_i\) using the formula

$$\begin{aligned} {\mathbf {prox}}_{f}(v) = \frac{1}{a}\bigg ({\mathbf {prox}}_{h, (e+\rho )/(c a^2)}\Big (a \left( v\rho -d\right) /(e+\rho ) - b\Big ) + b\bigg ), \end{aligned}$$

where for notational simplicity we have dropped the index i in the constants and functions. It is possible for a user to add their own proximal operator function, if it is not in the current library. We note that the separability assumption on f and g is a simplification, rather than a limitation of the algorithm. It allows us to apply the proximal operator in parallel using either CUDA or OpenMP (depending on the platform).

The constraint matrix is equilibrated using Algorithm 2, with a choice of \(p=2\) and \(\gamma = \frac{m + n}{mn}\sqrt{\varepsilon ^{\text {cmp}}}\), where \(\varepsilon ^{\text {cmp}}\) is machine epsilon. Both D and E are rescaled evenly, so that they satisfy \(\Vert DAE\Vert _F/\sqrt{\min (m,n)} = 1\). The projection \(\varPi \) is computed as outlined in Sect. 4.3.5. We work with the reduced update equations in all versions of POGS. In the indirect case, we chose to use CGLS. The parameter \(\rho \) is updated according to Algorithm 3. Empirically, we found that \((\delta , \,\tau ) = (1.05, \,0.8)\) works well. We also use over-relaxation with \(\alpha = 1.7\). POGS supports warm starting, whereby an initial guess for \(x^0\) and/or \(\nu ^0\) may be supplied by the user. If only \(x^0\) is provided, then \(\nu ^0\) will be estimated, and vice versa. The warm-start feature allows any cached matrices to be used to solve additional problems with the same matrix A. POGS returns the tuple (\(x^{k+1/2}, y^{k+1/2}, \mu ^{k+1/2}, \nu ^{k+1/2}\)), since it has finite primal and dual objectives. The primal and dual residuals will be nonzero and are determined by the specified tolerances. Future plans for POGS include extension to block-separable f and g (including general cone solvers), additional wrappers for Julia and Python, support for a sparse direct solver, and a multi-GPU extension.

4.6 Numerical Results

To highlight the robustness and general purpose nature of POGS, we tested it on 9 different problem classes using random but realistic data. We considered the following 9 problem classes: basis pursuit, entropy maximization, Huber fitting, lasso, logistic regression, linear programming, nonnegative least-squares, portfolio optimization, and support vector machine fitting. For each problem class, the number of nonzeros in A was varied on a logarithmic scale from 100 to 1 Billion. The aspect ratio of A also varied from 1:1.25 to 1:10, with the orientation (wide or tall) chosen depending on what was reasonable for each problem. We report running time averaged over all aspect ratios. These problems and the data generation methods are described in detail in a longer version of this chapter [10]. All experiments were performed in single precision arithmetic on a machine equipped with an Intel Core i7-870, 16GB of RAM, and a TitanX GPU. Timing results include the data copy from CPU to GPU.

Fig. 4.1
figure 1

POGS (GPU version) versus SDPT3 for dense matrices (color represents problem class)

We compare POGS to SDPT3 [50], an open-source solver that handles linear, second-order, and positive semidefinite cone programs. Since SDPT3 uses an interior-point algorithm, the solution returned will be of high precision, allowing us to verify the accuracy of the solution computed by POGS. Problems that took SDPT3 more than 200 seconds (of which there were many) were aborted.

The maximum number of iterations was set to \(10^{4}\), but all problems converged in fewer iterations, with most problems taking a couple of hundred iterations. The relative tolerance was set to \(10^{-3}\), and where solutions from SDPT3 were available, we verified that the solutions produced by both solvers matched to 3 decimal places. We omit SDPT3 running times for problems involving exponential cones, since SDPT3 does not support them.

Figure 4.1 compares the running time of POGS versus SDPT3, for problems where the constraint matrix A is dense. We can make several general observations.

  • POGS solves problems that are 3 orders of magnitude larger than SDPT3 in the same amount of time.

  • Problems that take 200 s in SDPT3 take 0.5 s in POGS.

  • POGS can solve problems with 1 Billion nonzeros in 10–40 s.

  • The variation in solve time across different problem classes was similar for POGS and SDPT3, around one order of magnitude.

In summary, POGS is able to solve much larger problems, much faster (to moderate precision).