Abstract
The proximal point algorithm (PPA) is a fundamental method for convex programming. When applying the PPA to solve linearly constrained convex problems, we may prefer to choose an appropriate metric matrix to define the proximal regularization, so that the computational burden of the resulted PPA can be reduced, and sometimes even admit closed form or efficient solutions. This idea results in the so-called customized PPA (also known as preconditioned PPA), and it covers the linearized ALM, the primal-dual hybrid gradient algorithm, ADMM as special cases. Since each customized PPA owes special structures and has popular applications, it is interesting to ask wether we can design a simple relaxation strategy for these algorithms. In this paper we treat these customized PPA algorithms uniformly by a mixed variational inequality approach, and propose a new relaxation strategy for these customized PPA algorithms. Our idea is based on correcting the dual variables individually and does not rely on relaxing the primal variables. This is very different from previous works. From variational inequality perspective, we prove the global convergence and establish a worst-case convergence rate for these relaxed PPA algorithms. Finally, we demonstrate the performance improvements by some numerical results.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Avoid common mistakes on your manuscript.
1 Introduction
The problem concerned in this paper is the following convex minimization model
where \(A\in \mathfrak {R}^{m\times n}\), \(b\in \mathfrak {R}^m\), \({{\mathcal {X}}}\subset \mathfrak {R}^{n}\) is a closed convex set, \(\theta : \mathfrak {R}^{n} \rightarrow \mathfrak {R}\) is a generic convex but not necessarily smooth function. Throughout, the solution set of (1.1) is assumed to be nonempty. Problems of this form arise in many applications such as compressed sensing, statistical estimation, machine learning, and image processing, see e.g., [2, 8, 17, 33] for some examples.
We first reformulate (1.1) as a variational inequality form, so we can take a brief look at the algorithms to solve this problem. The Lagrangian function of the optimization problem (1.1) is
which is defined on \({{\mathcal {X}}}\times \mathfrak {R}^m\). Let \((x^*,\lambda ^*)\) be a saddle point of the Lagrange function. Then we have
By rearranging the above saddle point inequalities, we have the following optimal conditions:
More compactly, the inequalities (1.2) can be characterized as a monotone variational inequality (VI):
where
It is easily verified that \( (w_1-w_2)^T\big (F(w_1)-F(w_2)\big )\ge 0\) for any \(w_1, w_2\in \Omega \), thus the operator \(F(\cdot )\) defined in (1.3b) is monotone. Since the solution set of (1.1) is assumed to be nonempty, the solution set of VI (1.3), denoted by \(\Omega ^*\), is also nonempty.
As we have mentioned, our analysis will be conducted in the variational inequality context, so the VI reformulation (1.3) serves as the starting point for further analysis. The algorithms used to solve the model (1.1) amount to finding a solution point of VI (1.3). We now briefly revisit some relevant algorithms in VI form.
The classical proximal point algorithm (PPA) [29, 31] is a popular method to solve the VI. Starting from an initial point \(w^0\), the PPA iterates as follows
where \(r> 0\) is a regularization parameter. The PPA is fundamental both theoretically and algorithmically in the area of the optimization. Indeed, the augmented Lagrangian method (ALM) [27, 30] can be interpreted as applying the PPA to the dual of (1.1) [31]. We now verify the interpretation by revisiting the ALM (1.5) in the VI context. For (1.1), the ALM reads as
where \(\lambda \) is the Lagrange multiplier and \(\beta >0\) is a penalty parameter for the linear constraints. Note that the first-order optimality condition of the x-subproblem (1.5a) is
Substituting (1.5b) into the above equality yields
Next, (1.5b) can be written as
Combining (1.6) and (1.7), we have
By using the notations in (1.3), the compact form is
with
Compared with the PPA scheme (1.4), it is easy to see that only the dual term is proximally regularized by the coefficient \(\frac{1}{\beta }\). Thus, applying the PPA to the dual of (1.1), we get the ALM scheme (1.5).
ALM is a benchmark method that has been successfully used in many important areas (e.g., linear inverse problems in image processing [1, 10]). At each iteration, the computational effort of ALM is dominated by evaluating the operator \((A^TA+\frac{1}{\beta }\partial \theta )^{-1}(\cdot )\) in \({\mathcal {X}}\). When \(A^TA\) is not an identity matrix, or even large and dense, such evaluation usually does not have easy or closed form solutions.
On the other hand, we have encountered some concrete applications arising in sparse or low-rank optimization problems modeled by (1.1), in which the convex functions have some special structures to utilize. An important structure we often encountered in practice is that the proximal mapping related to the objective \(\theta (\cdot )\) defined by
with parameter \(\tau >0\), can be computed easily. Here “easy” means a closed-form solution exists or efficient solvers are available. Examples of such structure exist in linear and quadratic programming, basis pursuit, nuclear norm minimization, and model fitting problems, etc. Let us briefly review two typical algorithms developed for employing this structure.
The first is the primal-dual hybrid gradient (PDHG) algorithm, which is used for total variation image restoration problems in [39], and subsequently improved by Chambolle and Pock in [5, 6]. For solving (1.1), the PDHG iterates as
where r and s are required to satisfy \(r>0,s>0, rs>\Vert A^TA\Vert \).
Similar to the derivation of ALM, by using the notations in (1.3), the VI form of PDHG is
where
Since \(rs>\Vert A^TA\Vert \), Q is symmetric positive definite, and can be used to define Q-norm by \(\Vert w\Vert _Q:=\sqrt{w^TQw}\). With the VI characterization (1.12), we now can easily derive the contraction property of PDHG. Here we briefly discuss below.
Setting \(w=w^*\) in (1.12), we have
Recall \(F(\cdot )\) is monotone (see (1.3b)), we have
Then, we obtain
Using the above inequality, we get
and thus the sequence generated by PDHG (1.12) obeys
The above inequality implies that the sequence \(\{ w^k\}\) is strictly contractive to the solution set in Q-norm. Recall that the proximal parameter in the original PPA is a scalar r, means that all the coordinates of w are proximally regularized with equivalent weights. While for the VI form of PDHG (1.12), the proximal parameter is a positive definite matrix Q, and the coordinates of w are proximally regularized with different weights. Using this judicious choice of Q, the resulting subproblems are simple under the assumption (1.10) and the convergence is also guaranteed by noting that the distance to the solution point decreases at each step (see (1.13)). Further, we observe that the algorithm (1.12) owes a similar structure with the PPA (1.12). In this sense, it is essentially a customized application of PPA. The VI revisit (1.12) and PPA interpretation of PDHG is first shown in [24], and used in [7] to simplify the convergence analysis.
Since the PDHG (1.11) takes an algorithmic structure analogous to that of the generic PPA, the existing relaxation strategies for the PPA in the literature can be naturally used for these customized PPA algorithms. One popular relaxation scheme is to combine a computationally trivial relaxation step with the PPA [19]. Specially, let \(\tilde{w}^k = (\tilde{x}^k, \tilde{\lambda }^k)\) be the output of the customized PPA. The new iterate \(w^{k+1}\) generated by the relaxation step in [19, 20] is given by
As demonstrated in many cases, the relaxation step (1.14b) with \(\gamma \in (1 ,2)\) can speed up the convergence of the customized PPA, see, e.g. [23, 24]. However, this relaxation strategy has some issues that would be overlooked: the point \(w^{k+1}\) is a combination of \(w^k\) and \(\tilde{w}^k\). It may not be in the set \(\Omega \) or owes the structure that \(\tilde{w}^k\) has. To address these limitations, one can first use the relaxation term (1.14b) in the iteration loops; when the specified stop criterion is achieved, one should perform an extra iteration without the relaxation step.
On the other hand, observe that each customized PPA owes special structure and has popular applications, it is interesting to ask wether there exists another simple relaxation strategy for these algorithms based on their structures and can also have good improvements. In this paper we treat these customized PPA algorithms uniformly by a mixed variational inequality approach, and propose a new relaxation strategy for these customized PPA algorithms. Now, we specify our relaxed customized PPA as follows
The paper is organized as follows. In Sect. 2, we introduce our relaxation steps for some customized PPA algorithms. In Sect. 3, we discuss the relaxation strategy for the ADMM. We then analyze the convergence of these methods in Sect. 4, and further establish their convergence rate in both the non-ergodic and ergodic sense in Sect. 5. After that, we present some numerical results to demonstrate the improved performance of the relaxation strategy in Sect. 6. Finally, some conclusions are made in Sect. 7.
2 Relaxed customized PPA
In this section, we show how to relax the customized PPA via correcting the dual variable. To make our illustration more clear, the output variable \(w^{k+1}\) of the customized PPA is relabeled as \(\tilde{w}^k\) below.
Note that the relaxation falls into the category of prediction-correction methods. We use the prototype algorithm framework proposed in [22] to unify our analysis. This algorithm framework is also used in [26] for analyzing the PDHG.
Then, the convergence of the prototype algorithm (2.1) can be guaranteed if the following conditions are fulfilled.
In the following, we show that each PPA with the correction step falls into the prototype algorithm (2.1). Then we demonstrate the conditions specified in (2.2) are fulfilled.
2.1 Relaxed PDHG
Recall that the PDHG in variational form is (1.15a) with metric
Then we judiciously correct the dual variable via
where the relaxation factor \(\gamma \in (0,2)\). Since we always have \(x^{k+1}=\tilde{x}^k\), using the notations in (1.3), we get the correction step as
where
Now, we check the positive definiteness of H and G to verify whether the conditions in (2.2) are fulfilled. First, we need to deduce the expression of H and G. Recall M (see (2.5b)), we have
Then
According to the Schur complement, we have
Since \(rs>\Vert A^TA\Vert \), H is positive definite.
The matrix G is symmetric by \(Q^T=Q\), we next deduce the expression of G.
According to the Schur complement,
Since \(rI\succ \frac{1}{s}A^TA\) and \(0< \gamma <2\), G is positive definite. The relaxed PDHG scheme is a special case of the prototype algorithm (2.2).
The relaxation step (2.4) looks somehow complicated. However, combining PDHG and (2.4), the update steps of the relaxed PDHG can be simple. By straightforward manipulations, the PDHG can be rewritten as a prediction step, which gives
Together with (2.4) and noting \(x^{k+1}=\tilde{x}^k\), we obtain:
2.2 Relaxed linearized ALM
In this section, we show that the linearized ALM (L-ALM) [35, 38] is also a special case of customized PPA, and it can be relaxed by (1.15b). The main idea of L-ALM is to linearize the augmented Lagrangian function (1.5a) in the ALM via adding a quadratic term \(\frac{1}{2} \Vert x-x^k\Vert _{(rI-\beta A^TA)}^2\). Ignoring some constant terms in the x-subproblem, the iterate scheme of L-ALM is
Note that the quadratic term should be positive, we have \(r>\beta \Vert A^TA\Vert \). Now, we revisit L-ALM (2.11) in VI form.
The first-order optimality condition of the x-subproblem of (2.11a) is
and it can be written as
Next, we rewrite \( \tilde{\lambda }^k = \lambda ^k-\beta (A\tilde{x}^k-b)\) (see (2.11b)) as
Combining (2.12) and (2.13), we have
Recall the notations in (1.3), the L-ALM (2.11) can thus be interpreted as a customized PPA (1.15a) with Q given by
To accelerate the L-ALM, we suggest to relax the dual variable by
where the relaxation factor \(\gamma \in (0,2)\). Note that \(x^{k+1}\) always amounts to \({{\tilde{x}}}^k\), we have
where
Next, we show that the condition (2.2) is fulfilled by the relaxed L-ALM scheme. We have
and
Under the condition that \(r>0, \beta>0, r>\beta \Vert A^TA\Vert \) and the relaxation factor \(\gamma \in (0,2)\), the matrix H and G are both symmetric and positive definite. Thus, the relaxed L-ALM scheme is a special case of the prototype algorithm (2.2) and the conditions in (2.2) are fulfilled. We refer to [32] for the relaxed L-ALM with (1.14b).
Now, the relaxed L-ALM scheme is:
3 Relaxed ADMM for separable case
Motivated by recent popular applications, we also discuss a separable case of (1.1), in which the objective function \(\theta (x)\) can be decomposed as a sum of two individual convex functions without coupled variables:
where \(\theta _1: \mathfrak {R}^{n_1} \rightarrow \mathfrak {R}\), \(\theta _2: \mathfrak {R}^{n_2} \rightarrow \mathfrak {R}\) are both closed convex (not necessarily smooth) functions; \(A_1\in \mathfrak {R}^{m\times n_1}\), \(A_2\in \mathfrak {R}^{m\times n_2}\); \({{\mathcal {X}}}_1\subset \mathfrak {R}^{n_1}\) , \({{\mathcal {X}}}_2\subset \mathfrak {R}^{n_2}\) are closed convex sets.
The optimality condition of the problem (3.1) can be analogously written as the following variational inequality:
where
Note that the mapping F(w) defined in (3.2b) also satifies \( (w_1-w_2)^T\big (F(w_1)-F(w_2)\big )\ge 0\) for any \(w_1, w_2\in \Omega \), thus it is monotone.
The classical alternating directions method of multipliers (ADMM) originates in [9, 16] is suitable for solving (3.1). Starting with an initial iterate \((x_2^0, \lambda ^0) \in {{\mathcal {X}}}_2 \times \mathfrak {R}^m\), the ADMM iterates via the scheme
In [13], it was shown that ADMM is actually equivalent to the Douglas-Rachford splitting method (DRSM) applied to the dual of (3.1). We also refer to [21] for the primal application of DRSM. In this section, to make our analysis more general, we consider the proximal ADMM where each subproblem is regularized with an additional quadratic proximal term. Note that the original ADMM produces the new iterate in the forward order \(x_1 \rightarrow x_2\rightarrow \lambda \). In cyclical sense, it can also be written as the \(x_1 \rightarrow \lambda \rightarrow x_2\) order, which gives
where \(D_1\) and \(D_2\) are two positive definite matrices. Obviously, the proximal ADMM contains ADMM as a special case. When \(D_1=rI-\beta A_1^TA_1\) (\(r>\beta \Vert A_1^TA_1\Vert \)) or \(D_2=sI-\beta A_2^TA_2\) (\(s>\beta \Vert A_2^TA_2\Vert \)), the so-called split inexact Uzawa method [37, 38] is recovered. In this case, the resulting subproblem is in form of (1.10) (when \(\mathcal{X}_i=\mathfrak {R}^{n_i}, i =1,2\)), which is simple under the assumption (1.10). We refer the reader to see [11] for more different choices of \(D_1\) and \(D_2\).
In our relaxation, the output of (3.4) serves as a predictor, we here relabel it as \( \tilde{w}^k= (\tilde{x}_1^k,\tilde{x}_2^k,{\tilde{\lambda }^k})\) as before. Now, we first follow [4] to revisit the proximal ADMM (3.4) in VI form.
Lemma 3.1
For given \(w^k=(x_1^k,x_2^k,\lambda ^k)\), let \({{\tilde{w}}}^k\) be generated by (3.4). Then, we have
where
Proof
The optimality condition of the \(x_1\)-subproblem in (3.4) is
and it can be written as (using \(\tilde{\lambda }^k\) generated by (3.4b))
The optimality condition of the \(x_2\)-subproblem of (3.4) is
Now, we treat the \(\{ \cdot \}\) term in the last inequality. Using \( \beta (A_1\tilde{x}_1^k + A_2x_2^k-b)= - (\tilde{\lambda }^k -\lambda ^k)\) (see (3.4b)), we obtain
Then the optimality condition of the \(x_2\)-subproblem can be written as
From (3.4b), we have
and it can be written as
Combining (3.6a), (3.6b) and (3.6c), and using the notations of (3.2b), the assertion of this lemma is proved. \(\square \)
Remark 3.2
The matrix Q (3.5b) is symmetric positive definite. Thus, the proximal ADMM (3.4) is a customized PPA.
To accelerate the ADMM (3.4), we suggest to relax the dual variable by
where the relaxation factor \(\gamma \in (0,2)\). Note that \(x_1^{k+1}\) always amounts to \({{\tilde{x}}}_1^k\) and \(x_2^{k+1}\) always amounts to \({{\tilde{x}}}_2^k\), we have
where
Then, we check the positive definiteness of H and G to verify whether the condition (2.2) is fulfilled for the relaxed ADMM scheme. First, recall M (see (3.8b)), we have
and
Thus H is positive definite. Next, we check the positive definiteness of G. We first deduce the expression of G:
From (3.10), G is positive definite with \(0< \gamma <2\). The relaxed ADMM scheme is thus a special case of the prototype algorithm (2.2). Note that
Write (3.4) and (3.7) together in a compact form and recall \(x_1^{k+1} = \tilde{x}_1^k\) and \(x_2^{k+1} = \tilde{x}_2^k\), the relaxed ADMM is
4 Global convergence
We have characterized the relaxed PDHG, relaxed L-ALM, relaxed ADMM as special cases of the Prototype Framework (2.1). Using the Prototype Framework (2.1), we can easily analyze their convergence property in a unified way.
Under the condition that H and G are positive definite, we now characterize the right hand side of (2.1a) in terms of \(\Vert w-w^k\Vert _H\) and \(\Vert w-w^{k+1}\Vert _H\). This is given in the following lemma.
Lemma 4.1
Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). We have
where H and G are defined in (2.2a) and (2.2b), respectively.
Proof
Observe that \(M(w^k-\tilde{w}^k)=(w^k-w^{k+1})\) (see (2.1b)) and \(Q=HM\), (2.1a) can be written as
Applying the identity
with \(a=w, b=\tilde{w}^k, c=w^k, d=w^{k+1}\) to the right hand side of (4.2), we have
For the last term of (4.3), we have
Combining (4.2), (4.3) and (4.4), the lemma is proved. \(\square \)
Theorem 4.2
Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). Then we have
Proof
Setting \(w = w^*\) in (4.1), we get
By using the optimality of \(w^*\) and the monotonicity of F(w), we have
and thus
The assertion (4.5) follows directly. \(\square \)
Since both H and G are positive definite, Theorem 4.2 implies that the sequence \(\{w^k\}\) generated by (2.1) is strictly contractive with respect to the solution set \(\Omega ^*\). Thus the global convergence of these relaxed PPA algorithms can be easily established. This is given by the following theorem.
Theorem 4.3
Let \(\{w^{k}\}\) be a sequence generated by (2.1). Then it globally converges to a point \(w^\infty \) which belongs to \(\Omega ^*\).
Proof
From (4.5), we can see that \(\{w^k\}\) is bounded, and
Thus, the sequence \(\{\tilde{w}^k\}\) is also bounded. Let \(w^\infty \) be a cluster point of the sequence \(\{\tilde{w}^k\}\). Since the sequence \(\{\tilde{w}^k\}\) is bounded, it has a subsequence \(\{{\tilde{w}^{k_j}}\}\) that converges to \(w^\infty \). From (2.1a), we have
Using the continuity of \(\theta (x)\) and F(w), we have
The above variational inequality implies that \(w^\infty \) is a solution point of VI\((\Omega ; F; \theta )\). By using (4.6) and \(\lim _{{j\rightarrow \infty }}{\tilde{w}^{k_j}} =w^\infty \), we get \(\lim _{{j\rightarrow \infty }} {w^{k_j}}=w^\infty \). Recall (4.5), we have
Thus the sequence \(\{w^k\}\) converges to \(w^\infty \). The proof is completed. \(\square \)
5 Convergence rate
In this section, we analyze the iteration complexity in both the non-ergodic and ergodic sense for the sequence generated by these relaxed customized PPA algorithms. Note that by letting the matrix \(M=I\), our analysis is also applicable for the original customized PPA.
5.1 Worse-case convergence rate in the non-ergodic sense
In this subsection, we establish the \(\mathcal {O}(1/\sqrt{t})\) non-ergodic convergence rate of these relaxed PPA.
Lemma 5.1
Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). Then we have
Proof
Setting \(w=\tilde{w}^{k+1}\) in (2.1a), we get
Notice that (2.1a) is also suitable for \(k :=k+1\), thus
Let \(w=\tilde{w}^{k}\) in the above inequality, we get
Summing (5.2) and (5.4) together and using the monotonicity of F, the assertion (5.1) is proved. \(\square \)
Lemma 5.2
Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). Then we have
Proof
Note that Q is positive definite. By adding the equation
to the both sides of (5.1), we get
Substituting the terms
into (5.6), we obtain
This completes the proof. \(\square \)
Theorem 5.3
Let \(\{w^{k}\}\) be a sequence generated by (2.1). Then we have
where H is defined in (2.2a).
Proof
Let \(a=M(w^k-\tilde{w}^{k})\) and \(b=M(w^{k+1}-\tilde{w}^{k+1})\) be in the following identity
we get
Substitute (5.5) into the right hand side of above equality, we get
Note that
Combining with \(G=Q^T+Q-M^THM\succ 0\) (see (2.2b)), the assertion is proved. \(\square \)
Theorem 5.3 implies that the sequences \(\{\Vert w^{k}-{{w}^{k+1}}\Vert _H^2\}\) is monotonically nonincreasing, which is crucial for our proof. Next, we need to establish the connection between \(\Vert w^k-\tilde{w}^k\Vert _G^2\) and \(\Vert w^k-w^{k+1}\Vert _H^2\).
Lemma 5.4
For each relaxed customized PPA, there exists a constant \(c: = \frac{2-\gamma }{4}\), such that
Proof
From \(w^k-w^{k+1}=M(w^k-\tilde{w}^k)\) (see (2.1b)), we have
The remaining task is to find a constant c, such that
for each G, H pair emerged in these algorithms. Note that from (2.2b), we have \(M^THM\prec 2Q\). From (2.3) and (2.7), we have \(G\succcurlyeq \frac{2-\gamma }{2}Q\) for the relaxed PDHG. From (2.19) and (2.15), \(G\succcurlyeq \frac{2-\gamma }{2}Q\) for the relaxed L-ALM. From (3.5b) and (3.10), \(G\succcurlyeq \frac{2-\gamma }{2}Q\) for the relaxed ADMM. The assertion is thus proved. \(\square \)
Theorem 5.5
Let \(\{w^{k}\}\) be a sequence generated by (2.1). Then we have
where the matrix H is defined as (2.2a).
Proof
Inserting (5.9) into (4.5), we get
Summing this inequality over \(t=0,1,\ldots , \infty \), we have
Now observe that the sequences \(\{\Vert w^{k}-w^{k+1}\Vert _H^2\}\) is monotonically nonincreasing in Theorem 5.3, we have
Using the above two inequalities (5.11) and (5.12), we obtain (5.10). \(\square \)
From (1.15a), we know that if \(\Vert w^k-w^{k+1}\Vert _H=0\), then \(w^k\) is the solution point of VI\((\Omega ; F; \theta )\). This allows us to use \(\Vert w^k-w^{k+1}\Vert _H\) as an error measurement in terms of the distance to the solution set of VI\((\Omega ; F; \theta )\) for the t-th iteration of the algorithm. Let \(d :=\inf \{\Vert w^0-w^*\Vert _H \ |\ w^*\in \Omega ^*\}\). Then, for any given \(\varepsilon >0\), Theorem 5.5 implies that these relaxed PPA needs at most \({\lceil \frac{d^2}{c\varepsilon ^2}\rceil }\) iterations to ensure that \(\Vert w^k-w^{k+1}\Vert _H\le \varepsilon \). Therefore, a worse-case \(\mathcal {O}(1/\sqrt{t})\) convergence rate in the non-ergodic sense for these relaxed PPA is established.
5.2 Worse-case convergence rate in ergodic sense
In this section, we establish \(\mathcal {O}(1/t)\) worse-case convergence rate in the ergodic sense for these relaxed customized PPA algorithms. First, we introduce the following lemma, which is convenient for analyzing the ergodic iteration complexity of the algorithm. Its proof can be found in [24] or Theorem 2.3.5 in [15].
Theorem 5.6
The solution set of \(\text{ VI }(\Omega ; F; \theta )\) ((1.3), (3.2)) is closed and convex, and can be characterized via
Computing exactly a primal-dual solution \(w^*\) is impractical, but we can find an approximate solution \(\tilde{w}\in \Omega \). With the Theorem 5.6, we define an \(\varepsilon \)-approximation solution of \(\text{ VI }(\Omega ; F; \theta )\) as follows: Given an accuracy \(\varepsilon >0\), \(\tilde{w}\) is said to be an \(\varepsilon \)-approximation solution of \(\text{ VI }(\Omega ; F; \theta )\), if it satisfies
where \(\mathcal {D}\in \Omega \) is a defined compact set, e.g., \(\mathcal {D}=\{w\ |\ \Vert w-\tilde{w}\Vert \le 1\}\). Next, our purpose is to prove that we can find an \(\varepsilon \)-solution after t iterations.
Theorem 5.7
Let \({{\tilde{w}}}^k\) be generated by the step (2.1a) and \(\{w^{k}\}\) be generated by (2.1b). For any integer number \(t>0\), define
then we have \(\tilde{w}_t\in \Omega \) and
where H is defined in (2.2a).
Proof
First, because of \(\tilde{w}^k\in \Omega \), it holds that \(\tilde{w}_t\in \Omega \) for all integer t.
Next, by using the definitions of F (see (1.3b) and (3.2b)), we obtain,
Also, using the positive definiteness of G and (4.1), we get
Summing inequality (5.15) from \(k=0\) to \(k=t\), we derive
From the definition of \(\tilde{w}_t\) (5.13), we can write above inequality as
Noticing the fact that \(\theta (x)\) is convex, for \(\tilde{w}_t\), we have
Substituting (5.17) into (5.16), we obtain (5.14). \(\square \)
Let \(d=\sup \ \{\Vert w-w_0\Vert _H, w\in \mathcal {D}\}.\) Theorem 5.7 implies that, after t iterations of (2.1), \(\tilde{w}_t\) defined in (5.13) satisfies
Thus, we obtained computational complexity estimates for the average of \(\tilde{w}^k\), which is of order \( \mathcal {O}(1/t)\) to get an approximation solution for \(\text{ VI }(\Omega , F, \theta )\).
6 Numerical results
In this section, we report some numerical results of the proposed algorithms. Note that the numerical comparisons with related methods for the customized PPA have already been conducted in the literatures, see, e.g., [6, 35, 38]. Thus they are omitted in our tests and we focus on showing the advantages of the relaxed methods to the original customized PPA algorithms, and verify the improvements in the computational sense. All codes were written in Matlab 2015b and all experiments were conducted on a laptop with Intel Core (TM) CPU 2.40GHz and 8G memory.
6.1 Matrix completion problem
The matrix completion problem consists of reconstructing an unknown low rank matrix from a given subset of observed entries [3]. Specifically, let \(M\in \mathfrak {R}^{m\times n}\) be a low rank matrix, \(\Omega \) be a subset of the indices of entries \(\{1,\cdots , {m}\}\times \{1,\cdots , n\}\). The convex model is
where \(\Vert \cdot \Vert _*\) denotes the nuclear norm of a matrix. Note that the constraints can also be viewed as a projection equation \(P_\Omega (X)=P_\Omega (M)\), thus we have \(\Vert A^TA\Vert =1\).
In our implementations, for the SVD decomposition, we resort to the widely recommended PROPACK package [28]. We generate the rank-\(r_a\) matrix M as a product \(M_LM_R^T\) , where \(M_L\) and \(M_R\) are independent \(n\times r_a\) matrices who have i.i.d. Gaussian entries. The set of observations \(\Omega \) is sampled uniformly at random among all sets of cardinality \(|\Omega |\). The degrees of freedom of the matrix is \(d_{r_a}=r_a(2{n}-r_a)\), then the oversampling factor \(|\Omega |/d_{r_a}\) is the ratio of the number of samples to the degrees of freedom. The primal variable \(X^0\) and the dual variable \(\Lambda ^0\) were all initialized as zeros(n). The stopping criteria is set to be
For each instance, we randomly generated ten cases, so the results reported were averaged over ten runs.
Now we show how to choose the parameters. Note that the PDHG requires multiple stepsize parameters, and it seems impossible to give a best choice of step sizes. We already know from (1.11) that 1 / r and 1 / s essentially play the role of proximal step sizes, and it may be preferable to choose r and s such that \(r\cdot s\) is close to the lower bound of \(\Vert A^TA\Vert \) by considering the convergence condition \(rs>\Vert A^TA\Vert \). Thus, in our experiments, we always set \(s=1.01/r\). However, the choice of r still lacks of theoretical analysis, and as far as we know, it is usually selected by numerical experiments. Since the condition varies for different problem size, or even different iteration step, it seems that adjusting the PDHG parameters adaptively at each iteration is a better way is to achieve faster convergence, see, e.g., [18] for a simple adaptive strategy. In this paper, our main goal is to illustrate the advantages of the relaxation steps, so the parameters are kept fixed in all our experiments. To choose a suitable value of r, We first run PDHG and relaxed PDHG (\(\gamma = 1.7\)) for several different values of r and record the resulting iteration numbers. In particular, r varies from 0.002 to 1.024. The results are reported in Fig. 1. We can see that the relaxed PDHG with \(\gamma = 1.7\) took fewer iterations than the PDHG for almost all cases. We observe that both algorithms are sensitive to the choice of r. As a result, for different dimension of n, the proper choice of r is different and there does not exist fixed and best parameters for PDHG throughout the implementation.
We further compare the two algorithms with specified values of the parameters. From Fig. 1, we see that when r is chosen in the interval (0.008, 0.0032), the performance is efficient and robust. Here, we set the parameters r and s to be \(r=0.008\), \(s=1.01/r\), respectively. We point out that the values of r and s are determined based on the experiments and may be far from optimal. For each setting, we report the number of iterations (“No. it”), the relative error (“RelErr”), the time in Table 1, respectively. As shown in the figure, the relaxed PDHG took much fewer iterations and is much faster than the PDHG for the cases \(n<300\). When n becomes larger, the relaxed PDHG also needs less number of iterations than PDHG, although they both perform well. We remark that, we can further improve the performance of the relaxed PDHG by judiciously choosing a proper value of \(\gamma \) in the interval (1, 2). But this is trivial and beyond the scope of this paper.
Next, we test the performance of L-ALM. Similarly with PDHG, the parameters r and \(\beta \) would be properly chosen so that \(r/\beta \) is close to the lower bound of \(\Vert A^TA\Vert \). Thus, in our experiments, we always set \(\beta =r/1.01\). Since the condition is similar to PDHG, we illustrate the performance with different values of r in Fig. 2. We can see that the relaxed L-ALM with \(\gamma = 1.7\) took fewer iterations than the L-ALM for almost all cases. When \(n<500\) and \(r\in (0.008, 0.032)\), the two algorithms cost less iterations and perform robust and efficient. When \(r>0.032\), the two algorithms require more iterations to achieve the same stopping criteria, and the relaxation step seems less effective to reduce the iteration numbers. We can note that the performance is remarkably similar to the PDHG. This phenomenon is not difficult to be understood: L-ALM can be explained as the linearized ADMM algorithm, and is closely related to the PDHG [14]. Here, we also set the parameters r and \(\beta \) to be \(r=0.008\), \({\beta }=r/1.01\), respectively. In Table 2, we report the performance of the relaxed L-ALM algorithm for different settings of the dimensions. We can see that the relaxed L-ALM is more effective than the L-ALM when \(n<300\). For large problem dimension, the results also demonstrate that the relaxation took less iterations, and can reduce \(10\%\) iterations numbers.
6.2 Robust principal component analysis
In this subsection, we test the robust principal component analysis (RPCA) which has the form
where M is a known matrix. Note that the problem (6.3) is a standard model of (3.1), and thus can be solved by ADMM. We refer to [36] for the strategies to solve the subproblems. We now demonstrate the efficiency of the relaxed ADMM for solving (6.3). In the experiments, we generated the data similar to [36]. More specially, we generate the rank-\(r_a\) matrix \(L^*\) as a product \(L_LL_R^T\) , where \(L_L\) and \(L_R\) are independent \(n\times r_a\) matrices who have i.i.d. Gaussian entries. For \(S^*\), we first generate a \(n\times n\) zero matrix S, and let the numbers of the nonzero entries amount to the nearest integer of \(sr\times n\times n\), where sr denotes the sparsity ratio. We then set the locations of the nonzero entries of the underlying sparse matrix \(S^*\) randomly, where the nonzero elements are generated by standard Gaussian distribution. We use \(L^*+S^*\) to generate the matrix M. The model parameter \(\tau \) is always set to be \(1/\sqrt{n}\), and the stopping criteria is set to be
We also randomly generated ten cases for each instance, and the results reported were averaged over ten runs.
It is well-known that, the numerical performance of ADMM is highly dependent on the value of \(\beta \), and how to determine the optimal value of \(\beta \) is a theoretically unknown question. To select a reasonably good value of \(\beta \), we first conduct some experiments to evaluate the performance of ADMM with different values of \(\beta \). In the experiments, we test a set of values \(\beta = k *\tau \) with \(k=0.02, 0.08,\ldots ,5.12\) on different dimensions from \(n=100\) to \(n=500\). The initial point is set to be zero. For simplicity, we always set \(r_a=0.1*n\) and the sparsity ratio \(sr=0.1\). For the relaxed ADMM, the relaxation factor is set to be \(\gamma =1.3\). The number of iterations of ADMM and the relaxed ADMM with respect to different values of \(\beta \) ( \(\beta = k *\tau \) ) are plotted in Fig. 3, respectively. According to the plots, the ADMM methods are very sensitive to the value of \(\beta \). For various dimensions of the problem, we can see that the optimal value of \(\beta \) could be different and there does not exist an optimal choice of \(\beta \) that is data-dependent. We can also find that the relaxed ADMM performs better than ADMM for most cases. In Fig. 3, we found that when \(k\in ({0.08},0.32)\), i.e., \(\beta \in ({0.08}\tau ,0.32\tau )\), the ADMM peforms more robust and efficient, and the two algorithms all cost less iterations compared with other choices. When \(\beta >0.32\tau \), both algorithms need more iterations for the same problem, and their performances are almost same.
Next, we report the experimental results of ADMM and the relaxed ADMM for more cases. For comparisons, we just set \(\beta =0.3*\tau \) for the two methods, since it demonstrates a good performance and is robust for different dimensions. The results of these algorithms are reported in Table 3. From these tests, we can see that the relaxed ADMM requires less iterations and computing time than ADMM to produce the same accurate solutions. For these tested instances, we can further improve the performance of relaxed ADMM by choosing the best relaxation factor. However, to simplify the demonstration, we used a fixed factor for all cases.
7 Conclusions
In the literature, it is well known that the proximal point algorithm (PPA) can be accelerated by using the relaxation strategy proposed in [19]. For many cases, the relaxed PPA with \(\gamma \in [1.5, 1.8]\) can obtain a guaranteed improvement of the convergence. In this paper, different from most previous works, we suggest to relax the PPA by correcting the dual variables individually. For the PDHG proposed in [5, 6], the linearized augmented Lagrangian method (L-ALM) and the alternating direction method of multipliers (ADMM), which are special cases of the customized PPA, we studied their relaxed schemes and observed the corresponding speedup. With the guidance of the variational inequality, we established the global convergence of these relaxed algorithms and the worst-case convergence rate in both ergodic and nonergodic sense. In conclusion, our results theoretically provide a new way of relaxing PPA algorithms.
It should be mentioned that this paper only discussed the customized PPA algorithms for the convex problems less than two blocks. In many real applications, the problems may be modeled in the form of (1.1) with the objective function consisting of multiple block components, and it is often more challenging to directly handle them. For the multi-block cases, we can also formulate them as variational inequality problems, and then design some customized PPA algorithms to solve them. In fact, the multi-block Jacobian ADMM [12, 34] and its linearized version, the partial parallel ADMM [25] are all designed for this purpose and can be viewed as PPA algorithms. Consequently, the relaxation strategy proposed in this paper can be further extended to these customized PPA algorithms, and therefore, gives to the relaxed versions. On the other hand, the stepsize parameters of the PPA algorithms discussed in this paper can all be tuned at each iteration by some self-adaptive rules, and the resulting performance would be further improved. So it would be interesting to investigate the self-adaptive rule for the relaxation factor when these algorithms are dynamically adjusted. We leave these as potential directions of future work.
References
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2010)
Cai, J.F., Candès, E.J., Shen, Z.W.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20, 1956–1982 (2010)
Cai, X.J., Gu, G.Y., He, B.S., Yuan, X.M.: A proximal point algorithm revisit on the alternating direction method of multipliers. Sci. China Math. 56, 2179–2186 (2013)
Chambolle, A., Pock, T.: Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In: IEEE International Conference on Computer Vision (ICCV), pp. 1762–1769 (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40, 120–145 (2011)
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159, 253–287 (2016)
Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)
Chan, T.F., Glowinski, R.: Finite Element Approximation and Iterative Solution of a Class of Mildly Nonlinear Elliptic Equations, Technical Report STAN-CS-78-674. Stanford University, Computer Science Department (1978)
Daubechies, I., Defrise, M., Mol, C.D.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57, 1413–1457 (2004)
Deng, W., Yin, W.T.: On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66, 889–916 (2016)
Deng, W., Lai, M.J., Peng, Z.M., Yin, W.T.: Parallel multi-block ADMM with o(1/k) convergence. J. Sci. Comput. 71, 712–736 (2017)
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Esser, E.: Primal Dual Algorithms for Convex Models and Applications to Image Restoration, Registration and Nonlocal Inpainting. University of California, Los Angeles (2010)
Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, New York (2003)
Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. Revue Fr. Autom. Inform. Rech. Opér. Anal. Numér. 2, 41–76 (1975)
Glowinski, R., Osher, S.J., Yin, W. (eds.): Splitting Methods in Communication, Imaging, Science, and Engineering. Springer, New York (2016)
Goldstein, T., Li, M., Yuan, X.M.: Adaptive primal-dual splitting methods for statistical learning and image processing. In: Advances in Neural Information Processing Systems, pp. 2089–2097 (2015)
Gol’shtein, E.G., Tret’yakov, N.V.: Modified Lagrangians in convex programming and their generalizations. Math. Program. Stud. 10, 86–97 (1979)
Gu, G.Y., He, B.S., Yuan, X.M.: Customized proximal point algorithms for linearly constrained convex minimization and saddle-point problems: a unified approach. Comput. Optim. Appl. 59, 135–161 (2014)
Han, D.R., He, H.J., Yang, H., Yuan, X.M.: A customized Douglas–Rachford splitting algorithm for separable convex minimization with linear constraints. Numer. Math. 127, 167–200 (2014)
He, B.S., Yuan, X.M.: A class of ADMM-based algorithms for three-block separable convex programming. Comput. Optim. Appl. 70, 791–826 (2018)
He, B.S.: PPA-Like contraction methods for convex optimization: a framework using variational inequality approach. J. Oper. Res. Soc. China 3, 391–420 (2015)
He, B.S., Yuan, X.M.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM J. Imaging Sci. 5, 119–149 (2012)
He, B.S., Yuan, X.M.: Block-wise alternating direction method of multipliers for multiple-block convex programming and beyond. SMAI J Comput. Math. 1, 145–174 (2015)
He, B.S., Ma, F., Yuan, X.M.: An algorithmic gramework of heneralized primal-dual hybrid gradient methods for saddle point problems. J. Math. Imaging Vis. 58, 279–293 (2017)
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4, 303–320 (1969)
Larsen, R.M.: PROPACK Software for large and sparse SVD calculations. Stanford University. http://sun.stanford.edu/~rmunk/PROPACK/ (1969)
Martinet, B.: Regularisation, d’inéquations variationelles par approximations succesives. Rev. Fr. d’Inform. Rech. Oper. 4, 154–159 (1970)
Powell, M.J.D.: A method for nonlinear constraints in minimization problems. In: Fletcher, R. (ed.) Optimization, pp. 283–298. Academic Press, New York (1969)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976)
Shen, Y., Wang, H.Y.: New augmented Lagrangian-based proximal point algorithm for convex optimization with equality constraints. J. Optim. Theory Appl. 171, 251–261 (2016)
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)
Wang, X.F., Hong, M.Y., Ma, S.Q., Luo, Z.Q.: Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. Pac. J. Optim. 11, 645–667 (2015)
Yang, J.F., Yuan, X.M.: Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. Math. Comput. 82, 301–329 (2013)
Yuan, X.M., Yang, J.F.: Sparse and low-rank matrix decomposition via alternating direction methods. Pac. J. Optim. 9, 167–180 (2013)
Zhang, X.Q., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)
Zhang, X.Q., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46, 20–46 (2010)
Zhu, M., Chan, T.F.: An efficient primal-dual hybrid gradient algorithm for total variation image restoration, CAM Report 08-34. UCLA, USA (2008)
Acknowledgements
The author is grateful to the associate editor and two anonymous reviewers for their valuable comments and suggestions that have helped improve the presentation of this paper greatly. This work was supported by the NSFC Grants 11701564 and 11871029.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ma, F. On relaxation of some customized proximal point algorithms for convex minimization: from variational inequality perspective. Comput Optim Appl 73, 871–901 (2019). https://doi.org/10.1007/s10589-019-00091-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-019-00091-z