Abstract
In the paper, we develop a composite version of Mirror Prox algorithm for solving convex–concave saddle point problems and monotone variational inequalities of special structure, allowing to cover saddle point/variational analogies of what is usually called “composite minimization” (minimizing a sum of an easy-to-handle nonsmooth and a general-type smooth convex functions “as if” there were no nonsmooth component at all). We demonstrate that the composite Mirror Prox inherits the favourable (and unimprovable already in the large-scale bilinear saddle point case) efficiency estimate of its prototype. We demonstrate that the proposed approach can be successfully applied to Lasso-type problems with several penalizing terms (e.g. acting together \(\ell _1\) and nuclear norm regularization) and to problems of semi-separable structures considered in the alternating directions methods, implying in both cases methods with the complexity bounds.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
1.1 Motivation
Our work is inspired by the recent trend of seeking efficient ways for solving problems with hybrid regularizations or mixed penalty functions in fields such as machine learning, image restoration, signal processing and many others. We are about to present two instructive examples (for motivations, see, e.g., [2, 6, 7]).
Example 1
(Matrix completion) Our first motivating example is matrix completion problem, where we want to reconstruct the original matrix \(y\in {\mathbf {R}}^{n\times n}\), known to be both sparse and low-rank, given noisy observations of part of the entries. Specifically, our observation is \(b=P_\varOmega y+\xi \), where \(\varOmega \) is a given set of cells in an \(n\times n\) matrix, \(P_\varOmega y\) is the restriction of \(y\in {\mathbf {R}}^{n\times n}\) onto \(\varOmega \), and \(\xi \) is a random noise. A natural way to recover \(y\) from \(b\) is to solve the optimization problem
where \(\mu ,\lambda >0\) are regularization parameters. Here \(\Vert y\Vert _2= \sqrt{{\hbox {Tr}}(y^Ty)}\) is the Frobenius norm, \(\Vert y\Vert _1=\sum _{i,j=1}^n|y_{ij}|\) is the \(\ell _1\)-norm, and \(\Vert y\Vert _{\mathrm{nuc}}=\sum _{i=1}^n\sigma _i(y)\) (\(\sigma _i(y)\) are the singular values of \(y\)) is the nuclear norm of a matrix \(y\in {\mathbf {R}}^{n\times n}\).
Example 2
(Image recovery) Our second motivating example is image recovery problem, where we want to recover an image \(y\in {\mathbf {R}}^{n\times n}\) from its noisy observations \(b=Ay+\xi \), where \(Ay\) is a given affine mapping (e.g. the restriction operator \(P_\varOmega \) defined as above, or some blur operator), and \(\xi \) is a random noise. Assume that the image can be decomposed as \(y=y_\mathrm{L}+y_\mathrm{S}+y_\mathrm{sm}\) where \(y_\mathrm{L}\) is of low rank, \(y_\mathrm{sm}\) is the matrix of contamination by a “smooth background signal”, and \(y_\mathrm{S}\) is a sparse matrix of “singular corruption.” Under this assumption in order to recover \(y\) from \(b\), it is natural to solve the optimization problem
where \(\mu _1,\mu _2,\mu _3>0\) are regularization parameters. Here \(\Vert y\Vert _\mathrm{TV}\) is the total variation of an image \(y\):
These and other examples motivate addressing the following multi-term composite minimization problem
and, more generally, the semi-separable problem
Here for \(1\le k\le K\) the domains \(Y_k\) are closed and convex, \(\psi _k(\cdot )\) are convex Lipschitz-continuous functions, and \(\varPsi _k(\cdot )\) are convex functions which are “simple and fit \(Y_k\)”.Footnote 1
The problem of multi-term composite minimization (3) has been considered (in a somewhat different setting) in [22] for \(K=2\). When \(K=1\), problem (3) becomes the usual composite minimization problem:
which is well studied in the case where \(\psi (\cdot )\) is a smooth convex function and \(\varPsi (\cdot )\) is a simple non-smooth function. For instance, it was shown that the composite versions of Fast Gradient Method originating in Nesterov’s seminal work [21] and further developed by many authors (see, e.g., [3, 4, 8, 25, 27] and references therein), as applied to (5), work as if there were no nonsmooth term at all and exhibit the \(O(1/t^2)\) convergence rate, which is the optimal rate attainable by first order algorithms of large-scale smooth convex optimization. Note that these algorithms cannot be directly applied to problems (3) with \(K>1\).
The problem with semi-separable structures (4) for \(K=2\), has also been extensively studied using the augmented Lagrangian approach (see, e.g., [5, 11, 12, 16, 23, 24, 26, 28] and references therein). In particular, much work was carried out on the alternating directions method of multipliers (ADMM, see [5] for an overview), which optimizes the augmented Lagrangian in an alternating fashion and exhibits an overall \(O(1/t)\) convergence rate. Note that the available accuracy bounds for those algorithms involve optimal values of Lagrange multipliers of the equality constraints (cf. [23]). Several variants of this method have been developed recently to adjust to the case for \(K>2\) (see, e.g.[10]), however, most of these algorithms require to solve iteratively subproblems of type (5) especially with the presence of non-smooth terms in the objective.
1.2 Our contribution
In this paper, we do not assume smoothness of functions \(\psi _k\), but instead, we suppose that \(\psi _k\) are saddle point representable:
where \(\phi _k(\cdot ,\cdot )\) are smooth functions which are convex–concave (i.e., convex in their first and concave in the second argument), \(Z_k\) are convex and compact, and \({\overline{\varPsi }}_k(\cdot )\) are simple convex functions on \(Z_k\). Let us consider, for instance, the multi-term composite minimization problem (3). Under (6), the primal problem (3) allows for the saddle point reformulation:
Note that when there are no \(\varPsi _k,{\overline{\varPsi }}_k\)’s, problem (7) becomes a convex–concave saddle point problem with smooth cost function, studied in [14]. In particular, it was shown in [14] that Mirror Prox (MP) algorithm originating from [17], when applied to the saddle point problem (7), exhibits the “theoretically optimal” convergence rate \(O(1/t)\). Our goal in this paper is to develop novel \(O(1/t)\)-converging first order algorithms for problem (7) (and also the related saddle point reformulation of the problem in (4)), which appears to be the best rate known, under circumstances, from the literature (and established there in essentially less general setting than the one considered below).
Our key observation is that composite problem (3), (6) can be reformulated as a smooth linearly constrained saddle point problem by simply moving the nonsmooth terms into the problem domain. Namely, problem (3) , (6) can be written as
We can further approximate the resulting problem by penalizing the equality constraints, thus passing to
where \(\rho _k>0\) are penalty parameters and \(W_k=\{w^k:\Vert w^k\Vert _2\le 1\}, k=1,\ldots ,K\).
We solve the convex–concave saddle point problem (8) with smooth cost function by \(O(1/t)\)-converging Mirror Prox algorithm. It is worth to mention that if the functions \(\phi _k\), \(\varPsi _k\) are Lipschitz continuous on the domains \(A_kY+b_k\), and \(\rho _k\) are selected properly, the saddle point problem is exactly equivalent to the problem of interest.
The monotone operator \(F\) associated with the saddle point problem in (8) has a special structure: the variables can be split into two blocks \(u\) (all \(y\)-, \(z\)- and \(w\)-variables) and \(v\) (all \(\tau \)- and \(\sigma \)-variables) in such a way that the induced partition of \(F\) is \(F=[F_u(u);F_v]\) with the \(u\)-component \(F_u\) depending solely on \(u\) and constant \(v\)-component \(F_v\). We demonstrate below that in this case the basic MP algorithm admits a “composite” version which works essentially “as if” there were no \(v\)-component at all. This composite version of MP will be the working horse of all subsequent developments.
The main body of this paper is organized as follows. In Sect. 2 we present required background on variational inequalities with monotone operators and convex–concave saddle points. In Sect. 3 we present and justify the composite MP algorithm. In Sects. 4 and 5, we apply our approach to problems (3), (6) and (4), (6). In Sect. 4.4, we illustrate our approach (including numerical results) as applied to the motivating examples. All proofs missing in the main body of the paper are relegated to the Appendix.
2 Preliminaries: variational inequalities and accuracy certificates
Execution protocols and accuracy certificates. Let \(X\) be a nonempty closed convex set in a Euclidean space \(E\) and \(F(x):X\rightarrow E\) be a vector field.
Suppose that we process \((X,F)\) by an algorithm which generates a sequence of search points \(x_t\in X\), \(t=1,2,\ldots \), and computes the vectors \(F(x_t)\), so that after \(t\) steps we have at our disposal \(t\)-step execution protocol \(\mathcal{{I}}_t=\{x_\tau ,F(x_\tau )\}_{\tau =1}^t\). By definition, an accuracy certificate for this protocol is simply a collection \(\lambda ^t=\{\lambda ^t_\tau \}_{\tau =1}^t\) of nonnegative reals summing up to 1. We associate with the protocol \(\mathcal{{I}}_t\) and accuracy certificate \(\lambda ^t\) two quantities as follows:
-
Approximate solution \(x^t(\mathcal{{I}}_t,\lambda ^t):=\sum _{\tau =1}^t \lambda ^t_\tau x_\tau \), which is a point of \(X\);
-
Resolution \({\hbox {Res}}(X'\big |\mathcal{{I}}_t,\lambda ^t)\) on a subset \(X'\ne \emptyset \) of \(X\) given by
$$\begin{aligned} {\hbox {Res}}(X'\big |\mathcal{{I}}_t,\lambda ^t) = \sup \limits _{x\in X'}\sum _{\tau =1}^t\lambda ^t_\tau \langle F(x_\tau ),x_\tau -x\rangle . \end{aligned}$$(9)
The role of those notions in the optimization context is explained next.Footnote 2 Variational inequalities. Assume that \(F\) is monotone, i.e.,
and let our goal be to approximate a weak solution to the variational inequality (v.i.) \(\hbox {vi}(X,F)\) associated with \((X,F)\); weak solution is defined as a point \(x_*\in X\) such that
A natural (in)accuracy measure of a candidate weak solution \(x\in X\) to \(\hbox {vi}(X,F)\) is the dual gap function
This inaccuracy is a convex nonnegative function which vanishes exactly at the set of weak solutions to the vi\((X,F)\) .
Proposition 1
For every \(t\), every execution protocol \(\mathcal{{I}}_t=\{x_\tau \in X,F(x_\tau )\}_{\tau =1}^t\) and every accuracy certificate \(\lambda ^t\) one has \(x^t:=x^t(\mathcal{{I}}_t,\lambda ^t)\in X\). Besides this, assuming \(F\) monotone, for every closed convex set \(X'\subset X\) such that \(x^t\in X'\) one has
Proof
Indeed, \(x^t\) is a convex combination of the points \(x_\tau \in X\) with coefficients \(\lambda ^t_\tau \), whence \(x^t\in X\). With \(X'\) as in the premise of Proposition, we have
where the first \(\le \) is due to monotonicity of \(F\). \(\square \)
Convex–concave saddle point problems Now let \(X=X_1\times X_2\), where \(X_i\) is a closed convex subset in Euclidean space \(E_i\), \(i=1,2\), and \(E=E_1\times E_2\), and let \(\varPhi (x^1,x^2):X_1\times X_2\rightarrow {\mathbf {R}}\) be a locally Lipschitz continuous function which is convex in \(x^1\in X_1\) and concave in \(x^2\in X_2\). \(X_1,X_2,\varPhi \) give rise to the saddle point problem
two induced convex optimization problems
and a vector field \(F(x^1,x^2)=[F_1(x^1,x^2);F_2(x^1,x^2)]\) specified (in general, non-uniquely) by the relations
It is well known that \(F\) is monotone on \(X\), and that weak solutions to the vi\((X,F)\) are exactly the saddle points of \(\varPhi \) on \(X_1\times X_2\). These saddle points exist if and only if \((P)\) and \((D)\) are solvable with equal optimal values, in which case the saddle points are exactly the pairs \((x^1_*,x^2_*)\) comprised by optimal solutions to \((P)\) and \((D)\). In general, \({\hbox {Opt}}(P)\ge {\hbox {Opt}}(D)\), with equality definitely taking place when at least one of the sets \(X_1,X_2\) is bounded; if both are bounded, saddle points do exist. To avoid unnecessary complications, from now on, when speaking about a convex–concave saddle point problem, we assume that the problem is proper, meaning that \({\hbox {Opt}}(P)\) and \({\hbox {Opt}}(D)\) are reals; this definitely is the case when \(X\) is bounded.
A natural (in)accuracy measure for a candidate \(x=[x^1;x^2]\in X_1\times X_2\) to the role of a saddle point of \(\varPhi \) is the quantity
This inaccuracy is nonnegative and is the sum of the duality gap \({\hbox {Opt}}(P)-{\hbox {Opt}}(D)\) (always nonnegative and vanishing when one of the sets \(X_1,X_2\) is bounded) and the inaccuracies, in terms of respective objectives, of \(x^1\) as a candidate solution to \((P)\) and \(x^2\) as a candidate solution to \((D)\).
The role of accuracy certificates in convex–concave saddle point problems stems from the following observation: \(\square \)
Proposition 2
Let \(X_1,X_2\) be nonempty closed convex sets, \(\varPhi :X:=X_1\times X_2\rightarrow {\mathbf {R}}\) be a locally Lipschitz continuous convex–concave function, and \(F\) be the associated monotone vector field on \(X\).
Let \(\mathcal{{I}}_t=\{x_\tau =[x^1_\tau ;x^2_\tau ]\in X,{F}(x_\tau )\}_{\tau =1}^t\) be a \(t\)-step execution protocol associated with \((X,{F})\) and \(\lambda ^t=\{\lambda ^t_\tau \}_{\tau =1}^t\) be an associated accuracy certificate. Then \(x^t:=x^t(\mathcal{{I}}_t,\lambda ^t)=[x^{1,t};x^{2,t}] \in X\).
Assume, further, that \(X^\prime _1\subset X_1\) and \(X^\prime _2\subset X_2\) are closed convex sets such that
Then
In addition, setting \(\widetilde{\varPhi }(x^1)=\sup _{x^2\in X_2^\prime }\varPhi \left( x^1,x^2\right) \), for every \(\widehat{x}^1\in X^\prime _1\) we have
In particular, when the problem \({\hbox {Opt}}=\min _{x^1\in X^\prime _1}\widetilde{\varPhi }(x^1)\) is solvable with an optimal solution \(x^1_*\), we have
Proof
The inclusion \(x^t\in X\) is evident. For every set \(Y\subset X\) we have
Thus, for every \(Y\subset X\) we have
Now assume that (17) takes place. Setting \(Y=X':=X^\prime _1\times X^\prime _2\) and recalling what \({\epsilon _{{\tiny \mathrm Sad}}}\) is, (21) yields (18). With \(Y=\{\widehat{x}^1\}\times X^\prime _2\), (21) yields the second inequality in (19); the first inequality in (19) is evident due to \(x^{2,t}\in X^\prime _2\).\(\square \)
3 Composite Mirror Prox algorithm
3.1 The situation
Let \(U\) be a nonempty closed convex domain in a Euclidean space \(E_u\), \(E_v\) be a Euclidean space, and \(X\) be a nonempty closed convex domain in \(E=E_u\times E_v\). We denote vectors from \(E\) by \(x=[u;v]\) with blocks \(u,v\) belonging to \(E_u\) and \(E_v\), respectively.
We assume that
- A1::
-
\(E_u\) is equipped with a norm \(\Vert \cdot \Vert \), the conjugate norm being \(\Vert \cdot \Vert _*\), and \(U\) is equipped with a distance-generating function (d.g.f.) \(\omega (\cdot )\) (that is, with a continuously differentiable convex function \(\omega (\cdot ):U\rightarrow {\mathbf {R}}\)) which is compatible with \(\Vert \cdot \Vert \), meaning that \(\omega \) is strongly convex, modulus 1, w.r.t. \(\Vert \cdot \Vert \). Note that d.g.f. \(\omega \) defines the Bregman distance
$$\begin{aligned} V_u(w):=\omega (w)-\omega (u)-\langle \omega '(u),w-u\rangle \ge {1\over 2}\Vert w-u\Vert ^2,\,u,w\in U, \end{aligned}$$(22)where the concluding inequality follows from strong convexity, modulus 1, of the d.g.f. w.r.t. \(\Vert \cdot \Vert \). In the sequel, we refer to the pair \(\Vert \cdot \Vert ,\,\omega (\cdot )\) as to proximal setup for \(U\).
- A2::
-
the image \(PX\) of \(X\) under the projection \(x=[u;v]\mapsto Px:=u\) is contained in \(U\).
- A3::
-
we are given a vector field \(F(u,v):X\rightarrow E\) on \(X\) of the special structure as follows:
$$\begin{aligned} F(u,v)=[F_u(u);F_v], \end{aligned}$$with \(F_u(u)\in E_u\) and \(F_v\in E_v\). Note that \(F\) is independent of \(v\). We assume also that
$$\begin{aligned} \forall u,u'\in U: \Vert F_u(u)-F_u(u')\Vert _*\le L\Vert u-u'\Vert +M \end{aligned}$$(23)with some \(L<\infty \), \(M<\infty \).
- A4::
-
the linear form \(\langle F_v,v\rangle \) of \([u;v]\in E\) is bounded from below on \(X\) and is coercive on \(X\) w.r.t. \(v\): whenever \([u_t;v_t]\in X\), \(t=1,2, \ldots \) is a sequence such that \(\{u_t\}_{t=1}^\infty \) is bounded and \(\Vert v_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we have \(\langle F_v,v_t\rangle \rightarrow \infty \), \(t\rightarrow \infty \).
Our goal in this section is to show that in the situation in question, proximal type processing \(F\) (say, \(F\) is monotone on \(X\), and we want to solve the variational inequality given by \(F\) and \(X\)) can be implemented “as if” there were no \(v\)-components in the domain and in \(F\).
A generic application we are aiming at is as follows. We want to solve a “composite” saddle point problem
where
-
\(U_1\subset E_1\) and \(U_2\subset E_2\) are nonempty closed convex sets in Euclidean spaces \(E_1,E_2\)
-
\(\phi \) is a smooth (with Lipschitz continuous gradient) convex–concave function on \(U_1\times U_2\)
-
\(\varPsi _1:U_1\rightarrow {\mathbf {R}}\) and \(\varPsi _2:U_2\rightarrow {\mathbf {R}}\) are convex functions, perhaps nonsmooth, but “fitting” the domains \(U_1\), \(U_2\) in the following sense: for \(i=1,2\), we can equip \(E_i\) with a norm \(\Vert \cdot \Vert _{(i)}\), and \(U_i\) - with a compatible with this norm d.g.f. \(\omega _i(\cdot )\) in such a way that optimization problems of the form
are easy to solve.
Our ultimate goal is to solve (24) “as if” there were no (perhaps) nonsmooth terms \(\varPsi _i\). With our approach, we intend to “get rid” of the nonsmooth terms by “moving” them into the description of problem’s domains. To this end, we act as follows:
-
For \(i=1,2\), we set \(X_i=\{x_i=[u_i;v_i]\in E_i\times {\mathbf {R}}: u_i\in U_i,v_i\ge \varPsi _i(u_i)\}\) and set
$$\begin{aligned} U&:= U_1\times U_2\subset E_u:=E_1\times E_2, E_v={\mathbf {R}}^2,\\ X&= \Big \{x=[u=[u_1;u_2];v=[v_1;v_2]]:u_i\in U_i,v_i\ge \varPsi _i(u_i),i=1,2\Big \}\\&\subset E_u\times E_v, \end{aligned}$$thus ensuring that \(PX\subset U\), where \(P[u;v]=u\);
-
We rewrite the problem of interest equivalently as
$$\begin{aligned} \hbox {SadVal}=\min _{x^1=[u_1;v_1]\in X_1}\max _{x^2=[u_2;v_2]\in X_2}\left[ \varPhi (u_1,v_1;u_2,v_2)=\phi (u_1,u_2)+v_1-v_2\right] \end{aligned}$$(26)Note that \(\varPhi \) is convex–concave and smooth. The associated monotone operator is
$$\begin{aligned} F(u&= [u_1;u_2],v=[v_1;v_2])\\&= \left[ F_u(u)= [\nabla _{u_1}\phi (u_1,u_2);-\nabla _{u_2}\phi (u_1,u_2)];F_v=[1;1]\right] \end{aligned}$$and is of the structure required in A3. Note that \(F\) is Lipschitz continuous, so that (23) is satisfied with properly selected \(L\) and with \(M=0\).
We intend to process the reformulated saddle point problem (26) with a properly modified state-of-the-art MP saddle point algorithm [17]. In its basic version and as applied to a variational inequality with Lipschitz continuous monotone operator (in particular, to a convex–concave saddle point problem with smooth cost function), this algorithm exhibits \(O(1/t)\) rate of convergence, which is the best rate achievable with First Order saddle point algorithms as applied to large-scale saddle point problems (even those with bilinear cost function). The basic MP would require to equip the domain \(X=X_1\times X_2\) of (26) with a d.g.f. \(\omega (x_1,x_2)\) resulting in an easy-to-solve auxiliary problems of the form
which would require to account in \(\omega \), in a nonlinear fashion, for the \(v\)-variables (since \(\omega \) should be a strongly convex in both \(u\)- and \(v\)-variables). While it is easy to construct \(\omega \) from our postulated “building blocks” \(\omega _1\), \(\omega _2\) leading to easy-to-solve problems (25), this construction results in auxiliary problems (27) somehow more complicated than problems (25). To overcome this difficulty, below we develop a “composite” MP algorithm taking advantage of the special structure of \(F\), as expressed in A3, and preserving the favorable efficiency estimates of the prototype. The modified MP operates with the auxiliary problems of the form
that is, with pairs of uncoupled problems
recalling that \(X_i=\{[u_i;v_i]:u_i\in U_i,v_i\ge \varPsi _i(u_i)\}\), these problems are nothing but the easy-to-solve problems (25).
3.2 Composite Mirror Prox algorithm
Given the situation described in Sect. 3.1, we define the associated prox-mapping: for \(\xi =[\eta ;\zeta ]\in E\) and \(x=[u;v]\in X\),
Observe that \(P_x([\eta ;\gamma F_v])\) is well defined whenever \(\gamma >0\)—the required \(\mathop {\hbox {Argmin}}\) is nonempty due to the strong convexity of \(\omega \) on \(U\) and assumption A4 (for verification, see item \(0^{\circ }\) in Appendix 1). Now consider the process as follows:
where \(\gamma _\tau >0\); the latter relation, due to the above, implies that the recurrence (29) is well defined.
Theorem 1
In the setting of Sect. 3.1, assuming that A1–A4 hold, consider the Composite Mirror Prox recurrence 29 (CoMP) with stepsizes \(\gamma _\tau >0\), \(\tau =1,2, \ldots \) satisfying the relation:
Then the corresponding execution protocol \(\mathcal{{I}}_t=\{y_\tau ,F(y_\tau )\}_{\tau =1}^t\) admits accuracy certificate \(\lambda ^t=\{\lambda ^t_\tau =\gamma _\tau /\sum _{i=1}^t\gamma _i\}\) such that for every \(X'\subset X\) it holds
Relation (30) is definitely satisfied when \(0<\gamma _\tau \le ({\sqrt{2}L})^{-1}\), or, in the case of \(M=0\), when \(\gamma _\tau \le L^{-1}\).
Invoking Propositions 1, 2, we arrive at the following
Corollary 1
Under the premise of Theorem 1, for every \(t=1,2, \ldots \), setting
we ensure that \(x^t\in X\) and that
(i) In the case when \(F\) is monotone on \(X\), we have
(ii) Let \(X=X_1\times X_2\), and let \(F\) be the monotone vector field associated with the saddle point problem (14) with convex–concave locally Lipschitz continuous cost function \(\varPhi \). Then
In addition, assuming that problem \((P)\) in (15) is solvable with optimal solution \(x^1_*\) and denoting by \(x^{1,t}\) the projection of \(x^t\in X=X_1\times X_2\) onto \(X_1\), we have
Remark
When \(F\) is Lipschitz continuous (that is, (23) holds true with some \(L>0\) and \(M=0\)), the requirements on the stepsizes imposed in the premise of Theorem 1 reduce to \(\delta _\tau \le 0\) for all \(\tau \) and are definitely satisfied with the constant stepsizes \(\gamma _\tau =1/L\). Thus, in the case under consideration we can assume w.l.o.g. that \(\gamma _\tau \ge 1/L\), thus ensuring that the upper bound on \({\hbox {Res}}(X'\big |\mathcal{{I}}_t,\lambda ^t)\) in (31) is \(\le \varTheta [X']Lt^{-1}\). As a result, (34) becomes
3.3 Modifications
In this section, we demonstrate that in fact our algorithm admits some freedom in building approximate solutions, freedom which can be used to improve to some extent solutions’ quality. Modifications to be presented originate from [19]. We assume that we are in the situation described in Sect. 3.1, and assumptions A1–A4 are in force. In addition, we assume that
- A5::
-
The vector field \(F\) described in A3 is monotone, and the variational inequality given by \((X,F)\) has a weak solution:
$$\begin{aligned} \exists x_*=[u_*;v_*]\in X: \langle F(y),y-x_*\rangle \ge 0\,\,\forall y\in X \end{aligned}$$(36)
Lemma 1
In the situation from Sect. 3.1 and under assumptions A1–A5, for \(R\ge 0\) let us set
(this quantity is finite since \(\omega \) is continuously differentiable on \(U\)), and let
be the trajectory of the \(N\)-step MP algorithm (29) with stepsizes \(\gamma _\tau >0\) which ensure (30) for \(\tau \le N\). Then for all \(u\in U\) and \(t\le N+1\),
with \(u_*\) defined in (36).
Proposition 3
In the situation of Sect. 3.1 and under assumptions A1–A5, let \(N\) be a positive integer, and let \(\mathcal{{I}}_N=\{y_\tau ,F(y_\tau )\}_{\tau =1}^N\) be the execution protocol generated by \(N\)-step CoMP (29) with stepsizes \(\gamma _\tau \) ensuring (30). Let also \(\lambda ^N=\{\lambda _1, \ldots ,\lambda _N\}\) be a collection of positive reals summing up to 1 and such that
Then for every \(R\ge 0\), with \(X_R=\{x=[u;v]\in X: \Vert u-u_1\Vert \le R\}\) one has
with \(\widehat{\varTheta }(\cdot )\) and \(R_N\) defined by (37) and (38).
Invoking Propositions 1, 2, we arrive at the following modification of Corollary 1.
Corollary 2
Under the premise and in the notation of Proposition 3, setting
we ensure that \(x^N\in X\). Besides this,
(i) Let \(X'\) be a closed convex subset of \(X\) such that \(x^N\in X'\) and the projection of \(X'\) on the \(u\)-space is contained in \(\Vert \cdot \Vert \)-ball of radius \(R\) centered at \(u_1\). Then
(ii) Let \(X=X_1\times X_2\) and \(F\) be the monotone vector field associated with saddle point problem (14) with convex–concave locally Lipschitz continuous cost function \(\varPhi \). Let, further, \(X^\prime _i\) be closed convex subsets of \(X_i\), \(i=1,2\), such that \(x^N\in X^\prime _1\times X^\prime _2\) and the projection of \(X^\prime _1\times X^\prime _2\) onto the \(u\)-space is contained in \(\Vert \cdot \Vert \)-ball of radius \(R\) centered at \(u_1\). Then
4 Multi-term composite minimization
In this section, we focus on the problem (3), (6) of multi-term composite minimization.
4.1 Problem setting
We intend to consider problem (3), (6) in the situation as follows. For a nonnegative integer \(K\) and \(0\le k\le K\) we are given
-
1.
Euclidean spaces \(E_k\) and \({\overline{E}}_k\) along with their nonempty closed convex subsets \(Y_k\) and \(Z_k\), respectively;
-
2.
Proximal setups for \((E_k,Y_k)\) and \(({\overline{E}}_k, Z_k)\), that is, norms \(p_k(\cdot )\) on \(E_k\), norms \(q_k(\cdot )\) on \({\overline{E}}_k\), and d.g.f.’s \(\omega _k(\cdot ):Y_k\rightarrow {\mathbf {R}}\), \({\overline{\omega }}_k(\cdot ):Z_k\rightarrow {\mathbf {R}}\) compatible with \(p_k(\cdot )\) and \(q_k(\cdot )\), respectively;
-
3.
Affine mappings \(y^0\mapsto A_k y^0+b_k:E_0\rightarrow E_k\), where \(y^0\mapsto A_0y^0+b_0\) is the identity mapping on \(E_0\);
-
4.
Lipschitz continuous convex functions \(\psi _k(y^k):Y_k\rightarrow {\mathbf {R}}\) along with their saddle point representations
$$\begin{aligned} \psi _k(y^k)=\sup _{z^k\in Z_k}{[}\phi _k(y^k,z^k)-{\overline{\varPsi }}_k(z^k){]},\;\;0\le k\le K, \end{aligned}$$(43)where \(\phi _k(y^k,z^k):Y_k\times Z_k\rightarrow {\mathbf {R}}\) are smooth (with Lipschitz continuous gradients) functions convex in \(y^k\in Y_k\) and concave in \(z^k\in Z_k\), and \({\overline{\varPsi }}_k(z^k):Z_k\rightarrow {\mathbf {R}}\) are Lipschitz continuous convex functions such that the problems of the form
$$\begin{aligned} \min \limits _{z^k\in Z_k}\left[ {\overline{\omega }}_k(z^k)+\langle \xi ^k,z^k\rangle +\alpha {\overline{\varPsi }}_k(z^k)\right] \quad [\alpha >0] \end{aligned}$$(44)are easy to solve;
-
5.
Lipschitz continuous convex functions \(\varPsi _k(y^k):Y_k\rightarrow {\mathbf {R}}\) such that the problems of the form
$$\begin{aligned} \min \limits _{y^k\in Y_k}\left[ \omega _k(y^k)+\langle \xi ^k,y^k\rangle +\alpha \varPsi _k(y^k)\right] \quad [\alpha >0] \end{aligned}$$(45)are easy to solve;
-
6.
For \(1\le k\le K\), the norms \(\pi _k^*(\cdot )\) on \(E_k\) are given, with conjugate norms \(\pi _k(\cdot )\), along with d.g.f.’s \(\widehat{\omega }_k(\cdot ):\;W_k:=\{w^k\in E_k:\pi _k(w^k)\le 1\}\rightarrow {\mathbf {R}}\) which are strongly convex, modulus 1, w.r.t. \(\pi _k(\cdot )\) such that the problems
$$\begin{aligned} \min _{w^k\in W_k}\left[ \widehat{\omega }_k(w^k)+\langle \xi ^k,w^k\rangle \right] \end{aligned}$$(46)are easy to solve.
The outlined data define the sets
The problem of interest (3), (6) along with its saddle point reformulation in the just defined situation read
which we rewrite equivalently as
From now on we make the following assumptions
B1: We have \(A_kY_0+b_k\subset Y_k\), \(1\le k\le K\);
B2: For \(0\le k\le K\), the sets \(Z_k\) are bounded. Further, the functions \(\varPsi _k\) are below bounded on \(Y_k\), and the functions \(f_k=\psi _k+\varPsi _k\) are coercive on \(Y_k\): whenever \(y^k_t\in Y_k\), \(t=1,2, \ldots ,\) are such that \(p_k(y^k_t)\rightarrow \infty \) as \(t\rightarrow \infty \), we have \({f_k}(y^k_t)\rightarrow \infty \).
Note that B1 and B2 imply that the saddle point problem (47c) is solvable; let \(\left\{ [y^k_*;\tau ^k_*]\right\} _{0\le k\le K};\;\left\{ [z^k_*;\sigma ^k_*]\right\} _{0\le k\le K} \) be the corresponding saddle point.
4.2 Course of actions
Given \(\rho _k>0\), \(1\le k\le K\), we approximate (47c) by the problem
where
Observe that the monotone operator \(F(x^1,x^2)=\left[ F_1(x^1,x^2);F_2(x^1,x^2)\right] \) associated with the saddle point problem in (48b) is given by
Now let us set
-
\(U=\left\{ \begin{array}{rl}&{}u=\left[ y^0;\ldots ;y^K;z^0;\ldots ;z^K;w^1;\ldots ;w^K\right] :y^k\in Y_k, \,z^k\in Z_k,\\ &{}\qquad \qquad \qquad \qquad \qquad \qquad \qquad 0\le k\le K,\pi _k(w^k)\le 1,\,1\le k\le K\end{array}\right\} ,\)
-
\(X=\left\{ \begin{array}{rl} &{} x=\left[ u=\left[ y^0;\ldots ;y^K;z^1;\ldots ;z^K;w^1;\ldots ;w^K \right] ; \right. \\ &{} \quad \quad \left. \;v=[\tau ^0;\ldots ;\tau ^K;\sigma ^0;\ldots ;\sigma ^K]\right] :\\ &{} u\in U, \,\tau ^k\ge \varPsi _k(y^k),\,\sigma ^k\ge {\overline{\varPsi }}_k(z^k),\,0\le \\ &{}\quad k\le K\end{array}\right\} \), so that \(PX\subset U\), cf. assumption A2 in Sect. 3.1.
The variational inequality associated with the saddle point problem in (48b) can be treated as the variational inequality on the domain \(X\) with the monotone operator
where
This operator meets the structural assumptions A3 and A4 from Sect. 3.1 (A4 is guaranteed by B2). We can equip \(U\) and its embedding space \(E_u\) with the proximal setup \(\Vert \cdot \Vert ,\;\omega (\cdot )\) given by
where \(\alpha _k,\beta _k\), \(0\le k\le K\), and \(\gamma _k\), \(1\le k\le K\), are positive aggregation parametersFootnote 3. Observe that carrying out a step of the CoMP algorithm presented in Sect. 3.2 requires computing \(F\) at \(O(1)\) points of \(X\) and solving \(O(1)\) auxiliary problems of the form
with positive \(a_k,...,e_k\), and we have assumed that these problems are easy to solve.
4.3 “Exact penalty”
Let us make one more assumption:
C: For \(1\le k\le K\),
-
\(\psi _{k}\) are Lipschitz continuous on \(Y_k\) with constants \(G_k\) w.r.t. \(\pi _k^*(\cdot )\),
-
\(\varPsi _{k}\) are Lipschitz continuous on \(Y_k\) with constants \(H_k\) w.r.t. \(\pi _k^*(\cdot )\).
Given a feasible solution \(\overline{x}=[\overline{x}^1; \overline{x}^2]\), \(\overline{x}^1:=\{[\overline{y}^k;\overline{\tau }^k]\in Y^+_k\}_{k=0}^K\) to the saddle point problem (48b), let us set
thus getting another feasible (by assumption B1) solution \(\widehat{x}{=}\big [\widehat{x}^1 {=}\{[\widehat{y}^k;\widehat{\tau }^k]\}_{k=0}^K;\,\overline{x}^2\big ]\) to (48b). We call \(\widehat{x}^1\) correction of \(\overline{x}^1\). For \(1\le k\le K\) we clearly have
and \(\widehat{\tau }^0=\varPsi _0(\overline{y}^0)\le \overline{\tau }^0\). Hence for \( \overline{\varPhi }(x^1)=\max \limits _{x^2\in X_2} \varPhi (x^1,x^2)\) we have
We see that under the condition
correction does not increase the value of the primal objective of (48b), whence the saddle point value \(\widehat{{\hbox {Opt}}}\) of (48b) is \(\ge \) the optimal value \({\hbox {Opt}}\) in the problem of interest (47a). Since the opposite inequality is evident, we arrive at the following
Proposition 4
In the situation of Sect. 4.1, let assumptions B1, B2, C and (52) hold true. Then
-
(i)
the optimal value \(\widehat{{\hbox {Opt}}}\) in (48a) coincides with the optimal value \({\hbox {Opt}}\) in the problem of interest (47a);
-
(ii)
consequently, if \(\overline{x}=[\overline{x}^1;\overline{x}^2]\) is a feasible solution of the saddle point problem in (48b), then the correction \(\widehat{x}^1=\{[\widehat{y}^k;\widehat{\tau }^k]\}_{k=0}^K\) of \(\overline{x}^1\) is a feasible solution to the problem of interest (47c), and
$$\begin{aligned} f(\widehat{y}^0)-{\hbox {Opt}}\le {\epsilon _{{\tiny \mathrm Sad}}}(\overline{x}\big |X_1,X_2,\varPhi ), \end{aligned}$$(53)where \(\widehat{y}^0(=y^0(\widehat{x}^1))\) is the “\(y^0\)-component” of \(\widehat{x}^1\);
As a corollary, under the premise of Proposition 4, when applying to the saddle point problem (48b) the CoMP algorithm induced by the above setup and passing “at no cost” from the approximate solutions \(x^t=[x^{1,t};x^{2,t}]\) generated by CoMP to the corrections \(\widehat{x}^{1,t}\) of \(x^{1,t}\)’s, we get feasible solutions to the problem of interest (47a) satisfying the error bound
where \(L\) is the Lipschitz constant of \(F_u(\cdot )\) induced by the norm \(\Vert \cdot \Vert \) given by (51), and \(\varTheta [\cdot ]\) is induced by the d.g.f. given by the same (51) and the \(u=[y^0;\ldots ;y^K;z^0;\ldots ;z^K;w^1;\ldots ;w^K]\) -component of the starting point. Note that \(W_k\) and \(Z_k\) are compact, whence \(\varTheta [x_*^1\times X_2]\) is finite.
Remark
In principle, we can use the result of Proposition 4 “as is”, that is, to work from the very beginning with values of \(\rho _k\) satisfying (52); this option is feasible, provided that we know in advance the corresponding Lipschitz constants and they are not too large (which indeed is the case in some applications). This being said, when our objective is to ensure the validity of the bound (53), selecting \(\rho _k\)’s according to (52) could be very conservative. From our experience, usually it is better to adjust the penalization coefficients \(\rho _k\) on-line. Specifically, let \(\overline{\varPhi }(\overline{x}^1)=\sup _{x^2\in X_2}\varPhi (\overline{x}^1,x^2)\) (cf (15)). We always have \(\widehat{{\hbox {Opt}}}\le {\hbox {Opt}}\). It follows that independently of how \(\rho _k\) are selected, we have
for every feasible solution \(\overline{x}^1=\{[\overline{y}^k;\overline{\tau }^k]\}_{k=0}^K\) to (48b) and the same inequality holds for its correction \(\widehat{x}^1=\left\{ [\widehat{y}^k;\widehat{\tau }^k]\right\} _{k=0}^K\). When \(\overline{x}^1\) is a component of a good (with small \({\epsilon _{{\tiny \mathrm Sad}}}\)) approximate solution to the saddle point problem (48b), \(\epsilon _2\) is small. If \(\epsilon _1\) also is small, we are done; otherwise we can either increase in a fixed ratio the current values of all \(\rho _k\), or only of those \(\rho _k\) for which passing from \([\overline{y}^k;\overline{\tau }^k]\) to \([\widehat{y}^k;\widehat{\tau }^k]\) results in “significant” quantities
and solve the updated saddle point problem (48b).
4.4 Numerical illustrations
4.4.1 Matrix completion
Problem of interest In the experiments to be reported, we applied the just outlined approach to Example 1, that is, to the problem
where \(\varOmega \) is a given set of cells in an \(n\times n\) matrix, and \(P_\varOmega y\) is the restriction of \(y\in {\mathbf {R}}^{n\times n}\) onto \(\varOmega \); this restriction is treated as a vector from \({\mathbf {R}}^M\), \(M={\hbox {Card}}(\varOmega )\). Thus, (56) is a kind of matrix completion problem where we want to recover a sparse and low rank \(n\times n\) matrix given noisy observations \(b\) of its entries in cells from \(\varOmega \). Note that (56) is a special case of (47b) with \(K=1\), \(Y_0=Y_1=E_0=E_1={\mathbf {R}}^{n\times n}\), the identity mapping \(y^0\mapsto A_1y^0\), and \(\phi _0(y^0,z^0)\equiv \psi _0(y^0)\), \(\phi _1\equiv 0\) (so that \(Z_k\) can be defined as singletons, and \({\overline{\varPsi }}_k(\cdot )\) set to 0, \(k=0,1\)).
Implementing the CoMP algorithm When implementing the CoMP algorithm, we used the Frobenius norm \(\Vert \cdot \Vert _F\) on \({\mathbf {R}}^{n\times n}\) in the role of \(p_0(\cdot )\), \(p_1(\cdot )\) and \(\pi _1(\cdot )\), and the function \({1\over 2}\Vert \cdot \Vert _F^2\) in the role of d.g.f.’s \(\omega _0(\cdot )\), \(\omega _1(\cdot )\), \(\widehat{\omega }_1(\cdot )\).
The aggregation weights in (51) were chosen as \(\alpha _0=\alpha _1=1/D\) and \(\gamma _1=1\), where \(D\) is a guess of the quantity \(D_*:=\Vert y^0_*\Vert _F\), where \(y^0_*\) is the optimal solution (56). With \(D=D_*\), our aggregation would roughly optimize the right hand side in (54), provided the starting point is the origin.
The coefficient \(\rho _1\) in (48b) was adjusted dynamically as explained at the end of Sect. 4.3. Specifically, we start with a small (0.001) value of \(\rho _1\) and restart the solution process, increasing by factor 3 the previous value of \(\rho _1\), each time when the \(x^1\)-component \(\overline{x}\) of current approximate solution and its correction \(\widehat{x}\) violate the inequality \(\upsilon (y^0(\widehat{x}))\le (1+\kappa )\overline{\varPhi }(\overline{x})\) for some small tolerance \(\kappa \) (we used \(\kappa \)=1.e\(-\)4), cf. (55).
The stepsizes \(\gamma _t\) in the CoMP algorithm were adjusted dynamically, specifically, as follows. At a step \(\tau \), given a current guess \(\gamma \) for the stepsize, we set \(\gamma _\tau =\gamma \), perform the step and check whether \(\delta _\tau \le 0\). If this is the case, we pass to step \(\tau +1\), the new guess for the stepsize being \(1.2\) times the old one. If \(\delta _\tau \) is positive, we decrease \(\gamma _\tau \) in a fixed proportion (in our implementation—by factor 0.8), repeat the step, and proceed in this fashion until the resulting value of \(\delta _\tau \) becomes nonpositive. When it happens, we pass to step \(\tau +1\), and use the value of \(\gamma _\tau \) we have ended up with as our new guess for the stepsize.
In all our experiments, the starting point was given by the matrix \(\widehat{y}:=P_\varOmega ^*b\) (“observations of entries in cells from \(\varOmega \) and zeros in all other cells”) according to \(y^0=y^1=\widehat{y}\), \(\tau ^0=\lambda \Vert \widehat{y}\Vert _1\), \(\tau ^1=\mu \Vert \widehat{y}\Vert _{\mathrm{nuc}}\), \(w^1=0\).
Lower bounding the optimal value When running the CoMP algorithm, we at every step \(t\) have at our disposal an approximate solution \(y^{0,t}\) to the problem of interest (59); \(y^{0,t}\) is nothing but the \(y^0\)-component of the approximate solution \(x^t\) generated by CoMP as applied to the saddle point approximation of (59) corresponding to the current value of \(\rho _1\), see (49). We have at our disposal also the value \(\upsilon (y^{0,t})\) of the objective of (56) at \(y^{0,t}\); this quantity is a byproduct of checking whether we should update the current value of \(\rho _1\).Footnote 4 As a result, we have at our disposal the best found so far value \(\upsilon ^t=\min _{1\le \tau \le t}\upsilon (y^{0,\tau })\), along with the corresponding value \(y^{0,t}_*\) of \(y^0\): \(\upsilon (y^{0,t}_*)=\upsilon ^t\). In order to understand how good is the best generated so far approximate solution \(y^{0,t}_*\) to the problem of interest, we need to upper bound the quantity \(\upsilon ^t-{\hbox {Opt}}\), or, which is the same, to lower bound \({\hbox {Opt}}\). This is a nontrivial task, since the domain of the problem of interest is unbounded, while the usual techniques for online bounding from below the optimal value in a convex minimization problem require the domain to be bounded. We are about to describe a technique for lower bounding \({\hbox {Opt}}\) utilizing the structure of (56).
Let \(y^0_*\) be an optimal solution to (56) (it clearly exists since \(\psi _0\ge 0\) and \(\lambda ,\mu >0\)). Assume that at a step \(t\) we have at our disposal an upper bound \(R=R_t\) on \(\Vert y^0_*\Vert _1\), and let
Let us look at the saddle point approximation of the problem of interest
associated with current value of \(\rho _1\), and let
Observe that the point \(x^{1,*}=[y^0_*;{\lambda }\Vert y^0_*\Vert _1;y^0_*;\mu \Vert y^0_*\Vert _{\mathrm{nuc}}]\) belongs to \(\widehat{X}_1\) (recall that \(\Vert \cdot \Vert _{\mathrm{nuc}}\le \Vert \cdot \Vert _1\)) and that
It follows that
Further, by Proposition 2 as applied to \(X^\prime _1=\widehat{X}_1\) and \(X^\prime _2=X_2\) we haveFootnote 5
where \(\mathcal{{I}}_t\) is the execution protocol generated by CoMP as applied to the saddle point problem (57) (i.e., since the last restart preceding step \(t\) till this step), and \(\lambda ^t\) is the associated accuracy certificate. We conclude that
and \(\ell _t\) is easy to compute (since the resolution is just the maximum of a readily given by \(\mathcal{{I}}_t,\lambda ^t\) affine function over \(\widehat{X}_1\times X_2\)). Setting \(\upsilon _t=\max _{\tau \le t} \ell _\tau \), we get nondecreasing with \(t\) lower bounds on \({\hbox {Opt}}\). Note that this component of our lower bounding is independent of the particular structure of \(\psi _0\).
It remains to explain how to get an upper bound \(R\) on \(\Vert y^0_*\Vert _1\), and this is where the special structure of \(\psi _0(y)={1\over 2}\Vert P_\varOmega y-b\Vert _2^2\) is used. Recalling that \(b\in {\mathbf {R}}^M\), let us set
It is immediately seen that replacing the entries in \(b\) by their magnitudes, \(\vartheta (\cdot )\) remains intact, and that for \(b\ge 0\) we have
so that \(\vartheta (\cdot )\) is an easy to compute nonnegative and nonincreasing convex function of \(r\ge 0\). Now, by definition of \(P_\varOmega \), the function \(\vartheta ^+(\Vert y^0\Vert _1)\) where
is a lower bound on \(\upsilon (y^0)\). As a result, given an upper bound \(\upsilon ^t\) on \({\hbox {Opt}}=\upsilon (y_*)\), the easy-to-compute quantity
is an upper bound on \(\Vert y^0_*\Vert _1\). Since \(\upsilon ^t\) is nonincreasing in \(t\), \(R_t\) is nonincreasing in \(t\) as well.
Generating the data In the experiments to be reported, the data of (56) were generated as follows. Given \(n\), we build “true” \(n\times n\) matrix \(y_{\#}=\sum _{i=1}^ke_if_i^T\), with \(k=\lfloor n/4\rfloor \) and vectors \(e_i,f_i\in {\mathbf {R}}^n\) sampled, independently of each other, as follows: we draw a vector from the standard Gaussian distribution \(\mathcal{N}(0,I_n)\), and then zero out part of the entries, with probability of replacing a particular entry with zero selected in such a way that the sparsity of \(y_{\#}\) is about a desired level (in our experiments, we wanted \(y_{\#}\) to have about 10% of nonzero entries). The set \(\varOmega \) of “observed cells” was built at random, with probability 0.25 for a particular cell to be in \(\varOmega \). Finally, \(b\) was generated as \(P_\varOmega (y_{\#}+\sigma \xi )\), where the entries of \(\xi \in {\mathbf {R}}^{n\times n}\) were independently of each other drawn from the standard Gaussian distribution, and
We used \({\lambda }=\mu =10\sigma \).Footnote 6 Finally, our guess for the Frobenius norm of the optimal solution to (56) is defined as follows. Note that the quantity \(\Vert b\Vert _2^2-M\sigma ^2\) is an estimate of \(\Vert P_\varOmega y_{\#}\Vert _2^2\). We define the estimate \(D\) of \(D_*:=\Vert y_*\Vert _F\) “as if” the optimal solution were \(y_{\#}\), and all entries of \(y_{\#}\) were of the same order of magnitude
Numerical results The results of the first series of experiments are presented in Table 1. The comments are as follows.
In the “small” experiment \((n=128\), the largest \(n\) where we were able to solve (56) in a reasonable time by CVX [13] using the state-of-the-art mosek [1] Interior-Point solver and thus knew the “exact” optimal value), CoMP exhibited fast convergence: relative accuracies 1.1e\(-\)3 and 6.2e\(-\)6 are achieved in 64 and 4,096 steps (1.2 and 74.9 s, respectively, as compared to 4,756.7 s taken by CVX).
In larger experiments (\(n=512\) and \(n=1,024\), meaning design dimensions 262,144 and 1,048,576, respectively), the running times look moderate, and the convergence pattern of the CoMP still looks promising.Footnote 7 Note that our lower bounding, while somehow working, is very conservative: it overestimates the “optimality gap” \(\upsilon ^t-\upsilon _t\) by 2–3 orders of magnitude for moderate and large values of \(t\) in the \(128\times 128\) experiment. More accurate performance evaluation would require a less conservative lower bounding of the optimal value (as of now, we are not aware of any alternative).
In the second series of experiments, the data of (56) were generated in such a way that the true optimal solution and optimal value to the problem were known from the very beginning. To this end we take as \(\varOmega \) the collection of all cells of an \(n\times n\) matrix, which, via optimality conditions, allows to select \(b\) making our “true” matrix \(y_{\#}\) the optimal solution to (56). The results are presented in Table 2.
In the third series of experiments, we compared our algorithm with the basic version of ADMM as presented in [5]; this version is capable to handle straightforwardly the matrix completion with noisy observations of part of the entries.Footnote 8 The data in these experiments were generated in the same way as in the aforementioned experiments with known optimal solutions. The results are presented in Table 3. We see that ADMM is essentially faster than our algorithm, suggesting that ADMM, when applicable in its basic form, typically outperforms CoMP. However, this is not the case when ADMM is not directly applicable; we consider one example of the sort in the next section.
It should be mentioned that in these experiments the value of \(\rho _1\) resulting in negligibly small, as compared to \(\epsilon _2\), values of \(\epsilon _1\) in (55) was found in the first 10–30 steps of the algorithm, with no restarts afterwards.
Remark
For the sake of simplicity, so far we were considering problem (56), where minimization is carried out over \(y^0\) running through the entire space \({\mathbf {R}}^{n\times n}\) of \(n\times n\) matrices. What happens if we restrict \(y^0\) to reside in a given closed convex domain \(Y_0\)?
It is immediately seen that the construction we have presented can be straightforwardly modified for the cases when \(Y_0\) is a centered at the origin ball of the Frobenius or \(\Vert \cdot \Vert _1\) norm, or the intersection of such a set with the space of symmetric \(n\times n\) matrices. We could also handle the case when \(Y_0\) is the centered at the origin nuclear norm ball (or intersection of this ball with the space of symmetric matrices, or with the cone of positive semidefinite symmetric matrices), but to this end one needs to “swap the penalties”—to write the representation (47c) of problem (56) as
where \(Y_1\supset Y_0\) “fits” \(\Vert \cdot \Vert _1\) (meaning that we can point out a d.g.f. \(\omega _1(\cdot )\) for \(Y_1\) which, taken along with \(\varPsi _1(y^1)={\lambda }\Vert y^1\Vert _1\), results in easy-to-solve auxiliary problems (45)). We can take, e.g. \(\omega _1(y^1)={1\over 2}\Vert y^1\Vert _F^2\) and define \(Y_1\) as the entire space, or a centered at the origin Frobenius/\(\Vert \cdot \Vert _1\) norm ball large enough to contain \(Y_0\).
4.4.2 Image decomposition
Problem of interest In the experiments to be reported, we applied the just outlined approach to Example 2, that is, to the problem
where \(A(y):\,{\mathbf {R}}^{n\times n}\rightarrow {\mathbf {R}}^{M}\) is a given linear mapping.
Problem reformulation We first rewrite (58) as a saddle point optimization problem
where \(T:\;{\mathbf {R}}^{n\times n}\rightarrow {\mathbf {R}}^{2n(n-1)}\) is the mapping \(y\mapsto T y=\) \(\left[ \begin{array}{c}\left\{ (\nabla _i y)_{n(j-1)+i}\right\} _{i=1, \ldots ,n-1,\,j=1, \ldots ,n}\\ \left\{ (\nabla _j y)_{n(i-1)+j})\right\} _{i=1, \ldots ,n,\,j=1, \ldots ,n-1}\end{array}\right] \).
Next we rewrite (59) as a linearly constrained saddle-point problem with “simple” penalties:
where
and further approximate the resulting problem with its penalized version:
with
Note that the function \(\psi (y^1,y^2,y^3):=\Vert A(y^1+y^2+y^3)-b\Vert _2=\max _{\Vert z\Vert _2\le 1}\) \( \langle z,\,A(y^1+y^2+y^3)-b\rangle \) is Lipschitz continuous in \(y^3\) with respect to the Euclidean norm on \({\mathbf {R}}^{n\times n}\) with corresponding Lipschitz constant \(G=\Vert A\Vert _{2,2}\), which is the spectral norm (the principal singular value) of \(A\). Further, \(\varPsi (y^0)=\mu _3\Vert y^0\Vert _1\) is Lipschitz-continuous in \(y^0\) with respect to the Euclidean norm on \({\mathbf {R}}^{2n(n-1)}\) with the Lipschitz constant \(H\le \mu _3\sqrt{2n(n-1)}\). With the help of the result of Proposition 4 we conclude that to ensure the “exact penalty” property it suffices to choose \(\rho \ge \Vert A\Vert _{2,2}+\mu _3\sqrt{2n(n-1)}\). Let us denote
We equip the embedding space \(E_u\) of \(U\) with the norm
and \(U\) with the proximal setup \((\Vert \cdot \Vert ,\,\omega (\cdot ))\) with
Implementing the CoMP algorithm When implementing the CoMP algorithm, we use the above proximal setup with adaptive aggregation parameters \(\alpha _0=\cdots =\alpha _4= 1/D^2\) where \(D\) is our guess for the upper bound of \(||y_*||_2\), that is, whenever the norm of the current solution exceeds 20 % of the guess value, we increase \(D\) by factor 2 and update the scales accordingly. The penalty \(\rho \) and stepsizes \(\gamma _t\) are adjusted dynamically the same way as explained in the last experiment.
Numerical results In the first series of experiments, we build the \(n\times n\) observation matrix \(b\) by first generating a random matrix with rank \(r=\lfloor \sqrt{n}\rfloor \) and another random matrix with sparsity \(p=0.01\), so that the observation matrix is a sum of these two matrices and of random noise of level \(\sigma =0.01\); we take \(y\mapsto Ay\) as the identity mapping. We use \(\mu _1=10\sigma , \mu _2=\sigma ,\mu _3=\sigma \). The very preliminary results of this series of experiments are presented in Table 4. Note that unlike the matrix completion problem, discussed in Sect. 4.4.1, here we are not able to generate the problem with known optimal solutions. Better performance evaluation would require good lower bounding of the true optimal value, which is however problematic due to unbounded problem domain.
In the second series of experiments, we implement the CoMP algorithm to decompose real images and extract the underlying low rank/sparse singular distortion/smooth background components. The purpose of these experiments is to illustrate how the algorithm performs with the choice of small regularization parameters which is meaningful from the point of view of applications to image recovery. Image decomposition results for two images are provided on Figs. 1 and 2. On Fig. 1, we present the decomposition of the observed image of size \(256\times 256\). We apply the model (59) with regularization parameters \(\mu _1=0.03,\mu _2=0.001,\mu _3=0.005\). We run 2,000 iterations of CoMP (total of 393.5 s MATLAB, Intel i5-2400S@2.5GHz CPU). The first component \(y_1\) has approximate rank \(\approx 1\); the relative reconstruction error is \(\Vert y_1+y_2+y_3-b\Vert _2/\Vert b\Vert _2\approx 2.8\times 10^{-4}\). Figure 2 shows the decomposition of the observed image of size \(480\times 640\) after 1,000 iterations of CoMP (total of \(873.6\) sec). The regularization parameters of the problem (58) were set to \(\mu _1=0.06,\mu _2=0.002,\mu _3=0.005\). The relative reconstruction error is \(\Vert y_1+y_2+y_3-b\Vert _2/\Vert b\Vert _2\approx 8.4\times 10^{-3}\).
In the third series of experiments, we compare the CoMP algorithm with some other first-order methods. To the best of our knowledge, a quite limited set of known methods are readily applicable to problems of the form (58), where the “observation-fitting” component in the objective is nonsmooth and the penalty terms involve different components of the observed image. As a result, we compared CoMP to just two alternatives. The first, below referred to as smoothing-APG, applies Nesterov’s smoothing techniques to both the first \(\Vert \cdot \Vert _2\) term and the total variation term in the objective of (58) and then uses the Accelerated Proximal Gradient method (see [20, 21] for details) to solve the resulting problem which takes the form
with
where \(\rho _1>0,\rho _2>0\). In the experiment, we specified the smoothing parameters as \(\rho _1=\epsilon , \rho _2=\frac{\epsilon }{2(n-1)n},\epsilon =10^{-3}\).
The second alternative, referred to as smoothing-ADMM, applies smoothing technique to the first term in the objective of (58) and uses the ADMM algorithm to solve the resulting problem
the associated augmented Lagrangian being
where \(\nu >0\) is a parameter. The basic version of ADMM would require performing alternatingly \(x=(y^1,y^2,y^3)\)-updates and \(z\)-updates. Since minimizing \(L_{\nu }\) in \(x\) in a closed analytic form is impossible, we are enforced to perform \(x\)-update iteratively and hence inexactly. In our experiment, we used for this purpose the Accelerated Proximal Gradient method, with three implementations differing by the allowed number of inner iterations (5, 20, 50, respectively).
In the experiment, we generated synthetic data in the same fashion as in the first series of experiments and compared the performances of the three algorithms (CoMP and two just described alternatives) by computing accuracies in terms of the objective achieved with a prescribed time budget. The results are presented in Fig. 3. One can see that the performance of ADMM heavily depends on the allowed number of inner iterations and is not better than the performance of the Accelerated Proximal Gradient algorithm as applied to smooth approximation of the problem of interest. Our algorithm, although not consistently outperforming the Smoothing-APG approach, could still be very competitive, especially when only low accuracy is required.
5 Semi-separable convex problems
5.1 Preliminaries
Our problem of interest in this section is problem (4), (6), namely,
where \(\pi (\cdot )\) is some norm and \(\pi ^*(\cdot )\) is the conjugate norm. A straightforward approach to (63) would be to rewrite it as a saddle point problem
and solve by the Mirror-Prox algorithm from Sect. 3.2 adjusted to work with an unbounded domain \(U\), or, alternatively, we could replace \(\max _w\) with \(\max _{w:\;\pi (w)\le R}\) with “large enough” \(R\) and use the above algorithm “as is.” The potential problem with this approach is that if the \(w\)-component \(w^*\) of the saddle point of (64) is of large \(\pi \)-norm (or “large enough” \(R\) is indeed large), the (theoretical) efficiency estimate would be bad since it is proportional to the magnitude of \(w^*\) (resp., to \(R\)). To circumvent this difficulty, we apply to (63) the sophisticated policy originating from [15]. This policy requires the set \(Y=Y_1\times \cdots \times Y_K\) to be bounded, which we assume below.
Course of actions Note that our problem of interest is of the generic form
where \(Y\) is a convex compact set in a Euclidean space \(E\), \(f\) and \(g:\;Y\rightarrow {\mathbf {R}}\) are convex and Lipschitz continuous functions. For the time being, we focus on (65) and assume that the problem is feasible and thus solvable.
We intend to solve (65) by the generic algorithm presented in [15]; for our now purposes, the following description of the algorithm will do:
-
1.
The algorithm works in stages. Stage \(s=1,2,...\) is associated with working parameter \(\alpha _s\in (0,1)\). We set \(\alpha _1=\frac{1}{2}\).
-
2.
At stage \(s\), we apply a first order method \(\mathcal{{B}}\) to the problem
$$\begin{aligned} {(P_s)} \quad {\hbox {Opt}}_s=\min _{y\in Y} \left\{ f_s(y)=\alpha _s f(y)+(1-\alpha _s) g(y)\right\} \end{aligned}$$(66)
The only property of the algorithm \(\mathcal{{B}}\) which matters here is its ability, when run on \((P_s)\), to produce in course of \(t=1,2, \ldots \) steps iterates \(y_{s,t}\), upper bounds \(\overline{f}_s^t\) on \({\hbox {Opt}}_s\) and lower bounds \(\underline{f}_{s,t}\) on \({\hbox {Opt}}_s\) in such a way that
-
(a)
for every \(t=1,2, \ldots \), the \(t\)-th iterate \(y_{s,t}\) of \(\mathcal{{B}}\) as applied to \((P_s)\) belongs to \(Y\);
-
(b)
the upper bounds \(\overline{f}_s^t\) are nonincreasing in \(t\) (this is “for free”) and “are achievable,” that is, they are of the form
$$\begin{aligned} \overline{f}_s^t=f_s(y^{s,t}), \end{aligned}$$where \(y^{s,t}\in Y\) is a vector which we have at our disposal at step \(t\) of stage \(s\);
-
(c)
the lower bounds \(\underline{f}_{s,t}\) should be nondecreasing in t (this again is “for free”);
-
(d)
for some nonincreasing sequence \(\epsilon _t\rightarrow +0\), \(t\rightarrow \infty \), we should have
$$\begin{aligned} \overline{f}_s^t-\underline{f}_{s,t}\le \epsilon _t \end{aligned}$$for all \(t\) and \(s\).
Note that since (65) is solvable, we clearly have \({\hbox {Opt}}_s\le \alpha _s{\hbox {Opt}}\), implying that the quantity \(\underline{f}_{s,t}/\alpha _s\) is a lower bound on \({\hbox {Opt}}\). Thus, at step \(t\) of stage \(s\) we have at our disposal a number of valid lower bounds on \({\hbox {Opt}}\); we denote the best (the largest) of these bounds \(\underline{{\hbox {Opt}}}_{s,t}\), so that
for all \(s,t\), and \(\underline{{\hbox {Opt}}}_{s,t}\) is nondecreasing in time.Footnote 9
-
3.
When the First Order oracle is invoked at step \(t\) of stage \(s\), we get at our disposal a triple \((y_{s,t}\in Y,f(y_{s,t}),g(y_{s,t}))\). We assume that all these triples are somehow memorized. Thus, after calling First Order oracle at step \(t\) of stage \(s\), we have at our disposal a finite set \(Q_{s,t}\) on the 2D plane such that for every point \((p,q)\in Q_{s,t}\) we have at our disposal a vector \(y_{pq}\in Y\) such that \(f(y_{pq})\le p\) and \(g(y_{pq})\le q\); the set \(Q_{s,t}\) (in today terminology, a filter) is comprised of all pairs \((f(y_{s',t'}),g(y_{s',t'}))\) generated so far. We set
$$\begin{aligned} h_{s,t}(\alpha )&= \min _{(p,q)\in Q_{s,t}} \left[ \alpha (p-\underline{{\hbox {Opt}}}_{s,t}) + (1-\alpha ) q\right] : [0,1]\rightarrow {\mathbf {R}},\nonumber \\ {\hbox {Gap}}(s,t)&= \max \limits _{0\le \alpha \le 1} h_{s,t}(\alpha ). \end{aligned}$$(68) -
4.
Let \(\varDelta _{s,t}=\{\alpha \in [0,1]: h_{s,t}(\alpha )\ge 0\}\), so that \(\varDelta _{s,t}\) is a segment in \([0,1]\). Unless we have arrived at \({\hbox {Gap}}(s,t)=0\) (i.e., got an optimal solution to (65), see (69)), \(\varDelta _{s,t}\) is not a singleton (since otherwise \({\hbox {Gap}}(s,t)\) were 0). Observe also that \(\varDelta _{s,t}\) are nested: \(\varDelta _{s',t'}\subset \varDelta _{s,t}\) whenever \(s'\ge s\), same as whenever \(s'=s\) and \(t'\ge t\).
We continue iterations of stage \(s\) while \(\alpha _s\) is “well-centered” in \(\varDelta _{s,t}\), e.g., belongs to the mid-third of the segment. When this condition is violated, we start stage \(s+1\), specifying \(\alpha _{s+1}\) as the midpoint of \(\varDelta _{s,t}\).
The properties of the aforementioned routine are summarized in the following statement (cf. [15]).
Proposition 5
(i) \({\hbox {Gap}}(s,t)\) is nonincreasing in time. Furthermore, at step \(t\) of stage \(s\), we have at our disposal a solution \(\widehat{y}^{s,t}\in Y\) to 65 such that
so that \(\widehat{y}^{s,t}\) belongs to the domain \(Y\) of problem (65) and is both \({\hbox {Gap}}(s,t)\)-feasible and \({\hbox {Gap}}(s,t)\)-optimal.
(ii) For every \(\epsilon >0\), the number \(s(\epsilon )\) of stages until a pair \((s,t)\) with \({\hbox {Gap}}(s,t)\le \epsilon \) is found obeys the bound
where \(L<\infty \) is an a priori upper bound on \(\max _{y\in Y}\max [|f(y)|,|g(y)|]\). Besides this, the number of steps at each stage does not exceed
5.2 Composite Mirror Prox algorithm for semi-separable optimization
We are about to apply the approach above to the semi-separable problem (63), (6). Problem setup we consider now is as follows (cf. Sect. 4.1). For every \(k\), \(1\le k\le K\), we are given
-
1.
Euclidean spaces \(E_k\) and \({\overline{E}}_k\) along with their nonempty closed and bounded convex subsets \(Y_k\) and \(Z_k\), respectively;
-
2.
proximal setups for \((E_k,Y_k)\) and \(({\overline{E}}_k, Z_k)\), that is, norms \(p_k(\cdot )\) on \(E_k\), norms \(q_k\) on \({\overline{E}}_k\), and d.g.f.’s \(\omega _k(\cdot ):Y_k\rightarrow {\mathbf {R}}\), \({\overline{\omega }}_k(\cdot ):Z_k\rightarrow {\mathbf {R}}\), which are compatible with \(p_k(\cdot )\) and \(q_k(\cdot )\), respectively;
-
3.
linear mapping \(y^k\mapsto A_k y^k:E_k\rightarrow E\), where \(E\) is a Euclidean space;
-
4.
Lipschitz continuous convex functions \(\psi _k(y^k):Y_k\rightarrow {\mathbf {R}}\) along with their saddle point representations
$$\begin{aligned} \psi _k(y^k)=\sup _{z^k\in Z_k}{[}\phi _k(y^k,z^k)-{\overline{\varPsi }}_k(z^k){]},\;\;1\le k\le K, \end{aligned}$$(72)where \(\phi _k(y^k,z^k):Y_k\times Z_k\rightarrow {\mathbf {R}}\) are smooth (with Lipschitz continuous gradients) functions convex in \(y^k\in Y_k\) and concave in \(z^k\in Z_k\), and \({\overline{\varPsi }}_k(z^k):Z_k\rightarrow {\mathbf {R}}\) are Lipschitz continuous convex functions such that the problems of the form
$$\begin{aligned} \min \limits _{z^k\in Z_k} \left[ {\overline{\omega }}_k(z^k)+\langle \xi ^k,z^k\rangle +\alpha {\overline{\varPsi }}_k(z^k)\right] \quad [\alpha >0] \end{aligned}$$(73)are easy to solve;
-
5.
Lipschitz continuous convex functions \(\varPsi _k(y^k):Y_k\rightarrow {\mathbf {R}}\) such that the problems of the form
$$\begin{aligned} \min \limits _{y^k\in Y_k}&\left[ \omega _k(y^k)+\langle \xi ^k,y^k\rangle +\alpha \varPsi _k(y^k)\right] \quad [\alpha >0] \end{aligned}$$are easy to solve;
-
6.
a norm \(\pi ^*(\cdot )\) on \(E\), with conjugate norm \(\pi (\cdot )\), along with a d.g.f. \(\widehat{\omega }(\cdot ):W:=\{w\in E:\pi (w)\le 1\}\rightarrow {\mathbf {R}}\) compatible with \(\pi (\cdot )\) and is such that problems of the form
$$\begin{aligned} \min _{w\in W}\left[ \widehat{\omega }(w)+\langle \xi ,w\rangle \right] \end{aligned}$$are easy to solve.
The outlined data define the sets
The problem of interest here is problem (63), (72):
Solving (74) using the approach in the previous section amounts to resolving a sequence of problems \((P_s)\) as in (66) where, with a slight abuse of notation,
Here \(C_k\ge \max _{y^k\in Y_k}\varPsi _k(y^k)\) are finite constants introduced to make \(Y\) compact, as required in the premise of Proposition 5; it is immediately seen that the magnitudes of these constants (same as their very presence) does not affect the algorithm \(\mathcal{{B}}\) we are about to describe.
The algorithm \(\mathcal{{B}}\) we intend to use will solve \((P_s)\) by reducing the problem to the saddle point problem
where \(\alpha =\alpha _s\).
Setting
\(X\) can be thought of as the domain of the variational inequality associated with (75), the monotone operator in question being
By exactly the same reasons as in Sect. 4, with properly assembled norm on the embedding space of \(U\) and d.g.f., (75) can be solved by the MP algorithm from Sect. 3.2. Let us denote
the approximate solution obtained in course of \(t=1,2, \ldots \) steps of CoMP when solving \((P_s)\), and let
be the corresponding value of the objective of \((P_s)\). It holds
where \({\mathcal{{L}}}<\infty \) is explicitly given by the proximal setup we use and by the related Lipschitz constant of \(F_u(\cdot )\) (note that this constant can be chosen to be independent of \(\alpha \in [0,1]\)). We assume that computing the corresponding objective value is a part of step \(t\) (these computations increase the complexity of a step by factor at most \(O(1)\)), and thus that \(\overline{f}_s^t\le \widehat{f}_s^t\). By (76), the quantity \(\widehat{f}_s^t-\epsilon _t\) is a valid lower bound on the optimal value of \((P_s)\), and thus we can ensure that \(\underline{f}_{s,t}\ge \widehat{f}_s^t-\epsilon _t\). The bottom line is that with the outlined implementation, we have
for all \(s,t\), with \(\epsilon _t\) given by (76). Consequently, by Proposition (5), the total number of CoMP steps needed to find a belonging to the domain of the problem of interest (63) \(\epsilon \)-feasible and \(\epsilon \)-optimal solution to this problem can be upper-bounded by
where \(L\) and \({\mathcal{{L}}}\) are readily given by the smoothness parameters of \(\phi _k\) and by the proximal setup we use.
5.3 Numerical illustration: \(\ell _1\)-minimization
Problem of interest We consider the simple \(\ell _1\) minimization problem
where \(x\in {\mathbf {R}}^n\), \(A\in {\mathbf {R}}^{m\times n}\) and \(m<n\). Note that this problem can also be written in the semi-separable form
if the data is partitioned into \(K\) blocks: \(x=[x_1;x_2;\ldots ;x_K]\) and \(A=[A_1,A_2,\) \(\ldots ,A_K]\).
Our main purpose here is to test the approach described in 5.1 and compare it to the simplest approach where we directly apply CoMP to the (saddle point reformulation of the) problem \(\min _{x\in X}[\Vert x\Vert _1+R\Vert Ax-b\Vert _2]\) with large enough value of \(R\). For the sake of simplicity, we work with the case when \(K=1\) and \(X=\{x\in {\mathbf {R}}^n:\Vert x\Vert _2\le 1\}\).
Generating the data In the experiments to be reported, the data of (77) were generated as follows. Given \(m,n\), we first build a sparse solution \(x^*\) by drawing random vector from the standard Gaussian distribution \(\mathcal{N}(0,I_n)\), zeroing out part of the entries and scaling the resulting vector to enforce \(x^*\in X\). We also build a dual solution \(\lambda ^*\) by scaling a random vector from distribution \(\mathcal{N}(0,I_m)\) to satisfy \(\Vert \lambda ^*\Vert _2=R_*\) for a prescribed \(R_*\). Next we generate \(A\) and \(b\) such that \(x^*\) and \(\lambda ^*\) are indeed the optimal primal and dual solutions to the \(\ell _1\) minimization problem (77), i.e. \(A^T\lambda ^*\in \partial \big |_{x=x^*}\Vert x\Vert _1\) and \(Ax^*=b\). To achieve this, we set
where \( p=\frac{\lambda ^*}{\Vert \lambda ^*\Vert _2^2}, q\in \partial \big |_{x=x^*} \Vert x\Vert _1-\frac{1}{\sqrt{n}}\widehat{F}_n \lambda ^*\), and \(\widehat{F}_n\) is a \(m\times n\) submatrix randomly selected from the DFT matrix \(F_n\). We expect that the larger is the \(\Vert \cdot \Vert _2\)-norm \(R_*\) of the dual solution, the harder is problem (77).
Implementing the algorithm When implementing the algorithm from Sect. 5.2, we apply at each stage \(s=1,2, \ldots \) CoMP to the saddle point problem
The proximal setup for CoMP is given by equipping the embedding space of \(U=\{u=[x;w]:x\in X,\Vert w\Vert _2\le 1\}\) with the norm \(\Vert u\Vert _2=\sqrt{\frac{1}{2}\Vert x\Vert _2^2+\frac{1}{2}\Vert w\Vert _2^2}\) and equipping \(U\) with the d.g.f. \(\omega (u)=\frac{1}{2}\Vert x\Vert _2^2+\frac{1}{2}\Vert w\Vert _2^2\). In the sequel we refer to the resulting algorithm as sequential CoMP. For comparison, we solve the same problem by applying CoMP to the saddle point problem
with \(R=R_*\); the resulting algorithm is referred to as simple CoMP. Both sequential CoMP and simple CoMP algorithms are terminated when the relative nonoptimality and constraint violation are both less than \(\epsilon =10^{-5}\), namely,
Numerical results are presented in Table 5. One can immediately see that to achieve the desired accuracy, the simple CoMP with \(R\) set to \(R_*\), i.e., to the exact magnitude of the true Lagrangian multiplier, requires almost twice as many steps as the sequential CoMP. In more realistic examples, the simple CoMP will additionally suffer from the fact that the magnitude of the optimal Lagrange multiplier is not known in advance, and the penalty \(R\) in \((P_R)\) should be somehow tuned “online.”
Notes
The precise meaning of simplicity and fitting will be specified later. As of now, it suffices to give a couple of examples. When \(\varPsi _k\) is the \(\ell _1\) norm, \(Y_k\) can be the entire space, or the centered at the origin \(\ell _p\)-ball, \(1\le p\le 2\); when \(\varPsi _k\) is the nuclear norm, \(Y_k\) can be the entire space, or the centered at the origin Frobenius/nuclear norm ball.
Our exposition follows.
With our implementation, we run this test for both search points and approximate solutions generated by the algorithm.
Note that the latter relation implies that what was denoted by \(\widetilde{\varPhi }\) in Proposition 2 is nothing but \(\overline{\varPhi }\).
If the goal of solving (56) were to recover \(y_{\#}\), our \({\lambda }\) and \(\mu \) would, perhaps, be too large. Our goal, however, was solving (56) as an “optimization beast,” and we were interested in “meaningful” contribution of \(\varPsi _0\) and \(\varPsi _1\) to the objective of the problem, and thus in not too small \({\lambda }\) and \(\mu \).
Recall that we do not expect linear convergence, just \(O(1/t)\) one.
Note that in a more complicated matrix recovery problem, where noisy linear combinations of the matrix entries rather than just some of these entries are observed, applying ADMM becomes somehow problematic, whilethe proposed algorithm still is applicable “as is.”
In what follows, we call a collection \(a_{s,t}\) of reals nonincreasing in time, if \(a_{s',t'}\le a_{s,t}\) whenever \(s'\ge s\), same as whenever \(s=s'\) and \(t'\ge t\). “Nondecreasing in time” is defined similarly.
We assume w.l.o.g. that \(|\underline{{\hbox {Opt}}}_{s,t}|\le L\).
References
Andersen, E. D., Andersen, K. D.: The MOSEK optimization tools manual. http://www.mosek.com/fileadmin/products/6_0/tools/doc/pdf/tools.pdf
Aujol, J.-F., Chambolle, A.: Dual norms and image decomposition models. Int. J. Comput. Vis. 63(1), 85–104 (2005)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Becker, S., Bobin, J., Candès, E.J.: Nesta: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 122–122 (2010)
Buades, A., Coll, B., Morel, J.-M.: A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4(2), 490–530 (2005)
Candés, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM (JACM) 58(3), 11 (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Deng, W., Lai, M.-J., Peng, Z., Yin, W.: Parallel multi-block admm with o (1/k) convergence, 2013. http://www.optimization-online.org/DB_HTML/2014/03/4282.html (2013)
Goldfarb, D., Ma, S.: Fast multiple-splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)
Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx (2013)
Juditsky, A., Nemirovski, A.: First-order methods for nonsmooth largescale convex minimization: I general purpose methods; ii utilizing problems structure. In: Sra, S., Nowozin, S., Wright, S. (eds.) Optimization for Machine Learning, pp. 121–183. The MIT Press, (2011)
Lemarchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program. 69(1–3), 111–147 (1995)
Monteiro, R.D., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)
Nemirovski, A.: Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 35(1), 52–78 (2010)
Nemirovski, A., Rubinstein, R.: An efficient stochastic approximation algorithm for stochastic saddle point problems. In: Dror, M., L’Ecuyer, P., Szidarovszky, F. (eds.) Modeling Uncertainty and Examination of Stochastic Theory, Methods, and Applications, pp. 155–184. Kluwer Academic Publishers, Boston (2002)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Orabona, F., Argyriou, A., Srebro, N.: Prisma: Proximal iterative smoothing algorithm. arXiv preprint arXiv:1206.2372, (2012)
Ouyang, Y., Chen, Y., Lan, G., Pasiliao, E. Jr.: An accelerated linearized alternating direction method of multipliers, arXiv:1401.6607 (2014)
Qin, Z., Goldfarb, D.: Structured sparsity via alternating direction methods. J. Mach. Learn. Res. 13, 1373–1406 (2012)
Scheinberg, K., Goldfarb, D., Bai, X.: Fast first-order methods for composite convex optimization with backtracking. http://www.optimization-online.org/DB_FILE/2011/04/3004.pdf (2011)
Tseng, P.: Alternating projection-proximal methods for convex programming and variational inequalities. SIAM J. Optim. 7(4), 951–965 (1997)
Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. SIAM J. Optim. (2008, submitted)
Wen, Z., Goldfarb, D., Yin, W.: Alternating direction augmented lagrangian methods for semidefinite programming. Math. Program. Comput. 2(3–4), 203–230 (2010)
Acknowledgments
Research of the first and the third authors was supported by the NSF Grant CMMI-1232623. Research of the second author was supported by the CNRS-Mastodons Project GARGANTUA, and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Proof of Theorem 1
0 \(^o\). Let us verify that the prox-mapping (28) indeed is well defined whenever \(\zeta =\gamma F_v\) with \(\gamma >0\). All we need is to show that whenever \(u\in U\), \(\eta \in E_u\), \(\gamma >0\) and \([w_t;s_t]\in X\), \(t=1,2, \ldots \), are such that \(\Vert w_t\Vert _2+\Vert s_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we have
Indeed, assuming the opposite and passing to a subsequence, we make the sequence \(r_t\) bounded. Since \(\omega (\cdot )\) is strongly convex, modulus 1, w.r.t. \(\Vert \cdot \Vert \), and the linear function \(\langle F_v,s\rangle \) of \([w;s]\) is below bounded on \(X\) by A4, boundedness of the sequence \(\{r_t\}\) implies boundedness of the sequence \(\{w_t\}\), and since \(\Vert [w_t;s_t]\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \), we get \(\Vert s_t\Vert _2\rightarrow \infty \) as \(t\rightarrow \infty \). Since \(\langle F_v,s\rangle \) is coercive in \(s\) on \(X\) by A4, and \(\gamma >0\), we conclude that \(b_t\rightarrow \infty \), \(t\rightarrow \infty \), while the sequence \(\{a_t\}\) is bounded since the sequence \(\{w_t\in U\}\) is so and \(\omega \) is continuously differentiable. Thus, \(\{a_t\}\) is bounded, \(b_t\rightarrow \infty \), \(t\rightarrow \infty \), implying that \(r_t\rightarrow \infty \), \(t\rightarrow \infty \), which is the desired contradiction
\(1^{\circ }\). Recall the well-known identity [9]: for all \(u,u',w\in U\) one has
Indeed, the right hand side is
For \(x=[u;v]\in X,\;\xi =[\eta ;\zeta ]\), let \(P_x(\xi )=[u';v']\in X\). By the optimality condition for the problem (28), for all \([s;w]\in X\),
which by (78) implies that
\(2^{\circ }\). When applying (79) with \([u;v]=[u_\tau ;v_\tau ]=x_\tau \), \(\xi =\gamma _\tau F(x_\tau )=[\gamma _\tau F_u(u_\tau );\gamma _\tau F_v]\), \([u';v']=[u'_\tau ;v'_\tau ]=y_\tau \), and \([s;w]=[u_{\tau +1};v_{\tau +1}]=x_{\tau +1}\) we obtain:
and applying (79) with \([u;v]=x_\tau \), \(\xi =\gamma _\tau F(y_\tau )\), \([u';v']=x_{\tau +1}\), and \([s;w]=z\in X\) we get:
Adding (81) to (80) we obtain for every \(z=[s;w]\in X\)
Due to the strong convexity, modulus 1, of \(V_u(\cdot )\) w.r.t. \(\Vert \cdot \Vert \), \(V_u(u')\ge {1\over 2}\Vert u-u'\Vert ^2\) for all \(u,u'\). Therefore,
where the last inequality is due to (23). Note that \(\gamma _\tau L<1\) implies that
Let us assume that the stepsizes \(\gamma _\tau >0\) ensure that (30) holds, meaning that \( \delta _\tau \le \gamma _\tau ^2M^2\) (which, by the above analysis, is definitely the case when \(0<\gamma _\tau \le {1\over \sqrt{2}L}\); when \(M=0\), we can take also \(\gamma _\tau \le {1\over L}\)). When summing up inequalities (82) over \(\tau =1,2, \ldots ,t\) and taking into account that \(V_{u_{t+1}}(s)\ge 0\), we conclude that for all \(z=[s;w]\in X\),
\(\square \)
Appendix 2: Proof of Lemma 1
Proof
All we need to verify is the second inequality in (38). To this end note that when \(t=1\), the inequality in (38) holds true by definition of \(\widehat{\varTheta }(\cdot )\). Now let \(1<t\le N+1\). Summing up the inequalities (82) over \(\tau =1, \ldots ,t-1\), we get for every \(x=[u;v]\in X\):
(we have used (30)). When \([u;v]\) is \(z_*\), the left hand side in the resulting inequality is \(\ge 0\), and we arrive at
hence
hence also
and therefore
and (38) follows. \(\square \)
Appendix 3: Proof of Proposition 3
Proof
From (82) and (30) it follows that
Summing up these inequalities over \(\tau =1, \ldots ,N\), we get \(\forall (x=[u;v]\in X)\):
where the concluding inequality is due to (38), and (40) follows.\(\square \)
Appendix 4: Proof of Proposition 5
1\(^o\). \(h_{s,t}(\alpha )\) are concave piecewise linear functions on \([0,1]\) which clearly are pointwise nonincreasing in time. As a result, \({\hbox {Gap}}(s,t)\) is nonincreasing in time. Further, we have
where \(\lambda ^*_{pq}\ge 0\) and sum up to 1. Recalling that for every \((p,q)\in Q_{s,t}\) we have at our disposal \(y_{pq}\in Y\) such that \(p\ge f(y_{pq})\) and \(q\ge g(y_{pq})\), setting \(\widehat{y}^{s,t}=\sum \limits _{(p,q)\in Q_{s,t}}\lambda ^*_{pq}y_{pq}\) and invoking convexity of \(f,g\), we get
and (69) follows, due to \(\underline{{\hbox {Opt}}}_{s,t}\le {\hbox {Opt}}\).
\(2^{\circ }\). We have \(\overline{f}_s^t=\alpha _sf(y^{s,t}) +(1-\alpha _s)g(y^{s,t})\) for some \(y^{s,t}\in Y\) which we have at our disposal at step \(t\), implying that \((\widehat{p}=f(y^{s,t}),\widehat{q}=g(y^{s,t}))\in Q_{s,t}\). Hence by definition of \(h_{s,t}(\cdot )\) it holds
where the concluding inequality is given by (67). Thus, \(h_{s,t}(\alpha _s)\le \overline{f}_s^t-\underline{f}_{s,t} \le \epsilon _t\). On the other hand, if stage \(s\) does not terminate in course of the first \(t\) steps, \(\alpha _s\) is well-centered in the segment \(\varDelta _{s,t}\) where the concave function \(h_{s,t}(\alpha )\) is nonnegative. We conclude that \(0\le {\hbox {Gap}}(s,t)=\max _{0\le \alpha \le 1}h_{s,t}(\alpha )=\max _{\alpha \in \varDelta _{s,t}}h_{s,t}(\alpha )\le 3h_{s,t}(\alpha _s)\). Thus, if a stage \(s\) does not terminate in course of the first \(t\) steps, we have \({\hbox {Gap}}(s,t)\le 3\epsilon _t\), which implies (71). Further, \(\alpha _s\) is the midpoint of the segment \(\varDelta ^{s-1}=\varDelta _{s-1,t_{s-1}}\), where \(t_r\) is the last step of stage \(r\) (when \(s=1\), we should define \(\varDelta ^0\) as \([0,1]\)), and \(\alpha _s\) is not well-centered in the segment \(\varDelta ^s=\varDelta _{s,t_s}\subset \varDelta _{s-1,t_{s-1}}\), which clearly implies that \(|\varDelta ^s|\le \mathrm{\small {3\over 4}}|\varDelta ^{s-1}|\). Thus, \(|\varDelta ^s|\le \left( \mathrm{\small {3\over 4}}\right) ^s\) for all \(s\). On the other hand, when \(|\varDelta _{s,t}|<1\), we have \({\hbox {Gap}}(s,t)=\max _{\alpha \in \varDelta _{s,t}}h_{s,t}(\alpha )\le 3L |\varDelta _{s,t}|\) (since \(h_{s,t}(\cdot )\) is Lipschitz continuous with constant \(3L\) Footnote 10 and \(h_{s,t}(\cdot )\) vanishes at (at least) one endpoint of \(\varDelta _{s,t}\)). Thus, the number of stages before \({\hbox {Gap}}(s,t)\le \epsilon \) is reached indeed obeys the bound (70).\(\square \)
Rights and permissions
About this article
Cite this article
He, N., Juditsky, A. & Nemirovski, A. Mirror Prox algorithm for multi-term composite minimization and semi-separable problems. Comput Optim Appl 61, 275–319 (2015). https://doi.org/10.1007/s10589-014-9723-3
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-014-9723-3
Keywords
- Numerical algorithms for variational problems
- Composite optimization
- Minimization problems with multi-term penalty
- Proximal methods