Adaptive smoothing algorithms for nonsmooth composite convex minimization

Tran-Dinh, Quoc

doi:10.1007/s10589-016-9873-6

Adaptive smoothing algorithms for nonsmooth composite convex minimization

Published: 15 September 2016

Volume 66, pages 425–451, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

Adaptive smoothing algorithms for nonsmooth composite convex minimization

Download PDF

Quoc Tran-Dinh¹

955 Accesses
15 Citations
Explore all metrics

Abstract

We propose an adaptive smoothing algorithm based on Nesterov’s smoothing technique in Nesterov (Math Prog 103(1):127–152, 2005) for solving “fully” nonsmooth composite convex optimization problems. Our method combines both Nesterov’s accelerated proximal gradient scheme and a new homotopy strategy for smoothness parameter. By an appropriate choice of smoothing functions, we develop a new algorithm that has the $\mathcal {O}\left( \frac{1}{\varepsilon }\right) $-worst-case iteration-complexity while preserves the same complexity-per-iteration as in Nesterov’s method and allows one to automatically update the smoothness parameter at each iteration. Then, we customize our algorithm to solve four special cases that cover various applications. We also specify our algorithm to solve constrained convex optimization problems and show its convergence guarantee on a primal sequence of iterates. We demonstrate our algorithm through three numerical examples and compare it with other related algorithms.

A class of exact penalty functions and penalty algorithms for nonsmooth constrained optimization problems

Article 19 October 2019

Solving Minimax Problems: Local Smoothing Versus Global Smoothing

Subgradient Smoothing Method for Nonsmooth Nonconvex Optimization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

This paper develops new smoothing optimization methods for solving the following “fully” nonsmooth composite convex minimization problem:

$$\begin{aligned} F^{\star } := \min _{{x}\in \mathbb {R}^p}\Big \{ F({x}) := f({x}) + g({x}) \Big \}, \end{aligned}$$

(1)

where $g:\mathbb {R}^p\rightarrow \mathbb {R}\cup \left\{ +\infty \right\} $ is a proper, closed and convex function, and $f : \mathbb {R}^p\rightarrow \mathbb {R}\cup \left\{ +\infty \right\} $ is a convex function defined by the following max-structure:

$$\begin{aligned} f({x}) := \max _{{u}\in \mathbb {R}^n}\Big \{ \langle {x}, {A}{u}\rangle - \varphi ({u}) : {u}\in \mathcal {U}\Big \}. \end{aligned}$$

(2)

Here, $\varphi : \mathbb {R}^n\rightarrow \mathbb {R}\cup \left\{ +\infty \right\} $ is a proper, closed and convex function, and $\mathcal {U}$ is a nonempty, closed, and convex set in $\mathbb {R}^n$, and ${A}\in \mathbb {R}^{p\times n}$ is given.

Clearly, any proper, closed and convex function f can be written as (2) using its Fenchel conjugate $f^{*}$, i.e., $f({x}) := \sup \left\{ \langle {x}, {u}\rangle - f^{*}({u}) : {u}\in \mathrm {dom}(f^{*})\right\} $. Hence, the max-structure (2) does not restrict the applicability of the template (1). Moreover, (1) also directly models many practical applications in signal and image processing, machine learning, statistics and data sciences, see, e.g., [4, 9, 13, 15, 26, 28, 32] and the references quoted therein.

While the first term f is nonsmooth, the second term g remains unspecified. On the one hand, we can assume that g is smooth and its gradient is Lipschitz continuous. On the other hand, g can be nonsmooth, but it is equipped with a “tractable” proximity operator defined as follows: g is said to be tractably proximal if its proximal operator

$$\begin{aligned} \mathrm {prox}_{g}({x}) := \mathrm {arg}\min _{{y}}\left\{ g({y}) + (1/2)\Vert {y}- {x}\Vert ^2 : {y}\in \mathrm {dom}(g)\right\} , \end{aligned}$$

(3)

can be computed “efficiently” (e.g., by a closed form or by polynomial time algorithms). In general, computing $\mathrm {prox}_g$ requires to solve the strongly convex problem (3), but in many cases, this operator can be obtained in a closed form or by a low-cost polynomial algorithm. Examples of such convex functions can be found in the literature including [3, 15, 28].

Solving nonsmooth convex optimization problems remains challenging, especially when none of the two nonsmooth terms f and g is equipped with a tractable proximity operator. Existing nonsmooth convex optimization approaches such as subgradient-type descent algorithms, dual averaging strategies, bundle-level techniques or derivative-free methods are often used to solve general nonsmooth convex problems. However, these methods suffer a slow convergence rate (resp., $\mathcal {O}\left( \frac{1}{\epsilon ^2}\right) $ - worst-case iteration-complexity). In addition, they are sensitive to the algorithmic parameters such as stepsizes [22].

In his pioneering work [24], Nesterov shown that one can solve the nonsmooth structured convex minimization problem (1) within $\mathcal {O}\left( \frac{1}{\epsilon }\right) $ iterations. This method combines a proximity smoothing technique and Nesterov’s accelerated gradient scheme [21] to achieve the optimal worst-case iteration-complexity, which is much better than the $\mathcal {O}\left( \frac{1}{\epsilon ^2}\right) $-worst-case iteration complexity in nonsmooth optimization methods.

Motivated by [24], Nesterov and many other researchers have proposed different algorithms using such a proximity smoothing method to solve other problems, to improve Nesterov’s original algorithm, or to customize his algorithm for specific applications, see, e.g., [2, 6, 7, 14, 17, 19, 20, 23, 25, 33]. In [5], Beck and Teboulle generalized Nesterov’s smoothing technique to a generic framework, where they discussed the advantages and disadvantages of smoothing techniques. In addition, they also illustrated numerical efficiency between smoothing techniques and proximal-type methods. In [1, 27], the authors studied smoothing techniques for the sum of three convex functions, where one term is Lipschitz gradient, while the others are nonsmooth. In [11], a variable smoothing method was proposed, which possesses the $\mathcal {O}\left( \frac{\ln (k)}{k}\right) $-convergence rate. This convergence rate is worse than the one in [24]. However, as a compensation, the smoothness parameter is updated at each iteration. In addition, their method uses special quadratic proximity functions, while smooths both f and g under a Lipschitz continuity assumption imposed on f and g.

In [23], Nesterov introduced an excessive gap technique, which requires both primal and dual schemes using two smoothness parameters. It symmetrically updates one parameter at each iteration. Nevertheless, this method uses different assumptions than our method. Other primal-dual methods studied in, e.g., [12, 16] use double smoothing techniques to solve (1), but only achieve $\mathcal {O}\left( \frac{1}{\varepsilon }\log \left( \frac{1}{\varepsilon }\right) \right) $-worst-case iteration-complexity.

Our approach in this paper is also based on Nesterov’s smoothing technique in [24]. To clarify the differences between our method and [23, 24], let us first briefly present Nesterov’s smoothing technique in [24] applying to (1).

Recall that a convex function $b_{\mathcal {U}} : \mathbb {R}^n\rightarrow \mathbb {R}$ is called a proximity function of $\mathcal {U}$ if it is continuous, and strongly convex with the convexity parameter $\mu _b > 0$ and $\mathcal {U}\subseteq \mathrm {dom}(b_{\mathcal {U}})$. Given $b_{\mathcal {U}}$, we define

$$\begin{aligned} \bar{{u}}^c := \mathrm {arg}\min _{{u}}\left\{ b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\right\} ~~~\text {and}~~D_{\mathcal {U}} := \sup _{{u}}\left\{ b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\right\} \in [0, +\infty ). \end{aligned}$$

Here, $\bar{{u}}^c$ and $D_{\mathcal {U}}$ are called the prox-center and prox-diameter of $\mathcal {U}$ w.r.t. $b_{\mathcal {U}}$, respectively. Without loss of generality, we can assume that $b_{\mathcal {U}}(\bar{{u}}^c) = 0$ and $\mu _b = 1$. Otherwise, we just rescale and shift it.

As shown in [24], given $\gamma > 0$ and $b_{\mathcal {U}}$, we can approximate f by $f_{\gamma }$ as

$$\begin{aligned} f_{\gamma }({x}) := \max _{{u}}\left\{ \langle {x}, {A}{u}\rangle - \varphi ({u}) - \gamma b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\right\} , \end{aligned}$$

(4)

where $\gamma $ is called a smoothness parameter. Since $f_{\gamma }$ is smooth and has Lipschitz gradient, one can apply accelerated proximal gradient methods [4, 26] to minimize the sum $f_{\gamma }(\cdot ) + g(\cdot )$. Using such methods, we can eventually guarantee

$$\begin{aligned} F({x}^k) - F^{\star }\le \min _{\gamma > 0}\left\{ \frac{2\Vert {A}\Vert ^2R_0^2}{\gamma (k+1)^2} + \gamma D_{\mathcal {U}}\right\} = \frac{2\sqrt{2}\Vert {A}\Vert R_0\sqrt{D_{\mathcal {U}}}}{(k+1)}, \end{aligned}$$

(5)

where $\left\{ {x}^k\right\} $ is the underlying sequence generated by the accelerated proximal-gradient method, see [24], and $R_0 := \Vert {x}^0 - {x}^{\star }\Vert $. To achieve an $\varepsilon $-solution ${x}^k$ such that $F({x}^k) - F^{\star }\le \varepsilon $, we set $\gamma \equiv \gamma ^{*} := \frac{\varepsilon }{2D_{\mathcal {U}}}$ at its optimal value. Hence, the algorithm requires at most $k_{\max } := \left\lfloor 2\sqrt{2}\Vert {A}\Vert R_0\sqrt{D_{\mathcal {U}}}\varepsilon ^{-1}\right\rfloor $ iterations.

Our approach The original smoothing algorithm in [24] has three computational disadvantages even with the optimal choice $\gamma ^{*} := \frac{\varepsilon }{2D_{\mathcal {U}}}$ of $\gamma $.

(a)
It requires the prox-diameter $D_{\mathcal {U}}$ of $\mathcal {U}$ to determine $\gamma ^{*}$, which may be expensive to estimate when $\mathcal {U}$ is complicated.
(b)
If $\varepsilon $ is small and $D_{\mathcal {U}}$ is large, then $\gamma ^{*}$ is small, and hence, the strong convexity parameter of (4) is small. Algorithms for solving (4) have slow convergence speed.
(c)
The Lipschitz constant of $\nabla {f_{\gamma }}$ is $\gamma ^{-1}\Vert A\Vert ^2 = 2\Vert A\Vert ^2D_{\mathcal {U}}\varepsilon ^{-1}$, which is large. This leads to a small step-size of $\varepsilon /(2\Vert A\Vert ^2D_{\mathcal {U}})$ in the accelerated proximal-gradient algorithm and hence, can have a slow convergence.

Our approach is briefly presented as follows. We first choose a smooth proximity function $b_{\mathcal {U}}$ instead of a general one. We assume that $\nabla {b}_{\mathcal {U}}$ is $L_b$-Lipschitz continuous with the Lipschitz constant $L_b \ge \mu _b = 1$. Then, we define $f_{\gamma }({x})$ as in (4), which is a smoothed approximation to f as above.

We design a smoothing accelerated proximal-gradient algorithm that can updates $\gamma $ from $\gamma _k$ to $\gamma _{k+1}$ at each iteration so that $\gamma _{k+1} < \gamma _k$ by performing only one accelerated proximal-gradient step [4, 26] to minimize the sum $F_{\gamma _{k+1}} := f_{\gamma _{k+1}} + g$ for each value $\gamma _{k+1}$ of $\gamma $. We prove that the sequence of the objective residuals, $\left\{ F({x}^k) - F^{\star }\right\} $, converges to zero up to the $\mathcal {O}\left( \frac{1}{k}\right) $-rate.

Our contributions Our main contributions can be summarized as follows:

(a)
We propose using a smooth proximity function to smooth the max-structure objective function f in (2), and develop a new smoothing algorithm, Algorithm 1, based on the accelerated proximal-gradient method to adaptively update the smoothness parameter in a heuristic-free fashion.
(b)
We prove up to the $\mathcal {O}\left( \frac{1}{\varepsilon }\right) $-worst-case iteration-complexity for our algorithm as in [24] to achieve an $\varepsilon $-solution, i.e., $F({x}^k) - F^{\star }\le \varepsilon $. Especially, with the quadratic proximity function $b_{\mathcal {U}}(\cdot ) := (1/2)\Vert \cdot - \bar{{u}}^c\Vert ^2$, our algorithm achieve exactly the $\mathcal {O}\left( \frac{1}{\varepsilon }\right) $-worst-case iteration-complexity as in [24].
(c)
We customize our algorithm to handle four important special cases that have a great practical impact in many applications.
(d)
We specify our algorithm to solve constrained convex minimization problems, and propose an averaging scheme to recover an approximate primal solution with a rigorous convergence guarantee.

From a practical point of view, we believe that the proposed algorithm can overcome three disadvantages mentioned previously in the original smoothing algorithm in [24]. However, our condition $L_b = 1$ on the choice of proximity functions may lead to some limitation of the proposed algorithm for exploiting further the structures of the constraint set $\mathcal {U}$. Fortunately, we can identify several important settings in Sect. 4, where we can eliminate this disadvantage. Such classes of problems cover several applications in image processing, compressive sensing, and monotropic programming [3, 15, 28, 34].

Paper organization The rest of this paper is organized as follows. Section 2 briefly discusses our smoothing technique. Section 3 presents our main algorithm, Algorithm 1, and proves its convergence guarantee. Section 4 handles four special but important cases of (1). Section 5 specializes our algorithm to solve constrained convex minimization problems. Preliminarily numerical examples are given in Sect. 6. For clarity of presentation, we move the long and technical proofs to the appendix.

Notation and terminology We work on the real spaces $\mathbb {R}^p$ and $\mathbb {R}^n$, equipped with the standard inner product $\langle \cdot , \cdot \rangle $ and the Euclidean $\ell _2$-norm $\Vert \cdot \Vert $. Given a proper, closed, and convex function g, we use $\mathrm {dom}(g)$ and $\partial {g}({x})$ to denote its domain and its subdifferential at ${x}$, respectively. If g is differentiable, then $\nabla {g}({x})$ stands for its gradient at ${x}$.

We denote $f^{*}({s}) := \sup \left\{ \left\langle {s}, {x}\right\rangle - f({x}) : {x}\in \mathrm {dom}(f)\right\} $, the Fenchel conjugate of f. For a given set $\mathcal {X}$, $\delta _{\mathcal {X}}({x}) := 0$ if ${x}\in \mathcal {X}$ and $\delta _{\mathcal {X}}({x}) := +\infty $, otherwise, defines the indicator function of $\mathcal {X}$. For a smooth function f, we say that f is $L_f$-smooth if for any ${x}, \tilde{{x}}\in \mathrm {dom}(f)$, we have $\Vert \nabla {f}({x}) - \nabla {f}(\tilde{{x}})\Vert \le L_f\Vert {x}- \tilde{{x}}\Vert $, where $L(f) := L_f \in [0, \infty )$. We denote by $\mathcal {F}_{L}^{1,1}$ the class of all $L_f$-smooth and convex functions f. We also use $\mu _f \equiv \mu (f)$ for the strong convexity parameter of a convex function f. Given a nonempty, closed and convex set $\mathcal {X}$, $\mathrm {dist}\left( {x},\mathcal {X}\right) $ denotes the distance from ${x}$ to $\mathcal {X}$.

2 Smoothing techniques via smooth proximity functions

Let $b_{\mathcal {U}}$ be a prox-function of the nonempty, closed and convex set $\mathcal {U}$ with the strong convexity parameter $\mu _b = 1$. In addition, $b_{\mathcal {U}}$ is smooth on $\mathcal {U}$, and its gradient $\nabla {b_{\mathcal {U}}}$ is Lipschitz continuous with the Lipschitz constant $L_b \ge \mu _b = 1$. In this case, $b_{\mathcal {U}}$ is said to be $L_b$-smooth. As a default example, $b_{\mathcal {U}}(\cdot ) := (1/2)\Vert \cdot - \bar{{u}}^c\Vert ^2$ for fixed $\bar{{u}}^c\in \mathcal {U}$ satisfies our assumptions with $L_b = \mu _b = 1$. Let $\bar{{u}}^c$ be the b-prox-center point of $\mathcal {U}$, i.e., $\bar{{u}}^c := \mathrm {arg}\min _{{u}}\left\{ b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\right\} $. Without loss of generality, we can assume that $b_{\mathcal {U}}(\bar{{u}}^c) = 0$. Otherwise, we consider $\bar{b}_{\mathcal {U}}({u}) := b_{\mathcal {U}}({u}) - b_{\mathcal {U}}(\bar{{u}}^c)$.

Given a convex function $\varphi ^{*}({z}) := \max _{{u}}\left\{ \langle {z}, {u}\rangle - \varphi ({u}) : {u}\in \mathcal {U}\right\} $, we define a smoothed approximation of $\varphi ^{*}$ as

$$\begin{aligned} \varphi ^{*}_{\gamma }({z}) := \max _{{u}\in \mathcal {U}}\left\{ \langle {z}, {u}\rangle - \varphi ({u}) - \gamma b_{\mathcal {U}}({u})\right\} , \end{aligned}$$

(6)

where $\gamma > 0$ is a smoothness parameter. We note that $\varphi ^{*}$ is not a Fenchel conjugate of $\varphi $ unless $\mathcal {U}= \mathrm {dom}(\varphi )$. We denote by ${u}^{*}_{\gamma }({x})$ the unique optimal solution of the strongly concave maximization problem (6), i.e.:

$$\begin{aligned} {u}^{*}_{\gamma }({z}) = \mathrm {arg}\max _{{u}}\left\{ \langle {z}, {u}\rangle - \varphi ({u}) - \gamma b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\right\} . \end{aligned}$$

(7)

We also define $D_{\mathcal {U}} := \sup _{{u}}\left\{ b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\cap \mathrm {dom}(\varphi )\right\} $ the b-prox diameter of $\mathcal {U}$. If $\mathcal {U}$ or $\mathrm {dom}(\varphi )$ is bounded, then $D_{\mathcal {U}} \in [0, +\infty )$.

Associated with $\varphi ^{*}_{\gamma }$, we consider a smoothed function for f in (2) as

$$\begin{aligned} f_{\gamma }({x}) := \varphi ^{*}_{\gamma }({A}^{\top }{x}) = \max _{{u}}\left\{ \langle {A}^{\top }{x}, {u}\rangle - \varphi ({u}) - \gamma b_{\mathcal {U}}({u}) : {u}\in \mathcal {U}\right\} . \end{aligned}$$

(8)

Then, the following lemma summaries the properties of the smoothed function $\varphi ^{*}_{\gamma }$ defined by (6) and $f_{\gamma }$ defined by (8), whose proof can be found in [31].

Lemma 1

The function $\varphi ^{*}_{\gamma }$ defined by (6) is convex and smooth. Its gradient is given by $\nabla {\varphi ^{*}_{\gamma }}({z}) := {u}^{*}_{\gamma }({z})$ which is Lipschitz continuous with the Lipschitz constant $L_{\varphi _{\gamma }^{*}} := \gamma ^{-1}$. Consequently, for any ${z},\bar{{z}}\in \mathbb {R}^n$, we have

$$\begin{aligned} \frac{\gamma }{2}\Vert {u}^{*}_{\gamma }({z}) - {u}^{*}_{\gamma }(\bar{{z}})\Vert ^2 \le \varphi _{\gamma }^{*}({z}) - \varphi _{\gamma }^{*}(\bar{{z}}) - \langle \nabla {\varphi _{\gamma }^{*}}({z}), {z}- \bar{{z}}\rangle \le \frac{1}{2\gamma }\Vert {z}- \bar{{z}}\Vert ^2. \end{aligned}$$

(9)

For fixed ${z}\in \mathbb {R}^n$, $\varphi _{\gamma }^{*}({z})$ is convex w.r.t. $\gamma \in \mathbb {R}_{++}$, and

$$\begin{aligned} \varphi ^{*}_{\gamma }({z}) - (\hat{\gamma } - \gamma )b_{\mathcal {U}}({u}^{*}_{\gamma }({z})) \le \varphi ^{*}_{\hat{\gamma }}({z}), ~~\forall \gamma ,\hat{\gamma }\in \mathbb {R}_{++}. \end{aligned}$$

(10)

As a consequence, $f_{\gamma }$ defined by (8) is convex and smooth. Its gradient is given by $\nabla {f_{\gamma }}({x}) = {A}{u}^{*}_{\gamma }({A}^{\top }{x})$, which is Lipschitz continuous with the Lipschitz constant $L_{f_{\gamma }} := \gamma ^{-1}\Vert {A}\Vert ^2$. In addition, we also have

$$\begin{aligned} f_{\gamma }({x}) \le f({x}) \le f_{\gamma }({x}) + \gamma D_{\mathcal {U}}, ~~\forall {x}\in \mathbb {R}^p. \end{aligned}$$

(11)

We emphasize that Lemma 1 provides key properties to analyze the complexity of our algorithm in the next setions.

3 The adaptive smoothing algorithm and its convergence

Associated with (1), we consider its smoothed composite convex problem as

$$\begin{aligned} F_{\gamma }^{\star } := \min _{{x}\in \mathbb {R}^p}\left\{ F_{\gamma }({x}) := f_{\gamma }({x}) + g({x})\right\} . \end{aligned}$$

(12)

Similar to [24], the main step of Nesterov’s accelerated proximal-gradient scheme [4, 26] applied to the smoothed problem (12) is expressed as follows:

$$\begin{aligned} \begin{array}{ll} {x}^{k+1} &{}:= \mathrm {prox}_{\beta g}\left( \hat{{x}}^k - \beta \nabla {f}_{\gamma }(\hat{{x}}^k)\right) \\ &{}\equiv \displaystyle \mathrm {arg}\min _{{x}\in \mathbb {R}^p}\Big \{ g({x}) + \frac{1}{2\beta }\big \Vert {x}- \big (\hat{{x}}^k - \beta {A}{u}^{*}_{\gamma }({A}^{\top }\hat{{x}}^k)\big )\big \Vert ^2\Big \}, \end{array} \end{aligned}$$

(13)

where $\hat{{x}}^k$ is given, and $\beta > 0$ is a given step size, which will be chosen later.

The following lemma provides a descent property of the proximal-gradient step (13), whose proof can be found in Appendix 1.

Lemma 2

Let ${x}^{k+1}$ be generated by (13). Then, for any ${x}\in \mathbb {R}^p$, we have

$$\begin{aligned} \begin{array}{ll} F_{\gamma }({x}^{k+1})&\le \hat{\ell }^k_{\gamma }({x}) + \frac{1}{\beta }\langle {x}^{k+1} - \hat{{x}}^k, {x}-\hat{{x}}^k\rangle - \frac{1}{2}\left( \frac{2}{\beta } - \frac{\Vert {A}\Vert ^2}{\gamma }\right) \Vert \hat{{x}}^k - {x}^{k+1}\Vert ^2, \end{array} \end{aligned}$$

(14)

where

$$\begin{aligned} \begin{array}{ll} \hat{\ell }^k_{\gamma }({x}) &{}:= f_{\gamma }(\hat{{x}}^k) + \langle \nabla {f_{\gamma }}(\hat{{x}}^k), {x}- \hat{{x}}^k\rangle + g({x}) \\ &{} \le F_{\gamma }({x}) - \frac{\gamma }{2}\Vert {u}^{*}_{\gamma }({A}^{\top }{x}) - {u}^{*}_{\gamma }({A}^{\top }\hat{{x}}^k)\Vert ^2. \end{array} \end{aligned}$$

(15)

We now adopt the accelerated proximal-gradient scheme (FISTA) in [4] to solve (12) using an adaptive step-size $\beta _{k+1} := \frac{\gamma _{k+1}}{\Vert {A}\Vert ^2}$, which becomes

$$\begin{aligned} \left\{ \begin{array}{ll} \hat{{x}}^k &{}:= (1-\tau _k){x}^k + \tau _k\tilde{{x}}^k\\ {x}^{k+1} &{}:= \mathrm {prox}_{\beta _{k+1}g}\left( \hat{{x}}^k - \beta _{k+1}\nabla {f}_{\gamma _{k+1}}(\hat{{x}}^k)\right) \\ \tilde{{x}}^{k+1} &{}:= \tilde{{x}}^k -\frac{1}{\tau _k}(\hat{{x}}^k - {x}^{k+1}), \end{array}\right. \end{aligned}$$

(16)

where $\gamma _{k+1} > 0$ is the smoothness parameter, and $\tau _k \in (0, 1]$.

By letting $t_k := \frac{1}{\tau _k}$, we can eliminate $\tilde{{x}}^k$ in (16) to obtain a compact version

$$\begin{aligned} \left\{ \begin{array}{ll} {x}^{k+1} &{}:= \mathrm {prox}_{\beta _{k+1}g}\left( \hat{{x}}^k - \beta _{k+1}\nabla {f}_{\gamma _{k+1}}(\hat{{x}}^k)\right) \\ \hat{x}^{k+1}&{} := x^{k+1} + \frac{t_k-1}{t_{k+1}}(x^{k+1}-x^k) \end{array}\right. \end{aligned}$$

(17)

The following lemma provides a key estimate to prove the convergence of the scheme (16) (or (17)), whose proof can be found in Appendix 1.

Lemma 3

Let $\left\{ ({x}^k, \tilde{{x}}^k, \tau _k,\gamma _k)\right\} $ be the sequence generated by (16). Then

$$\begin{aligned} F_{\gamma _{k+1}}({x}^{k+1}) \le (1 - \tau _k)F_{\gamma _{k}}({x}^k) + \tau _kF({x}) + \frac{\Vert {A}\Vert ^2\tau _k^2}{2\gamma _{k+1}}\left[ \Vert \tilde{{x}}^k - {x}\Vert ^2 - \Vert \tilde{{x}}^{k+1} \!-\! {x}\Vert ^2\right] - R_k, \end{aligned}$$

(18)

for any $x\in \mathbb {R}^p$ and $R_k$ is defined by

$$\begin{aligned} \begin{array}{ll} R_k &{} := \tau _k\gamma _{k+1}b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)) - (1-\tau _k)(\gamma _k - \gamma _{k+1})b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k))\\ &{}\qquad \quad + \, \frac{(1 - \tau _k)\gamma _{k+1}}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)\Vert ^2. \end{array} \end{aligned}$$

(19)

Moreover, the quantity $R_k$ is bounded from below by

$$\begin{aligned} \begin{array}{ll} R_k&\ge \frac{1}{2}(1-\tau _k)\left[ \gamma _{k+1}\tau _k - L_b(\gamma _k - \gamma _{k+1})\right] b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k)). \end{array} \end{aligned}$$

(20)

Next, we show one possibility for updating $\tau _k$ and $\gamma _k$, and provide an upper bound for $F_{\gamma _k}({x}^k) - F^{\star }$. The proof of this lemma is moved to Appendix 1.

Lemma 4

Let us choose $\tilde{{x}}^0 := {x}^0 \in \mathrm {dom}(F)$, $\gamma _1 > 0$, and an arbitrary constant $\bar{c} \ge 1$. If the parameters $\tau _k$ and $\gamma _k$ are updated by

$$\begin{aligned} \tau _k := \frac{1}{k+\bar{c}} ~~~\text {and}~~~\gamma _{k+1} := \frac{\gamma _1\bar{c}}{k+\bar{c}}, \end{aligned}$$

(21)

then the quantity $R_k$ defined by (19) and the parameters $\{ (\tau _k, \gamma _k)\}$ satisfy

$$\begin{aligned} \frac{\gamma _{k+1}}{\tau _k^2}R_k \ge -\frac{\gamma _1^2\bar{c}^2\left[ (L_b-1)(k + \bar{c}) + 1\right] }{(k + \bar{c})^2}D_{\mathcal {U}}~~~~\text {and}~~~~\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2} = \frac{\gamma _k}{\tau _{k-1}^2}. \end{aligned}$$

(22)

Moreover, the following estimate holds

$$\begin{aligned} F_{\gamma _{k+1}}({x}^{k+1}) - F^{\star }\le \frac{\tau _{k}^2}{\gamma _{k+1}}\left[ \frac{(1-\tau _0)\gamma _1}{\tau _0^2}(F_{\gamma _0}({x}^0) - F^{\star }) + \frac{\Vert {A}\Vert ^2}{2}\Vert {x}^0 - {x}^{\star }\Vert ^2 + S_kD_{\mathcal {U}}\right] ,\nonumber \\ \end{aligned}$$

(23)

where

$$\begin{aligned} \begin{array}{ll} {}S_k := \gamma _1^2\bar{c}^2\sum _{i=0}^k\left[ \frac{(L_b -1)}{(i + \bar{c})} + \frac{1}{(i + \bar{c})^2}\right] \le \gamma _1^2\bar{c}^2(L_b-1)\left( \ln (k+\bar{c}) + 1\right) + \gamma _1^2(\bar{c}+1). \end{array} \end{aligned}$$

(24)

In particular, if we choose $b_{\mathcal {U}}$ such that $L_b = 1$, then $S_k \le \gamma _1^2(\bar{c}+1)$.

By (21), the second line of (17) reduces to $\hat{{x}}^{k+1} := {x}^{k+1} + \left( \frac{k+\bar{c}-1}{k+\bar{c}+1}\right) ({x}^{k+1} - {x}^k)$. Using this step into (17) and combining the result with the update rule (21), we can present our algorithm for solving (1) as in Algorithm 1.

The following theorem proves the convergence of Algorithm 1 and estimates its worst-case iteration-complexity.

Theorem 1

Let $\{{x}^k\}$ be the sequence generated by Algorithm 1 using $\bar{c} = 1$. Then, for $k\ge 1$, we have

$$\begin{aligned} F({x}^{k}) - F^{\star }\le \frac{\Vert {A}\Vert ^2\Vert {x}^0 - {x}^{\star }\Vert ^2}{2\gamma _1k} + \frac{3\gamma _1D_{\mathcal {U}}}{k} + \frac{\gamma _1(L_b-1)\left( \ln (k) + 1\right) D_{\mathcal {U}}}{k}. \end{aligned}$$

(25)

If $b_{\mathcal {U}}$ is chosen so that $L_b = 1$ (e.g., $b_{\mathcal {U}}(\cdot ) := \frac{1}{2}\Vert \cdot -\bar{{u}}^c\Vert ^2)$, then (25) reduces to

$$\begin{aligned} F({x}^{k}) - F^{\star }&\le \frac{\Vert {A}\Vert ^2\Vert {x}^0 - {x}^{\star }\Vert ^2}{2\gamma _1k} + \frac{3\gamma _1D_{\mathcal {U}}}{k}, ~~(\forall k\ge 1). \end{aligned}$$

(26)

Consequently, if we set $\gamma _1 := \frac{R_0\Vert {A}\Vert }{\sqrt{6D_{\mathcal {U}}}}$, which is independent of k, then

$$\begin{aligned} F({x}^{k}) - F^{\star }\le \frac{R_0\Vert {A}\Vert \sqrt{6D_{\mathcal {U}}}}{k}~~~(\forall k\ge 1), \end{aligned}$$

(27)

where $R_0 := \Vert {x}^0 - {x}^{\star }\Vert $.

In this case, the worst-case iteration-complexity of Algorithm 1 to achieve an $\varepsilon $-solution ${x}^{k}$ to (1) such that $F({x}^{k})-F^{\star }\le \varepsilon $ is $k_{\max } := \mathcal {O}\Big (\frac{R_0\Vert {A}\Vert \sqrt{D_{\mathcal {U}}}}{\varepsilon }\Big )$.

Proof

From (21), $\bar{c} = 1$ we have $\frac{\tau _{k-1}^2}{\gamma _k} = \frac{(k+\bar{c}-1)}{\bar{c}\gamma _1(k+\bar{c}-1)^2} = \frac{1}{\gamma _1k}$. Using this bound and $S_{k-1} \le \gamma _1^2(L_b-1)\left[ \ln (k) + 1\right] + 2\gamma _1^2$ into (23) we get

$$\begin{aligned} F_{\gamma _k}({x}^k) - F^{\star }\le & {} \frac{1}{\gamma _1k}\left[ \frac{\Vert {A}\Vert ^2}{2}\Vert {x}^0 - {x}^{\star }\Vert ^2 +\frac{\gamma _1(1-\tau _0)}{\tau _0^2}[F_{\gamma _0}({x}^0) - F^{\star }]\right] \\&+\, \frac{\left( \gamma _1(L_b-1)\left[ \ln (k) + 1\right] + 2\gamma _1\right) D_{\mathcal {U}}}{k}. \end{aligned}$$

Since $F({x}^k) - F_{\gamma _k}({x}^k) \le \gamma _kD_{\mathcal {U}}$ due to (11), and $\gamma _k = \frac{\gamma _1\bar{c}}{k+\bar{c}-1} = \frac{\gamma _1}{k}$. Substituting this inequality into the last estimate, and using $\tau _0 = \frac{1}{\bar{c}} = 1$, we obtain (25).

If we choose $b_{\mathcal {U}}$ such that $L_b = 1$, e.g., $b_{\mathcal {U}}(\cdot ) := (1/2)\Vert \cdot -\bar{{u}}^c\Vert ^2$, then $S_k \le 2\gamma _1^2$ as shown in (24). Using this, it follows from (25) that $F({x}^k) - F^{\star }\le \frac{\Vert {A}\Vert }{2\gamma _1k}R_0^2 + \frac{3\gamma _1}{k+1}D_{\mathcal {U}}$. By minimizing the right hand side of this estimate w.r.t $\gamma _1 > 0$, we have $\gamma _1 := \frac{R_0\Vert {A}\Vert }{\sqrt{6D_{\mathcal {U}}}}$ and hence, $F({x}^{k}) - F^{\star }\le \frac{R_0\Vert {A}\Vert \sqrt{6D_{\mathcal {U}}}}{k}$, which is exactly (27). The last statement is a direct consequence of (27). $\square $

For general prox-function $b_{\mathcal {U}}$ with $L_b > 1$, Theorem 1 shows that the convergence rate of Algorithm 1 is $\mathcal {O}\left( \frac{\ln (k)}{k}\right) $, which is similar to [11]. However, when $L_b$ is close to 1, the last term in (25) is better than [11, Theorem 1].

Remark 1

Let $b_{\mathcal {U}}(\cdot ) := (1/2)\Vert \cdot -\bar{{u}}^c\Vert ^2$. Then, (27) shows that the number of maximum iterations in Algorithm 1 is $k_{\max } := \left\lfloor \frac{R_0\Vert A\Vert \sqrt{6D_{\mathcal {U}}}}{\varepsilon }\right\rfloor $, which is the same, $k_{\max } := \left\lfloor \frac{2\sqrt{2}\Vert {A}\Vert R_0\sqrt{D_{\mathcal {U}}}}{\varepsilon }\right\rfloor $, as in (5) (with different factors, $\sqrt{6}$ and $2\sqrt{2}$).

4 Exploiting structures for special cases

For general smooth proximity function $b_{\mathcal {U}}$ with $L_b > 1$, we can achieve the $\mathcal {O}\left( \frac{(L_b - 1)\ln (k)}{k}\right) $ convergence rate. When $L_b = 1$, we obtain exactly the $\mathcal {O}\left( \frac{1}{k}\right) $ rate as in [24]. In this section, we consider three special cases of (1) where we use the quadratic proximity function $b_{\mathcal {U}}(\cdot ) := (1/2)\Vert \cdot - \bar{{u}}^c\Vert ^2$. Then, we specify Algorithm 1 for the $L_g$-smooth objective function g in (1).

4.1 Fenchel conjugate

Let $f^{*}$ be the Fenchel conjugate of f. We can write f in the form of (2) as

$$\begin{aligned} f({x}) = \max _{{u}}\left\{ \langle {x},{u}\rangle - f^{*}({u}) : {u}\in \mathrm {dom}(f^{*}) \right\} \!. \end{aligned}$$

We can smooth f by using $b_{\mathcal {U}}({u}) := (1/2)\Vert {u}\Vert _2^2$ as

$$\begin{aligned} f_{\gamma }({x}) := \max _{ {u}\in \mathrm {dom}(f^{*}) }\left\{ \langle {x},{u}\rangle - f^{*}({u}) - (\gamma /2)\Vert {u}\Vert _2^2 \right\} = \frac{\Vert {x}\Vert ^2}{2\gamma } -{}^{\gamma ^{-1}}f^{*}(\gamma ^{-1}{x}), \end{aligned}$$

where ${}^\beta {h}$ is the Moreau envelope of a convex function h with a parameter $\beta $ [3]. In this case, ${u}^{*}_{\gamma }({x}) = \mathrm {prox}_{\gamma ^{-1}f^{*}}( \gamma ^{-1}{x}) = \gamma ^{-1}({x}- \mathrm {prox}_{\gamma f}({x}))$. Hence, $\nabla {f_{\gamma }}({x}) = \gamma ^{-1}({x}- \mathrm {prox}_{\gamma f}({x}))$. The main step, Step 3, of Algorithm 1 becomes

$$\begin{aligned} {x}^{k+1} = \mathrm {prox}_{\gamma _{k+1}g}\big ( \mathrm {prox}_{\gamma _{k+1}f}(\hat{{x}}^k) \big ). \end{aligned}$$

Hence, Algorithm 1 can be applied to solve (1) using the proximal operator of f and g. The worst-case complexity bound in Theorem 1 becomes $\mathcal {O}\left( \frac{D_{\mathrm {dom}(f^{*})}R_0}{\varepsilon }\right) $, where $D_{\mathrm {dom}(f^{*})} := \displaystyle \max _{{u}\in \mathrm {dom}(f^{*})} \Vert {u}\Vert $ is the diameter of $\mathrm {dom}(f^{*})$.

4.2 Composite convex minimization with linear operator

We consider the following composite convex problem with a linear operator that covers many important applications in practice, see, e.g., [1, 3, 15]:

$$\begin{aligned} F^{\star } := \min _{{x}\in \mathbb {R}^p}\left\{ F({x}) := f({A}{x}) + g({x})\right\} , \end{aligned}$$

(28)

where f and g are two proper, closed and convex functions, and ${A}$ is a linear operator from $\mathbb {R}^p$ to $\mathbb {R}^n$.

We first write $f({A}{x}) := \max _{{u}}\left\{ \langle {A}{x}, {u}\rangle - f^{*}({u}) : {u}\in \mathrm {dom}(f^{*})\right\} $. Next, we choose a quadratic smoothing proximity function $b_{\mathcal {U}}({u}) := (1/2)\Vert {u}- \bar{{u}}^c\Vert ^2$ for fixed $\bar{{u}}^c\in \mathrm {dom}(f^{*})$, and define $\mathcal {U}:= \mathrm {dom}(f^{*})$. Using this smoothing prox-function, we obtain a smoothed approximation of $f({A}{x})$ as follows:

$$\begin{aligned} f_{\gamma }({A}{x}) := \max _{{u}}\left\{ \langle {A}{x}, {u}\rangle - f^{*}({u}) - (\gamma /2)\Vert {u}- \bar{{u}}^c\Vert ^2 : {u}\in \mathrm {dom}(f^{*})\right\} . \end{aligned}$$

In this case, we can compute ${u}^{*}_{\gamma }({A}{x}) = \mathrm {prox}_{\gamma ^{-1}f^{*}}\left( \bar{{u}}^c + \gamma ^{-1}{A}{x}\right) $ by using the proximal operator of $f^{*}$. By Fenchel–Moreau’s decomposition $\mathrm {prox}_{\gamma ^{-1}f^{*}}(\gamma ^{-1}{v}) = \gamma ^{-1}({v}- \mathrm {prox}_{\gamma f}(\gamma {v}))$ as above, we can compute $\mathrm {prox}_{\gamma ^{-1}f^{*}}$ using the proximal operator of f. In this case, we can specify the proximal-gradient step (13) as

$$\begin{aligned} \left\{ \begin{array}{ll} \hat{{u}}^{*}_k &{}:= \mathrm {prox}_{\gamma _{k+1}^{-1}f^{*}}\left( \bar{{u}}^c + \gamma _{k+1}^{-1}{A}\hat{{x}}^k\right) \\ &{} = \bar{{u}}^c + \gamma _{k+1}^{-1}\left( {A}\hat{{x}}^k - \mathrm {prox}_{\gamma _{k+1}f}\big (\gamma _{k+1}\bar{{u}}^c + {A}\hat{{x}}^k\big )\right) \\ {x}^{k+1} &{}:= \mathrm {prox}_{\beta _{k+1} g}\left( \hat{{x}}^k - \beta _{k+1}{A}^{\top }\hat{{u}}^{*}_k\right) , \end{array}\right. \end{aligned}$$

where $\beta _{k+1} := \gamma _{k+1}\Vert {A}\Vert ^{-2}$. Using this proximal gradient step in Algorithm 1, we still obtain the complexity as in Theorem 1, which is $\mathcal {O}\Big (\frac{\Vert {x}^0 - {x}^{\star }\Vert \Vert {A}\Vert \sqrt{D_{\mathcal {U}}}}{\varepsilon }\Big )$, where the domain $\mathcal {U}:= \mathrm {dom}(f^{*})$ of $f^{*}$ is assumed to be bounded.

4.3 The decomposable structure

The function $\varphi $ and the set $\mathcal {U}$ in (2) are said to be decomposable if they can be represented as follows:

$$\begin{aligned} \varphi ({u}) := \sum _{i=1}^m\varphi _i({u}_i), ~~~\text {and}~~~ \mathcal {U}:= \mathcal {U}_1\times \cdots \times \mathcal {U}_m, \end{aligned}$$

(29)

where $m\ge 2$, ${u}_i \in \mathbb {R}^{n_i}$, $\mathcal {U}_i\subseteq \mathbb {R}^{n_i}$ and $\sum _{i=1}^mn_i = n$. In this case, we also say that problem (1) is decomposable.

The structure (29) naturally arises in linear programming and monotropic programming. In addition, many nondecomposable problems such as consensus optimization, empirical risk minimization, conic programming and geometric programming can also be reformulated into (1) with the structure (29). The decomposable structure (29) immediately supports parallel and distributed computation. Exploiting this structure, one can design new parallel and distributed optimization algorithms using the same approach as in Algorithm 1 for solving (1), see, e.g., [10, 13, 29, 30].

Under the structure (29), we choose a decomposable smoothing function $b_{\mathcal {U}}({u}) := \sum _{i=1}^mb_{\mathcal {U}_i}({u}_i)$, where $b_{\mathcal {U}_i}$ is the prox-function of $\mathcal {U}_i$ for $i=1,\ldots ,m$. The smoothed function $f_{\gamma }$ for f is decomposable, and is represented as follows:

$$\begin{aligned} f_{\gamma }({x}) := \sum _{i=1}^m\left\{ f_{\gamma }^i({x}) := \displaystyle \max _{{u}_i\in \mathcal {U}_i}\left\{ \langle {x}, {A}_i{u}_i\rangle - \varphi _i({u}_i) - \gamma b_{\mathcal {U}_i}({u}_i)\right\} \right\} . \end{aligned}$$

(30)

Let us denote by ${u}^{*}_{\gamma ,i}({A}_i^{\top }{x})$ the unique solution of the subproblem i in (30) for $i=1,\ldots , m$. Then, under the decomposable structure, the evaluation of $f_{\gamma }$ and ${u}^{*}_{\gamma }({A}^{\top }{x}) := [{u}^{*}_{\gamma ,1}({A}_1^{\top }{x}),\ldots , {u}^{*}_{\gamma ,m}({A}_m^{\top }{x})]$ can be computed in parallel.

If we apply Algorithm 1 to solve (1) with the structure (29), then we have the following guarantee on the objective residual:

$$\begin{aligned} \begin{array}{ll} F({x}^{k}) - F^{\star }&\le \frac{L_{{A}}\Vert {x}^0 - {x}^{\star }\Vert ^2}{2\gamma _1k} + \frac{\gamma _1D_{\mathcal {U}}}{k}\left( 3 + (L_b -1)\left( \ln (k+1) + 1\right) \right) , \end{array} \end{aligned}$$

where $L_{{A}} := \sum _{i=1}^m\Vert {A}_i\Vert ^2$, $L_b := \max \left\{ L_{b_i} : 1\le i\le m\right\} $ and $D_{\mathcal {U}} := \sum _{i=1}^mD_{\mathcal {U}_i}$. Hence, the convergence rate of Algorithm 1 stated in Theorem 1 is $\mathcal {O}\left( \frac{\ln (k)}{k}\right) $. If we choose $b_{\mathcal {U}_i}(\cdot ) := (1/2)\Vert \cdot -\bar{{u}}^c_i\Vert ^2$ for all $i=1,\ldots , m$, then $L_b = 1$. Consequently, we obtain the $\mathcal {O}\left( \frac{L_{{A}}R_0\sqrt{D_{\mathcal {U}}}}{\varepsilon }\right) $-worst-case iteration-complexity.

4.4 The Lipschitz gradient structure

If g is smooth and its gradient $\nabla {g}$ is Lipschitz continuous with the Lipschitz constant $L_g > 0$, then $F_{\gamma } := f_{\gamma } + g\in \mathcal {F}_{L}^{1,1}$, i.e., $\nabla {F_{\gamma }}$ is Lipschitz continuous with the Lipschitz constant $L_{F_{\gamma }} := L_g + \gamma ^{-1}\Vert {A}\Vert ^2$.

We replace the proximal-gradient step (13) using in Algorithm 1 by the following “full” gradient step

$$\begin{aligned} {x}^{k+1} := \hat{{x}}^k -\beta _{k+1}\big (\nabla {g}(\hat{{x}}^k) + {A}{u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)\big ), \end{aligned}$$

(31)

where ${u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)$ is computed by (7) and $\beta _{k+1} := \frac{1}{L_g + \gamma _{k+1}^{-1}\Vert {A}\Vert ^2}$ is a given step-size. Unlike (21), we update the parameters $\tau _k$ and $\gamma _k$ as

$$\begin{aligned} \tau _k := \frac{1}{k+1}~~~\text {and}~~~~\gamma _{k+1} := \frac{k\gamma _k\Vert {A}\Vert ^2}{L_g\gamma _k + \Vert {A}\Vert ^2(k+1)}, \end{aligned}$$

where $\gamma _1 := \frac{\Vert {A}\Vert ^2}{L_g}$ is fixed. We name this variant as Algorithm 1(b).

The following corollary summarizes the convergence properties of this variant, whose proof can be found in Appendix 1.

Corollary 1

Assume that $g\in \mathcal {F}_L^{1,1}$ with the Lipschitz constant $L_g \ge 0$. Let $\{ {x}^k\}$ be the sequence generated by Algorithm 1(b). Then, for $k\ge 1$, one has

$$\begin{aligned} {}\begin{array}{ll} {}F({x}^k) - F^{\star } \le \frac{3L_g}{2k}\Vert {x}^0 - {x}^{\star }\Vert ^2 + \frac{\Vert {A}\Vert ^2}{L_gk}\left( \frac{2L_b}{L_g} + 1\right) D_{\mathcal {U}} + \frac{(L_b - 1)\Vert {A}\Vert ^2}{L_g^2k}\left( \ln (k) + 1\right) D_{\mathcal {U}}. \end{array}{} \end{aligned}$$

(32)

If we choose $b_{\mathcal {U}}$ such that $L_b = 1$, then (32) reduces to

$$\begin{aligned} F({x}^k) - F^{\star } \le \frac{3L_g}{2k}\Vert {x}^0 - {x}^{\star }\Vert ^2 + \frac{\Vert {A}\Vert ^2}{L_g^2k}(L_g + 2)D_{\mathcal {U}}. \end{aligned}$$

Consequently, the worst-case iteration-complexity of Algorithm 1(b) is $\mathcal {O}\Big (\frac{1}{\varepsilon }\Big )$.

5 Application to general constrained convex optimization

In this section, we customize Algorithm 1 to solve the following general constrained convex optimization problem:

$$\begin{aligned} \varphi ^{\star } := \min _{{u}\in \mathbb {R}^n }\Big \{ \varphi ({u}) : {A}{u}- {b}\in -\mathcal {K}, ~{u}\in \mathcal {U}\Big \}, \end{aligned}$$

(33)

where $\varphi $ is a proper, closed and convex function from $\mathbb {R}^n\rightarrow \mathbb {R}\cup \left\{ +\infty \right\} $, ${A}\in \mathbb {R}^{p\times n}$, ${b}\in \mathbb {R}^p$, $\mathcal {U}$ and $\mathcal {K}$ are two nonempty, closed and convex set in $\mathbb {R}^n$ and $\mathbb {R}^p$, respectively. Without loss of generality, we can assume that $\varphi $ and $\mathcal {U}$ are decomposable as in (29) with $m\ge 1$.

Associated with the primal setting (33), we consider its dual problem

$$\begin{aligned} F^{\star } := \min _{{x}\in \mathbb {R}^p}\Big \{ F({x}) := \max _{{u}\in \mathcal {U}}\left\{ \langle {x}, {A}{u}\rangle - \varphi ({u})\right\} - \langle {b},{x}\rangle + \max _{{r}\in \mathcal {K}}\langle {x},{r}\rangle \Big \}. \end{aligned}$$

(34)

Clearly, (34) has the same form as (1) with $f({x}) := \displaystyle \max _{{u}}\left\{ \langle {x},{A}{u}\rangle - \varphi ({u}): {u}\in \mathcal {U}\right\} $ and $g({x}) := s_{\mathcal {K}}({x}) - \langle {b},{x}\rangle $, where $s_{\mathcal {K}}$ is the support function of $\mathcal {K}$.

We now specify Algorithm 1 to solve this dual problem. Computing ${u}_{\gamma }^{*}({x})$ requires to solve the following subproblem:

$$\begin{aligned} {u}_{\gamma }^{*}({x}) := \mathrm {arg}\min _{{u}}\left\{ \langle {x}, {A}{u}\rangle - \varphi ({u}) - \gamma b_{\mathcal {U}}({u}) \right\} . \end{aligned}$$

The proximal-step of g becomes $\mathrm {prox}_g({x}) := \mathrm {prox}_{s_{\mathcal {K}}}({x}+{b}) = ({x}+{b}) - \mathrm {proj}_{\mathcal {K}}({x}+{b})$, where $\mathrm {proj}_{\mathcal {K}}(\cdot )$ is the projection onto $\mathcal {K}$. Together with the dual steps, we use an adaptive weighted averaging scheme

$$\begin{aligned} \bar{{u}}^k := \varGamma _k^{-1}\sum _{i=0}^k\tau _i^{-1}\gamma _{i+1}{u}^{*}_{\gamma _{i+1}}(\hat{{x}}^i),~~~\text {and}~~~\varGamma _k := \sum _{i=0}^k\tau _i^{-1}\gamma _{i+1}, \end{aligned}$$

(35)

to construct an approximate primal solution $\bar{{u}}^k$ to an optimal solution ${u}^{\star }$ of (33). Clearly, we can compute $\bar{{u}}^k$ recursively starting from $\bar{{u}}^0 := \varvec{0}^n$ as

$$\begin{aligned} \bar{{u}}^k := (1-\nu _k)\bar{{u}}^{k-1} + \nu _k{u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k), ~~\text {where}~~\nu _k := (\varGamma _k\tau _k)^{-1}\gamma _{k+1} \in (0, 1]. \end{aligned}$$

(36)

We incorporate this scheme into Algorithm 1 to solve (33). While Algorithm 1 constructs an approximate solution to the dual problem (34), (36) allows us to recover an approximate solution $\bar{{u}}^k$ of the primal problem (33). We name this algorithmic variant as Algorithm 1(c).

We specify the convergence guarantee of Algorithm 1(c) in the following theorem. The proof of this theorem is given in Appendix 1.

Theorem 2

Assume that $b_{\mathcal {U}}$ is chosen such that $L_b = 1$, and $\bar{c} = 1$ in (21). Let $\{({x}^k,\bar{{u}}^k)\}$ be generated by Algorithm 1(c). Then $\{\bar{{u}}^k\}\subset \mathcal {U}$ and

$$\begin{aligned} {}\left\{ \begin{array}{l} -\Vert {x}^{\star }\Vert \mathrm {dist}\left( {b}- {A}\bar{{u}}^k, \mathcal {K}\right) \le \varphi (\bar{{u}}^k) - \varphi ^{\star } \le \frac{ \Vert {A}\Vert ^2\Vert {x}^0\Vert ^2 +2 (\gamma _1 + 2\gamma _1^2)D_{\mathcal {U}} }{\gamma _1(k+1)},\\ \mathrm {dist}\left( {b}- {A}\bar{{u}}^k, \mathcal {K}\right) \le \frac{\Vert {A}\Vert ^2\Big (\Vert {x}^0 - {x}^{\star }\Vert + \sqrt{\Vert {x}^0 - {x}^{\star }\Vert ^2 + 2\Vert {A}\Vert ^{-2}(2\gamma _1^2 + \gamma _1)D_{\mathcal {U}}}\Big )}{\gamma _1(k+1)}. \end{array}\right. {} \end{aligned}$$

(37)

Consequently, the worst-case iteration-complexity of Algorithm 1(c) to achieve an $\varepsilon $-solution $\bar{{u}}^k$ such that $\vert \varphi (\bar{{u}}^k) - \varphi ^{\star }\vert \le \varepsilon $ and $\mathrm {dist}\left( {b}-{A}\bar{{u}}^k, \mathcal {K}\right) \le \varepsilon $ is $\mathcal {O}\left( \frac{1}{\varepsilon }\right) $.

Theorem 2 shows that Algorithm 1(c) has the $\mathcal {O}\left( \frac{1}{\varepsilon }\right) $ worst-case iteration-complexity on the primal objective residual and feasibility violation for (33).

6 Preliminarily numerical experiments

We demonstrate the performance of Algorithm 1 for solving the three well-known convex optimization problems. The first example is a LASSO problem with $\ell _1$-loss [34], the second one is a square-root LASSO studied in [8], and the last example is an image deblurring problem with a non-smooth data fidelity function (e.g., the $\ell _1$-norm or the $\ell _2$-norm function).

6.1 The $\ell _1\text {-}\ell _1$-regularized LASSO

We consider the $\ell _1\text {-}\ell _1$-regularized LASSO problem studied in [34] as follows:

$$\begin{aligned} F^{\star }:= \min \left\{ F({x}) := \Vert {B}{x}- {b}\Vert _1 + \lambda \Vert {x}\Vert _1 : {x}\in \mathbb {R}^p\right\} , \end{aligned}$$

(38)

where ${B}$ and ${b}$ are defined as in (39), and $\lambda > 0$ is a regularization parameter.

The function $f({x}) := \Vert {B}{x}- {b}\Vert _1 = \max \left\{ \langle {B}^{\top }{u}, {x}\rangle - \langle {b},{u}\rangle : \Vert {u}\Vert _{\infty } \le 1\right\} $ falls into the decomposable case considered in Sect. 4.3. Hence, we can smooth f using the quadratic prox-function to obtain

$$\begin{aligned} f_{\gamma }({x}) := \max _{{u}}\Big \{ \langle {x}, {B}^{\top }{u}\rangle - \langle {b},{u}\rangle - (\gamma /2)\Vert {u}\Vert ^2 : {u}\in \mathcal {B}_{\infty } \Big \}. \end{aligned}$$

Clearly, we can show that ${u}^{*}_{\gamma }({B}{x}) := \mathrm {proj}_{\mathcal {B}_{\infty }}\left( \gamma ^{-1}({B}{x}- {b})\right) $. In this case, we also have $D_{\mathcal {B}_{\infty }} := \frac{1}{2}n$ and $\mathcal {U}:= \mathcal {B}_{\infty }$.

Now, we apply Algorithm 1 to solve problem (39). To verify the theoretical bound in Theorem 1, we use CVX [18] to solve (39) and obtain a high accuracy approximate solution ${x}^{\star }$. Then, we can compute $R_0 := \Vert {x}^0 - {x}^{\star }\Vert _2$, and choose $\gamma _1 \equiv \gamma _1^{*} := \frac{\Vert {B}\Vert R_0}{\sqrt{6D_{\mathcal {B}_{\infty }}}}$. From Theorem 1, we have $F({x}^k) - F^{\star }\le \frac{R_0\Vert {B}\Vert \sqrt{6D_{\mathcal {B}_{\infty }}}}{k}$, which is the worst-case bound of Algorithm 1, where k is the iteration number.

For our comparison, we also implement the smoothing algorithm in [24] using the quadratic prox-function. As indicated in (5), we set $\gamma \equiv \gamma ^{*} := \frac{\sqrt{2}\Vert B\Vert R_0}{\sqrt{D_{\mathcal {U}}}(k+1)}$. Hence, we also obtain the theoretical upper bound $F({x}^k) - F^{\star }\le \frac{2\sqrt{2}\Vert B\Vert R_0\sqrt{D_{\mathcal {U}}}}{(k+1)}$. We name this algorithm as Non-adapt. Alg. (non-adaptive algorithm).

The test data is generated as follows: Matrix ${B}\in \mathbb {R}^{n\times p}$ is generated randomly using the standard Gaussian distribution $\mathcal {N}(0, 1)$. We consider two cases. In the first case, we use non-correlated data, while in the second case, we generate ${B}$ with $50\%$ correlated columns as ${B}(:,j+1) = 0.5{B}(:,j) + \text {randn}(:)$. The observed measurement vector ${b}$ is generated as ${b}:= {B}{x}^{\natural } + \mathcal {N}(0, 0.05)$, where ${x}^{\natural }$ is a given s-sparse vector generated randomly using $\mathcal {N}(0, 1)$.

We test both algorithms: Algorithm 1 and Non-adapt. Alg. on two problem instances of the size $(p, n, s) = (1000, 350, 100)$ (with and without correlated data, respectively). We sweep along the values of $\lambda $ to find an optimal value for $\lambda $ which are $\lambda = 6.2105$ for non-correlated data, and $\lambda = 5.7368$ for correlated data, respectively. For comparison, we first select the optimal value for $\gamma _1 := \gamma _1^{*}$ and $\gamma := \gamma ^{*}$ in both algorithms. Then, we consider two cases: (i) $\gamma _1 := 10\gamma _1^{*}$ and $\gamma := 10\gamma ^{*}$, and (ii) $\gamma _1 := 0.1\gamma _1^{*}$ and $\gamma := 0.1\gamma ^{*}$.

Figure 1 plots the empirical bounds of the 6 variants versus the theoretical bounds from 200 to 10, 000 iterations. Obviously, both algorithms show their empirical rate which is much better than their theoretical bound. But if we change the smoothness parameters, the guarantee is no longer preserved. Algorithm 1 shows a better performance than Non-adapt. Alg. in both cases.

6.2 Square-root LASSO

We consider the following well-known square-root LASSO problem:

$$\begin{aligned} \min _{{x}\in \mathbb {R}^p}\Big \{ F({x}) := \Vert {B}{x}- {b}\Vert _2 + \lambda \Vert {x}\Vert _1 \Big \}. \end{aligned}$$

(39)

As proved in [8], if matrix ${B}$ is Gaussian, then we can select the regularization parameter $\lambda $ such that we can obtain exact recovery to the true solution ${x}^{\natural }$.

The function f defined by $f({x}) := \Vert {B}{x}- {b}\Vert _2$ can be written as

$$\begin{aligned} f({x}) = \max _{{u}}\left\{ \langle {B}^{\top }{u}, {x}\rangle - \langle {b},{u}\rangle : \Vert {u}\Vert _2 \le 1\right\} . \end{aligned}$$

Let $\mathcal {B}_2 := \left\{ {u}\in \mathbb {R}^n : \Vert {u}\Vert _2 \le 1\right\} $ be the $\ell _2$-norm ball. We choose $b({u}) := \frac{1}{2}\Vert {u}\Vert _2^2$ as a prox-function for $\mathcal {B}_2$. Then, we can smooth f using $b(\cdot ) := \frac{1}{2}\Vert \cdot \Vert _2^2$ as

$$\begin{aligned} f_{\gamma }({x}) := \max _{{u}}\left\{ \langle {x}, {B}^{\top }{u}\rangle - \langle {b},{u}\rangle - (\gamma /2)\Vert {u}\Vert _2^2 : {u}\in \mathcal {B}_2\right\} . \end{aligned}$$

Clearly, ${u}^{*}_{\gamma }({x}) := \mathrm {proj}_{\mathcal {B}_2}\left( \gamma ^{-1}({B}{x}- {b})\right) $ is the solution of the maximization problem, where $\mathrm {proj}_{\mathcal {B}_2}$ is the projection onto $\mathcal {B}_2$. Moreover, we have $D_{\mathcal {U}} = \frac{1}{2}$.

Now, we apply Algorithm 1 to solve problem (39). We choose $\bar{c} := 1$ and set $\gamma _1 \equiv \gamma _1^{*} := \frac{\Vert {A}\Vert R_0}{\sqrt{6D_{\mathcal {U}}}}$, where $R_0 := \Vert {x}^0 - {x}^{\star }\Vert _2$. We also estimate the theoretical upper bound indicated in Theorem 1 for $F({x}^k) - F^{\star }$ using (25), which is $\frac{\Vert {A}\Vert R_0\sqrt{6D_{\mathcal {U}}}}{k}$. We implement the smoothing algorithm in [24] for our comparison by using the same prox-function. The parameter of this algorithm is set as in the previous example.

The data test is generated as in Sect. 6.1. We also perform the test on two problem instances of size $(p, n, s) = (1000, 350, 100)$: non-correlated data and correlated data. We choose the regularization parameter $\lambda $ as suggested in [8]. We use the same setting for the smoothness parameter $\gamma $ in both algorithms as in Sect. 6.1. In this case, the theoretical upper bound of Algorithm 1 depends on the $\log $ term which is scaled by the condition number of ${B}{B}^{\top }$, and is worse than in Non-adapt. Alg. variant.

Figure 2 plots the empirical bounds of the 6 variants versus the theoretical bounds from 200 to 10, 000 iterations. Obviously, both algorithms show their empirical rate which is much better than their theoretical bound. Algorithm 1 gives a better performance than the nonadaptive method in this example. We note that the theoretical bound in Algorithm 1 remains non-optimal, while it is optimal in the nonadaptive one.

Finally, we compare Algorithm 1 with the variable smoothing algorithm in [11] (Boţ&Hendrich Alg.). Whlile the first term $f({x}) := \Vert {A}{x}- {b}\Vert _2$ is smoothened as in Algorithm 1, we smooth the second term $g({x}) := \lambda \Vert {x}\Vert _1$ as

$$\begin{aligned} g_{\beta }({x}) := \max _{{v}}\left\{ \langle {x},{v}\rangle - (\beta /2)\Vert {v}\Vert _2^2 : \Vert {v}\Vert _{\infty } \le 1\right\} . \end{aligned}$$

Then, we update $\gamma _k$ and $\beta _k$ as $\gamma _k = \frac{1}{c_a(k+1)}$ and $ \beta _k = \frac{1}{c_b(k+1)}$, respectively as suggested in [11], where $c_a$ and $c_b$ are two appropriate constants.

We compare Algorithm 1 and Boţ&Hendrich Alg. on a problem instance of size $(p, n, s) = (1000, 350, 100)$, where the data is generated as in the previous tests. To find an appropriate value of $c_a$ and $c_b$, we sweep $c_a\in [10, 5000]$ simultaneously with $c_b \in [0.001, 500]$. We obtain $c_a = 51$ and $c_b = 49$. For Algorithm 1, we consider two cases. In the first case, we set $\gamma _1 = \gamma _1^{*} = 129.5505$ computed from the worst-case bound, while in the second case, we also sweep $\gamma _1\in [10, 1000]$ to find an appropriate value $\gamma _1 = 51$. The results of both algorithms are plotted in Fig. 3 for 5000 iterations.

Figure 3 (left) shows that the objective value produced by Algorithm 1 does not vary much when $\gamma _1\in [10, 1000]$, while, in Boţ&Hendrich Alg., the objective value changes rapidly when we sweep on $c_a$ and $c_b$ simultaneously. Hence, it is unclear how to choose an appropriate value for $c_a$ and $c_b$ without sweeping. Figure 3 (right) shows the convergence behavior of both algorithms. Without sweeping, Algorithm 1 has a good empirical convergence rate in the early iterations. With sweeping, both algorithms perform better in the later iterations. Algorithm 1 has a better performance than Boţ&Hendrich Alg.

6.3 Image deblurring with the $\ell _1$ or $\ell _2$-data fidelity function

We consider an image deblurring problem using the $\ell _{\alpha }$-norm fidelity term as

$$\begin{aligned} \min _{{X}} \left\{ F({X}) := \Vert \mathcal {A}({X}) - {b}\Vert _{\alpha } + \lambda \Vert W{X}\Vert _1 : {X}\in \mathbb {R}^{m\times q} \right\} , \end{aligned}$$

(40)

where $\alpha \in \left\{ 1, 2\right\} $, ${A}: \mathbb {R}^p\rightarrow \mathbb {R}^p$ ($p =m\times q$) is a blurring kernel, ${b}$ is an observed noisy image, $W : \mathbb {R}^p\rightarrow \mathbb {R}^p$ is the orthogonal Haar wavelet transform with four levels, $\lambda > 0$ is the regularizer parameter.

Table 1 The PSNR values reported by the 8 algorithmic variants on the 5 test images

Full size table

We now apply Algorithm 1 (Alg. 1) to solve problem (40) and compare it with the nonadaptive variant (Nes. Alg.) and Boţ & Hendrich’s algorithm (BH Alg.) in [11]. Since ${A}$ is orthogonal, we can use the quadratic smoothing function as $b_{\mathcal {U}}({X}) := (1/2)\Vert {X}\Vert _F^2$. With this choice, we can compute the gradient of ${u}^{*}_{\gamma }({X})$ defined by (7) as ${u}^{*}_{\gamma }({X}) = \mathrm {proj}_{\mathcal {B}^{*}_{\alpha }}(\gamma ^{-1}(\mathcal {A}({X}) - {b}))$, where $\mathrm {proj}_{\mathcal {B}_{\alpha }}$ is the projection onto the dual norm ball $\mathcal {B}^{*}_{\alpha }$ of the $\ell _{\alpha }$-norm.

We test three algorithms on the five images: cameraman, barbara, lena, boat and house widely used in the literature. The noisy images are generated as in [4]. Although we use the non-smooth $\ell _{\alpha }$-norm function with $\alpha =1$ or $\alpha = 2$, the regularization parameter $\lambda $ is set to $\lambda := 10^{-4}$ as suggested in [4], but it still provides the best recovery compared to other values in all 5 images.

While we fix $\gamma _1 = 62$ in Algorithm 1 which is roughly computed from the worst-case bound, we sweep $\gamma $ and $c_a$ (see Sect. 6.2) in [0.0001, 1000] to choose the best possible value for Nes. Alg. and BH Alg. in each image (with 300 iterations). We also set $c_b = c_a$ as suggested in [11]. For Nes. Alg., we have $\gamma = 1$ in the boat image, while in the other 4 images, $\gamma = 2.5$ is the best value. For BH Alg., we have $c_a = 0.005$ in the cameraman, barbara and boat images, and $c_a = 0.0025$ in the lena and house images. The PSNR (Peak Signal to Noise Ratio [4]) of the 8 algorithms are reported in Table 1.

It shows that the nonsmooth $\ell _1$-norm objective produces slightly better recovery images in terms of PSNR than the $\ell _2$-norm objective in many cases for Algorithm 1, but it is not the case in Nes. Alg. and BH Alg. In addition, Algorithm 1 is superior to Nes. Alg. in all cases, and is also better than BH Alg. in the majority of the test. We note that the complexity-per-iteration of the four algorithms are essentially the same, while our new adaptive strategy produces better solutions in terms of PSNR than the other two methods. In addition, our algorithm significantly improves the PSNR if we run it further, while the nonadaptive variant does not make any clear progress on the PSNR value if we continue running it. If we sweep the values of $\gamma _1$ in Algorithm 1 ($\gamma _1$-sweeping), we can also improve the results of this algorithm.

References

Andreas, A., Marco, S., Suykens, J.: Hybrid conditional gradient-smoothing algorithms with applications to sparse and low rank regularization. In: Andreas, A., Marco, S., Suykens, J. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, pp. 53–82. CRC Press, Boca Raton (2014)
Google Scholar
Baes, M., and Bürgisser, M.: Smoothing techniques for solving semi-definite programs with many constraints. Optimization Online (2009)
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces. Springer, New York (2011)
Book MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Article MathSciNet MATH Google Scholar
Becker, S., Bobin, J., Candès, E.J.: NESTA: a fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)
Article MathSciNet MATH Google Scholar
Becker, S., Candès, E.J., Grant, M.: Templates for convex cone problems with applications to sparse signal recovery. Math. Prog. Comput. 3(3), 165–218 (2011)
Article MathSciNet MATH Google Scholar
Belloni, A., Chernozhukov, V., Wang, L.: Square-root LASSO: pivotal recovery of sparse signals via conic programming. Biometrika 94(4), 791–806 (2011)
Article MathSciNet MATH Google Scholar
Ben-Tal, A., and Nemirovski, A.: Lectures on modern convex optimization: Analysis, algorithms, and engineering applications, volume 3 of MPS/SIAM Series on Optimization. SIAM (2001)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical methods. Prentice Hall, Englewood Cliffs (1989)
MATH Google Scholar
Boţ, R., Hendrich, C.: A variable smoothing algorithm for solving convex optimization problems. TOP 23(1), 124–150 (2012)
MathSciNet MATH Google Scholar
Bot, R.I., Hendrich, C.: A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems. Comput. Optim. Appl. 54(2), 239–262 (2013)
Article MathSciNet MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Article MATH Google Scholar
Chen, J., Burer, S.: A first-order smoothing technique for a class of large-scale linear programs. SIAM J. Optim. 24(2), 598–620 (2014)
Article MathSciNet MATH Google Scholar
Combettes, P., Pesquet, J.-C.: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Chapter Proximal Splitting Methods in Signal Processing, pp. 185–212. Springer, New York (2011)
Book MATH Google Scholar
Devolder, O., Glineur, F., Nesterov, Y.: Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J. Optim. 22(2), 702–727 (2012)
Article MathSciNet MATH Google Scholar
Goldfarb, D., Ma, S.: Fast alternating linearization methods of minimization of the sum of two convex functions. Math. Prog. A 141(1), 349–382 (2012)
MATH Google Scholar
Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and Its Applications, pp. 155–210. Springer, New York (2006)
Chapter Google Scholar
Necoara, I., Suykens, J.A.K.: Applications of a smoothing technique to decomposition in convex optimization. IEEE Trans. Autom. Control 53(11), 2674–2679 (2008)
Article MathSciNet Google Scholar
Nedelcu, V., Necoara, I., Tran-Dinh, Q.: Computational complexity of inexact gradient augmented lagrangian methods: application to constrained MPC. SIAM J. Optim. Control 52(5), 3109–3134 (2014)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence $o(1/k^2)$. Doklady AN SSSR 269 (Soviet Math. Dokl.), 543–547 (1983)
Nesterov, Y.: Introductory lectures on convex optimization: a basic course. Applied Optimization, vol. 87. Kluwer Academic Publishers, Dordrecht (2004)
MATH Google Scholar
Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16(1), 235–249 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Prog. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Math. Prog. 110(2), 245–259 (2007)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Prog. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Orabona, F., Argyriou, A., and Srebro, N.: PRISMA: proximal iterative smoothing algorithm. Tech. Report., pp. 1–21 (2012). http://arxiv.org/abs/1206.2372
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Google Scholar
Rockafellar, R.T.: Monotropic programming: a generalization of linear programming and network programming. In: Rockafellar, R.T. (ed.) Convexity and Duality in Optimization, pp. 10–36. Springer, New York (1985)
Chapter Google Scholar
Tran-Dinh, Q.: Sequential Convex Programming and Decomposition Approaches for Nonlinear Optimization. PhD Thesis, Arenberg Doctoral School, KU Leuven, Department of Electrical Engineering (ESAT/SCD) and Optimization in Engineering Center, Kasteelpark Arenberg 10, 3001-Heverlee, Belgium (2012)
Tran-Dinh, Q., Cevher, V.: A primal-dual algorithmic framework for constrained convex minimization, pp. 1–54. Tech. Report, LIONS (2014)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 15, 374–416 (2015)
MathSciNet MATH Google Scholar
Tran-Dinh, Q., Savorgnan, C., Diehl, M.: Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimization problems. Compt. Optim. Appl. 55(1), 75–111 (2013)
Article MathSciNet MATH Google Scholar
Yang, J., Zhang, Y.: Alternating direction algorithms for $\ell _1$ -problems in compressive sensing. SIAM J. Sci. Comput. 33(1–2), 250–278 (2011)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This research was partially supported by NSF, Grant No. 161-9884, and NAFOSTED, Grant No. 101.01-2014-24.

Author information

Authors and Affiliations

Department of Statistics and Operations Research, University of North Carolina at Chapel Hill (UNC), Chapel Hill, USA
Quoc Tran-Dinh

Authors

Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc Tran-Dinh.

Appendix: the proof of technical results

This appendix provides the full proof of the technical results presented in the main text.

1.1 The proof of Lemma 2: descent property of the proximal gradient step

By using (9) with $f_{\gamma }({x}) := \varphi ^{*}_{\gamma }({A}^{\top }{x})$, $\nabla {f_{\gamma }}(\bar{{x}}) = {A}\nabla {\varphi _{\gamma }^{*}}({A}^{\top }\bar{{x}})$, ${z}:= {A}^{\top }{x}$, $\bar{{z}} := {A}^{\top }\bar{{x}}$, and $\Vert {A}^{\top }({x}- \bar{{x}})\Vert \le \Vert {A}\Vert \Vert {x}- \bar{{x}}\Vert $ we can show that

$$\begin{aligned} \frac{\gamma }{2}\Vert {u}^{*}_{\gamma }({A}^{\top }{x}) - {u}^{*}_{\gamma }({A}^{\top }\bar{{x}})\Vert ^2 \le f_{\gamma }({x}) - f_{\gamma }(\bar{{x}}) - \langle \nabla {f_{\gamma }}({x}), {x}- \bar{{x}}\rangle \le \frac{\Vert {A}\Vert ^2}{2\gamma }\Vert {x}- \hat{{x}}\Vert ^2. \end{aligned}$$

Using this estimate, we can show that the proof of (14) can be done similarly as in [31]. $\square $

1.2 The proof of Lemma 3: key estimate

We first substitute $\beta = \frac{\gamma _{k+1}}{\Vert {A}\Vert ^2}$ into (14) and using (15) to obtain

$$\begin{aligned} F_{\gamma _{k+1}}({x}^{k+1})\le & {} F_{\gamma _{k+1}}({x}^k) - \frac{\gamma _{k+1}}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k) \Vert ^2 \nonumber \\&+ \, \frac{\Vert {A}\Vert ^2}{\gamma _{k+1}}\left\langle {x}^{k+1} - \hat{{x}}^k, {x}- \bar{{x}}^k\right\rangle -\frac{\Vert {A}\Vert ^2}{2\gamma _{k+1}}\Vert \hat{{x}}^k - {x}^{k+1}\Vert ^2. \end{aligned}$$

Multiplying this inequality by $(1-\tau _k)$ and (14) by $\tau _k$, and summing up the results we obtain

$$\begin{aligned} F_{\gamma _{k+1}}({x}^{k+1})\le & {} (1 - \tau _k)F_{\gamma _{k+1}}({x}^k) + \tau _k\hat{\ell }^k_{\gamma _{k+1}}({x}) \nonumber \\&-\frac{(1-\tau _k)\gamma _{k+1}}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k) \Vert ^2\nonumber \\&+ \, \frac{\Vert {A}\Vert ^2}{\gamma _{k+1}}\left\langle \hat{{x}}^k - {x}^{k+1}, \hat{{x}}^k - (1-\tau _k){x}^k - \tau _k{x}\right\rangle -\frac{\Vert {A}\Vert ^2}{2\gamma _{k+1}}\Vert \hat{{x}}^k - {x}^{k+1}\Vert ^2, \end{aligned}$$

where $\hat{\ell }_{\gamma }^k({x}) := f_{\gamma }(\hat{{x}}^k) + \langle \nabla {f_{\gamma }}(\hat{{x}}^k), {x}- \hat{{x}}^k\rangle + g({x})$.

From (16), we have $\tau _k\tilde{{x}}^k = \hat{{x}}^k - (1-\tau _k){x}^k$, we can write this inequality as

$$\begin{aligned} F_{\gamma _{k+1}}({x}^{k+1})&\le (1 - \tau _k)F_{\gamma _{k+1}}({x}^k) + \tau _k\hat{\ell }^k_{\gamma _{k+1}}({x})\nonumber \\&\quad -\frac{(1-\tau _k)\gamma _{k+1}}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k) \Vert ^2\nonumber \\&\quad + \, \frac{\Vert {A}\Vert ^2\tau _k^2}{2\gamma _{k+1}}\Big [ \Vert \tilde{{x}}^k - {x}\Vert ^2 - \Vert \tilde{{x}}^k - \tau _k^{-1}(\hat{{x}}^k - {x}^{k+1}) - {x}\Vert ^2 \Big ]. \end{aligned}$$

(41)

Using (10) with $\gamma := \gamma _{k+1}$, $\hat{\gamma } := \gamma _k$ and ${z}:= {A}^{\top }{x}^k$, we get

$$\begin{aligned} f_{\gamma _{k+1}}({x}^k) \le f_{\gamma _k}({x}^k) + (\gamma _k - \gamma _{k+1})b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k)), \end{aligned}$$

which leads to (cf: $F_{\gamma } = f_{\gamma } + g$):

$$\begin{aligned} F_{\gamma _{k+1}}({x}^k) \le F_{\gamma _k}({x}^k) + (\gamma _k - \gamma _{k+1})b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k)). \end{aligned}$$

(42)

Next, we estimate $\hat{\ell }_{\gamma _{k+1}}^k$. Using the definition of $f_{\gamma }$ and $\nabla {f}_{\gamma }$, we can deduce

$$\begin{aligned} {}\hat{\ell }_{\gamma _{k+1}}^k({x}):= & {} f_{\gamma _{k+1}}(\hat{{x}}^k) + \left\langle \nabla {f_{\gamma _{k+1}}}(\hat{{x}}^k), {x}- \hat{{x}}^k\right\rangle + g({x}) \nonumber \\= & {} \big \langle \hat{{x}}^k, {A}^{\top }{u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)\big \rangle - \varphi ({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)) - \gamma _{k+1}b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)) \nonumber \\&+ \, \big \langle {x}- \hat{{x}}^k, {A}{u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)\big \rangle + g({x}) \nonumber \\= & {} \big \langle {x}, {A}{u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)\big \rangle - \varphi ({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)) - \gamma _{k+1}b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)) + g({x}) \nonumber \\\le & {} \displaystyle \max _{{u}}\left\{ \big \langle {x}, {A}{u}\big \rangle - \varphi ({u}) : {u}\in \mathcal {U}\right\} - \gamma _{k+1}b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)) + g({x}) \nonumber \\= & {} F({x}) - \gamma _{k+1}b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)). \end{aligned}$$

(43)

Substituting $\tilde{{x}}^{k+1} := \tilde{{x}}^k - \frac{1}{\tau _k}(\hat{{x}}^k - {x}^{k+1})$ from the third line of (16) together with (42), and (43) into (41), we can derive

$$\begin{aligned} F_{\gamma _{k+1}}({x}^{k+1})\le & {} (1- \tau _k)F_{\gamma _{k}}({x}^k) + \tau _kF({x}) \\&+ \frac{\Vert {A}\Vert ^2\tau _k^2}{2\gamma _{k+1}}\left[ \Vert \tilde{{x}}^k - {x}\Vert ^2 - \Vert \tilde{{x}}^{k+1} - {x}\Vert ^2\right] - R_k, \end{aligned}$$

which is indeed (18), where $R_k$ is given by (19).

Finally, we prove (20). Indeed, using the strong convexity and the $L_b$-smoothness of $b_{\mathcal {U}}$, we can lower bound

$$\begin{aligned} R_k\ge & {} \frac{\tau _k\gamma _{k+1}}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k) - \bar{{u}}^c\Vert ^2 + \frac{(1-\tau _k)\gamma _{k+1}}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k)\Vert ^2\\&-\frac{L_b}{2}(1-\tau _k)(\gamma _k - \gamma _{k+1})\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - \bar{{u}}^c\Vert ^2. \end{aligned}$$

Letting $\hat{{v}}_k := {u}^{*}_{\gamma _{k+1}}({A}^{\top }\hat{{x}}^k) - \bar{{u}}^c$ and $ {v}_k := {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) - \bar{{u}}^c$, we write $R_k$ as

$$\begin{aligned} 2\gamma _{k+1}^{-1}R_k\ge & {} \tau _k\Vert \hat{{v}}_k \Vert ^2 + (1-\tau _k)\Vert \hat{{v}}_k - {v}_k\Vert ^2 - (1-\tau _k)(\gamma _{k+1}^{-1}\gamma _k - 1)L_b\Vert {v}_k\Vert ^2 \\= & {} \Vert \hat{{v}}_k\Vert ^2 - 2(1-\tau _k)\langle \hat{{v}}_k ,{v}_k \rangle + (1-\tau _k)\big [1 - (\gamma _{k+1}^{-1}\gamma _k - 1)L_b\big ]\Vert {v}_k\Vert ^2 \\= & {} \Vert \hat{{v}}_k - (1-\tau _k){v}_k\Vert ^2 + (1-\tau _k)\left[ \tau _k - (\gamma _{k+1}^{-1}\gamma _k - 1)L_b\right] \Vert {v}_k\Vert ^2 \\\ge & {} (1-\tau _k)\left[ \tau _k - (\gamma _{k+1}^{-1}\gamma _k - 1)L_b\right] \Vert {v}_k\Vert ^2, \end{aligned}$$

which obviously implies (20). $\square $

1.3 The proof of Lemma 4: the choice of parameters

First, using the update rules of $\tau _k$ and $\gamma _k$ in (21), we can express the quantity $m_k$ as

$$\begin{aligned} m_k&:= \frac{\gamma _{k+1}(1-\tau _k)\left[ \gamma _{k+1}\tau _k - L_b(\gamma _k - \gamma _{k+1})\right] }{\tau _k^2} = -\frac{\gamma _1^2\bar{c}^2\left[ (L_b-1)(k + \bar{c}) + 1\right] }{(k+\bar{c})^2}. \end{aligned}$$

Moreover, it follows from the properties of $b_{\mathcal {U}}$ that

$$\begin{aligned} \frac{1}{2}\Vert {u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k) -\bar{{u}}^c\Vert ^2 \le b_{\mathcal {U}}({u}^{*}_{\gamma _{k+1}}({A}^{\top }{x}^k)) \le D_{\mathcal {U}}. \end{aligned}$$

Multiplying the lower bound (20) by $\frac{\gamma _{k+1}}{\tau _k^2}$ and combining the result with the last inequality and the estimate of $m_k$, we obtain the first lower bound in (22).

Next, using the update rules (21) of $\tau _k$ and $\gamma _k$, we have $\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2} = \frac{(k+\bar{c}-1)(k+\bar{c})^2\gamma _1\bar{c}}{(k+\bar{c})(k+\bar{c})} = \frac{\gamma _1\bar{c}(k+\bar{c}-1)^2}{(k+\bar{c}-1)} = \frac{\gamma _k}{\tau _{k-1}^2}$, which is the second equality in (22).

Using (18) with the lower bound of $R_k$ from (22), we have

$$\begin{aligned} \frac{\gamma _{k+1}}{\tau _k^2}\Delta {F}_{k+1} + \frac{\Vert {A}\Vert ^2}{2}\Vert \tilde{{x}}^{k+1} - {x}^{\star }\Vert ^2 \le \frac{(1 - \tau _k)\gamma _{k+1}}{\tau _k^2}\Delta {F}_k + \frac{\Vert {A}\Vert ^2}{2}\Vert \tilde{{x}}^k - {x}^{\star }\Vert ^2 + s_kD_{\mathcal {U}}, \end{aligned}$$

(44)

where $\Delta {F}_k := F_{\gamma _k}({x}^k) - F^{\star }$ and $s_k := \frac{\gamma _1^2\bar{c}^2\left[ (L_b-1)(k + \bar{c}) + 1\right] }{(k + \bar{c})^2}$. Using this inequality and the relation $\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2} = \frac{\gamma _k}{\tau _{k-1}^2}$ in (22), we can easily show that

$$\begin{aligned} \frac{\gamma _{k+1}}{\tau _k^2}\Delta {F}_{k+1} + \frac{\Vert {A}\Vert ^2}{2}\Vert \tilde{{x}}^{k+1} - {x}^{\star }\Vert ^2 \le \frac{\gamma _{k}}{\tau _{k-1}^2}\Delta {F}_k + \frac{\Vert {A}\Vert ^2}{2}\Vert \tilde{{x}}^k - {x}^{\star }\Vert ^2 + s_kD_{\mathcal {U}}. \end{aligned}$$

By induction, we obtain from the last inequality that

$$\begin{aligned} \frac{\gamma _{k+1}}{\tau _k^2}\Delta {F}_{k+1} + \frac{\Vert {A}\Vert ^2}{2}\Vert \tilde{{x}}^{k+1} - {x}^{\star }\Vert ^2 \le \frac{(1-\tau _0)\gamma _1}{\tau _0^2}\Delta {F}_0 + \frac{\Vert {A}\Vert ^2}{2}\Vert \tilde{{x}}^0 - {x}^{\star }\Vert ^2 + S_kD_{\mathcal {U}}, \end{aligned}$$

(45)

which implies (23), where $S_k := \sum _{i=0}^ks_k = \gamma _1^2\bar{c}^2\sum _{i=0}^k\frac{\left[ (L_b-1)(i + \bar{c}) + 1\right] }{(i + \bar{c})^2}$.

Finally, to prove (24), we use two elementary inequalities $\sum _{i=1}^{k+\bar{c}}\frac{1}{i} < 1 + \ln (k+\bar{c})$ and $\sum _{i=0}^k\frac{1}{(i+\bar{c})^2} \le \frac{1}{\bar{c}^2} + \sum _{i=1}^k\frac{1}{(i+\bar{c}-1)(i+\bar{c})} < \frac{1}{\bar{c}^2} + \frac{1}{\bar{c}}$. $\square $

1.4 The proof of Corollary 1: the smooth accelerated gradient method

First, it is similar to the proof of (44), we can derive

$$\begin{aligned}&\frac{\gamma _{k+1}}{(L_g\gamma _{k+1} + \Vert {A}\Vert ^2)\tau _k^2}\Delta {F}_{k+1} + \frac{1}{2}\Vert \tilde{{x}}^{k+1} - {x}^{\star }\Vert ^2 \nonumber \\&\quad \le \frac{(1 - \tau _k)\gamma _{k+1}}{(L_g\gamma _{k+1} + \Vert {A}\Vert ^2)\tau _k^2}\Delta {F}_k + \frac{1}{2}\Vert \tilde{{x}}^k - {x}^{\star }\Vert ^2 + \hat{s}_kD_{\mathcal {U}}, \end{aligned}$$

where $\Delta {F}_k := F_{\gamma _{k}}({x}^k) - F^{\star }$, and $\hat{s}_k := \frac{\gamma _{k+1}(1-\tau _k)\left[ L_b(\gamma _k - \gamma _{k+1}) - \gamma _{k+1}\tau _k\right] }{\tau _k^2(L_g\gamma _{k+1}+\Vert {A}\Vert ^2)}$.

Next, we impose condition $\frac{(1-\tau _k)\gamma _{k+1}}{\tau _k^2(L_g\gamma _{k+1} + \Vert {A}\Vert ^2)} = \frac{\gamma _k}{\tau _{k-1}^2(L_g\gamma _k + \Vert {A}\Vert ^2)}$ and choose $\tau _k = \frac{1}{k+1}$. Then, we can show from the last condition that $\gamma _{k+1} = \frac{k\gamma _k\Vert {A}\Vert ^2}{L_g\gamma _k + \Vert {A}\Vert ^2(k+1)}$. Now, we show that $\gamma _k \le \frac{\gamma _1}{k+1}$. Indeed, we have $\frac{1}{\gamma _{k+1}} = \left( \frac{k+1}{k}\right) \frac{1}{\gamma _k} + \frac{L_g}{\Vert {A}\Vert ^2k} \ge \left( \frac{k+1}{k}\right) \frac{1}{\gamma _k}$, which implies that $\gamma _{k+1} \le \frac{k}{k+1}\gamma _k$. By induction, we get $\gamma _{k+1} \le \frac{\gamma _1}{k+1}$. On the other hand, assume that $\frac{1}{\gamma _{k+1}} = \left( \frac{k+1}{k}\right) \frac{1}{\gamma _k} + \frac{L_g}{\Vert {A}\Vert ^2k} \le \frac{1}{\gamma _k} \left( \frac{k}{k-1}\right) $ for $k\ge 2$. This condition leads to $\gamma _k \le \frac{\Vert {A}\Vert ^2}{L_g(k-1)}$. Using $\gamma _k \le \frac{\gamma _1}{k} = \frac{\Vert {A}\Vert ^2}{L_gk}$ due to the choice of $\gamma _1$, we can show that $\gamma _k \le \frac{\Vert {A}\Vert ^2}{L_g(k-1)}$. Hence, with the choice $\gamma _1 := \frac{\Vert {A}\Vert ^2}{L_g}$, the estimate $\frac{1}{\gamma _{k+1}} \le \frac{1}{\gamma _k} \left( \frac{k}{k-1}\right) $ and the update rule of $\gamma _k$ eventually imply

$$\begin{aligned} \frac{\gamma _1\Vert {A}\Vert ^2}{(L_g\gamma _1 + 2\Vert {A}\Vert ^2)k} = \frac{\gamma _2}{k} \le \gamma _{k+1} \le \frac{\gamma _1}{k+1},~~\forall k\ge 1. \end{aligned}$$

This condition leads to $\frac{\tau _{k-1}^2(L_g\gamma _k + \Vert {A}\Vert ^2)}{\gamma _k} = L_g\tau _{k-1}^2 + \frac{\tau _{k-1}^2}{\gamma _k}\Vert {A}\Vert ^2 \le \frac{L_g}{k^2} + \frac{3L_g(k-1)}{k^2} = \frac{3L_g}{k}$. Using the estimates of $\tau _k$ and $\gamma _k$, we can easily show that $\hat{s}_k \le \frac{L_b\Vert {A}\Vert ^2}{L_g^2k(k+2)} + \frac{L_b\Vert {A}\Vert ^2}{L_g^2(k+1)(k+2)} + \frac{(L_b-1)\Vert {A}\Vert ^2k}{L_g(k+1)(k+2)}$. Hence, we can prove that

$$\begin{aligned} \hat{S}_k:= & {} \sum _{i=0}^k\hat{s}_i \le \frac{2L_b\Vert {A}\Vert ^2}{L_g^2} + \frac{(L_b-1)\Vert {A}\Vert ^2}{L_g^2}\sum _{i=0}^k\frac{1}{i+1} \nonumber \\= & {} \frac{2L_b\Vert {A}\Vert ^2}{L_g^2} + \frac{(L_b-1)\Vert {A}\Vert ^2}{L_g^2}\left( \ln (k) + 1\right) . \end{aligned}$$

Using this estimate, we can show that

$$\begin{aligned} F_{\gamma _k}({x}^k) - F^{\star }\le & {} \frac{3L_g}{2k}\Vert {x}^0 - {x}^{\star }\Vert ^2 + \frac{2L_b\Vert {A}\Vert ^2}{L_g^2k}D_{\mathcal {U}} \nonumber \\&+ \frac{(L_b-1)\Vert {A}\Vert ^2}{L_g^2k}\left( \ln (k) + 1\right) D_{\mathcal {U}}. \end{aligned}$$

Finally, using the bound (11) and $\gamma _k \le \frac{\Vert {A}\Vert ^2}{L_g(k+1)} < \frac{\Vert {A}\Vert ^2}{L_gk}$, we obtain (32). $\square $

1.5 The proof of Theorem 2: primal solution recovery

Let $\Delta {F}_k := F_{\gamma _{k}}({x}^k) - F^{\star }$. Then, by (11), we have $\Delta {F}_k \ge F({x}^k) - F^{\star }- \gamma _kD_{\mathcal {U}} \ge -\gamma _kD_{\mathcal {U}}$. Similar to the proof of Lemma 4, we can prove that

$$\begin{aligned} \frac{\gamma _{i+1}}{\tau _i^2}\Delta {F}_{i+1}\le & {} \frac{\gamma _i}{\tau _{i-1}^2}\Delta {F}_i + \frac{\gamma _{i+1}}{\tau _i}\Delta {\hat{\ell }}_{\gamma _{i+1}}({x})\nonumber \\&+ \frac{\Vert {A}\Vert ^2}{2}\left( \Vert \tilde{{x}}^i - {x}\Vert ^2 - \Vert \tilde{{x}}^{i+1} - {x}\Vert ^2\right) + s_iD_{\mathcal {U}}, \end{aligned}$$

(46)

where $s_i := \frac{\left[ (L_b-1)(i + \bar{c}) + 1\right] }{(i + \bar{c})^2}$ as in the proof of Lemma 4, and $\Delta {\hat{\ell }}_{\gamma _{k+1}}^k({x}) = \big \langle {x}, {A}{u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k)\big \rangle - \varphi ({u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k)) + g({x}) - F^{\star }= \big \langle {x}, {A}{u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k) - {b}\big \rangle - \varphi ({u}^{*}_{\gamma _{k+1}}(\hat{{x}}^k)) + s_{\mathcal {K}}({x}) - F^{\star }$. Summing up this inequality from $i=1$ to $i=k$ and using $\tau _0 = 1$ and $\tilde{{x}}^0 ={x}^0$, we obtain

$$\begin{aligned} {}\frac{\gamma _{k+1}}{\tau _k^2}\Delta {F}_{k+1}\le & {} \sum _{i=1}^k\frac{\gamma _{i+1}}{\tau _i}\Delta {\hat{\ell }}_{\gamma _{i+1}}({x}) + \frac{\Vert {A}\Vert ^2}{2}\left( \Vert \tilde{{x}}^1 - {x}\Vert ^2 - \Vert \tilde{{x}}^{k+1} - {x}\Vert ^2\right) \nonumber \\&+\,\gamma _1\Delta {F}_1 + S_kD_{\mathcal {U}},{} \end{aligned}$$

(47)

where $S_k := \sum _{i=1}^ks_i$. Now, using again (46) with $k=1$, ${x}^0 = \tilde{{x}}^0$ and $\tau _0 = 1$, we get $\gamma _1\Delta {F}_1 \le \gamma _1\tau _0\Delta {\hat{\ell }}_{\gamma _1}(x) + \frac{\Vert {A}\Vert ^2}{2}\left( \Vert {x}^0 - {x}^{\star }\Vert ^2 - \Vert \tilde{{x}}^1 - {x}^{\star }\Vert ^2\right) $. Using this into (47), one yields

$$\begin{aligned} \frac{\gamma _{k+1}}{\tau _k^2}\Delta {F}_{k+1}\le & {} \sum _{i=0}^k\frac{\gamma _{i+1}}{\tau _i}\left( \big \langle {x}, {A}{u}^{*}_{\gamma _{i+1}}(\hat{{x}}^i) - {b}\big \rangle - \varphi ({u}^{*}_{\gamma _{i+1}}(\hat{{x}}^i)) + s_{\mathcal {K}}({x})- F^{\star }\right) \nonumber \\&+ \frac{\Vert {A}\Vert ^2}{2}\Vert {x}^0 - {x}\Vert ^2 + S_kD_{\mathcal {U}}. \end{aligned}$$

(48)

Combining $\bar{{u}}^k$ defined by (35) with $w_i :=\frac{\gamma _{i+1}}{\tau _i}$, and the convexity of $\varphi $, we have

$$\begin{aligned} \sum _{i=0}^k\frac{\gamma _{i+1}}{\tau _i}\left( \big \langle {x}, {A}{u}^{*}_{\gamma _{i+1}}(\hat{{x}}^i) - {b}\big \rangle - \varphi ({u}^{*}_{\gamma _{i+1}}(\hat{{x}}^i))\right) \le \varGamma _k\left( \big \langle {x}, {A}\bar{{u}}^k - {b}\big \rangle - \varphi (\bar{{u}}^k)\right) . \end{aligned}$$

Substituting this into (48) and then using $\Delta {F}_{k+1} \ge -\gamma _{k+1}D_{\mathcal {U}}$ we get

$$\begin{aligned} -\frac{\gamma _{k+1}^2}{\tau _k^2}D_{\mathcal {U}} \le \varGamma _k\left( \big \langle {x}, {A}\bar{{u}}^k - {b}\big \rangle - \varphi (\bar{{u}}^k) + s_{\mathcal {K}}({x}) - F^{\star }\right) + \frac{\Vert {A}\Vert ^2}{2}\Vert {x}^0 - {x}\Vert ^2 \!+\! S_kD_{\mathcal {U}}, \end{aligned}$$

which implies

$$\begin{aligned} F^{\star }\le \big \langle {x}, {A}\bar{{u}}^k - {b}\big \rangle - \varphi (\bar{{u}}^k) + s_{\mathcal {K}}({x}) + \frac{\Vert {A}\Vert ^2}{2\varGamma _k}\Vert {x}^0 - {x}\Vert ^2 + \frac{D_{\mathcal {U}}}{\varGamma _k}\left( \frac{\gamma _{k+1}^2}{\tau _k^2} + S_k\right) . \end{aligned}$$

By arranging this inequality, we get

$$\begin{aligned} \inf _{{r}\in \mathcal {K}}\big \langle {x}, {b}- {A}\bar{{u}}^k - {r}\big \rangle + \varphi (\bar{{u}}^k) \le -F^{\star }+ \frac{\Vert {A}\Vert ^2}{2\varGamma _k}\Vert {x}^0 - {x}\Vert ^2 + \frac{D_{\mathcal {U}}}{\varGamma _k}\left( \frac{\gamma _{k+1}^2}{\tau _k^2} + S_k\right) , \end{aligned}$$

(49)

where we use the relation $-s_{\mathcal {K}}({x}) = -\sup _{{r}\in \mathcal {K}}\langle {x}, {r}\rangle = \inf _{{r}\in \mathcal {K}}\langle {x}, -{r}\rangle $. On the other hand, by the saddle point theory for the primal and dual problems (33) and (34), for any optimal solution ${x}^{\star }$, we can show that

$$\begin{aligned} -F^{\star } = \varphi ^{\star } \le \varphi ({u}) - \langle {x}^{\star }, {A}{u}- {b}+ {r}\rangle , ~\forall {u}\in \mathcal {U},~{r}\in \mathcal {K}. \end{aligned}$$

Since this inequality holds for any ${r}\in \mathcal {K}$ and ${u}\in \mathcal {U}$, by using ${u}= \bar{{u}}^k$, it leads to

$$\begin{aligned} \inf _{{r}\in \mathcal {K}}\big \langle {x}^{\star }, {A}\bar{{u}}^k - {b}+ {r}\big \rangle - \varphi (\bar{{u}}^k) \le F^{\star }. \end{aligned}$$

(50)

Combining (49) and (50) yields

$$\begin{aligned}&\min _{{r}\in \mathcal {K}}\Big \{\big \langle {x}^{\star }- {x}, {r}+ {A}\bar{{u}}^k - {b}\big \rangle - \frac{\Vert {A}\Vert ^2}{2\varGamma _k}\Vert {x}^0 - {x}\Vert ^2\Big \} \le \frac{D_{\mathcal {U}}}{\varGamma _k}\left( \frac{\gamma _{k+1}^2}{\tau _k^2} + S_k\right) , ~~\forall {x}\in \mathbb {R}^p.\nonumber \\ \end{aligned}$$

(51)

Taking ${x}:= {x}^0 - \Vert {A}\Vert ^{-2}\varGamma _k({A}\bar{{u}}^k - {b}+ {r})$ for any ${r}\in \mathcal {K}$, we obtain from (51) that

$$\begin{aligned} \min _{{r}\in \mathcal {K}}\left\{ \frac{\varGamma _k}{\Vert {A}\Vert ^2}\Vert {A}\bar{{u}}^k + {r}- {b}\Vert ^2 + 2\big \langle {A}\bar{{u}}^k - {b}+ {r}, {x}^{\star }- {x}^0\big \rangle \right\} \le \frac{2D_{\mathcal {U}}}{\varGamma _k}\left( \frac{ \gamma _{k+1}^2}{\tau _k^2} + S_k\right) , \end{aligned}$$

which implies (by the Cauchy–Schwarz inequality)

$$\begin{aligned}&\min _{{r}\in \mathcal {K}}\left\{ \varGamma _k\Vert {A}\bar{{u}}^k - {b}+ {r}\Vert ^2 - 2\Vert {A}\Vert ^2\Vert {A}\bar{{u}}^k - {b}+ {r}\Vert \Vert {x}^{\star }- {x}^0\Vert \right\} \\&\quad \le \frac{2\Vert {A}\Vert ^2D_{\mathcal {U}}}{\varGamma _k}\left( \frac{\gamma _{k+1}^2}{\tau _k^2} + S_k\right) . \end{aligned}$$

By elementary calculations and $\mathrm {dist}\left( {b}- {A}\bar{{u}}^k, \mathcal {K}\right) = \inf \left\{ \Vert {A}\bar{{u}}^k-{b}+ {r}\Vert : {r}\in \mathcal {K}\right\} $, we can show from the last inequality that

$$\begin{aligned} \mathrm {dist}\left( {b}- {A}\bar{{u}}^k, \mathcal {K}\right) \le \frac{\Vert {A}\Vert ^2}{\varGamma _k}\Big [\Vert {x}^0 - {x}^{\star }\Vert + \sqrt{\Vert {x}^0 - {x}^{\star }\Vert ^2 + \frac{2}{\Vert {A}\Vert ^2}\Big ( S_k+\frac{\gamma _{k+1}^2}{\tau _k^2}\Big )D_{\mathcal {U}}}\Big ]. \end{aligned}$$

(52)

To prove the first estimate of (37), we use (49) with ${x}= \varvec{0}^p$ and $F^{\star } = -\varphi ^{\star }$ to get

$$\begin{aligned} \varphi (\bar{{u}}^k) - \varphi ^{\star } \le \frac{1}{\varGamma _k}\Big [\frac{\Vert {A}\Vert ^2}{2}\Vert {x}^0\Vert ^2 + \Big (\frac{\gamma _{k+1}^2}{\tau _k^2} + S_k\Big )D_{\mathcal {U}}\Big ]. \end{aligned}$$

(53)

Since we apply Algorithm 1(c) to solve the dual problem (34) using $b_{\mathcal {U}}$ such that $L_b = 1$, we have $S^k \le 2\gamma _1^2$. Then, by using $\gamma _{k+1} = \frac{\bar{c}\gamma _1}{k+\bar{c}}$, and $\tau _k := \frac{1}{k+\bar{c}}$, we can show that $\frac{\gamma _{k+1}^2}{\tau _k^2} = \gamma _1\bar{c}$. Moreover, we also have $\varGamma _k := \sum _{i=0}^k\frac{\gamma _{i+1}}{\tau _i} = \gamma _1\bar{c}(k+1)$. Using these estimates, and $S_k \le 2\gamma _1^2$ from (4) into (52) and (53) we obtain (37). For the left-hand side inequality in the first estimate of (37), we use a simple bound $-\Vert {x}^{\star }\Vert \mathrm {dist}\left( {b}- {A}{u},\mathcal {K}\right) \le \varphi ({u}) - \varphi ^{\star }$ for ${u}= \bar{{u}}^k \in \mathcal {U}$ from the saddle point theory as in (50). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran-Dinh, Q. Adaptive smoothing algorithms for nonsmooth composite convex minimization. Comput Optim Appl 66, 425–451 (2017). https://doi.org/10.1007/s10589-016-9873-6

Download citation

Received: 24 August 2015
Published: 15 September 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s10589-016-9873-6

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive smoothing algorithms for nonsmooth composite convex minimization

Abstract

Similar content being viewed by others

A class of exact penalty functions and penalty algorithms for nonsmooth constrained optimization problems

Solving Minimax Problems: Local Smoothing Versus Global Smoothing

Subgradient Smoothing Method for Nonsmooth Nonconvex Optimization

1 Introduction

2 Smoothing techniques via smooth proximity functions

Lemma 1

3 The adaptive smoothing algorithm and its convergence

Lemma 2

Lemma 3

Lemma 4

Theorem 1

Proof

Remark 1

4 Exploiting structures for special cases

4.1 Fenchel conjugate

4.2 Composite convex minimization with linear operator

4.3 The decomposable structure

4.4 The Lipschitz gradient structure

Corollary 1

5 Application to general constrained convex optimization

Theorem 2

6 Preliminarily numerical experiments

6.1 The \(\ell _1\text {-}\ell _1\)-regularized LASSO

6.2 Square-root LASSO

6.3 Image deblurring with the \(\ell _1\) or \(\ell _2\)-data fidelity function

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: the proof of technical results

Appendix: the proof of technical results

1.1 The proof of Lemma 2: descent property of the proximal gradient step

1.2 The proof of Lemma 3: key estimate

1.3 The proof of Lemma 4: the choice of parameters

1.4 The proof of Corollary 1: the smooth accelerated gradient method

1.5 The proof of Theorem 2: primal solution recovery

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation