1 Introduction

First-order methods for convex optimization have become a focus of active research during the past several years. Their low per-iteration complexity and improved convergence rates have made them a viable alternative to the higher-order, but expensive, interior-point methods. The rapid growth of convex optimization applications in signal processing and machine learning has further increased the popularity of first-order methods.

We are interested in this paper in the following so-called composite unconstrained convex programming problem:

$$\begin{aligned} (P) \qquad \min \{F(x)\equiv f(x)+g(x): x\in \varvec{R}^n\}, \end{aligned}$$
(1.1)

where \(f:\varvec{R}^n\rightarrow \varvec{R}\) and \(g:\varvec{R}^n\rightarrow \varvec{R}\) are both convex functions and only \(f\) is required to be smooth. Many problems that arise in, for example, compressed sensing and statistical learning are of this form. For instance, in compressed sensing \(g(x)=\Vert x\Vert _1\). This form also accommodates constrained problems with a simple constraint set \(X\) by setting \(g(x)\) to be an indicator function of \(x\in X\).

A class of first-order methods, pioneered by Nesterov [10] and referred to as accelerated gradient methods, enjoys an improved convergence rate compared to the classical gradient method. This class of methods has recently become exceptionally popular, and several variants have been developed (e.g., [1, 11, 12, 14]). In particular, we note that the so-called fast iterative shrinkage/thresholding algorithm (FISTA) of Beck and Teboulle [1], which enjoys the same improved convergence rate as Nesterov’s optimal gradient method for convex composite objective functions [9], is designed to solve problems of the form (1.1).

Like the first-order algorithms proposed in [9], FISTA computes an \(\epsilon \)-optimal solution in \(O(\sqrt{L(f)/\epsilon })\) steps, where \(L(f)\) is a bound on the Lipschitz constant for \(\nabla f(x)\). Hence, it is an optimal gradient method since this is the best complexity bound that one can obtain using only first-order information [8, 11]. In these accelerated gradient methods, a combination of previous points is used to compute the new point at each iteration. In [12, 14], these techniques together with smoothing techniques were applied to nonsmooth problems yielding optimal complexity results for such problems.

FISTA requires one gradient computation per iteration, unlike the method in [9]. Although the method in [9] employs a more general framework than FISTA, FISTA’s convergence rate does not follow from the analysis in [9]. On the other hand, FISTA and its analysis in [1] are very simple and intuitive, which has led to its being used in many applications. The principal drawback of FISTA compared to the method in [9] is the requirement that the estimates of the Lipschitz constant (or the prox parameters) that are used must be nondecreasing from one iteration to the next. This can substantially limit the performance of FISTA when a large Lipschitz constant estimate is encountered early in the algorithm since this causes the sizes of the steps taken at that point, and at all subsequent iterates, to be very small.

Another very popular class of first-order methods for problems of the form (1.1) is the class of alternating direction methods (ADMs or ADMMs) [3]. This class of methods has been shown to be very effective in practice for many statistical learning applications, but for most of its variants the convergence rates are either unknown or worse than those of the accelerated gradient methods such as FISTA. A recently proposed fast alternating linearization method (FALM) [7] is an accelerated version of an ADM and was shown to have the same complexity estimates as FISTA under similar assumptions. It is applied to problems for which both \(f\) and \(g\) have a special structure that allows the proximity operators \(p_\mu ^f(y)\) and \(p_\mu ^g(y)\) (defined subsequently) to be easily computed. Here we introduce the backtracking (BKTR) variant FALM, which allows for large steps and hence has better practical behavior. Moreover, considering backtracking within the FALM framework allows us to generalize the convergence rate results and substantially simplify the proofs in [7].

In addition, we discuss the concept of a local composite Lipschitz constant and show that with the use of a backtracking strategy the constants in complexity bounds for the methods in [1, 7], and [9] can be improved to depend on the average local composite Lipschitz constant for \(\nabla f(x)\) rather than the worst-case Lipschitz constant \(L(f)\), as in all previously derived bounds for accelerated first-order schemes. This result is potentially important in cases where the Lipschitz constant is very large at (or near) the solution but is moderate almost everywhere else. Important examples of such functions include smooth approximations of nonsmooth convex functions.

The paper is organized as follows. In the next section we introduce some preliminary results and definitions. In Sect. 3 we introduce and analyze the FISTA algorithm with full line search. In Sect. 3.1 we consider the more restricted line search scheme proposed in [9] and show that it can be applied to FISTA. We then discuss the complexity implications of this scheme. In Sect. 4 we extend our line search approach and our complexity analysis to the FALM method. We conclude in Sect. 5 with some computational results.

2 Preliminary Results

The following assumption is made throughout the paper:

Assumption 2.1

\(f:\varvec{R}^n\rightarrow \varvec{R}\) is a smooth convex function of the type \(C^{1,1}\), i.e., continuously differentiable with Lipschitz continuous gradient \(L(f)\):

$$\begin{aligned} \Vert \nabla f(x)-\nabla f(y)\Vert \le L(f)\Vert x-y\Vert , \forall x,y\in \varvec{R}^n, \end{aligned}$$
(2.1)

where \(||\cdot ||\) stands for standard Euclidean norm, unless specified otherwise.

We define the following notation.

Definition 2.1

$$\begin{aligned} Q_\mu (u,v)&:= f(v) + \langle u-v, \nabla f(v) \rangle + \frac{1}{2\mu } \Vert u-v\Vert ^2 + g(u),\end{aligned}$$
(2.2)
$$\begin{aligned} p_\mu (v)&:= \arg \min _u Q_\mu (u,v). \end{aligned}$$
(2.3)

The following lemma from [1] plays a key role in our analysis.

Lemma 2.1

For any \(y,x\in \varvec{R}^n\) and \(\mu >0\), if

$$\begin{aligned} F(p_\mu (y))\le Q_\mu (p_\mu (y), y), \end{aligned}$$
(2.4)

where \(p_\mu (y)\) is given in Definition 2.1, then

$$\begin{aligned} 2\mu (F(x)\!-\!F(p_\mu (y))) \ge \Vert p_\mu (y)\!-\!y\Vert ^2 \!+\! 2\langle y\!-\!x, p_\mu (y)\!-\!y\rangle \!=\! \Vert p_\mu (y)\!-\!x\Vert ^2\!-\!\Vert y\!-\!x\Vert ^2. \end{aligned}$$
(2.5)

Proof

The proof can be found in [1]. \(\square \)

The following fact is well known and will be used in our theoretical analysis: if \(\mu \le 1/L(f)\), then (2.4) holds for any \(y\), and hence (2.5) holds for any \(x\) and \(y\).

2.1 Local Composite Lipschitz Constants

The lower bound on the value of the prox parameter \(\mu \) is a key ingredient in the complexity bound for the algorithms that we consider in this paper. As we will see, the actual value of \(\mu \) at each iteration is determined via a backtracking line search so that condition (2.4) is satisfied. A lower bound on \(\mu \) can be derived from the facts that \(\mu \) is reduced by a constant factor at each line search step and as soon as \(\mu \le 1/L(f)\), condition (2.4) is satisfied and the line search terminates. On the other hand condition (2.4) may be satisfied with a larger value of \(\mu \). As we will see from our results, increasing the lower bound on \(\mu \) at any of the iterations improves the complexity bound on the algorithms. In an effort to improve this lower bound we now define a local, composite Lipschitz constant \(L(f,g,y)\) whose value never exceeds that of \(L(f)\) but is often smaller.

As a first step consider any two vectors \(u,v\in \varvec{R}^n\), and let \([u,v]\) denote the set of points on a segment between \(u\) and \(v\), in other words, \([u,v]=\{x:\, x=\lambda u+(1-\lambda ) v, 0\le \lambda \le 1\}\). Let \(L_{[u,v]}(f)\) be the Lipschitz constant of \(\nabla f(x)\) restricted to \([u,v]\), i.e.,

$$\begin{aligned} \Vert \nabla f(x)-\nabla f(y)\Vert \le L_{[u,v]}(f)\Vert x-y\Vert , \forall x,y\in [u,v]. \end{aligned}$$

From simple calculus it follows that

$$\begin{aligned} f(u)\le f(v)+\nabla f(v)^\top (u-v)+\frac{L_{[u,v]}(f)}{2}\Vert u-v\Vert ^2. \end{aligned}$$
(2.6)

Note that the roles of \(u\) and \(v\) are interchangeable.

Definition 2.2

We call \(L(f,g,y)\) a local composite Lipschitz constant for \(\nabla f(x)\) at \(y\) if

$$\begin{aligned} \Vert \nabla f(x_1)\!-\!\nabla f(x_2)\Vert&\le L(f,g,y)\Vert x_1\!-\!x_2\Vert , \forall x_1, x_2\in [p_\mu (y), y], \ \forall \mu \le 1/L(f,g,y). \end{aligned}$$

In other words, \(L(f,g,y)\) is a local composite Lipschitz constant if it is a Lipschitz constant for \(\nabla f(x)\) restricted to the interval \([p_\mu (y), y]\) for any \( \mu \le 1/L(f,g,y)\).

The dependence of \(L(f,g,y)\) on \(g\) arises from the dependence of \(p_\mu (y)\) on \(g\); hence we use the term composite to emphasize this dependence. If \(g(x)\equiv 0\), then \(p_\mu (y)=y-\mu \nabla f(y)\) for any \(\mu \) and, hence, \(L(f,g,y)\) is a Lipschitz constant of \(\nabla f(x)\) restricted to an interval \([y, y-\frac{1}{L(f,g,y)} \nabla f(x)]\).

For the purposes of this paper we point to two key observations:

  • \(L(f,g,y)\le L(f)\) for all \(y\) and

  • From Definition 2.2 it follows that for any \(\mu \) and \(y\), such that \(\mu \le 1/L(f, g, y)\), (2.4) holds [by (2.6)], and hence (2.5) holds for the given \(y\) and any \(x\).

Let us now illustrate why \(1/L(f,g,y)\) may be a better estimate for the prox parameter \(\mu \) than \(1/L(f)\).

Example 2.1

Consider a compressed sensing or Lasso setting:

$$\begin{aligned} (P) \qquad \min \{F(x)\equiv \Vert Ax-b\Vert ^2+\rho \Vert x\Vert _1: x\in \varvec{R}^n\}. \end{aligned}$$
(2.7)

In this case \(f(x)\!=\!\Vert Ax-b\Vert ^2\) and \(L(f)\!=\!\Vert AA^\top \Vert _2\). Now consider a sparse vector \(\bar{y}\); without loss of generality assume that \(\bar{y}\!=\!(\bar{y}_1, \bar{y}_2)\) with \(\bar{y}_2\!=\!0\). Also, consider the gradient vector \(z\!=\!A^\top (A\bar{y}-b)\), and assume that \(\Vert z_2\Vert _\infty \le \rho \) (\(z_2\) is the subvector of \(z\) that corresponds to the subvector \(\bar{y}_2\)). In this case it is easy to see from the properties of the shrinkage operator that \(p_\mu (\bar{y})_2\!=\!0\) for all \(\mu >0\). This implies that for any \(x\in [p_\mu (\bar{y}), \bar{y}]\) \(x_2\!=\!0\), and hence \(L(f,g,\bar{y})\!=\!\Vert A_1A_1^\top \Vert _2\), which is clearly smaller than \(L(f)\!=\!\Vert AA^\top \Vert _2\), where \(A_1\) is the subset of columns of \(A\) that correspond to the subvector \(y_1\).

Since it may be difficult, in general, to compute the local composite Lipschitz constant accurately, we may consider estimating it via an upper bound that is still lower than \(L(f)\). For instance, assume that for a given \(f\), \(g\), and \(y\),

$$\begin{aligned} \Vert \nabla f(x)-\nabla f(y)\Vert \le L(f,g,y)\Vert x-y\Vert , \forall x:\, \Vert x-y\Vert \le \Vert p_{1/L(f,g,y)}(y)-y\Vert , \end{aligned}$$
(2.8)

in other words, \(L(f,g,y)\) is a Lipschitz constant of \(\nabla f(x)\) restricted on the ball around \(y\) of radius \(\Vert p_{1/L(f,g,y)}(y)-y\Vert \). Then \(L(f,g,y)\) is a local composite Lipschitz constant for these \(f\), \(g\), and \(y\), and hence for any \(\mu <1/L(f,g,y)\), (2.4) holds.

To prove this, it is sufficient to verify that \(\Vert p_{\mu }(y)-y\Vert \le \Vert p_{1/L(f,g,y)}(y)-y\Vert \) for any \(\mu <1/L(f,g,y)\).

Lemma 2.2

Suppose that \(p_\mu (y):=\arg \min _x Q_\mu (x,y)\) for a given \(y\), where \(Q_\mu (x,y)\) is defined as in (2.1). If \(\mu _1\le \mu _2\), then \(\Vert p_{\mu _1}(y)-y\Vert \le \Vert p_{\mu _2}(y)-y\Vert \).

Proof

Assume \(\mu _1 > \mu _2\) (\(\mu _1=\mu _2\) is trivial); then \(\alpha \equiv \frac{1}{2\mu _1}-\frac{1}{2\mu _2}> 0\). Suppose that \(\Vert p_{\mu _1}(y)-y\Vert >\Vert p_{\mu _2}(y)-y\Vert \). Then

$$\begin{aligned} Q_{\mu _1}(p_{\mu _1}(y),y)- Q_{\mu _2}(p_{\mu _1}(y),y)=\alpha \Vert p_{\mu _1}(y)-y\Vert ^2> \alpha \Vert p_{\mu _2}(y)-y\Vert ^2\nonumber \\ = Q_{\mu _1}(p_{\mu _2}(y),y)- Q_{\mu _2}(p_{\mu _2}(y),y). \end{aligned}$$

Hence, by applying the fact that \(p_\mu (y)\) is a minimizer of \(Q_\mu (x, y)\) in \(x\) for \(\mu =\mu _1\) and \(\mu =\mu _2\) we obtain

$$\begin{aligned} 0> Q_{\mu _1}(p_{\mu _1}(y),y)- Q_{\mu _1}(p_{\mu _2}(y),y)> Q_{\mu _2}(p_{\mu _1}(y),y)- Q_{\mu _2}(p_{\mu _2}(y),y)>0, \end{aligned}$$

which implies a contradiction. \(\square \)

Example 2.2

To illustrate the case where an upper bound of the local composite Lipschitz constant can be estimated to be lower than the global Lipschitz constant, consider the following simple setting: let \(f(x)=H_\nu \Vert Ax-b\Vert _2\) be the smooth Huber function approximation of the nonsmooth function \(\Vert Ax-b\Vert _2\), and let \(g(x)\equiv 0\) for simplicity of the derivations. In this case,

$$\begin{aligned} \nabla f(x)=\left\{ \begin{array}{ll} \frac{A^\top (Ax-b)}{\nu }&{} \mathrm{if\ }\Vert Ax-b\Vert \le \nu ,\\ \frac{ A^\top (Ax-b)}{\Vert Ax-b\Vert _2}&{} \mathrm{otherwise}. \end{array} \right. \end{aligned}$$

\(L(f)\) is then equal to \(\Vert A^\top A\Vert _2/\nu \), which can be very large if \(\nu \) is small.

Assume now that a particular vector \(\bar{y}\) is given for which

$$\begin{aligned} \Vert A\bar{y}-b\Vert \ge 2\lambda \end{aligned}$$

for some \(\lambda \gg \nu \). Since the function \(\Vert Ax-b\Vert \) is Lipschitz continuous with constant \(\Vert A\Vert _2\), then for any \(x\) such that \(\Vert x-\bar{y}\Vert \le \lambda /\Vert A\Vert _2\) we have

$$\begin{aligned} \Vert A\bar{y}-b\Vert -\Vert Ax-b\Vert \le \Vert A\Vert _2\Vert x-\bar{y}\Vert \le \lambda , \end{aligned}$$

and therefore

$$\begin{aligned} \Vert Ax-b\Vert \ge \lambda . \end{aligned}$$

From the properties of the Huber function \( \Vert \nabla ^2 f(x)\Vert \le \Vert A^\top A\Vert _2/\lambda \). Now consider \(\mu \le \lambda /\Vert A^\top A\Vert _2\), since \(p_\mu (\bar{y})=\bar{y} -\mu \nabla f(\bar{y})\); then

$$\begin{aligned} \Vert p_\mu (\bar{y})-\bar{y}\Vert \le \mu \Vert A\Vert _2\le \lambda /\Vert A\Vert _2. \end{aligned}$$

This implies that (2.8) holds for \(L(f,g,\bar{y})= \Vert A^\top A\Vert _2/\lambda \); in other words, the local composite Lipschitz constant for this case is bounded from above by \(\Vert A^\top A\Vert _2/\lambda \), which is significantly smaller than the worst-case Lipschitz constant \(\Vert A^\top A\Vert _2/\nu \). We can conclude that

$$\begin{aligned} L(f,g,y)\le \frac{\Vert A^\top A\Vert _2}{2\Vert Ay-b\Vert },\ \forall y:\, \Vert Ay-b\Vert \ge 2\nu . \end{aligned}$$
(2.9)

The preceding definition of local composite Lipschitz constants is provided here to motivate the results that follow. In the following sections we will show that the complexity bounds of line search variants of FISTA and FALM depend on the average value of the prox parameter rather than the worst-case lower bound. Hence, if a sequence of prox parameters can be bounded from below by a sequence of local estimates, for instance by inverses of local composite Lipschitz constants, then this can potentially provide significantly better complexity bounds for FISTA and FALM for certain problem classes.

3 Fast Iterative Shrinkage/Thresholding Method with Variable Step Sizes

In this section we develop an extension of the fast iterative shrinkage/thresholding algorithm (FISTA) of [1] that computes the step size via backtracking starting from any value. We first consider the following generalization (Algorithm 1) of FISTA. In FISTA \(\theta _k=1\), for all \(k\), we will explain why this choice is imposed and how it can be generalized.

figure a

According to Step 2 of this algorithm, if \(\mu _0\) and \(\bar{\mu }_k\) are chosen so that \(\mu _0 > 1/L(f)\) and \(\mu _k\ge \mu _{k-1}\), then \(\mu _k\ge \beta /L(f)\), \(\forall k\), since, as already noted, \(F(p_{\mu _k}(y^k))\le Q_{\mu _k}(p_{\mu _k}(y_k),y^k)\) holds for any \(y_k\) if \(\mu _k<1/L(f)\). The following lemma is an immediate consequence of this.

Lemma 3.1

Let \(\mu _0>1/L(f)\); then at each iteration of Algorithm 1 there are at most \(\log _\frac{1}{\beta }(\mu _0L(f))+1\) line search backtracking steps and, hence, computations of \(p_{\mu _k}(y_k)\).

If for a given \(y_k\) we define a local composite local Lipschitz constant \(L_k\) (\(=L(f, g, y_k)\)), as discussed in Sect. 2.1, then \(\mu _k\ge \beta /L_k\), and the number of line search steps at iteration \(k\) is at most \(\log _\frac{1}{\beta }(\mu _0L_k)+1\).

The complexity of Algorithm 1 for \(\theta _k\equiv 1\) is analyzed in [1]. The proof in [1] relies on the fact that \(\bar{\mu }_{k+1}\) is chosen so that \(\mu _{k+1}\le \mu _k\) for all \(k\ge 1\), i.e., the step size \(\mu _k\) is monotonically nonincreasing. Here we will use an analysis similar to that in [1] but that allows for increases in the step size in Algorithm 1 while preserving the algorithm’s \(O(\sqrt{L(f)/\epsilon })\) complexity. As we will show, we can accomplish this by choosing appropriate values of \(\theta _k\).

We will first need an auxiliary lemma, which is analogous to a part of a theorem in [1].

Lemma 3.2

Assume that on each step of Algorithm 1, \(\mu _k t_k^2\ge \mu _{k+1}t_{k+1}(t_{k+1}-1)\). Then for \(v_k=F(x^k)-F(x^*)\) and \( \mathbf {u}_k=t_kx_k-(t_k-1)x_{k-1}-x^*\) we have for all \(k\ge 1\)

$$\begin{aligned} 2\mu _kt_k^2v_k+\Vert \mathbf {u}_k\Vert ^2 \ge 2\mu _{k+1}t_{k+1}^2v_{k+1}+\Vert \mathbf {u}_{k+1}\Vert ^2. \end{aligned}$$
(3.1)

Proof

Let \(y=y^{k+1}\), \(\mu =\mu _{k+1}\), and \(p_{\mu }(y)=x^{k+1}\) in Lemma 2.1. Then from this lemma it follows that for \(x=x^k\) we have

$$\begin{aligned} 2\mu _{k+1}(v_k-v_{k+1})\ge \Vert y^{k+1}-x^{k+1}\Vert ^2+2\langle x^{k+1}-y^{k+1}, y^{k+1}-x^k\rangle , \end{aligned}$$
(3.2)

and for \(x=x^*\) we have

$$\begin{aligned} -2\mu _{k+1} v_{k+1} \ge \Vert y^{k+1}-x^{k+1}\Vert ^2+2\langle x^{k+1}-y^{k+1}, y^{k+1}-x^*\rangle . \end{aligned}$$
(3.3)

Multiplying the first inequality by \(t_{k+1}-1\) and adding it to the second inequality we have

$$\begin{aligned} 2\mu _{k+1}((t_{k+1}-1)v_k-t_{k+1}v_{k+1})&\ge t_{k+1}\Vert y^{k+1}-x^{k+1}\Vert ^2\\&+2\langle x^{k+1}-y^{k+1}, t_{k+1}y^{k+1}\!-\!(t_{k+1}\!-\!1)x_k\!-\!x^*\rangle . \end{aligned}$$

Then multiplying this inequality by \(t_{k+1}\) and rearranging the right-hand-side terms yields

$$\begin{aligned} 2\mu _{k+1}((t_{k+1}-1)t_{k+1}v_k-t_{k+1}^2v_{k+1})&\ge \Vert t_{k+1}x^{k+1}-(t_{k+1}-1)x_k-x^*\Vert ^2\\&-\Vert t_{k+1}y^{k+1}-(t_{k+1}-1)x_k-x^*\Vert ^2. \end{aligned}$$

Now from the definition of \(y^{k+1}\) in Algorithm 1 and \(\mathbf {u}_k\),

$$\begin{aligned} t_{k+1}y^{k+1}=t_{k+1}x_k+(t_k-1)(x_k-x_{k-1}), \end{aligned}$$

and it follows that

$$\begin{aligned} t_{k+1}y^{k+1}-(t_{k+1}-1)x_k-x^*=t_kx_k-(t_k-1)x_{k-1}-x^*= \mathbf {u}_k. \end{aligned}$$

Hence,

$$\begin{aligned} 2\mu _{k+1}((t_{k+1}-1)t_{k+1}v_k-t_{k+1}^2v_{k+1})\ge \Vert \mathbf {u}_{k+1}\Vert ^2-\Vert \mathbf {u}_{k}\Vert ^2. \end{aligned}$$
(3.4)

Since \(\mu _k t_k^2\ge \mu _{k+1}t_{k+1}(t_{k+1}-1)\), we then have

$$\begin{aligned} 2(\mu _{k}t_k^2v_k-\mu _{k+1}t_{k+1}^2v_{k+1})\ge \Vert \mathbf {u}_{k+1}\Vert ^2-\Vert \mathbf {u}_{k}\Vert ^2, \end{aligned}$$
(3.5)

which is equivalent to (3.1). \(\square \)

Theorem 3.1

Assume that on each step of Algorithm 1, \(\mu _k t_k^2\ge \mu _{k+1}t_{k+1}(t_{k+1}-1)\). The sequence \(\{x^k\}\) generated by Algorithm 1 satisfies

$$\begin{aligned} F(x^k)-F(x^*)\le \frac{\Vert x^0-x^*\Vert ^2}{2\mu _kt_k^2}. \end{aligned}$$
(3.6)

Proof

Since \(\Vert \mathbf {u}_k\Vert ^2\ge 0\), \(t_1=1\), and \(\mathbf {u}_1=x_1-x^*\), it follows from Lemma 3.2 that

$$\begin{aligned} 2\mu _k t_k^2 v_k\le 2\mu _k t_k^2v_k + \Vert \mathbf {u}_k\Vert ^2\le 2\mu _1t_1^2v_1+\Vert x_1-x^*\Vert ^2. \end{aligned}$$
(3.7)

But from Lemma 2.1 with \(x=x^*, y=y_1=x_0\), \(t_1=1\), and \(\mu =\mu _1\), we have that

$$\begin{aligned} -2\mu _1v_1\ge \Vert x^1-x^*\Vert ^2-\Vert x^0-x^*\Vert ^2. \end{aligned}$$
(3.8)

Therefore, combining (3.7) and (3.8) we obtain that

$$\begin{aligned} 2\mu _k t_k^2v^k\le \Vert x^0-x^*\Vert ^2, \end{aligned}$$

which is equivalent to (3.6). \(\square \)

The result we want to obtain is

$$\begin{aligned} F(x^k)-F(x^*)=v_k\le \frac{\Vert x^0-x^*\Vert ^2}{\eta k^2} \end{aligned}$$
(3.9)

for some fixed \(\eta >0\). For this to be the case, it is sufficient to show that our algorithm satisfies

$$\begin{aligned} 2\mu _kt_k^2\ge \eta k^2 \end{aligned}$$
(3.10)

while maintaining

$$\begin{aligned} \mu _k t_k^2\ge \mu _{k+1}t_{k+1}(t_{k+1}-1). \end{aligned}$$
(3.11)

From the update of \(t_k\) in Algorithm 1,

$$\begin{aligned} t_{k+1} = (1+\sqrt{1+4\theta _kt_k^2})/2, \end{aligned}$$
(3.12)

it follows that

$$\begin{aligned} \theta _k t_{k}^2 =t_{k+1}(t_{k+1}-1). \end{aligned}$$
(3.13)

Hence, as long as \(\theta _k\le \mu _k/\mu _{k+1}\), property (3.11) holds. We know that if we implement the constraint \(\mu _{k+1}\le \mu _k\) and use \(\theta _k=1\), as is done in the FISTA algorithm in [1], then both conditions (3.10) and (3.11) are satisfied with \(\eta = \beta /L(f)\). However, if we want to allow \(\mu _{k+1}> \mu _k \), say \(\mu _{k+1}\ge \mu _k/\beta \), then we need to have \(\theta _k<1\), e.g., \(\theta _k\le \beta \). On the other hand, condition (3.10) requires \(t_k\ge k\sqrt{\frac{\eta }{2\mu _k}}\ge k\sqrt{\frac{\eta }{2\mu _0}}\), hence \(t_k\) needs to grow at the same rate as \(k\). From (3.13) and the fact that \(t_k>1\) for all \(k>1\),

$$\begin{aligned} (t_{k+1}-1)^2 <t_{k+1}(t_{k+1}-1)=\theta _k t_k^2, \end{aligned}$$

and hence, \(t_{k+1}<\theta _k^{1/2}t_k+1\). Assume that for all \(k\), we allow \(\theta _k\le \beta <1\); then if \(t_k> 1/(1-\beta ^{1/2})\), we have \(t_{k+1}<t_k\). This means that the sequence \(t_k\) will not grow at the required rate if we simply allow \(\theta _k<1\) for all iterations.

To maintain FISTA’s rate of convergence while allowing \(\theta _k<1\) on some iterations, we need to ensure that \(\theta _k>1\) on some of the other iterations.

This can be accomplished on iterations on which \(\mu _k/\mu _{k+1}>1\). The immediate difficulty is that we do not know the value of \(\mu _{k+1}\) on iteration \(k\) when \(\theta _k\) and \(t_{k+1}\) are computed. Simply setting \(\theta _k\le \mu _k/\mu _0\), where \(\mu _0\) is an upper bound on the step size, will imply that \(\theta _k<1\) for all \(k\). Hence, it is necessary to update \(\theta _k\) and \(t_{k+1}\) along with \(\mu _{k+1}\), thereby expanding the backtracking part of the algorithm,

We now present Algorithm 2, which is an extension of Algorithm 1, which allows for a full backtracking and any size \(\mu _k\). It satisfies conditions (3.10)–(3.11) while setting the step size \(\mu _k\) initially to some value \(\mu ^0_k\) at the beginning of each iteration. As we will show, Algorithm 2 is designed to maintain \(\theta _k=\mu _k/\mu _{k+1}\) for all \(k\ge 1\).

Recall that the parameter \(\theta _k\) is used to compute \(t_{k+1}\) and \(y^{k+1}\) in Algorithm 1 using the updates

$$\begin{aligned} t_{k+1}&:= (1+\sqrt{1+4\theta _kt_k^2})/2, \\ y^{k+1}&:= x^k+ \frac{t_{k}-1}{t_{k+1}}[x^k-x^{k-1}]. \end{aligned}$$

Let us denote this computation by

$$\begin{aligned} (t_{k+1},y^{k+1})=FistaStep(x^k,x^{k-1},t_k, \theta _k). \end{aligned}$$

At the end of iteration \(k-1\) of Algorithm 2, \(\theta _{k-1}=\mu _{k-1}/\mu ^0_k\), where \(\mu ^0_k\) is an initial “guess” for the value of \(\mu _{k}\). Hence, \(t_{k}\) and \(y^{k}\) are computed using this guess. Once the backtracking is called at iteration \(k\), this guess may turn out to be correct, or it may turn out to be an overestimate of \(\mu _{k}\). If the guess is correct, then no correction to \(\theta _{k-1}\) is needed. If \(\mu _{k}\) is reduced during the backtracking, then \(\theta _{k-1}\) is recomputed to account for the new value of \(\mu _{k}\) so that \(\theta _{k-1}=\mu _{k-1}/\mu _{k}\) is satisfied. Another call to \(FistaStep(x^{k-1},x^{k-2},t_{k-1}, \theta _{k-1})\) is then necessary. After such a call is made, since the iterate \(y^{k}\) has changed, \(p_{\mu _{k}}(y^{k})\) must be recomputed and a new backtracking step performed. The backtracking starts with the value of \(\mu _{k}\) with which the previous backtracking stopped and, hence, that was used to compute the most recent value of \(\theta _{k-1}\). If the value of \(\mu _k\) is not reduced any further, then \(\theta _{k-1}\), \(y^{k}\) and \(t_k\) have the correct value and the backtracking part of the iteration is complete. If the value of \(\mu _{k}\) is reduced further, then \(\theta _{k-1}\) is recomputed again and backtracking continues.

figure b

We have the following lemma, whose proof follows from the preceding algorithm and discussion.

Lemma 3.3

At each iteration, upon completion of the backtracking, the iterate \(x_k\) is computed as \(p_{\mu _k}(y^k)\), where

$$\begin{aligned} y^{k} = x^{k-1}+\frac{t_{k-1}-1}{t_{k}}[x^{k-1}-x^{k-2}] \end{aligned}$$

and

$$\begin{aligned} t_{k} = (1+\sqrt{1+4(\mu _{k-1}/\mu _k)t_{k-1}^2})/2, \end{aligned}$$
(3.14)

which is equivalent to (3.12) since \(\theta _{k-1}=\mu _{k-1}/\mu _k\).

From the expression for \(t_k\) in this lemma, (3.11) is satisfied for each \(k\). We now need to show that condition (3.14) implies (3.10), in which case the complexity result (3.9) follows.

Lemma 3.4

Let \(\{\mu _k,t_k\}\) be a sequence that satisfies (3.14); then we have \(\mu _kt_k^2\ge (\sum _{i=1}^k \sqrt{\mu _i}/2)^2\).

Proof

We will prove the lemma using induction. Since \(t_1=1\), the statement trivially holds for \(k=1\). Let us assume that \(\mu _kt_k^2\ge (\sum _{i=1}^{k} \sqrt{\mu _i}/2)^2\) for a given \(k\ge 1\). From the fact that (3.14) holds for each \(k\) it follows that

$$\begin{aligned} t_{k+1}\ge 1/2+\sqrt{\frac{\mu _k}{\mu _{k+1}}}t_k, \end{aligned}$$

and hence,

$$\begin{aligned} \sqrt{\mu _{k+1}}t_{k+1}\ge \sqrt{\mu _{k+1}}/2+\sqrt{\mu _k}t_k. \end{aligned}$$

We know that \(\sqrt{\mu _k}t_k\ge \sum _{i=1}^{k} \sqrt{\mu _i}/2\); hence, from the induction assumption

$$\begin{aligned} \sqrt{\mu _{k+1}}t_{k+1}\ge \sqrt{\mu _{k+1}}/2+ \sum _{i=1}^{k} \sqrt{\mu _i}/2= \sum _{i=1}^{k+1} \sqrt{\mu _i}/2, \end{aligned}$$

which concludes the induction argument.

Moreover, since \(\mu _i\ge \beta /L(f)\),

$$\begin{aligned} \left( \sum _{i=1}^k \sqrt{\mu _i}/2\right) ^2\ge \frac{k^2\beta }{4L(f)}. \end{aligned}$$

\(\square \)

Since Algorithm 2 maintains \(\theta _k=\mu _k/\mu _{k+1}\), (3.11) holds automatically, and we have shown that (3.10) holds with \(\eta = \beta L(f)/4\) by Lemma 3.4. Thus the following two complexity results follow immediately from Theorem 3.1.

Theorem 3.2

At the \(k\)th iteration of Algorithm 2 we have

$$\begin{aligned} F(x^k)-F(x^*)=v_k\le \frac{2L(f)\Vert x^0-x^*\Vert ^2}{\beta k^2}. \end{aligned}$$
(3.15)

Theorem 3.3

Let \(\sqrt{L_k}=(\sum _{i=1}^k \sqrt{ L(f,g, y^i)})/k\) be the average estimate of local composite Lipschitz constants encountered during the first \(k\) iterations of Algorithm 2. Assume that \(\mu _i^0\ge 1/L(f,g, y^i)\) for all \(1\ge i \ge k\). Then

$$\begin{aligned} F(x^k)-F(x^*)=v_k\le \frac{2 L_k\Vert x^0-x^*\Vert ^2}{\beta k^2}. \end{aligned}$$
(3.16)

Proof

The proof follows trivially by observing that for any \(i=1, \ldots , k\) \(\mu _i\ge \beta /L(f,g, y^i)\) and applying Lemma 3.4. \(\square \)

In [9], Nesterov gives an accelerated method for solving problem (1.1) that involves a backtracking process and has an iteration complexity of \(O(\sqrt{L(f)/\epsilon })\). Like Algorithm 2, Nesterov’s method allows repeated reduction of the prox parameter \(\mu _k\) by a given factor on the \(k\)th iteration until a backtracking condition is met.

While Nesterov’s method minimizes the same function \(Q_{\mu _k}\) on each inner step, as do FISTA and Algorithm 2, it utilizes a different update of the parameters that are used to compute the new extrapolated point \(y^k\) and relies on a criterion for terminating each backtracking step other than \(F(p_{\mu _k}(y^k))> Q_{\mu _k}(p_{\mu _k}(y^k),y^k)\). In particular, the criterion in [9] involves \(\nabla f(p_{\mu _k}(y_k))\) at each backtracking step, thereby increasing the per-iteration cost compared to that of Algorithm 2 [assuming that the cost of computing \(\nabla f(y)\) is higher than that of computing \(F(p_{\mu _k}(y^k))\), as in compressed sensing, for instance, where the former requires two matrix vector multiplications, while the latter requires only one]. Whether either of the criteria yields larger step sizes and, hence, possibly fewer iterations remains to be investigated. For the sake of theoretical comparison, one can easily modify the results in [9] to obtain a lemma similar to Lemma 3.4. However, Theorem 3.3 does not apply directly to the method in [9] because in that case \(\mu _k\) increases by a constant factor between two consecutive iterations. We discuss this strategy and its advantages next.

Each iteration of the algorithm can start with any value of the prox parameter \(\mu ^0_k\), and the condition \(\mu _kt_{k}^2=\mu _{k+1}(t_{k+1}^2-t_{k+1})\) is maintained at each iteration. If \(\mu _k^0\) is chosen to be a large fixed value \(\mu _0\) at each iteration, this implies a potentially large number of calls to the FistaStep procedure to update \(y^k\) and \(t^k\) and the consequent recomputation of \(\nabla f(y^k)\), \(p_{\mu _k}(y^k)\), and \(F(p_{\mu _k}(y^k))\). However, we know that the number of such internal steps does not exceed \(log_\frac{1}{\beta }(\mu _0L(f))+1\); hence, the complexity of Algorithm 2 is at most a logarithmic factor \(L(f)\) worse than that of FISTA, i.e., Algorithm 1. The advantage of choosing large \(\mu ^0_k\) is in the potentially rapid increase of the prox parameter if such an increase is possible and desirable due to a rapid change in the local Lipschitz behavior of \(\nabla f(x)\).

On the other hand, \(\mu ^0_k\) can be selected to have a more conservative value (as long as the value of \(\theta _k\) is chosen accordingly). For instance, one can choose \(\mu ^0_k\) to equal \(\mu _{k-1}/\beta \) at the beginning of the \(k\)th iteration. This will significantly reduce the number of backtracking steps. This strategy of increasing \(\mu _k\) incrementally is what is used in Nesterov’s method in [9], where \(\mu ^0_k\) is set to \(\mu _{k-1}/\alpha \) for some fixed \(\alpha \ge \beta \). It is then possible to bound the total number of line search steps used by the algorithm. The following lemma is proved in [9] and applies readily to Algorithm 2 if \( \mu ^0_k\) is chosen to equal \(\mu _{k-1}/\alpha \) instead of some fixed large value \(\mu _0\).

Lemma 3.5

For any \(k\ge 1\) the number of calls to FistaStep by the end of the \(k\)th iteration is bounded by

$$\begin{aligned} \left[ 1+\frac{\ln \alpha }{\ln \beta }\right] (k+1)+\frac{1}{\ln \beta }\left[ \ln \frac{\alpha \mu _0}{\beta (1/L(f))}\right] _+ \,. \end{aligned}$$

If we choose \(\alpha =\beta \), then we see that the average number of calls of FistaStep per iteration is \(2\). Hence, the logarithmic factor is removed from the overall complexity of Algorithm 2.

It is easy to see that (3.14) still holds for this case, and hence Lemma 3.4 applies. However, the drawback of this approach is that we can no longer guarantee that \(\mu _k\ge \beta / L_k\) for each \(k\) and Theorem 3.3 may not hold for \(L_k\) as the estimate of the local composite Lipschitz constant defined in Sect. 2.1. The bound on \(\mu _k\) is now \(\mu _k\ge \min \{\mu _{k-1}/\beta , \beta L_k\}\). Ideally, one should select the increase of the prox term based on the estimated increase of the local Lipschitz constant. The applicability of such an approach is a subject for future investigation.

Let us consider the extra work required by each of the backtracking steps. The \(FistaStep(x^k,x^{k-1},t_k, \theta _k)\) requires only an additional \(O(n)\) operations; hence, the main additional work comes from the necessity of recomputing \(F(p_{\mu _k}(y^k))\) and \(Q_{\mu _k}(p_{\mu _k}(y^k),y^k)\). In the case of compressed sensing and Lasso problems, which we consider for our computational results, the dominating cost of all these computations can be reduced to one additional matrix vector multiplication per backtracking step.

3.1 A Practical Backtracking FISTA Algorithm for Compressed Sensing and Lasso Problems

The additional cost of each backtracking step of Algorithm 2 compared to that of Algorithm 1 lies in a call to FistaStep updates and recomputation of \(\nabla f(y^k)\), which is needed to construct \(Q_\mu (y^k, x)\). All remaining computational cost is the same for both algorithms. The number of backtracking steps is defined solely by the choice of \(\mu _0^k\) at each iteration, as discussed in the previous section. The choice of a practical approach is likely to depend on the comparisons of the cost of computation of \(\nabla f(y^k)\), \(p_{\mu _k}(y^k)\), and \(F(p_{\mu _k}(y^k))\). Here we consider specific application of our backtracking algorithm to problems of the form

$$\begin{aligned} (P) \qquad \min \{F(x)\equiv \Vert Ax-b\Vert ^2+g(x): x\in \varvec{R}^n\}. \end{aligned}$$
(3.17)

We assume here that \(g(x)\) is a simple function, such as \(\Vert x\Vert _1\), as in the case of CS or Lasso [4, 13] or \(\sum _i \Vert x_i\Vert _2\) as in the case of group Lasso [15]. In this case, \(\nabla f(x)=A^\top (Ax-b)\). Recall the expression for \(y^k\):

$$\begin{aligned} y^{k} = x^{k-1}+\frac{t_{k-1}-1}{t_{k}}[x^{k-1}-x^{k-2}], \end{aligned}$$

which implies that

$$\begin{aligned} \nabla f(y^{k} ) = \nabla f(x^{k-1})+\frac{t_{k-1}-1}{t_{k}}[\nabla f(x^{k-1})-\nabla f(x^{k-2})]. \end{aligned}$$

Hence, if \(\nabla f (x^{k-1})\) and \(\nabla f (x^{k-2})\) are available, then \(\nabla f(y^{k} )\) can be computed in \(O(n)\) operations for any value of \(t_k\). Since the backtracking step changes only the value of \(t_k\) but not \(x^{k-1}\) or \(x^{k-2}\), this means that the extra cost of each backtracking step of Algorithm 2 compared to that of Algorithm 1 is only \(O(n)\), which is negligible.

However, as discussed earlier, using larger values of \(\mu ^k_0\) may result in a higher number of backtracking steps; hence, we should analyze the cost of a backtracking step itself. For simple forms of \(g(x)\), computing \(p_{\mu _k}(y^k)\) given \(\nabla f(y^{k} )\) takes \(O(n)\) operations. Finally, computing \(F(p_{\mu _k}(y^k))\) requires one matrix-vector product for computing \(Ax-b\). Once \(x^k\) is determined via backtracking, an additional matrix-vector product is employed to compute \(\nabla f(x^k)\); however, this last computation is not a part of the backtracking procedure. Assuming that matrix-vector products comprise the dominant cost of each iteration, the total cost of an iteration without backtracking equals two matrix-vector products, while the cost of an iteration with backtracking contains an additional matrix-vector product per backtracking step. For instance, if \(\mu _k^0=\mu _{k-1}/\beta \), then, by Lemma 3.5, the average cost of an iteration of Algorithm 2 is \(3/2\) that of Algorithm 1. Such a cost increase is beneficial if the number of iterations of Algorithm 2 is proportionately smaller.

In the examples we consider in our computational experiments in Sect. 5, increasing \(\mu _k\) at each iteration appears to be wasteful. Hence, we choose to allow for the increase in the value of the prox parameter every \(l\) iteration, where \(l\) is chosen heuristically. In particular, we try several different values of \(l\) during the first 100 iterations and then settle with the value of \(l\), which gives the least number of failed backtracking steps. The key assumption in using such a heuristic is that the behavior of the algorithm at later iterations is similar to that during earlier iterations. While the behavior of the algorithm is problem-dependent, in our experiments this heuristic produced good results. We believe that this is due to the fact that the local Lipschitz constants do not vary dramatically for the problems in our experiments. If significant changes occur in the Lipschitz constants, we believe that any backtracking heuristic will produce significant improvements over the pure FISTA algorithm because it will allow the prox parameter to increase sooner or later.

4 Fast Alternating Linearization Methods

In this section, we propose a backtracking version of the fast alternating linearization method (FALM) with step skipping [7], which is given below as Algorithm 3. This algorithm can be viewed as an extension of FISTA, where the linearization is applied to each of the two parts of \(F(x)\) in turn. This similarity to FISTA allows us to extend our backtracking approach and convergence rates results to FALM.

The main difference between FISTA and FALM is that FISTA approximates only the first part of the composite objective function \(F(x)\), while FALM applies alternating approximations to each of the two parts of \(F(x)\). Hence, we consider the following two approximations of \(F(x)\) and their minimizers:

Definition 4.1

$$\begin{aligned} Q^f_\mu (u,v)&:= f(v) + \langle u-v, \nabla f(v) \rangle + \frac{1}{2\mu } \Vert u-v\Vert ^2 + g(u);\\ p^f_\mu (v)&:= \arg \min _u Q^f_\mu (u,v);\\ Q^g_\mu (u,v)&:= f(v) + \langle v-u, \lambda \rangle + \frac{1}{2\mu } \Vert u-v\Vert ^2 + g(u), \end{aligned}$$

where \(\lambda \) is some element of the subdifferential \(\partial g(u)\);

$$\begin{aligned} p^g_\mu (u):=\arg \min _v Q^g_\mu (u,v). \end{aligned}$$

Algorithm 3 (FALM-S) given below was proposed in [7] for cases where computing \(p^f_\mu (v)\) and \(p^g_\mu (u)\), defined earlier, is relatively easy, i.e., comparable to a gradient computation of \(g\) and \(f\), respectively. We call the \(k\)th iteration of this algorithm a skipping step if \(x^k=z^k\) and a regular step if \(x^k\ne z^k\). Skipping steps are designed to accommodate those cases where the condition \(F(p^g_{\mu }(u)) \le Q_\mu ^g(u,p^g_{\mu }(u))\) fails. Since \(g\) is not a smooth function, it may not be possible to satisfy the preceding condition by choosing a sufficiently small value of \(\mu \). Hence, when the minimization of \(Q_\mu ^g(v,u)\) in \(v\) does not produce the desired function reduction, the corresponding step is simply skipped.

figure c

For the efficient computation of \(\lambda ^{k+1}\in -\partial g(z^{k+1})\) we refer the reader to [7].

As one can observe, Algorithm 3 does not include a mechanism for selecting \(\mu \) at each iteration – the prox parameter is assumed to be fixed. In that case, the \(\theta _k\) parameter in FistaStep computation equals \((\mu +\mu )/(\mu +\mu )=1\) when on both iterations, \(k\) and \(k+1\), there are no skipped steps. Alternatively, \(\theta _k= \mu /(\mu +\mu )=1/2\) when a step is skipped on iteration \(k\) but not on iteration \(k+1\), while \(\theta _k= (\mu +\mu )/\mu =2\) when a step is skipped on iteration \(k+1\) but not on iteration \(k\). The convergence rate results in [7] can be easily extended to the case where \(\mu _k\) is chosen via backtracking, as is done in FISTA, as long as \(\mu _k\le \mu _{k-1}\), with the same values for \(\theta _k\). Here we propose a backtracking variant of FALM-S, which is given below as Algorithm 4. We note that we now have two linearization and minimization steps per iteration; hence, we allow the parameter \(\mu \) to be set separately from \(\mu _k^x\) and \(\mu _k^y\), respectively, for each of the minimization steps. The average \( \mu _k=(\mu _k^x+\mu _k^y)/2\) is then used to compute \(\theta _k\) and, hence, to update the parameter \(t_{k+1}\). The new algorithm is a generalization of Algorithm 3 and includes skipping steps as a special case when \(skip=\mathrm{true}\) and \(\mu _k^x=0\).

Let us first observe that the algorithm generates a sequence \(\{\mu _k, t_k\}\) that satisfies a condition analogous to condition (3.11), which we required for FISTA-BKTR. Indeed, it is evident from the computation of \(\theta _k\) that \(\theta _k=\mu _k/\mu _{k+1}\), with \(\mu _{k}=\frac{\mu ^x_k+\mu ^y_k}{2}\), and hence \(\mu _k t_k^2 = \mu _{k+1}t_{k+1}(t_{k+1}-1)\) for all \(k\).

To obtain a bound on the number of iterations required by Algorithm 4 to reach an \(\epsilon \)-optimal solution, we need the following lemma.

Lemma 4.1

The sequence \(\{x^k,y^k\}\) generated by Algorithm 4 satisfies

$$\begin{aligned} 2(\mu _kt_k^2v_k - \bar{\mu }_{k+1}t_{k+1}^2v_{k+1}) \ge \Vert u^{k+1}\Vert ^2 - \Vert u^k\Vert ^2, \end{aligned}$$
(4.1)

where \(u^k:=t_ky^k-(t_k-1)y^{k-1}-x^*\), \(v_k:=2F(y^k)-2F(x^*)\) and \(\mu _k=(\mu ^x_k+\mu ^y_k)/2\).

figure d

Proof

We will prove that the following inequality holds:

$$\begin{aligned} 2(\mu _k t_k^2v_k-\mu _{k+1} t_{k+1}^2 v_{k+1})&\ge t_{k\!+\!1}(t_{k\!+\!1}\!-\!1)\left( \Vert y^{k\!+\!1}\!-\!y^k\Vert ^2\!-\!\Vert z^{k\!+\!1}\!-\!y^k\Vert ^2\right) \\&\quad +t_{k\!+\!1}\left( \Vert y^{k\!+\!1}\!-\!x^*\Vert ^2\!-\!\Vert z^{k\!+\!1}\!-\!x^*\Vert ^2\right) . \end{aligned}$$

The proof of (4.1), and hence the lemma, then follows from the fact that the right-hand side of inequality (4.2) equals

$$\begin{aligned} \Vert t_{k\!+\!1}y^{k\!+\!1}-(t_{k\!+\!1}\!-\!1)y^k\!-\!x^*\Vert ^2\!-\!\Vert t_{k\!+\!1}z^{k\!+\!1}\!-\!(t_{k\!+\!1}\!-\!1)y^k\!-\!x^*\Vert ^2 = \Vert u^{k\!+\!1}\Vert ^2\!-\!\Vert u^k\Vert ^2, \end{aligned}$$

where we have used the fact that \(t_{k+1}z^{k+1}:=t_{k+1}y^k+t_k(y^k-y^{k-1})-(y^k-y^{k-1}).\)

First applying Lemma 2.1 to the function \(Q^f_{\mu }(\cdot , \cdot )\), with \(\mu =\mu _{k+1}^y\), \(x=y^k\), and \(y=x^{k+1}\), we have \(p^f_{\mu _{k+1}^y}(y) =y^{k+1}\) and

$$\begin{aligned} 2\mu ^y_{k+1}(F(y^k) - F(y^{k+1}))\ge \Vert y^{k+1}-y^k\Vert ^2-\Vert x^{k+1}-y^k\Vert ^2. \end{aligned}$$
(4.2)

In the case of a regular iteration, that is, when \(\mu ^x_{k+1}>0\), applying Lemma 2.1 to the function \(Q^g_{\mu }(\cdot , \cdot )\), with \(\mu =\mu _{k+1}^x\), \(x=y^k\), and \(y=z^{k+1}\), we obtain \(p^g_{\mu _{k+1}^x}(y)=x^{k+1}\) and

$$\begin{aligned} 2\mu ^x_{k+1}(F(y^k) - F(x^{k+1}))\ge \Vert x^{k+1}-y^k\Vert ^2-\Vert z^{k+1}-y^k\Vert ^2. \end{aligned}$$
(4.3)

In the case where \(\mu ^x_{k+1}=0\), since \(x^{k+1}:=z^{k+1}\), (4.3) still holds.

Summing (4.2) and (4.3), and using the fact that \(F(y^{k+1})\le F(x^{k+1})\), we obtain

$$\begin{aligned} 2\mu _{k+1}(v_k-v_{k+1}) = 2\mu _{k+1}(2F(y^k)-2F(y^{k+1})) \ge \Vert y^{k+1}- y^k\Vert ^2-\Vert z^{k+1}- y^k\Vert ^2. \end{aligned}$$
(4.4)

Again, applying Lemma 2.1 in the case where \(\mu _{k+1}^x>0\) to the function \(Q^g_{\mu _{k+1}^x}(\cdot ,\cdot )\) with \(x=x^*\) and \(y=z^{k+1}\), we obtain \(p^g_{\mu ^x_{k+1}}(y)=x^{k+1}\) and

$$\begin{aligned} 2\mu ^x_{k+1}(F(x^*)-F(x^{k+1}))\ge \Vert x^{k+1}-x^*\Vert ^2-\Vert z^{k+1}-x^*\Vert ^2. \end{aligned}$$
(4.5)

On the other hand, if \(\mu ^x_{k+1}=0\), then \(x^{k+1}:=z^{k+1}\), and (4.5) still holds. Lemma 2.1 applied to the function \(Q^f_{\mu _{k+1}^y}(\cdot , \cdot )\), with \(x=x^*\), \(y=x^{k+1}\), and \(p^f_{\mu _{k+1}^y}(x^{k+1})=y^{k+1}\), gives

$$\begin{aligned} 2\mu ^y_{k+1}(F(x^*)-F(y^{k+1}))\ge \Vert y^{k+1}-x^*\Vert ^2-\Vert x^{k+1}-x^*\Vert ^2. \end{aligned}$$
(4.6)

Summing (4.5) and (4.6), and again using the fact that \(F(y^{k+1})\le F(x^{k+1})\) and \(\mu _k=(\mu _k^x+\mu _k^y)/2\), we obtain

$$\begin{aligned} -2\mu _{k+1} v_{k+1} = 2\mu _{k+1} (2F(x^*)-2F(y^{k+1})) \ge \Vert y^{k+1}-x^*\Vert ^2-\Vert z^{k+1}-x^*\Vert ^2.\quad \quad \end{aligned}$$
(4.7)

If we multiply (4.4) by \(t_{k+1}(t_{k+1}-1)\), and (4.7) by \(t_{k+1}^2\), and take the sum of the two resulting inequalities, we obtain (4.2) by using the fact that \(\mu _kt_k^2\ge \mu _{k+1}t_{k+1}(t_{k+1}-1)\). \(\square \)

Since Algorithm 4 maintains \(\theta _k=\mu _k/\mu _{k+1}\) on each iteration, then from Lemma 3.4 we know that for the sequence \(\{\mu _k,t_k\}\) generated by Algorithm 4,

$$\begin{aligned} \mu _kt_k^2\ge (\sum _{i=1}^k \sqrt{\mu _i}/2)^2. \end{aligned}$$
(4.8)

Now we are ready to give the complexity of Algorithm 4. Applying a proof equivalent to that of Theorem 3.1 we obtain

$$\begin{aligned} F(x^k)-F(x^*)\le \frac{\Vert x^0-x^*\Vert ^2}{4\mu _kt_k^2}. \end{aligned}$$
(4.9)

In the worst case, we have \(\mu _0^x=0\) and \(\mu _k\ge \frac{1}{2L(f)}\) and the algorithm reduces to FISTA. Hence, just as for FISTA, from (4.8) we have the following theorem.

Theorem 4.1

At each iteration \(k\) of Algorithm 2 we have

$$\begin{aligned} F(x^k)-F(x^*)=v_k\le \frac{2L(f)\Vert x^0-x^*\Vert ^2}{\beta k^2}. \end{aligned}$$
(4.10)

More generally, similarly to the case of FISTA-BKTR, we have the following complexity result.

Theorem 4.2

Let \(\sqrt{\mu (k)}=(\sum _{i=1}^k \sqrt{ \mu _i})/k \) be the average of the square roots of the average prox parameters \(\mu _i\) used during the first \(k\) iterations of Algorithm 4. Then

$$\begin{aligned} F(x^k)-F(x^*)=v_k\le \frac{2 \Vert x^0-x^*\Vert ^2}{\mu (k) k^2}. \end{aligned}$$
(4.11)

Note that the prox parameters \(\mu _i\) now depend on the local composite Lipschitz constants of both \(f\) and \(g\), i.e, \(L(f, g, y^k)\) and \(L(g, f, z^k)\). For simplicity we write the statement of the theorem in terms of \(\mu (k)\) instead of \(L(f, g, y^k)\) and \(L(g, f, z^k)\) since the latter may not be well defined in some cases and the skipped steps need to be accounted for.

4.1 A Practical Backtracking FALM Algorithm for Compressed Sensing and Lasso Problems

We now discuss the additional cost of each backtracking step of Algorithm 4 compared to that of Algorithm 3 and a general efficient implementation of the algorithm targeted at the problems of the form (3.17).

We again assume here that \(g(x)\) is a simple function, hence computing \(p^g_{\mu ^x}(y)\) is a relatively cheap operation. Computing \(p^f_{\mu ^y}(x)\), however, involves solving a system of linear equations with the matrix \(A^\top A+\frac{1}{2\mu ^y}I\). In some compressed sensing problems, \(A\) has a special structure, such that this system can be solved in \(O(n \log n)\) operations – the same work as is required to multiply \(A\) or \(A^\top \) by a vector and, hence, to compute the gradient \(\nabla f(x)=A^\top (Ax-b)\) [7]. In cases where such a special structure is not present, work that involves factorization of \(A^\top A+\frac{1}{2\mu ^y}I\) may be the dominant cost of the iteration because it generally requires \(O(m^3)\) operations.

If \(\mu ^y\) is fixed beforehand, then the matrix \(A^\top A+\frac{1}{2\mu ^y}I\) can be factorized once, at the initialization stage of the algorithm, and hence the per-iteration cost only involves additional matrix-vector products. If \(\mu ^y\) is updated in an arbitrary way on each iteration, then the factorization must be repeated each time. Recall that ideally we want \(\mu ^y\) to have the largest possible value that satisfies the line search conditions in Step 2 of Algorithm 4 and that keeping \(\mu ^y\) constant may result in very slow progress. Hence, again, there exists a tradeoff between choosing a suitable value of the prox parameter and the per-iteration cost. For practical efficiency we strive to achieve a reasonable balance. Assume that we choose some value \(\bar{\mu }_1^y\) at the beginning of the algorithm and we choose \(\bar{\mu }^y_{k+1}=\beta ^i \mu ^y_k\) for some \(i\in Z\) at each iteration \(k\). Note that in this case, for all \(k\), \(\mu _k^y\) can only take values \(\beta ^j \mu _1^y\) for \(j\in Z\). Let us consider \(\beta =0.5\). We also impose the following restriction of \(\mu _k^y\): if \(\mu _k^y<\mu _1^y/1000\), then the expected improvement achieved by Step 2 is too small and the step is automatically skipped; if \(\mu _k^y>1000\mu _1^y\), the prox parameter is large enough and no additional increase of \(\mu _k^y\) is necessary. Hence the only values allowed for \(\mu _k^y\) throughout the algorithm are \(\{2^{-10}\mu _1^y,2^{-9}\mu _1^y,...,0,2\mu _1^y,4\mu _1^y,...,2^{10}\mu _1^y\}\), overall \(21\) possible values. As soon as one of these values occurs in the algorithm, the corresponding matrix factorization can be computed and stored for future use.

If the skipping of Step 2 occurs for a few consecutive iterations, we may choose to automatically skip this step in the further iterations and thus avoid the additional cost of computing \(p^f_{\mu ^y}(x)\). In this case, the FALM-BKTR algorithm reduces FISTA-BKTR. We found it beneficial to attempt Step 2 from time to time, even after it has been skipped consistently on prior iterations.

The management of the \(\mu _k^x\) parameter and the additional per-iteration cost of Step 3 can be executed similarly to what is described in Sect. 3.1 for the FISTA-BKTR implementation.

5 Computational Results

We now present numerical results for several sparse optimization problems of the standard compressed sensing or Lasso form:

$$\begin{aligned} \bar{x}: =\arg \min _x \quad \frac{1}{2}\Vert Ax-b\Vert _2^2 + \rho \Vert x\Vert _1, \end{aligned}$$
(5.1)

with \(f(x):=\frac{1}{2}\Vert Ax-b\Vert _2^2\) and \(g(x):=\rho \Vert x\Vert _1\).

We compare the following algorithms:

  • FISTA: the original FISTA, [1], as described in Algorithm 1 with \(\theta _k=1\).

  • FISTA-BKTR: an efficient implementation of Algorithm 2, as discussed in Sect. 3.1.

  • FALM: an implementation of Algorithm 3, as discussed in Sect. 4.

  • FALM-BKTR: an efficient implementation of Algorithm 4, as discussed in Sect. 4.1.

  • SpaRSA: a gradient-based algorithm, with the use of shrinking, described in [5].

  • Yall1: a solver based on alternating-direction methods, described in [16].

We compared the performance of the algorithms, benchmarking them against FISTA. In particular, we ran FISTA for \(j\) iterations and recorded the best objective function value \(FISTA(j)\) achieved by FISTA thus far. Then for all other algorithms we recorded the number of iterations it took to reach a function value smaller than \(FISTA(j)\). We report the number of iterations as well as the number of matrix-vector multiplications. Throughout our tests, the maximum number of iterations is set to be 100,000, and the tolerance is set to be \(10^{-3}\), which means that the algorithm terminates when the objective function value is within \(10^{-3}\) from the optimal (precomputed). We report the final objective function value when each algorithm terminates.

The main goal of our experiments is to demonstrate that our full backtracking strategy provides not only theoretical but practical advantage when applied to FISTA and FALM methods. The comparison to Yall1 and SpaRSA methods is only presented here to gauge the difficulty of the problems in our experiments and to demonstrate that the behavior of our methods is reasonable. Our implementations were written in MATLAB and run in MATLAB R2010b on a laptop with an Intel Core Duo 1.8 GHz CPU and 2 GB RAM. We used the default setting for both Yall1 and SpaRSA, which likely accounts for the bad performance of these algorithms on some of the problems.

5.1 Spear Examples

This set of instances is obtained from the Sparse Exact and Approximate Recovery Project and can be downloaded from either of the following links:

Spear10 \((1024\times 512)\) Dynamic range is 3.02e+4. Sparsity is 18 (i.e., 18 nonzero elements in the true solution). This problem provides a relatively easy instance when \(\rho =1\), but the difficulty increases substantially as \(\rho \) decreases.

From Table 1, we see that the algorithms with backtracking (FISTA-BKTR and FALM-BKTR) were generally faster than their basic counterparts (FISTA, FALM) in terms of the number of matrix-vector multiplications. For example, it only takes 627 iterations and 1,343 matrix-vector multiplications for FISTA-BKTR to reduce the objective function value below \(\mathrm{FISTA}(1000) = 1.0035\mathrm{e}+5\). Comparing FALM-S1 and FALM-BKTR, we observe that the latter is faster for the same initial choice of \(\mu _g\) (\(\mu _g = 1\)). The initial performance of FALM-S2 and, especially, FALM-S3 is good due to the larger starting values for \(\mu _g\); however, this performance slows down compared to FALM-BKTR as iterations progress. This indicates that the full backtracking strategy can help accelerate the original algorithms at any stage.

Table 1 Comparison of algorithms for solving (5.1) with \(\rho =1\) on Spear10

In Fig. 1 we plot how \(\mu \) changes during iterations taken by FISTA and FISTA-BKTR when solving (5.1) with \(\rho =1\) on Spear10. We see that FISTA-BKTR can achieve larger values of \(\mu \) by allowing backtracking and thus performs large steps on some of the iterations, which corresponds to the smaller number of iterations required by FISTA-BKTR.

Fig. 1
figure 1

Comparison on \(\mu \) values while solving (5.1) with \(\rho =1\) on Spear10

Table 2 shows the results on a Spear10 problem with \(\rho =0.01\). This problem provides a difficult instance where our backtracking method appears to provide a clear advantage. SpaRSA did not converge to the proximity of the solution, while Yall1 achieved an accuracy of only \(10^{-1}\), not \(10^{-3}\). Here we observe that FISTA-BKTR retains its advantage, while FALM-BKTR seems to slow down compared to FALM method when it gets closer to the solutions. The reasons for this behavior will be investigated in the future (Table 3).

Table 2 Comparison of algorithms for solving (5.1) with \(\rho =0.01\) on Spear10
Table 3 Comparison of algorithms for solving (5.1) with \(\rho =0.1\) on Spear3

Spear3 \((1024\times 512)\) with \(\rho =0.1\). The dynamic range is 2.7535e+4, the sparsity is 6. We observe that the behavior of FALM-BKTR converges to that of FISTA-BKTR in later iterations due to persistent skipping of Step 2.

5.2 Bdata Problems

A Bdata test set was originally created by A. Nemirovski with the aim of imitating examples with worst-case complexity for the first-order methods. This problem, however, provides a relatively easy instance, probably due to the presence of the \({\ell }_1\) term in the objective. Here we present the results for Bdata1 \((1036\times 1036)\), with \(\rho =0.0001\); the dynamic range is 5.9915 and the sparsity is 16 (Table 4).

Table 4 Comparison of algorithms for solving (5.1) with \(\rho =0.0001\) on Bdata1

5.3 Sparco Problems

This category of instances is obtained from [2]. Due to the fact that the Sparco instances use function handles for matrix computation, which our FALM implementation is not equipped to do, we do not include FALM in this comparison. We present the results for Sparco3 \((2048\times 1024)\), with \(\rho =0.01\), dynamic range \(2\), and sparsity \(2\). We observe that, for this relatively easy instance, FISTA-BKTR has a minor advantage over FISTA in terms of the number of matrix-vector multiplications. For this example, FISTA outperforms the alternating-direction-based method Yall1, while SpaRSA seems to be the winning method for this instance (Table 5).

Table 5 Comparison of algorithms for solving (5.1) with \(\rho =0.01\) on Sparco3

5.4 Smoothed \(\ell _2\) Norm Minimization

As an alternative to problem (5.1) one may wish to solve the following problem with an exact \({\ell }_2\) penalty term:

$$\begin{aligned} \bar{x}: =\arg \min _x \quad \Vert Ax-b\Vert _2 + \rho \Vert x\Vert _1. \end{aligned}$$
(5.2)

To apply the FISTA and FALM family of methods, we can smooth the \({\ell }_2\) term with the well-known Huber penalty function and obtain the following minimization problem:

$$\begin{aligned} \bar{x}: =\arg \min _x \quad H_{\nu }(||Ax-b||_{2}) + \rho \Vert x\Vert _1, \end{aligned}$$
(5.3)

where

$$\begin{aligned} H_{\nu }(y)= \left\{ \begin{array}{l@{\quad }l} \frac{y^{2}}{2\nu }, &{} 0\le |y|\le \nu \\ |y|-\frac{\nu }{2}, &{} |y|\ge \nu , \end{array} \right. \end{aligned}$$

for \(\nu >0\). If \(\nu <\epsilon \), then the solution of (5.3) is an \(\epsilon \)-solution to (5.2). We define \(f(x):=H_{\nu }(||Ax-b||_{2})\) and \(g(x):=\rho \Vert x\Vert _1\). It is well known that the global Lipschitz constant of \(\nabla f(x)\) is \(O(1/\nu )\). Recall Example 2.2, where we analyze the local composite Lipschitz constant for the case where \(g(x)\equiv 0\) and show that away from the optimal solution the local composite Lipschitz constant of \(\nabla f(x)\) is much smaller than \(O(1/\nu )\). The analysis of the case where \(g(x):=\rho \Vert x\Vert _1\) is much more complex, but the essential expectation remains for our first-order schemes: the prox parameter \(\mu \) will be relatively large away from the solutions, while it will decrease as the algorithm converges. In fact, FISTA and FALM in their original form will observe the same behavior of the prox parameter because they allow the prox parameter to decrease but not increase. Hence, we do not expect a significant saving using backtracking in this setting; however, we present experiments for illustration and to confirm our expectation of the prox parameter behavior.

In Tables 6 and 7 we present a comparison of the first-order methods on Spear10 data and formulation (5.3). We observe that FISTA-BKTR is much faster than FISTA and SpaRSA in the case where the initial prox parameter value is not very large. As compared with FISTA, FISTA-BKTR allows for a huge increase in \(\mu \). If the initial \(\mu \) is set to 1, then FISTA’s performance is very slow, as is shown in Table 6. But if \(\mu =1000\), then FISTA’s performance improves, to the level of FISTA-BKTR, as seen in 7. This is well explained by showing graphically the change of \(\mu \) in Fig. 2.

Fig. 2
figure 2

Comparison on \(\mu \) values while solving (5.3) with \(\rho =0.1\) on Spear10

Table 6 Comparison of algorithms for solving (5.3) with \(\rho =0.1\) on Spear10
Table 7 Comparison of the algorithms for solving (5.3) with \(\rho =0.1\) on Spear10

In Table 8 and Fig. 3 we show the outcome of the experiments on the Bdata1 test set. In this case, FISTA and FISTA-BKTR perform as expected, with FISTA-BKTR retaining a small advantage. In fact, after 500 iterations, the \(\mu \) value for both algorithms becomes small, which indicates a large Lipschitz constant for solving (5.3) on Bdata1.

Fig. 3
figure 3

Comparison on \(\mu \) values while solving (5.3) with \(\rho =0.001\) on Bdata

Table 8 Comparison of algorithms for solving (5.3) with \(\rho =0.01\) on Bdata1

5.5 \({\ell }_2\) Regularized Logistic Regression

Finally, we illustrate the behavior of FISTA vs. FISTA-BKTR on an example of \({\ell }_2\) regularized logistic regression applied to an optical character recognition data set, Optdigits, from the UCI repository [6]. We present this example here purely for illustration purposes to show that on settings other than compressed sensing the backtracking strategy can produce significant improvements. As we discussed, to obtain optimal performance from backtracking, a careful implementation is needed that tries to take into account the problem structure. Such an implementation for logistic regression and other problems will be the subject of future research. Here we present the results of the basic approach where \(\mu \) is increased by a factor of \(2\) on each iteration. FISTA required \(5705\) iterations and \(11417\) matrix-vector multiplications to obtain a solution with a gradient norm less than \(10^{-2}\), while FISTA-BKTR required \(1315\) iterations and \(3512\) matrix-vector multiplications. In Fig. 4 we plot the behavior of the \(\mu \) parameter for both algorithms, which clearly shows that FISTA-BKTR benefits from much larger steps. The analysis of local Lipschitz constants for logistic regression will be the subject of future research.

Fig. 4
figure 4

Comparison on \(\mu \) values while solving logistic regression problem on Optdigits data set

6 Conclusion

We present a generalized version of accelerated first-order schemes capable of estimating the prox parameter via backtracking, thereby allowing for the value of this parameter to increase as well as decrease. We show that the value of the parameter depends on the so-called local composite Lipschitz constant of the gradient, rather than the global Lipschitz constant. We show that the complexity results can then be derived in terms of the average of the local composite Lipschitz constants along the iterates of the algorithm. We show via some examples that the local constants can be much smaller than the global ones; hence, one could potentially obtain better convergence bounds. To produce such bounds, one would need to combine the analysis in this paper with the analysis of the iterates of a first-order algorithm, which will be the subject of a future study. Our computational experiments and the discussion of a practical implementation show that in practice our proposed backtracking schemes offer improvements in terms of accelerated first-order algorithms.