1 Introduction

Accelerated gradient schemes were first proposed by Yurii Nesterov in 1983 [18]. He demonstrated a simple modification to gradient descent that could obtain provably optimal performance for the complexity class of first-order algorithms applied to minimize smooth convex functions. The method, and its successors, are often referred to as ‘fast’, ‘accelerated’, or ‘optimal’ methods. In recent years there has been a resurgence of interest in first-order optimization methods [1, 3, 14, 20, 24], driven primarily by the need to solve very large problem instances unsuited to second-order methods.

Accelerated gradient schemes can be thought of as momentum methods, in that the step taken at each iteration depends on the previous iterations, where the momentum grows from one iteration to the next. When we refer to restarting the algorithm we mean starting the algorithm again, taking the current iteration as the new starting point. This erases the memory of previous iterations and resets the momentum back to zero.

Unlike gradient descent, accelerated methods are not guaranteed to be monotone in the objective function value. A common observation when running an accelerated method is the appearance of ripples or bumps in the trace of the objective value. These are seemingly regular increases in the objective, see Fig. 1 for an example. In this paper we demonstrate that this behavior can occur when the momentum has exceeded a critical value (the optimal momentum value derived by Nesterov in [19]) and that the period of these ripples is proportional to the square-root of the (local) condition number of the function. Separately, we re-derive the previously known result that the optimal restart interval is also proportional to the square root of the condition number. Combining these provides a justification for the use of a restart technique whereby we restart whenever we observe this rippling behavior. The analysis also suggests that if the function is locally well-conditioned we may be able to use restarting to obtain a linear convergence rate inside the well-conditioned region.

Fig. 1
figure 1

Convergence of Algorithm 1 with different estimates of q

Smooth Unconstrained Optimization

We wish to minimize a smooth convex function [4] of a variable xR n, i.e.,

$$ \mbox{minimize}\quad f(x) $$
(1)

where f:R nR has a Lipschitz continuous gradient with constant L, i.e.,

$$\bigl\|\nabla f(x) -\nabla f(y)\bigr\|_2 \leq L \|x-y\|_2, \quad \forall x,y \in \mathbf {R}^n. $$

We shall denote by f the optimal value of the above optimization problem, if a minimizer exists then we shall write it as x . Further, a continuously differentiable function f is said to be strongly convex with strong convexity parameter μ>0 if

$$f(x) \geq f(y) + \nabla f(y)^{\mathrm{T}}(x-y) + (\mu/2) \|x-y \|_2^2, \quad \forall x,y \in \mathbf {R}^n. $$

The condition number of a smooth, strongly convex function is given by L/μ.

2 Accelerated Methods

Accelerated first-order methods to solve (1) were first developed by Nesterov [18], this scheme is from [19]: There are many variants of the above scheme, see, e.g., [1, 2, 14, 20, 24]. Note that by setting q=1 in the above scheme we recover gradient descent. For a smooth convex function the above scheme converges for any t k ≤1/L; setting t k =1/L and q=0 obtains a guaranteed convergence rate of

$$ f\bigl(x^{k}\bigr) - f^{\star} \leq \frac{4 L \|x^0 - x^{\star}\|^2}{(k+2)^2}, $$
(2)

assuming a minimizer exists. If the function is also strongly convex with strong convexity parameter μ, then a choice of q=μ/L (the reciprocal of the condition number) will achieve

$$ f\bigl(x^{k}\bigr) - f^{\star} \leq L \biggl(1-\sqrt{\frac{\mu}{L}}\,\biggr)^k \bigl\|x^0 - x^{\star}\bigr\|^2. $$
(3)

This is often referred to as linear convergence. With this convergence rate we can achieve an accuracy of ϵ in

$$ \mathcal{O} \biggl(\sqrt{\frac{L}{\mu}}\log \frac{1}{\epsilon} \biggr) $$
(4)

iterations.

In the case of a strongly convex function the following simpler scheme obtains the same guaranteed rate of convergence [19]: where we set

$$ \beta^{\star}=\frac{1- \sqrt{\mu/L}}{1+\sqrt{\mu/L}}. $$
(5)

Note that in Algorithm 1, using the optimal choice q=μ/L, we have β k β . If β k is interpreted as a momentum term then β is the maximum amount of momentum we should apply; when we have a value of β higher than β we are in the ‘high momentum’ regime. We shall return to this point later.

Algorithm 1
figure 2

Accelerated scheme I

The convergence rates of Algorithms 1 and 2 are optimal in the sense of the lower complexity bounds derived by Nemirovski and Yudin in [17]. However, this convergence is only guaranteed when the function parameters μ and L are known in advance.

Algorithm 2
figure 3

Accelerated scheme II

2.1 Robustness

A natural question to ask is how robust are accelerated methods to errors in the estimates of the Lipschitz constant L and strong convexity parameter μ? For the case of an unknown Lipschitz constant we can estimate the optimal step-size by the use of backtracking; see, e.g., [2, 3, 24]. Estimating the strong convexity parameter is much more challenging.

Estimating the Strong Convexity Parameter

In [20] Nesterov demonstrated a method to bound μ, similar to the backtracking schemes to estimate L. His scheme achieves a convergence rate quite a bit slower than Algorithm 1 when μ is known. In practice, we often assume (or guess) that μ is zero, which corresponds to setting q=0 in Algorithm 1. Indeed many discussions of accelerated algorithms do not even include a q term, e.g., the original algorithm in [18]. However, this choice can dramatically slow down the convergence of the iterates. Figure 1 shows Algorithm 1 applied to minimize a positive definite quadratic function in n=200 dimensions, with the optimal choice of q being q =μ/L=4.1×10−5 (a condition number of about 2.4×104), and step-size t=1/L. Each trace is the progress of the algorithm with a different choice of q (hence a different estimate of μ).

We observe that slightly over or underestimating the optimal value of q for the function can have a severe detrimental effect on the rate of convergence of the algorithm. We also note the clear difference in behavior between the cases where we underestimate and where we overestimate q ; in the latter we observe monotonic convergence but in the former we notice the appearance of regular ripples or bumps in the traces.

Interpretation

The optimal momentum depends on the condition number of the function; specifically, higher momentum is required when the function has a higher condition number. Underestimating the amount of momentum required leads to slower convergence. However, we are more often in the other regime, that of overestimated momentum, because generally q=0, in which case β k ↑1; this corresponds to high momentum and rippling behavior, as we see in Fig. 1. This can be visually understood in Fig. 2, which shows the trajectories of sequences generated by Algorithm 1 minimizing a positive definite quadratic in two dimensions, under q=q , the optimal choice of q, and q=0. The high momentum causes the trajectory to overshoot the minimum and oscillate around it. This causes a rippling in the objective function values along the trajectory. In the sequel we shall demonstrate that the period of these ripples is proportional to the square root of the (local) condition number of the function.

Fig. 2
figure 4

Sequence trajectories under Algorithm 1 and with adaptive restart

Algorithm 3
figure 5

Fixed restarting

Lastly we mention that the condition number is a global parameter; the sequence generated by an accelerated scheme may enter regions that are locally better conditioned, say, near the optimum. In these cases the choice of q=q is appropriate outside of this region, but once we enter it we expect the rippling behavior associated with high momentum to emerge, despite the optimal choice of q.

3 Restarting

3.1 Fixed Restart

For strongly convex functions an alternative to choosing the optimal value of q in Algorithm 1 is to use restarting, [3, 11, 15, 16, 20]. One example of a fixed restart scheme is as follows: We restart the algorithm every k iterations, taking as our starting point the last point produced by the algorithm, and reset the momentum back to zero.

Optimal Fixed Restart Interval

Fixed restart intervals have been examined and upper bounds on the optimal restart interval have been derived by several authors; see, e.g., [16, §11.4], [11, 13, 20]. We re-derive an upper bound here.

If we restart every k iterations we have, at outer iteration j, inner loop iteration k (just before a restart),

where the first inequality is the convergence guarantee of Algorithm 1, and the second comes from the strong convexity of f. So after jk steps we have

$$f\bigl(x^{(j,0)}\bigr) - f^\star \leq \bigl( 8 L / \mu k^2 \bigr)^j \bigl(f\bigl(x^{(0,0)}\bigr) - f^\star\bigr). $$

We wish to minimize this quantity over j and k jointly, subject to having jk=c total iterations. A simple calculation yields

$$ k^\star= e \sqrt{8L/\mu}. $$
(6)

Using this as our restart interval we obtain an accuracy of ϵ in less than \(\mathcal{O}(\sqrt{L/\mu} \log(1/\epsilon))\) iterations, i.e., the optimal linear convergence rate as in (4).

The drawbacks in using fixed restarts are two-fold, firstly it depends on unknown parameters L and μ, and secondly it is a conservative estimate based on global parameters and may be inappropriate in better conditioned regions.

3.2 Adaptive Restart

The above analysis suggests that an adaptive restart technique may be useful when using Algorithm 1. In particular we want a scheme that makes some computationally cheap observation and decides whether or not to restart based on that observation. In this paper we suggest two schemes that perform well in practice and provide a heuristic analysis that suggests improved convergence when these schemes are used.

  • Function scheme: restart whenever

    $$f\bigl(x^k\bigr) > f\bigl(x^{k-1}\bigr). $$
  • Gradient scheme: restart whenever

    $$\nabla f\bigl(y^{k-1}\bigr)^{\mathrm{T}}\bigl(x^k - x^{k-1}\bigr) >0. $$

Empirically we observe that these two schemes perform similarly well. The gradient scheme has two advantages over the function scheme. Firstly all quantities involved in the gradient scheme are already calculated in accelerated schemes, so no extra computation is required. Secondly near to the optimum the gradient scheme may be more numerically stable, since ∇f(y k−1)T x k will tend to zero as we get close to the optimum, whereas f(x k) will tend to f , leading to possible cancellation errors when evaluating f(x k)−f(x k−1).

We can give rough justifications for each scheme. The function scheme restarts at the bottom of the troughs as in Fig. 1, thereby avoiding the wasted iterations where we are moving away from the optimum. The gradient scheme restarts whenever the momentum term and the negative gradient are making an obtuse angle. In other words we restart when the momentum seems to be taking us in a bad direction, as measured by the negative gradient at that point.

Figure 3 shows the effect of different restart intervals on minimizing a positive definite quadratic function in n=500 dimensions. In this particular case the upper bound on the optimal restart interval is every 700 iterations. We note that when this interval is used the convergence is better than when no restart is used, however, not as good as using the optimal choice of q. We also note that restarting every 400 iterations performs about as well as restarting every 700 iterations, suggesting that the optimal restart interval is somewhat lower than 700. We have also plotted the performance of the two adaptive restart schemes. The performance is on the same order as the algorithm with the optimal q and much better than using the fixed restart interval. Figure 2 demonstrates the function restart scheme trajectories for a two dimensional example, restarting resets the momentum and prevents the characteristic spiraling behavior.

Fig. 3
figure 6

Comparison of fixed and adaptive restart intervals

It should be noted that the conjugate gradient method [12, 21] outperforms fast gradient schemes when minimizing a quadratic, both in theory and practice. See [21, eq. 5.36] and compare with the convergence rate in (3). We use quadratics here simply to illustrate the technique.

4 Analysis

In this section we consider applying an accelerated scheme to minimizing a positive definite quadratic function. We shall see that once the momentum is larger than a critical value we observe periodicity in the iterates. We use this periodicity to recover linear convergence when using adaptive restarting. The analysis presented in this section is similar in spirit to the analysis of the heavy ball method in [22, §3.2].

4.1 Minimizing a Quadratic

Consider minimizing a strongly convex quadratic. Without loss of generality we can assume that f has the following form:

$$f(x) = (1/2)x^{\mathrm{T}}Ax $$

where AR n×n is positive definite and symmetric. In this case x =0 and f =0. We have strong convexity parameter μ=λ min>0 and L=λ max, where λ min and λ max are the minimum and maximum eigenvalues of A, respectively.

4.2 The Algorithm as a Linear Dynamical System

We apply an accelerated scheme to minimize f with a fixed step-size t=1/L. Given quantities x 0 and y 0=x 0, Algorithm 1 is carried out as follows:

$$\begin{aligned} x^{k+1} &= y^{k} - (1/L) A y^{k}, \\ y^{k+1} &= x^{k+1} + \beta_k \bigl(x^{k+1} - x^{k}\bigr). \end{aligned} $$

For the rest of the analysis we shall take β k to be constant and equal to some β for all k. By making this approximation we can show that there are two regimes of behavior for the system, depending on the value of β. Consider the eigenvector decomposition of A=VΛV T. Denote by w k=V T x k, v k=V T y k. In this basis the update equations can be written

$$\begin{aligned} w^{k+1} &= v^{k} - (1/L) \varLambda v^{k}, \\ v^{k+1} &= w^{k+1} + \beta\bigl(w^{k+1} - w^{k}\bigr). \end{aligned} $$

These are n independently evolving dynamical systems. The ith system evolves according to

$$\begin{aligned} w_i^{k+1} &= v_i^{k} - (\lambda_i/L) v_i^{k}, \\ v_i^{k+1} &= w_i^{k+1} + \beta \bigl(w_i^{k+1} - w_i^{k}\bigr), \end{aligned} $$

where λ i is the ith eigenvalue of A. Eliminating the sequence \(v_{i}^{(k)}\) from the above we obtain the following recurrence relation for the evolution of w i :

$$w_i^{k+2} = (1+\beta) (1-\lambda_i/L) w_i^{k+1} - \beta(1 - \lambda_i /L) w_i^{k},\quad k=0,1,\ldots, $$

where \(w_{i}^{0}\) is known and \(w_{i}^{1} = w_{i}^{0} (1-\lambda_{i} /L)\), i.e., a gradient step from \(w_{i}^{0}\).

The update equation for v i is identical, differing only in the initial conditions,

$$v_i^{k+2} = (1+\beta) (1-\lambda_i /L) v_i^{k+1} - \beta(1 - \lambda_i /L) v_i^{k},\quad k=0,1,\ldots, $$

where \(v_{i}^{0} = w_{i}^{0}\) and \(v_{i}^{1} = ((1+\beta)(1-\lambda _{i}/L)-\beta)v_{i}^{0}\).

4.3 Convergence Properties

The behavior of this system is determined by the characteristic polynomial of the recurrence relation,

$$ r^2 - (1+\beta) (1-\lambda_i /L) r + \beta(1-\lambda_i /L). $$
(7)

Let \(\beta^{\star}_{i}\) be the critical value of β for which this polynomial has repeated roots, i.e.,

$$\beta^\star_i := \frac{1-\sqrt{\lambda_i /L}}{1+\sqrt{\lambda_i /L}}. $$

If \(\beta\leq\beta^{\star}_{i}\) then the polynomial (7) has two real roots, r 1 and r 2, and the system evolves according to [8]

$$ w_i^k = c_1 r_1^k + c_2 r_2^k. $$
(8)

When \(\beta= \beta^{\star}_{i}\) the roots coincide at the point \(r^{\star}= (1+\beta)(1-\lambda_{i} /L)/2 = (1-\sqrt{\lambda_{i} /L})\); this corresponds to critical damping. We have the fastest monotone convergence at rate \(\propto (1-\sqrt{\lambda_{i} /L})^{k}\). Note that if λ i =μ then \(\beta^{\star}_{i}\) is the optimal choice of β as given by (5) and the convergence rate is the optimal rate, as given by (3). This occurs as typically the mode corresponding to the smallest eigenvalue dominates the convergence of the entire system.

If \(\beta< \beta^{\star}_{i}\) we are in the low momentum regime, and we say the system is over-damped. The convergence rate is dominated by the larger root, which is greater than r , i.e., the system exhibits slow monotone convergence.

If \(\beta> \beta^{\star}_{i}\) then the roots of the polynomial (7) are complex and we are in the high momentum regime. The system is under-damped and exhibits periodicity. In that case the characteristic solution is given by [8]

$$w_i^{k} = c_i \bigl(\beta(1- \lambda_i /L) \bigr)^{k/2} \bigl(\cos(k \psi_i - \delta_i) \bigr) $$

where

$$\psi_i = \cos^{-1} \bigl((1-\lambda_i /L) (1+\beta)/2\sqrt{\beta (1-\lambda_i /L)}\bigr) $$

and δ i and c i are constants that depend on the initial conditions; in particular for β≈1 we have δ i ≈0 and we will ignore it. Similarly,

$$v_i^{k} = \hat{c}_i \bigl(\beta(1- \lambda_i /L) \bigr)^{k/2} \bigl(\cos(k \psi_i - \hat{\delta}_i) \bigr) $$

where \(\hat{\delta}_{i}\) and \(\hat{c}_{i}\) are constants, and again \(\hat{\delta}_{i} \approx0\). For small θ we know that \(\cos^{-1}(\sqrt{1-\theta}) \approx\sqrt {\theta}\), and therefore if λ i L, then

$$\psi_i \approx\sqrt{\lambda_i /L}. $$

In particular the frequency of oscillation for the mode corresponding to the smallest eigenvalue μ is approximately given by \(\psi_{\mu}\approx \sqrt{\mu/L}\).

To summarize, based on the value of β we observe the following behavior:

  • \(\beta> \beta_{i}^{\star}\): high momentum, under-damped.

  • \(\beta< \beta_{i}^{\star}\): low momentum, over-damped.

  • \(\beta= \beta_{i}^{\star}\): optimal momentum, critically damped.

4.4 Observable Quantities

We do not observe the evolution of the modes, but we can observe the evolution of the function value; which is given by

$$f\bigl(x^{k}\bigr) = \sum_{i=1}^n \bigl(w_i^{k}\bigr)^2 \lambda_i $$

and if \(\beta> \beta^{\star}= (1-\sqrt{\mu/L})/(1+\sqrt{\mu/L})\) we are in the high momentum regime for all modes and thus

$$f\bigl(x^{k}\bigr) = \sum_{i=1}^n \bigl(w_i^{k}\bigr)^2 \lambda_i \approx\sum_{i=1}^n \bigl(w_i^{0} \bigr)^2\lambda_i \beta^k (1- \lambda_i /L)^k \cos^2(k \psi_i). $$

The function value will quickly be dominated by the smallest eigenvalue and we have

$$ f\bigl(w^{k}\bigr) \approx\bigl(w_\mu^{0} \bigr)^2 \mu\beta^k (1-\mu/L)^k \cos^2 \bigl(k \sqrt{\mu/L}\,\bigr), $$
(9)

where we have replaced ψ μ with \(\sqrt{\mu/L}\), and we are using the subscript μ to denote those quantities corresponding to the smallest eigenvalue.

A similar analysis for the gradient restart scheme yields

$$ \nabla f\bigl(y^k\bigr)^{\mathrm{T}} \bigl(x^{k+1} - x^k\bigr) \approx\mu v^k_\mu \bigl(w^{k+1}_\mu- w^k_\mu\bigr) \propto\beta^k(1-\mu/L)^k \sin\bigl(2k\sqrt{\mu/L}\,\bigr). $$
(10)

In other words observing the quantities in (9) or (10) we expect to see oscillations at a frequency proportional to \(\sqrt{\mu/L}\), i.e., the frequency of oscillation is telling us something about the condition number of the function.

4.5 Convergence with Adaptive Restart

Applying Algorithm 1 with q=0 to minimize a quadratic starts with β 0=0, i.e., the system starts in the low momentum, monotonic regime. Eventually β k becomes larger than β and we enter the high momentum, oscillatory regime. It takes about \((3/2) \sqrt{L/\mu}\) iterations for β k to exceed β . After that the system is under-damped and the iterates obey (9) and (10). Under either adaptive restart scheme, (9) and (10) indicate that we shall observe the restart condition after a further \((\pi/2)\sqrt{L/\mu}\) iterations. We restart and the process begins again, with β k set back to zero. Thus under either scheme we restart approximately every

$$k^\star= \frac{\pi+3}{2} \sqrt{\frac{L}{\mu}} $$

iterations (cf. the upper bound on optimal fixed restart interval (6)). Following a similar calculation to Sect. 3.1, this restart interval guarantees us an accuracy of ϵ within \(\mathcal{O}(\sqrt{L/\mu} \log(1/\epsilon))\) iterations, i.e., we have recovered the optimal linear convergence rate of (4) via adaptive restarting, with no prior knowledge of μ.

4.6 Extension to Smooth Convex Minimization

If the function we are minimizing has a positive definite Hessian at the optimum, then by Taylor’s theorem there is a region inside of which

$$f(x) \approx f\bigl(x^\star\bigr) + \bigl(x- x^\star \bigr)^{\mathrm{T}} \nabla^2 f\bigl(x^\star\bigr) \bigl(x-x^\star\bigr), $$

and loosely speaking we are minimizing a quadratic. Once we are inside this region we will observe behavior consistent with the analysis above, and we can exploit this behavior to achieve fast convergence by using restarts. Note that the Hessian at the optimum may have smallest eigenvalue λ min>μ, the global strong convexity parameter, and we may be able to achieve faster local convergence than (3) would suggest. This result is similar in spirit to the restart method applied to the non-linear conjugate gradient method, where it is desirable to restart the algorithm once it reaches a region in which the function is well approximated by a quadratic [21, §5.2].

The effect of these restart schemes outside of the quadratic region is unclear. In practice we observe that restarting based on one of the criteria described above is almost always helpful, even far away from the optimum. However, we have observed cases where restarting far from the optimum can slow down the early convergence slightly, until the quadratic region is reached and the algorithm enters the rapid linear convergence phase.

5 Numerical Examples

In this section we describe three further numerical examples that demonstrate the improvement of accelerated algorithms under an adaptive restarting technique.

5.1 Log-Sum-Exp

Here we minimize a smooth convex function that is not strongly convex. Consider the following optimization problem:

$$\mbox{minimize} \quad \rho\log \Biggl(\sum _{i=1}^m \exp \bigl(\bigl(a_i^{\mathrm{T}} x - b_i\bigr)/\rho \bigr) \Biggr) $$

where xR n. The objective function is smooth, but not strongly convex, it grows linearly asymptotically. Thus, the optimal value of q in Algorithm 1 is zero. The quantity ρ controls the smoothness of the function, as ρ→0, \(f(x) \rightarrow\max_{i=1,\ldots,m} (a_{i}^{\mathrm{T}}x-b_{i})\). As it is smooth, we expect the region around the optimum to be well approximated by a quadratic (assuming the optimum exists), and thus we expect to eventually enter a region where our restart method will obtain linear convergence without any knowledge of where this region is, the size of the region or the local function parameters within this region. For smaller values of ρ the smoothness of the objective function decreases and thus we expect to take more iterations before we enter the region of linear convergence.

As a particular example we took n=20 and m=100; we generated the a i and b i randomly. Figure 4 demonstrates the performance of four different schemes for four different values of ρ. We selected the step-size for each case using the backtracking scheme described in [3, §5.3]. We note that both restart schemes perform well, eventually beating both gradient descent and the accelerated scheme. Both the function and the gradient schemes eventually enter a region of fast linear convergence. For large ρ we see that even gradient descent performs well: similar to the adaptive restart scheme, it is able to automatically exploit the local strong convexity of the quadratic region around the optimum, see [19, §1.2.3]. Notice also the appearance of the periodic behavior in the trace of Algorithm 1.

Fig. 4
figure 7

Minimizing a smooth but not strongly convex function

Algorithm 4
figure 8

ISTA

Algorithm 5
figure 9

FISTA

5.2 Sparse Linear Regression

Consider the following optimization problem:

$$ \mbox{minimize} \quad (1/2)\|Ax-b \|_2^2 + \rho\|x\|_1, $$
(11)

over xR n, where AR m×n and typically nm. This is a widely studied problem in the field of compressed sensing, see e.g., [5, 6, 10, 23]. Loosely speaking problem (11) seeks a sparse vector with a small measurement error. The quantity ρ trades off these two competing objectives. The iterative soft-threshold algorithm (ISTA) can be used to solve (11) [7, 9]. ISTA relies on the soft-thresholding operator:

$$\mathcal{T}_\alpha(x) = \mathrm{sign}(x) \max\bigl(|x|-\alpha,0\bigr), $$

where all the operations are applied element-wise. The ISTA algorithm, with constant step-size t, is given by The convergence rate of ISTA is guaranteed to be at least \(\mathcal{O}(1/k)\), making it analogous to gradient descent. The fast iterative soft thresholding algorithm (FISTA) was developed in [2]; a similar algorithm was also developed by Nesterov in [20]. FISTA essentially applies acceleration to the ISTA algorithm; it is carried out as follows: For any choice of t≤1/λ max(A T A) FISTA obtains a convergence rate of at least O(1/k 2). The objective in problem (11) is non-smooth, so it does not fit the class of problems we are considering in this paper. However, we are seeking a sparse solution vector x , and we note that once the non-zero basis of the solution has been identified we are essentially minimizing a quadratic. Thus we expect that after a certain number of iterations adaptive restarting may provide linear convergence.

In this setting the function restart scheme can be applied unchanged, and it does not require an extra application of the matrix A, which is the costly operation in the algorithm. However, in performing FISTA we do not evaluate a gradient so we use the composite gradient mapping [20] for the gradient restart scheme, in which we take

$$x^{k+1} = \mathcal{T}_{\lambda t} \bigl(y^k - tA^{\mathrm{T}}\bigl(Ay^k - b\bigr)\bigr) := y^k - tG \bigl(y^k\bigr) $$

to be a generalized gradient step, where G(y k) is a generalized gradient at y k. In this case the gradient restart scheme amounts to restarting whenever

$$ G\bigl(y^k\bigr)^{\mathrm{T}} \bigl(x^{k+1}-x^k\bigr) >0, $$
(12)

or equivalently

$$ \bigl(y^k - x^{k+1}\bigr)^{\mathrm{T}} \bigl(x^{k+1} - x^k\bigr)>0. $$
(13)

We generated data for the numerical instances as follows. Firstly the entries of A were sampled from a standard normal distribution. We then randomly generated a sparse vector y with n entries, only s of which were non-zero. We then set b=Ay+w, where the entries in w were IID sampled from \(\mathcal{N}(0,0.1)\). This ensured that the solution vector x is approximately s-sparse. We chose ρ=1 and the step-size t=1/λ max(A T A). Figure 5 shows the dramatic speedup that adaptive restarting can provide, for two different examples.

Fig. 5
figure 10

Adaptive restarting applied to the FISTA algorithm

Algorithm 6
figure 11

Accelerated projected gradient

5.3 Quadratic Programming

Consider the following quadratic program:

$$ \begin{aligned} \mbox{minimize} \quad & (1/2) x^{\mathrm{T}} Q x + q^{\mathrm{T}} x \\ \mbox{subject to}\quad & a \leq x \leq b, \end{aligned} $$
(14)

over xR n, where QR n×n is positive definite and a,bR n are fixed vectors. The constraint inequalities are to be interpreted element-wise, and we assume that a<b. We denote by \(\varPi_{\mathcal{C}}(z)\) the projection of a point z onto the constraint set, which amounts to thresholding the entries in z.

Projected gradient descent [19] can solve (14); it is carried out as follows:

$$x^{k+1} = \varPi_\mathcal{C} \bigl(x^{k} - t \bigl(Qx^k +q\bigr) \bigr). $$

Projected gradient descent obtains a guaranteed convergence rate of \(\mathcal{O}(1/k)\). Acceleration has been successfully applied to the projected gradient method, [2, 20]: For any choice of t≤1/λ max(Q) accelerated projected gradient schemes obtain a convergence rate of at least \(\mathcal{O}(1/k^{2})\).

The presence of constraints make this a non-smooth optimization problem, however, once the constraints that are active have been identified the problem reduces to minimizing a quadratic on a subset of the variables, and we expect adaptive restarting to increase the rate of convergence. As in the sparse regression example of Sect. 5.2 the function restart remains unchanged. For the gradient scheme we use the gradient mapping [19, 2.2.3] as a generalized gradient, in which we take

$$x^{k+1} = \varPi_\mathcal{C} \bigl(y^{k} - t \bigl(Qy^k +q\bigr) \bigr) = y^k - tG\bigl(y^k \bigr) $$

to be a generalized gradient step and G(y k) to be a generalized gradient at y k. This amounts to restarting based on condition (12) or, equivalently, (13).

For a numerical instance, we took n=500 and generated Q and q randomly; Q had a condition number of 107. We took b to be the vector of all ones, and a to be that of all negative ones. The step-size was set to t=1/λ max(Q). The solution to this problem had 70 active constraints. Figure 6 shows the performance of projected gradient descent, accelerated projected gradient descent, and the two restart techniques.

Fig. 6
figure 12

Adaptive restarting applied to the accelerated projected gradient algorithm

6 Summary

In this paper we introduced a simple heuristic adaptive restart technique that can improve the convergence performance of accelerated gradient schemes for smooth convex optimization. We restart the algorithm whenever we observe a certain condition on the objective function value or gradient. We provided a heuristic analysis to show that we can recover the optimal linear rate of convergence in many cases, and near the optimum of a smooth function we can potentially dramatically accelerate the rate of convergence, even if the function is not globally strongly convex. We demonstrated the performance of the scheme on some numerical examples.