Oracle complexity of second-order methods for smooth convex optimization

Arjevani, Yossi; Shamir, Ohad; Shiff, Ron

doi:10.1007/s10107-018-1293-1

Oracle complexity of second-order methods for smooth convex optimization

Full Length Paper
Series A
Published: 28 May 2018

Volume 178, pages 327–360, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mathematical Programming Submit manuscript

Oracle complexity of second-order methods for smooth convex optimization

Download PDF

892 Accesses
33 Citations
3 Altmetric
Explore all metrics

Abstract

Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.

Zeroth-order methods for noisy Hölder-gradient functions

Article 29 April 2021

Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex

Article 12 November 2016

A zeroth order method for stochastic weakly convex optimization

Article Open access 01 September 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider an unconstrained optimization problem of the form

$$\begin{aligned} \min _{\mathbf {w}\in \mathbb {R}^d} f(\mathbf {w}), \end{aligned}$$

(1)

where f is a generic smooth and convex function. A natural and fundamental question is how efficiently can we optimize such functions.

We study this question through the well-known framework of oracle complexity [13], which focuses on iterative methods relying on local information. Specifically, it is assumed that the algorithm’s access to the function f is limited to an oracle, which given a point $\mathbf {w}$, returns the values and derivatives of the function f at $\mathbf {w}$. This naturally models standard optimization approaches to unstructured problems such as (1), and allows one to study their efficiency, by bounding the number of oracle calls required to reach a given optimization error. Different classes of methods can be distinguished by the type of oracle they use. For example, gradient-based methods (such as gradient descent or accelerated gradient descent) rely on a first-order oracle, which returns gradients, whereas methods such as the Newton method rely on a second-order oracle, which returns gradients as well as Hessians.

The theory of first-order oracle complexity is quite well developed [12, 13, 15]. For example, if the dimension is unrestricted, f in (1) has $\mu _1$-Lipschitz gradients, and the algorithm makes its first oracle query at a point $\mathbf {w}_1$, then the worst-case number of queries T required to attain a point $\mathbf {w}_T$ satisfying $f(\mathbf {w}_T)-\min _{\mathbf {w}}f(\mathbf {w})\le \epsilon $ is

$$\begin{aligned} \varTheta \left( \sqrt{\frac{\mu _1 D^2}{\epsilon }}\right) , \end{aligned}$$

(2)

where D is an upper bound on the distance between $\mathbf {w}_1$ and the nearest minimizer of f. Moreover, if the function f is also $\lambda $-strongly convex for some $\lambda >0$^{Footnote 1}, then the oracle complexity bound is

$$\begin{aligned} \varTheta \left( \sqrt{\frac{\mu _1}{\lambda }}\cdot \log \left( \frac{\mu _1 D^2}{\epsilon }\right) \right) . \end{aligned}$$

(3)

Both bounds are achievable using accelerated gradient descent [14].

However, these bounds do not capture the attainable performance of second-order methods, which rely on gradient as well as Hessian information. This is a central class of optimization methods, including the well-known Newton method and its many variants. Clearly, since these methods rely on Hessians as well as gradients, their oracle complexity can only be better than first-order methods. On the flip side, the per-iteration computational complexity is generally higher, in order to process the additional Hessian information (especially in high-dimensional problems where the Hessian matrix may be very large). Thus, it is natural to ask how much does this added per-iteration complexity pay off in terms of oracle complexity.

To answer this question, one needs good oracle complexity lower bounds for second-order methods, which establish the limits of attainable performance using any such algorithm. Perhaps surprisingly, such results do not seem to currently exist in the literature, and clarifying the oracle complexity of such methods was posed as an important open question (see for example [16]). The goal of this paper is to address this gap.

Specifically, we prove that when the dimension is sufficiently large, for the class of convex functions with $\mu _1$-Lipschitz gradients and $\mu _2$-Lipschitz Hessians, the worst-case oracle complexity of any deterministic algorithm is

$$\begin{aligned} \varOmega \left( \min \left\{ \sqrt{\frac{\mu _1 D^2}{\epsilon }},~\left( \frac{\mu _2D^3}{\epsilon }\right) ^{2/7}\right\} \right) . \end{aligned}$$

(4)

This bound is tight up to constants, as it is matched by a combination of existing methods in the literature (see discussion below). Moreover, if we restrict ourselves to functions which are $\lambda $-strongly convex, we prove an oracle complexity lower bound of

$$\begin{aligned} \varOmega \left( \left( \min \left\{ \sqrt{\frac{\mu _1}{\lambda }},~ \left( \frac{\mu _2}{\lambda }D\right) ^{2/7}\right\} + \log \log _{18}\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) \right) . \end{aligned}$$

(5)

Moreover, we establish that this bound is tight up to logarithmic factors (independent of $\epsilon $), utilizing a novel adaptation of the A-NPE algorithm proposed in [10] (see “Appendix A”). These new lower bounds have several implications:

Perhaps unexpectedly, (5) establishes that one cannot avoid in general a polynomial dependence on geometry-dependent “condition numbers” of the form $\mu _1/\lambda $ or $\mu _2D/\lambda $, even with second-order methods. This is despite the ability of such methods to favorably alter the geometry of the problem (for example, the Newton method is well-known to be affine invariant).
To improve on the oracle complexity of first-order methods for strongly-convex problems ((3)) by more than logarithmic factors, one cannot avoid a polynomial dependence on the initial distance D to the optimum. This is despite the fact that the dependence on D with first-order methods is only logarithmic. In fact, when D is sufficiently large (of order $\frac{\mu _1^{7/4}}{\mu _2\lambda ^{3/4}}$ or larger), second-order methods cannot improve on the oracle complexity of first-order methods by more than logarithmic factors.
In the convex case, second-order methods are again no better than first-order methods in certain parameter regimes (i.e., when $\mu _2\ge \mu _1^{7/4}\sqrt{D}/\epsilon ^{3/4}$), despite the availability of more information.

Finally, we show how our proof techniques can be generalized, to establish lower bounds for methods employing higher-order derivatives. In particular, for methods using all derivatives up to order k, we show that for convex functions with $\mu _k$-Lipschitz k-th order derivatives, the oracle complexity is

$$\begin{aligned} \varOmega \left( \left( \frac{\mu _{k} D^{k+1}}{(k+1)!k\epsilon }\right) ^{2/(3k+1)}\right) . \end{aligned}$$

This directly generalizes (2) for $k=1$, and (4) when $k=2$ and $\mu _1$ is unrestricted.

We note that in this paper, we focus on deterministic algorithms, in view of the fact that all state-of-the-art algorithms for our problem are deterministic, and following most existing literature on oracle complexity lower bounds. However, we believe our results can be readily extended to randomized algorithms as well (see Remark 1 for more details).

1.1 Related work

Below, we review some pertinent results in the context of second-order methods. Related results in the context of k-th order methods are discussed in Sect. 2.2.

Perhaps the most well-known and fundamental second-order method is the Newton method, which relies on iterations of the form $\mathbf {w}_{t+1}=\mathbf {w}_{t}- (\nabla ^2 f(\mathbf {w}))^{-1}\nabla f(\mathbf {w})$ (see e.g., [6]). It is well-known that this method exhibits local quadratic convergence, in the sense that if f is strictly convex, and the method is initialized close enough to the optimum $\mathbf {w}^*=\arg \min _{\mathbf {w}} f(\mathbf {w})$, then $\mathcal {O}(\log \log (1/\epsilon ))$ iterations suffice to reach a solution $\mathbf {w}$ such that $f(\mathbf {w})-f(\mathbf {w}^*)\le \epsilon $. However, in order to get global convergence (starting from an arbitrary point not necessarily close to the optimum), one needs to make some algorithmic modifications. For the Newton method with a line search, the number of iterations can be upper bounded by $ \mathcal {O}\left( \frac{\mu _1^2 \mu _2^2}{\lambda ^5}(f(\mathbf {w}_1)-f(\mathbf {w}^*))+\log \log _2\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) , $ where $\mu _1,\mu _2$ are the Lipschitz parameters of the gradients and Hessians respectively, and assuming the function is $\lambda $-strongly convex [9]. Note that the first term captures the initial phase required to get sufficiently close to $\mathbf {w}^*$, and is easily the dominant term in the bound (unless $\epsilon $ is exceedingly small). If f is self-concordant^{Footnote 2}, this can be improved to $ \mathcal {O}\left( (f(\mathbf {w}_1)-f(\mathbf {w}^*))+\log \log _2\left( \frac{1}{\epsilon }\right) \right) , $ independent of the strong convexity and Lipschitz parameters ([17]). Unfortunately, not all practically relevant objective functions are self-concordant (or at least cannot be made self-concordant without strongly affecting other problem parameters).

Returning to our setting of generic convex and smooth functions, and focusing on strongly convex functions for now, the best existing upper bounds (we are aware of) were obtained for cubic-regularized variants of the Newton method [8, 16, 18]. The existing analysis (see section 6 of [16]) implies an oracle complexity bound of at most $ \mathcal {O}\left( \left( \frac{\mu _2 }{\lambda }D\right) ^{1/3}+\log \log _2\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) $, where $D=\Vert \mathbf {w}_1-\mathbf {w}^*\Vert $ is the distance from the initialization point $\mathbf {w}_1$ to the optimum $\mathbf {w}^*$. However, as we show in “Appendix A”, a better oracle complexity bound can be obtained, by adapting the A-NPE method proposed in [10] and analyzed for convex functions, to the strongly convex case. The resulting complexity upper bound is

$$\begin{aligned} \mathcal {O}\left( \left( \frac{\mu _2}{\lambda }D\right) ^{2/7}\log \left( \frac{\mu _1\mu _2^2D^2}{\lambda ^3}\right) +\log \log _2\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) . \end{aligned}$$

(6)

An alternative to the above is to use a hybrid scheme, starting with accelerated gradient descent (which is an optimal first-order method for strongly convex functions with Lipschitz gradients) and when close enough to the optimal solution, switch to a Newton based method The required number of iterations can be shown to be at most

$$\begin{aligned} \mathcal {O}\left( \sqrt{\frac{\mu _1}{\lambda }}\cdot \log \left( \frac{\mu _1\mu _2^2 D^2}{\lambda ^3}\right) +\log \log _2\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) , \end{aligned}$$

(7)

where $D=\Vert \mathbf {w}_1-\mathbf {w}^*\Vert $ (see [15, 16]). Clearly, by taking the best of (6) and (7) (depending on the parameters), one can theoretically attain an oracle complexity which is the minimum of (6) and (7). This minimum matches (up to a logarithmic factors) the lower bound in (5), which we establish in this paper.

It is interesting to note that the bounds in (6) and (7) are not directly comparable: The first bound has a polynomial dependence on $\mu _2/\lambda $ and $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert $, and a logarithmic dependence on $\mu _1$, whereas the second bound has a polynomial dependence on $\mu _1/\lambda $, logarithmic dependence on $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert $, and a logarithmic dependence on $\mu _2$. In a rather wide parameter regime (e.g. when D is reasonably large, as often occurs in practice), the bound of the hybrid scheme can be better than that of pure second-order methods. In light of this, [16] raised the question of whether second-order schemes are indeed useful at the initial stage of the optimization process, for these types of problems. Our results indicate that indeed, in certain parameter regimes, this is not the case.

Analogous results can be obtained for convex (not necessarily strongly convex) smooth functions. The best upper bounds we are aware are for the (second-order) A-NPE method of [10], and for the (first order) accelerated gradient descent method, which are

$$\begin{aligned} \mathcal {O}\left( \left( \frac{\mu _2D^3}{\epsilon }\right) ^{2/7}\right) ~~~\text {and}~~~ \mathcal {O}\left( \sqrt{\frac{\mu _1 D^2}{\epsilon }}\right) \end{aligned}$$

respectively. Clearly, by taking the best of these two methods (depending on the problem parameters), one can attain an oracle complexity equal to the minimum of the two bounds. This is matched (up to constants) by the lower bound in (4), which we establish in this paper.

Finally, we discuss the few existing lower bounds known for second-order methods. If $\mu _2$ is not bounded (i.e.,the Hessians are not Lipschitz), it is easy to show that Hessian information is not useful. Specifically, the lower bound of (2) for first-order methods will then also apply to second-order methods, and in fact, to any method based on local information (see [13, section 7.2.6] and [4]). Of course, this lower bound does not apply to second-order methods when $\mu _2$ is bounded. In our setting, it is also possible to prove an $\varOmega (\log \log (1/\epsilon ))$ lower bound, even in one dimension [13, section 8.1.1], but this does not capture the dependence on the strong convexity and Lipschitz parameters. Some algorithm-specific lower bounds in the context of non-convex optimization are provided in [7]. Finally, we were recently informed of a new work ([1], yet unpublished at the time of writing), which uses a clean and elegant smoothing approach, to derive second- and higher-order oracle lower bounds directly from known first-order oracle lower bounds, as well as extensions to randomized algorithms. However, the resulting bounds are not as tight as ours.

2 Main results

In this section, we formally present our main results, starting with second-order oracle complexity bounds (Sect. 2.1), and then discussing extensions to higher-order oracles (Sect. 2.2).

2.1 Second-order oracle

We consider a second-order oracle, which given a point $\mathbf {w}$ returns the function’s value $f(\mathbf {w})$, its gradient $\nabla f(\mathbf {w})$ and its Hessian $\nabla ^2 f(\mathbf {w})$, and algorithms, which produce a sequence of points $\mathbf {w}_1,\mathbf {w}_2,\ldots ,\mathbf {w}_T$, with each $\mathbf {w}_t$ being some deterministic function of the oracle’s responses at $\mathbf {w}_1,\ldots ,\mathbf {w}_{t-1}$. Our main results (for strongly convex and convex functions f respectively) are provided below.

Theorem 1

For any positive $\mu _1,\mu _2,\lambda ,D,\epsilon $ such that $ \frac{\mu _1}{\lambda }\ge c_1, \frac{\mu _2}{\lambda }D\ge c_2$ and $ \epsilon < \frac{c_3\lambda ^3}{\mu _2^2} $ (for some positive universal constants $c_1,c_2,c_3$), and any algorithm as above with initialization point $\mathbf {w}_1$, there exists a function $f:\mathbb {R}^d\rightarrow \mathbb {R}$ (for some finite d) such that

f is $\lambda $-strongly convex, twice-differentiable, has $\mu _1$-Lipschitz gradients and $\mu _2$-Lipschitz Hessians, and has a global minimum $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert \le D$.
For some universal constant $c>0$, the index T required to ensure $f(\mathbf {w}_T)-f(\mathbf {w}^*) ~\le ~ \epsilon $ is at least
$$\begin{aligned} c\cdot \left( \min \left\{ \sqrt{\frac{\mu _1}{\lambda }}~,~ \left( \frac{\mu _2}{\lambda }D\right) ^{2/7}\right\} + \log \log _{18}\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) . \end{aligned}$$

Theorem 2

For any positive $\mu _1,\mu _2,D,\epsilon $ and any algorithm as above with initialization point $\mathbf {w}_1$, there exists a function $f:\mathbb {R}^d \rightarrow \mathbb {R}$ (for some finite d) such that

f is convex, twice-differentiable, has $\mu _1$-Lipschitz gradients and $\mu _2$-Lipschitz Hessians, and has a global minimum $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert \le D$.
For some universal constant $c>0$, the index T required to ensure $f(\mathbf {w}_T)-f(\mathbf {w}^*)\le \epsilon $ is at least
$$\begin{aligned} c\cdot \min \left\{ \sqrt{\frac{\mu _1 D^2}{\epsilon }}~,~\left( \frac{\mu _2D^3}{\epsilon }\right) ^{2/7} \right\} . \end{aligned}$$

We emphasize that the theorems focus on the high-dimensional setting, where the dimension d is not necessarily fixed and may depend on other problem parameters (more specifically, we require the dimension to be at least 2T). Also, we note that the parameter constraints in Theorem 1 are purely for technical reasons (they imply that the different terms in the bound are at least some positive constant), and can probably be relaxed somewhat.

Let us compare these theorems to the upper bounds discussed in the related work section, which are

$$\begin{aligned} \mathcal {O}\left( \min \left\{ \sqrt{\frac{\mu _1}{\lambda }} ,\left( \frac{\mu _2 D}{\lambda }\right) ^{2/7}\right\} \cdot \log \left( \frac{\mu _1\mu _2^2 D^2}{\lambda ^3}\right) +\log \log _2\left( \frac{\lambda ^3/\mu _2^2}{\epsilon }\right) \right) . \end{aligned}$$

in the strongly convex case, and

$$\begin{aligned} \mathcal {O}\left( \min \left\{ \sqrt{\frac{\mu _1 D^2}{\epsilon }}~,~\left( \frac{\mu _2D^3}{\epsilon }\right) ^{2/7}\right\} \right) . \end{aligned}$$

in the convex case. Our bound in the convex case is tight up to constants, and in the strongly convex case, up to a $\log (\mu _1\mu _2^2D^2/\lambda ^3)$ factor. We conjecture that some such logarithmic factor (possibly a smaller one) is indeed necessary, in order to get a tight interpolation to the $\varOmega (\sqrt{\mu _1/\lambda }\cdot \log (\mu _1 D^2/\epsilon ))$ lower bound of first-order methods as $\mu _2\rightarrow \infty $ (see [13, section 7.2.6] and [4]), and that it can be recovered with a more careful analysis of our construction. However, this involves some non-trivial technical challenges, which we leave to future work.

Remark 1

(Randomized Algorithms) In this paper, we focus on deterministic algorithms, where each point $\mathbf {w}_t$ produced is a deterministic function of the oracle’s responses at $\mathbf {w}_1,\ldots ,\mathbf {w}_{t-1}$. This follows most existing works on oracle complexity lower bounds, and is without much loss of generality, since all existing state-of-the-art algorithms for this problem are deterministic. In any case, we believe our bounds can be readily extended to randomized algorithms, using the same techniques that extend first-order oracle complexity lower bounds to randomized algorithms (see [13] and the explicit construction in [20]). In a nutshell, for deterministic algorithms we can tailor the “hard” function to the algorithm, whereas for randomized algorithms, we construct some fixed distribution over “hard” functions, so that any algorithm (random or not) will fail to achieve a certain optimization error with high probability. Since the formal proof is considerably more involved than in the deterministic case, we leave its formal derivation to future work.

2.2 Higher order oracles

In addition to first-order and second-order oracles, it is of interest to understand what can be achieved with methods employing higher order derivatives. It turns out that the techniques we use to establish our second-order lower bounds can be easily generalized to such higher-order methods.

More explicitly, we consider methods which can be modelled as interacting with a k-th order oracle, which given a point $\mathbf {w}$ returns the function’s value and all of its derivatives up to order k, namely, $f(\mathbf {w}),\nabla f(\mathbf {w}),\nabla ^2 f(\mathbf {w}),\ldots ,\nabla ^k f(\mathbf {w})$. Given access to such an oracle, the method produces a sequence of points $\mathbf {w}_1,\mathbf {w}_2,\ldots ,\mathbf {w}_T$ as before (where each $\mathbf {w}_t$ is a deterministic function of the previous oracle responses). For simplicity, we will focus here on the case of convex functions (not necessarily strongly convex), where the k-th order derivative is Lipschitz continuous.

Theorem 3

For any positive integer k, positive $\mu _k,D,\epsilon $, and algorithm based on a k-th order oracle as above, there exists a function $f:\mathbb {R}^d \rightarrow \mathbb {R}$ (for some finite d) such that

f is convex, k times differentiable, k-order smooth (i.e., $\Vert \nabla ^kf(\mathbf {u})-\nabla ^kf(\mathbf {v})\Vert $$ \le \mu _k \Vert \mathbf {u}-\mathbf {v}\Vert $) and has a global minimum $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert \le D$.
For some universal constant $c>0$, the index T required to ensure $f(\mathbf {w}_T)-f(\mathbf {w}^*)\le \epsilon $ is at least
$$\begin{aligned} c\left( \frac{\mu _{k} D^{k+1}}{(k+1)!k\epsilon }\right) ^{2/(3k+1)}. \end{aligned}$$

Note that this result directly generalizes existing results for first-order oracles ($k=1$), as well as our results for second-order oracles ($k=2$, when $\mu _1$ is unrestricted).

Finally, we compare our lower bound to the best upper bound we are aware of, established by [5] using a high-order method with oracle complexity of

$$\begin{aligned} \mathcal {O}\left( \sqrt{k}\left( \frac{f(\mathbf {w}_1)-f(\mathbf {w}^*)}{\epsilon }+\frac{\mu _kD^{k+1}}{(k+1)!\epsilon }\right) ^{1/(k+1)}\right) . \end{aligned}$$

Note that the upper bound contains an additional $(f(\mathbf {w}_1)-f(\mathbf {w}^*))/\epsilon $ term, and moreover, the exponent (as a function of k) is larger than ours ($1/(k+1)$ vs. $2/(3k+1)$). Based on our results, we know that this upper bound is loose in the $k=2$ case, so we conjecture that it is indeed loose for all k, and can be improved.

3 Proof ideas

The proofs of our theorems are based on a careful modification of a known lower bound construction for first-order methods (see [15]). That construction uses quadratic functions, which in the convex case and ignoring various parameters, have a basic structure of the form

$$\begin{aligned} f_T(\mathbf {w}) = f_T(w_1,w_2,\ldots ) = w_1^2+\sum _{j=1}^{T-1}(w_j-w_{j+1})^2+w_T^2-w_1 \end{aligned}$$

(more precisely, one considers $f_T(V\mathbf {w})$ for a certain orthogonal matrix V, and use additional parameters depending on the smoothness). A crucial ingredient of the proof is that the function $x\mapsto x^2$ has a value and derivative of zero at the origin, which allows us to construct a function which “hides” information from an algorithm relying solely on values and gradients. This can be shown to lead to an optimization error lower bound of the form $\min _{\mathbf {w}} f_{T}(\mathbf {w})-\min _{\mathbf {w}} f_{2T}(\mathbf {w})$ after T oracle queries, which for first-order methods leads to an $\varOmega (\mu _1 D^2/T^2)$ lower bound on the error, translating to an $\varOmega (\sqrt{\mu _1 D^2/\epsilon })$ lower bound on T. However, this construction leads to trivial bounds for second-order methods, since given the Hessian and a gradient of a quadratic function at just a single point, one can already compute the exact minimizer.

Our approach to handle second-order (and more generally, k-order) methods is quite simple: Instead of $x\mapsto x^2$, we rely on mappings of the form $x\mapsto |x|^{k+1}$, and use functions with the basic structure

$$\begin{aligned} f_T(\mathbf {w}) = |w_1|^{k+1}+\sum _{j=1}^{T-1}|w_j-w_{j+1}|^{k+1}+|w_T|^{k+1}-w_1. \end{aligned}$$

The intuition is that $x\mapsto |x|^{k+1}$ has a value and first k derivatives of zero at the origin, and therefore variants of the function above can be used to “hide” information from the algorithm, even if it can receive Hessians or higher-order derivatives of the function. Another motivation for choosing such functions is that they are generally not self-concordant, and therefore the upper bounds relevant to self-concordant functions do not apply. We rely on this construction and arguments similar to those of first-order oracle lower bounds, to get our results.

In the derivation of our results for second-order methods, there are two technical challenges that need to be overcome: The first is that $f_T$, as defined above (for $k=2$), can be shown to have globally Lipschitz Hessians, but not globally Lipschitz gradients as required by our theorems. To tackle this, we replace the mapping $x\mapsto |x|^3$ by a more complicated mapping, which is cubic close to the origin and quadratic further away. This necessarily complicates the proof. The second challenge is that due to the cubic terms, computing the minimizer of $f_T$ and its minimal value is more challenging than in first-order lower bounds, especially in the strongly convex case (where we are unable to even find a closed-form expression for the minimizer, and resort to bounds instead). Again, this makes the analysis more complicated.

We conclude this section by sketching how our bounds can be derived in case of second-order methods, and in the simplest possible setting, where we wish to obtain an $\varOmega ((D^3/\epsilon )^{2/7})$ lower bound for the class of convex functions with Lipschitz Hessians (and no assumptions on the Lipschitz parameter of the gradients), assuming the algorithm makes its first query at the origin:

Proposition 1

For any positive $\mu _2,D,\epsilon $ and any algorithm with an initialization point $\mathbf {w}_1$ in the origin, there exists a function $f:\mathbb {R}^d \rightarrow \mathbb {R}$ (for some finite d) such that

f is convex, twice-differentiable, has $\mu _2$-Lipschitz Hessians, and has a global minimum $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert \le D$.
The index T required to ensure $f(\mathbf {w}_T)-f(\mathbf {w}^*)\le \epsilon $ is $\varOmega ((D^3/\epsilon )^{2/7})$.

Proof Sketch

Consider the function $f_T$ in this class of the form

$$\begin{aligned} f_T(\mathbf {w}) = |w_1|^3+\sum _{j=1}^{T-1}|w_j-w_{j+1}|^3+|w_T|^3-3\gamma \cdot w_1, \end{aligned}$$

where $\gamma $ is a parameter to be chosen later. Computing the derivatives and setting to zero, and arguing that the minimizer must have non-negative coordinates, we get that the optimum satisfies

$$\begin{aligned} w_1^2+(w_1-w_2)^2 = \gamma ,~~w_T^2=(w_{T-1}-w_T)^2 \end{aligned}$$

and

$$\begin{aligned} \forall j=2,3,\ldots ,T-1,~~~(w_{j-1}-w_j)^2=(w_j-w_{j+1})^2. \end{aligned}$$

It can be verified that this is satisfied by $w_j = (T+1-j)\sqrt{\frac{\gamma }{T^2+1}}$ for all $j=1,2,\ldots ,T$, and that this is the unique minimizer of $f_T$ as a function on $\mathbb {R}^T$. Moreover, assuming $\gamma \le D^2/T$, the norm of this minimizer (and hence the initial distance to it from the algorithm’s first query point, by assumption) is at most D as required. Plugging this $\mathbf {w}$ into $f_T$, we get that

$$\begin{aligned} \min _{\mathbf {w}} f_T(\mathbf {w}) = -\frac{2\gamma ^{3/2}T}{\sqrt{T^2+1}}. \end{aligned}$$

Now, using arguments very similar to those in the first-order oracle complexity lower bounds in [15], it is possible to construct a function for which the optimization error of the algorithm is lower bounded by $\min _{\mathbf {w}}f_{T}(\mathbf {w})-\min _{\mathbf {w}}f_{2T}(\mathbf {w})$. By the calculations above, this in turn equals

$$\begin{aligned} 2\gamma ^{3/2}\left( \frac{2T}{\sqrt{4T^2+1}} -\frac{T}{\sqrt{T^2+1}}\right) ~=~2\gamma ^{3/2}\left( \frac{1}{\sqrt{1+\frac{1}{4T^2}}}-\frac{1}{\sqrt{1+\frac{1}{T^2}}}\right) . \end{aligned}$$

Using the fact that $\frac{1}{\sqrt{1+x}}\approx 1-\frac{1}{2}x$ for small x, this equals $\varOmega (\gamma ^{3/2}/T^2)$. Choosing $\gamma $ on the order of $D^2/T$ (as required earlier to satisfy the norm constraint on the minimizer), we get a lower bound of $\varOmega (D^3/T^{7/2})$ on the optimization error $\epsilon $, or equivalently, a lower bound of $\varOmega ((D^3/\epsilon )^{2/7})$ on T. $\square $

4 Proof of Theorem 1

We will assume without loss of generality that the algorithm initializes at $\mathbf {w}_1=\mathbf {0}$ (if that is not the case, one can simply replace the “hard” function $f(\mathbf {w})$ below by $f(\mathbf {w}-\mathbf {w}_1)$, and the same proof holds verbatim). Thus, the theorem requires that our function has a minimizer $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}^*\Vert \le D$.

Let $\Delta ,\gamma $ be parameters to be chosen later^{Footnote 3}. Define $g:\mathbb {R}\mapsto \mathbb {R}$ as

$$\begin{aligned} g(x) = {\left\{ \begin{array}{ll} \frac{1}{3}|x|^3 &{} |x|\le \Delta \\ \Delta x^2-\Delta ^2|x|+\frac{1}{3}\Delta ^3&{}|x|>\Delta , \end{array}\right. }. \end{aligned}$$

which is easily verified to be convex and twice continuously differentiable, and let $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}$ be orthogonal unit vectors in $\mathbb {R}^d$ which will be specified later. Letting the number of iterations T be fixed, we consider the function

$$\begin{aligned} f(\mathbf {w}) = \frac{\mu _2}{12}\left( \sum _{i=1}^{{\tilde{T}}-1}g(\langle \mathbf {v}_i,\mathbf {w}\rangle -\langle \mathbf {v}_{i+1},\mathbf {w}\rangle )-\gamma \langle \mathbf {v}_1,\mathbf {w}\rangle \right) +\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2, \end{aligned}$$

where ${\tilde{T}}\ge \max \left\{ 4\gamma \left( \frac{\mu _2}{6\lambda }\right) ^2+1, 2T,\frac{\gamma \mu _2}{6\lambda }+1\right\} $ is some sufficiently large number, and the dimension d is at least $2{\tilde{T}}$.

The proof is constructed of several parts: First, we analyze properties of the global minimum of f (Sect. 4.1). Then, we prove the oracle complexity lower bound in Sect. 4.2 (depending on $\Delta ,\gamma $), and finally, in Sect. 4.3, we choose the parameters so that f indeed has the various geometric properties specified in the theorem.

4.1 Minimizer of f

The goal of this subsection is to prove the following proposition, which characterizes key properties of the global minimum of f:

Proposition 2

Suppose that $ \gamma \ge 10^4\left( \frac{\lambda }{\mu _2}\right) ^2$ and $\Delta \ge \sqrt{\gamma }$. Then for any unit orthogonal $\mathbf {v}_1,\mathbf {v}_2,\ldots $, the function f has a unique minimizer $\mathbf {w}^*$ which satisfies the following:

1.
For any $t\in \{1,2,\ldots ,{\tilde{T}}\}$, it holds that $\langle \mathbf {v}_t,\mathbf {w}^*\rangle \ge \max \left\{ 0~,~\frac{\gamma ^{3/4}}{7\sqrt{12\lambda /\mu _2}} +\sqrt{\gamma }\left( \frac{1}{2}-t\right) \right\} $.
2.
There exists some $t_0 \le {\tilde{T}}/2$ such that for all indices $k\in \{0,1,\ldots ,{\tilde{T}}-t_0\}$, it holds that $\langle \mathbf {v}_{t_0+k},\mathbf {w}^*\rangle ~\ge ~ \frac{108\lambda }{\mu _2}\cdot (18)^{-2^k}$.
3.
$\Vert \mathbf {w}^*\Vert ^2 \le \frac{2\gamma ^{7/4}}{(12\lambda /\mu _2)^{3/2}}$.

Since f is strongly convex, its global minimizer is unique and well-defined. To prove the proposition, we will consider the simpler strongly-convex function

$$\begin{aligned} {\tilde{f}}(\mathbf {w}) = \frac{1}{3}\sum _{i=1}^{{\tilde{T}}-1}|w_i-w_{i+1}|^3 +\frac{{\tilde{\lambda }}}{2}\Vert \mathbf {w}\Vert ^2-\gamma \cdot w_1, \end{aligned}$$

(8)

where ${\tilde{\lambda }}:=\frac{12\lambda }{\mu _2}$, and prove that its minimizer ${\tilde{\mathbf {w}}}^*$ satisfies the following:

1.
For any $t\in \{1,2,\ldots ,{\tilde{T}}\}$, it holds that $\tilde{w}^*_t \ge \max \left\{ 0,~\frac{\gamma ^{3/4}}{7\sqrt{\lambda }} +\sqrt{\gamma }\left( \frac{1}{2}-t\right) \right\} $ (Lemma 2).
2.
There exists some $t_0 \le {\tilde{T}}/2$ such that for all $k\in \{0,1,\ldots ,{\tilde{T}}-t_0\}$, it holds that $\tilde{w}^*_{t_0+k} ~\ge ~ 9{\tilde{\lambda }}\cdot (18)^{-2^k}$ (Lemma 3).
3.
$\sum _{i=1}^{{\tilde{T}}}\tilde{w}^{*2}_{i} \le \frac{2\gamma ^{7/4}}{{\tilde{\lambda }}^{3/2}}$ (Lemma 4)

We then argue that the minimizer $\mathbf {w}^*$ of f satisfies $\langle \mathbf {v}_i,\mathbf {w}^*\rangle =\tilde{w}^*_{i}$ for all $i=1,2,\ldots ,{\tilde{T}}$ (Lemma 5), and that $\Vert \mathbf {w}^*\Vert ^2 = \sum _{i=1}^{{\tilde{T}}}\langle \mathbf {v}_i,\mathbf {w}^*\rangle ^2$ (Lemma 6), from which Proposition 2 follows.

We begin with the following technical key result:

Lemma 1

It holds that $\tilde{w}^*_1\ge \tilde{w}^*_2\ge \cdots \ge \tilde{w}^*_{{\tilde{T}}} \ge 0$, and

$$\begin{aligned} \tilde{w}^*_{t+1} = w_{t}^*-\sqrt{\gamma -{\tilde{\lambda }} \sum _{j=1}^{t}\tilde{w}^*_j}~ \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}-1\}. \end{aligned}$$

Moreover, $\sum _{j=1}^{{\tilde{T}}}\tilde{w}^*_j = \frac{\gamma }{{\tilde{\lambda }}}$.

Proof

We begin by showing that $\tilde{w}^*_j\ge 0$ for all j, first for $j=1$ and then for $j>1$. To do so, note that ${\tilde{f}}(\mathbf {0})=0$ yet $\nabla {\tilde{f}}(\mathbf {0})=-\gamma \cdot \mathbf {e}_1\ne \mathbf {0}$, and therefore $\mathbf {0}$ is a sub-optimal point. Thus, we must have ${\tilde{f}}({\tilde{\mathbf {w}}}^*)<0$. The only negative term in the definition of ${\tilde{f}}(\cdot )$ is $-\gamma \cdot w_1$, so we must have $\tilde{w}^*_1>0$. We now argue that $\tilde{w}^*_j\ge 0$ for all $j>1$: Otherwise, let $\mathbf {w}$ be the vector which equals $w_j=|\tilde{w}^*_j|$ for all j, and note that $w_1=\tilde{w}^*_1$ since we just showed $\tilde{w}^*_1> 0$. Based on this, it is easily verified that

$$\begin{aligned} {\tilde{f}}(\mathbf {w})-{\tilde{f}}({\tilde{\mathbf {w}}}^*) = \frac{1}{3}\sum _{i=1}^{{\tilde{T}}-1}\left( \big ||\tilde{w}^*_i|-|\tilde{w}^*_{i+1}|\big |^3-|\tilde{w}^*_i-\tilde{w}^*_{i+1}|^3 \right) ~\le ~0, \end{aligned}$$

which means that $\mathbf {w}$ is the (unique) minimum of ${\tilde{f}}$, hence $\mathbf {w}={\tilde{\mathbf {w}}}^*$. By definition of $\mathbf {w}$, this implies $\tilde{w}^*_j=|\tilde{w}^*_j|$ for all j, hence $\tilde{w}^*_j\ge 0$ for all j.

We now turn to prove that $\tilde{w}^*_j$ is monotonically decreasing in j. Suppose on the contrary that this is not the case, and let $j_0$ be the smallest index for which $\tilde{w}^*_{j_0}<\tilde{w}^*_{j_0+1}$, and let $\delta := \tilde{w}^*_{j_0+1}-\tilde{w}^*_{j_0}>0$. Define the vector $\mathbf {w}$ to be

$$\begin{aligned} w_i = {\left\{ \begin{array}{ll} \tilde{w}^*_i &{} i\le j_0 \\ \max \{0,\tilde{w}^*_i-\delta \} &{} d \ge i>j_0 \end{array}\right. }. \end{aligned}$$

Note that this vector must be different than $\mathbf {w}$, as $w_{j_0+1}=\max \{0,\tilde{w}^*_{j_0+1}-\delta \}=\max \{0,\tilde{w}^*_{j_0}\}=\tilde{w}^*_{j_0}=w_{j_0}$, hence $w_{j_0+1}=w_{j_0}$ yet $\tilde{w}^*_{j_0+1}>\tilde{w}^*_{j_0}$ by assumption. On the other hand, it is easily verified that $|w_i-w_{i+1}|^3\le |\tilde{w}^*_i-\tilde{w}^*_{i+1}|^3$ and $w_i^2 \le (\tilde{w}^*_i)^2$ for all^{Footnote 4}i, and therefore ${\tilde{f}}(\mathbf {w})\le {\tilde{f}}({\tilde{\mathbf {w}}}^*)$. But since ${\tilde{\mathbf {w}}}^*$ is the unique global minimizer and $\mathbf {w}\ne {\tilde{\mathbf {w}}}^*$, we get a contradiction, so we must have $\tilde{w}^*_j$ monotonically decreasing for all j.

We now turn to prove the recursive relation $\tilde{w}^*_{t+1} = w_{t}^*-\sqrt{\gamma -{\tilde{\lambda }} \sum _{j=1}^{t}\tilde{w}^*_j}$. By differentiating ${\tilde{f}}$ and setting to zero (and using the fact that $\tilde{w}^*_j$ is monotonically decreasing in j), we get that

$$\begin{aligned} (\tilde{w}^*_1 - \tilde{w}^*_2)^2 = \gamma -{\tilde{\lambda }} w_1^*,~~~~(\tilde{w}^*_{{\tilde{T}}-1} - \tilde{w}^*_{{\tilde{T}}})^2 = {\tilde{\lambda }} w_{{\tilde{T}}}^* \end{aligned}$$

(9)

and

$$\begin{aligned} (\tilde{w}^*_{t}-\tilde{w}^*_{t+1})^2 = (\tilde{w}^*_{t-1}-\tilde{w}^*_{t})^2-{\tilde{\lambda }} \tilde{w}^*_{t} \quad \forall t\in \{2,3,\ldots ,{\tilde{T}}-1\}. \end{aligned}$$

(10)

By unrolling this recursive form, we get

$$\begin{aligned} (\tilde{w}^*_{t}-\tilde{w}^*_{t+1})^2 = \gamma -{\tilde{\lambda }}\sum _{j=1}^{t}\tilde{w}^*_j~ \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}-1\}, \end{aligned}$$

from which the equation

$$\begin{aligned} \tilde{w}^*_{t+1} = w_{t}^*-\sqrt{\gamma -{\tilde{\lambda }} \sum _{j=1}^{t}\tilde{w}^*_j}~ \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}-1\} \end{aligned}$$

(11)

follows, again using the monotonicity of $\tilde{w}^*_t$ in t.

It remains to prove that $\sum _{j=1}^{{\tilde{T}}}\tilde{w}^*_j = \frac{\gamma }{{\tilde{\lambda }}}$. By summing both sides of (10) from $t=2$ to $t={\tilde{T}}-1$ we have that:

$$\begin{aligned} (\tilde{w}^*_{{\tilde{T}}-1}-\tilde{w}^*_{{\tilde{T}}})^2 = (\tilde{w}^*_{1}-\tilde{w}^*_{2})^2-{\tilde{\lambda }}\sum _{t=2}^{{\tilde{T}}-1}\tilde{w}^*_{t} \end{aligned}$$

So by using (9) we get the desired equality. $\square $

Lemma 2

For any $t\in \{1,2,\ldots ,{\tilde{T}}\}$,

$$\begin{aligned} \tilde{w}^*_t \ge \max \left\{ 0~,~\frac{\gamma ^{3/4}}{7\sqrt{{\tilde{\lambda }}}}+\sqrt{\gamma }\left( \frac{1}{2}-t\right) \right\} . \end{aligned}$$

Proof

By the displayed equation in Lemma 1, we clearly have $\tilde{w}^*_{t+1}\ge \tilde{w}^*_t-\sqrt{\gamma }$ for all $t\le {\tilde{T}}-1$, and therefore

$$\begin{aligned} \tilde{w}^*_t\ge \tilde{w}^*_1-(t-1)\sqrt{\gamma } \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}\}. \end{aligned}$$

(12)

Using the facts that $\tilde{w}^*_t$ is also always non-negative, that ${\tilde{T}}\ge \frac{\gamma \mu _2}{6\lambda }+1=\frac{\gamma }{{\tilde{\lambda }}}+1 \ge \tilde{w}^*_1 + 1$, and by Lemma 1,

$$\begin{aligned}&\frac{\gamma }{{\tilde{\lambda }}}=\sum _{t=1}^{{\tilde{T}}}\tilde{w}^*_t \ge \sum _{t=1}^{{\tilde{T}}}\max \{0,\tilde{w}^*_1-(t-1)\sqrt{\gamma }\}= \sum _{t=1}^{\lfloor \tilde{w}^*_1/\sqrt{\gamma }+1\rfloor }\left( \tilde{w}^*_1-(t-1)\sqrt{\gamma }\right) \\&= \left\lfloor \frac{\tilde{w}^*_1}{\sqrt{\gamma }}+1\right\rfloor \tilde{w}_1^*-\sqrt{\gamma }\frac{\left( \left\lfloor \frac{\tilde{w}^*_1}{\sqrt{\gamma }}+1\right\rfloor -1\right) \left\lfloor \frac{\tilde{w}^*_1}{\sqrt{\gamma }}+1\right\rfloor }{2} \ge \frac{(\tilde{w}^*_1)^2}{\sqrt{\gamma }}-\sqrt{\gamma } \frac{\frac{\tilde{w}^*_1}{\sqrt{\gamma }}\left( \frac{w_1^*}{\sqrt{\gamma }}+1\right) }{2}, \end{aligned}$$

which implies that $(\tilde{w}^*_1)^2-\sqrt{\gamma }\cdot w_1^*-\frac{2\gamma ^{3/2}}{{\tilde{\lambda }}}\le 0$, which implies in turn

$$\begin{aligned} \tilde{w}^*_1\le \frac{\sqrt{\gamma }+\sqrt{\gamma +8\gamma ^{3/2}/{\tilde{\lambda }}}}{2}\le \frac{\sqrt{\gamma }+\sqrt{\gamma }+\sqrt{8\gamma ^{3/2}/{\tilde{\lambda }}}}{2} = \sqrt{\gamma }+\sqrt{\frac{2\gamma ^{3/2}}{{\tilde{\lambda }}}}. \end{aligned}$$

(13)

On the other hand, again by Lemma 1, we know that

$$\begin{aligned} \tilde{w}^*_{t+1}\le \tilde{w}^*_{t}-\frac{\sqrt{\gamma }}{2}~, \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}-1\}:\sum _{j=1}^{t}\tilde{w}^*_j\le \frac{3\gamma }{4{\tilde{\lambda }}}, \end{aligned}$$

and hence

$$\begin{aligned} \tilde{w}^*_{t+1}\le \tilde{w}^*_1-\frac{t\sqrt{\gamma }}{2},~~~~~\forall t\in \{1,2,\ldots ,{\tilde{T}}-1\}:\sum _{j=1}^{t}\tilde{w}^*_j\le \frac{3\gamma }{4{\tilde{\lambda }}}. \end{aligned}$$

(14)

Let $t_0\in \{1,2,\ldots ,{\tilde{T}}\}$ be the smallest index such that $\sum _{j=1}^{t_0}\tilde{w}^*_j > \frac{3\gamma }{4{\tilde{\lambda }}}$ (such an index must exist since $\sum _{j=1}^{{\tilde{T}}}\tilde{w}^*_j=\frac{\gamma }{{\tilde{\lambda }}}$). Since $\frac{3\gamma }{4{\tilde{\lambda }}}<\sum _{j=1}^{t_0}\tilde{w}^*_j\le t_0 \tilde{w}^*_1\le t_0\left( \sqrt{\gamma }+\sqrt{2\gamma ^{3/2}/{\tilde{\lambda }}}\right) $ by (13), it follows that

$$\begin{aligned} t_0 ~\ge ~ \frac{3\gamma }{4{\tilde{\lambda }}\left( \sqrt{\gamma }+\sqrt{2\gamma ^{3/2}/{\tilde{\lambda }}}\right) }~=~ \frac{3\sqrt{\gamma }}{4\left( {\tilde{\lambda }}+\sqrt{2\gamma ^{1/2}{\tilde{\lambda }}}\right) }. \end{aligned}$$

According to (14) and the fact that $\tilde{w}^*_{t_0}\ge 0$, it follows that

$$\begin{aligned} 0\le \tilde{w}^*_{t_0}\le \tilde{w}^*_1-\frac{(t_0-1)\sqrt{\gamma }}{2}, \end{aligned}$$

and hence

$$\begin{aligned} \tilde{w}^*_1 ~\ge ~ \frac{(t_0-1)\sqrt{\gamma }}{2}~\ge ~ \frac{3\gamma }{8({\tilde{\lambda }}+\sqrt{2\gamma ^{1/2}{\tilde{\lambda }}})}-\frac{\sqrt{\gamma }}{2}. \end{aligned}$$

Using this and (12), it follows that for all $t\le {\tilde{T}}$,

$$\begin{aligned} \tilde{w}^*_t \ge \frac{3\gamma }{8({\tilde{\lambda }}+\sqrt{2\gamma ^{1/2}{\tilde{\lambda }}})}+\sqrt{\gamma }\left( \frac{1}{2}-t\right) . \end{aligned}$$

Since we assumed $\gamma \ge 10^4(\lambda /\mu _2)^2 > (12\lambda /\mu _2)^2 = {\tilde{\lambda }}^2$, we have ${\tilde{\lambda }}< \sqrt{\gamma ^{1/2}{\tilde{\lambda }}}$, so the above can be lower bounded by the simpler expression $\gamma ^{3/4}/7\sqrt{{\tilde{\lambda }}}+\sqrt{\gamma }(1/2-t)$. Since we also know that $\tilde{w}^*_t$ is non-negative, the result follows. $\square $

Lemma 3

There exists an index $t_0 \le {\tilde{T}}/2$ such that

$$\begin{aligned} \tilde{w}^*_{t_0+k} ~\ge ~ 9{\tilde{\lambda }}\cdot (18)^{-2^k} \quad \forall k\in \{0,1,\ldots ,{\tilde{T}}-t_0\} \end{aligned}$$

Proof

By Lemma 1, it holds for any $t\in \{1,2,\ldots ,{\tilde{T}}-1\}$ that

$$\begin{aligned} \tilde{w}^*_{t}-\tilde{w}^*_{t+1}~=~\sqrt{\gamma -{\tilde{\lambda }}\sum _{j=1}^{t}\tilde{w}^*_j} ~=~\sqrt{\gamma -{\tilde{\lambda }}\left( \frac{\gamma }{{\tilde{\lambda }}}-\sum _{j=t+1}^{{\tilde{T}}}\tilde{w}^*_j\right) } ~=~\sqrt{{\tilde{\lambda }}\sum _{j=t+1}^{{\tilde{T}}}\tilde{w}^*_j}. \end{aligned}$$

(15)

In particular, since $\tilde{w}^*_{j}\ge 0$ for all $j\le {\tilde{T}}$, it follows that $\tilde{w}^*_t\ge \sqrt{{\tilde{\lambda }}\sum _{j=t+1}^{{\tilde{T}}}\tilde{w}^*_j}\ge \sqrt{{\tilde{\lambda }} \tilde{w}^*_{t+1}}$, and therefore

$$\begin{aligned} \tilde{w}^*_{t+1}~\le ~ \frac{1}{{\tilde{\lambda }}}(\tilde{w}^*_t)^2 ~ \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}-1\}. \end{aligned}$$

(16)

Let $t\le {\tilde{T}}-1$ be any index such that^{Footnote 5}$\tilde{w}^*_{t+1} \le \frac{{\tilde{\lambda }}}{2}$. By (16), this implies that

$$\begin{aligned} \sum _{j=t+1}^{{\tilde{T}}}\tilde{w}^*_j&~\le ~ \sum _{k=0}^{{\tilde{T}}-t-1}{\tilde{\lambda }}\left( \frac{\tilde{w}^*_{t+1}}{{\tilde{\lambda }}}\right) ^{2^{k}}~=~ \sum _{k=0}^{{\tilde{T}}-t-1}\tilde{w}^*_{t+1}\left( \frac{\tilde{w}^*_{t+1}}{{\tilde{\lambda }}}\right) ^{2^{k}-1}\\&~\le ~ \sum _{k=0}^{{\tilde{T}}-t-1}\tilde{w}^*_{t+1} \left( \frac{1}{2}\right) ^{2^k-1}~<~ 2\cdot \tilde{w}^*_{t+1}. \end{aligned}$$

Using the inequality above together with (15) and the monotonicity of $\tilde{w}^*_t$, we get that for all $t\le {\tilde{T}}-1$ such that $\tilde{w}^*_{t+1}\le \frac{{\tilde{\lambda }}}{2}$,

$$\begin{aligned} \tilde{w}^*_{t}&~=~\tilde{w}^*_{t+1}+\sqrt{{\tilde{\lambda }}\sum _{j=t+1}^{{\tilde{T}}}\tilde{w}^*_j} ~\le ~\tilde{w}^*_{t+1}+\sqrt{2{\tilde{\lambda }} \tilde{w}^*_{t+1}} ~=~\sqrt{\tilde{w}^*_{t+1}}\left( \sqrt{\tilde{w}^*_{t+1}}+\sqrt{2{\tilde{\lambda }}} \right) \\&~\le ~\sqrt{\tilde{w}^*_{t+1}}\left( \sqrt{\frac{{\tilde{\lambda }}}{2}}+\sqrt{2{\tilde{\lambda }}}\right) ~\le ~ 3\sqrt{{\tilde{\lambda }} \tilde{w}^*_{t+1}}. \end{aligned}$$

This chain of inequalities implies that

$$\begin{aligned} w_{t+1}\ge \frac{(\tilde{w}^*_t)^2}{9{\tilde{\lambda }}}~ \quad \forall t\in \{1,2,\ldots ,{\tilde{T}}-1\}: \tilde{w}^*_{t+1}\le \frac{{\tilde{\lambda }}}{2}. \end{aligned}$$

Let $t_0\le {\tilde{T}}/2$ denote the unique index that satisfies $\tilde{w}^*_{t_0}>\frac{{\tilde{\lambda }}}{2}$, as well as $\tilde{w}^*_{t_0+1}\le \frac{{\tilde{\lambda }}}{2}$ for all t between $t_0$ and ${\tilde{T}}-1$ ^{Footnote 6}. Using the displayed inequality above, we get that for any $k\le {\tilde{T}}-t_0$,

$$\begin{aligned} \tilde{w}^*_{t_0+k} ~\ge ~ \frac{(\tilde{w}^*_{t_0+k-1})^2}{9{\tilde{\lambda }}}~\ge ~ \frac{(\tilde{w}^*_{t_0+k-2})^4}{(9{\tilde{\lambda }})^3}~\ge ~\cdots ~\ge ~ 9{\tilde{\lambda }}\left( \frac{\tilde{w}^*_{t_0}}{9{\tilde{\lambda }}}\right) ^{2^k} ~>~ 9{\tilde{\lambda }}\left( \frac{{\tilde{\lambda }}/2}{9{\tilde{\lambda }}}\right) ^{2^k}, \end{aligned}$$

so we get $\tilde{w}^*_{t_0+k} ~\ge ~ 9{\tilde{\lambda }}\cdot (18)^{-2^k}$ as required. $\square $

Lemma 4

$\sum _{i=1}^{{\tilde{T}}}(\tilde{w}^*_i)^2 \le 2\gamma ^{7/4}/{\tilde{\lambda }}^{3/2}$

Proof

We need to upper bound the squared Euclidean norm of $(\tilde{w}^*_1,\ldots ,\tilde{w}^*_{{\tilde{T}}})$. Note that for any vector $\mathbf {w}$, we have $\Vert \mathbf {w}\Vert ^2=\sum _i w_i^2 \le (\max _i |w_i|)\sum _i |w_i| = \Vert \mathbf {w}\Vert _{\infty }\Vert \mathbf {w}\Vert _1$. Thus, by Lemma 1, (13), and the assumption that $\gamma \ge 10^4(\lambda /\mu _2)^2>277{\tilde{\lambda }}^2$, the squared norm is at most

$$\begin{aligned} \left( \sqrt{\gamma }+\sqrt{\frac{2\gamma ^{3/2}}{{\tilde{\lambda }}}}\right) \frac{\gamma }{{\tilde{\lambda }}} ~=~\left( 1+\sqrt{\frac{2\gamma ^{1/2}}{{\tilde{\lambda }}}}\right) \frac{\gamma ^{3/2}}{{\tilde{\lambda }}} ~\le ~\left( \sqrt{\frac{\gamma ^{1/2}}{\sqrt{277}{\tilde{\lambda }}}}+\sqrt{\frac{2\gamma ^{1/2}}{{\tilde{\lambda }}}}\right) \frac{\gamma ^{3/2}}{{\tilde{\lambda }}}, \end{aligned}$$

which is at most $2\sqrt{\gamma ^{1/2}/{\tilde{\lambda }}}\cdot \gamma ^{3/2}/{\tilde{\lambda }} = 2\gamma ^{7/4}/{\tilde{\lambda }}^{3/2}$$\square $

Lemma 5

$\mathbf {w}^*=\arg \min _{\mathbf {w}} f(\mathbf {w})$ satisfies $\langle \mathbf {v}_i,\mathbf {w}^*\rangle =\tilde{w}_i^*$ for all $i=1,\ldots ,{\tilde{T}}$, where ${\tilde{\mathbf {w}}}^*=\arg \min _{\mathbf {w}} {\tilde{f}}(\mathbf {w})$.

Proof

First, we argue that ${\tilde{\mathbf {w}}}^*$, which minimizes

$$\begin{aligned} {\tilde{f}}(\mathbf {w}) = \frac{1}{3}\sum _{i=1}^{{\tilde{T}}-1}|w_i-w_{i+1}|^3+\frac{{\tilde{\lambda }}}{2}\Vert \mathbf {w}\Vert ^2-\gamma \cdot w_1~, \end{aligned}$$

also minimizes

$$\begin{aligned} {\hat{f}}(\mathbf {w}) = \sum _{i=1}^{{\tilde{T}}-1}g(w_i-w_{i+1})+\frac{{\tilde{\lambda }}}{2}\Vert \mathbf {w}\Vert ^2-\gamma \cdot w_1. \end{aligned}$$

To see this, note that ${\tilde{f}}$ and ${\hat{f}}$ differ only in that g(x) is replaced by $\frac{1}{3}|x|^3$. By definition of g, we have that g(x) and $\frac{1}{3}|x|^3$ coincide for any $|x|\le \Delta $, from which it is easily verified that f and ${\tilde{f}}$ have the same values and gradients at any $\mathbf {w}$ for which $|w_i-w_{i+1}|\le \Delta $ for all $i\le {\tilde{T}}-1$. By Lemma 1 and the assumption $\Delta \ge \sqrt{\gamma }$, the global minimizer ${\tilde{\mathbf {w}}}^*$ of ${\tilde{f}}$ belongs to this set, and therefore $\nabla {\tilde{f}}({\tilde{\mathbf {w}}}^*)=\nabla {\hat{f}}({\tilde{\mathbf {w}}}^*)=\mathbf {0}$. But ${\hat{f}}$ is strongly convex, hence has a unique point (the global minimizer) at which the gradient of ${\hat{f}}$ is zero, hence ${\tilde{\mathbf {w}}}^*$ is indeed the global minimizer of ${\hat{f}}$.

Next, since the global minimizer is invariant to multiplying the function by a fixed positive factor, we get that ${\tilde{\mathbf {w}}}^*$ is also the global minimizer of

$$\begin{aligned} \frac{\mu _2}{12}{\hat{f}}(\mathbf {w})&= \frac{\mu _2}{12}\left( \sum _{i=1}^{{\tilde{T}}-1}g(w_i-w_{i+1})+ \frac{{\tilde{\lambda }}}{2}\Vert \mathbf {w}\Vert ^2-\gamma \cdot w_1\right) \\&= \frac{\mu _2}{12}\left( \sum _{i=1}^{{\tilde{T}}-1}g(w_i-w_{i+1})-\gamma \cdot w_1\right) +\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2, \end{aligned}$$

where in the last step we used the fact that ${\tilde{\lambda }}=12\lambda /\mu _2$. Recalling that

$$\begin{aligned} f(\mathbf {w}) = \frac{\mu }{12}\left( \sum _{i=1}^{{\tilde{T}}-1}g(\langle \mathbf {v}_i,\mathbf {w}\rangle -\langle \mathbf {v}_{i+1},\mathbf {w}\rangle ) -\langle \mathbf {v}_1,\mathbf {w}\rangle \right) +\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2, \end{aligned}$$

and that $\mathbf {v}_1,\mathbf {v}_2,\ldots $ are orthogonal, we can write $f(\mathbf {w})$ as $\frac{\mu }{12}\cdot {\hat{f}}(V\mathbf {w})$, where V is any orthogonal matrix with the first ${\tilde{T}}$ columns being $\mathbf {v}_1,\ldots ,\mathbf {v}_{{\tilde{T}}}$. Therefore, the minimizer $\mathbf {w}^*$ of f satisfies $V\mathbf {w}^* =(\langle \mathbf {v}_1,\mathbf {w}^*\rangle ,\langle \mathbf {v}_2,\mathbf {w}_2^*\rangle ,\ldots )= {\tilde{\mathbf {w}}}^*$. $\square $

Lemma 6

$\Vert \mathbf {w}^*\Vert ^2 = \sum _{i=1}^{{\tilde{T}}}\langle \mathbf {v}_i,\mathbf {w}^*\rangle ^2$

Proof

$f(\mathbf {w})$ is a function which can be written in the form $h(\langle \mathbf {v}_1,\mathbf {w}\rangle ,\langle \mathbf {v}_2,\mathbf {w}\rangle ,$$\ldots ,\langle \mathbf {v}_{{\tilde{T}}},\mathbf {w}\rangle )+\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2$, so by the Representer theorem, its minimizer $\mathbf {w}^*$ must lie in the span of $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}$. Moreover, since these vectors are orthogonal and of unit norm, we have $\mathbf {w}^* = \sum _{i=1}^{{\tilde{T}}}\langle \mathbf {v}_i,\mathbf {w}^*\rangle \mathbf {v}_i$, and thus $\Vert \mathbf {w}^*\Vert ^2 =\sum _{i=1}^{{\tilde{T}}}\langle \mathbf {v}_i,\mathbf {w}^*\rangle ^2$. $\square $

4.2 Oracle complexity lower bound

In this subsection, we prove the following oracle complexity lower bound, depending on the free parameter $\gamma $:

Proposition 3

Assume that $\epsilon < \min \left\{ \frac{108^2\cdot \lambda ^3}{\mu _2^2},\frac{\gamma \lambda }{8}\right\} $. Under the conditions of Proposition 2, it is possible to choose the vectors $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}$ in the function f, such that the number of iterations T required to have $f(\mathbf {w}_T)-f(\mathbf {w}^*)\le \epsilon $ is at least

$$\begin{aligned} \max \left\{ \frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}~,~\log _2 \log _{18}\left( \frac{108^2\cdot \lambda ^3 }{\mu _2^2\epsilon }\right) -1\right\} \end{aligned}$$

To prove the theorem, we will need the following key lemma, which establishes that oracle information at certain points $\mathbf {w}$ do not leak any information on some of the $\mathbf {v}_1,\mathbf {v}_2,\ldots $ vectors.

Lemma 7

For any $\mathbf {w}\in \mathbb {R}^d$ orthogonal to $\mathbf {v}_t,\mathbf {v}_{t+1},\ldots ,\mathbf {v}_{{\tilde{T}}}$ , it holds that $f(\mathbf {w}), \nabla f(\mathbf {w}),\nabla ^2 f(\mathbf {w})$ do not depend on $\mathbf {v}_{t+1},\mathbf {v}_{t+2},\ldots ,\mathbf {v}_{{\tilde{T}}}$.

Proof

Since the regularization term $\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2$ doesn’t depend on $\mathbf {v}_{t+1},\mathbf {v}_{t+2},\ldots ,\mathbf {v}_{{\tilde{T}}}$ we can define $h(\mathbf {w}) := f(\mathbf {w})-\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2$ and prove the result on $h(\mathbf {w})$. Using the definition of h and differentiating, we have that

$$\begin{aligned} h(\mathbf {w})= & {} \frac{\mu _2}{12} \left( \sum _{i=1}^{{\tilde{T}}-1}g(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle ) - \gamma \langle \mathbf {v}_1,\mathbf {w}\rangle \right) \\ \nabla h(\mathbf {w})= & {} \frac{\mu _2}{12} \left( \sum _{i=1}^{{\tilde{T}}-1}g'(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )(\mathbf {v}_{i}-\mathbf {v}_{i+1}) - \gamma \mathbf {v}_1 \right) \\ \nabla ^2 h(\mathbf {w})= & {} \frac{\mu _2}{12} \left( \sum _{i=1}^{{\tilde{T}}-1}g''(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )(\mathbf {v}_{i}-\mathbf {v}_{i+1})(\mathbf {v}_{i}-\mathbf {v}_{i+1})^T \right) \end{aligned}$$

By the assumption $\langle \mathbf {v}_{t},\mathbf {w}\rangle =\langle \mathbf {v}_{t+1},\mathbf {w}\rangle =\ldots =0$, and the fact that $g(0)=g'(0)=g''(0)=0$, we have that $g(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )=g'(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )=g''(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )=0$ for all $i\in \{t,t+1,\ldots ,{\tilde{T}}-1\}$. Therefore, it is easily verified that the expressions above indeed do not depend on $\mathbf {v}_{t+1},\mathbf {v}_{t+2},\ldots ,\mathbf {v}_{{\tilde{T}}}$. $\square $

Let us now fix any number of iterations $T\le {\tilde{T}}$. Using the previous results, we now argue that given any deterministic algorithm, we can choose $\mathbf {v}_1,\ldots ,\mathbf {v}_{{\tilde{T}}}$ so that one of them will necessarily be orthogonal to $\mathbf {w}_T$, the algorithm’s output after T iterations:

Lemma 8

For any deterministic algorithm, it is possible to choose $\mathbf {v}_1,\ldots ,\mathbf {v}_{{\tilde{T}}}$, so that $\mathbf {v}_T,\mathbf {v}_{T+1},\ldots ,\mathbf {v}_{{\tilde{T}}}$ are orthogonal to both $\mathbf {w}_1,\ldots ,\mathbf {w}_T$ and $\mathbf {v}_1,\ldots ,\mathbf {v}_T$.

Proof

The proof is constructive, and uses an argument similar to other oracle complexity lower bounds in the literature. It proceeds as follows: $\square $

First, we compute $\mathbf {w}_1$ (which is possible since the algorithm is deterministic and $\mathbf {w}_1$ is chosen before any oracle calls are made).
We pick $\mathbf {v}_1$ to be some unit vector orthogonal to $\mathbf {w}_1$. Assuming $\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}$ will also be orthogonal to $\mathbf {w}_1$ (which will be ensured by the construction which follows), we have by Lemma 7 that the information $F(\mathbf {w}_1),\nabla F(\mathbf {w}_1),$$\nabla ^2 F(\mathbf {w}_1)$ provided by the oracle to the algorithm does not depend on $\{\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}\}$, and thus depends only on $\mathbf {v}_1$ which was already fixed. Since the algorithm is deterministic, this fixes the next query point $\mathbf {w}_2$.
For $t=2,3,\ldots ,T-1$, we repeat the process above: We compute $\mathbf {w}_t$, and pick $\mathbf {v}_{t}$ to be some unit vectors orthogonal to $\mathbf {w}_1,\mathbf {w}_2,\ldots ,\mathbf {w}_t$, as well as all previously constructed $\mathbf {v}$’s (this is always possible since the dimension is sufficiently large). By Lemma 7, as long as all vectors thus constructed are orthogonal to $\mathbf {w}_t$, the information $\{F(\mathbf {w}_t),\nabla F(\mathbf {w}_t),\nabla ^2 F(\mathbf {w}_t)\}$ provided to the algorithm does not depend on $\mathbf {v}_{t+1},\ldots ,\mathbf {v}_{{\tilde{T}}}$, and only depends on $\mathbf {v}_1,\ldots ,\mathbf {v}_t$ which were already determined. Therefore, the next query point $\mathbf {w}_{t+1}$ is fixed.
At the end of the process, we pick $\mathbf {v}_{T},\mathbf {v}_{T+1},\ldots ,\mathbf {v}_{{\tilde{T}}}$ to be some unit vectors orthogonal to all previously chosen $\mathbf {v}$’s as well as $\mathbf {w}_1,\ldots ,\mathbf {w}_T$ (this is possible since the dimension is large enough). $\square $

Using the facts that $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}$ are orthogonal (and thus act as an orthonormal basis to a subspace of $\mathbb {R}^d$), that $\mathbf {w}_T$ is orthogonal to $\mathbf {v}_T,\mathbf {v}_{T+1},\ldots \mathbf {v}_{{\tilde{T}}}$, and that $t_0+T\le \frac{{\tilde{T}}}{2}+\frac{{\tilde{T}}}{2}={\tilde{T}}$ (where $t_0$ is as defined in Proposition 2), we have $ \Vert \mathbf {w}_T-\mathbf {w}^*\Vert ^2 \ge \sum _{i=1}^{{\tilde{T}}} \langle \mathbf {v}_i,\mathbf {w}_T-\mathbf {w}^*\rangle ^2 \ge \langle \mathbf {v}_{t_0+T},\mathbf {w}_T-\mathbf {w}^*\rangle ^2 = \langle \mathbf {v}_{t_0+T},\mathbf {w}^*\rangle ^2$. By Proposition 2, we can lower bound the above by $ \frac{108^2\lambda ^2}{\mu _2^2}\cdot (18)^{-2^{(T+1)}} $. Using the strong convexity of f, we therefore get

$$\begin{aligned} f(\mathbf {w}_T)-f(\mathbf {w}^*) \ge \frac{\lambda }{2}\cdot \Vert \mathbf {w}_T-\mathbf {w}^*\Vert ^2 \ge \frac{108^2\cdot \lambda ^3}{\mu _2^2} (18)^{-2^{(T+1)}}. \end{aligned}$$

To make the right-hand side smaller than $\epsilon $, T must satisfy $ (18)^{-2^{(T+1)}} \le \frac{\mu _2^2\epsilon }{108^2\cdot \lambda ^3 } $, which is equivalent to $ 2^{(T+1)} \ge \log _{18}\left( \frac{108^2\cdot \lambda ^3 }{\mu _2^2\epsilon }\right) $. Assuming $\epsilon < \frac{108^2\cdot \lambda ^3}{\mu _2^2}$, then

$$\begin{aligned} T ~\ge ~ \log _2 \log _{18}\left( \frac{108^2\cdot \lambda ^3 }{\mu _2^2\epsilon }\right) -1. \end{aligned}$$

We now turn to argue that we can also lower bound T by $\frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}$. Otherwise, suppose by contradiction that we can have $f(\mathbf {w}_T)-f(\mathbf {w}^*)\le \epsilon $ for some $T< \frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}$. From Proposition 2 we know that $ \langle \mathbf {v}_T,\mathbf {w}^*\rangle ~\ge ~ \frac{\gamma ^{3/4}}{7\sqrt{12\lambda /\mu _2}} +\sqrt{\gamma }\left( \frac{1}{2}-T\right) $, so as before, we have that $ \Vert \mathbf {w}_T-\mathbf {w}^*\Vert ^2 \ge \sum _{i=1}^{{\tilde{T}}} \langle \mathbf {v}_i,\mathbf {w}_T-\mathbf {w}^*\rangle ^2 \ge \langle \mathbf {v}_{T},\mathbf {w}_T-\mathbf {w}^*\rangle ^2 = \langle \mathbf {v}_{T},\mathbf {w}^*\rangle ^2, $ and thus

$$\begin{aligned} f(\mathbf {w}_T)-f(\mathbf {w}^*)&~\ge ~ \frac{\lambda }{2}\cdot \Vert \mathbf {w}_T-\mathbf {w}^*\Vert ^2 ~\ge ~ \frac{\lambda }{2}\cdot \langle \mathbf {v}_T,\mathbf {w}^*\rangle ^2 \\&~\ge ~\frac{\lambda }{2} \left( \frac{\gamma ^{3/4}}{7\sqrt{12\lambda /\mu _2}} +\sqrt{\gamma }\left( \frac{1}{2}-T\right) \right) ^2. \end{aligned}$$

To make the right-hand side smaller than $\epsilon $, T must satisfy

$ \left( \frac{\gamma ^{3/4}}{7\sqrt{12\lambda /\mu _2}} +\sqrt{\gamma }\left( \frac{1}{2}-T\right) \right) ^2 \le \frac{2\epsilon }{\lambda } $, or equivalently $ T ~\ge ~ \frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}+\frac{1}{2} - \sqrt{\frac{2\epsilon }{\gamma \lambda }} $. But since we assume $\epsilon < \frac{\gamma \lambda }{8}$, this is at least $\frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}$, contradicting our earlier assumption.

Overall, we showed that T is lower bounded by both $\frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}$, as well as $\log _2 \log _{18}\left( \frac{108^2\cdot \lambda ^3}{\mu _2^2 \epsilon }\right) -1$, hence proving Proposition 3.

4.3 Setting the $\gamma ,\Delta $ Parameters

In the following lemma, we establish the strong convexity and smoothness parameters of f (depending on the parameter $\Delta $ which is still free at this point).

Lemma 9

f is $\lambda $-strongly convex and twice-differentiable, with $\mu _2$-Lipschitz Hessians and $\left( \frac{2\mu _2\Delta }{3}+\lambda \right) $-Lipschitz gradients.

Proof

Since f is a sum of convex, twice-differentiable functions and the $\lambda $-strongly convex function $\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2$, it is clearly $\lambda $-strongly convex and twice-differentiable. Thus, it only remains to calculate the Lipschitz parameter of the gradients and Hessians.

To simplify the proof, we note that Lipschitz smoothness is a property invariant to the coordinate system used, so we can assume without loss of generality that $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{{\tilde{T}}}$ correspond to the standard basis $\mathbf {e}_1,\mathbf {e}_2,\ldots ,\mathbf {e}_{{\tilde{T}}}$, and consider the Lipschitz properties of the function

$$\begin{aligned} {\hat{f}}(\mathbf {w}) = \frac{\mu _2}{12}\left( \sum _{i=1}^{{\tilde{T}}-1}g(w_i-w_{i+1})-\gamma \cdot w_1\right) +\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2 \end{aligned}$$

By definition of g, it is easily verified that $ g''(x) = 2\cdot \min \{\Delta ,|x|\} $, which is a 2-Lipschitz function bounded in $[0,2\Delta ]$. This implies that $g'(x)$ is $2\Delta $-Lipschitz. Letting $\mathbf {r}_i:=\mathbf {e}_i-\mathbf {e}_{i+1}$, we can write ${\hat{f}}$ as $ \frac{\mu _2}{12}\left( \sum _{i=1}^{{\tilde{T}}-1}g(\langle \mathbf {r}_i,\mathbf {w}\rangle )-\gamma \cdot w_1\right) $$+\frac{\lambda }{2}\Vert \mathbf {w}\Vert ^2$. Differentiating twice, we get $ \nabla ^2 {\hat{f}}(\mathbf {w}) = \frac{\mu _2}{12} \sum _{i=1}^{{\tilde{T}}-1}g''(\langle \mathbf {r}_i,\mathbf {w}\rangle )\cdot \mathbf {r}_i \mathbf {r}_i^\top +\lambda I $. Since this is a sum of positive-semidefinite matrices with non-negative coefficients (as we showed that $g''(x)\in [0,2\Delta ]$ for all x), it follows that its spectral norm is at most $ \frac{\mu _2\Delta }{6}\cdot \left\| \sum _{i=1}^{{\tilde{T}}-1}\mathbf {r}_i\mathbf {r}_i^\top \right\| +\lambda $, and the first term equals

$$\begin{aligned} \frac{\mu _2\Delta }{6}\cdot \max _{\mathbf {x}}\frac{\sum _{i=1}^{{\tilde{T}}-1}\langle \mathbf {r}_i,\mathbf {x}\rangle ^2}{\Vert \mathbf {x}\Vert ^2}&~=~ \frac{\mu _2\Delta }{6}\cdot \max _{\mathbf {x}}\frac{\sum _{i=1}^{{\tilde{T}}-1}(x_i-x_{i+1})^2}{\Vert \mathbf {x}\Vert ^2}\\&~\le ~ \frac{\mu _2\Delta }{6}\cdot \max _{\mathbf {x}}\frac{\sum _{i=1}^{{\tilde{T}}-1}(2x_i^2+2x_{i+1}^2)}{\sum _{i=1}^{d}x_i^2}~\le ~ \frac{2\mu _2\Delta }{3}. \end{aligned}$$

Overall, we showed that $\Vert \nabla ^2 {\hat{f}}(\mathbf {w})\Vert \le \frac{2\mu _2\Delta }{3}+\lambda $, so the gradients of f are $\left( \frac{2\mu _2\Delta }{3}+\lambda \right) $-Lipschitz.

It remains to show that $\nabla ^2 {\hat{f}}(\mathbf {w})$ is $\mu _2$-Lipschitz. Using the formula for $\nabla ^2 {\hat{f}}(\mathbf {w})$, and recalling that $g''(x)$ is 2-Lipschitz, and $\Vert \mathbf {r}_i\Vert =\sqrt{2}$ by definition, we have that for any $\mathbf {w},{\tilde{\mathbf {w}}}$,

$$\begin{aligned}&\Vert \nabla ^2 {\hat{f}}(\mathbf {w})-\nabla ^2 {\hat{f}}({\tilde{\mathbf {w}}})\Vert ~=~\frac{\mu _2}{12}\cdot \left\| {\sum _{i=1}^{{\tilde{T}}-1}(g''(\langle \mathbf {r}_i,\mathbf {w}\rangle )-g''(\langle \mathbf {r}_i,{\tilde{\mathbf {w}}}\rangle ))\cdot \mathbf {r}_i\mathbf {r}_i^\top }\right\| \\&~\le ~\frac{\mu _2}{12}\cdot \left\| {\sum _{i=1}^{{\tilde{T}}-1}|g''(\langle \mathbf {r}_i,\mathbf {w}\rangle )-g''(\langle \mathbf {r}_i,{\tilde{\mathbf {w}}}\rangle )|\cdot \mathbf {r}_i\mathbf {r}_i^\top }\right\| \\&~\le ~ \frac{\mu _2}{12}\cdot \left\| {\sum _{i=1}^{{\tilde{T}}-1}2|\langle \mathbf {r}_i,\mathbf {w}-\mathbf {w}'\rangle |\cdot \mathbf {r}_i\mathbf {r}_i^\top }\right\| ~\le ~ \frac{\mu _2}{12}\cdot 2\sqrt{2}\cdot \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert \cdot \left\| \sum _{i=1}^{{\tilde{T}}-1}\mathbf {r}_i\mathbf {r}_i^\top \right\| . \end{aligned}$$

Using the same calculations as earlier, we have $\left\| \sum _{i=1}^{{\tilde{T}}-1}\mathbf {r}_i\mathbf {r}_i^\top \right\| \le 4$, and therefore we showed overall that $ \Vert \nabla ^2 {\hat{f}}(\mathbf {w})-\nabla ^2 {\hat{f}}({\tilde{\mathbf {w}}})\Vert ~\le ~ \frac{\mu _2\cdot 8\sqrt{2}}{12}\cdot \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert < \mu _2\cdot \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert $, hence $\nabla ^2 {\hat{f}}(\mathbf {w})$ is $\mu _2$-Lipschitz. $\square $

We now collect the ingredients necessary to fix $\gamma ,\Delta $ and hence prove our theorem. Combining the previous lemma, Proposition 2 and Proposition 3, and recalling that we want f to have $\mu _1$-Lipschitz gradients and $\mu _2$-Lipschitz Hessians, with an optimizer $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}^*\Vert \le D$, we have an oracle complexity lower bound of the form

$$\begin{aligned} T~\ge ~ \max \left\{ \frac{\gamma ^{1/4}}{7\sqrt{12\lambda /\mu _2}}~,~\log _2 \log _{18}\left( \frac{108^2\cdot \lambda ^3 }{\mu _2^2\epsilon }\right) -1\right\} , \end{aligned}$$

(17)

assuming the following conditions: $ \gamma \ge 10^4\left( \frac{\lambda }{\mu _2}\right) ^2$,$\Delta \ge \sqrt{\gamma }$, $\epsilon < \min \left\{ \frac{108^2\cdot \lambda ^3}{\mu _2^2},\frac{\gamma \lambda }{8}\right\} $, $\frac{2\mu _2\Delta }{3}+\lambda \le \mu _1$, $\sqrt{\frac{2\gamma ^{7/4}}{(12\lambda /\mu _2)^{3/2}}}\le D $. Picking $\Delta =\sqrt{\gamma }$, using the fact that $\mu _1\ge \lambda $ (as any $\lambda $-strongly convex function must have gradients with Lipschitz parameter at least $\lambda $), and rewriting the last two conditions, this is equivalent to

$$\begin{aligned} \gamma \ge 10^4\left( \frac{\lambda }{\mu _2}\right) ^2, \epsilon < \min \left\{ \frac{108^2\cdot \lambda ^3}{\mu _2^2},\frac{\gamma \lambda }{8}\right\} , \gamma \le \left( \frac{3(\mu _1-\lambda )}{2\mu _2}\right) ^2, \gamma \le \root 7 \of {\frac{D^8(12\lambda )^6}{2^4\mu _2^6}}. \end{aligned}$$

Since the first condition needs to hold anyway, we can allow ourself to make the second condition stronger, by substituting $10^4(\lambda /\mu _2)^2$ in lieu of $\gamma $ in the second condition. Doing this, simplifying, and merging the last two conditions, the set of condition above is implied by requiring

$$\begin{aligned} \gamma \ge 10^4\left( \frac{\lambda }{\mu _2}\right) ^2, \epsilon < \frac{10^4 \lambda ^3}{8\mu _2^2} , \gamma \le \min \left\{ \left( \frac{3(\mu _1-\lambda )}{2\mu _2}\right) ^2, \root 7 \of {\frac{D^8(12\lambda )^6}{2^4\mu _2^6}}\right\} . \end{aligned}$$

Clearly, to make the lower bound in (17) as large as possible, we should pick the largest possible $\gamma $, namely $ \gamma =\min \left\{ \left( \frac{3(\mu _1-\lambda )}{2\mu _2}\right) ^2, \root 7 \of {\frac{D^8(12\lambda )^6}{2^4\mu _2^6}}\right\} $, and to ensure that the other conditions hold, require that

$$\begin{aligned} \min \left\{ \left( \frac{3(\mu _1-\lambda )}{2\mu _2}\right) ^2, \root 7 \of {\frac{D^8(12\lambda )^6}{2^4\mu _2^6}}\right\} ~\ge ~ 10^4\left( \frac{\lambda }{\mu _2}\right) ^2,~~ \epsilon < \frac{10^4 \lambda ^3}{8\mu _2^2}. \end{aligned}$$

Simplifying a bit, these two conditions are implied by requiring

$$\begin{aligned} \frac{\mu _1}{\lambda }\ge 68,~~\frac{\mu _2}{\lambda }D\ge 694,~~ \epsilon < \frac{10^4 \lambda ^3}{8\mu _2^2}, \end{aligned}$$

(18)

Finally, let us plug our choice of $\gamma =\min \left\{ \left( \frac{3(\mu _1-\lambda )}{2\mu _2}\right) ^2, \root 7 \of {\frac{D^8(12\lambda )^6}{2^4\mu _2^6}}\right\} $ into the lower bound in (17). We thus get an oracle complexity lower bound of

$$\begin{aligned}&\max \left\{ \frac{\sqrt{\mu _2}}{7\sqrt{12\lambda }}\min \left\{ \sqrt{\frac{3(\mu _1-\lambda )}{2\mu _2}},\frac{D^{2/7}(12\lambda )^{3/14}}{2^{1/7}\mu _2^{3/14}}\right\} , \log _2 \log _{18}\left( \frac{108^2\lambda ^3 }{\mu _2^2\epsilon }\right) -1 \right\} \\&= \max \left\{ \min \left\{ \frac{1}{14}\sqrt{\frac{\mu _1-\lambda }{2\lambda }}, \frac{(D\mu _2/12\lambda )^{2/7}}{7\cdot 2^{1/7}}\right\} ~, \log _2 \log _{18}\left( \frac{108^2\cdot \lambda ^3 }{\mu _2^2\epsilon }\right) -1 \right\} , \end{aligned}$$

under the conditions of (18).

To simplify the bound a bit, we note that we can lower bound $\mu _1-\lambda $ by $\frac{67}{68}\mu _1$ (possible by (18)), and lower bound $\log _2 \log _{18}\left( \frac{108^2\cdot \lambda ^3 }{\mu _2^2\epsilon }\right) -1$ by $\frac{1}{2}\log \log _{18}\left( \frac{\lambda ^3}{\mu _2^2\epsilon }\right) $, by assuming that $\epsilon \le c\lambda ^3/\mu _2^2$ for some small enough c (in other words, increasing the constant in the third condition in (18)). Finally, using the fact that $\max \{a,b\}\ge (a+b)/2$, the result in the theorem follows.

5 Proof of Theorem 2

The proof follows the lines of Proposition 1 and its proof sketch, but with a more complicated construction (as we need to capture the dependence on the Lipschitz parameters of both the gradients and the Hessians).

Similarly to the strongly convex case, we will assume without loss of generality that the algorithm initializes at $\mathbf {w}_1=\mathbf {0}$, since otherwise one can simply replace the “hard” function $f(\mathbf {w})$ below by $f(\mathbf {w}-\mathbf {w}_1)$, and the same proof holds verbatim. Thus, the theorem requires that our function has a minimizer $\mathbf {w}^*$ satisfying $\Vert \mathbf {w}^*\Vert \le D$.

Define $g:\mathbb {R}\mapsto \mathbb {R}$ as

$$\begin{aligned} g(x) = {\left\{ \begin{array}{ll} \frac{1}{3}|x|^3 &{} |x|\le \Delta \\ \Delta x^2-\Delta ^2|x|+\frac{1}{3}\Delta ^3&{}|x|>\Delta , \end{array}\right. }. \end{aligned}$$

where $\Delta := \frac{3 \mu _1}{2 \mu _2}$. The function g can be easily verified to be twice continuously differentiable. Assume that $d\ge 2^{\bar{T}}$, and let $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{\bar{T}}$ be orthogonal unit vectors in $\mathbb {R}^d$ which will be specified later. Given $\bar{T}$, and letting $\gamma >0$ be a parameter to be specified later, define the function $f_{\bar{T}}$ as

$$\begin{aligned}&f_{\bar{T}}(\mathbf {w}) = \frac{\mu _2}{12}\left( g(\langle \mathbf {v}_1,\mathbf {w}\rangle )+g(\langle \mathbf {v}_{\bar{T}},\mathbf {w}\rangle )+\sum _{i=1}^{\bar{T}-1}g(\langle \mathbf {v}_i,\mathbf {w}\rangle -\langle \mathbf {v}_{i+1},\mathbf {w}\rangle )-\gamma \langle \mathbf {v}_1,\mathbf {w}\rangle \right) \\&= \frac{\mu _2}{12}\left( g(\langle \mathbf {v}_1,\mathbf {w}\rangle )+g(\langle \mathbf {v}_{\bar{T}},\mathbf {w}\rangle )+\sum _{i=1}^{\bar{T}-1}g(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )-\gamma \langle \mathbf {v}_1,\mathbf {w}\rangle \right) \end{aligned}$$

This function is easily shown to be convex and twice-differentiable, with $\mu _1$-Lipschitz gradients and $\mu _2$-Lipschitz Hessians (the proof is identical to the proof of Lemma 9). Our goal will be to show a lower bound on the optimization error using this type of function.

5.1 Minimizer of $f_{\bar{T}}$

In this subsection, we analyze the properties of a minimizer of $f_{\bar{T}}$. To that end, we introduce the following function in $\mathbb {R}^{\bar{T}}$:

$$\begin{aligned} {\hat{f}}_{\bar{T}}(\mathbf {w})=g(w_1)+g(w_{\bar{T}}) + \sum _{i=1}^{\bar{T}-1}g(w_i-w_{i+1})-\gamma w_1. \end{aligned}$$

It is easily verified that the minimal values of $\frac{\mu _2}{12}{\hat{f}}_{\bar{T}}$ and $f_{\bar{T}}$ are the same, and moreover, if ${\hat{\mathbf {w}}}\in \mathbb {R}^{\bar{T}}$ is a minimizer of ${\hat{f}}_{\bar{T}}$, then $\mathbf {w}^*=\sum _{j=1}^{\bar{T}}{\hat{w}}^*_j\cdot \mathbf {v}_j\in \mathbb {R}^d$ is a minimizer of $f_{\bar{T}}$, and with the same Euclidean norm as ${\hat{\mathbf {w}}}^*$.

We begin with the following technical lemma:

Lemma 10

${\hat{f}}_{\bar{T}}$ has a unique minimizer ${\hat{\mathbf {w}}}^* \in \mathbb {R}^{\bar{T}}$, which satisfies

$$\begin{aligned} {\hat{w}}^*_t = \delta \cdot (\bar{T}+1)\cdot \left( 1-\frac{t}{\bar{T}+1}\right) , \end{aligned}$$

for all $t=1,2,\ldots ,\bar{T}$, where $\delta $ is non-negative and independent of t. Moreover,

$$\begin{aligned} g'({\hat{w}}^*_1) + g'({\hat{w}}^*_{\bar{T}}) = \gamma . \end{aligned}$$

Proof

Taking the derivative and setting to zero, we get that the

$$\begin{aligned} g'({\hat{w}}^*_1)+g'({\hat{w}}^*_1-{\hat{w}}^*_2)=\gamma ,~~g'({\hat{w}}^*_{\bar{T}-1}-{\hat{w}}^*_{\bar{T}})=g'({\hat{w}}^*_{\bar{T}}) \end{aligned}$$

as well as

$$\begin{aligned} g'({\hat{w}}^*_{j-1}-{\hat{w}}^*_j)=g'({\hat{w}}^*_{j}-{\hat{w}}^*_{j+1}) \end{aligned}$$

for all $j\in \{2,3,\ldots ,\bar{T}-1\}$. By definition of g, it is easily verified that $g'$ is a strictly monotonic (hence invertible) function, so the above implies ${\hat{w}}^*_{j-1}-{\hat{w}}^*_j={\hat{w}}^*_{j}-{\hat{w}}^*_{j+1}$ for all $j\in \{2,3,\ldots ,\bar{T}-1\}$, as well as ${\hat{w}}^*_{\bar{T}-1}-{\hat{w}}^*_{\bar{T}}={\hat{w}}^*_{\bar{T}}$. From this, it follows by straightforward induction that ${\hat{w}}^*_{T+1-t}=t\cdot {\hat{w}}^*_{\bar{T}}$, from which the first displayed equation in the lemma follows. This also implies $g'(\bar{T}{\hat{w}}^*_{\bar{T}})+g'({\hat{w}}^*_{\bar{T}})=\gamma $, and since $g'$ is strictly monotonic, we have that ${\hat{w}}^*_{\bar{T}}$ is uniquely defined, and since the other coordinates of ${\hat{\mathbf {w}}}^*$ are also uniquely defined given ${\hat{w}}^*_{\bar{T}}$, we get that ${\hat{\mathbf {w}}}^*$ is unique. Finally, $\delta $ (and hence ${\hat{w}}_t^*$ for all t) is necessarily non-negative, since otherwise ${\hat{w}}^*_1$ is negative, which would imply ${\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*)>0$, even though ${\hat{f}}_{\bar{T}}(\mathbf {0})=0$, violating the fact that ${\hat{\mathbf {w}}}^*$ minimizes ${\hat{f}}_{\bar{T}}$. $\square $

Table 1 Properties of ${\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*)$ and $\Vert {\hat{\mathbf {w}}}^*\Vert ^2$ for different $\gamma $ regimes

Full size table

The main technical result in this subsection is the following proposition, which characterizes $\Vert {\hat{\mathbf {w}}}^*\Vert $ and ${\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*)$ under various parameter regimes. By the discussion above and definition of $f_{\bar{T}}$, we have

$$\begin{aligned} \Vert \mathbf {w}^*\Vert =\Vert {\hat{\mathbf {w}}}^*\Vert ~~~\text {and}~~~f_{\bar{T}}(\mathbf {w}^*)=\frac{\mu _2}{12}\cdot {\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*), \end{aligned}$$

(19)

which will be used in the remainder of the proof of our theorem.

Proposition 4

The values of the minimizer of $\hat{f}_{\bar{T}}$ and the corresponding minimum $\hat{f}_{\bar{T}}(\hat{\mathbf {w}}^*)$ for different $\gamma $ regimes (which depend on $\Delta $ and $\bar{T}$) are summarized in Table 1.

Proof

To prove the proposition, we will consider three regimes, depending on $\bar{T},\delta ,\Delta $ (Table 1): Namely, $\bar{T}\delta \le \Delta $, $\frac{\Delta }{\bar{T}}<\delta \le \Delta $ and $\delta >\Delta $. We will show that each regime corresponds to one of the three regimes specified in the proposition, and prove the relevant bounds.

${\underline{\mathbf{Case 1: } {\bar{T}}\delta \le \Delta }}$. In that case, ${\hat{w}}^*_1,{\hat{w}}^*_{\bar{T}}$ as well as ${\hat{w}}^*_i-{\hat{w}}^*_{i+1}$ for all $i=2,\ldots ,\bar{T}-1$ in the definition of ${\hat{f}}_{\bar{T}}$ all lie in the interval where g is a cubic function. Using Lemma 10, $ g'(w^*_1) + g'(w^*_{\bar{T}}) = w^{*2}_1 + w^{*2}_{\bar{T}} = \gamma $, hence $ \delta ^2\bar{T}^2+\delta ^2 = \gamma $ and $ \delta = \sqrt{\frac{\gamma }{1+\bar{T}^2}} $. Therefore, our condition $\bar{T}\delta \le \Delta $ is exactly equivalent to $\gamma \le \frac{\Delta ^2 \left( 1+\bar{T}^2\right) }{\bar{T}^2}$, namely the first regime discussed in the proposition. We now establish the relevant bounds. By plugging the optimal solution ${\hat{\mathbf {w}}}^*$ in ${\hat{f}}_{\bar{T}}(\mathbf {w})$, we have that

$$\begin{aligned} {\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*) ~=~ -\frac{2\gamma ^{3/2}\bar{T}}{3\sqrt{\left( 1+\bar{T}^2\right) }} \end{aligned}$$

and

$$\begin{aligned} \Vert {\hat{\mathbf {w}}}^*\Vert ^2_2 = \sum _{t=1}^{\bar{T}}{\hat{w}}^{*2}_t =\frac{\gamma (1+\bar{T})^2}{\left( 1+\bar{T}^2\right) }\sum _{t=1}^{\bar{T}} \left( 1-\frac{t}{1+\bar{T}} \right) ^2 ~\le ~ \frac{\gamma (1+\bar{T})^3}{3\left( 1+\bar{T}^2\right) }~, \end{aligned}$$

where in the calculation above we used fact $\sum _{t=1}^{\bar{T}} t^2 \le \int _1^{\bar{T}+1}t^2dt < \frac{(\bar{T}+1)^3}{3}$.

${\underline{\mathbf{Case 2: } \frac{\Delta }{{\bar{T}}}<\delta \le \Delta }}$. In this case, by Lemma 10, ${\hat{w}}^*_{\bar{T}} \le \Delta $ but $ {\hat{w}}^*_1 > \Delta $. Therefore, in the definition ${\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*)$, $g({\hat{w}}^*_1)$ lies in the quadratic region of g, whereas $g({\hat{w}}^*_{\bar{T}})$ and $g'({\hat{w}}^*_{i}-{\hat{w}}^*_{i+1})$ for all i lies in the cubic region of g. As a result, $ g'(w^*_1) + g'(w^*_{\bar{T}}) = 2 \Delta w^*_1 -\Delta ^2 + w^{*2}_{\bar{T}} = \gamma $. Plugging in $w^*_{\bar{T}}=\delta $ and $w^*_1=\bar{T}\cdot \delta $, we get $ \delta ^2 + 2\Delta \delta \cdot \bar{T} -\left( \gamma + \Delta ^2 \right) = 0 $, and therefore (using the fact $\delta \ge 0$, see Lemma 10), $ \delta = -\Delta \bar{T} + \Delta \bar{T}\sqrt{1+\frac{\gamma + \Delta ^2 }{\Delta ^2 \bar{T}^2}} $. This, plus the assumption $\frac{\Delta }{\bar{T}}<\delta \le \Delta $, is equivalent to $\frac{\Delta ^2 \left( 1+\bar{T}^2\right) }{\bar{T}^2} <\gamma \le 2 \Delta ^2\bar{T}$, hence showing that we are indeed in the second regime as specified in our proposition.

Turning to calculate the relevant bounds, we have ${\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*) = \bar{T}\delta ^3+ \Delta ^2\bar{T}^2 \delta ^2 -\bar{T}\left( \Delta ^2+\gamma \right) \delta + \frac{\Delta ^3}{3}$. Moreover, $ \Vert {\hat{\mathbf {w}}}^*\Vert ^2 = \delta ^2(\bar{T}+1)^2\sum _{t=1}^{\bar{T}} \left( 1-\frac{t}{1+\bar{T}} \right) ^2 $, which by definition of $\delta $ above and the inequality $\sqrt{1+x}\le 1+\frac{1}{2}x$ for all $x\ge 0$, is at most $\frac{\left( \gamma + \Delta ^2\right) ^2(\bar{T}+1)^3}{12\Delta ^2 \bar{T}^2}$.

$\underline{\mathbf{Case 3: \delta > \Delta }}$. In this case, by Lemma 10, we have $ {\hat{w}}^*_1>{\hat{w}}^*_{\bar{T}} = {\hat{w}}^*_i-{\hat{w}}^*_{i+1} > \Delta $, which implies that in the definition of ${\hat{f}}_t({\hat{\mathbf {w}}}^*)$, these terms all lie in the quadratic region of g. Therefore, $ g'(w^*_1) + g'(w^*_{\bar{T}}) = 2 \Delta w^*_1 -\Delta ^2 + 2 \Delta w^*_{\bar{T}} -\Delta ^2 = \gamma $, and thus $ 2 \Delta (\bar{T}+1)\delta = \gamma + 2 \Delta ^2, $ or equivalently $ \delta = \frac{\gamma + 2\Delta ^2}{2\Delta (\bar{T}+1)} $. Note that this, plus our assumption $\delta >\Delta $, is equivalent to $\gamma > 2\Delta ^2\bar{T} $, which shows that we are indeed in the third regime as specified in our proposition. Turning to calculate $\Vert {\hat{\mathbf {w}}}^*\Vert $ and ${\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*)$, we have

$$\begin{aligned} {\hat{f}}_{\bar{T}}({\hat{\mathbf {w}}}^*) = - \frac{\bar{T}\left( \gamma + 2\Delta ^2\right) ^2}{4\Delta (\bar{T}+1)} +\frac{(\bar{T}+1)\Delta ^3}{3}, \end{aligned}$$

and $ \Vert {\hat{\mathbf {w}}}^*\Vert ^2 ~=~ \frac{\left( \gamma + 2\Delta ^2\right) ^2}{4\Delta ^2} \sum _{t=1}^{\bar{T}} \left( 1-\frac{t}{1+\bar{T}} \right) ^2~\le ~ \frac{(\bar{T}+1)\left( \gamma + 2\Delta ^2\right) ^2}{12\Delta ^2} $. $\square $

5.2 Oracle complexity lower bound

Given the expressions on the optimal value of ${\hat{f}}_{\bar{T}}$, derived in the previous subsection, we turn to explain how the oracle complexity lower bound is derived. The argument is very similar to the strongly convex case (proof of Theorem 1, Sect. 4.2). Specifically, consider the case when $\bar{T} = 2T$, given by

$$\begin{aligned} f_{2T}(\mathbf {w}) = \frac{\mu _2}{12}\left( g(\langle \mathbf {v}_1,\mathbf {w}\rangle )+g(\langle \mathbf {v}_{2T},\mathbf {w}\rangle )+\sum _{i=1}^{2T-1}g(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )-\gamma \langle \mathbf {v}_1,\mathbf {w}\rangle \right) . \end{aligned}$$

Given an algorithm, we choose $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_{2T}$ to be orthogonal unit vectors, so that each $\mathbf {v}_t$ is orthogonal to the first t points $\mathbf {w}_{1},\mathbf {w}_{2},\ldots ,\mathbf {w}_{t}$ computed by the algorithm (this is possible by an argument identical to Lemma 8).

With this choice, it is easily verified that $f_{2T}(\mathbf {w}_T)$ equals

$$\begin{aligned} \frac{\mu _2}{12}\left( g(\langle \mathbf {v}_1,\mathbf {w}_T\rangle )+g(\langle \mathbf {v}_{T},\mathbf {w}_T\rangle ) +\sum _{i=1}^{T-1}g(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}_T\rangle )-\gamma \langle \mathbf {v}_1,\mathbf {w}_T\rangle \right) , \end{aligned}$$

which is clearly greater than $\min _{\mathbf {w}} f_T(\mathbf {w})$, where $f_T$ is defined with the same $\mathbf {v}_1,\ldots ,\mathbf {v}_T$. Therefore, we can lower bound the optimization error $f_{2T}(\mathbf {w}_T)-\min _{\mathbf {w}}f_{2T}(\mathbf {w})$ by $\min _{\mathbf {w}}f_{T}(\mathbf {w})-\min _{\mathbf {w}}f_{2T}(\mathbf {w})$. Moreover, by (19), this equals $ \frac{\mu _2}{12}\left( \min _{\mathbf {w}}{\hat{f}}_T(\mathbf {w})-\min _{\mathbf {w}}{\hat{f}}_{2T}(\mathbf {w})\right) $. Using Proposition 4, we can now plug in these minimal values, depending on the various parameter regimes, and get an oracle complexity lower bound. Computing these bounds and parameter regimes (while picking the free parameter $\gamma $ appropriately) is performed in the next subsection.

5.3 Setting the $\gamma $ Parameter

To simplify notation, we let ${\hat{f}}^*_{T}$ and ${\hat{f}}^*_{2T}$ be shorthand for $\min _{\mathbf {w}}{\hat{f}}_T(\mathbf {w})$ and $\min _{\mathbf {w}}{\hat{f}}_{2T}(\mathbf {w})$ respectively, with minimizers ${\hat{\mathbf {w}}}^*_{T}$ and ${\hat{\mathbf {w}}}^*_{2T}$. We will consider three regimes, depending on the relationships between $D,\Delta ,T$.

5.3.1 Case 1: $\frac{D^2}{48\Delta ^2 T^3}\le \frac{1}{T^2}$

In this setting, we choose $ \gamma = \frac{D^2}{48T} $. Using this and the assumption on the parameters, we get that $\gamma \le \Delta ^2< \frac{\Delta ^2 \left( 1+4T^2\right) }{4T^2} < \frac{\Delta ^2 \left( 1+T^2\right) }{T^2}$ , and therefore, we are in the first regime for both $f_T$ and $f_{2T}$ as specified in Proposition 4. Plugging in the bound on $\Vert \hat{\mathbf {w}^*}\Vert ^2$ in that regime, and using the fact that $\Delta ^2\le \gamma $ by the assumption above, we have $ \Vert {\hat{\mathbf {w}}}_{2T}^*\Vert ^2_2 \le \frac{\gamma (1+2T)^3}{3\left( 1+4T^2\right) } \le \frac{27D^2T^3}{576T^2} \le D^2 $ as required.

Using the results from Proposition 4 for the first regime we can compute the optimization error bound

$$\begin{aligned}&{\hat{f}}_{T}^* - {\hat{f}}_{2T}^* = \frac{4 \gamma ^{3/2}T}{3\sqrt{(1+4T^2)}}-\frac{2 \gamma ^{3/2}T}{3\sqrt{(1+T^2)}} \\&= \frac{ 2\gamma ^{3/2}}{3} \left( \frac{1}{\sqrt{\left( 1 + \frac{1}{4T^2}\right) }}- \frac{1}{\sqrt{\left( 1+\frac{1}{T^2}\right) }} \right) \\&\ge \frac{2\gamma ^{3/2}}{3} \left( 1 - \frac{1}{8T^2} - \left( 1 - \frac{1}{2T^2} + \frac{3}{8T^4} \right) \right) \\&= \frac{ 2\gamma ^{3/2}}{3} \left( \frac{3}{8T^2} - \frac{3}{8T^4} \right) = \frac{ \gamma ^{3/2}\left( T^2-1 \right) }{4T^4} \ge \frac{ D^3}{1331T^{7/2}} \\ \end{aligned}$$

Where in the first inequality we used the fact that $ 1 - \frac{1}{2}x \le \frac{1}{\sqrt{1+x}} \le 1 - \frac{1}{2}x + \frac{3}{8}x^2 $ for all $x\ge 0$ and for the last inequality we assumed that $T \ge 2$. In the case that $T=1$, the final result still holds. Hence, the suboptimality is at least $\frac{\mu _2D^3}{16,000 T^{7/2}}$.

5.3.2 Case 2: $\frac{1}{T^2} < \frac{D^2}{48\Delta ^2 T^3} \le 1$

In this setting, we choose $ \gamma = {\frac{D\Delta }{ \sqrt{12T}}} $. Using this and the assumption on the parameters, we get that $\frac{\Delta ^2 \left( 1+T^2\right) }{T^2}<\gamma < 2 \Delta ^2T$, and therefore, we are in the second regime for both $f_T$ and $f_{2T}$ as specified in Proposition 4. Plugging in the bound on $\Vert \hat{\mathbf {w}^*}\Vert ^2$ in that regime, and using the fact that $\Delta ^2<\gamma $ by the assumption above, we have $ \Vert {\hat{\mathbf {w}}}^*_{2T}\Vert ^2 ~\le ~ \frac{\left( \gamma + \Delta ^2\right) ^2(2T+1)^3 }{48\Delta ^2T^2} ~\le ~ \frac{\gamma ^2(2T+1)^3 }{12\Delta ^2T^2} ~=~ \frac{D^2(2T+1)^3 }{144T^3}$, which is at most $D^2$ as required.

Turning to compute the optimization error bound, and letting $\delta _T,\delta _{2T}$ denote the quantity $\delta $ in Proposition 4 for ${\hat{f}}_T$ and ${\hat{f}}_{2T}$ respectively, we have

$$\begin{aligned} f^*_T-f^*_{2T} ~=~ \left( 2\delta _{2T}-\delta _T \right) \left( T\left( \Delta ^2+\gamma \right) -\Delta T^2 \left( 2\delta _{2T}+\delta _T \right) \right) +\frac{1}{3}T \left( \delta _T^3-2\delta _{2T}^3 \right) . \end{aligned}$$

(20)

To continue, we use the following auxiliary lemma:

Lemma 11

$\left( 2\delta _{2T}-\delta _T \right) \left( T\left( \Delta ^2+\gamma \right) -\Delta T^2 \left( 2\delta _{2T}+\delta _T \right) \right) \ge 0$

Proof

First we will prove that $T\left( \Delta ^2+\gamma \right) -\Delta T^2 \left( 2\delta _{2T}+\delta _T \right) \ge 0$: Since $\delta _T = -\Delta T + \Delta T\sqrt{1+\frac{\gamma + \Delta ^2 }{\Delta ^2 T^2}}$ and using $\sqrt{1+x} \le 1+ \frac{1}{2}x$ for $x\ge 0$ we have that $\delta _T \le \frac{\left( \gamma + \Delta ^2\right) }{2\Delta T}$, so $ T\left( \Delta ^2+\gamma \right) -\Delta T^2 \left( 2\delta _{2T}+\delta _T \right) \ge T\left( \Delta ^2+\gamma \right) -\Delta T^2 \frac{\left( \gamma + \Delta ^2\right) }{\Delta T} = 0 $. To complete the proof, it remains to show that $2\delta _{2T} -\delta _T \ge 0$. We have

$$\begin{aligned} 2\delta _{2T} -\delta _T ~=~ -4\Delta T + 4\Delta T\sqrt{1+\frac{\gamma + \Delta ^2 }{4\Delta ^2 T^2}} +\Delta T - \Delta T\sqrt{1+\frac{\gamma + \Delta ^2 }{\Delta ^2 T^2}} \end{aligned}$$

Define $\alpha := \frac{\gamma + \Delta ^2 }{\Delta ^2 T^2} \ge 0$. Hence, we need to prove $-4+4\sqrt{1+\frac{1}{4}\alpha }+1-\sqrt{1+\alpha } \ge 0$. By straightforward manipulations this is equivalent to $6 + 3\alpha \ge 6\sqrt{1+\alpha }$, which is true since $\sqrt{1+\alpha } \le 1+\frac{1}{2}\alpha $. $\square $

With this lemma, we can lower bound the optimization error in (20) by

$$\begin{aligned} \frac{1}{3}T \left( \delta _T^3-2\delta _{2T}^3 \right) . \end{aligned}$$

(21)

To continue, we note that by definition of $\delta _T,\delta _{2T}$ and the fact that $1+ \frac{1}{2}x -\frac{1}{8}x^2 \le \sqrt{1+x} \le 1+ \frac{1}{2}x$, we have $ \frac{\left( \gamma + \Delta ^2\right) }{2\Delta T} - \frac{\left( \gamma + \Delta ^2\right) ^2}{8\Delta ^3 T^3} \le \delta _T \le \frac{\left( \gamma + \Delta ^2\right) }{2\Delta T} $. Therefore,

$$\begin{aligned}&\delta _T-\root 3 \of {2}\delta _{2T} ~\ge ~\frac{\left( \gamma + \Delta ^2\right) }{2\Delta T} - \frac{\left( \gamma + \Delta ^2\right) ^2}{8\Delta ^3 T^3} - \frac{\root 3 \of {2}\left( \gamma +\Delta ^2\right) }{4\Delta T} \\&~~\ge ~ \frac{\left( \gamma + \Delta ^2\right) }{20\Delta T} + \frac{\left( \gamma + \Delta ^2\right) }{8\Delta T} \left( 1 - \frac{\gamma + \Delta ^2}{\Delta ^2 T^2} \right) ~\ge ~ \frac{\left( \gamma + \Delta ^2\right) }{20\Delta T}. \end{aligned}$$

Using this inequality, and the fact $(a-b)^3 \le a^3-b^3$ for $a\ge b \ge 0$, we can lower bound (21) by $ \frac{1}{3}T \left( \delta _T-\root 3 \of {2}\delta _{2T}\right) ^3 \ge \frac{\left( \gamma +\Delta ^2\right) ^3}{60\Delta ^3 T^2} \ge \frac{D^3}{2500 T^{7/2}} $. Hence, the suboptimality is at least $\frac{\mu _2D^3}{30000 T^{7/2}}$.

5.3.3 Case 3: $\frac{D^2}{48\Delta ^2 T^3} > 1$

In this setting, we choose $ \gamma = {\frac{D\Delta }{ \sqrt{3T}}}. $ Using this and the assumption on the parameters, we get that $\gamma > 4 \Delta ^2T$, and therefore, we are in the third regime for both $f_T$ and $f_{2T}$ as specified in Proposition 4. Plugging in the bound on $\Vert {\hat{\mathbf {w}}}^*_{2T}\Vert ^2$ in that regime, and using the fact that $2\Delta ^2<\gamma $ by the assumption above, we have $ \Vert {\hat{\mathbf {w}}}^*_{2T}\Vert ^2 ~\le ~ \frac{\left( \gamma + 2\Delta ^2\right) ^2(2T+1)}{12\Delta ^2} ~\le ~ \frac{4\gamma ^2(2T+1) }{12\Delta ^2} = \frac{D^2(2T+1)}{9T} \le D^2 $. Now, by the assumptions that $T\Delta ^3<\frac{\Delta D^2}{48T^2}$ and by using the fact that

$ 1 - x \le \frac{1}{1+x} \le 1 - x + x^2$ for all $x\ge 0$, the optimization error bound is

$$\begin{aligned} {\hat{f}}_T^*-{\hat{f}}_{2T}^*&\ge \frac{2T\left( \gamma + 2\Delta ^2\right) ^2}{4\Delta (2T+1)} - \frac{T\left( \gamma + 2\Delta ^2\right) ^2}{4\Delta (T+1)} +\frac{(T+1)\Delta ^3}{3} -\frac{(2T+1)\Delta ^3}{3} \\&=\frac{\left( \gamma + 2\Delta ^2\right) ^2}{4\Delta } \left( \frac{1}{1+\frac{1}{2T}} -\frac{1}{1+\frac{1}{T}}\right) -\frac{T\Delta ^3}{3}\\&\ge \frac{D^2\Delta }{12T} \left( \frac{1}{2T} - \frac{1}{T^2}\right) -\frac{T\Delta ^3}{3} \ge \frac{D^2\Delta }{72T^2} - \frac{ D^2\Delta }{144T^2} ~=~\frac{ D^2\Delta }{144T^2}\\ \end{aligned}$$

In the last inequality we assumed that $T \ge 3$. For $T=1,2$ it can be easily verified that the inequality ${\hat{f}}_T^*-{\hat{f}}_{2T}^* \ge \frac{ D^2\Delta }{144T^2}$ holds. Hence, using $\Delta = \frac{3\mu _1}{2\mu _2}$ the suboptimality is at least $\frac{\mu _1D^2}{576T^{2}}$.

5.4 Wrapping up

Combining the three cases from the previous subsection, we see that we get the following lower bound

$$\begin{aligned} f_{2T}(\mathbf {w}_T)-\min _{\mathbf {w}}f_{2T}(\mathbf {w}) \ge {\left\{ \begin{array}{ll} \frac{\mu _2D^3}{30000 T^{7/2}} &{} \frac{D^2}{48\Delta ^2 T^3} \le 1\\ \frac{\mu _1D^{2}}{576T^{2}} &{}\frac{D^2}{48\Delta ^2 T^3} > 1 \end{array}\right. }. \end{aligned}$$

Thus, we get that $ f_{2T}(\mathbf {w}_T)-\min _{\mathbf {w}}f_{2T}(\mathbf {w}) \ge \min \left\{ \frac{\mu _2D^3}{30,000 T^{7/2}}, \frac{\mu _1D^2}{576T^{2}} \right\} $. Equating these bounds to $\epsilon $, and solving for T, the theorem follows.

6 Proof of Theorem 3

The Proof of Theorem 3 will follow the same outline of the proof of 2. We are again going to assume without loss of generality that $\mathbf {w}_1=0$, and we will thus require that $\Vert \mathbf {w}^*\Vert \le D$ (see discussion in the Proof of Theorem 2). We define $g:\mathbb {R}\mapsto \mathbb {R}$ as $ g(x) = \frac{1}{k+1}|x|^{k+1} $ and

$$\begin{aligned} f_T(\mathbf {w}) = \frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}\left( g(\langle \mathbf {v}_1,\mathbf {w}\rangle )+g(\langle \mathbf {v}_T,\mathbf {w}\rangle )+\sum _{i=1}^{T-1}g(\langle \mathbf {v}_i-\mathbf {v}_{i+1},\mathbf {w}\rangle )-\gamma \langle \mathbf {v}_1,\mathbf {w}\rangle \right) . \end{aligned}$$

Lemma 12

$f_T(\mathbf {w})$ is k-times differentiable, with $\mu _k$-Lipschitz $k\text {-}th$ order derivative tensor.

Proof

Similarly to Lemma 9, we can assume without loss of generality, that the vectors $\mathbf {v}_1,\mathbf {v}_2,\ldots ,\mathbf {v}_T$ correspond to the standard basis vectors $\mathbf {e}_1,\mathbf {e}_2,\ldots ,\mathbf {e}_T$, so we can examine the Lipschitz properties of

$$\begin{aligned} {\hat{f}}(\mathbf {w}) = \frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}\left( g(w_1)+g(w_T)+\sum _{i=1}^{T-1}g(w_i-w_{i+1})-\gamma w_1\right) . \end{aligned}$$

Letting $\mathbf {r}_0=\mathbf {e}_1$, $\mathbf {r}_T=\mathbf {e}_T$ and $\mathbf {r}_i=\mathbf {e}_i-\mathbf {e}_{i+1}$ for all $1\le i\le T-1$. Differentiating k times, we have that

$$\begin{aligned} \nabla ^{(k)}{\hat{f}}(\mathbf {w})&= \frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}\left( g^{(k)}(w_1)\mathbf {r}_0^{\otimes k} + g^{(k)}(w_T)\mathbf {r}_T^{\otimes k} + \sum _{i=1}^{T-1} g^{(k)}(\langle \mathbf {r}_i,\mathbf {w}\rangle )\mathbf {r}_i^{\otimes k}\right) \\&= \frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}\left( \sum _{i=0}^{T} g^{(k)}(\langle \mathbf {r}_i,\mathbf {w}\rangle )\mathbf {r}_i^{\otimes k}\right) , \\ \end{aligned}$$

Where $\mathbf {v}^{\otimes p} = \underbrace{\mathbf {v}\otimes \mathbf {v}\otimes \cdots \mathbf {v}}_{\text {p times}}$. Since $g^{(k)}(x) = k!x$, we get that $\Vert \nabla ^{(k)}{\hat{f}}(\mathbf {w}) - \nabla ^{(k)}{\hat{f}}({\tilde{\mathbf {w}}})\Vert $ equals

$$\begin{aligned} \frac{\mu _{k}}{k!2^{\frac{k+3}{2}}} \left\| \sum _{i=0}^{T} \left( g^{(k)}(\langle \mathbf {r}_i,\mathbf {w}\rangle )-g^{(k)}(\langle \mathbf {r}_i,{\tilde{\mathbf {w}}}\rangle )\right) \mathbf {r}_i^{\otimes k}\right\| ~=~ \frac{\mu _{k}}{2^{\frac{k+3}{2}}} \left\| \sum _{i=0}^{T} \langle \mathbf {r}_i,\mathbf {w}-{\tilde{\mathbf {w}}}\rangle \mathbf {r}_i^{\otimes k}\right\| . \end{aligned}$$

(22)

Note that for a k-th order symmetric tensor T , the operator norm equals $ \Vert T\Vert = \max _{\Vert \mathbf {x}\Vert =1} \left| \sum _{i_1,i_2, \ldots ,i_k}T_{i_1,i_2, \ldots ,i_k}x_{i_1}x_{i_2} \ldots x_{i_k}\right| $ (see for example [11]). So,

$$\begin{aligned}&\left\| \sum _{i=0}^{T} \langle \mathbf {r}_i,\mathbf {w}-{\tilde{\mathbf {w}}}\rangle \mathbf {r}_i^{\otimes k}\right\| = \max _{\Vert \mathbf {x}\Vert =1} \left| \sum _{i_1,i_2, \ldots ,i_k}\sum _{i=0}^{T}\langle \mathbf {r}_i,\mathbf {w}-{\tilde{\mathbf {w}}}\rangle r_{i,i_1}r_{i,i_2} \ldots r_{i,i_k}x_{i_1}x_{i_2} \ldots x_{i_k}\right| \\&= \max _{\Vert \mathbf {x}\Vert =1}\left| \sum _{i=0}^{T}\langle \mathbf {r}_i,\mathbf {w}-{\tilde{\mathbf {w}}}\rangle \sum _{i_1}r_{i,i_1}x_{i_1} \ldots \sum _{i_k}r_{i,i_k}x_{i_k} \right| = \max _{\Vert \mathbf {x}\Vert =1}\left| \sum _{i=0}^{T}\langle \mathbf {r}_i,\mathbf {w}-{\tilde{\mathbf {w}}}\rangle \langle \mathbf {r}_i,\mathbf {x}\rangle ^k \right| \\&\le 2^{\frac{k}{2}}\max _{\Vert \mathbf {x}\Vert =1}\sum _{i=0}^{T}\left| \langle \mathbf {r}_i,\mathbf {w}-{\tilde{\mathbf {w}}}\rangle \right| \left| \left\langle \frac{\mathbf {r}_i}{\sqrt{2}},\mathbf {x}\right\rangle \right| ^k \le 2^{\frac{k+1}{2}}\Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert \max _{\Vert \mathbf {x}\Vert =1}\sum _{i=0}^{T} \left\langle \frac{\mathbf {r}_i}{\sqrt{2}},\mathbf {x}\right\rangle ^2 \\&= 2^{\frac{k-1}{2}} \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert \max _{\Vert \mathbf {x}\Vert =1} x_1^2 + x_T^2 +\sum _{i=1}^{T-1} \left( x_i - x_{i+1}\right) ^2 \\&\le 2^{\frac{k-1}{2}} \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert \max _{\Vert \mathbf {x}\Vert =1} x_1^2 + x_T^2 +2\sum _{i=1}^{T-1} x_i^2 + 2\sum _{i=2}^{T} x_i^2 \le 2^{\frac{k+3}{2}} \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert \\ \end{aligned}$$

where in the first inequality we used $\Vert \mathbf {r}_i\Vert \le \sqrt{2}$ for all i. Plugging this into (22) we have $ \left\| \nabla ^{(k)}{\hat{f}}(\mathbf {w}) - \nabla ^{(k)}{\hat{f}}({\tilde{\mathbf {w}}}\right\| \le \mu _{k} \Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert $ as required. $\square $

6.1 Minimizer of $f_T$

In order to derive the complexity bound, we will first analyze ${\hat{f}}_T$, which is a simplified version of $f_T$, as defined in Sect. 5.1. It is easily verified that $\min _\mathbf {w}f_T(\mathbf {w})=\frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}\cdot \min _{\mathbf {w}}\mathbf {w}{\hat{f}}_T({\hat{\mathbf {w}}})$, and moreover, if ${\hat{\mathbf {w}}}\in \mathbb {R}^T$ is a minimizer of ${\hat{f}}_T$, then $\mathbf {w}^*=\sum _{j=1}^{T}{\hat{w}}^*_j\cdot \mathbf {v}_j\in \mathbb {R}^d$ is a minimizer of $f_T$, and with the same Euclidean norm as ${\hat{\mathbf {w}}}^*$.

Using an identical proof to Lemma 10 we can have that ${\hat{f}}_T$ has a unique minimizer ${\hat{\mathbf {w}}}^* \in \mathbb {R}^T$, which satisfies $ {\hat{w}}^*_t = \delta \cdot (T+1)\cdot \left( 1-\frac{t}{T+1}\right) $ for some $\delta > 0 $ and all $t=1,2,\ldots ,T$, and $ g'(w^*_1) + g'(w^*_T) = (\delta T)^{k} + \delta ^{k}= \delta ^{k}(T^{k} + 1) = $. Hence, $\delta = \left( \frac{\gamma }{T^{k} + 1} \right) ^\frac{1}{k}$. Plugging this and performing some algebraic manipulations, we get

$$\begin{aligned}&{\hat{f}}_T({\hat{\mathbf {w}}}^*) = \frac{1}{k+1}(\delta T)^{k+1} + \frac{T}{k+1}\delta ^{k+1} -\gamma \delta T \nonumber \\&= \frac{T}{k+1}\left( T^{k}+1 \right) \left( \frac{\gamma }{T^{k} + 1} \right) ^\frac{k+1}{k} - \gamma T \left( \frac{\gamma }{T^{p-1} + 1} \right) ^\frac{1}{p-1} = - \frac{kT\gamma ^\frac{k+1}{k}}{(k+1)\left( T^{k} + 1\right) ^\frac{1}{k}} \end{aligned}$$

(23)

$$\begin{aligned} \Vert {\hat{\mathbf {w}}}^*\Vert ^2_2 = \sum _{t=1}^T{\hat{w}}^{*2}_t =\delta ^2 \left( 1+T\right) ^2\sum _{t=1}^T \left( 1-\frac{t}{1+T} \right) ^2 \le \frac{1}{3}\left( \frac{\gamma }{T^{k} + 1} \right) ^\frac{2}{k} \left( 1+T\right) ^3, \end{aligned}$$

(24)

where we used $\sum _{t=1}^T (1-\frac{t}{1+T})^2 \le \frac{1}{3}(1+T)$ as in Proposition 4.

6.2 Oracle complexity lower bound

The derivation of the lower complexity bound will be exactly the same as in Sect. 5.2. In that subsection, we showed that we can lower bound the optimization error $f_{2T}(\mathbf {w}_T)-\min _{\mathbf {w}}f_{2T}(\mathbf {w})$ by $\min _{\mathbf {w}}f_{T}(\mathbf {w})-\min _{\mathbf {w}}f_{2T}(\mathbf {w})$. Using the fact that $ f_T(\mathbf {w}^*)=\frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}\cdot {\hat{f}}_T({\hat{\mathbf {w}}}^*)$, this equals $ \frac{\mu _{k}}{k!2^{\frac{k+3}{2}}}(\min _{\mathbf {w}}{\hat{f}}_T(\mathbf {w})-\min _{\mathbf {w}}{\hat{f}}_{2T} (\mathbf {w})) $. Letting $f_T^*$ and ${\hat{f}}_T^*$ to be the minimal values of $f_T$ and ${\hat{f}}_T$ respectively, and by using equation (23),

$$\begin{aligned} {\hat{f}}_{T}^* - {\hat{f}}_{2T}^*&= \frac{k\gamma ^\frac{k+1}{k}}{(k+1)\left( 1 + \frac{1}{(2T)^{k}}\right) ^\frac{1}{k}}- \frac{k\gamma ^\frac{k+1}{k}}{(k+1)\left( 1 + \frac{1}{T^{k}}\right) ^\frac{1}{k}} \\&\ge \frac{k\gamma ^\frac{k+1}{k}}{k+1}\left( 1 - \frac{1}{k(2T)^{k}} - \left( 1 - \frac{1}{kT^{k}} + \frac{k+1}{2k^2 T^{2k}} \right) \right) \\&= \frac{\gamma ^\frac{k+1}{k}}{(k+1)kT^{k}}\left( 1 - \frac{1}{2^{k}} - \frac{k+1}{2kT^{k}}\right) \ge \frac{\gamma ^\frac{k+1}{k}}{6(k+1)kT^{k}} \\ \end{aligned}$$

The last inequality holds for $k=1, T \ge 3$, $k = 2, T\ge 2$ or $ k \ge 3, T \ge 1$. It can be verified that for the other cases, the inequality ${\hat{f}}_{T}^* - {\hat{f}}_{2T}^*\ge \frac{\gamma ^\frac{k+1}{k}}{6(k+1)kT^{k}}$ holds.

Since we want $f_{T}^* - f_{2T}^*$ to be as large as possible, we will set $\gamma $ to be as large as possible, under the constraint that $\Vert \mathbf {w}^*_{2T}\Vert \le D$. By (24) we can choose $ \gamma = \frac{3^{\frac{k}{2}}D^{k}\left( 1+ (2T)^{k} \right) }{(1+2T)^{\frac{3k}{2}}}, $ so

$$\begin{aligned} {\hat{f}}_{T}^* - {\hat{f}}_{2T}^* ~\ge ~ \frac{3^{\frac{k+1}{2}}D^{k+1}\left( 1+ (2T)^{k} \right) ^{\frac{k+1}{k}}}{6(k+1)kT^{k}(1+2T)^{\frac{3(k+1)}{2}}} ~\ge ~ \frac{2^{k+1} D^{k+1}}{6\cdot 3^{k+1}(k+1)kT^{\frac{3k+1}{2}}}. \end{aligned}$$

Thus, according to the discussion in Sect. 6.2, the final bound is $ f_{T}^* - f_{2T}^* \ge \frac{\mu _{k}\sqrt{2}^{k+1} D^{k+1}}{12\cdot 3^{k+1}(k+1)!kT^{\frac{3k+1}{2}}} $, and the number of iterations required for having $\min _{\mathbf {w}}f_{T}(\mathbf {w})-\min _{\mathbf {w}}f_{2T}(\mathbf {w}) < \epsilon $ , $T_\epsilon $ must satisfy

$$\begin{aligned} T_\epsilon \ge \left( \frac{\mu _{k}\sqrt{2}^{k+1} D^{k+1}}{12\cdot 3^{k+1}(k+1)!k\epsilon }\right) ^\frac{2}{3k+1} \ge c\left( \frac{\mu _{k} D^{k+1}}{(k+1)!k\epsilon }\right) ^\frac{2}{3k+1} \end{aligned}$$

for an appropriate numerical constant c.

Notes

Assuming f is twice-differentiable, this corresponds to $\nabla ^2 f(\mathbf {w})\succeq \lambda I$ for all $\mathbf {w}$.
That is, for any vectors $\mathbf {v},\mathbf {w}$, the function $g(t)=f(\mathbf {w}+t\mathbf {v})$ satisfies $|g'''(t)|\le 2g''(t)^{3/2}$.
Ultimately, we will choose $\gamma =\min \left\{ \left( \frac{3(\mu _1-\lambda )}{2\mu _2}\right) ^2, \root 7 \of {\frac{D^8(12\lambda )^6}{2^4\mu _2^6}}\right\} $ and $\Delta =\sqrt{\gamma }$, see Subsection 4.3.
This is trivially true for $i<j_0$. For $i=j_0$, we have $|w_{j_0}-w_{j_0+1}|^3=0< |\tilde{w}^*_{j_0}-\tilde{w}^*_{j_0+1}|^3$ and $w_{j_0}^2=(\tilde{w}^*_{j_0})^2$. For $i>j_0$, we have $|w_i-w_{i+1}|^3= |\max \{0,\tilde{w}^*_i-\Delta \}-\max \{0,\tilde{w}^*_{i+1}-\Delta \}|^3\le |(\tilde{w}^*_i-\Delta )-(\tilde{w}^*_{i+1}-\Delta )|^3 = |\tilde{w}^*_i-\tilde{w}^*_{i+1}|^3$, and moreover, $w_i^2 = \max \{0,\tilde{w}^*_i-\Delta \}^2$, which is 0 (hence $\le (\tilde{w}^*_i)^2$) if $\tilde{w}^*_i\le \Delta $ and less than $(\tilde{w}^*_i)^2$ if $\tilde{w}^*_i>\Delta $.
Such an index must exist: By assumption, ${\tilde{T}}\ge 2\gamma \left( \frac{\mu _2}{6\lambda }\right) ^2=\frac{2\gamma }{{\tilde{\lambda }}^2}$, so by Lemma 1, $\frac{\gamma }{{\tilde{\lambda }}} =\sum _{t=1}^{{\tilde{T}}}\tilde{w}^*_t \ge {\tilde{T}}\tilde{w}^*_{{\tilde{T}}} \ge \frac{2\gamma }{{\tilde{\lambda }}^2}\tilde{w}^*_{{\tilde{T}}}$, hence $\tilde{w}_{{\tilde{T}}}\le {\tilde{\lambda }}/2$.
Since $\tilde{w}^*_t$ monotonically decrease in t, such an index must exist: On the one hand, $\tilde{w}^*_1$ can be verified to be at least ${\tilde{\lambda }}>{\tilde{\lambda }}/2$ (by Lemma 2 and the assumption $\gamma \ge 10^4(\lambda /\mu _2)^2$, hence $\gamma \ge 277{\tilde{\lambda }}^2$). On the other hand, if we let $t_1$ be the largest index $\le {\tilde{T}}$ satisfying $\tilde{w}^*_{t_1}>{\tilde{\lambda }}/2$, we have by Lemma 1 that $\frac{\gamma }{{\tilde{\lambda }}} \ge \sum _{t=1}^{t_1}\tilde{w}^*_t \ge t_1\tilde{w}_{t_1}^*> \frac{t_1 {\tilde{\lambda }}}{2}$, which implies that $t_1 \le \frac{2\gamma }{{\tilde{\lambda }}^2}$, which is less than ${\tilde{T}}/2$ by the assumption on ${\tilde{T}}$ being large enough. Therefore, $t_0$ is at most ${\tilde{T}}/2$ as well.
We note that the reverse direction, of adapting strongly convex optimization algorithms to the convex case, is more common in the literature, and can be achieved using regularization or more sophisticated approaches [2].
Specifically, since in our framework we do not limit computational resources, we assume that the minimization problem in Eq. (6.1) of [10] can be solved exactly.

References

Agarwal, N., Hazan, E.: Lower bounds for higher-order optimization. Working draft (2017)
Allen-Zhu, Z., Hazan, E.: Optimal black-box reductions between optimization objectives. In: Advances in Neural Information Processing Systems, pp. 1614–1622 (2016)
Arjevani, Y., Shamir, O.: On the iteration complexity of oblivious first-order optimization algorithms. In: International Conference on Machine Learning, pp. 908–916 (2016)
Arjevani, Y., Shamir, O.: Oracle complexity of second-order methods for finite-sum problems. arXiv preprint arXiv:1611.04982 (2016)
Baes, M.: Estimate Sequence Methods: Extensions and Approximations. Institute for Operations Research, ETH, Zürich (2009)
Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)
Article MathSciNet Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim Methods Softw. 27(2), 197–219 (2012)
Article MathSciNet Google Scholar
Kantorovich, L.V.: Functional analysis and applied mathematics. Uspekhi Matematicheskikh Nauk 3(6), 89–185 (1948)
MathSciNet MATH Google Scholar
Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013)
Article MathSciNet Google Scholar
Mu, C., Hsu, D., Goldfarb, D.: Successive rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM J. Matrix Anal. Appl. 36(4), 1638–1659 (2015)
Article MathSciNet Google Scholar
Nemirovski, A.: Efficient methods in convex programming—lecture notes (2005)
Nemirovsky, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Sov. Math. Dokl. 27(2), 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)
Book Google Scholar
Nesterov, Y.: Accelerating the cubic regularization of newton method on convex problems. Math. Program. 112(1), 159–181 (2008)
Article MathSciNet Google Scholar
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. SIAM, Philadelphia (1994)
Book Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet Google Scholar
Vladimirov, A., Nesterov, Y.E., Chekanov, Y.N.: On uniformly convex functionals. Vestnik Moskov. Univ. Ser. XV Vychisl. Mat. Kibernet 3, 12–23 (1978)
MathSciNet MATH Google Scholar
Woodworth, B., Srebro, N.: Lower bound for randomized first order convex optimization. arXiv preprint arXiv:1709.03594 (2017)

Download references

Acknowledgements

We thank Yurii Nesterov for several helpful comments on a preliminary version of this paper, as well as Naman Agarwal, Elad Hazan and Zeyuan Allen-Zhu for informing us about the A-NPE algorithm of [10].

Author information

Authors and Affiliations

Department of Computer Science, Weizmann Institute of Science, Rehovot, Israel
Yossi Arjevani, Ohad Shamir & Ron Shiff

Authors

Yossi Arjevani
View author publications
You can also search for this author in PubMed Google Scholar
Ohad Shamir
View author publications
You can also search for this author in PubMed Google Scholar
Ron Shiff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ron Shiff.

Appendix A: An improved second-order oracle complexity bound for strongly convex functions

In this section, we show how the A-NPE algorithm of [10], which is a second-order method analyzed for smooth convex functions, can be used to yield near-optimal performance if the function is also strongly convex. Rather than directly adapting their analysis, which is non-trivial, we use a simple restarting scheme, which allows one to convert an algorithm for the convex setting, to an algorithm in the strongly convex setting^{Footnote 7}.

Our algorithm is described as follows: In the first phase, we apply a generic restarting scheme (based on [3, Subsction 4.2]), where we repeatedly run A-NPE for a bounded number of steps, followed by restarting the algorithm, running it from the last iterate obtained. By strong convexity, we show that each such epoch reduces the suboptimality by a constant factor. Once we reach a point sufficiently close to the global optimum, we switch to the second phase, where we use the cubic-regularized Newton method to get a quadratic convergence rate.

To formalize this, let us first analyze the convergence rate of the first phase. We assume that we use the algorithm described in [10, Subsection 7.4]^{Footnote 8}. By [10, Theorem 6.4 and Theorem 3.10], we have that the t’th iterate satisfies

$$\begin{aligned} \Vert \mathbf {w}_t-\mathbf {w}_*\Vert \le D ~~\text {and }~~ f(\mathbf {w}_t) - f(\mathbf {w}^*) \le \frac{c \mu _2 \Vert \mathbf {w}_1-\mathbf {w}^*\Vert ^3}{t^{7/2}}, \end{aligned}$$

where $\mu _2$ is the Lipschitz constant of $\nabla ^2 f$, $\mathbf {w}_1$ is the initialization point, $\mathbf {w}^*$ is the unique minimizer (due to strong convexity) of f, D bounds $\Vert \mathbf {w}_1-\mathbf {w}^*\Vert $ from above, and $c>0$ is some universal constant. Since f is also assumed to be $\lambda $-strongly convex, we have $ \frac{\lambda }{2}\Vert \mathbf {w}_1 - \mathbf {w}^*\Vert ^2 \le f(\mathbf {w}_1) - f(\mathbf {w}^*) $, hence $f(\mathbf {w}_t)-f(\mathbf {w}^*)$ is at most

$$\begin{aligned} \frac{c \mu _2 \Vert \mathbf {w}_1-\mathbf {w}^*\Vert ^3}{t^{7/2}} \le \frac{2c \mu _2 \Vert \mathbf {w}_1-\mathbf {w}^*\Vert (f(\mathbf {w}_1)-f(\mathbf {w}^*))}{\lambda t^{7/2}} \le \frac{2c \mu _2 D(f(\mathbf {w}_1)-f(\mathbf {w}^*))}{\lambda t^{7/2}}. \end{aligned}$$

Thus, running the algorithm for $\tau = \left( \frac{4c\mu _2 D}{\lambda }\right) ^{2/7} $ iterations, we see that $f(\mathbf {w}_t) - f(\mathbf {w}^*) \le {(f(\mathbf {w}_1)-f(\mathbf {w}^*))}/{2}$. Now, since the distance from $\mathbf {w}_t$ to $\mathbf {w}^*$ is also smaller than D, we may initialize the algorithm at the last iterate returned by the previous run and run it for $\tau $ iterations to reduce $f(\mathbf {w}_t) - f(\mathbf {w}^*)$ in, yet again, a factor of 2. Applying the algorithm for T iterations (and restarting the algorithmic parameters after every $\tau $ iterations) yields $f(\mathbf {w}_T)-f(\mathbf {w}^*) \le \frac{f(\mathbf {w}_1)-f(\mathbf {w}^*) }{2^{T/\tau }}$. Equivalently, to obtain an $\epsilon $-optimal solution, we need at most $ \left( \frac{4c\mu _2 D}{\lambda }\right) ^{2/7}\log _2\left( \frac{f(\mathbf {w}_1)-f(\mathbf {w}^*) }{\epsilon }\right) $ oracle calls (note that this restarting scheme can be applied also on uniform convex functions of any order, as defined in, e.g., [19]).

Next, after performing a number of iterations sufficiently large to obtain high accuracy solutions, we proceed to the second phase of the algorithm where cubic-regularized Newton steps are applied (see [16]). According to that analysis, after reducing the optimization error to below $\lambda ^3/4\mu _2^2$, the number of cubic-regularized Newton steps required to achieve an $\epsilon $-suboptimal solution is $ \mathcal {O}\left( \log \log _2\left( \frac{\lambda ^3}{\mu _2^2\epsilon }\right) \right) $. Thus, using the $\mu _1$-Lipschitzness of the gradient to bound $f(\mathbf {w}_1)-f(\mathbf {w}^*)$ from above by $\mu _1 D^2/2$, we get that the overall number of iterations is at most $ \mathcal {O}\left( \left( \frac{\mu _2 D}{\lambda }\right) ^{2/7}\log _2\left( \frac{\mu _1 \mu _2^2D^2 }{\lambda ^3}\right) + \log \log _2\left( \frac{\lambda ^3}{\mu _2^2\epsilon }\right) \right) $.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arjevani, Y., Shamir, O. & Shiff, R. Oracle complexity of second-order methods for smooth convex optimization. Math. Program. 178, 327–360 (2019). https://doi.org/10.1007/s10107-018-1293-1

Download citation

Received: 12 September 2017
Accepted: 07 May 2018
Published: 28 May 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s10107-018-1293-1

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Oracle complexity of second-order methods for smooth convex optimization

Abstract

Similar content being viewed by others

Zeroth-order methods for noisy Hölder-gradient functions

Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex

A zeroth order method for stochastic weakly convex optimization

1 Introduction

1.1 Related work

2 Main results

2.1 Second-order oracle

Theorem 1

Theorem 2

Remark 1

2.2 Higher order oracles

Theorem 3

3 Proof ideas

Proposition 1

Proof Sketch

4 Proof of Theorem 1

4.1 Minimizer of f

Proposition 2

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

4.2 Oracle complexity lower bound

Proposition 3

Lemma 7

Proof

Lemma 8

Proof

4.3 Setting the \(\gamma ,\Delta \) Parameters

Lemma 9

Proof

5 Proof of Theorem 2

5.1 Minimizer of \(f_{\bar{T}}\)

Lemma 10

Proof

Proposition 4

Proof

5.2 Oracle complexity lower bound

5.3 Setting the \(\gamma \) Parameter

5.3.1 Case 1: \(\frac{D^2}{48\Delta ^2 T^3}\le \frac{1}{T^2}\)

5.3.2 Case 2: \(\frac{1}{T^2} < \frac{D^2}{48\Delta ^2 T^3} \le 1\)

Lemma 11

Proof

5.3.3 Case 3: \(\frac{D^2}{48\Delta ^2 T^3} > 1\)

5.4 Wrapping up

6 Proof of Theorem 3

Lemma 12

Proof

6.1 Minimizer of \(f_T\)

6.2 Oracle complexity lower bound

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A: An improved second-order oracle complexity bound for strongly convex functions

Appendix A: An improved second-order oracle complexity bound for strongly convex functions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation