Abstract
An Adagrad-inspired class of algorithms for smooth unconstrained optimization is presented in which the objective function is never evaluated and yet the gradient norms decrease at least as fast as \(\mathcal{O}(1/\sqrt{k+1})\) while second-order optimality measures converge to zero at least as fast as \(\mathcal{O}(1/(k+1)^{1/3})\). This latter rate of convergence is shown to be essentially sharp and is identical to that known for more standard algorithms (like trust-region or adaptive-regularization methods) using both function and derivatives’ evaluations. A related “divergent stepsize” method is also described, whose essentially sharp rate of convergence is slighly inferior. It is finally discussed how to obtain weaker second-order optimality guarantees at a (much) reduced computational cost.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
This paper considers an a priori unexpected but fundamental and challenging question: is evaluating the value of the objective function necessary for obtaining (complexity-wise) efficient minimization algorithms which find second-order approximate minimizers? This question arose as a natural consequence of the somewhat surprising results of [14], where it was shown that OFFO (i.e. Objective-Function Free Optimization) algorithmsFootnote 1 exist which converge to first-order points at a global rate which in order identical to that to well-known methods using both gradient and objective function evaluations. That these algorithms include the deterministic version of Adagrad [10], a very popular method for deep learning applications, was an added bonus and a good motivation.
We show here that, from the point of view of evaluation complexity alone, evaluating the value of the objective function during optimization is also unnecessaryFootnote 2for finding approximate second-order minimizers at a (worst-case) cost entirely comparable to that incurred by familiar and reliable techniques such as second-order trust-region or adaptive regularization methods. This conclusion is coherent with that of [14] for first-order points and is obtained by exhibiting an OFFO algorithm whose global rate of convergence is proved to be \(\mathcal{O}(1/\sqrt{k+1})\) for the gradients’norm and \(\mathcal{O}(1/(k+1)^{1/3})\) for second-order measures. The new ASTR2 algorithm is of the adaptively scaled trust-region type, as those studied in [14]. The key difference is that it now hinges on a scaling technique which depends on second-order information, when relevant.
A further motivation for our analysis is the folklore observation that algorithms which use function values (often in linesearches or other acceptance tests for a new iterate) are significantly less robust than OFFO counterparts, essentially because the accuracy necessary for the former methods to work well is significantly higher than that requested on derivatives’ values. Thus OFFO algorithms, like the one discussed in this paper, merit, in our view, a sound theoretical consideration.
The paper is organized as follows. Section 2 presents the new ASTR2 class of algorithms and discusses some of its scaling-independent properties. The complexity analysis of a first, Adagrad-like, subclass of ASTR2 is then presented in Sect. 3. Another subclass of interest is also considered and analyzed in Sect. 4. Sect. 5 discusses how weaker optimality conditions may be guaranteed by the ASTR2 algorithms at significantly reduced computational cost. Conclusions and perspectives are finally presented in Sect. 6
2 The ASTR2 class of minimization methods
2.1 Approximate first- and second-order optimality
We consider the nonlinear unconstrained optimization problem
where f is a function from \(\mathbb{R}^n\) to \(\mathbb{R}\). More precisely, we assume that
-
AS.1:
the objective function f(x) is twice continuously differentiable;
-
AS.2:
its gradient \(g(x){\mathop {=}\limits ^\textrm{def}}\nabla _x^1f(x)\) and Hessian \(H(x) {\mathop {=}\limits ^\textrm{def}}\nabla _x^2f(x)\) are Lipschitz continuous with Lipschitz constant \(L_1\) and \(L_2\), respectively, that is
$$\begin{aligned} \Vert g(x)-g(y)\Vert \le L_1 \Vert x-y\Vert \;\; \text{ and } \;\; \Vert H(x)-H(y)\Vert \le L_2 \Vert x-y\Vert \end{aligned}$$for all \(x,y\in \mathbb{R}^n\);
-
AS.3:
there exists a constant \(f_{\textrm{low}}\) such that \(f(x)\ge f_{\textrm{low}}\) for all \(x\in \mathbb{R}^n\).
As our purpose is to find approximate first- and second-order minimizers, we need to clarify these concepts. In this paper we choose to follow the “ strong \(\phi\)” concept of optimality discussed in [4, 6] or [5, Chapters 12–14]. It is based on the quantity
where \(T_{f,2}(x,d)\) is the second-order Taylor expansion of f at x, that is
Observe that \(\phi _{f,j}^\delta (x)\) is interpreted as the maximum decrease of the local j-th order Taylor model of the objective function f at x, within a ball of radius \(\delta\). Importantly for our present purposes, the evaluation of \(\phi _{f,2}^\delta (x)\) does not require the evaluation of f(x), as it can be rewritten as
Moreover, computing \(\phi _{f,2}^\delta (x)\) is a standard trust-region step calculation, for which many efficient methods exist (see [7, Chapter 7], for instance).
The next result recalls the link between the \(\phi\) optimality measure and the more standard ones.
Lemma 2.1
[5, Theorems 12.1.4 and 12.1.6] Suppose that f is twice continuously differentiable. Then
-
(i)
for any \(\delta >0\) and any \(x\in \mathbb{R}^n\), we have that
$$\begin{aligned} \Vert g(x)\Vert =\frac{\phi _{f,1}^{\delta }(x)}{\delta }, \end{aligned}$$(2.4)and so \(\phi _{f,1}^{\delta }(x) = 0\) if and only if \(g(x)=0\);
-
(ii)
we have that
$$\begin{aligned} \phi _{f,2}^{\delta }(x) = 0 \;\; \text{ for } \text{ some } \delta >0\text{, } \text{ then } \;\; g(x)=0\,\;\; \text{ and } \;\;\,\lambda _{\min }[H(x)] \ge 0, \end{aligned}$$and so any such x is a first- and second-order minimizer;
-
(iii)
if \(\phi _{f,1}^{\delta _1}(x) \le \epsilon _1 \, \delta _1\) (and so (2.5) holds with \(j=1\)), then \(\Vert g(x)\Vert \le \epsilon _1\);
-
(iv)
if \(\phi _{f,2}^\delta (x) \le {\scriptstyle \frac{1}{2}}\epsilon _2\delta ^2\), then \(\lambda _{\min }[H(x)] \ge -\epsilon _2\) (and so (2.5) holds for \(j=2\)) and \(\Vert g(x)\Vert \le \delta \kappa (x)\sqrt{\epsilon _2}\), where \(\kappa (x)\) depends on (the eigenvalues of) H(x).
Note also that computing \(\phi _{f,1}^\delta (x)\) simply results from (2.4) and that, in particular, \(\phi _{f,1}^1(x)=\Vert g(x)\Vert\). Computing \(\phi _{f,2}^\delta (x)\) is a standard Euclidean trust-region step calculation (see [7, Chapter 7], for instance).
For \(j\in \{1,2\}\), we then say that an iterate \(x_k\) is an \(\epsilon\)-approximate minimizer if
where \(\epsilon = (\epsilon _1, \ldots , \epsilon _j)\). There are two ways to express how fast an algorithm tends to such points in the worst case. The first (the “\(\epsilon\)-orders”) is to assume \(\epsilon\) is given and then give a bound on the maximum number of iterations and evaluations that are needed to satisfy (2.5). In this paper we focus on the second (the “k-orders”), where one instead gives an upper boundFootnote 3 on \(\phi _{f,j}^\delta (x_k)\) as a function of k (for specified j and \(\delta\)).
2.2 The ASTR2 class
After these preliminaries, we now introduce the new ASTR2 class of algorithms. Methods in this class are of “adaptively scaled trust-region” type, a term we now briefly explain. Classical trust-region algorithms (see [7] for an in-depth coverage or [21] for a more recent survey) are iterative. At each iteration, they define a local model of the objective function which is deemed trustable within the “trust region”, a ball of given radius centered at the current iterate. A step and corresponding trial point are then computed by (possibly approximately) minimizing this model in the trust region. The objective function value is then computed at the trial point, and this point is accepted as the new iterate if the ratio of the achieved reduction in the objective function to that predicted by the model is sufficiently large. The radius of the trust region is then updated using the value of this ratio. As is clear from this description, these methods are intrinsically dependent of the evaluation of the objective function, and therefore not suited to our Objective-Function Free Optimization (OFFO) context. Here we follow [14] in interpreting the mechanism designed for the Adagrad methods [10] as an alternative trust-region design not using function evaluations. In this interpretation, the trial point is always accepted and the trust-region radius is determined by the gradient sizes, in a manner reminiscent also of [11]. In this approach, one uses scaling factors to determine the radius (hence the name of Adaptively Scaled Trust Region) at each iteration. Given these factors, which we will denote, at iteration k, by \(w_k^L\) and \(w_k^Q\), we may then state the ASTR2 class of algorithms as shown ASTR2. This algorithm involves requirements on the step which are standard (and practical) for trust-region methods.
A few additional comments on this algorithm are now in order.
-
1.
The algorithms in the ASTR2 class belong to the OFFO framework: the objective function is never evaluated (remember that \(\phi _{f,j}^1(x)\) can be computed without any such evaluation, the same being obviously true for \(\Delta q_k\), \(\Delta q_k^C\) and \(\Delta q_k^E\)).
-
2.
Given our focus on k-orders of convergence, the algorithm does not include a termination criterion. It is however easy, should one be interested in \(\epsilon\)-orders instead, to test (2.5) for \(\delta =1\) and the considered \(\epsilon _1\) and \(\epsilon _2\) at the end of Step 1, and then terminate if this condition holds.
-
3.
Despite their somewhat daunting statements, conditions (2.9)–(2.12) are relatively mild and have been extensively used for standard trust-region algorithms, both in theory and practice. Condition (2.10) defines the so-called “Cauchy decrease”, which is the decrease achievable on the quadratic model \(T_{f,2}(x_k,s)\) in the steepest descent direction [7, Section 6.3.2]. Conditions (2.11) and (2.12) define the “eigen-point decrease”, which is that achievable along \(u_k\), a (\(\chi\)-approximate) eigenvector associated with the smallest Hessian eigenvalue [7, Section 6.6] when this eigenvalue is negative. We set \(\Delta q_k^E= 0\) for consistency when \(H_k\) is positive semi-definite. We discuss in Sect. 5 how they can be ensured in practice, possibly approximately, for instance by the GLTR algorithm [13].
-
4.
The computation of \(\phi _k\) can be reused to compute \(s^Q_k\), should it be necessary. If \(\Delta _k>1\), the model minimization may be pursued beyond the boundary of the unit ball. If \(\Delta _k<1\), backtracking is also possible [7, Section 10.3.2].
Algorithm 2.1
ASTR2
- Step 0: :
-
Initialization. A starting point \(x_0\) is given. The constants \(\tau ,\chi \in (0,1]\) and \(\xi \ge 1\) are also given. Set \(k=0\).
- Step 1: :
-
Compute derivatives.Compute \(g_k = g(x_k)\) and \(H_k = H(x_k)\), as well as \(\phi _k {\mathop {=}\limits ^\textrm{def}}\phi _{f,2}^1(x_k)\) and \(\widehat{\phi }_k {\mathop {=}\limits ^\textrm{def}}\min [\phi _k,\xi ]\).
- Step 2: :
-
Define the trust-region radii.Set
$$\begin{aligned} \Delta ^L_k = \frac{\Vert g_k\Vert }{w^L_k} \;\; \text{ and } \;\; \Delta ^Q_k = \frac{\widehat{\phi }_k}{w^Q_k} \end{aligned}$$(2.6)where \(w^L_k=w^L(x_0,\ldots ,x_k)\) and \(w^Q_k=w^Q(x_0,\ldots ,x_k)\).
- Step 3: :
-
Step computation.If
$$\begin{aligned} \Vert g_k\Vert ^2 \ge \widehat{\phi }_k^3, \end{aligned}$$(2.7)then set
$$\begin{aligned} s_k = s^L_k = -\frac{g_k}{w^L_k}. \end{aligned}$$(2.8)Otherwise, set \(s_k= s^Q_k\), where \(s^Q_k\) is such that
$$\begin{aligned} \Vert s^Q_k\Vert \le \Delta ^Q_k \;\; \text{ and } \;\; \Delta q_k = f(x_k)-T_{f,2}(x_k,s_k^C ) \ge \tau \max \big [\Delta q_k^C,\Delta q_k^E\Big ] \end{aligned}$$(2.9)where
$$\begin{aligned} \Delta q_k^C = \max _{_{{\mathop {\scriptstyle \alpha \Vert g_k\Vert \le \Delta _k^Q}\limits ^{\scriptstyle \alpha \ge 0}}}} \Big [f(x_k)-T_{f,2}(x_k,-\alpha g_k )\Big ] \end{aligned}$$(2.10)and \(\Delta q_k^E= 0\) if \(\lambda _{\min }[H_k]\ge 0\), or
$$\begin{aligned} \Delta q_k^E = \max _{_{{\mathop {\scriptstyle \alpha \le \Delta _k^Q}\limits ^{\scriptstyle \alpha \ge 0}}}} \Big [f(x_k)-T_{f,2}(x_k, \alpha u_k )\Big ] \end{aligned}$$(2.11)with \(u_k\) satisfying
$$\begin{aligned} u_k^TH_ku_k \le \chi \, \lambda _{\min }[H_k], \;\;\;\;u_k^Tg_k \le 0 \;\; \text{ and } \;\; \Vert u_k\Vert =1, \end{aligned}$$(2.12)if \(\lambda _{\min }[H_k] < 0\).
- Step 4: :
-
New iterate.Define
$$\begin{aligned} x_{k+1} = x_k + s_k, \end{aligned}$$(2.13)increment k by one and return to Step 1.
-
5.
Note that two scaling factors are updated from iteration to iteration: one for first-order models and one for second-order ones. It does indeed make sense to trust these two types of models in region of different sizes, as Taylor’s theory suggests second-order models may be reasonably accurate in larger neighbourhoods.
-
6.
A “componentwise” version where the trust region is defined in the \(\Vert \cdot \Vert _\infty\) norm is possible with
$$\begin{aligned} \phi _{i,k} = \max \left[ \phi _k, - \min _{|\alpha | \le 1}\left( \alpha g_{i,k} +{\scriptstyle \frac{1}{2}}\alpha ^2 [H_k]_{i,i}\right) \right] \end{aligned}$$and
$$\begin{aligned} \Delta ^L_{i,k} = \frac{|g_{i,k}|}{w^L_{i,k}} \;\; \text{ and } \;\; \Delta ^Q_{i,k} = \frac{\min [\xi , \phi _{i,k}]}{w^Q_{i,k}}, \end{aligned}$$where \(g_{i,k}\) denotes the ith component of \(g_k\), and where \(w_{i,k}^L\), \(w_{i,k}^Q\) and \(\Delta _{i,k}\) are the ith components of \(w_k^L\), \(w_k^Q\) and \(\Delta _k\) (now vectors in \(\mathbb{R}^n\)). We will not explicitly consider this variant to keep our notations reasonably simple.
Our assumption that the gradient and Hessian are Lipschitz continuous (AS.2) ensures the following standard result.
Lemma 2.2
[1] or [5, Theorem A.8.3] Suppose that AS.1 and AS.2 hold. Then
and
The first step in analyzing the convergence of the ASTR2 algorithm is to derive bounds on the objective function’s change from iteration to iteration, depending on which step (linear with \(s_k=s_k^L\), or quadratic with \(s_k=s_k^Q\)) is chosen. We start by a few auxiliary results on the relations between first- and second-order optimality measures.
Lemma 2.3
Suppose that H is an \(n\times n\) symmetric positive semi-definite matrix and \(g \in \mathbb{R}^n\), and consider the (convex) quadratic \(q(d) = \langle g,d \rangle + {\scriptstyle \frac{1}{2}}\langle d,Hd \rangle\). Then
Proof
From the definition of the gradient, we have that
But \(\langle g,d \rangle\) defines the supporting hyperplane of q(d) at \(d=0\) and thus the convexity of q implies that \(q(d) \ge \langle g,d \rangle\) for all d. Hence
and (2.16) follows. \(\square\)
Lemma 2.4
Suppose that
where
Then
Proof
Observe first that (2.17) implies that \(\lambda _{\min }[H_k] < 0\) and \(\eta _k = \left| \lambda _{\min }[H_k]\right|\). Let \(d_k\) be a solution of the optimization problem defining \(\phi _k\), i.e.,
so that \(\phi _k = f_k-T_{f,2}(x_k,d_k)\). Since \(\lambda _{\min }[H_k] < 0\), it is known from trust-region theory [7, Corollary.2.2] that \(d_k\) may be chosen such that \(\Vert d_k\Vert = 1\). Now define
and note that \(q_0(d)\) is convex by construction. Then, at \(d_k\),
and (2.17) implies that \(q_0(d_k) < 0\). Moreover,
where we used the convexity of \(q_0\) to deduce the the first inequality, and Cauchy-Schwarz with \(\Vert d_k\Vert \le 1\) to derive the second. This proves (2.19). \(\square\)
Using these results, we may now prove a crucial property on objective function change. For this purpose, we partition the iterations in two sets, depending which type of step is chosen, that is
Lemma 2.5
Suppose that AS.1 and AS.2 hold. Then
and
Proof
Suppose first that \(s_k=s^L_k\). Then (2.14), (2.8) and (2.6) ensure that
giving (2.20).
Suppose now that \(s_k= s^Q_k\), i.e. \(k\in \mathcal{K}^Q\). Then, because of (2.9)–(2.12), the decrease \(\Delta q_k\) in the quadratic model \(T_{f,2}(x_k,s)\) at \(s_k\) is at least a fraction \(\tau\) of the maximum of the Cauchy and eigen-point decreases given by \(\Delta q_k^C\) and \(\Delta q_k^E\). Standard trust-region theory (see [7, Lemmas 6.3.2 and 6.6.1] for instance) then ensures that, for possibly non-convex \(T_{f,2}(x_k,s)\),
where we used the bound \(\Vert H_k\Vert \le L_1\) and (2.6) to derive the last inequality. If \(\eta _k\le {\scriptstyle \frac{1}{2}}\phi _k\), then, using Lemma 2.4 and the inequality \(\phi _k\ge \widehat{\phi }_k\),
Now \(\widehat{\phi }_k^3 \le \xi \widehat{\phi }_k^2\) and thus
If instead \(\eta _k > {\scriptstyle \frac{1}{2}}\phi _k\ge {\scriptstyle \frac{1}{2}}\widehat{\phi }_k\), then
Given that, if \(k\in \mathcal{K}^Q\), \(\Vert s_k\Vert \le \Delta _k^Q= \widehat{\phi }_k/w_k^Q\), we deduce (2.21) from (2.15), (2.23) and (2.24). \(\square\)
Observe that neither (2.20) nor (2.21) guarantees that the objective function values are monotonically decreasing.
3 An Adagrad-like algorithm for second-order optimality
We first consider a choice of scaling factors directly inspired by the Adagrad algorithm [10] and assume that, for some \(\varsigma > 0\), \(\mu ,\nu \in (0,1)\), \(\vartheta _L,\vartheta _Q\in (0,1]\) and all \(k\ge 0\),
and
Note that selecting the parameters \(\vartheta _L\) and \(\vartheta _Q\) strictly less than one allows the scaling factors \(w_k^L\) and \(w_k^Q\) to be chosen in an interval at each iteration without any monotonicity.
We now present a two technical lemmas which will be necessary in our analysis. The first states useful results for a specific class of inequalities.
Lemma 3.1
Let \(a\ge {\scriptstyle \frac{1}{2}}\varsigma\) and \(b\ge {\scriptstyle \frac{1}{2}}\varsigma\). Suppose that, for some \(\theta _a\ge 1\), \(\theta _b\ge 1\), \(\theta \ge 0\), \(\mu \in (0,1)\), and \(\nu \in (0,{\scriptstyle \frac{1}{3}})\)
where A(a) and B(b) are given, as a function of \(\mu\) and \(\nu\), by
\(\mu <{\scriptstyle \frac{1}{2}}\) | \(\mu = {\scriptstyle \frac{1}{2}}\) | \(\mu >{\scriptstyle \frac{1}{2}}\) | |
---|---|---|---|
A(a) | \(a^{1-2\mu }\) | \(\log (2a)\) | 0 |
and
\(\nu <{\scriptstyle \frac{1}{3}}\) | \(\nu = {\scriptstyle \frac{1}{3}}\) | \(\nu >{\scriptstyle \frac{1}{3}}\) | |
---|---|---|---|
B(k) | \(b^{1-3\nu }\) | \(\log (2b)\) | 0 |
Then there exists positive constants \(\kappa _a\) and \(\kappa _b\) only depending on \(\theta _a\), \(\theta _b\), \(\theta\), \(\mu\) and \(\nu\) such that
Proof
This result is proved by comparing the value of the left- and right-hand sides for possibly large a and b. The details are given in Lemmas 7.2–7.8 in appendix, whose results are then combined as shown in Table 1. The details of the constants \(\kappa _a\) and \(\kappa _b\) for the various cases are explicitly given in the statements of the relevant lemmas.
The second auxiliary result is a bound extracted from [14] (see also [9, 20] for the case \(\alpha =1\)).
Lemma 3.2
Let \(\{c_k\}\) be a non-negative sequence, \(\varsigma >0\), \(\alpha > 0\), \(\nu \ge 0\) and define, for each \(k \ge 0\), \(d_k = \sum _{j=0}^k c_j\). If \(\alpha \ne 1\), then
Otherwise,
Note that, if \(\alpha > 1\), then the bound (3.5) can be rewritten as
whose right-hand side is positive.
Armed with the above results, we are now in position to specify particular choices of the scaling factors \(w_k\) and derive the convergence properties of the resulting variants of ASTR2.
Theorem 3.3
Suppose that AS.1–AS.3 hold and that the ASTR2 algorithm is applied to problem (2.1), where \(w^L_k\) and \(w^Q_k\) are given by (3.1) and (3.2), respectively. Then there exists a positive constant \(\kappa _{ \text{ ASTR2 }}\) only depending on the problem-related quantities \(x_0\), \(f_{\textrm{low}}\), \(L_1\) and \(L_2\) and on the algorithmic parameters \(\varsigma\), \(\tau\), \(\xi\), \(\mu\) and \(\nu\) such that
and therefore that
Proof
To simplify notations in the proof, define
Consider first an iteration index \(j\in \mathcal{K}^L\) Then (2.20) (expressed for for \(j\ge 0\)), (3.1) and the inequality \(\tau \le 1\) give that
Suppose now that \(j\in \mathcal{K}^Q\). Then (2.21) and (3.2) imply that
Suppose now that
which implies that
Then combining (3.10) and (3.11), the inequality \(\xi \ge 1\) and AS.3, we deduce that, for all \(k\ge 0\),
But, by definition, \(a_j\le a_k\) and \(b_j\le b_k\) for \(j\le k\), and thus, for all \(k\ge 0\),
We now have to bound the last two terms on the right-hand side of (3.13). Using (3.1) and Lemma 3.2 with \(\{c_k\} = \{\Vert g_k\Vert ^2\}_{k\in \mathcal{K}^L}\) and \(\alpha = 2\mu\), gives that
if \(\mu < {\scriptstyle \frac{1}{2}}\), and
if \(\mu = {\scriptstyle \frac{1}{2}}\) and
if \(\mu > {\scriptstyle \frac{1}{2}}\). Similarly, using (3.2) and Lemma 3.2 with \(\{c_k\} = \{\widehat{\phi }_k^3\}_{k\in \mathcal{K}^Q}\) and \(\alpha = 3\nu\) yields that
if \(\nu < {\scriptstyle \frac{1}{3}}\),
if \(\nu = {\scriptstyle \frac{1}{3}}\), and
if \(\nu > {\scriptstyle \frac{1}{3}}\). Moreover, unless \(a_k< 1\), the argument of the logarithm in the right-hand side of (3.15) satisfies
Similarly, unless \(b_k< 1\), the argument of the logarithm in the right-hand side of (3.18) satisfies
Moreover, we may assume, without loss of generality, that \(L_1\) and \(L_2\) are large enough to ensure that
Because of these observations and since (3.13) together with one of (3.14)–(3.16) and one of (3.17)–(3.19) has the form of condition (3.3), we may then apply Lemma 3.1 for each \(k\ge 0\) with \(a=a_k\), \(b=b_k\) and the following associations:
\(\bullet\) for \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\):
\(\bullet\) for \(\mu ={\scriptstyle \frac{1}{2}}\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\):
\(\bullet\) for \(\mu \in ({\scriptstyle \frac{1}{2}},1)\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\):
\(\bullet\) for \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu ={\scriptstyle \frac{1}{3}}\):
\(\bullet\) for \(\mu ={\scriptstyle \frac{1}{2}}\), \(\nu ={\scriptstyle \frac{1}{3}}\):
\(\bullet\) for \(\mu \in ({\scriptstyle \frac{1}{2}},1)\), \(\nu ={\scriptstyle \frac{1}{3}}\):
\(\bullet\) for \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in ({\scriptstyle \frac{1}{3}},1)\):
\(\bullet\) for \(\mu ={\scriptstyle \frac{1}{2}}\), \(\nu \in ({\scriptstyle \frac{1}{3}}, 1)\):
\(\bullet\) for \(\mu \in ({\scriptstyle \frac{1}{2}},1)\), \(\nu \in ({\scriptstyle \frac{1}{3}},1)\):
As a consequence of applying Lemma 3.1, we obtain that there exists positive constantsFootnote 4\(\kappa _{ \text{1rst }}\ge 1\) and \(\kappa _{ \text{2nd }}\ge 1\) only depending on problem-related quantities and on \(\varsigma\), \(\xi\), \(\mu\) and \(\nu\) such that, for all \(k\ge 0\),
We also have, from the mechanism of Step 3 of the algorithm [see (2.7)] and (3.9), that
and
These two inequalities in turn imply that, for all \(k\ge 0\),
and the desired results follow with \(\kappa _{ \text{ ASTR2 }} = {\scriptstyle \frac{1}{2}}(\kappa _{ \text{1rst }} + \kappa _{ \text{2nd }})\).
Comments:
-
1.
Note that \(\widehat{\phi }_k < \phi _k\) only when \(\phi _k > \xi\). Thus, if \(\phi _k\) is boundedFootnote 5, one can choose \(\xi\) large enough to ensure that \(\phi _k=\widehat{\phi }_k\) for all k, and therefore that \(\min _{j\{ 0, \ldots , k \}}\phi _j \le \kappa _{ \text{ ASTR2 }}/(k+1)^{\scriptstyle \frac{1}{3}}\). In practice, \(\xi\) can be used to tune the algorithm’s sensitivity to second-order information.
-
2.
If the k-orders of convergence specified by (3.8) are translated in \(\epsilon\)-orders, that is numbers of iterations/evaluations to achieve \(\Vert g(x_k\Vert \le \epsilon _1\) and \(\phi _k = \widehat{\phi }_k \le \epsilon _2\), where \(\epsilon _1\) and \(\epsilon _2\) are precribed accuracies, we verify that at most \(\mathcal{O}(\epsilon _1^{-2})\) of them are needed to achieve the first of these conditions, while at most \(\mathcal{O}(\epsilon _2^{-3})\) are needed to achieve the second. As a consequence, at most \(\mathcal{O}(\max [\epsilon _1^{-2},\epsilon _2^{-3}])\) iterations/evaluations are needed to satisfy both conditions. These orders are identical to the sharp bounds known for the familiar trust-region methods (see [2, 16] or [5, Theorems 2.3.7 and 3.2.6]Footnote 6), or, for second-order optimalityFootnote 7, for the Adaptive Regularization method (see [18, 5, Theorem 3.3.2]). This is quite remarkable because function values are essential in these two latter classes of algorithms to enforce descent, itself a crucial ingredient of existing convergence proofs.
-
3.
While (3.8) is adequate to allow a meaningful comparison of the global convergence rates with standard algorithms, as we just discussed, we note that (3.7) is at least as strong, because the average is of course a majorant of the minimum. As it turns out, it provides (order)-equivalent bounds. To see this, we first note that, for \(k >0\) and \(\alpha \in (0,1)\),
$$\begin{aligned} \begin{array}{lcl} \displaystyle \sum _{i=1}^k \frac{1}{i^\alpha } &{} = &{} \frac{\displaystyle k^{1-\alpha }}{\displaystyle 1-\alpha } + \zeta (\alpha ) + \alpha \displaystyle \int _k^\infty \frac{\displaystyle x-\lfloor x \rfloor }{\displaystyle x^{1+\alpha }}\, dx\\ &{}\le &{}\frac{\displaystyle k^{1-\alpha }}{\displaystyle 1-\alpha } + \zeta (\alpha ) + \alpha \displaystyle \int _k^\infty \frac{\displaystyle 1}{\displaystyle x^{1+\alpha }}\, dx\\ &{}= &{}\frac{\displaystyle k^{1-\alpha }}{\displaystyle 1-\alpha } + \zeta (\alpha ) + \frac{\displaystyle \alpha ^2}{\displaystyle k^\alpha } \end{array} \end{aligned}$$where \(\zeta (\cdot )\) is the Riemann zeta function (see [19, (25.2.8)] for the first equality), so that
$$\begin{aligned} \frac{\frac{1}{k}\displaystyle \sum _{i=1}^k\frac{1}{i^\alpha }}{\frac{1}{k^\alpha }} = k^{\alpha -1}\displaystyle \sum _{i=1}^k\frac{1}{i^\alpha } \le \frac{1}{1-\alpha } + \frac{\zeta (\alpha )}{k^{1-\alpha }} + \frac{\alpha ^2}{k} \end{aligned}$$and therefore
$$\begin{aligned} \frac{1}{k}\displaystyle \sum _{i=1}^k\frac{1}{i^\alpha } = \mathcal{O}\left( \frac{1}{k^\alpha }\right) \end{aligned}$$for \(\alpha \in \{{\scriptstyle \frac{1}{2}}, {\scriptstyle \frac{1}{3}}\}\) (the two cases of interest here).
-
4.
The expression of the constants in Theorem 3.3 is very intricate. However it is remarkable that they do not explicitly depend on the problem dimension. Although a good sign, this does not tell the whole story and caution remains advisable, because the Lipschitz constants \(L_1\) and \(L_2\) may themselves hide this (potentially severe) dependence.
-
5.
It is also remarkable that the bounds (3.7) and (3.8) specify the same order of global convergence irrespective of the values of \(\mu\) and \(\nu\) in (0, 1), although these values do affect the constants involved.
-
6.
The condition (2.7) determining the choice of a linear (in \(\mathcal{K}^L\)) or quadratic (in \(\mathcal{K}^Q\)) step is only used at the very end of the theorem’s proof, after (3.22) has already been obtained. This means that other choice mechanisms are possible without affecting this last conclusion, which is enough to derive bounds on \(\Vert g_j\Vert ^2\) and \(\widehat{\phi }_j^3\) averaged on iterations in \(\mathcal{K}^L\) and \(\mathcal{K}^Q\), respectively (rather than on all iterations).
We now show that the bound (3.8) is essentially sharp (in the sense of [3], meaning that a lower bound on evaluation complexity exists which is arbitrarily close to its upper bound) by following ideas of [5, Theorem 2.2.3] in an argument parallel to that used in [14] for the first-order bound.
Theorem 3.4
The bound (3.8) is essentially sharp in that, for each \(\mu ,\nu \in (0,1)\), \(\vartheta _L=\vartheta _Q=1\) and each \(\varepsilon \in (0,{\scriptstyle \frac{2}{3}})\), there exists a univariate function \(f_{\mu ,\nu ,\varepsilon }\) satisfying AS.1–AS.3 such that, when applied to minimize \(f_{\mu ,\nu ,\varepsilon }\) from the origin, the ASTR2 algorithm with (3.1)–(3.2) produces second-order optimality measures given by
Proof
We start by constructing \(\{x_k\}\) for which \(f_{\mu ,\nu ,\varepsilon }(x_k) = f_k\), \(\nabla _x^1f_{\mu ,\nu ,\varepsilon }(x_k) = g_k\) and \(\nabla _x^2 f_{\mu ,\nu ,\varepsilon }(x_k) = H_k\) for associated sequences of function, gradient and Hessian values \(\{f_k\}\), \(\{g_k\}\) and \(\{H_k\}\), and then apply Hermite interpolation to exhibit the function \(f_{\mu ,\nu ,\varepsilon }\) itself. We select an arbitrary \(\varsigma > 0\) and define, for \(k\ge 0\),
from which we deduce, using (2.2), that, for \(k> 0\),
Since \(\phi _k^3 > 0 = \Vert g_k\Vert ^2\), we set
which is the exact minimizer of the quadratic model within the trust region, yielding that, for \(k\ge 0\),
where we used the fact that \(\varsigma +\sum _{j=0}^k \widehat{\phi }_j^3> \varsigma +\widehat{\phi }_0>1\) to deduce the last inequality. We then define, for all \(k\ge 0\),
and
Observe that the sequence \(\{f_k\}\) is decreasing and that, for all \(k\ge 0\),
where we used (3.28) and (3.26). Hence (3.28) implies that
Also note that, using (3.28),
while, using (3.24),
Moreover, using the fact that \(1/x^{{\scriptstyle \frac{1}{3}}+\nu }\) is a convex function of x over \([1,+\infty )\), and that from (3.25) \(s_k \ge \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\nu } \left( \varsigma + k+1\right) ^\nu }\), we derive that, for \(k\ge 0\),
These last bounds with (3.30), (3.31) and (3.32) allow us to use standard Hermite interpolation on the data given by \(\{f_k\}\), \(\{g_k\}\) and \(\{H_k\}\): see, for instance, Theorem A.9.1 in [5] with \(p=2\) and
(the second term in the max bounding \(|f_k|\) because of (3.30) and the third bounding \(|g_k|\) and \(|H_k|\) because of (3.24)). We then deduce that there exists a twice continuously differentiable function \(f_{\mu ,\nu ,\varepsilon }\) from \(\mathbb{R}\) to \(\mathbb{R}\) with Lipschitz continuous gradient and Hessian (i.e. satisfying AS.1 and AS.2) such that, for \(k\ge 0\),
Moreover, the range of \(f_{\mu ,\nu ,\varepsilon }\) is constant independent of \(\varepsilon\), hence guaranteeing AS.3. The definitions (3.24), (3.25), (3.27) and (3.28) imply that the sequences \(\{x_k\}\), \(\{f_k\}\), \(\{g_k\}\) and \(\{H_k\}\) can be seen as generated by the ASTR2 algorithm applied to \(f_{\mu ,\nu ,\varepsilon }\), starting from \(x_0=0\).
Figure 1 shows the behaviour of \(f_{\mu ,\nu ,\varepsilon }(x)\) for \(\mu = {\scriptstyle \frac{1}{2}}\), \(\nu = {\scriptstyle \frac{1}{3}}\), \(\vartheta _L=\vartheta _Q=1\) and \(\varepsilon = \varsigma ={\scriptstyle \frac{1}{100}}\), its gradient and Hessian, as resulting from the first 10 iterations of the ASTR2 algorithm with (3.1)–(3.2). (We have chosen to shift \(f_0\) to 100 in order to avoid large numbers on the vertical axis of the left panel.) Due to the slow convergence of the series \(\sum _j 1/j^{\frac{1}{1+3/100}}\), illustrating the boundeness of \(f_0-f_{k+1}\) would require many more iterations. One also notes that the gradient is not monotonically increasing, which implies that \(f_{\mu ,\nu ,\varepsilon }(x)\) is nonconvex, as can be verified in the left panel. Note that the unidimensional nature of the example is not restrictive, since it is always possible to make the value of its objective function and gradient independent of all dimensions but one. Also note that, as was the case in [14], the argument of Theorem 3.4 fails for \(\varepsilon =0\) since then the sums in (3.29) diverge when k tends to infinity.
Note that, because
one deduces that
which, when compared to (3.8), reflects the (slight) difference in strength between (3.7) and (3.8).
4 A “divergent stepsize” ASTR2 subclass
A “divergent stepsize” first-order method was analyzed in [14], motivated by its good practical behaviour in the stochastic context [15]. For coherence, we now present and analyze a similar variant, this time for second-order optimality. This requires the following additional assumption.
-
AS.4:
there exists a constant \(\kappa _g >0\) such that, for all x, \(\Vert g(x)\Vert _\infty \le \kappa _g\).
Theorem 4.1
Suppose that AS.1–AS.3 and AS.4 hold and that the ASTR2 algorithm is applied to problem (2.1), where, the scaling factors \(w_{i,k}\) are chosen such that, for some power parameters \(0< \nu _1 \le \mu _1 < 1\) and \(0<\nu _2\le \mu _ 2<{\scriptstyle \frac{1}{2}}\), some constants \(\varsigma \in (0,1]\) and \(\kappa _w\ge \max [1,\varsigma ]\), all \(i\in \{ 1, \ldots , n \}\) and all \(k\ge 0\),
Let \(\psi _k{\mathop {=}\limits ^\textrm{def}}\min [1,\max [\Vert g_k\Vert ^2,\phi _k^3]]\). Then, for any \(\theta \in (0,{\scriptstyle \frac{1}{4}}\tau )\) and \(k >j_\theta\),
where
and
Proof
Consider an arbitrary \(\theta \in (0,{\scriptstyle \frac{1}{4}}\tau )\) and note that AS.4, (4.1) and the definition of \(\widehat{\phi }_k\) imply that
If we define \(j_\theta\) by (4.3), we immediately obtain from AS.4 and Lemma 2.5 (where we neglect the first term in the right-hand sides of (2.20) and (2.21)) that
If we choose \(j > j_\theta\), one then verifies that the definition of \(j_\theta\) in (4.3), the bounds (2.20) and (2.21) and the definition (4.1) together ensure that
Using now the mechanism of Step 3, the definition of \(\psi _k\), (4.1) and the inequality \(\kappa _w\ge 1\), we obtain that, for \(j > j_\theta\)
As a consequence, we obtain from (4.5) and the summation of (4.6) for \(j\in \{ j_\theta +1, \ldots , k \}\) that, for \(k > j_\theta\),
We therefore deduce, using AS.3, that
and (4.2) follows. \(\square\)
This theorem gives a bound on the rate at which the combined optimality measure \(\psi _k\) tends to zero, and this bound is slightly worse than but close to what we obtained in the previous section whenever \(\max [\mu _1,2\mu _2]\) approaches zero.
Using the methodology of Theorem 3.4, we now show that the bound (4.2) is also essentially sharp.
Theorem 4.2
The bound (4.2) is essentially sharp in that, for each \(\mu = (\mu _1,\mu _2)\), each \(\nu = (\nu _1,\nu _2)\) with \(0<\nu _1\le \mu _1<1\) and \(0<\nu _2\le \mu _2<{\scriptstyle \frac{1}{2}}\) and each \(\varepsilon \in (0,1-{\scriptstyle \frac{1}{3}}(1-2\mu _2))\), there exists a univariate function \(h_{\mu ,\nu ,\varepsilon }\) satisfying AS.1–AS.4 such that, when applied to minimize \(h_{\mu ,\nu ,\varepsilon }\) from the origin, the ASTR2 algorithm with (4.1) produces second-order optimality measures given by
Proof
As above, we start by defining, for \(k\ge 0\), \(\gamma = {\scriptstyle \frac{1}{3}}(1-2\mu _2)+\varepsilon\), \(w_k= \kappa _w(k+1)^{\mu _2}\), and, for \(k\ge 0\),
which then implies, using (2.2) that, for \(k> 0\),
Given these definitions and because \(\widehat{\phi }^3_k>0 = \Vert g_k\Vert ^2\), we set
yielding that, for \(k>0\),
where we used the fact that \(\kappa _w \ge 1\) to deduce the last inequality. We then define, for all \(k\ge 0\),
and
where \(\zeta (\cdot )\) is the Riemann zeta function. Note that, since \(\gamma >1-2\mu _2\), the argument \(3\gamma +2\mu _2\) of \(\zeta\) is strictly larger than one and \(\zeta (3\gamma +2\mu _2)\) is finite. Observe also that the sequence \(\{h_k\}\) is decreasing and that, for all \(k\ge 0\),
where we used (3.28) and (3.26). Hence (3.28) implies that
Also note that, using (3.28),
while, using (3.24),
Moreover, using the fact that \(1/x^\gamma\) is a convex function of x over \([1,+\infty )\) and (4.10), we derive that, for \(k\ge 0\),
This bound with (4.15), (4.16) and (4.17) once more allow us to use standard Hermite interpolation on the data given by \(\{h_k\}\), \(\{g_k\}\) and \(\{H_k\}\), as stated in [5, Theorem A.9.1] with \(p=2\) and
(the second term in the max bounds \(|h_k|\) because of (4.15) and the third bounds both \(|g_k|\) and \(|H_k|\) because of (4.8)). As a consequence, there exists a twice continuously differentiable function \(h_{\mu ,\nu ,\varepsilon }\) from \(\mathbb{R}\) to \(\mathbb{R}\) with Lipschitz continuous gradient and Hessian (i.e. satisfying AS.1 and AS.2) such that, for \(k\ge 0\),
Moreover, the ranges of \(h_{\mu ,\nu ,\varepsilon }\) and its derivatives is constant independent of \(\gamma\), hence guaranteeing AS.3 and AS.4. Thus (4.8), (4.10), (4.12) and (4.13) imply that the sequences \(\{x_k\}\), \(\{h_k\}\), \(\{g_k\}\) and \(\{H_k\}\) can be seen as generated by the ASTR2 algorithm applied to \(h_{\mu ,\nu ,\varepsilon }\), starting from \(x_0=0\). The first bound of (4.7) then results from (4.9) and the definition of \(\gamma\).
The behaviour of \(h_{\mu ,\nu ,\varepsilon }\) is illustrated in Figure 2. It is qualitatively similar to that of \(f_{\mu ,\nu ,\varepsilon }\) shown in Figure 1, although the decrease in objective-value is somewhat slower, as expected. As in Sect. 3, note that the inequality
implies that
which has the same flavour as the second bound of (3.23).
5 Second-order optimality in a subspace
While the ASTR2 algorithms guarantee second-order optimality conditions, they come at a computational price. The key of this guarantee is of course that significant negative curvature in any direction of \(\mathbb{R}^n\) must be exploited, which requires evaluating the Hessian. In addition, the optimality measure \(\phi _k\) and the step \(s_k\) must also be computed. However, these computational costs may be judged excessive, so the question arises whether a potentially cheaper algorithm is able to ensure a “degraded” or weaker form of second-order optimality. Fortunately, the answer is positive: one can guarantee second-order optimality in subspaces of \(\mathbb{R}^n\) at lower cost.
The first step is to assume that a subspace \(\mathcal{S}_k\) is of interest at iteration k. Then, instead of computing \(\phi _k\) from (2.3), one can choose to calculate
Because the dimension of \(\mathcal{S}_k\) may be much smaller than n, the cost of this computation may be significantly smaller than that of computing \(\phi _k\). The measure \(\phi _k^{\mathcal{S}_k}\) may for instance be obtained using a Krylov-based method, as conjugate gradients [17], GLRT [13] or variants thereof, where the minimum of the model \(T_{f,2}(x,d)\) within the trust region is derived iteratively in a sequence of nested Krylov subspaces of increasing dimension, which tend to contain vector along which curvature is extreme [12, Chapter 9], thereby improving the quality of the second-order guarantee compared to random subspaces. This process may then be terminated before the subspaces fill \(\mathbb{R}^n\), should the calculation become too expensive or a desired accuracy be reached. In addition, there is no need for \(n_k\), the dimension of the final Krylov space at iteration k to be constant: it is often kept very small when far from optimality. This technique has the added benefit that the full Hessian is not evaluated, but only \(n_k\) Hessian-times-vector products are needed, again significantly reducing the computational burden. Calculating the step \(s_k^Q\) for \(k\in \mathcal{K}^Q\) once \(\phi _k^{\mathcal{S}_k}\) is known is also cheaper in a space of dimension \(n_k\) much less than n, especially since only a \(\tau\)-approximation is needed (see the comments after the algorithm).
Importantly, the theory developped in the previous sections is not affected by the transition from \(\mathbb{R}^n\) to \(\mathcal{S}_k\), except that now the complexity bounds (3.7)–(3.8) and (4.2) are no longer expressed using \(\widehat{\phi }_k\) but now involve \(\widehat{\phi }_k^{\mathcal{S}_k} = \min [ 1, \phi _k^{\mathcal{S}_k}]\) instead. While clearly not as powerful as the complete second-order guarantee in \(\mathbb{R}^n\), weaker guarantees based on (Krylov) subspaces are often sufficient in practice and make the ASTR2 algorithm more affordable. Note that, in the limit, one can even choose \(\mathcal{S}_k= \{0\}\) for all k, in which case we can set \(H_k=0\) for all k and we do not obtain any second-order guarantee (but the first-order complexity bounds remain valid, recovering results of [14]).
6 Conclusions
We have introduced an OFFO algorithm whose global rate of convergence to first-order minimizers is \(\mathcal{O}((k+1)^{-{\scriptstyle \frac{1}{2}}})\) while it converges to second-order ones as \(\mathcal{O}((k+1)^{-{\scriptstyle \frac{1}{3}}})\). These bounds are equivalent to the best known bounds for second-order optimality for algorithms using objective-function evaluations, despite the latter exploiting significantly more information. Thus we conclude that, from the point of view of evaluation complexity at least, evaluating values of the objective function is an unnecessary effort for efficiently finding second-order minimizers. We have also discussed another closely related algorithm, whose global rates of convergence can be nearly as good. We have finally considered how weaker second-order guarantees may be obtained at a much reduced computational cost.
We expect that extending our proposal to convexly constrained cases (for instance to problems involving bounds on the variables) should be possible. As in [7, Chapter 12], the idea would be to restrict the model minimization at each iteration to the intersection of the trust region with the feasible domain, but this should of course be verified.
It is clearly too early to assess whether the new algorithms will turn out to be of practical interest. In the form with \(\vartheta = 1\), they are undoublty quite conservative because of the monotonic nature of the scaling factors (and hence of the trust-region radius) and because, for locally convex functions (for which a second-order guarantee is not needed), \(\Vert g_k\Vert \ge \phi _k\), yielding a linear step. Whether less conservative variants with similar or better complexity can be designed is the object of ongoing research.
Data availability
No dataset was generated or analyzed because our approach is theoretical.
Notes
For which the only source of information on the problem at hand is the value of the gradient.
The authors are well aware that this is a theoretical statement, as it may be impractical to evaluate derivatives without first evaluating the function itself.
Converging to zero.
Which is the case if \(\Vert g_k\Vert \le \kappa _g\) (as we will require in Sect. 4) since then \(\phi _k \le \Vert g_k\Vert + {\scriptstyle \frac{1}{2}}\Vert H_k\Vert \le \kappa _g + {\scriptstyle \frac{1}{2}}L_1.\)
This second of these theorems quotes an \(\mathcal{O}(\max [\epsilon _1^{-2}\epsilon _2^{-1},\epsilon _2^{-3}])\) order bound known for standard trust-region methods using first and second derivatives.
References
Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, Ph.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. Ser. A 163(1), 359–368 (2017)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28, 93–108 (2012)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. In Sirakov, B., de Souza, P., Viana, M. (eds.) Invited Lectures, Proceedings of the 2018 International Conference of Mathematicians (ICM 2018), Vol. 4, Rio de Janeiro, pp. 3729–3768. World Scientific Publishing Co Pte Ltd (2018)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Strong evaluation complexity of an inexact trust-region algorithm for arbitrary-order unconstrained nonconvex optimization. arXiv:2011.00854 (2020)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Evaluation complexity of algorithms for nonconvex optimization. Number 30 in MOS-SIAM Series on Optimization. SIAM, Philadelphia, USA, June (2022)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Strong evaluation complexity bounds for arbitrary-order optimization of nonconvex nonsmooth composite functions. In: Invited Lectures, Proceedings of the 2022 International Conference of Mathematicians (ICM 2022), St. Petersburg. European Mathematical Society (EMS), 2022. arXiv:2001.10802
Conn, A.R., Gould, N.I.M., Toint, Ph.L.: Trust-Region Methods. Number 1 in MOS-SIAM Optimization Series. SIAM, Philadelphia, USA (2000)
Corless, R.M., Gonnet, G.H., Hare, D.E., Jeffrey, D.J., Knuth, D.E.: On the Lambert W function. Adv. Comput. Math. 5, 329–359 (1996)
Défossez, A., Bottou, L., Bach, F., Usunier, N.: A simple convergence proof for Adam and Adagrad. arXiv:2003.02395v2 (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Fan, J., Yuan, Y.: A new trust region algorithm with trust region radius converging to zero. In: Li, D, (ed) Proceedings of the 5th International Conference on Optimization: Techniques and Applications (ICOTA 2001, Hong Kong), pp. 786–794 (2001)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
Gould, N.I.M., Lucidi, S., Roma, M., Toint, Ph.L.: Solving the trust-region subproblem using the Lanczos method. SIAM J. Optim. 9(2), 504–525 (1999)
Gratton, S., Jerad, S., Toint, Ph. L.: First-order objective-function-free optimization algorithms and their complexity. arXiv:2203.01757 (2022)
Gratton, S., Jerad, S., Toint, Ph. L.: Parametric complexity analysis for a class of first-order Adagrad-like algorithms. arXiv:2203.01647 (2022)
Gratton, S., Sartenaer, A., Toint, Ph.L.: Recursive trust-region methods for multiscale nonlinear optimization. SIAM J. Optim. 19(1), 414–444 (2008)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Natl. Bureau Stand. 49, 409–436 (1952)
Nesterov, Yu., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. Ser. A 108(1), 177–205 (2006)
National Institute of Standard and Technology. Digital Library of Mathematical Functions. https://dlmf.nist.gov/25.2#E8 (2022)
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes. In: Proceedings in the International Conference on Machine Learning (ICML2019) (2019)
Yuan, Y.: Recent advances in trust region algorithms. Math. Program. Ser. A 151(1), 249–281 (2015)
Acknowledgements
S. Gratton.: Work partially supported by 3IA Artificial and Natural Intelligence Toulouse Institute (ANITI), French "Investing for the Future - PIA3" program under the Grant agreement ANR-19-PI3A-0004". Ph. L. Toint.: Work partially supported by ANITI.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: technical lemmas
Appendix: technical lemmas
Lemma 7.1
Let \(w >0\) and suppose that
for some \(\alpha \in (0,1)\) and \(\beta\) such that
Then
where \(W_{-1}(\cdot )\) is the second branch of the Lambert function [8].
Proof
First note that (7.1) is equivalent to
Setting now \(u = (2 w)^\alpha\), one obtains that
But \(\omega (u)\) is convex for \(u>0\) and tends to infinity if u tends to zero or to infinity. Moreover, it achieves its minimum at \(u_{\min }= \beta 2^\alpha /\alpha\), at which it takes the value
where the inequality results from (7.2). Hence \(\omega (u)\) has two real roots \(u_1\le u_2\) and the set of u for which (7.4) holds is bounded above by \(u_2\). By definition,
which is
Defining now \(z = -\frac{\alpha }{\beta \,2^\alpha }\,u_2\), we obtain that
By definition of the Lambert function, this gives that
which is well-defined because (7.2) implies that \(-\frac{\alpha }{\beta \,2^\alpha }\in [-\frac{1}{e},0)\). Since \(w = u^{\scriptstyle \frac{1}{\alpha }}/2\), this implies (7.3).
Lemma 7.2
Let \(a\ge 0\) and \(b\ge 0\). Suppose that, for some \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\) and some \(\theta _{a,0}, \theta _{b,0}\) and \(\theta _0 \ge 0\),
Then
Proof
Obvious from the inequalities \(\theta _{a,0}a^{1-\mu } \le \theta _{a,0}a^{1-\mu }+ \theta _{b,0}b^{1-2\nu }\) and \(\theta _{b,0}b^{1-2\nu }\le \theta _{a,0}a^{1-\mu } + \theta _{b,0}b^{1-2\nu }\).
Lemma 7.3
Let \(a\ge 0\) and \(b\ge 0\). Suppose that, for some \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), some \(\theta _{a,1} > 0\) and some \(\theta _1 \ge 0\),
Then
Symmetrically, if \(\nu \in (0,{\scriptstyle \frac{1}{3}})\), \(\theta _{b,1} > 0\) and
then
Proof
Suppose first that \(\theta _{a,1} a^{1-2\mu }\le \theta _1\). Then \(a^{1-\mu }\le 2\theta _1\) and thus \(a\le (2\theta _1)^{\scriptstyle \frac{1}{1-\mu }}\) Suppose now that \(\theta _{a,1} a^{1-2\mu }> \theta _1\). Then \(a^{1-\mu }\le 2\theta _{a,1} a^{1-2\mu }\), that is \(a\le (2\theta _a)^{\scriptstyle \frac{1}{\mu }}\). The proof of the second part is similar.
Lemma 7.4
Let \(a\ge 0\) and \(b\ge 0\). Suppose that, for some \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\) and some \(\theta _a, \theta _b > 0\) and \(\theta _2 \ge 0\),
Then
and
Proof
Suppose first that
Then, from Lemma 7.2,
Suppose now that (7.9) fails, and thus that
Assume also that
Then,
and so, using (7.8) and (7.11),
which is impossible. Hence (7.12) cannot hold, and at least one of its inequalities must fail. Suppose that it is the first, that is
Then (7.8) and (7.11) give that
and we may apply Lemma 7.3 with \(\theta _{b,1} = 2\theta _{b,2}\) and \(\theta _1 = 2\theta _{a,2}\kappa _1^{1-2\mu }\) to deduce that
Symmetrically, we deduce that if the second inequality of (7.12) fails, that is if
then, applying Lemma 7.3 with \(\theta _{a,1}= 2 \theta _{a,2}\) and \(\theta _1 = 2\theta _{b,2}\kappa _2^{1-3\nu }\),
Combining the two cases yieldsCombining the two cases yields the desired result.
Lemma 7.5
Let \(a\ge 0\) and \(b \ge 0\). Suppose that, for some \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in ({\scriptstyle \frac{1}{3}},1)\) and some \(\theta _{a,3}>0\), \(\theta _3 \ge 0\),
Then
Symmetrically, if \(\theta _{b,3}>0\) and
then
Proof
From (7.14), we have that
and we may apply Lemma 7.3 with \(\theta _{a,1}=\theta _{a,3}\) and \(\theta _1=\theta _3\) to deduce that
From the inequality \(b^{1-2\nu } \le a^{1-\mu }+b^{1-2\nu }\) and (7.14), we also obtain that
Lemma 7.6
Let \(a>0\) and \(b>0\). Suppose that, for some \(\nu \in (0,{\scriptstyle \frac{1}{3}}]\), some \(\theta _{a,4} \ge 1\) and some \(\theta _4\ge 0\),
Then
Symmetrically, if \(\theta _{b,4}\ge 1\), \(\mu \in (0,{\scriptstyle \frac{1}{2}}]\) and
then
Proof
Suppose first that \(\theta _{a,4} \log (2a)\le \theta _4\). Then
Otherwise, (7.15) gives that
from which one deduces using Lemma 7.1 with \(\alpha = {\scriptstyle \frac{1}{2}}\) and \(\beta =2\theta _{a,4}\) (which is allowed since \(2\theta _{a,4}\ge 2> 3/2^{\scriptstyle \frac{5}{2}}\) implies (7.2)) that
where \(\sigma (\cdot ,\cdot )\) is defined in (7.3). This inequality and (7.16) give the desired bound on a. Substituting this in (7.15) gives the bound on b. The proof of the symmetric statement is similar, in which the use of Lemma 7.1 is now allowed because \(\theta _{b,4} \ge 1 > 1/2^{\scriptstyle \frac{4}{3}}\) again implies (7.2).
Lemma 7.7
Let \(a>0\) and \(b\ge 0\). Suppose that, for some \(\nu \in (0,{\scriptstyle \frac{1}{3}})\), some \(\theta _{a,5} \ge 1\), \(\theta _{b,5}>0\) and some \(\theta _5\ge 0\),
Then
and
with \(\sigma _a= \sigma \big ({\scriptstyle \frac{1}{2}},2\theta _{a,5}\big )\). Symmetrically, if \(\theta _{a,5} \ge 1\), \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(b>0\) and
then
and
with \(\sigma _b= \sigma \big ({\scriptstyle \frac{1}{2}},2\theta _{b,5}\big )\).
Proof
Suppose first that,
Then
Suppose now that (7.18) fails. Then, from (7.17),
If
we obtain, using Lemma 7.1 (which we may apply because \(\theta _{a,5} \ge 1 > 3/2^{\scriptstyle \frac{5}{2}}\)), (7.20) and (7.17) that
which is impossible. Hence one of the inequalities of (7.21) must be violated. Suppose that
Using Lemma 7.1 again and (7.20), this implies that
and we deduce from Lemma 7.3 with \(\theta _{b,1}=2\theta _{b,5}\) and \(\theta _1= 2\theta _{a,5}\log (2\sigma _a)\) that
If we now suppose that \(b \le (2\theta _{b,5})^{\scriptstyle \frac{1}{\nu }}\), then (7.20) ensures that
and we now obtain from Lemma 7.6 with \(\theta _{a,4} = 2\theta _{a,5}\) and \(\theta _4=2\theta _{b,5} (2\theta _{b,5})^{\scriptstyle \frac{1-3\nu }{\nu }}\) that
\(\square\)
Lemma 7.8
Let \(a>0\) and \(b>0\). Suppose that, for some \(\theta _{a,6} \ge 1\), \(\theta _{b,6} \ge 1\) and some \(\theta _6\ge 0\),
where \(2a\ge \varsigma\) and \(2b\ge \varsigma\). Then
and
Proof
Suppose first that
Then
and hence
Suppose now that (7.23) fails, and thus (7.22) implies that
Assume also that
Then, using (7.22) and (7.25),
which is impossible. Hence one of the inequalities of (7.26) must fail. If \(a \le \sigma \big ({\scriptstyle \frac{1}{2}},2\theta _{a,6}\big ),\) then (7.22) gives that
and Lemma 7.6 with \(\theta _{b,4}= 2\theta _{b,6}\) and \(\theta _4 = \theta _{a,6}\log \Big (2\sigma \big ({\scriptstyle \frac{1}{2}},2\theta _{a,6}\big )\Big )\) then implies that
Symmetrically, if \(b < \sigma \big ({\scriptstyle \frac{1}{3}},2\theta _{b,6}\big ),\) then
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gratton, S., Toint, P.L. OFFO minimization algorithms for second-order optimality and their complexity. Comput Optim Appl 84, 573–607 (2023). https://doi.org/10.1007/s10589-022-00435-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-022-00435-2