1 Introduction

This paper considers an a priori unexpected but fundamental and challenging question: is evaluating the value of the objective function necessary for obtaining (complexity-wise) efficient minimization algorithms which find second-order approximate minimizers? This question arose as a natural consequence of the somewhat surprising results of [14], where it was shown that OFFO (i.e. Objective-Function Free Optimization) algorithmsFootnote 1 exist which converge to first-order points at a global rate which in order identical to that to well-known methods using both gradient and objective function evaluations. That these algorithms include the deterministic version of Adagrad [10], a very popular method for deep learning applications, was an added bonus and a good motivation.

We show here that, from the point of view of evaluation complexity alone, evaluating the value of the objective function during optimization is also unnecessaryFootnote 2for finding approximate second-order minimizers at a (worst-case) cost entirely comparable to that incurred by familiar and reliable techniques such as second-order trust-region or adaptive regularization methods. This conclusion is coherent with that of [14] for first-order points and is obtained by exhibiting an OFFO algorithm whose global rate of convergence is proved to be \(\mathcal{O}(1/\sqrt{k+1})\) for the gradients’norm and \(\mathcal{O}(1/(k+1)^{1/3})\) for second-order measures. The new ASTR2 algorithm is of the adaptively scaled trust-region type, as those studied in [14]. The key difference is that it now hinges on a scaling technique which depends on second-order information, when relevant.

A further motivation for our analysis is the folklore observation that algorithms which use function values (often in linesearches or other acceptance tests for a new iterate) are significantly less robust than OFFO counterparts, essentially because the accuracy necessary for the former methods to work well is significantly higher than that requested on derivatives’ values. Thus OFFO algorithms, like the one discussed in this paper, merit, in our view, a sound theoretical consideration.

The paper is organized as follows. Section 2 presents the new ASTR2 class of algorithms and discusses some of its scaling-independent properties. The complexity analysis of a first, Adagrad-like, subclass of ASTR2 is then presented in Sect. 3. Another subclass of interest is also considered and analyzed in Sect. 4. Sect. 5 discusses how weaker optimality conditions may be guaranteed by the ASTR2 algorithms at significantly reduced computational cost. Conclusions and perspectives are finally presented in Sect. 6

2 The ASTR2 class of minimization methods

2.1 Approximate first- and second-order optimality

We consider the nonlinear unconstrained optimization problem

$$\begin{aligned} \min _{x\in \mathbb{R}^n} f(x) \end{aligned}$$
(2.1)

where f is a function from \(\mathbb{R}^n\) to \(\mathbb{R}\). More precisely, we assume that

  1. AS.1:

    the objective function f(x) is twice continuously differentiable;

  2. AS.2:

    its gradient \(g(x){\mathop {=}\limits ^\textrm{def}}\nabla _x^1f(x)\) and Hessian \(H(x) {\mathop {=}\limits ^\textrm{def}}\nabla _x^2f(x)\) are Lipschitz continuous with Lipschitz constant \(L_1\) and \(L_2\), respectively, that is

    $$\begin{aligned} \Vert g(x)-g(y)\Vert \le L_1 \Vert x-y\Vert \;\; \text{ and } \;\; \Vert H(x)-H(y)\Vert \le L_2 \Vert x-y\Vert \end{aligned}$$

    for all \(x,y\in \mathbb{R}^n\);

  3. AS.3:

    there exists a constant \(f_{\textrm{low}}\) such that \(f(x)\ge f_{\textrm{low}}\) for all \(x\in \mathbb{R}^n\).

As our purpose is to find approximate first- and second-order minimizers, we need to clarify these concepts. In this paper we choose to follow the “ strong \(\phi\)” concept of optimality discussed in [4, 6] or [5, Chapters 12–14]. It is based on the quantity

$$\begin{aligned} \phi _{f,2}^\delta (x) = f(x) - \min _{\Vert d\Vert \le \delta } T_{f,2}(x,d), \end{aligned}$$
(2.2)

where \(T_{f,2}(x,d)\) is the second-order Taylor expansion of f at x, that is

$$\begin{aligned} T_{f,1}(x,d) = f(x) + g(x)^Td \;\; \text{ and } \;\; T_{f,2}(x,d) = f(x) + g(x)^Td + {\scriptstyle \frac{1}{2}}d^TH(x)d. \end{aligned}$$

Observe that \(\phi _{f,j}^\delta (x)\) is interpreted as the maximum decrease of the local j-th order Taylor model of the objective function f at x, within a ball of radius \(\delta\). Importantly for our present purposes, the evaluation of \(\phi _{f,2}^\delta (x)\) does not require the evaluation of f(x), as it can be rewritten as

$$\begin{aligned} \phi _{f,2}^\delta (x) = \max _{\Vert d\Vert \le \delta } -\Big ( g(x)^Td + {\scriptstyle \frac{1}{2}}d^TH(x)d \Big ). \end{aligned}$$
(2.3)

Moreover, computing \(\phi _{f,2}^\delta (x)\) is a standard trust-region step calculation, for which many efficient methods exist (see [7, Chapter 7], for instance).

The next result recalls the link between the \(\phi\) optimality measure and the more standard ones.

Lemma 2.1

[5, Theorems 12.1.4 and 12.1.6] Suppose that f is twice continuously differentiable. Then

  1. (i)

    for any \(\delta >0\) and any \(x\in \mathbb{R}^n\), we have that

    $$\begin{aligned} \Vert g(x)\Vert =\frac{\phi _{f,1}^{\delta }(x)}{\delta }, \end{aligned}$$
    (2.4)

    and so \(\phi _{f,1}^{\delta }(x) = 0\) if and only if \(g(x)=0\);

  2. (ii)

    we have that

    $$\begin{aligned} \phi _{f,2}^{\delta }(x) = 0 \;\; \text{ for } \text{ some } \delta >0\text{, } \text{ then } \;\; g(x)=0\,\;\; \text{ and } \;\;\,\lambda _{\min }[H(x)] \ge 0, \end{aligned}$$

    and so any such x is a first- and second-order minimizer;

  3. (iii)

    if \(\phi _{f,1}^{\delta _1}(x) \le \epsilon _1 \, \delta _1\) (and so (2.5) holds with \(j=1\)), then \(\Vert g(x)\Vert \le \epsilon _1\);

  4. (iv)

    if \(\phi _{f,2}^\delta (x) \le {\scriptstyle \frac{1}{2}}\epsilon _2\delta ^2\), then \(\lambda _{\min }[H(x)] \ge -\epsilon _2\) (and so (2.5) holds for \(j=2\)) and \(\Vert g(x)\Vert \le \delta \kappa (x)\sqrt{\epsilon _2}\), where \(\kappa (x)\) depends on (the eigenvalues of) H(x).

Note also that computing \(\phi _{f,1}^\delta (x)\) simply results from (2.4) and that, in particular, \(\phi _{f,1}^1(x)=\Vert g(x)\Vert\). Computing \(\phi _{f,2}^\delta (x)\) is a standard Euclidean trust-region step calculation (see [7, Chapter 7], for instance).

For \(j\in \{1,2\}\), we then say that an iterate \(x_k\) is an \(\epsilon\)-approximate minimizer if

$$\begin{aligned} \phi _{f,i}^\delta (x_k) \le \epsilon _i \frac{\delta ^i}{i} \;\;\;\;\;\; \text{ for } \text{ some } \;\; \delta \in (0,1] \;\; \text{ and } \text{ all } \;\; 1\le i \le j, \end{aligned}$$
(2.5)

where \(\epsilon = (\epsilon _1, \ldots , \epsilon _j)\). There are two ways to express how fast an algorithm tends to such points in the worst case. The first (the “\(\epsilon\)-orders”) is to assume \(\epsilon\) is given and then give a bound on the maximum number of iterations and evaluations that are needed to satisfy (2.5). In this paper we focus on the second (the “k-orders”), where one instead gives an upper boundFootnote 3 on \(\phi _{f,j}^\delta (x_k)\) as a function of k (for specified j and \(\delta\)).

2.2 The ASTR2 class

After these preliminaries, we now introduce the new ASTR2 class of algorithms. Methods in this class are of “adaptively scaled trust-region” type, a term we now briefly explain. Classical trust-region algorithms (see [7] for an in-depth coverage or [21] for a more recent survey) are iterative. At each iteration, they define a local model of the objective function which is deemed trustable within the “trust region”, a ball of given radius centered at the current iterate. A step and corresponding trial point are then computed by (possibly approximately) minimizing this model in the trust region. The objective function value is then computed at the trial point, and this point is accepted as the new iterate if the ratio of the achieved reduction in the objective function to that predicted by the model is sufficiently large. The radius of the trust region is then updated using the value of this ratio. As is clear from this description, these methods are intrinsically dependent of the evaluation of the objective function, and therefore not suited to our Objective-Function Free Optimization (OFFO) context. Here we follow [14] in interpreting the mechanism designed for the Adagrad methods [10] as an alternative trust-region design not using function evaluations. In this interpretation, the trial point is always accepted and the trust-region radius is determined by the gradient sizes, in a manner reminiscent also of [11]. In this approach, one uses scaling factors to determine the radius (hence the name of Adaptively Scaled Trust Region) at each iteration. Given these factors, which we will denote, at iteration k, by \(w_k^L\) and \(w_k^Q\), we may then state the ASTR2 class of algorithms as shown ASTR2. This algorithm involves requirements on the step which are standard (and practical) for trust-region methods.

A few additional comments on this algorithm are now in order.

  1. 1.

    The algorithms in the ASTR2 class belong to the OFFO framework: the objective function is never evaluated (remember that \(\phi _{f,j}^1(x)\) can be computed without any such evaluation, the same being obviously true for \(\Delta q_k\), \(\Delta q_k^C\) and \(\Delta q_k^E\)).

  2. 2.

    Given our focus on k-orders of convergence, the algorithm does not include a termination criterion. It is however easy, should one be interested in \(\epsilon\)-orders instead, to test (2.5) for \(\delta =1\) and the considered \(\epsilon _1\) and \(\epsilon _2\) at the end of Step 1, and then terminate if this condition holds.

  3. 3.

    Despite their somewhat daunting statements, conditions (2.9)–(2.12) are relatively mild and have been extensively used for standard trust-region algorithms, both in theory and practice. Condition (2.10) defines the so-called “Cauchy decrease”, which is the decrease achievable on the quadratic model \(T_{f,2}(x_k,s)\) in the steepest descent direction [7, Section 6.3.2]. Conditions (2.11) and (2.12) define the “eigen-point decrease”, which is that achievable along \(u_k\), a (\(\chi\)-approximate) eigenvector associated with the smallest Hessian eigenvalue [7, Section 6.6] when this eigenvalue is negative. We set \(\Delta q_k^E= 0\) for consistency when \(H_k\) is positive semi-definite. We discuss in Sect. 5 how they can be ensured in practice, possibly approximately, for instance by the GLTR algorithm [13].

  4. 4.

    The computation of \(\phi _k\) can be reused to compute \(s^Q_k\), should it be necessary. If \(\Delta _k>1\), the model minimization may be pursued beyond the boundary of the unit ball. If \(\Delta _k<1\), backtracking is also possible [7, Section 10.3.2].

Algorithm 2.1

ASTR2

Step 0: :

Initialization. A starting point \(x_0\) is given. The constants \(\tau ,\chi \in (0,1]\) and \(\xi \ge 1\) are also given. Set \(k=0\).

Step 1: :

Compute derivatives.Compute \(g_k = g(x_k)\) and \(H_k = H(x_k)\), as well as \(\phi _k {\mathop {=}\limits ^\textrm{def}}\phi _{f,2}^1(x_k)\) and \(\widehat{\phi }_k {\mathop {=}\limits ^\textrm{def}}\min [\phi _k,\xi ]\).

Step 2: :

Define the trust-region radii.Set

$$\begin{aligned} \Delta ^L_k = \frac{\Vert g_k\Vert }{w^L_k} \;\; \text{ and } \;\; \Delta ^Q_k = \frac{\widehat{\phi }_k}{w^Q_k} \end{aligned}$$
(2.6)

where \(w^L_k=w^L(x_0,\ldots ,x_k)\) and \(w^Q_k=w^Q(x_0,\ldots ,x_k)\).

Step 3: :

Step computation.If

$$\begin{aligned} \Vert g_k\Vert ^2 \ge \widehat{\phi }_k^3, \end{aligned}$$
(2.7)

then set

$$\begin{aligned} s_k = s^L_k = -\frac{g_k}{w^L_k}. \end{aligned}$$
(2.8)

Otherwise, set \(s_k= s^Q_k\), where \(s^Q_k\) is such that

$$\begin{aligned} \Vert s^Q_k\Vert \le \Delta ^Q_k \;\; \text{ and } \;\; \Delta q_k = f(x_k)-T_{f,2}(x_k,s_k^C ) \ge \tau \max \big [\Delta q_k^C,\Delta q_k^E\Big ] \end{aligned}$$
(2.9)

where

$$\begin{aligned} \Delta q_k^C = \max _{_{{\mathop {\scriptstyle \alpha \Vert g_k\Vert \le \Delta _k^Q}\limits ^{\scriptstyle \alpha \ge 0}}}} \Big [f(x_k)-T_{f,2}(x_k,-\alpha g_k )\Big ] \end{aligned}$$
(2.10)

and \(\Delta q_k^E= 0\) if \(\lambda _{\min }[H_k]\ge 0\), or

$$\begin{aligned} \Delta q_k^E = \max _{_{{\mathop {\scriptstyle \alpha \le \Delta _k^Q}\limits ^{\scriptstyle \alpha \ge 0}}}} \Big [f(x_k)-T_{f,2}(x_k, \alpha u_k )\Big ] \end{aligned}$$
(2.11)

with \(u_k\) satisfying

$$\begin{aligned} u_k^TH_ku_k \le \chi \, \lambda _{\min }[H_k], \;\;\;\;u_k^Tg_k \le 0 \;\; \text{ and } \;\; \Vert u_k\Vert =1, \end{aligned}$$
(2.12)

if \(\lambda _{\min }[H_k] < 0\).

Step 4: :

New iterate.Define

$$\begin{aligned} x_{k+1} = x_k + s_k, \end{aligned}$$
(2.13)

increment k by one and return to Step 1.

  1. 5.

    Note that two scaling factors are updated from iteration to iteration: one for first-order models and one for second-order ones. It does indeed make sense to trust these two types of models in region of different sizes, as Taylor’s theory suggests second-order models may be reasonably accurate in larger neighbourhoods.

  2. 6.

    A “componentwise” version where the trust region is defined in the \(\Vert \cdot \Vert _\infty\) norm is possible with

    $$\begin{aligned} \phi _{i,k} = \max \left[ \phi _k, - \min _{|\alpha | \le 1}\left( \alpha g_{i,k} +{\scriptstyle \frac{1}{2}}\alpha ^2 [H_k]_{i,i}\right) \right] \end{aligned}$$

    and

    $$\begin{aligned} \Delta ^L_{i,k} = \frac{|g_{i,k}|}{w^L_{i,k}} \;\; \text{ and } \;\; \Delta ^Q_{i,k} = \frac{\min [\xi , \phi _{i,k}]}{w^Q_{i,k}}, \end{aligned}$$

    where \(g_{i,k}\) denotes the ith component of \(g_k\), and where \(w_{i,k}^L\), \(w_{i,k}^Q\) and \(\Delta _{i,k}\) are the ith components of \(w_k^L\), \(w_k^Q\) and \(\Delta _k\) (now vectors in \(\mathbb{R}^n\)). We will not explicitly consider this variant to keep our notations reasonably simple.

Our assumption that the gradient and Hessian are Lipschitz continuous (AS.2) ensures the following standard result.

Lemma 2.2

[1] or [5, Theorem A.8.3] Suppose that AS.1 and AS.2 hold. Then

$$\begin{aligned} f(x_k+s^L_k)- f(x_k) \le \langle g_k,s^L_k \rangle + \frac{L_1}{2} \Vert s^L_k\Vert ^2 \end{aligned}$$
(2.14)

and

$$\begin{aligned} f(x_k+s^Q_k) - f(x_k) \le -\Delta q_k + \frac{L_2}{6}\Vert s^Q_k\Vert ^3. \end{aligned}$$
(2.15)

The first step in analyzing the convergence of the ASTR2 algorithm is to derive bounds on the objective function’s change from iteration to iteration, depending on which step (linear with \(s_k=s_k^L\), or quadratic with \(s_k=s_k^Q\)) is chosen. We start by a few auxiliary results on the relations between first- and second-order optimality measures.

Lemma 2.3

Suppose that H is an \(n\times n\) symmetric positive semi-definite matrix and \(g \in \mathbb{R}^n\), and consider the (convex) quadratic \(q(d) = \langle g,d \rangle + {\scriptstyle \frac{1}{2}}\langle d,Hd \rangle\). Then

$$\begin{aligned} \phi _{q,2}^1(0) = \left| \min _{\Vert d\Vert \le 1}q(d)\right| \le \Vert g\Vert . \end{aligned}$$
(2.16)

Proof

From the definition of the gradient, we have that

$$\begin{aligned} \Vert g\Vert = \left| \min _{\Vert d\Vert \le 1}\langle g,d \rangle \right| . \end{aligned}$$

But \(\langle g,d \rangle\) defines the supporting hyperplane of q(d) at \(d=0\) and thus the convexity of q implies that \(q(d) \ge \langle g,d \rangle\) for all d. Hence

$$\begin{aligned} \left| \min _{\Vert d\Vert \le 1}q(d)\right| \le \left| \min _{\Vert d\Vert \le 1}\langle g,d \rangle \right| \end{aligned}$$

and (2.16) follows. \(\square\)

Lemma 2.4

Suppose that

$$\begin{aligned} 0 < \eta _k\le {\scriptstyle \frac{1}{2}}\phi _k \end{aligned}$$
(2.17)

where

$$\begin{aligned} \eta _k {\mathop {=}\limits ^\textrm{def}}\min \bigg (0,-\lambda _{\min }[H_k]\bigg ). \end{aligned}$$
(2.18)

Then

$$\begin{aligned} {\scriptstyle \frac{1}{2}}\phi _k \le \Vert g_k\Vert . \end{aligned}$$
(2.19)

Proof

Observe first that (2.17) implies that \(\lambda _{\min }[H_k] < 0\) and \(\eta _k = \left| \lambda _{\min }[H_k]\right|\). Let \(d_k\) be a solution of the optimization problem defining \(\phi _k\), i.e.,

$$\begin{aligned} d_k = {{\,\mathrm{arg\,min}\,}}_{\Vert d\Vert \le 1}T_{f,2}(x_k,d), \end{aligned}$$

so that \(\phi _k = f_k-T_{f,2}(x_k,d_k)\). Since \(\lambda _{\min }[H_k] < 0\), it is known from trust-region theory [7, Corollary.2.2] that \(d_k\) may be chosen such that \(\Vert d_k\Vert = 1\). Now define

$$\begin{aligned} q_0(d) {\mathop {=}\limits ^\textrm{def}}\langle g_k,d \rangle + {\scriptstyle \frac{1}{2}}\langle d,(H_k-\lambda _{\min }[H_k] I)d \rangle = T_{f,2}(x_k,d) - f_k + \eta _k \Vert d\Vert ^2 \end{aligned}$$

and note that \(q_0(d)\) is convex by construction. Then, at \(d_k\),

$$\begin{aligned} q_0(d_k) = -\phi _k + \eta _k \end{aligned}$$

and (2.17) implies that \(q_0(d_k) < 0\). Moreover,

$$\begin{aligned} {\scriptstyle \frac{1}{2}}\phi _k \le - q_0(d_k) \le -\langle g_k,d_k \rangle \le \Vert g_k\Vert , \end{aligned}$$

where we used the convexity of \(q_0\) to deduce the the first inequality, and Cauchy-Schwarz with \(\Vert d_k\Vert \le 1\) to derive the second. This proves (2.19). \(\square\)

Using these results, we may now prove a crucial property on objective function change. For this purpose, we partition the iterations in two sets, depending which type of step is chosen, that is

$$\begin{aligned} \mathcal{K}^L = \{ k \ge 0 \mid s_k = s_k^L \} \;\; \text{ and } \;\; \mathcal{K}^Q = \{ k \ge 0 \mid s_k = s_k^Q \}. \end{aligned}$$

Lemma 2.5

Suppose that AS.1 and AS.2 hold. Then

$$\begin{aligned} f_{k+1}-f_k \le - \frac{\Vert g_k\Vert ^2}{w^L_k} + \frac{L_1}{2}\frac{\Vert g_k\Vert ^2}{(w^L_k)^2} \;\; \text{ for } \;\; k \in \mathcal{K}^L \end{aligned}$$
(2.20)

and

$$\begin{aligned} f_{k+1}-f_k \le - \frac{\tau }{4\xi } \min \left[ \frac{1}{2(1+L_1)},\frac{1}{w^Q_k},\frac{1}{(w^Q_k)^2}\right] \,\widehat{\phi }_k^3 + \frac{L_2}{6} \frac{\widehat{\phi }_k^3}{(w^Q_k)^3} \;\; \text{ for } \;\; k \in \mathcal{K}^Q. \end{aligned}$$
(2.21)

Proof

Suppose first that \(s_k=s^L_k\). Then (2.14), (2.8) and (2.6) ensure that

$$\begin{aligned} f_{k+1}-f_k \le -\frac{\Vert g_k\Vert ^2}{w^L_k} + \frac{L_1}{2}(\Delta ^L_k)^2 = -\frac{\Vert g_k\Vert ^2}{w^L_k} + \frac{L_1}{2}\frac{\Vert g_k\Vert ^2}{(w^L_k)^2}, \end{aligned}$$
(2.22)

giving (2.20).

Suppose now that \(s_k= s^Q_k\), i.e. \(k\in \mathcal{K}^Q\). Then, because of (2.9)–(2.12), the decrease \(\Delta q_k\) in the quadratic model \(T_{f,2}(x_k,s)\) at \(s_k\) is at least a fraction \(\tau\) of the maximum of the Cauchy and eigen-point decreases given by \(\Delta q_k^C\) and \(\Delta q_k^E\). Standard trust-region theory (see [7, Lemmas 6.3.2 and 6.6.1] for instance) then ensures that, for possibly non-convex \(T_{f,2}(x_k,s)\),

$$\begin{aligned} \begin{array}{lcl} \Delta q_k &{} \ge &{} \tau \max \left[ \frac{\displaystyle 1}{\displaystyle 2}\min \left( \frac{\displaystyle \Vert g_k\Vert ^2}{\displaystyle 1+\Vert H_k\Vert },\Vert g_k\Vert \Delta ^Q_k\right) , \frac{\displaystyle \eta _k}{\displaystyle 2} (\Delta ^Q_k)^2\right] \\&{} \ge &{} \frac{\displaystyle \tau }{\displaystyle 2}\max \left[ \min \left( \frac{\displaystyle \Vert g_k\Vert ^2}{\displaystyle 1+L_1},\frac{\displaystyle \Vert g_k\Vert \widehat{\phi }_k}{\displaystyle w^Q_k}\right) , \frac{\displaystyle \eta _k\widehat{\phi }_k^2}{\displaystyle (w^Q_k)^2}\right] \end{array} \end{aligned}$$

where we used the bound \(\Vert H_k\Vert \le L_1\) and (2.6) to derive the last inequality. If \(\eta _k\le {\scriptstyle \frac{1}{2}}\phi _k\), then, using Lemma 2.4 and the inequality \(\phi _k\ge \widehat{\phi }_k\),

$$\begin{aligned} \Delta q_k \ge \frac{\displaystyle \tau }{\displaystyle 2}\min \left( \frac{\displaystyle \Vert g_k\Vert ^2}{\displaystyle 1+L_1},\frac{\displaystyle \Vert g_k\Vert \widehat{\phi }_k}{\displaystyle w^Q_k}\right) \ge \frac{\displaystyle \tau }{\displaystyle 2}\min \left( \frac{\displaystyle ({\scriptstyle \frac{1}{2}}\widehat{\phi }_k)^2}{\displaystyle 1+L_1},\frac{\displaystyle ({\scriptstyle \frac{1}{2}}\widehat{\phi }_k)\widehat{\phi }_k}{\displaystyle w^Q_k}\right) . \end{aligned}$$

Now \(\widehat{\phi }_k^3 \le \xi \widehat{\phi }_k^2\) and thus

$$\begin{aligned} \Delta q_k \ge \frac{\displaystyle \tau }{\displaystyle 2}\min \left( \frac{\displaystyle \widehat{\phi }_k^3}{\displaystyle 4\xi (1+L_1)}, \frac{\displaystyle \widehat{\phi }_k^3}{\displaystyle 2\xi w^Q_k}\right) . \end{aligned}$$
(2.23)

If instead \(\eta _k > {\scriptstyle \frac{1}{2}}\phi _k\ge {\scriptstyle \frac{1}{2}}\widehat{\phi }_k\), then

$$\begin{aligned} \Delta q_k \ge \frac{\displaystyle \tau }{\displaystyle 2}\frac{\displaystyle \eta _k\widehat{\phi }_k^2}{\displaystyle (w^Q_k)^2} \ge \frac{\displaystyle \tau }{\displaystyle 2}\frac{\displaystyle ({\scriptstyle \frac{1}{2}}\widehat{\phi }_k)\widehat{\phi }_k^2}{\displaystyle (w^Q_k)^2}. \end{aligned}$$
(2.24)

Given that, if \(k\in \mathcal{K}^Q\), \(\Vert s_k\Vert \le \Delta _k^Q= \widehat{\phi }_k/w_k^Q\), we deduce (2.21) from (2.15), (2.23) and (2.24). \(\square\)

Observe that neither (2.20) nor (2.21) guarantees that the objective function values are monotonically decreasing.

3 An Adagrad-like algorithm for second-order optimality

We first consider a choice of scaling factors directly inspired by the Adagrad algorithm [10] and assume that, for some \(\varsigma > 0\), \(\mu ,\nu \in (0,1)\), \(\vartheta _L,\vartheta _Q\in (0,1]\) and all \(k\ge 0\),

$$\begin{aligned} w_k^L \in [\vartheta _L \hat{w}_k^L, \hat{w}_k^L] \;\; \text{ where } \;\; \hat{w}^L_k = \left( \varsigma + \sum _{_{{\mathop {\scriptstyle \ell \in \mathcal{K}^L}\limits ^{\scriptstyle \ell =0}}}}^k \Vert g_k\Vert ^2\right) ^\mu \end{aligned}$$
(3.1)

and

$$\begin{aligned} w_k^Q \in [\vartheta _Q \hat{w}_k^Q,\hat{w}_k^Q] \;\; \text{ where } \;\; \hat{w}^Q_k = \left( \varsigma + \sum _{_{{\mathop {\scriptstyle \ell \in \mathcal{K}^Q}\limits ^{\scriptstyle \ell =0}}}}^k \widehat{\phi }_k^3\right) ^\nu . \end{aligned}$$
(3.2)

Note that selecting the parameters \(\vartheta _L\) and \(\vartheta _Q\) strictly less than one allows the scaling factors \(w_k^L\) and \(w_k^Q\) to be chosen in an interval at each iteration without any monotonicity.

We now present a two technical lemmas which will be necessary in our analysis. The first states useful results for a specific class of inequalities.

Lemma 3.1

Let \(a\ge {\scriptstyle \frac{1}{2}}\varsigma\) and \(b\ge {\scriptstyle \frac{1}{2}}\varsigma\). Suppose that, for some \(\theta _a\ge 1\), \(\theta _b\ge 1\), \(\theta \ge 0\), \(\mu \in (0,1)\), and \(\nu \in (0,{\scriptstyle \frac{1}{3}})\)

$$\begin{aligned} a^{1-\mu } + b^{1-2\nu } \le \theta _a A(a) + \theta _b B(b) + \theta \end{aligned}$$
(3.3)

where A(a) and B(b) are given, as a function of \(\mu\) and \(\nu\), by

 

\(\mu <{\scriptstyle \frac{1}{2}}\)

\(\mu = {\scriptstyle \frac{1}{2}}\)

\(\mu >{\scriptstyle \frac{1}{2}}\)

A(a)

\(a^{1-2\mu }\)

\(\log (2a)\)

0

 and 

 

\(\nu <{\scriptstyle \frac{1}{3}}\)

\(\nu = {\scriptstyle \frac{1}{3}}\)

\(\nu >{\scriptstyle \frac{1}{3}}\)

B(k)

\(b^{1-3\nu }\)

\(\log (2b)\)

0

Then there exists positive constants \(\kappa _a\) and \(\kappa _b\) only depending on \(\theta _a\), \(\theta _b\), \(\theta\), \(\mu\) and \(\nu\) such that

$$\begin{aligned} a\le \kappa _a \;\; \text{ and } \;\; b \le \kappa _b. \end{aligned}$$
(3.4)

Proof

This result is proved by comparing the value of the left- and right-hand sides for possibly large a and b. The details are given in Lemmas 7.27.8 in appendix, whose results are then combined as shown in Table 1. The details of the constants \(\kappa _a\) and \(\kappa _b\) for the various cases are explicitly given in the statements of the relevant lemmas.

Table 1 Lemmas for combinations of \(\mu\) and \(\nu\)

The second auxiliary result is a bound extracted from [14] (see also [9, 20] for the case \(\alpha =1\)).

Lemma 3.2

Let \(\{c_k\}\) be a non-negative sequence, \(\varsigma >0\), \(\alpha > 0\), \(\nu \ge 0\) and define, for each \(k \ge 0\), \(d_k = \sum _{j=0}^k c_j\). If \(\alpha \ne 1\), then

$$\begin{aligned} \sum _{j=0}^k \frac{c_j}{(\varsigma +d_j)^{\alpha }} \le \frac{1}{(1-\alpha )} ( (\varsigma + d_k)^{1- \alpha } - \varsigma ^{1- \alpha } ). \end{aligned}$$
(3.5)

Otherwise,

$$\begin{aligned} \sum _{j=0}^k\frac{c_j}{(\varsigma +d_j)} \le \log \left( \frac{\varsigma + d_k}{\varsigma } \right) . \end{aligned}$$
(3.6)

Note that, if \(\alpha > 1\), then the bound (3.5) can be rewritten as

$$\begin{aligned} \sum _{j=0}^k \frac{ c_j}{(\varsigma +d_j)^{\alpha }} \le \frac{1}{\alpha -1} \Big ( \varsigma ^{1-\alpha } -(\varsigma + d_k)^{1-\alpha }\Big ), \end{aligned}$$

whose right-hand side is positive.

Armed with the above results, we are now in position to specify particular choices of the scaling factors \(w_k\) and derive the convergence properties of the resulting variants of ASTR2.

Theorem 3.3

Suppose that AS.1–AS.3 hold and that the ASTR2 algorithm is applied to problem (2.1), where \(w^L_k\) and \(w^Q_k\) are given by (3.1) and (3.2), respectively. Then there exists a positive constant \(\kappa _{ \text{ ASTR2 }}\) only depending on the problem-related quantities \(x_0\), \(f_{\textrm{low}}\), \(L_1\) and \(L_2\) and on the algorithmic parameters \(\varsigma\), \(\tau\), \(\xi\), \(\mu\) and \(\nu\) such that

$$\begin{aligned} {{\,\textrm{average}\,}}_{j\in \{ 0, \ldots , k \}} \Vert g_j\Vert ^2 \le \frac{\kappa _{ \text{ ASTR2 }}}{k+1} \;\; \text{ and } \;\; {{\,\textrm{average}\,}}_{j\in \{ 0, \ldots , k \}} \widehat{\phi }_j^3 \le \frac{\kappa _{ \text{ ASTR2 }}}{k+1}, \end{aligned}$$
(3.7)

and therefore that

$$\begin{aligned} \min _{j\in \{ 0, \ldots , k \}} \Vert g_j\Vert \le \frac{\kappa _{ \text{ ASTR2 }}}{(k+1)^{\scriptstyle \frac{1}{2}}} \;\; \text{ and } \;\; \min _{j\in \{ 0, \ldots , k \}} \widehat{\phi }_j \le \frac{\kappa _{ \text{ ASTR2 }}}{(k+1)^{\scriptstyle \frac{1}{3}}}. \end{aligned}$$
(3.8)

Proof

To simplify notations in the proof, define

$$\begin{aligned} a_k = 2\sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \Vert g_j\Vert ^2 \;\; \text{ and } \;\; b_k = 2\sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k \widehat{\phi }_k^3. \end{aligned}$$
(3.9)

Consider first an iteration index \(j\in \mathcal{K}^L\) Then (2.20) (expressed for for \(j\ge 0\)), (3.1) and the inequality \(\tau \le 1\) give that

$$\begin{aligned} f(x_{j+1})-f(x_j) \le - \frac{\displaystyle \tau }{\displaystyle 2} \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle w^L_j} + \frac{\displaystyle L_1}{\displaystyle 2\vartheta _L^2} \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (w^L_j)^2} \le - \frac{\displaystyle \tau }{\displaystyle 2} \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (\varsigma +{\scriptstyle \frac{1}{2}}a_j)^\mu } + \frac{\displaystyle L_1}{\displaystyle 2\vartheta _L^2} \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (\varsigma +{\scriptstyle \frac{1}{2}}a_j)^{2\mu }}. \end{aligned}$$
(3.10)

Suppose now that \(j\in \mathcal{K}^Q\). Then (2.21) and (3.2) imply that

$$\begin{aligned} f_{j+1}-f_j \le - \frac{\tau }{4\xi }\min \left[ \frac{\widehat{\phi }_j^3}{2(1+L_1)}, \frac{\widehat{\phi }_j^3}{(\varsigma +{\scriptstyle \frac{1}{2}}b_j)^\nu } \frac{\widehat{\phi }_j^3}{(\varsigma +{\scriptstyle \frac{1}{2}}b_j)^{2\nu }} \right] + \frac{L_2}{6\vartheta _Q^3} \frac{\widehat{\phi }_j^3}{(\varsigma +{\scriptstyle \frac{1}{2}}b_j)^{3\nu }}, \end{aligned}$$
(3.11)

Suppose now that

$$\begin{aligned} a_j> 2\varsigma \;\; \text{ and } \;\; b_j > \max \left[ 1, 2\varsigma ,\Big (2(1+L_1)\Big )^{\scriptstyle \frac{1}{\nu }}\right] , \end{aligned}$$
(3.12)

which implies that

$$\begin{aligned} w_j^L \le a_j^\mu , \;\;\;\;w_j^Q \le b_j^\nu \;\; \text{ and } \;\; 2(1+L_1) \le b_j^\nu . \end{aligned}$$

Then combining (3.10) and (3.11), the inequality \(\xi \ge 1\) and AS.3, we deduce that, for all \(k\ge 0\),

$$\begin{aligned} f(x_0)-f_{\textrm{low}}\ge \frac{\tau }{4\xi } \left[ \displaystyle \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle a_j^\mu } + \displaystyle \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^q}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \widehat{\phi }_j^3}{\displaystyle b_j^{2\nu }}\right] - \frac{\displaystyle L_1}{\displaystyle 2\vartheta _L^2} \displaystyle \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (w_k^L)^2} - \frac{\displaystyle L_2}{\displaystyle 6\vartheta _Q^3} \displaystyle \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \widehat{\phi }_j^3}{\displaystyle (w_k^Q)^3}. \end{aligned}$$

But, by definition, \(a_j\le a_k\) and \(b_j\le b_k\) for \(j\le k\), and thus, for all \(k\ge 0\),

$$\begin{aligned} a_k^{1-\mu } + b_k^{1-2\nu } \le \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau } + \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2} \displaystyle \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (w_k^L)^2} + \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3} \displaystyle \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \widehat{\phi }_j^3}{\displaystyle (w_k^Q)^3}. \end{aligned}$$
(3.13)

We now have to bound the last two terms on the right-hand side of (3.13). Using (3.1) and Lemma 3.2 with \(\{c_k\} = \{\Vert g_k\Vert ^2\}_{k\in \mathcal{K}^L}\) and \(\alpha = 2\mu\), gives that

$$\begin{aligned} \displaystyle \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (w^L_k)^2} \le \frac{1}{\vartheta _L^2(1-2\mu )} \left( \Big (\varsigma + \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k\Vert g_k\Vert ^2\Big )^{1-2\mu } - \varsigma ^{1-2\mu } \right) \le \frac{a_k^{1-2\mu }}{\vartheta _L^2(1-2\mu )} \end{aligned}$$
(3.14)

if \(\mu < {\scriptstyle \frac{1}{2}}\), and

$$\begin{aligned} \displaystyle \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (w^L_j)^2} \le \frac{1}{\vartheta _L^2}\log \left( \frac{\varsigma + \sum _{j=0,k\in \mathcal{K}^L}^k\Vert g_k\Vert ^2}{\varsigma }\right) \le \frac{1}{\vartheta _L^2}\log \left( \frac{\varsigma +a_k}{\varsigma }\right) \end{aligned}$$
(3.15)

if \(\mu = {\scriptstyle \frac{1}{2}}\) and

$$\begin{aligned} \displaystyle \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \Vert g_j\Vert ^2}{\displaystyle (w^L_k)^2} \le \frac{1}{\vartheta _L^2(2\mu -1)} \left( \varsigma ^{1-2\mu } - \big (\varsigma + \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k\Vert g_k\Vert ^2\big )^{1-2\mu }\right) \le \frac{\varsigma ^{1-2\mu }}{\vartheta _L^2(2\mu -1)} \end{aligned}$$
(3.16)

if \(\mu > {\scriptstyle \frac{1}{2}}\). Similarly, using (3.2) and Lemma 3.2 with \(\{c_k\} = \{\widehat{\phi }_k^3\}_{k\in \mathcal{K}^Q}\) and \(\alpha = 3\nu\) yields that

$$\begin{aligned} \displaystyle \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \phi _j^3}{\displaystyle (w^Q_k)^3} \le \frac{1}{\vartheta _Q^3(1-3\nu )} \left( \big (\varsigma + \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k\widehat{\phi }_k^3\big )^{1-3\nu } - \varsigma ^{1-3\nu } \right) \le \frac{b_k^{1-3\nu }}{\vartheta _Q^3(1-3\nu )} \end{aligned}$$
(3.17)

if \(\nu < {\scriptstyle \frac{1}{3}}\),

$$\begin{aligned} \displaystyle \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \widehat{\phi }_j^3}{\displaystyle (w^Q_j)^3} \le \frac{1}{\vartheta _Q^3}\log \left( \frac{\varsigma + \sum _{j=0,k\in \mathcal{K}^Q}^k\widehat{\phi }_k^3}{\varsigma }\right) \le \frac{1}{\vartheta _Q^3}\log \left( \frac{\varsigma +b_k}{\varsigma }\right) \end{aligned}$$
(3.18)

if \(\nu = {\scriptstyle \frac{1}{3}}\), and

$$\begin{aligned} \displaystyle \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k \frac{\displaystyle \widehat{\phi }_j^3}{\displaystyle (w^Q_k)^3} \le \frac{1}{\vartheta _Q^3(3\nu -1)} \left( \varsigma ^{1-3\nu } - \big (\varsigma + \sum _{_{{\mathop {\scriptstyle k\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k\widehat{\phi }_j^3\big )\right) \le \frac{\varsigma ^{1-3\nu }}{\vartheta _Q^3(3\nu -1)} \end{aligned}$$
(3.19)

if \(\nu > {\scriptstyle \frac{1}{3}}\). Moreover, unless \(a_k< 1\), the argument of the logarithm in the right-hand side of (3.15) satisfies

$$\begin{aligned} 1 \le \frac{\varsigma +a_k}{\varsigma } \le 1+a_k \le 2a_k. \end{aligned}$$
(3.20)

Similarly, unless \(b_k< 1\), the argument of the logarithm in the right-hand side of (3.18) satisfies

$$\begin{aligned} 1 \le \frac{\varsigma +b_k}{\varsigma } \le 1+b_k \le 2b_k. \end{aligned}$$
(3.21)

Moreover, we may assume, without loss of generality, that \(L_1\) and \(L_2\) are large enough to ensure that

$$\begin{aligned} 2\xi L_1 \ge \tau \vartheta _L^2 \;\; \text{ and } \;\; 2\xi L_2 \ge 3 \tau \vartheta _Q^3. \end{aligned}$$

Because of these observations and since (3.13) together with one of (3.14)–(3.16) and one of (3.17)–(3.19) has the form of condition (3.3), we may then apply Lemma 3.1 for each \(k\ge 0\) with \(a=a_k\), \(b=b_k\) and the following associations:

\(\bullet\) for \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}(1-2\mu ), \;\;\;\;\theta _b = \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3(1-3\nu )}, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau }; \end{aligned}$$

\(\bullet\) for \(\mu ={\scriptstyle \frac{1}{2}}\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}, \;\;\;\;\theta _b = \frac{\displaystyle 2\xi L_2}{\displaystyle 3\vartheta _Q^3\tau (1-3\nu )}, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau }; \end{aligned}$$

\(\bullet\) for \(\mu \in ({\scriptstyle \frac{1}{2}},1)\), \(\nu \in (0,{\scriptstyle \frac{1}{3}})\):

$$\begin{aligned} \theta _a = 1, \;\;\;\;\theta _b = \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3(1-3\nu )}, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau } + \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}\cdot \frac{\varsigma ^{1-2\mu }}{2\mu -1}; \end{aligned}$$

\(\bullet\) for \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu ={\scriptstyle \frac{1}{3}}\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2(1-2\mu )}, \;\;\;\;\theta _b = \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3}, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau }; \end{aligned}$$

\(\bullet\) for \(\mu ={\scriptstyle \frac{1}{2}}\), \(\nu ={\scriptstyle \frac{1}{3}}\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}, \;\;\;\;\theta _b = \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3}, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau }; \end{aligned}$$

\(\bullet\) for \(\mu \in ({\scriptstyle \frac{1}{2}},1)\), \(\nu ={\scriptstyle \frac{1}{3}}\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}, \;\;\;\;\theta _b = \frac{\displaystyle 2\xi L_2}{\displaystyle 3\vartheta _Q^3\tau }, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau } +\frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}\cdot \frac{\varsigma ^{1-2\mu }}{2\mu -1}; \end{aligned}$$

\(\bullet\) for \(\mu \in (0,{\scriptstyle \frac{1}{2}})\), \(\nu \in ({\scriptstyle \frac{1}{3}},1)\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2(1-2\mu )}, \;\;\;\;\theta _b = 1, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau } + \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3}\cdot \frac{\varsigma ^{1-3\nu }}{3\nu -1}; \end{aligned}$$

\(\bullet\) for \(\mu ={\scriptstyle \frac{1}{2}}\), \(\nu \in ({\scriptstyle \frac{1}{3}}, 1)\):

$$\begin{aligned} \theta _a = \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}, \;\;\;\;\theta _b = 1, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau } + \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3}\cdot \frac{\varsigma ^{1-3\nu }}{3\nu -1}; \end{aligned}$$

\(\bullet\) for \(\mu \in ({\scriptstyle \frac{1}{2}},1)\), \(\nu \in ({\scriptstyle \frac{1}{3}},1)\):

$$\begin{aligned} \theta _a = 1, \;\;\;\;\theta _b = 1, \;\;\;\;\theta = \frac{\displaystyle 4\xi (f(x_0)-f_{\textrm{low}})}{\displaystyle \tau } + \frac{\displaystyle 2\xi L_1}{\displaystyle \tau \vartheta _L^2}\cdot \frac{\varsigma ^{1-2\mu }}{2\mu -1} + \frac{\displaystyle 2\xi L_2}{\displaystyle 3\tau \vartheta _Q^3}\cdot \frac{\varsigma ^{1-3\nu }}{3\nu -1}. \end{aligned}$$

As a consequence of applying Lemma 3.1, we obtain that there exists positive constantsFootnote 4\(\kappa _{ \text{1rst }}\ge 1\) and \(\kappa _{ \text{2nd }}\ge 1\) only depending on problem-related quantities and on \(\varsigma\), \(\xi\), \(\mu\) and \(\nu\) such that, for all \(k\ge 0\),

$$\begin{aligned} a_k \le \kappa _{ \text{1rst }} \;\; \text{ and } \;\; b_k \le \kappa _{ \text{2nd }}. \end{aligned}$$
(3.22)

We also have, from the mechanism of Step 3 of the algorithm [see (2.7)] and (3.9), that

$$\begin{aligned} \sum _{j=0}^k \Vert g_j\Vert ^2= & {} \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k\Vert g_j\Vert ^2 + \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k\Vert g_j\Vert ^2 \le \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k\Vert g_j\Vert ^2 \quad +\sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k\widehat{\phi }_j^3 \le {\scriptstyle \frac{1}{2}}(a_k+b_k) \le {\scriptstyle \frac{1}{2}}(\kappa _{ \text{1rst }} + \kappa _{ \text{2nd }}) \end{aligned}$$

and

$$\begin{aligned} \sum _{j=0}^k \widehat{\phi }_j^3 = \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k\widehat{\phi }_j^3 + \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k\widehat{\phi }_j^3 \le \sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^L}\limits ^{\scriptstyle j=0}}}}^k\Vert g_j\Vert ^2+\sum _{_{{\mathop {\scriptstyle j\in \mathcal{K}^Q}\limits ^{\scriptstyle j=0}}}}^k\widehat{\phi }_j^3 \le {\scriptstyle \frac{1}{2}}(a_k+b_k) \le {\scriptstyle \frac{1}{2}}(\kappa _{ \text{1rst }} + \kappa _{ \text{2nd }}). \end{aligned}$$

These two inequalities in turn imply that, for all \(k\ge 0\),

$$\begin{aligned}{} & {} (k+1) {{\,\textrm{average}\,}}_{j\in \{ 0, \ldots , k \}} \Vert g_j\Vert ^2 \le {\scriptstyle \frac{1}{2}}(\kappa _{ \text{1rst }} + \kappa _{ \text{2nd }}) \;\;\; \text{ and } \; \quad (k+1) {{\,\textrm{average}\,}}_{j\in \{ 0, \ldots , k \}} \widehat{\phi }_j^3 \le {\scriptstyle \frac{1}{2}}(\kappa _{ \text{1rst }} + \kappa _{ \text{2nd }}), \end{aligned}$$

and the desired results follow with \(\kappa _{ \text{ ASTR2 }} = {\scriptstyle \frac{1}{2}}(\kappa _{ \text{1rst }} + \kappa _{ \text{2nd }})\).

Comments:

  1. 1.

    Note that \(\widehat{\phi }_k < \phi _k\) only when \(\phi _k > \xi\). Thus, if \(\phi _k\) is boundedFootnote 5, one can choose \(\xi\) large enough to ensure that \(\phi _k=\widehat{\phi }_k\) for all k, and therefore that \(\min _{j\{ 0, \ldots , k \}}\phi _j \le \kappa _{ \text{ ASTR2 }}/(k+1)^{\scriptstyle \frac{1}{3}}\). In practice, \(\xi\) can be used to tune the algorithm’s sensitivity to second-order information.

  2. 2.

    If the k-orders of convergence specified by (3.8) are translated in \(\epsilon\)-orders, that is numbers of iterations/evaluations to achieve \(\Vert g(x_k\Vert \le \epsilon _1\) and \(\phi _k = \widehat{\phi }_k \le \epsilon _2\), where \(\epsilon _1\) and \(\epsilon _2\) are precribed accuracies, we verify that at most \(\mathcal{O}(\epsilon _1^{-2})\) of them are needed to achieve the first of these conditions, while at most \(\mathcal{O}(\epsilon _2^{-3})\) are needed to achieve the second. As a consequence, at most \(\mathcal{O}(\max [\epsilon _1^{-2},\epsilon _2^{-3}])\) iterations/evaluations are needed to satisfy both conditions. These orders are identical to the sharp bounds known for the familiar trust-region methods (see [2, 16] or [5, Theorems 2.3.7 and 3.2.6]Footnote 6), or, for second-order optimalityFootnote 7, for the Adaptive Regularization method (see [18, 5, Theorem 3.3.2]). This is quite remarkable because function values are essential in these two latter classes of algorithms to enforce descent, itself a crucial ingredient of existing convergence proofs.

  3. 3.

    While (3.8) is adequate to allow a meaningful comparison of the global convergence rates with standard algorithms, as we just discussed, we note that (3.7) is at least as strong, because the average is of course a majorant of the minimum. As it turns out, it provides (order)-equivalent bounds. To see this, we first note that, for \(k >0\) and \(\alpha \in (0,1)\),

    $$\begin{aligned} \begin{array}{lcl} \displaystyle \sum _{i=1}^k \frac{1}{i^\alpha } &{} = &{} \frac{\displaystyle k^{1-\alpha }}{\displaystyle 1-\alpha } + \zeta (\alpha ) + \alpha \displaystyle \int _k^\infty \frac{\displaystyle x-\lfloor x \rfloor }{\displaystyle x^{1+\alpha }}\, dx\\ &{}\le &{}\frac{\displaystyle k^{1-\alpha }}{\displaystyle 1-\alpha } + \zeta (\alpha ) + \alpha \displaystyle \int _k^\infty \frac{\displaystyle 1}{\displaystyle x^{1+\alpha }}\, dx\\ &{}= &{}\frac{\displaystyle k^{1-\alpha }}{\displaystyle 1-\alpha } + \zeta (\alpha ) + \frac{\displaystyle \alpha ^2}{\displaystyle k^\alpha } \end{array} \end{aligned}$$

    where \(\zeta (\cdot )\) is the Riemann zeta function (see [19, (25.2.8)] for the first equality), so that

    $$\begin{aligned} \frac{\frac{1}{k}\displaystyle \sum _{i=1}^k\frac{1}{i^\alpha }}{\frac{1}{k^\alpha }} = k^{\alpha -1}\displaystyle \sum _{i=1}^k\frac{1}{i^\alpha } \le \frac{1}{1-\alpha } + \frac{\zeta (\alpha )}{k^{1-\alpha }} + \frac{\alpha ^2}{k} \end{aligned}$$

    and therefore

    $$\begin{aligned} \frac{1}{k}\displaystyle \sum _{i=1}^k\frac{1}{i^\alpha } = \mathcal{O}\left( \frac{1}{k^\alpha }\right) \end{aligned}$$

    for \(\alpha \in \{{\scriptstyle \frac{1}{2}}, {\scriptstyle \frac{1}{3}}\}\) (the two cases of interest here).

  4. 4.

    The expression of the constants in Theorem 3.3 is very intricate. However it is remarkable that they do not explicitly depend on the problem dimension. Although a good sign, this does not tell the whole story and caution remains advisable, because the Lipschitz constants \(L_1\) and \(L_2\) may themselves hide this (potentially severe) dependence.

  5. 5.

    It is also remarkable that the bounds (3.7) and (3.8) specify the same order of global convergence irrespective of the values of \(\mu\) and \(\nu\) in (0, 1), although these values do affect the constants involved.

  6. 6.

    The condition (2.7) determining the choice of a linear (in \(\mathcal{K}^L\)) or quadratic (in \(\mathcal{K}^Q\)) step is only used at the very end of the theorem’s proof, after (3.22) has already been obtained. This means that other choice mechanisms are possible without affecting this last conclusion, which is enough to derive bounds on \(\Vert g_j\Vert ^2\) and \(\widehat{\phi }_j^3\) averaged on iterations in \(\mathcal{K}^L\) and \(\mathcal{K}^Q\), respectively (rather than on all iterations).

We now show that the bound (3.8) is essentially sharp (in the sense of [3], meaning that a lower bound on evaluation complexity exists which is arbitrarily close to its upper bound) by following ideas of [5, Theorem 2.2.3] in an argument parallel to that used in [14] for the first-order bound.

Theorem 3.4

The bound (3.8) is essentially sharp in that, for each \(\mu ,\nu \in (0,1)\), \(\vartheta _L=\vartheta _Q=1\) and each \(\varepsilon \in (0,{\scriptstyle \frac{2}{3}})\), there exists a univariate function \(f_{\mu ,\nu ,\varepsilon }\) satisfying AS.1–AS.3 such that, when applied to minimize \(f_{\mu ,\nu ,\varepsilon }\) from the origin, the ASTR2 algorithm with (3.1)–(3.2) produces second-order optimality measures given by

$$\begin{aligned} \phi _k = \widehat{\phi }_k = \!\!\min _{j\in \{ 0, \ldots , k \}}\widehat{\phi }_j = \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}. \end{aligned}$$
(3.23)

Proof

We start by constructing \(\{x_k\}\) for which \(f_{\mu ,\nu ,\varepsilon }(x_k) = f_k\), \(\nabla _x^1f_{\mu ,\nu ,\varepsilon }(x_k) = g_k\) and \(\nabla _x^2 f_{\mu ,\nu ,\varepsilon }(x_k) = H_k\) for associated sequences of function, gradient and Hessian values \(\{f_k\}\), \(\{g_k\}\) and \(\{H_k\}\), and then apply Hermite interpolation to exhibit the function \(f_{\mu ,\nu ,\varepsilon }\) itself. We select an arbitrary \(\varsigma > 0\) and define, for \(k\ge 0\),

$$\begin{aligned} g_k {\mathop {=}\limits ^\textrm{def}}0, \;\; \text{ and } \;\; H_k = -\frac{2}{(k+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}, \end{aligned}$$
(3.24)

from which we deduce, using (2.2), that, for \(k> 0\),

$$\begin{aligned} \phi _k=\widehat{\phi }_k= \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}. \end{aligned}$$

Since \(\phi _k^3 > 0 = \Vert g_k\Vert ^2\), we set

$$\begin{aligned} s_k = s_k^Q {\mathop {=}\limits ^\textrm{def}}\frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }[\varsigma +\sum _{j=0}^k \widehat{\phi }_j^3]^\nu }, \end{aligned}$$
(3.25)

which is the exact minimizer of the quadratic model within the trust region, yielding that, for \(k\ge 0\),

$$\begin{aligned} \Delta q_k {\mathop {=}\limits ^\textrm{def}}\left| g_ks_k + {\scriptstyle \frac{1}{2}}H_ks_k^2 \right| = \frac{1}{(k+1)^{1+3\varepsilon }\big [\varsigma +\sum _{j=0}^k \widehat{\phi }_j^3)\big ]^{2\nu }} \le \frac{1}{(k+1)^{1+3\varepsilon }}, \end{aligned}$$
(3.26)

where we used the fact that \(\varsigma +\sum _{j=0}^k \widehat{\phi }_j^3> \varsigma +\widehat{\phi }_0>1\) to deduce the last inequality. We then define, for all \(k\ge 0\),

$$\begin{aligned} x_0 = 0, \;\;\;\;x_{k+1} = x_k+s_k \;\;\;\;(k\ge 0) \end{aligned}$$
(3.27)

and

$$\begin{aligned} f_0 = \zeta (1+3\varepsilon ), \;\;\;\;f_{k+1} = f_k - \Delta q_k \;\;\;\;(k\ge 0). \end{aligned}$$
(3.28)

Observe that the sequence \(\{f_k\}\) is decreasing and that, for all \(k\ge 0\),

$$\begin{aligned} f_{k+1} = f_0 - \displaystyle \sum _{k=0}^k\Delta q_k \ge f_0 - \displaystyle \sum _{k=0}^k\frac{1}{(k+1)^{1+3\varepsilon }} \ge f_0 - \zeta (1+3\varepsilon ), \end{aligned}$$
(3.29)

where we used (3.28) and (3.26). Hence (3.28) implies that

$$\begin{aligned} f_k \in [0, f_0] \;\; \text{ for } \text{ all } \;\; k\ge 0. \end{aligned}$$
(3.30)

Also note that, using (3.28),

$$\begin{aligned} |f_{k+1} - f_k + \Delta q_k| = 0, \end{aligned}$$
(3.31)

while, using (3.24),

$$\begin{aligned} |g_{k+1}-g_k| = 0 \;\;\;\;(k \ge 0). \end{aligned}$$
(3.32)

Moreover, using the fact that \(1/x^{{\scriptstyle \frac{1}{3}}+\nu }\) is a convex function of x over \([1,+\infty )\), and that from (3.25) \(s_k \ge \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\nu } \left( \varsigma + k+1\right) ^\nu }\), we derive that, for \(k\ge 0\),

$$\begin{aligned} |H_{k+1} - H_k|&= 2\left| \frac{1}{(k+2)^{{\scriptstyle \frac{1}{3}}+\nu }} - \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\nu }} \right| \\&\le 2\left( \frac{1}{3}+\nu \right) \frac{1}{(k+1)^{{\scriptstyle \frac{4}{3}} + \nu }} \\&\le \frac{8}{3} \, \frac{(\varsigma + k+1 )^\nu }{ (k+1) (k+1)^{{\scriptstyle \frac{1}{3}}+ \nu } (\varsigma + k+1)^\nu } \\&\le \frac{8}{3} \, \frac{(\varsigma + k+1 )^\nu }{k+1 } s_k \\&\le \frac{8}{3} \, \left( \varsigma + 2 \right) ^\nu s_k . \end{aligned}$$

These last bounds with (3.30), (3.31) and (3.32) allow us to use standard Hermite interpolation on the data given by \(\{f_k\}\), \(\{g_k\}\) and \(\{H_k\}\): see, for instance, Theorem A.9.1 in [5] with \(p=2\) and

$$\begin{aligned} \kappa _f = \max \left[ \frac{8}{3}(\varsigma +2)^\nu , f_0,2\right] \end{aligned}$$

(the second term in the max bounding \(|f_k|\) because of (3.30) and the third bounding \(|g_k|\) and \(|H_k|\) because of (3.24)). We then deduce that there exists a twice continuously differentiable function \(f_{\mu ,\nu ,\varepsilon }\) from \(\mathbb{R}\) to \(\mathbb{R}\) with Lipschitz continuous gradient and Hessian (i.e. satisfying AS.1 and AS.2) such that, for \(k\ge 0\),

$$\begin{aligned} f_{\mu ,\nu ,\varepsilon }(x_k) = f_k,\;\;\;\;\nabla _x^1f_{\mu ,\nu ,\varepsilon }(x_k) = g_k \;\; \text{ and } \;\; \nabla _x^2f_{\mu ,\nu ,\varepsilon }(x_k) = H_k. \end{aligned}$$

Moreover, the range of \(f_{\mu ,\nu ,\varepsilon }\) is constant independent of \(\varepsilon\), hence guaranteeing AS.3. The definitions (3.24), (3.25), (3.27) and (3.28) imply that the sequences \(\{x_k\}\), \(\{f_k\}\), \(\{g_k\}\) and \(\{H_k\}\) can be seen as generated by the ASTR2 algorithm applied to \(f_{\mu ,\nu ,\varepsilon }\), starting from \(x_0=0\).

Fig. 1
figure 1

The function \(f_{\mu ,\nu ,\varepsilon }(x)\) (left), its gradient \(\nabla _x^1f_{\mu ,\nu ,\varepsilon }(x)\) (middle) and its Hessian \(\nabla _x^2f_{\mu ,\nu ,\varepsilon }(x)\) (right) plotted as a function of x, for the first 10 iterations of the ASTR2 algorithm with (3.1)–(3.2) (\(\mu = {\scriptstyle \frac{1}{2}}\), \(\nu ={\scriptstyle \frac{1}{3}}\), \(\varepsilon =\varsigma = {\scriptstyle \frac{1}{100}}\), \(\vartheta _L=\vartheta _Q=1\))

Figure 1 shows the behaviour of \(f_{\mu ,\nu ,\varepsilon }(x)\) for \(\mu = {\scriptstyle \frac{1}{2}}\), \(\nu = {\scriptstyle \frac{1}{3}}\), \(\vartheta _L=\vartheta _Q=1\) and \(\varepsilon = \varsigma ={\scriptstyle \frac{1}{100}}\), its gradient and Hessian, as resulting from the first 10 iterations of the ASTR2 algorithm with (3.1)–(3.2). (We have chosen to shift \(f_0\) to 100 in order to avoid large numbers on the vertical axis of the left panel.) Due to the slow convergence of the series \(\sum _j 1/j^{\frac{1}{1+3/100}}\), illustrating the boundeness of \(f_0-f_{k+1}\) would require many more iterations. One also notes that the gradient is not monotonically increasing, which implies that \(f_{\mu ,\nu ,\varepsilon }(x)\) is nonconvex, as can be verified in the left panel. Note that the unidimensional nature of the example is not restrictive, since it is always possible to make the value of its objective function and gradient independent of all dimensions but one. Also note that, as was the case in [14], the argument of Theorem 3.4 fails for \(\varepsilon =0\) since then the sums in (3.29) diverge when k tends to infinity.

Note that, because

$$\begin{aligned} \sum _{j=0}^k \frac{1}{(j+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}&\ge \int _0^k \frac{dj}{(j+2)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}\\&=\frac{3}{2+3\varepsilon }\left[ \frac{k+2}{(k+2)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}-2\right] \\&\ge \frac{3}{2(2+3\varepsilon )}\left[ \frac{k+1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}-2\right] , \end{aligned}$$

one deduces that

$$\begin{aligned} {{\,\textrm{average}\,}}_{j\in \{ 0, \ldots , k \}}\widehat{\phi }_j \ge \frac{3}{2(2+3\varepsilon )}\left[ \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}+\varepsilon }}-\frac{2}{k+1}\right] , \end{aligned}$$

which, when compared to (3.8), reflects the (slight) difference in strength between (3.7) and (3.8).

4 A “divergent stepsize” ASTR2 subclass

A “divergent stepsize” first-order method was analyzed in [14], motivated by its good practical behaviour in the stochastic context [15]. For coherence, we now present and analyze a similar variant, this time for second-order optimality. This requires the following additional assumption.

  1. AS.4:

    there exists a constant \(\kappa _g >0\) such that, for all x, \(\Vert g(x)\Vert _\infty \le \kappa _g\).

Theorem 4.1

Suppose that AS.1–AS.3 and AS.4 hold and that the ASTR2 algorithm is applied to problem (2.1), where, the scaling factors \(w_{i,k}\) are chosen such that, for some power parameters \(0< \nu _1 \le \mu _1 < 1\) and \(0<\nu _2\le \mu _ 2<{\scriptstyle \frac{1}{2}}\), some constants \(\varsigma \in (0,1]\) and \(\kappa _w\ge \max [1,\varsigma ]\), all \(i\in \{ 1, \ldots , n \}\) and all \(k\ge 0\),

$$\begin{aligned} 0< \varsigma \, (k+1)^{\nu _1} \le w^L_{k} \le \kappa _w \,(k+1)^{\mu _1} \;\; \text{ and } \;\; 0 < \varsigma \, (k+1)^{\nu _2} \le w^Q_{k} \le \kappa _w \,(k+1)^{\mu _2}. \end{aligned}$$
(4.1)

Let \(\psi _k{\mathop {=}\limits ^\textrm{def}}\min [1,\max [\Vert g_k\Vert ^2,\phi _k^3]]\). Then, for any \(\theta \in (0,{\scriptstyle \frac{1}{4}}\tau )\) and \(k >j_\theta\),

$$\begin{aligned} \min _{j\in \{ j_\theta , \ldots , k \}}\psi _k \le \kappa _\diamond (\theta ) \frac{\displaystyle (k+1)^{\max [\mu _1,2\mu _2]}}{\displaystyle k-j_\theta } \le \frac{\displaystyle \kappa _\diamond (\theta )(j_\theta +1)}{\displaystyle (k+1)^{1-\max [\mu _1,2\mu _2]}}, \end{aligned}$$
(4.2)

where

$$\begin{aligned} j_\theta {\mathop {=}\limits ^\textrm{def}}\max \left[ \left( \frac{L_1}{2\varsigma (1-\theta )}\right) ^{{\scriptstyle \frac{1}{\nu _1}}}, \left( \frac{2(1+L_1)}{\varsigma }\right) ^{{\scriptstyle \frac{1}{\nu _2}}}, \left( \frac{L_2}{3\varsigma ({\scriptstyle \frac{1}{4}}\tau -\theta )}\right) ^{{\scriptstyle \frac{1}{\nu _2}}}, \left( \frac{L_2\xi }{3\varsigma ^2({\scriptstyle \frac{1}{4}}\tau -\theta )}\right) ^{{\scriptstyle \frac{1}{2\nu _2}}} \right] \end{aligned}$$
(4.3)

and

$$\begin{aligned} \kappa _\diamond (\theta ) {\mathop {=}\limits ^\textrm{def}}\left\{ \frac{\kappa _w^2}{\theta }\left( f(x_0)-f_{\textrm{low}}+ (j_\theta +1)\max \left[ \frac{L_1\kappa _g^2}{2\varsigma ^2},\frac{L_2\xi ^3}{3\varsigma ^3}\right] \right) \right\} ^{\scriptstyle \frac{1}{3}}. \end{aligned}$$

Proof

Consider an arbitrary \(\theta \in (0,{\scriptstyle \frac{1}{4}}\tau )\) and note that AS.4, (4.1) and the definition of \(\widehat{\phi }_k\) imply that

$$\begin{aligned} w^L_k \in [\varsigma ^\mu , \kappa _w(k+1)^{\mu _1}] \;\; \text{ and } \;\; w^Q_k \in [\varsigma ^\nu , \kappa _w(k+1)^{\mu _2} ]. \end{aligned}$$
(4.4)

If we define \(j_\theta\) by (4.3), we immediately obtain from AS.4 and Lemma 2.5 (where we neglect the first term in the right-hand sides of (2.20) and (2.21)) that

$$\begin{aligned} f(x_{j_\theta +1}) \le f(x_0) + (j_\theta +1)\kappa _{ \text{ over }} \;\; \text{ where } \;\; \kappa _{ \text{ over }} = \max \left[ \frac{L_1\kappa _g^2}{2\varsigma ^2},\frac{L_2\xi ^3}{3\varsigma ^3}\right] . \end{aligned}$$
(4.5)

If we choose \(j > j_\theta\), one then verifies that the definition of \(j_\theta\) in (4.3), the bounds (2.20) and (2.21) and the definition (4.1) together ensure that

$$\begin{aligned} f(x_{j+1}) -f(x_j) \le \left\{ \begin{array}{ll} -\theta \frac{\displaystyle \Vert g_k\Vert ^2}{\displaystyle w^L_k} &{} \;\; \text{ if } \;\; j\in \mathcal{K}^L,\\ -\theta \frac{\displaystyle \phi _k^3}{\displaystyle (w^Q_k)^2} &{} \;\; \text{ if } \;\; j\in \mathcal{K}^Q. \end{array}\right. \end{aligned}$$

Using now the mechanism of Step 3, the definition of \(\psi _k\), (4.1) and the inequality \(\kappa _w\ge 1\), we obtain that, for \(j > j_\theta\)

$$\begin{aligned} f(x_j)-f(x_{j+1}) \ge \theta \psi _j\min \left[ \frac{1}{w^L_k},\frac{1}{(w^Q_k)^2}\right] \ge \frac{\theta \psi _j}{\kappa _w^2(j+1)^{\max [\mu _1,2\mu _2]}}. \end{aligned}$$
(4.6)

As a consequence, we obtain from (4.5) and the summation of (4.6) for \(j\in \{ j_\theta +1, \ldots , k \}\) that, for \(k > j_\theta\),

$$\begin{aligned} f(x_0)-f(x_{j+1}) \ge -(j_\theta +1)\kappa _{ \text{ over }} +\sum _{j=j_\theta +1}^k\frac{\theta \psi _j}{\kappa _w^2(j+1)^{\max [\mu _1,2\mu _2]}}. \end{aligned}$$

We therefore deduce, using AS.3, that

$$\begin{aligned} (k-j_\theta )\min _{j_{\theta ,\max }+1,\ldots ,k}\psi _j \le \sum _{j=j_\theta +1}^k\psi _j \le \frac{\kappa _w^2(k+1)^{\max [\mu _1,2\mu _2]}}{\theta }\Big [ f(x_0)-f_{\textrm{low}}+ (j_\theta +1)\kappa _{ \text{ over }} \Big ], \end{aligned}$$

and (4.2) follows. \(\square\)

This theorem gives a bound on the rate at which the combined optimality measure \(\psi _k\) tends to zero, and this bound is slightly worse than but close to what we obtained in the previous section whenever \(\max [\mu _1,2\mu _2]\) approaches zero.

Using the methodology of Theorem 3.4, we now show that the bound (4.2) is also essentially sharp.

Theorem 4.2

The bound (4.2) is essentially sharp in that, for each \(\mu = (\mu _1,\mu _2)\), each \(\nu = (\nu _1,\nu _2)\) with \(0<\nu _1\le \mu _1<1\) and \(0<\nu _2\le \mu _2<{\scriptstyle \frac{1}{2}}\) and each \(\varepsilon \in (0,1-{\scriptstyle \frac{1}{3}}(1-2\mu _2))\), there exists a univariate function \(h_{\mu ,\nu ,\varepsilon }\) satisfying AS.1–AS.4 such that, when applied to minimize \(h_{\mu ,\nu ,\varepsilon }\) from the origin, the ASTR2 algorithm with (4.1) produces second-order optimality measures given by

$$\begin{aligned} \phi _k = \widehat{\phi }_k = \psi _k = \min _{j\in \{ 0, \ldots , k \}}\psi _j = \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}(1-2\mu _2)+\varepsilon }}. \end{aligned}$$
(4.7)

Proof

As above, we start by defining, for \(k\ge 0\), \(\gamma = {\scriptstyle \frac{1}{3}}(1-2\mu _2)+\varepsilon\), \(w_k= \kappa _w(k+1)^{\mu _2}\), and, for \(k\ge 0\),

$$\begin{aligned} g_k {\mathop {=}\limits ^\textrm{def}}0 \;\; \text{ and } \;\; H_k = -\frac{2}{(k+1)^\gamma }, \end{aligned}$$
(4.8)

which then implies, using (2.2) that, for \(k> 0\),

$$\begin{aligned} \phi _k=\widehat{\phi }_k= \frac{1}{(k+1)^\gamma }. \end{aligned}$$
(4.9)

Given these definitions and because \(\widehat{\phi }^3_k>0 = \Vert g_k\Vert ^2\), we set

$$\begin{aligned} s_k = s_k^Q {\mathop {=}\limits ^\textrm{def}}\frac{1}{(k+1)^\gamma [\kappa _w(k+1)^{\mu _2}]} = \frac{1}{\kappa _w(k+1)^{\gamma +\mu _2}}, \end{aligned}$$
(4.10)

yielding that, for \(k>0\),

$$\begin{aligned} \Delta q_0 {\mathop {=}\limits ^\textrm{def}}\frac{1}{(\varsigma +1)^{2\nu }} \;\; \text{ and } \;\; \Delta q_k {\mathop {=}\limits ^\textrm{def}}\left| g_ks_k + {\scriptstyle \frac{1}{2}}H_ks_k^2 \right| = \frac{1}{\kappa _w^2(k+1)^{3\gamma +2\mu _2}} \le \frac{1}{(k+1)^{3\gamma +2\mu _2}}, \end{aligned}$$
(4.11)

where we used the fact that \(\kappa _w \ge 1\) to deduce the last inequality. We then define, for all \(k\ge 0\),

$$\begin{aligned} x_0 = 0, \;\;\;\;x_{k+1} = x_k+s_k \;\;\;\;(k>0) \end{aligned}$$
(4.12)

and

$$\begin{aligned} h_0 = \zeta (3\gamma +2\mu _2) \;\; \text{ and } \;\; h_{k+1} = h_k - \Delta q_k \;\;\;\;(k \ge 0), \end{aligned}$$
(4.13)

where \(\zeta (\cdot )\) is the Riemann zeta function. Note that, since \(\gamma >1-2\mu _2\), the argument \(3\gamma +2\mu _2\) of \(\zeta\) is strictly larger than one and \(\zeta (3\gamma +2\mu _2)\) is finite. Observe also that the sequence \(\{h_k\}\) is decreasing and that, for all \(k\ge 0\),

$$\begin{aligned} h_{k+1} = h_0 - \displaystyle \sum _{k=0}^k\Delta q_k \ge h_0 - \displaystyle \sum _{k=0}^k\frac{1}{(k+1)^{3\gamma +2\mu _2}} \ge h_0 - \zeta (3\gamma +2\mu _2), \end{aligned}$$
(4.14)

where we used (3.28) and (3.26). Hence (3.28) implies that

$$\begin{aligned} h_k \in [0, h_0] \;\; \text{ for } \text{ all } \;\; k\ge 0. \end{aligned}$$
(4.15)

Also note that, using (3.28),

$$\begin{aligned} |h_{k+1} - h_k + \Delta q_k| = 0, \end{aligned}$$
(4.16)

while, using (3.24),

$$\begin{aligned} |g_{k+1}-g_k| = 0 \;\;\;\;(k \ge 0). \end{aligned}$$
(4.17)

Moreover, using the fact that \(1/x^\gamma\) is a convex function of x over \([1,+\infty )\) and (4.10), we derive that, for \(k\ge 0\),

$$\begin{aligned} |H_{k+1} - H_k| = 2\left| \frac{1}{(k+2)^\gamma } - \frac{1}{(k+1)^\gamma } \right| \le \frac{2\gamma }{(k+1)^{1+\gamma }} \le \frac{2\gamma \kappa _w(k+1)^{\mu _2}}{k+1} \,s_k \le 2 \gamma \kappa _w s_k. \end{aligned}$$

This bound with (4.15), (4.16) and (4.17) once more allow us to use standard Hermite interpolation on the data given by \(\{h_k\}\), \(\{g_k\}\) and \(\{H_k\}\), as stated in [5, Theorem A.9.1] with \(p=2\) and

$$\begin{aligned} \kappa _f = \max \left[ 2\gamma \kappa _w, h_0,2\right] \end{aligned}$$

(the second term in the max bounds \(|h_k|\) because of (4.15) and the third bounds both \(|g_k|\) and \(|H_k|\) because of (4.8)). As a consequence, there exists a twice continuously differentiable function \(h_{\mu ,\nu ,\varepsilon }\) from \(\mathbb{R}\) to \(\mathbb{R}\) with Lipschitz continuous gradient and Hessian (i.e. satisfying AS.1 and AS.2) such that, for \(k\ge 0\),

$$\begin{aligned} h_{\mu ,\nu ,\varepsilon }(x_k) = h_k,\;\;\;\;\nabla _x^1h_{\mu ,\nu ,\varepsilon }(x_k) = g_k \;\; \text{ and } \;\; \nabla _x^2h_{\mu ,\nu ,\varepsilon }(x_k) = H_k. \end{aligned}$$

Moreover, the ranges of \(h_{\mu ,\nu ,\varepsilon }\) and its derivatives is constant independent of \(\gamma\), hence guaranteeing AS.3 and AS.4. Thus (4.8), (4.10), (4.12) and (4.13) imply that the sequences \(\{x_k\}\), \(\{h_k\}\), \(\{g_k\}\) and \(\{H_k\}\) can be seen as generated by the ASTR2 algorithm applied to \(h_{\mu ,\nu ,\varepsilon }\), starting from \(x_0=0\). The first bound of (4.7) then results from (4.9) and the definition of \(\gamma\).

Fig. 2
figure 2

The function \(h_{\mu ,\nu ,\varepsilon }(x)\) (left), its gradient \(\nabla _x^1h_{\mu ,\nu ,\varepsilon }(x)\) (middle) and its Hessian \(\nabla _x^2h_{\mu ,\nu ,\varepsilon }(x)\) (right) plotted as a function of x, for the first 10 iterations of the ASTR2 algorithm with (4.2) (\(\mu = \nu = ({\scriptstyle \frac{1}{2}},{\scriptstyle \frac{1}{3}})\))

The behaviour of \(h_{\mu ,\nu ,\varepsilon }\) is illustrated in Figure 2. It is qualitatively similar to that of \(f_{\mu ,\nu ,\varepsilon }\) shown in Figure 1, although the decrease in objective-value is somewhat slower, as expected. As in Sect. 3, note that the inequality

$$\begin{aligned} \sum _{j=0}^k\frac{1}{(j+1)^\gamma } \ge \int _0^k\frac{dj}{(j+2)^\gamma } = \frac{1}{1-\gamma }\left[ \frac{k+2}{(k+2)^\gamma }-1\right] \ge \frac{1}{2(1-\gamma )}\left[ \frac{k+1}{(k+1)^\gamma }-2\right] \end{aligned}$$

implies that

$$\begin{aligned} {{\,\textrm{average}\,}}_{j\in \{ 0, \ldots , k \}}\psi _j \ge \frac{1}{2(1-\gamma )}\left[ \frac{1}{(k+1)^{{\scriptstyle \frac{1}{3}}(1-2\mu _2)+\varepsilon }}-\frac{2}{k+1}\right] , \end{aligned}$$

which has the same flavour as the second bound of (3.23).

5 Second-order optimality in a subspace

While the ASTR2 algorithms guarantee second-order optimality conditions, they come at a computational price. The key of this guarantee is of course that significant negative curvature in any direction of \(\mathbb{R}^n\) must be exploited, which requires evaluating the Hessian. In addition, the optimality measure \(\phi _k\) and the step \(s_k\) must also be computed. However, these computational costs may be judged excessive, so the question arises whether a potentially cheaper algorithm is able to ensure a “degraded” or weaker form of second-order optimality. Fortunately, the answer is positive: one can guarantee second-order optimality in subspaces of \(\mathbb{R}^n\) at lower cost.

The first step is to assume that a subspace \(\mathcal{S}_k\) is of interest at iteration k. Then, instead of computing \(\phi _k\) from (2.3), one can choose to calculate

$$\begin{aligned} \phi _k^{\mathcal{S}_k} = \max _{_{{\mathop {\scriptstyle d\in \mathcal{S}_k}\limits ^{\scriptstyle \Vert d\Vert \le 1}}}} -\Big ( g(x)^Td + {\scriptstyle \frac{1}{2}}d^TH(x)d \Big ). \end{aligned}$$

Because the dimension of \(\mathcal{S}_k\) may be much smaller than n, the cost of this computation may be significantly smaller than that of computing \(\phi _k\). The measure \(\phi _k^{\mathcal{S}_k}\) may for instance be obtained using a Krylov-based method, as conjugate gradients [17], GLRT [13] or variants thereof, where the minimum of the model \(T_{f,2}(x,d)\) within the trust region is derived iteratively in a sequence of nested Krylov subspaces of increasing dimension, which tend to contain vector along which curvature is extreme [12, Chapter 9], thereby improving the quality of the second-order guarantee compared to random subspaces. This process may then be terminated before the subspaces fill \(\mathbb{R}^n\), should the calculation become too expensive or a desired accuracy be reached. In addition, there is no need for \(n_k\), the dimension of the final Krylov space at iteration k to be constant: it is often kept very small when far from optimality. This technique has the added benefit that the full Hessian is not evaluated, but only \(n_k\) Hessian-times-vector products are needed, again significantly reducing the computational burden. Calculating the step \(s_k^Q\) for \(k\in \mathcal{K}^Q\) once \(\phi _k^{\mathcal{S}_k}\) is known is also cheaper in a space of dimension \(n_k\) much less than n, especially since only a \(\tau\)-approximation is needed (see the comments after the algorithm).

Importantly, the theory developped in the previous sections is not affected by the transition from \(\mathbb{R}^n\) to \(\mathcal{S}_k\), except that now the complexity bounds (3.7)–(3.8) and (4.2) are no longer expressed using \(\widehat{\phi }_k\) but now involve \(\widehat{\phi }_k^{\mathcal{S}_k} = \min [ 1, \phi _k^{\mathcal{S}_k}]\) instead. While clearly not as powerful as the complete second-order guarantee in \(\mathbb{R}^n\), weaker guarantees based on (Krylov) subspaces are often sufficient in practice and make the ASTR2 algorithm more affordable. Note that, in the limit, one can even choose \(\mathcal{S}_k= \{0\}\) for all k, in which case we can set \(H_k=0\) for all k and we do not obtain any second-order guarantee (but the first-order complexity bounds remain valid, recovering results of [14]).

6 Conclusions

We have introduced an OFFO algorithm whose global rate of convergence to first-order minimizers is \(\mathcal{O}((k+1)^{-{\scriptstyle \frac{1}{2}}})\) while it converges to second-order ones as \(\mathcal{O}((k+1)^{-{\scriptstyle \frac{1}{3}}})\). These bounds are equivalent to the best known bounds for second-order optimality for algorithms using objective-function evaluations, despite the latter exploiting significantly more information. Thus we conclude that, from the point of view of evaluation complexity at least, evaluating values of the objective function is an unnecessary effort for efficiently finding second-order minimizers. We have also discussed another closely related algorithm, whose global rates of convergence can be nearly as good. We have finally considered how weaker second-order guarantees may be obtained at a much reduced computational cost.

We expect that extending our proposal to convexly constrained cases (for instance to problems involving bounds on the variables) should be possible. As in [7, Chapter 12], the idea would be to restrict the model minimization at each iteration to the intersection of the trust region with the feasible domain, but this should of course be verified.

It is clearly too early to assess whether the new algorithms will turn out to be of practical interest. In the form with \(\vartheta = 1\), they are undoublty quite conservative because of the monotonic nature of the scaling factors (and hence of the trust-region radius) and because, for locally convex functions (for which a second-order guarantee is not needed), \(\Vert g_k\Vert \ge \phi _k\), yielding a linear step. Whether less conservative variants with similar or better complexity can be designed is the object of ongoing research.