1 Introduction

Consider a given, smooth, vector-valued residual function \(r:\mathbb {R}^n\longrightarrow \mathbb {R}^m\), and let \(\Vert \cdot \Vert \) be the Euclidean norm. Our goal is to design effective methods for finding values of \(x \in \mathbb {R}^n\) for which \(\Vert r(x)\Vert \) is (locally) as small as possible. Since \(\Vert r(x)\Vert \) is generally non smooth, it is common to consider the equivalent problem of minimizing

$$\begin{aligned} \Phi (x) {:=}{\scriptstyle \frac{1}{2}}\Vert r(x)\Vert ^2, \end{aligned}$$
(1.1)

and to tackle the resulting problem using a generic method for unconstrained optimization, or one that exploits the special structure of \(\Phi \).

To put our proposal into context, arguably the most widely used method for solving nonlinear least-squares problems is the Gauss–Newton method and its variants. These iterative methods all build locally-linear (Taylor) approximations to \(r(x_k+s)\) about \(x_k\), and then minimize the approximation as a function of s in the least-squares sense to derive the next iterate \(x_{k+1} = x_k + s_k\) [21, 22, 24]. The iteration is usually stabilized either by imposing a trust-region constraint on the permitted s, or by including a quadratic regularization term [3, 23]. While these methods are undoubtedly popular in practice, they often suffer when the optimal value of the norm of the residual is large. To counter this, regularized Newton methods for minimizing (1.1) have also been proposed [7, 16, 17]. Although this usually provides a cure for the slow convergence of Gauss–Newton-like methods on non-zero-residual problems, the global behaviour is sometimes less attractive; we attribute this to the Newton model not fully reflecting the sum-of-squares nature of the original problem.

With this in mind, we consider instead the obvious nonlinear generalization of Gauss–Newton in which a locally quadratic (Taylor) “tensor-Newton” approximation to the residuals is used instead of a locally linear one. Of course, the resulting least-squares model is now quartic rather than quadratic (and thus in principle is harder to solve), but our experiments [19] have indicated that this results in more robust global behaviour than Newton-type methods and an improved performance on non-zero-residual problems than seen for Gauss–Newton variants. Our intention here is to explore the convergence behaviour of a tensor-Newton approach.

We mention in passing that we are not the first authors to consider higher-order models for least-squares problems. The earliest approach we are aware of [4, 5] uses a quadratic model of \(r(x_k+s)\) in which the Hessian of each residual is approximated by a low-rank matrix that is intended to compensate for any small singular values of the Jacobian. Another approach, known as geodesic acceleration [29, 30], aims to modify Gauss–Newton-like steps with a correction that allows for higher-order derivatives. More recently, derivative-free methods that aim to build quadratic models of \(r(x_k+s)\) by interpolation/regression of past residual values have been proposed [31, 32], although these ultimately more resemble Gauss–Newton variants. While each of these methods has been shown to improve performance relative to Gauss–Newton-like approaches, none makes full use of the residual Hessians. Our intention is thus to investigate the convergence properties of methods based on the tensor-Newton model.

There has been a long-standing interest in establishing the global convergence of general smooth unconstrained optimization methods, that is, in ensuring that a method for minimizing a function f(x) starting from an arbitrary initial guess ultimately delivers an iterate for which a measure of optimality is small. A more recent concern has focused on how many evaluations of f(x) and its derivatives are necessary to reduce the optimality measure below a specified (small) \(\epsilon > 0\) from the initial guess. If the measure is \(\Vert g(x)\Vert \), where \(g(x) {:=}\nabla _x f(x)\), it is known that some well-known schemes (including steepest descent and generic second-order trust-region methods) may require \(\Theta (\epsilon ^{-2})\) evaluations under standard assumptions [6], while this may be improved to \(\Theta (\epsilon ^{-3/2})\) evaluations for second-order methods with cubic regularization or using specialised trust-region tools [8, 15, 26]. Here and hereafter \(O(\cdot )\) indicates a term that is of at worst a multiple of its argument, while \(\Theta (\cdot )\) indicates additionally there are instances for which the bound holds.

For the problem we consider here, an obvious approach is to apply any of the aforementioned algorithms to minimize (1.1), and to terminate as soon as

$$\begin{aligned} \Vert \nabla _x \Phi (x) \Vert \le \epsilon , \;\; \text{ where } \;\; \nabla _x \Phi (x) = J^T(x) r(x) \;\; \text{ and } \;\; J(x) {:=}\nabla _x r(x). \end{aligned}$$
(1.2)

However, it has been argued [9] that this ignores the possibility that it may suffice to stop instead when r(x) is small, and that a more sensible criterion is to terminate when

$$\begin{aligned} \Vert r(x)\Vert \le \epsilon _p \quad \mathrm{or}\quad \Vert g_r(x)\Vert \le \epsilon _d, \end{aligned}$$
(1.3)

where \(\epsilon _p>0\) and \(\epsilon _d>0\) are required accuracy tolerances and \(g_r(x)\) is the scaled gradient given by

$$\begin{aligned} g_r(x){:=}\left\{ \begin{array}{ll} \displaystyle \frac{J^T(x) r(x)}{\Vert r(x)\Vert }, &{}\quad \;\; \text{ whenever } \;\; r(x)\ne 0;\\ 0, &{}\quad \;\; \text{ otherwise. } \;\; \end{array} \right. \end{aligned}$$
(1.4)

We note that \(g_r(x)\) in (1.4) is precisely the gradient of \(\Vert r(x)\Vert \) whenever \(r(x)\ne 0\), while if \(r(x)=0\), we are at the global minimum of r and so \(g_r(x)=0\in \partial (\Vert r(x)\Vert )\), the sub-differential of r(x). Furthermore \(\Vert g_r(x)\Vert \) is less sensitive to scaling than \(\Vert J^T(x) r(x) \Vert \). It has been shown that a second-order method based on cubic regularization will satisfy (1.3) after \(O\left( \max (\epsilon _d^{-3/2},\epsilon _p^{-1/2})\right) \) evaluations [9, Theorem 3.2]. One of our aims here is to show similar bounds for the tensor-Newton method we are advocating. We propose a regularized tensor-Newton method in Sect. 2, and analyse both its global convergence and its evaluation complexity in Sect. 3. The regularization order, r, permitted by the algorithm proposed in Sect. 2 is restricted to be no larger than 3, and so in Sect. 4 we introduce a modified algorithm for which \(r>3\) is possible. We make further comments and draw general conclusions in Sect. 6.

2 The tensor-Newton method

Suppose that \(r(x) \in C^2\) has components \(r_i(x)\) for \(i=1,\ldots ,m\). Let t(xs) be the vector whose components are

$$\begin{aligned} t_i(x,s) \,{:=}\,r_i(x) + s^T \nabla _x r_i(x) + {\scriptstyle \frac{1}{2}}s^T \nabla _{xx} r_i(x) s \end{aligned}$$
(2.1)

for \(i=1,\ldots ,m\). We build the tensor-Newton approximation

$$\begin{aligned} m(x,s) \,{:=}\, {\scriptstyle \frac{1}{2}}\Vert t(x,s)\Vert ^2 \end{aligned}$$
(2.2)

of \(\Phi (x+s)\), and define the regularized model

$$\begin{aligned} m^{\mathrm{R}}(x,s,\sigma ) \,{:=}\,\, m(x,s) + \frac{1}{r} \sigma \Vert s\Vert ^r, \end{aligned}$$
(2.3)

where \(r \ge 2\) is given. Note that

$$\begin{aligned} \nabla _s m^{\mathrm{R}}(x,s,\sigma ) = \nabla _s m(x,s) + \sigma \Vert s\Vert ^{r-2} s. \end{aligned}$$
(2.4)

We consider the following algorithm (Algorithm 2.1) to find a critical point of \(\Phi (x)\).

Algorithm 2.1

Adaptive Tensor-Newton Regularization.

A starting point \(x_0\), an initial and a minimal regularization parameter \(\sigma _0\ge \sigma _{\min }>0\) and algorithmic parameters \(\theta > 0\), \(\gamma _3\ge \gamma _2> 1> \gamma _1 >0\) and \(1>\eta _2\ge \eta _1>0\), are given. Evaluate \(\Phi (x_0)\). For \(k=0, 1, \ldots \), until termination, do:

  1. 1.

    If the termination test has not been satisfied, compute derivatives of r(x) at \(x_k\).

  2. 2.

    Compute a step \(s_k\) by approximately minimizing \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) so that

    $$\begin{aligned} m^{\mathrm{R}}(x_k,s_k,\sigma _k)<m^{\mathrm{R}}(x_k,0,\sigma _k) \end{aligned}$$
    (2.5)

    and

    $$\begin{aligned} \left\| \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k)\right\| \le \theta \Vert s_k\Vert ^{r-1}. \end{aligned}$$
    (2.6)
  3. 3.

    Set \(\widehat{x}_{k} = x_k + s_k\) and compute \(\Phi (\widehat{x}_k)\) and

    $$\begin{aligned} \rho _k = \frac{\Phi (x_k)-\Phi (\widehat{x}_k)}{m(x_k,0)-m(x_k,s_k)}. \end{aligned}$$
    (2.7)
  4. 4.

    Set

    $$\begin{aligned} \sigma _{k+1} \in \left\{ \begin{array}{lll} [\max (\sigma _{\min }, \gamma _1\sigma _k),\sigma _k ] &{} \;\; \text{ if } \;\; \rho _k \ge \eta _2 &{}\;\; \text{[very } \text{ successful } \text{ iteration] } \;\; \\ \;[ \sigma _k, \gamma _2 \sigma _k ] &{} \;\; \text{ if } \;\; \eta _1 \le \rho _k < \eta _2&{} \;\; \text{[successful } \text{ iteration] } \;\;\\ \; [ \gamma _2 \sigma _k, \gamma _3 \sigma _k ] &{} \;\; \text{ otherwise } \;\; &{} \;\; \text{[unsuccessful } \text{ iteration], } \;\; \end{array}\right. \nonumber \\ \end{aligned}$$
    (2.8)
  5. 5.

    If \(\rho _k \ge \eta _1\), set \(x_{k+1} = \widehat{x}_k\). Otherwise go to Step 2.

At the very least, we insist that (trivial) termination should occur in Step 1 of Algorithm 2.1 if \(\Vert \nabla _x \Phi (x_k)\Vert = 0\), but in practice a rule such as (1.2) or (1.3) at \(x = x_k\) will be preferred.

At the heart of Algorithm 2.1 is the need (Step 2) to find a vector \(s_k\) that both reduces \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) and satisfies \(\Vert \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k)\Vert \le \theta \Vert s_k\Vert ^{r-1}\) (see, e.g., [1]). Since \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) is bounded from below (and grows as s approaches infinity), we may apply any descent-based local optimization method that is designed to find a critical point of \(m^{\mathrm{R}}(x_k,s,\sigma _k)\), starting from \(s = 0\), as this will generate an \(s_k\) that is guaranteed to satisfy both Step 2 stopping requirements. Crucially, such a minimization is on the model \(m^{\mathrm{R}}(x_k,s,\sigma _k)\), not the true objective, and thus involves no true objective evaluations. We do not claim that this calculation is trivial, but it might, for example, be achieved by applying a safeguarded Gauss–Newton method to the least-squares problem involving the extended residuals \((t(x_k,s),\sqrt{\sigma _k\Vert s\Vert ^{r-2}}s)\).

We define the index set of successful iterations, in the sense of (2.8), up to iteration k to be \(\mathcal{S}_k {:=}\{0 \le l \le k \mid \rho _l \ge \eta _1\}\) and let \(\mathcal{S}{:=}\{ k \ge 0 \mid \rho _k \ge \eta _1 \}\) be the set of all successful iterations.

3 Convergence analysis

We make the following blanket assumption:

  • AS.1 each component \(r_i(x)\) and its first two derivatives are Lipschitz continuous on an open set containing the intervals \([x_k, x_k+s_k]\) generated by Algorithm 2.1 (or its successor).

It has been shown [10, Lemma 3.1] that AS.1 implies that \(\Phi (x)\) and its first two derivatives are Lipschitz on \([x_k, x_k+s_k]\).

We define

$$\begin{aligned} H(x,y) \, {:=}\sum _{i=1}^m y_i \nabla _{xx} r_i(x) \end{aligned}$$

and let q(xs) be the vector whose ith component is

$$\begin{aligned} q_i(x,s) \, {:=}\,\, s^T \nabla _{xx} r_i(x) s \end{aligned}$$

for \(i=1,\ldots ,m\). In this case

$$\begin{aligned} t(x,s) = r(x) + J(x) s + {\scriptstyle \frac{1}{2}}q(x,s). \end{aligned}$$

Since \(m(x_k,s)\) is a second-order accurate model of \(\Phi (x_k+s)\), we expect bounds of the form

$$\begin{aligned} |\Phi (x_k+s_k) - m(x_k,s_k)| \le L_{\mathrm{f}} \Vert s_k\Vert ^3 \end{aligned}$$
(3.1)

and

$$\begin{aligned} |\nabla _x \Phi (x_k+s_k) - \nabla _s m(x_k,s_k)| \le L_{\mathrm{g}} \Vert s_k\Vert ^2 \end{aligned}$$
(3.2)

for some \(L_{\mathrm{f}}, L_{\mathrm{g}} > 0\) and all \(k \ge 0\) for which \(\Vert s_k\Vert \le 1\) (see “Appendix A”).

Also, since \(\Vert r(x)\Vert \) decreases monotonically,

$$\begin{aligned} \Vert J^T(x_k) r(x_k)\Vert \le \Vert J^T(x_k)\Vert \Vert r(x_k)\Vert \le L_{\mathrm{J}} \Vert r(x_0)\Vert \end{aligned}$$
(3.3)

and

$$\begin{aligned} \Vert H(x_k,r(x_k))\Vert \le L_{\mathrm{H}} \Vert r(x_k)\Vert \le L_{\mathrm{H}} \Vert r(x_0)\Vert \end{aligned}$$
(3.4)

for some \(L_{\mathrm{J}}, L_{\mathrm{H}} > 0\) and all \(k \ge 0\) (again, see “Appendix A”).

Our first result derives simple conclusions from the basic requirement that the step \(s_k\) in our algorithm is chosen to reduce the regularized model.

Lemma 3.1

Algorithm 2.1 ensures that

$$\begin{aligned} m(x_k,0) - m(x_k,s_k) > {\scriptstyle \frac{1}{r}}\sigma _k \Vert s_k\Vert ^r \end{aligned}$$
(3.5)

In addition, if \(r=2\), at least one of

$$\begin{aligned} \sigma _k < 2 \Vert H(x_k,r(x_k))\Vert \end{aligned}$$
(3.6)

or

$$\begin{aligned} \sigma _k \Vert s_k\Vert < 4 \Vert J^T(x_k) r(x_k)\Vert \end{aligned}$$
(3.7)

holds, while if \(r>2\),

$$\begin{aligned} \Vert s_k\Vert < \max \left( \left( \frac{ r \Vert H(x_k,r(x_k))\Vert }{\sigma _k}\right) ^{1/(r-1)} , \;\;\;\;\;\;\;\;\; \left( \frac{2 r \Vert J^T(x_k) r(x_k)\Vert }{\sigma _k}\right) ^{1/(r-2)} \right) .\nonumber \\ \end{aligned}$$
(3.8)

Proof

It follows from (2.5), (2.3) and (2.2) that

$$\begin{aligned} 0> & {} 2\left( m(x_k,s_k) + {\scriptstyle \frac{1}{r}}\sigma _k \Vert s_k\Vert ^r - m(x_k,0) \right) \nonumber \\= & {} \Vert r(x_k) + J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \Vert ^2 + {\scriptstyle \frac{2}{r}}\sigma _k \Vert s_k\Vert ^r - \Vert r(x_k)\Vert ^2 \nonumber \\= & {} \Vert J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \Vert ^2 + 2 r^T(x_k)\left( J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \right) + {\scriptstyle \frac{2}{r}}\sigma _k \Vert s_k\Vert ^r \nonumber \\= & {} \Vert J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \Vert ^2 + 2 s_k^T J^T(x_k) r(x_k) + s_k^T H(x_k,r(x_k)) s_k + {\scriptstyle \frac{2}{r}}\sigma _k \Vert s_k\Vert ^r \nonumber \\\ge & {} \Vert J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \Vert ^2 - 2 \Vert J^T(x_k) r(x_k) \Vert \Vert s_k\Vert \nonumber \\&- \Vert H(x_k,r(x_k))\Vert \Vert s_k\Vert ^2 + {\scriptstyle \frac{2}{r}}\sigma _k \Vert s_k\Vert ^r. \end{aligned}$$
(3.9)

Inequality (3.5) follows immediately from the first inequality in (3.9). When \(r=2\), inequality (3.9) becomes

$$\begin{aligned} 0> & {} \Vert J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \Vert ^2 \\&+ \left( {\scriptstyle \frac{1}{2}}\sigma _k \Vert s_k\Vert - 2 \Vert J^T(x_k) r(x_k) \Vert \right) \Vert s_k\Vert + \left( {\scriptstyle \frac{1}{2}}\sigma _k - \Vert H(x_k,r(x_k)) \Vert \right) \Vert s_k\Vert ^2. \end{aligned}$$

In order for this to be true, it must be that at least one of the last two terms is negative, and this provides the alternatives (3.6) and (3.7). By contrast, when \(r>2\), inequality (3.9) becomes

$$\begin{aligned} 0> & {} \Vert J(x_k) s_k + {\scriptstyle \frac{1}{2}}q(x_k,s_k) \Vert ^2 + \left( {\scriptstyle \frac{1}{r}}\sigma _k \Vert s_k\Vert ^{r-1} - 2 \Vert J^T(x_k) r(x_k) \Vert \right) \Vert s_k\Vert \\&+ \left( {\scriptstyle \frac{1}{r}}\sigma _k \Vert s_k\Vert ^{r-2} - \Vert H(x_k,r(x_k)) \Vert \right) \Vert s_k\Vert ^2, \end{aligned}$$

and this implies that

$$\begin{aligned} {\scriptstyle \frac{1}{r}}\sigma _k \Vert s_k\Vert ^{r-1}< 2 \Vert J^T(x_k) r(x_k)\Vert \;\; \text{ or } \;\; {\scriptstyle \frac{1}{r}}\sigma _k \Vert s_k\Vert ^{r-2} < \Vert H(x_k,r(x_k)) \Vert \end{aligned}$$

(or both), which gives (3.8). \(\square \)

Our next task is to show that \(\sigma _k\) is bounded from above. Let

$$\begin{aligned} \mathcal{B}_\gamma {:=}\left\{ j \ge 0 \mid \sigma _j \ge \gamma r \max \left( \Vert H\left( x_j,r(x_j)\right) \Vert , 2 \Vert J^T(x_j) r(x_j)\Vert \right) \right\} \end{aligned}$$

and

$$\begin{aligned} \mathcal{B}\,\, {:=}\, \, \mathcal{B}_1, \end{aligned}$$

and note that Lemma 3.1 implies that

$$\begin{aligned} \Vert s_k\Vert \le 1 \;\; \text{ if } \;\; k \in \mathcal{B}_\gamma \;\; \text{ when } \;\; \gamma \ge 1, \end{aligned}$$

and in particular

$$\begin{aligned} \Vert s_k\Vert \le 1 \;\; \text{ for } \text{ all } \;\; k \in \mathcal{B}. \end{aligned}$$
(3.10)

We consider first the special case for which \(r=2\).

Lemma 3.2

Suppose that AS.1 holds, \(r=2\), \(k \in \mathcal{B}\) and

$$\begin{aligned} \sigma _k \ge \sqrt{\frac{8 L_{\mathrm{f}} L_{\mathrm{J}} \Vert r(x_0)\Vert }{1 - \eta _2}}. \end{aligned}$$
(3.11)

Then iteration k of Algorithm 2.1 is very successful.

Proof

Since \(k \in \mathcal{B}\), Lemma 3.1 implies that (3.7) and (3.10) hold. Then (2.7), (3.1) and (3.5) give that

$$\begin{aligned} |\ \rho _k - 1 | = \frac{|\Phi (x_k+s_k) - m(x_k,s_k)|}{m(x_k,0)-m(x_k,s_k)} \le \frac{2 L_{\mathrm{f}} \Vert s_k\Vert }{\sigma _k} \end{aligned}$$

and hence

$$\begin{aligned} |\ \rho _k - 1 | \le \frac{8 L_{\mathrm{f}} \Vert J^T(x_k) r(x_k)\Vert }{\sigma ^2_k} \le \frac{8 L_{\mathrm{f}} L_{\mathrm{J}} \Vert r(x_0)\Vert }{\sigma ^2_k} \le 1 - \eta _2 \end{aligned}$$

from (3.3), (3.7) and (3.11). Thus it follows from (2.8) that the iteration is very successful. \(\square \)

Lemma 3.3

Suppose that AS.1 holds and \(r=2\). Then Algorithm 2.1 ensures that

$$\begin{aligned} \sigma _k \le \sigma _{\max } {:=}\gamma _3 \max \left( \sqrt{\frac{8 L_{\mathrm{f}} L_{\mathrm{J}} \Vert r(x_0)\Vert }{1 - \eta _2}}, \sigma _0, 2\max ( L_{\mathrm{H}},2 L_{\mathrm{J}}) \Vert r(x_0)\Vert \right) \end{aligned}$$
(3.12)

for all \(k \ge 0\).

Proof

Let

$$\begin{aligned} \sigma ^{\mathrm{B}}_{\max } = \gamma _3 \max \left( \sqrt{\frac{8 L_{\mathrm{f}} L_{\mathrm{J}} \Vert r(x_0)\Vert }{1 - \eta _2}}, \sigma _0 \right) . \end{aligned}$$

Suppose that \(k + 1 \in \mathcal{B}_{\gamma _3}\) is the first iteration for which \(\sigma _{k+1} \ge \sigma ^{\mathrm{B}}_{\max }\). Then, since \(\sigma _k < \sigma _{k+1}\), iteration k must have been unsuccessful, \(x_k=x_{k+1}\) and (2.8) gives that \(\sigma _{k+1} \le \gamma _3 \sigma _k\). Thus

$$\begin{aligned} \gamma _3 \sigma _k\ge & {} \sigma _{k+1} \ge 2\gamma _3 \max ( \Vert H(x_{k+1},r(x_{k+1}))\, 2 \Vert J^T(x_{k+1}) r(x_{k+1}) \Vert ) \\= & {} 2\gamma _3\max ( \Vert H(x_k,r(x_k)),2 \Vert J^T(x_{k}) r(x_{k})\Vert ) \end{aligned}$$

since \(k + 1 \in \mathcal{B}_{\gamma _3}\), which implies that \(k \in \mathcal{B}\). Furthermore,

$$\begin{aligned} \gamma _3 \sigma _k \ge \sigma _{k+1} \ge \sigma ^{\mathrm{B}}_{\max } \ge \gamma _3 \sqrt{\frac{8 L_{\mathrm{f}} L_{\mathrm{J}} \Vert r(x_0)\Vert }{1 - \eta _2}}, \end{aligned}$$

which implies that (3.11) holds. But then Lemma 3.2 implies that iteration k must be very successful. This contradiction ensures that

$$\begin{aligned} \sigma _k < \sigma ^{\mathrm{B}}_{\max } \end{aligned}$$
(3.13)

for all \(k \in \mathcal{B}_{\gamma _3}\). For all other iterations, we have that \(k \notin \mathcal{B}_{\gamma _3}\), and for these the definition of \(\mathcal{B}_{\gamma _3}\), and the bounds (3.3) and (3.4) give

$$\begin{aligned} \sigma _k < 2 \gamma _3 \max ( \Vert H(x_k,r(x_k))\Vert , 2\Vert J^T(x_k)r(x_k))\Vert ) \le 2 \gamma _3 \max (L_{\mathrm{H}}, 2L_{\mathrm{J}})\Vert r(x_0)\Vert .\nonumber \\ \end{aligned}$$
(3.14)

Combining (3.13) and (3.14) gives (3.12).\(\square \)

We now turn to the general case for which \(2<r\le 3\).

Lemma 3.4

Suppose that AS.1 holds, \(2<r\le 3\), \(k \in \mathcal{B}\) and

$$\begin{aligned} \sigma _k \ge \max \left( \left( \frac{ r L_{\mathrm{f}} \left( r L_{\mathrm{H}} \Vert r(x_0)\Vert \right) ^{\frac{3-r}{r-1}}}{1 - \eta _2} \right) ^{\frac{r-1}{2}} , \;\;\; \left( \frac{ r L_{\mathrm{f}} \left( 2 r L_{\mathrm{J}} \Vert r(x_0)\Vert \right) ^{\frac{3-r}{r-2}}}{1 - \eta _2} \right) ^{r-2} \right) \nonumber \\ \end{aligned}$$
(3.15)

Then iteration k of Algorithm 2.1 is very successful.

Proof

Since \(k \in \mathcal{B}\), it follows from (2.7), (3.10), (3.1), (3.5), (3.8), (3.3), (3.4) and (3.15) that

$$\begin{aligned} | \rho _k - 1 |= & {} \displaystyle \frac{|\Phi (x_k+s_k) - m(x_k,s_k)|}{m(x_k,0)-m(x_k,s_k)} \le \frac{r L_{\mathrm{f}} \Vert s_k\Vert ^{3-r}}{\sigma _k} \nonumber \\< & {} r L_{\mathrm{f}} \max \left( \left( r \Vert H(x_k,r(x_k))\Vert \right) ^{(3-r)/(r-1)} \sigma _k^{-2/(r-1)} ,\right. \nonumber \\&\qquad \qquad \left. \left( 2 r \Vert J^T(x_k) r(x_k)\Vert \right) ^{(3-r)/(r-2)} \sigma _k^{-1/(r-2)} \right) \nonumber \\\le & {} r L_{\mathrm{f}} \max \left( \left( r L_{\mathrm{H}} \Vert r(x_0)\Vert \right) ^{(3-r)/(r-1)} \sigma _k^{-2/(r-1)} , \right. \nonumber \\&\qquad \qquad \,\, \left. \left( 2 r L_{\mathrm{J}} \Vert r(x_0)\Vert \right) ^{(3-r)/(r-2)} \sigma _k^{-1/(r-2)} \right) \nonumber \\\le & {} 1 - \eta _2. \end{aligned}$$
(3.16)

As before, (2.8) then ensures that the iteration is very successful. \(\square \)

Lemma 3.5

Suppose that AS.1 holds and \(2<r\le 3\). Then Algorithm 2.1 ensures that

(3.17)

for all \(k \ge 0\).

Proof

The proof mimics that of Lemma 3.3. First, suppose that \(k \in \mathcal{B}_{\gamma _3}\) and that iteration \(k+1\) is the first for which

$$\begin{aligned} \sigma _{k+1} \ge \sigma ^{\mathrm{B}}_{\max } {:=} \gamma _3 \max \left( \left( \frac{ r L_{\mathrm{f}} \left( r L_{\mathrm{H}} \Vert r(x_0)\Vert \right) ^{\frac{3-r}{r-1}}}{1 - \eta _2} \right) ^{\frac{r-1}{2}} , \;\;\; \left( \frac{ r L_{\mathrm{f}} \left( 2 r L_{\mathrm{J}} \Vert r(x_0)\Vert \right) ^{\frac{3-r}{r-2}}}{1 - \eta _2} \right) ^{r-2} , \;\; \sigma _0 \right) . \end{aligned}$$

Then, since \(\sigma _k < \sigma _{k+1}\), iteration k must have been unsuccessful and (2.8) gives that

$$\begin{aligned} \gamma _3 \sigma _k \ge \sigma _{k+1} \ge \sigma ^{\mathrm{B}}_{\max }, \end{aligned}$$

which implies that \(k \in \mathcal{B}\) and (3.15) holds. But then Lemma 3.4 implies that iteration k must be very successful. This contradiction provides the first three terms in the bound (3.17), while the others arise as for the proof of Lemma 3.3 when \(k \notin \mathcal{B}_{\gamma _3}\). \(\square \)

Next, we bound the number of iterations in terms of the number of successful ones.

Lemma 3.6

[8, Theorem 2.1]. The adjustment (2.8) in Algorithm 2.1 ensures that

$$\begin{aligned} k \le \kappa _u |\mathcal{S}_k| + \kappa _s, \;\; \text{ where } \;\; \kappa _u {:=}\left( 1 - \frac{\log \gamma _1}{\log \gamma _2} \right) , \kappa _s {:=} \frac{1}{\log \gamma _2} \log \left( \frac{\sigma _{\max }}{\sigma _0}\right) ,\qquad \end{aligned}$$
(3.18)

and \(\sigma _{\max }\) is any known upper bound on \(\sigma _k\).

Our final ingredient is to find a useful bound on the smallest model decrease as the algorithm proceeds. Let \(\mathcal{L}{:=}\{ k \mid \Vert s_k\Vert \le 1 \}\), and let \(\mathcal{G}{:=}\{ k \mid \Vert s_k\Vert > 1 \}\) be its compliment. We then have the following crucial bounds.

Lemma 3.7

Suppose that AS.1 holds and \(2\le r\le 3\). Then Algorithm 2.1 ensures that

$$\begin{aligned} m(x_k,0) - m(x_k,s_k) \ge \left\{ \begin{array}{ll} \frac{1}{r}\sigma _{\min } \left( \displaystyle \frac{\Vert \nabla _x \Phi (x_k+s_k) \Vert }{L_{\mathrm{g}} + \theta + \sigma _{\max }} \right) ^{\frac{r}{r-1}} &{} \;\; \text{ if } \;\; k \in \mathcal{L}\\ \frac{1}{r} \sigma _{\min } &{} \;\; \text{ if } \;\; k \in \mathcal{G}. \end{array} \right. \end{aligned}$$
(3.19)

Proof

Consider \(k \in \mathcal{L}\). The Cauchy-Schwarz inequality and (2.4) reveal that

$$\begin{aligned} \Vert \nabla _x \Phi (x_k+s_k) \Vert= & {} \left\| \left( \nabla _x \Phi (x_k+s_k) - \nabla _s m(x_k,s_k) \right) \right. \nonumber \\&\left. + \left( \nabla _s m(x_k,s_k) + \sigma _k \Vert s_k\Vert ^{r-2} s_k \right) - \sigma _k \Vert s_k\Vert ^{r-2} s_k \right\| \nonumber \\\le & {} \Vert \nabla _x \Phi (x_k+s_k) - \nabla _s m(x_k,s_k) \Vert \nonumber \\&+ \Vert \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k) \Vert + \sigma _k\Vert s_k \Vert ^{r-1}. \end{aligned}$$
(3.20)

Combining (3.20) with (3.2), (2.6), (3.12), (3.17) and \(\Vert s_k\Vert \le 1\) we have

$$\begin{aligned} \Vert \nabla _x \Phi (x_k+s_k) \Vert \le L_{\mathrm{g}} \Vert s_k\Vert ^2 + \theta \Vert s_k\Vert ^{r-1} + \sigma _{\max } \Vert s_k\Vert ^{r-1} \le (L_{\mathrm{g}} + \theta + \sigma _{\max } ) \Vert s_k\Vert ^{r-1} \end{aligned}$$

and thus that

$$\begin{aligned} \Vert s_k\Vert \ge \left( \displaystyle \frac{\Vert \nabla _x \Phi (x_k+s_k) \Vert }{L_{\mathrm{g}} + \theta + \sigma _{\max }} \right) ^{\frac{1}{r-1}}. \end{aligned}$$

But then, combining this with (3.5), the lower bound

$$\begin{aligned} \sigma _k \ge \sigma _{\min } \end{aligned}$$
(3.21)

imposed by Algorithm 2.1 and (3.5) provides the first possibility in (3.19).

By contrast, if \(k \in \mathcal{G}\), (3.5), \(\Vert s_k\Vert > 1\) and (3.21) ensure the second possibility in (3.19). \(\square \)

Corollary 3.8

Suppose that AS.1 holds and \(2\le r\le 3\). Then Algorithm 2.1 ensures that

$$\begin{aligned} \Phi (x_k) - \Phi (x_{k+1}) \ge \left\{ \begin{array}{ll} \frac{1}{r}\eta _1 \sigma _{\min } \left( \displaystyle \frac{\Vert \nabla _x \Phi (x_k+s_k) \Vert }{L_{\mathrm{g}} + \theta + \sigma _{\max }} \right) ^{\frac{r}{r-1}} &{}\quad \;\; \text{ if } \;\; k \in \mathcal{L}\cap \mathcal{S}\\ \frac{1}{r} \eta _1 \sigma _{\min } &{}\quad \;\; \text{ if } \;\; k \in \mathcal{G}\cap \mathcal{S}. \end{array} \right. \qquad \end{aligned}$$
(3.22)

Proof

The result follows directly from and (2.7) and (3.19). \(\square \)

We now provide our three main convergence results. Firstly, we establish the global convergenceFootnote 1 of our algorithm to first-order critical points of \(\Phi (x)\).

Theorem 3.9

Suppose that AS.1 holds and \(2\le r\le 3\). Then the iterates \(\{x_k\}\) generated by Algorithm 2.1 satisfy

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert \nabla _x \Phi (x_k)\Vert = 0 \end{aligned}$$
(3.23)

if no non-trivial termination test is provided.

Proof

Suppose that \(\epsilon > 0\), and consider any successful iteration for which

$$\begin{aligned} \Vert \nabla _x \Phi (x_k)\Vert \ge \epsilon >0. \end{aligned}$$
(3.24)

Then it follows from (3.22) that

$$\begin{aligned} \Phi (x_k)-\Phi (x_{k+1}) \ge \delta {:=} \frac{\eta _1 \sigma _{\min }}{r} \min \left( \left( \displaystyle \frac{\epsilon }{L_{\mathrm{g}} + \theta + \sigma _{\max }}\right) ^{\frac{r}{r-1}} ,\;\;\; 1 \right) > 0.\qquad \end{aligned}$$
(3.25)

Consider the set \(\mathcal{U}_{\epsilon } = \{ k \in \mathcal{S}\mid \Vert \nabla _x \Phi (x_k)\Vert \ge \epsilon \}\), suppose that \(\mathcal{U}_{\epsilon }\) is infinite, and let \(k_i\) be the i-th entry of \(\mathcal{U}_{\epsilon }\). Now consider

$$\begin{aligned} i_{\epsilon } = \lceil {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2 / \delta \rceil + 1. \end{aligned}$$

Thus summing (3.25) over successful iterations, recalling that \(\Phi (x_0) = {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2\), \(\Phi (x_k) \ge 0\), and that \(\Phi \) decreases monotonically and using (3.25), we have that

$$\begin{aligned} {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2 \ge \Phi (x_0)-\Phi (x_{k_{i_{\epsilon }}+1}) \ge \sum _{k \in \mathcal{U}_{\epsilon }, k \le k_{i_{\epsilon }}} \Phi (x_k)-\Phi (x_{k+1}) \ge i_{\epsilon } \delta > {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2.\nonumber \\ \end{aligned}$$
(3.26)

This contradiction shows that \(\mathcal{U}_{\epsilon }\) is finite for any \(\epsilon > 0\), and therefore (3.23) holds. \(\square \)

Secondly, we provide an evaluation complexity result based on the stopping criterion (1.2).

Theorem 3.10

Suppose that AS.1 holds and \(2\le r\le 3\). Then Algorithm 2.1 requires at most

$$\begin{aligned} \left\lceil \frac{\kappa _u\Vert r(x_0)\Vert ^2 r \left( L_{\mathrm{g}} + \theta + \sigma _{\max }\right) ^{\frac{r}{r-1}}}{2 \eta _1 \sigma _{\min }} \epsilon ^{-\frac{r}{r-1}} \right\rceil + \kappa _s + 1 \end{aligned}$$
(3.27)

evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test

$$\begin{aligned} \Vert \nabla _x \Phi (x_k)\Vert \le \epsilon \end{aligned}$$

is satisfied for given \(0< \epsilon < 1\), where \(\kappa _u\) and \(\kappa _s\) are defined in (3.18).

Proof

If the algorithm has not terminated, (3.24) holds, so summing (3.25) as before

$$\begin{aligned} {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2 \ge \Phi (x_0)-\Phi (x_{k+1}) \ge |\mathcal{S}_k| \delta = |\mathcal{S}_k| \frac{\eta _1 \sigma _{\min }}{r} \left( \displaystyle \frac{\epsilon }{L_{\mathrm{g}} + \theta + \sigma _{\max }}\right) ^{\frac{r}{r-1}}\nonumber \\ \end{aligned}$$
(3.28)

since \(\epsilon < 1\), and thus that

$$\begin{aligned} |\mathcal{S}_k| \le \frac{\Vert r(x_0)\Vert ^2 \left( L_{\mathrm{g}} + \theta + \sigma _{\max }\right) ^{\frac{r}{r-1}}}{2 \eta _1 \sigma _{\min }} \epsilon ^{-\frac{r}{r-1}}. \end{aligned}$$

Combining this with (3.18) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (3.27). \(\square \)

Notice how the evaluation complexity improves from \(O(\epsilon ^{-2})\) evaluations with quadratic \((r=2)\) regularization to \(O(\epsilon ^{-3/2})\) evaluations with cubic \((r=3)\) regularization. It is not clear if these bounds are sharp.

Finally, we refine this analysis to provide an alternative complexity result based on the stopping rule (1.3). The proof of this follows similar arguments in [9, §3.2], [11, §3] and crucially depends upon the following elementary result.

Lemma 3.11

Suppose that \(a > b \ge 0\). Then

$$\begin{aligned} a^2 - b^2 \ge c \;\; \text{ implies } \text{ that } \;\; a^{1/2^i} - b^{1/2^i} \ge \frac{c}{2^{i+1}a^{\frac{2^{i+1}-1}{2^i}}} \end{aligned}$$

for all integers \(i \ge -1\).

Proof

The result follows directly by induction using the identity \(A^2 - B^2 = (A-B)(A+B)\) with \(A=a^{1/2^j} > B=b^{1/2^j}\) for increasing \(j\le i\). \(\square \)

Theorem 3.12

Suppose that AS.1 holds, \(2 < r\le 3\) and that the integer

$$\begin{aligned} i\ge i_{{\scriptscriptstyle 0}}{:=} \left\lceil \log _2 \left( \frac{r-1}{r-2} \right) \right\rceil \end{aligned}$$
(3.29)

is given. Then Algorithm 2.1 requires at most

$$\begin{aligned} \left\lceil \kappa _u \max \left( \kappa _c^{-1}, \kappa _g^{-1} \epsilon _d^{-r/(r-1)}, \kappa _r^{-1} \epsilon _p^{- 1/2^{i}} \right) \right\rceil + \kappa _s + 1 \end{aligned}$$
(3.30)

evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test

$$\begin{aligned} \Vert r(x_k)\Vert \le \epsilon _p \quad \mathrm{or}\quad \Vert g_r(x_k)\Vert \le \epsilon _d, \end{aligned}$$
(3.31)

is satisfied for given \(\epsilon _p>0\) and \(\epsilon _d>0\), where \(\kappa _u\) and \(\kappa _s\) are defined in (3.18), \(\kappa _c\), \(\kappa _g\) and \(\kappa _r\) are given by

$$\begin{aligned} \kappa _c \, {:=}\,&{\scriptstyle \frac{1}{2}}^{i+1} \frac{\eta _1 \sigma _{\min }}{r} \Vert r(x_0)\Vert ^{-(2^{i+1}-1)/2^{i}}, \nonumber \\ \kappa _g \, {:=}\,&\displaystyle \frac{{\scriptstyle \frac{1}{2}}^{i} \eta _1 \sigma _{\min } \beta ^{r/(r-1)}}{r(L + \theta + \sigma _{\max })^{r/(r-1)}} \Vert r(x_0) \Vert ^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) } \nonumber \\ \;\; \text{ and } \;\; \kappa _r \,{:=}\,&\displaystyle \frac{1 - \beta ^{1/2^{i}} }{\beta ^{1/2^{i}}}, \end{aligned}$$
(3.32)

and \(\beta \in (0,1)\) is a fixed problem-independent constant.

Proof

Consider \(\mathcal{S}_{\beta } {:=}\{ l \in \mathcal{S}\mid \Vert r(x_{l+1})\Vert > \beta \Vert r(x_l)\Vert \}\), and let \(i\) be the smallest integer for which

$$\begin{aligned} \frac{2^{i+1} - 1}{2^{i}} \ge \frac{r}{r-1}, \end{aligned}$$
(3.33)

that is \(i\) satisfies (3.29).

First, consider \(l \in \mathcal{G}\cap \mathcal{S}\). Then (3.22) gives that

$$\begin{aligned} \Vert r(x_l)\Vert ^2 - \Vert r(x_{l+1})\Vert ^2 \ge \frac{\eta _1 \sigma _{\min }}{r} \end{aligned}$$

and, since

$$\begin{aligned} \Vert r(x_{l+1})\Vert < \Vert r(x_l)\Vert \le \Vert r(x_0)\Vert \end{aligned}$$
(3.34)

for all \(l \in \mathcal{S}\), Lemma 3.11 implies that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}}\ge & {} {\scriptstyle \frac{1}{2}}^{i+1} \frac{\eta _1 \sigma _{\min }}{r} \Vert r(x_l)\Vert ^{-(2^{i+1}-1)/2^{i}} \nonumber \\\ge & {} {\scriptstyle \frac{1}{2}}^{i+1} \frac{\eta _1 \sigma _{\min }}{r} \Vert r(x_0)\Vert ^{-(2^{i+1}-1)/2^{i}}. \end{aligned}$$
(3.35)

By contrast, for \(l \in \mathcal{L}\cap \mathcal{S}\), (3.22) gives that

$$\begin{aligned} \Vert r(x_l)\Vert ^2 - \Vert r(x_{l+1})\Vert ^2\ge & {} \kappa \Vert J^T(x_{l+1}) r(x_{l+1}) \Vert ^{r/(r-1)}, \;\; \text{ where } \;\;\nonumber \\&\quad \kappa = \frac{2 \eta _1 \sigma _{\min }}{r(L + \theta + \sigma _{\max })^{r/(r-1)}}. \end{aligned}$$
(3.36)

If additionally \(l \in \mathcal{S}_{\beta }\), (3.36) may be refined as

$$\begin{aligned} \Vert r(x_l)\Vert ^2 - \Vert r(x_{l+1})\Vert ^2&\ge \kappa \left( \frac{\Vert J^T(x_{l+1}) r(x_{l+1}) \Vert }{\Vert r(x_{l+1}) \Vert }\right) ^{r/(r-1)} \Vert r(x_{l+1}) \Vert ^{r/(r-1)} \nonumber \\&\ge \kappa \left( \frac{\Vert J^T(x_{l+1}) r(x_{l+1}) \Vert }{\Vert r(x_{l+1}) \Vert }\right) ^{r/(r-1)} \Vert r(x_{l+1}) \Vert ^{r/(r-1)} \nonumber \\&\ge \kappa \beta ^{r/(r-1)} \Vert g_r(x_{l+1}) \Vert ^{r/(r-1)} \Vert r(x_l) \Vert ^{r/(r-1)} \end{aligned}$$
(3.37)

from (1.4) and the requirement that \(\Vert r(x_{l+1})\Vert > \beta \Vert r(x_l)\Vert \). Using (3.37), (3.34), Lemma 3.11 and (3.33), we then obtain the bound

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}}\ge & {} {\scriptstyle \frac{1}{2}}^{i+1} \kappa \beta ^{r/(r-1)} \Vert g_r(x_{l+1}) \Vert ^{r/(r-1)}\nonumber \\&\Vert r(x_l) \Vert ^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) } \nonumber \\\ge & {} {\scriptstyle \frac{1}{2}}^{i+1} \kappa \beta ^{r/(r-1)} \Vert r(x_0) \Vert ^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) }\nonumber \\&\Vert g_r(x_{l+1}) \Vert ^{r/(r-1)} \end{aligned}$$
(3.38)

for all \(l \in \mathcal{L}\cap \mathcal{S}_{\beta }\). Finally, consider \(l \in \mathcal{S}\setminus \mathcal{S}_{\beta }\), for which \(\Vert r(x_{l+1})\Vert \le \beta \Vert r(x_l)\Vert \) and hence \(\Vert r(x_{l+1})\Vert ^{1/2^{i}} \le \beta ^{1/2^{i}} \Vert r(x_l)\Vert ^{1/2^{i}}\). Thus we have that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}}&\ge (1 - \beta ^{1/2^{i}} ) \Vert r(x_l)\Vert ^{1/2^{i}} \nonumber \\&\ge \frac{1 - \beta ^{1/2^{i}} }{\beta ^{1/2^{i}}} \Vert r(x_{l+1})\Vert ^{1/2^{i}} \end{aligned}$$
(3.39)

for all \(l \in \mathcal{L}\cap ( \mathcal{S}\setminus \mathcal{S}_{\beta } )\). Thus, combining (3.35), (3.38) and (3.39), we have that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}} \ge \min \left( \kappa _c, \kappa _g \Vert g_r(x_{l+1}) \Vert ^{r/(r-1)}, \kappa _r \Vert r(x_{l+1})\Vert ^{1/2^{i}} \right) , \nonumber \\ \end{aligned}$$
(3.40)

for \(\kappa _c\), \(\kappa _g\) and \(\kappa _r\) given by (3.32), for all \(l \in \mathcal{S}\).

Now suppose that the stopping rule (3.31) has not been satisfied up until the start of iteration \(k+1\), and thus that

$$\begin{aligned} \Vert r(x_{l+1})\Vert> \epsilon _p \quad \mathrm{and}\quad \Vert g_r(x_{l+1})\Vert > \epsilon _d \end{aligned}$$
(3.41)

for all \(l \in \mathcal{S}_k\). Combining this with (3.40), we have that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}} \ge \min \left( \kappa _c, \kappa _g \epsilon _d^{r/(r-1)}, \kappa _r \epsilon _p^{1/2^{i}} \right) , \end{aligned}$$

and thus, summing over \(l \in \mathcal{S}_k\) and using (3.34),

$$\begin{aligned} \Vert r(x_0)\Vert ^{1/2^{i}} \ge \Vert r(x_0)\Vert ^{1/2^{i}} - \Vert r(x_{k+1})\Vert ^{1/2^{i}} \ge |\mathcal{S}_k| \min \left( \kappa _c, \kappa _g \epsilon _d^{r/(r-1)}, \kappa _r \epsilon _p^{1/2^{i}} \right) . \end{aligned}$$

As before, combining this with (3.18) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (3.30).\(\square \)

If \(i< i_{{\scriptscriptstyle 0}}\), a weaker bound that includes \(r = 2\) is possible. The key is to note that the purpose of (3.33) is to guarantee the second inequality in (3.38). Without this, we have instead

$$\begin{aligned}&\Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}} \ge {\scriptstyle \frac{1}{2}}^{i+1} \kappa \beta ^{r/(r-1)} \Vert g_r(x_{l+1}) \Vert ^{r/(r-1)} \Vert r(x_{l+1}) \Vert ^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) }\nonumber \\ \end{aligned}$$
(3.42)

for all \(l \in \mathcal{L}\cap \mathcal{S}_{\beta }\), and this leads to

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^{i}} - \Vert r(x_{l+1})\Vert ^{1/2^{i}} \ge \min \left( \kappa _c, \kappa _g\prime \epsilon _d^{r/(r-1)} \epsilon _p^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) } , \kappa _r \epsilon _p^{1/2^{i}} \right) , \end{aligned}$$

where

$$\begin{aligned} \kappa _g\prime {:=}\displaystyle \frac{{\scriptstyle \frac{1}{2}}^{i} \eta _1 \sigma _{\min } \beta ^{r/(r-1)}}{r( L_{\mathrm{g}} + \theta + \sigma _{\max })^{r/(r-1)}}. \end{aligned}$$

if (3.41) holds. This results in a bound of \(O\Big (\max ( 1,\epsilon _d^{r/(r-1)} \cdot \epsilon _p^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) }, \epsilon _p^{1/2^{i}} )\Big )\) function evaluations, which approaches that in (3.30) as i increases to infinity when \(r=2\).

4 A modified algorithm for cubic-and-higher regularization

For the case where \(r > 3\), the proof of Lemma 3.4 breaks down as there is no obvious bound on the quantity \(\Vert s_k\Vert ^{3-r}/\sigma _k\). One way around this defect is to modify Algorithm 2.1 so that such a bound automatically occurs. We consider the following variant; our development follows very closely that in [12], itself inspired by [20]. For completeness, we allow \(r = 3\) in this new framework since it is trivial to do so.

Algorithm 4.1

Adaptive Tensor-Newton Regularization when\(r \ge 3\).

A starting point \(x_0\), an initial regularization parameter \(\sigma _0>0\) and algorithmic parameters \(\theta > 0\), \(\alpha \in (0,{\scriptstyle \frac{1}{3}}]\), \(\gamma _3\ge \gamma _2> 1> \gamma _1 >0\) and \(1>\eta _2\ge \eta _1>0\), are given. Evaluate \(\Phi (x_0)\), and test for termination at \(x_0\).

For \(k=0, 1, \ldots \), until termination, do:

  1. 1.

    Compute derivatives of r(x) at \(x_k\).

  2. 2.

    Compute a step \(s_k\) by approximately minimizing \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) so that

    $$\begin{aligned} m^{\mathrm{R}}(x_k,s_k,\sigma _k)<m^{\mathrm{R}}(x_k,0,\sigma _k) \end{aligned}$$

    and

    $$\begin{aligned} \Vert \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k)\Vert \le \theta \Vert s_k\Vert ^{2} \end{aligned}$$
    (4.1)

    hold.

  3. 3.

    Set \(\widehat{x}_k = x_k+s_k\), and test for termination at \(\widehat{x}_k\).

  4. 4.

    Compute \(\Phi (\widehat{x}_k)\) and

    $$\begin{aligned} \rho _k = \frac{\Phi (x_k)-\Phi (\widehat{x}_k)}{m(x_k,0)-m(x_k,s_k)}. \end{aligned}$$

    If \(\rho _k \ge \eta _1\) and

    $$\begin{aligned} \sigma _k \Vert s_k\Vert ^{r-1} \ge \alpha \Vert \nabla _x \Phi (\widehat{x}_k)\Vert , \end{aligned}$$
    (4.2)

    set \(x_{k+1} = \widehat{x}_k\).

  5. 5.

    Set

    $$\begin{aligned} \sigma _{k+1} \in \left\{ \begin{array}{cl} [\gamma _1\sigma _k,\sigma _k ] &{} \;\; \text{ if } \;\; \rho _k \ge \eta _2 \;\; \text{ and } \;\; (4.2) \;\; \text{ holds } \;\; \\ \;[ \sigma _k, \gamma _2 \sigma _k ] &{} \;\; \text{ if } \;\; \eta _1 \le \rho _k< \eta _2 \;\; \text{ and } \;\; (4.2) \;\; \text{ holds } \;\; \\ \; [ \gamma _2 \sigma _k, \gamma _3 \sigma _k ] &{} \;\; \text{ if } \;\; \rho _k < \eta _1 \;\; \text{ or } \;\; (4.2) \;\; \text{ fails, } \;\; \end{array}\right. \end{aligned}$$
    (4.3)

    and go to Step 2 if \(\rho _k < \eta _1\) or (4.2) fails.

It is important that termination is tested at Step 3 as deductions from computations in subsequent steps rely on this. We modify our definition of a successful step accordingly so that now \(\mathcal{S}_k =\{0 \le l \le k \mid \rho _l \ge \eta _1 \;\; \text{ and } \;\; ((4.2)) \;\; \text{ holds }\}\) and \(\mathcal{S}=\{ k \ge 0 \mid \rho _k \ge \eta _1 \;\; \text{ and } \;\; ((4.2)) \;\; \text{ holds }\}\), and note in particular that Lemma 3.6 continues to hold in this case since it only depends on the adjustments in (4.3). Likewise, a very successful iteration is now one for which \(\rho _k \ge \eta _2\) and (4.2) holds. Note that (4.3), unlike (2.8) in Algorithm 2.1, does not impose a nonzero lower bound on the generated regularization weight; this will be reflected in our derived complexity bound (cf Theorems 3.12 and 4.7).

As is now standard, our first task is to establish an upper bound on \(\sigma _k\).

Lemma 4.1

Suppose that AS.1 holds, \(r \ge 3\) and

$$\begin{aligned} \sigma _k \Vert s_k\Vert ^{r-3} \ge \kappa _2, \;\; \text{ where } \;\; \kappa _2 {:=}\frac{r L}{1-\eta _2} \;\; \text{ and } \;\; L = \max (L_{\mathrm{f}},L_{\mathrm{g}},\theta ). \end{aligned}$$
(4.4)

Then iteration k of Algorithm 4.1 is very successful.

Proof

It follows immediately from (2.7), (3.1), (3.5) and (4.4) that

$$\begin{aligned} | \rho _k - 1 | = \displaystyle \frac{|\Phi (x_k+s_k) - m(x_k,s_k)|}{m(x_k,0)-m(x_k,s_k)} \le \frac{r L_{\mathrm{f}} \Vert s_k\Vert ^{3-r}}{\sigma _k} \le \frac{r L \Vert s_k\Vert ^{3-r}}{\sigma _k} \le 1 - \eta _2, \end{aligned}$$

and thus \(\rho _k \ge \eta _2\). Observe that

$$\begin{aligned} \kappa _2 \ge L \end{aligned}$$
(4.5)

since \(1 - \eta _2 \le 1\) and \(r \ge 1\). We also have from (3.20), (3.2) and (4.1) that

$$\begin{aligned} \Vert \nabla _x \Phi (x_k+s_k) \Vert \le L_{\mathrm{g}} \Vert s_k\Vert ^2 + \theta \Vert s_k\Vert ^2 + \sigma _k \Vert s_k\Vert ^{r-1} = \left( L_{\mathrm{g}} + \theta + \sigma _k \Vert s_k\Vert ^{r-3}\right) \Vert s_k\Vert ^2\nonumber \\ \end{aligned}$$
(4.6)

and thus from (4.4), (4.5) and the algorithmic restriction \(3 \le 1/\alpha \) that

$$\begin{aligned} \Vert \nabla _x \Phi (x_k+s_k) \Vert\le & {} \left( 2 L + \sigma _k \Vert s_k\Vert ^{r-3}\right) \Vert s_k\Vert ^2 \le \left( 3 \sigma _k \Vert s_k\Vert ^{r-3}\right) \Vert s_k\Vert ^2\\= & {} 3 \sigma _k \Vert s_k\Vert ^{r-1} \le \frac{\sigma _k}{\alpha } \Vert s_k\Vert ^{r-1}. \end{aligned}$$

Thus (4.2) is also satisfied, and hence iteration k is very successful. \(\square \)

Lemma 4.2

Suppose that AS.1 holds, \(r \ge 3\) and

$$\begin{aligned} \sigma _k \ge \kappa _1 \Vert \nabla _s \Phi (x_k+s_k)\Vert ^{(3-r)/2}, \;\; \text{ where } \;\; \kappa _1 {:=}\, \kappa _2 (3 \kappa _2)^{(r-3)/2} \end{aligned}$$
(4.7)

and \(\kappa _2\) is defined in the statement of Lemma 4.1. Then iteration k of Algorithm 4.1 is very successful.

Proof

It follows from Lemma 4.1 that it suffices to show that (4.7) implies (4.4). The result is immediate when \(r=3\) since then (4.4) is (4.7). Suppose therefore that \(r > 3 \) and that (4.4) is not true, that is

$$\begin{aligned} \sigma _k \Vert s_k\Vert ^{r-3} < \kappa _2. \end{aligned}$$
(4.8)

Then (4.6), (4.8) and (4.5) imply that

$$\begin{aligned} \Vert \nabla _x \Phi (x_k+s_k) \Vert \le ( 2 L + \kappa _2 ) \Vert s_k\Vert ^2< 3 \kappa _2 \Vert s_k\Vert ^2 < 3 \kappa _2 \left( \frac{\kappa _2}{\sigma _k}\right) ^{2/(r-3)} \end{aligned}$$

which contradicts (4.7). Thus (4.4) holds. \(\square \)

Unlike in our previous analysis of Algorithm 2.1 when \(r \le 3\), we are unable to deduce an upper bound on \(\sigma _k\) without further consideration. With this in mind, we now suppose that all the iterates \(x_k+s_k\) generated by Algorithm 4.1 satisfy

$$\begin{aligned} \Vert \nabla _x \Phi (x_k+s_k)\Vert \ge \epsilon \end{aligned}$$
(4.9)

for some \(\epsilon > 0\) and all \(0 \le k \le l\), and thus, from (4.2), that

$$\begin{aligned} \sigma _k \Vert s_k\Vert ^{r-1} \ge \alpha \epsilon \end{aligned}$$
(4.10)

for \(k \in \mathcal{S}_l\). In this case, we can show that \(\sigma _k\) is bounded from above.

Lemma 4.3

Suppose that AS.1 holds and \(r \ge 3\). Then provided that (4.9) holds for all \(0 \le k \le l\), Algorithm 4.1 ensures that

$$\begin{aligned} \sigma _k \le \sigma _{\max } {:=} \gamma _3 \max \left( \kappa _1 \epsilon ^{(3-r)/2}, \sigma _0 \right) \end{aligned}$$
(4.11)

and \(\kappa _1\) is defined in the statement of Lemma 4.2.

Proof

The proof is similar to the first part of that of Lemma 3.5. Suppose that iteration \(k+1\) (with \(k \le l\)) is the first for which \(\sigma _{k+1} \ge \sigma _{\max }\). Then, since \(\sigma _k < \sigma _{k+1}\), iteration k must have been unsuccessful and (4.3) gives that

$$\begin{aligned} \gamma _3 \sigma _k \ge \sigma _{k+1} \ge \sigma _{\max }, \end{aligned}$$

i.e., that

$$\begin{aligned} \sigma _k \ge \max \left( \kappa _1 \epsilon ^{(3-r)/2}, \sigma _0 \right) \ge \kappa _1 \epsilon ^{(3-r)/2} \ge \kappa _1 \Vert \nabla _x \Phi (x_k+s_k)\Vert ^{(3-r)/2} \end{aligned}$$

because of (4.9). But then Lemma 4.2 implies that iteration k must be very successful. This contradiction establishes (4.11). \(\square \)

We may also show that a successful step ensures a non-trivial reduction in \(\Phi (x)\).

Lemma 4.4

Suppose that AS.1 holds and \(r \ge 3\). Suppose further that (4.9) holds for all \(0 \le k \le l\). Then provided that (4.9) holds for all \(0 \le k \le l\) and some \(0 < \epsilon \le 1\), Algorithm 4.1 guarantees that

$$\begin{aligned} \Phi (x_k) - \Phi (x_{k+1}) \ge \kappa _4 \epsilon ^{3/2} > 0 \end{aligned}$$
(4.12)

for all \(k \in \mathcal{S}\), where

$$\begin{aligned} \kappa _4 {:=} \displaystyle \frac{\eta \alpha ^{r/(r-1)} }{r \kappa _3^{1/(r-1)}}, \;\; \kappa _3 {:=} \gamma _3 \max (\kappa _1,\sigma _0), \end{aligned}$$
(4.13)

and \(\kappa _1\) is defined in the statement of Lemma 4.2.

Proof

Since \(0 < \epsilon \le 1\), (4.11) ensures that \(\sigma _{\max } \le \kappa _3 \epsilon ^{(3-r)/2}\) and thus if \(k \in \mathcal{S}\), it follows from (3.5) and (4.10) that

$$\begin{aligned} \Phi (x_k) - \Phi (x_{k+1})\ge & {} \eta _1 ( m(x_k,0) - m(x_k,s_k) )> \displaystyle \frac{\eta _1}{r} \sigma _k \Vert s_k\Vert ^r \\= & {} \displaystyle \frac{\eta _1}{r} ( \sigma _k \Vert s_k\Vert ^{r-1} ) \Vert s_k\Vert \ge \displaystyle \frac{\eta _1}{r} \alpha \epsilon \frac{(\alpha \epsilon )^{1/(r-1)}}{\sigma _k^{1/(r-1)}} \ge \displaystyle \frac{\eta (\alpha \epsilon )^{r/(r-1)}}{r \sigma _{\max }^{1/(r-1)}} \\\ge & {} \displaystyle \frac{\eta \alpha ^{r/(r-1)} }{r \kappa _3^{1/(r-1)}} \displaystyle \frac{\epsilon ^{r/(r-1)}}{( \epsilon ^{(3-r)/2} )^{1/(r-1)}} = \kappa _4 \epsilon ^{3/2} > 0, \end{aligned}$$

as required. \(\square \)

These introductory lemmas now lead to our main convergence results. First we establish global convergence to a critical point of \(\Phi (x)\).

Theorem 4.5

Suppose that AS.1 holds and \(r \ge 3\). Then the iterates \(\{x_k\}\) generated by Algorithm 4.1 satisfy

$$\begin{aligned} \lim _{k \rightarrow \infty } \inf \Vert \nabla _x \Phi (x_k)\Vert = 0 \end{aligned}$$
(4.14)

if no non-trivial termination test is provided.

Proof

Suppose that (4.14) does not hold, in which case (4.9) holds for some \(0 < \epsilon \le 1\) and all \(k \ge 0\). We then deduce by summing the reduction in \(\Phi (x)\) guaranteed by Lemma 4.4 over successful iterations that

$$\begin{aligned} {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2 \ge \Phi (x_0)-\Phi (x_{k+1}) \ge |\mathcal{S}_k| \kappa _4 \epsilon ^{3/2}. \end{aligned}$$

Just as in the proof of Theorem 3.10, this ensures that there are only a finite number of successful iterations. If iteration k is the last of these, all subsequent iterations are unsuccessful, and thus \(\sigma _k\) grows without bound. But as this contradicts Lemma 4.3, (4.9) cannot be true, and thus (4.14) holds. \(\square \)

Next, we give an evaluation complexity result based on the stopping criterion (1.2).

Theorem 4.6

Suppose that AS.1 holds and \(r \ge 3\). Then Algorithm 4.1 requires at most

$$\begin{aligned} \begin{array}{ll} \left\lceil \kappa _u \displaystyle \frac{\Vert r(x_0)\Vert ^2}{2 \kappa _4} \epsilon ^{-3/2} + \kappa _b \right\rceil + 1 &{}\quad \;\; \text{ if } \;\; r = 3, \\ \left\lceil \kappa _u \displaystyle \frac{\Vert r(x_0)\Vert ^2}{2 \kappa _4} \epsilon ^{-3/2} + \kappa _i + \kappa _e \log \epsilon ^{-1} \right\rceil + 1 &{}\quad \;\; \text{ if } \;\; r > 3 \;\; \text{ and } \;\; \epsilon < \left( \displaystyle \frac{\kappa _1}{\sigma _0}\right) ^{2/(r-3)},\\ \left\lceil \kappa _u \displaystyle \frac{\Vert r(x_0)\Vert ^2}{2 \kappa _4} \epsilon ^{-3/2} + \kappa _a \right\rceil + 1 &{}\quad \;\; \text{ otherwise } \;\; \end{array}\nonumber \\ \end{aligned}$$
(4.15)

evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test

$$\begin{aligned} \Vert \nabla _x \Phi (x_k)\Vert \le \epsilon \end{aligned}$$

is satisfied for given \(0< \epsilon < 1\), where

$$\begin{aligned} \kappa _b \,{:=}\, \frac{\log ( \kappa _3 / \sigma _0) }{\log \gamma _2}, \;\; \kappa _i \,{:=}\, \frac{\log (\gamma _3 \kappa _1 / \sigma _0)}{\log \gamma _2}, \;\; \kappa _e \,{:=}\, \frac{r-3}{2 \log \gamma _2} \;\; \text{ and } \;\; \kappa _a \,{:=}\, \frac{\log \gamma _3 }{\log \gamma _2},\qquad \quad \end{aligned}$$
(4.16)

\(\kappa _u\) is defined in (3.18), \(\kappa _1\) in (4.7) and \(\kappa _3\) in (4.13).

Proof

If the algorithm has not terminated on or before iteration k, (4.9) holds, and so summing (4.12) over successful iterations and recalling that \(\Phi (x_0) = {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2\) and \(\Phi (x_k) \ge 0\), we have that

$$\begin{aligned} {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2 \ge \Phi (x_0)-\Phi (x_{k+1}) \ge |\mathcal{S}_k| \kappa _4 \epsilon ^{3/2}. \end{aligned}$$

Thus there at most

$$\begin{aligned} |\mathcal{S}_k| \le \frac{\Vert r(x_0)\Vert ^2}{2 \kappa _4} \epsilon ^{-3/2} \end{aligned}$$

successful iterations. Combining this with Lemma 3.6, accounting for the \(\max \) in (4.11) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (4.15). \(\square \)

We note in passing that in order to derive Theorem 4.6, we could have replaced the test (4.2) in Algorithm 4.1 by the normally significantly-weaker requirement (4.10).

Our final result examines the evaluation complexity under the stopping rule (3.31).

Theorem 4.7

Suppose that AS.1 holds, \(r \ge 3\) and an \(i \ge 1\) is given. Then Algorithm 4.1 requires at most

$$\begin{aligned} \left\lceil \kappa _u \Vert r(x_0)\Vert ^{1/2^{i}} \max \left( \kappa _g^{-1} \epsilon _d^{-3/2}, \kappa _r^{-1} \epsilon _p^{-1/2^i} \right) + \kappa _b \right\rceil + 1 \end{aligned}$$

when \(r = 3\)

$$\begin{aligned} \left\lceil \kappa _u \Vert r(x_0)\Vert ^{1/2^{i}} \max \left( \kappa _g^{-1} \epsilon _d^{-3/2}, \kappa _r^{-1} \epsilon _p^{-1/2^i} \right) + \kappa _i + \kappa _e ( \log \epsilon _d^{-1} + \log \epsilon _p^{-1} ) \right\rceil + 1\nonumber \\ \end{aligned}$$
(4.17)

when \(r > 3\) and \(\epsilon _p \epsilon _d < \left( \displaystyle \frac{\kappa _1}{\sigma _0}\right) ^{2/(r-3)} , \text{ or } \text{ otherwise }\)

$$\begin{aligned} \left\lceil \kappa _u \Vert r(x_0)\Vert ^{1/2^{i}} \max \left( \kappa _g^{-1} \epsilon _d^{-3/2}, \kappa _r^{-1} \epsilon _p^{-1/2^i} \right) + \kappa _a \right\rceil + 1, \end{aligned}$$

evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test

$$\begin{aligned} \Vert r(x_k)\Vert \le \epsilon _p \quad \mathrm{or}\quad \Vert g_r(x_k)\Vert \le \epsilon _d, \end{aligned}$$

is satisfied for given \(0 < \epsilon _p, \epsilon _d \le 1\), where \(\kappa _u\) is defined in (3.18),

$$\begin{aligned} \kappa _g {:=}\displaystyle \frac{\eta _1\alpha ^{r/(r-1)}}{2^i r \gamma _3^{1/(r-1)}} \min \left( \frac{1}{\kappa _1}, \frac{1}{\sigma _0} \right) ^{1/(r-1)} \Vert r(x_0)\Vert ^{(3/2- (2^{i+1}-1)/2^i)},\;\; \kappa _r {:=}\, (1 - \beta ^{1/2^i} ), \nonumber \\ \end{aligned}$$
(4.18)

\(\kappa _1\) is defined in (4.7), \(\kappa _b\), \(\kappa _i\), \(\kappa _e\) and \(\kappa _a\) in (4.16), and \(\beta \in (0,1)\) is a fixed problem-independent constant.

Proof

As in the proof of Theorem 3.12, let \(\mathcal{S}_{\beta } {:=}\{ l \in \mathcal{S}\mid \Vert r(x_{l+1})\Vert > \beta \Vert r(x_l)\Vert \}\) for a given \(\beta \in (0,1)\). We suppose that Algorithm 4.1 has not terminated prior to iteration \(l + 1\), and thus that

$$\begin{aligned} \Vert r(x_k)\Vert> \epsilon _p \quad \mathrm{and}\quad \Vert g_r(x_k)\Vert > \epsilon _d \end{aligned}$$
(4.19)

for all \(k \le l+1\). If \(l \in \mathcal{S}_{\beta }\), it follows from (3.5), (4.2) and the definition (1.4) that

$$\begin{aligned} \Vert r(x_l)\Vert ^2 - \Vert r(x_{l+1}) \Vert ^2\ge & {} 2 \eta _1 ( m(x_l,0) - m(x_l,s_l) ) > \displaystyle \frac{2\eta _1}{r} \sigma _l \Vert s_l\Vert ^r \\= & {} \displaystyle \frac{2\eta _1}{r} \left( \sigma _l \Vert s_l\Vert ^{r-1}\right) \Vert s_l\Vert \ge \displaystyle \frac{2\eta _1}{r} \alpha ^{r/(r-1)} \sigma _l^{-1/(r-1)} \Vert \nabla _x \Phi (x_{l+1})\Vert ^{r/(r-1)} \\\ge & {} \displaystyle \frac{2\eta _1}{r} \alpha ^{r/(r-1)} \sigma _l^{-1/(r-1)} \Vert g_r(x_{l+1})\Vert ^{r/(r-1)} \Vert r(x_{l+1})\Vert ^{r/(r-1)} \\\ge & {} \displaystyle \frac{2\eta _1}{r} \alpha ^{r/(r-1)} \sigma _l^{-1/(r-1)} \Vert g_r(x_{l+1})\Vert ^{r/(r-1)} \Vert r(x_l)\Vert ^{r/(r-1)} \beta ^{r/(r-1)} \end{aligned}$$

and thus applying Lemma 3.11 with \(i \ge 1\),

$$\begin{aligned}&\Vert r(x_l)\Vert ^{1/2^i} - \Vert r(x_{l+1}) \Vert ^{1/2^i} \nonumber \\&\quad \ge \displaystyle \frac{\eta _1\alpha ^{r/(r-1)}}{2^ir} \beta ^{r/(r-1)} \sigma _l^{-1/(r-1)} \Vert g_r(x_{l+1})\Vert ^{r/(r-1)} \Vert r(x_l)\Vert ^{(r/(r-1)-(2^{i+1}-1)/2^i)} \nonumber \\&\quad = \displaystyle \frac{\eta _1 \alpha ^{r/(r-1)}}{2^ir} \beta ^{r/(r-1)} \sigma _l^{-1/(r-1)} \Vert g_r(x_{l+1})\Vert ^{r/(r-1)} \Vert r(x_l)\Vert ^{(r/(r-1)-3/2)}\nonumber \\&\qquad \Vert r(x_l)\Vert ^{(3/2- (2^{i+1}-1)/2^i)} \nonumber \\&\quad \ge \kappa _d \sigma _l^{-1/(r-1)} \Vert g_r(x_{l+1})\Vert ^{r/(r-1)} \Vert r(x_l)\Vert ^{(r/(r-1)-3/2)}, \end{aligned}$$
(4.20)

where \(\kappa _d {:=}\displaystyle \frac{\eta _1 \alpha ^{r/(r-1)}}{2^ir} \beta ^{r/(r-1)} \Vert r(x_0)\Vert ^{(3/2- (2^{i+1}-1)/2^i)}\), as \(3/2 \le (2^{i+1}-1)/2^i\) and (3.34) holds.

In particular (4.20) becomes

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^i} - \Vert r(x_{l+1}) \Vert ^{1/2^i} \ge \kappa _d \sigma _l^{-1/(r-1)} \epsilon _p^{(3-r)/2(r-1)} \epsilon _d^{r/(r-1)} \end{aligned}$$
(4.21)

and (4.10) holds with \(\epsilon = \epsilon _p \epsilon _d\), and so

$$\begin{aligned} \sigma _l \le \sigma _{\max } {:=} \gamma _3 \max \left( \kappa _1 \epsilon _p^{(3-r)/2}\epsilon _d^{(3-r)/2}, \sigma _0 \right) \end{aligned}$$
(4.22)

from Lemma 4.3. Consider the possibility

$$\begin{aligned} \kappa _1 \epsilon _p^{(3-r)/2}\epsilon _d^{(3-r)/2} \ge \sigma _0. \end{aligned}$$
(4.23)

In this case, (4.22) implies that

$$\begin{aligned} \sigma _l^{-1/(r-1)} \ge \frac{1}{(\gamma _3 \kappa _1)^{1/(r-1)}} \epsilon _p^{(r-3)/2(r-1)}\epsilon _d^{(r-3)/2(r-1)} \end{aligned}$$

and hence combining with (4.21), we find that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^i} - \Vert r(x_{l+1}) \Vert ^{1/2^i} \ge \displaystyle \frac{\kappa _d}{(\gamma _3 \kappa _1)^{1/(r-1)}} \epsilon _d^{3/2} \end{aligned}$$
(4.24)

If (4.23) does not hold,

$$\begin{aligned} \sigma _l^{-1/(r-1)} \ge \frac{1}{(\gamma _3 \sigma _0)^{1/(r-1)}} \end{aligned}$$

and thus (4.21) implies that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^i} - \Vert r(x_{l+1}) \Vert ^{1/2^i}\ge & {} \displaystyle \frac{\kappa _d}{(\gamma _3 \sigma _0)^{1/(r-1)}} \epsilon _p^{(3-r)/2(r-1)} \epsilon _d^{r/(r-1)} \nonumber \\\ge & {} \displaystyle \frac{\kappa _d}{(\gamma _3 \sigma _0)^{1/(r-1)}} \epsilon _d^{3/2} \end{aligned}$$
(4.25)

since \(\epsilon _p\) and \(\epsilon _d \le 1\) and \(r \ge 3\). Hence (4.24) and (4.25) hold when \(l \in \mathcal{S}_{\beta }\),

For \(l \in \mathcal{S}\setminus \mathcal{S}_{\beta }\), for which \(\Vert r(x_{l+1})\Vert \le \beta \Vert r(x_l)\Vert \) and hence \(\Vert r(x_{l+1})\Vert ^{1/2^i} \le \beta ^{1/2^i} \Vert r(x_l)\Vert ^{1/2^i}\). Thus in view of (4.19), we have that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^i} - \Vert r(x_{l+1})\Vert ^{1/2^i} \ge (1 - \beta ^{1/2^i} ) \Vert r(x_l)\Vert ^{1/2^i} \ge (1 - \beta ^{1/2^i} ) \epsilon _p^{1/2^i}\qquad \end{aligned}$$
(4.26)

for all \(l \in \mathcal{S}\setminus \mathcal{S}_{\beta }\). Thus, combining (4.24), (4.25) and (4.26), we have that

$$\begin{aligned} \Vert r(x_l)\Vert ^{1/2^i} - \Vert r(x_{l+1}) \Vert ^{1/2^i} \ge \min \left( \kappa _g \epsilon _d^{3/2}, \kappa _r \epsilon _p^{1/2^i} \right) \end{aligned}$$

for all \(l \in \mathcal{S}\), where \(\kappa _g\) and \(\kappa _r\) are given by (4.18). Summing over \(l \in \mathcal{S}_k\) and using (3.34),

$$\begin{aligned} \Vert r(x_0)\Vert ^{1/2^{i}} \ge \Vert r(x_0)\Vert ^{1/2^{i}} - \Vert r(x_{k+1})\Vert ^{1/2^{i}} \ge |\mathcal{S}_k | \min \left( \kappa _g \epsilon _d^{3/2}, \kappa _r \epsilon _p^{1/2^i} \right) \end{aligned}$$

and thus that there are at most

$$\begin{aligned} |\mathcal{S}_k | \le \Vert r(x_0)\Vert ^{1/2^{i}} \max \left( \kappa _g^{-1} \epsilon _d^{-3/2}, \kappa _r^{-1} \epsilon _p^{-1/2^i} \right) . \end{aligned}$$

successful iterations. As before, combining this with Lemma 3.6 for \(\epsilon = \epsilon _p \epsilon _d\), accounting for the \(\max \) in (4.11) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (4.17). \(\square \)

Comparing (3.30) with (4.17), there seems little theoretical advantage (aside from constants) in using regularization of order more than three. We note, however, that the constants in the complexity bounds in Sect. 3 depend (inversely) on \(\sigma _{\min }\), while those in Sect. 4 do not; whether this is important in practice for small chosen \(\sigma _{\min }\) depends on quite how tight our bounds actually are when \(r=3\).

5 Numerical experiments

We compare the performance of the newly proposed algorithm with a Gauss–Newton method, with regularization of order two, and a Newton method, with regularization of order three. We use implementations of these algorithms found in our RALFit software [28], which is an open-source Fortran package for solving nonlinear least-squares problems. We apply tensor-Newton methods with regularization powers \(r = 2\) and 3, and we solve the subproblem (Step 2 of Algorithm 2.1) by calling the RALFit code recursively; see [19] for details.

Table 1 reports the number of iterations, function evaluations, and Jacobian evaluations needed to solve the 26 problems in the NIST nonlinear regression test set [27]. We also include the median numbers over all tests.

Table 1 Results for the NIST test set
Fig. 1
figure 1

Convergence curves for examples from the NIST test set

Table 1 reports that, for most problems in the test set, the tensor-Newton methods required fewer iterations, function evaluations, and Jacobian evaluations. We can learn more about the performance of individual problems by looking at convergence curves that plot the gradient, \(\Vert J^Tr\Vert \), at each iteration; we give these for a number of the problems, chosen to represent different behaviours, in Fig. 1. As should be expected, the asymptotic convergence rate of the Newton approximation is better than that of Gauss–Newton. We also see that, despite the inferior asymptotic convergence rate of Gauss–Newton, it often converges in fewer iterations that Newton due to the fact that it takes longer for Newton to enter this asymptotic regime (see, e.g., [13]). This is the case in Fig. 1a–c (see also Table 1). Our newly proposed tensor-Newton algorithm seems to converge at the same asymptotic rate as Newton, but with this regime being entered into much earlier, as is typical of Gauss–Newton. We credit this behaviour to the fact that, unlike Newton, the Gauss–Newton and tensor-Newton models are themselves sums-of-squares. We note that, although we observe something close to quadratic convergence in practice, whether this is always the asymptotic convergence rate is an open question (but see “Appendix B”).

Figure 1d shows convergence curves for one of the few tests where the performance of tensor-Newton is worse than that of the alternatives. All four methods struggle with this problem initially, but Gauss–Newton and Newton fall into the asymptotic regime first. Figure 1c, by contrast, shows an example where both variants of tensor-Newton perform much better than Gauss–Newton/Newton, which both suffer from a long period of stagnation.

The NIST examples are generally too small to make useful time comparisons. In Table 2 we report timings for those where at least one of the solvers took over 0.5s. These computations were performed on a desktop machine running Linux Mint 18.2, with an Intel Core i7-7700 and 16GB RAM, and we used the gfortran compiler.

We see that the cost of carrying out an iteration of the tensor-Newton method is significantly higher than that of Gauss–Newton/Newton, but there are examples (e.g., BENNETT5, MGH17) where it is the fastest.

Table 2 Wallclock timings (seconds), with the number of iterations in brackets, for NIST problems where at least one solver took over 0.5 s

In order to demonstrate the behaviour of the algorithms with an expensive function evaluation, we performed an experiment where we read in the data at each function/derivative evaluation from a directory stored on a remote computer. We performed this test for the example closest to the median behaviour in Table 1: MISRA1B. Here, Gauss–Newton took 0.108 s, Newton 0.148 s, and tensor-Newton 0.004 s. This highlights that, while more work needs to be done per iteration in the tensor-Newton method, once the function has been evaluated and the derivatives calculated, it makes greater use of the information, which can lead to a faster solution time.

6 Conclusions

We have proposed and analysed a related pair of tensor-Newton algorithms for solving nonlinear least-squares problems. Under reasonable assumptions, the algorithms have been shown to converge globally to a first-order critical point. Moreover, their function-evaluation complexity is as good as the best-known algorithms for such problems. In particular, convergence to an \(\epsilon \)-first-order critical point of the sum-of-squares objective (1.1) requires at most \(O\left( \epsilon ^{-\min (r/(r-1),3/2)}\right) \) function evaluations with r-th-order regularization with \(r \ge 2\). Moreover, convergence to a point that satisfies the more natural convergence criteria (1.3) takes at most \(O\left( \max ( \epsilon _d^{-\min (r/(r-1),3/2)}, \epsilon _p^{- 1/2^{i}}\right) \) evaluations for any chosen \(i\ge \left\lceil \log _2 \left( (r-1)/(r-2) \right) \right\rceil \). Whether such bounds may be achieved is an open question.

Although quadratic (\(r=2\)) regularization produces the poorest theoretical worst-case bound in the above, in practice it often performs well. Moreover, although quadratic regularization is rarely mentioned for general optimization in the literature (but see [2] for a recent example), it is perhaps more natural in the least-squares setting since the Gauss- and tensor-Newton approximations (2.2) are naturally bounded from below and thus it might be argued that regularization need not be so severe. The rather weak dependence of the second bound above on \(\epsilon _p\) is worth noting. Indeed, increasing \(i\) reduces the influence, but of course the constant hidden by the \(O(\cdot )\) notation grows with \(i\). A similar improvement on the related bound in [9, Theorem 3.2] is possible using the same arguments.

It is also possible to imagine generalizations of the methods here in which the quadratic tensor-Newton model in (2.1) is replaced by a \(p-\)th-order Taylor approximation (\(p>2)\). One might then anticipate evaluation-complexity bounds in which the exponents \(\min (r/(r-1),3/2)\) mentioned above are replaced by \(\min (r/(r-1),(p+1)/p)\), along the lines considered elsewhere [11, 12]. The limiting applicability will likely be the cost of computing higher-order derivative tensors.

An open question relates to the asymptotic rates of convergence of our methods. It is well known that Gauss–Newton methods converge quadratically for rank-deficient problems under reasonable assumptions, but that a Newton-like method is needed to achieve this rate when the optimal residuals are nonzero. It is not clear what the rate is for our tensor-Newton method. The main obstacle to a convincing analysis is that, unlike its quadratic counterpart, a quartic model such as used by the tensor-Newton may have multiple minimizers. Our inner-iteration stopping criteria make no attempt to distinguish, indeed to do so would require global optimality conditions. In practice, however, we generally observe at least quadratic convergence, sometimes even faster when the optimal residuals are zero. In “Appendix B”, we indicate that a reasonable choice of the step \(s_k\) in Algorithm 2.1 does indeed converge with an asymptotic Q rate of \(r-1\) for \(2< r < 3\) under standard assumptions. Extending this to Algorithm 4.1 is less obvious as it is unclear that the additional required acceptance test (4.2) might not deny an otherwise rapidly-converging natural choice of the step.

Our interest in these algorithms has been prompted by observed good behaviour when applied to practical problems [19]. The resulting software is available as part of the RALFit [28] and GALAHAD [18] software libraries.