Abstract
Given a twice-continuously differentiable vector-valued function r(x), a local minimizer of \(\Vert r(x)\Vert _2\) is sought. We propose and analyse tensor-Newton methods, in which r(x) is replaced locally by its second-order Taylor approximation. Convergence is controlled by regularization of various orders. We establish global convergence to a first-order critical point of \(\Vert r(x)\Vert _2\), and provide function evaluation bounds that agree with the best-known bounds for methods using second derivatives. Numerical experiments comparing tensor-Newton methods with regularized Gauss–Newton and Newton methods demonstrate the practical performance of the newly proposed method.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider a given, smooth, vector-valued residual function \(r:\mathbb {R}^n\longrightarrow \mathbb {R}^m\), and let \(\Vert \cdot \Vert \) be the Euclidean norm. Our goal is to design effective methods for finding values of \(x \in \mathbb {R}^n\) for which \(\Vert r(x)\Vert \) is (locally) as small as possible. Since \(\Vert r(x)\Vert \) is generally non smooth, it is common to consider the equivalent problem of minimizing
and to tackle the resulting problem using a generic method for unconstrained optimization, or one that exploits the special structure of \(\Phi \).
To put our proposal into context, arguably the most widely used method for solving nonlinear least-squares problems is the Gauss–Newton method and its variants. These iterative methods all build locally-linear (Taylor) approximations to \(r(x_k+s)\) about \(x_k\), and then minimize the approximation as a function of s in the least-squares sense to derive the next iterate \(x_{k+1} = x_k + s_k\) [21, 22, 24]. The iteration is usually stabilized either by imposing a trust-region constraint on the permitted s, or by including a quadratic regularization term [3, 23]. While these methods are undoubtedly popular in practice, they often suffer when the optimal value of the norm of the residual is large. To counter this, regularized Newton methods for minimizing (1.1) have also been proposed [7, 16, 17]. Although this usually provides a cure for the slow convergence of Gauss–Newton-like methods on non-zero-residual problems, the global behaviour is sometimes less attractive; we attribute this to the Newton model not fully reflecting the sum-of-squares nature of the original problem.
With this in mind, we consider instead the obvious nonlinear generalization of Gauss–Newton in which a locally quadratic (Taylor) “tensor-Newton” approximation to the residuals is used instead of a locally linear one. Of course, the resulting least-squares model is now quartic rather than quadratic (and thus in principle is harder to solve), but our experiments [19] have indicated that this results in more robust global behaviour than Newton-type methods and an improved performance on non-zero-residual problems than seen for Gauss–Newton variants. Our intention here is to explore the convergence behaviour of a tensor-Newton approach.
We mention in passing that we are not the first authors to consider higher-order models for least-squares problems. The earliest approach we are aware of [4, 5] uses a quadratic model of \(r(x_k+s)\) in which the Hessian of each residual is approximated by a low-rank matrix that is intended to compensate for any small singular values of the Jacobian. Another approach, known as geodesic acceleration [29, 30], aims to modify Gauss–Newton-like steps with a correction that allows for higher-order derivatives. More recently, derivative-free methods that aim to build quadratic models of \(r(x_k+s)\) by interpolation/regression of past residual values have been proposed [31, 32], although these ultimately more resemble Gauss–Newton variants. While each of these methods has been shown to improve performance relative to Gauss–Newton-like approaches, none makes full use of the residual Hessians. Our intention is thus to investigate the convergence properties of methods based on the tensor-Newton model.
There has been a long-standing interest in establishing the global convergence of general smooth unconstrained optimization methods, that is, in ensuring that a method for minimizing a function f(x) starting from an arbitrary initial guess ultimately delivers an iterate for which a measure of optimality is small. A more recent concern has focused on how many evaluations of f(x) and its derivatives are necessary to reduce the optimality measure below a specified (small) \(\epsilon > 0\) from the initial guess. If the measure is \(\Vert g(x)\Vert \), where \(g(x) {:=}\nabla _x f(x)\), it is known that some well-known schemes (including steepest descent and generic second-order trust-region methods) may require \(\Theta (\epsilon ^{-2})\) evaluations under standard assumptions [6], while this may be improved to \(\Theta (\epsilon ^{-3/2})\) evaluations for second-order methods with cubic regularization or using specialised trust-region tools [8, 15, 26]. Here and hereafter \(O(\cdot )\) indicates a term that is of at worst a multiple of its argument, while \(\Theta (\cdot )\) indicates additionally there are instances for which the bound holds.
For the problem we consider here, an obvious approach is to apply any of the aforementioned algorithms to minimize (1.1), and to terminate as soon as
However, it has been argued [9] that this ignores the possibility that it may suffice to stop instead when r(x) is small, and that a more sensible criterion is to terminate when
where \(\epsilon _p>0\) and \(\epsilon _d>0\) are required accuracy tolerances and \(g_r(x)\) is the scaled gradient given by
We note that \(g_r(x)\) in (1.4) is precisely the gradient of \(\Vert r(x)\Vert \) whenever \(r(x)\ne 0\), while if \(r(x)=0\), we are at the global minimum of r and so \(g_r(x)=0\in \partial (\Vert r(x)\Vert )\), the sub-differential of r(x). Furthermore \(\Vert g_r(x)\Vert \) is less sensitive to scaling than \(\Vert J^T(x) r(x) \Vert \). It has been shown that a second-order method based on cubic regularization will satisfy (1.3) after \(O\left( \max (\epsilon _d^{-3/2},\epsilon _p^{-1/2})\right) \) evaluations [9, Theorem 3.2]. One of our aims here is to show similar bounds for the tensor-Newton method we are advocating. We propose a regularized tensor-Newton method in Sect. 2, and analyse both its global convergence and its evaluation complexity in Sect. 3. The regularization order, r, permitted by the algorithm proposed in Sect. 2 is restricted to be no larger than 3, and so in Sect. 4 we introduce a modified algorithm for which \(r>3\) is possible. We make further comments and draw general conclusions in Sect. 6.
2 The tensor-Newton method
Suppose that \(r(x) \in C^2\) has components \(r_i(x)\) for \(i=1,\ldots ,m\). Let t(x, s) be the vector whose components are
for \(i=1,\ldots ,m\). We build the tensor-Newton approximation
of \(\Phi (x+s)\), and define the regularized model
where \(r \ge 2\) is given. Note that
We consider the following algorithm (Algorithm 2.1) to find a critical point of \(\Phi (x)\).
Algorithm 2.1
Adaptive Tensor-Newton Regularization.
A starting point \(x_0\), an initial and a minimal regularization parameter \(\sigma _0\ge \sigma _{\min }>0\) and algorithmic parameters \(\theta > 0\), \(\gamma _3\ge \gamma _2> 1> \gamma _1 >0\) and \(1>\eta _2\ge \eta _1>0\), are given. Evaluate \(\Phi (x_0)\). For \(k=0, 1, \ldots \), until termination, do:
-
1.
If the termination test has not been satisfied, compute derivatives of r(x) at \(x_k\).
-
2.
Compute a step \(s_k\) by approximately minimizing \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) so that
$$\begin{aligned} m^{\mathrm{R}}(x_k,s_k,\sigma _k)<m^{\mathrm{R}}(x_k,0,\sigma _k) \end{aligned}$$(2.5)and
$$\begin{aligned} \left\| \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k)\right\| \le \theta \Vert s_k\Vert ^{r-1}. \end{aligned}$$(2.6) -
3.
Set \(\widehat{x}_{k} = x_k + s_k\) and compute \(\Phi (\widehat{x}_k)\) and
$$\begin{aligned} \rho _k = \frac{\Phi (x_k)-\Phi (\widehat{x}_k)}{m(x_k,0)-m(x_k,s_k)}. \end{aligned}$$(2.7) -
4.
Set
$$\begin{aligned} \sigma _{k+1} \in \left\{ \begin{array}{lll} [\max (\sigma _{\min }, \gamma _1\sigma _k),\sigma _k ] &{} \;\; \text{ if } \;\; \rho _k \ge \eta _2 &{}\;\; \text{[very } \text{ successful } \text{ iteration] } \;\; \\ \;[ \sigma _k, \gamma _2 \sigma _k ] &{} \;\; \text{ if } \;\; \eta _1 \le \rho _k < \eta _2&{} \;\; \text{[successful } \text{ iteration] } \;\;\\ \; [ \gamma _2 \sigma _k, \gamma _3 \sigma _k ] &{} \;\; \text{ otherwise } \;\; &{} \;\; \text{[unsuccessful } \text{ iteration], } \;\; \end{array}\right. \nonumber \\ \end{aligned}$$(2.8) -
5.
If \(\rho _k \ge \eta _1\), set \(x_{k+1} = \widehat{x}_k\). Otherwise go to Step 2.
At the very least, we insist that (trivial) termination should occur in Step 1 of Algorithm 2.1 if \(\Vert \nabla _x \Phi (x_k)\Vert = 0\), but in practice a rule such as (1.2) or (1.3) at \(x = x_k\) will be preferred.
At the heart of Algorithm 2.1 is the need (Step 2) to find a vector \(s_k\) that both reduces \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) and satisfies \(\Vert \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k)\Vert \le \theta \Vert s_k\Vert ^{r-1}\) (see, e.g., [1]). Since \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) is bounded from below (and grows as s approaches infinity), we may apply any descent-based local optimization method that is designed to find a critical point of \(m^{\mathrm{R}}(x_k,s,\sigma _k)\), starting from \(s = 0\), as this will generate an \(s_k\) that is guaranteed to satisfy both Step 2 stopping requirements. Crucially, such a minimization is on the model \(m^{\mathrm{R}}(x_k,s,\sigma _k)\), not the true objective, and thus involves no true objective evaluations. We do not claim that this calculation is trivial, but it might, for example, be achieved by applying a safeguarded Gauss–Newton method to the least-squares problem involving the extended residuals \((t(x_k,s),\sqrt{\sigma _k\Vert s\Vert ^{r-2}}s)\).
We define the index set of successful iterations, in the sense of (2.8), up to iteration k to be \(\mathcal{S}_k {:=}\{0 \le l \le k \mid \rho _l \ge \eta _1\}\) and let \(\mathcal{S}{:=}\{ k \ge 0 \mid \rho _k \ge \eta _1 \}\) be the set of all successful iterations.
3 Convergence analysis
We make the following blanket assumption:
-
AS.1 each component \(r_i(x)\) and its first two derivatives are Lipschitz continuous on an open set containing the intervals \([x_k, x_k+s_k]\) generated by Algorithm 2.1 (or its successor).
It has been shown [10, Lemma 3.1] that AS.1 implies that \(\Phi (x)\) and its first two derivatives are Lipschitz on \([x_k, x_k+s_k]\).
We define
and let q(x, s) be the vector whose ith component is
for \(i=1,\ldots ,m\). In this case
Since \(m(x_k,s)\) is a second-order accurate model of \(\Phi (x_k+s)\), we expect bounds of the form
and
for some \(L_{\mathrm{f}}, L_{\mathrm{g}} > 0\) and all \(k \ge 0\) for which \(\Vert s_k\Vert \le 1\) (see “Appendix A”).
Also, since \(\Vert r(x)\Vert \) decreases monotonically,
and
for some \(L_{\mathrm{J}}, L_{\mathrm{H}} > 0\) and all \(k \ge 0\) (again, see “Appendix A”).
Our first result derives simple conclusions from the basic requirement that the step \(s_k\) in our algorithm is chosen to reduce the regularized model.
Lemma 3.1
Algorithm 2.1 ensures that
In addition, if \(r=2\), at least one of
or
holds, while if \(r>2\),
Proof
It follows from (2.5), (2.3) and (2.2) that
Inequality (3.5) follows immediately from the first inequality in (3.9). When \(r=2\), inequality (3.9) becomes
In order for this to be true, it must be that at least one of the last two terms is negative, and this provides the alternatives (3.6) and (3.7). By contrast, when \(r>2\), inequality (3.9) becomes
and this implies that
(or both), which gives (3.8). \(\square \)
Our next task is to show that \(\sigma _k\) is bounded from above. Let
and
and note that Lemma 3.1 implies that
and in particular
We consider first the special case for which \(r=2\).
Lemma 3.2
Suppose that AS.1 holds, \(r=2\), \(k \in \mathcal{B}\) and
Then iteration k of Algorithm 2.1 is very successful.
Proof
Since \(k \in \mathcal{B}\), Lemma 3.1 implies that (3.7) and (3.10) hold. Then (2.7), (3.1) and (3.5) give that
and hence
from (3.3), (3.7) and (3.11). Thus it follows from (2.8) that the iteration is very successful. \(\square \)
Lemma 3.3
Suppose that AS.1 holds and \(r=2\). Then Algorithm 2.1 ensures that
for all \(k \ge 0\).
Proof
Let
Suppose that \(k + 1 \in \mathcal{B}_{\gamma _3}\) is the first iteration for which \(\sigma _{k+1} \ge \sigma ^{\mathrm{B}}_{\max }\). Then, since \(\sigma _k < \sigma _{k+1}\), iteration k must have been unsuccessful, \(x_k=x_{k+1}\) and (2.8) gives that \(\sigma _{k+1} \le \gamma _3 \sigma _k\). Thus
since \(k + 1 \in \mathcal{B}_{\gamma _3}\), which implies that \(k \in \mathcal{B}\). Furthermore,
which implies that (3.11) holds. But then Lemma 3.2 implies that iteration k must be very successful. This contradiction ensures that
for all \(k \in \mathcal{B}_{\gamma _3}\). For all other iterations, we have that \(k \notin \mathcal{B}_{\gamma _3}\), and for these the definition of \(\mathcal{B}_{\gamma _3}\), and the bounds (3.3) and (3.4) give
Combining (3.13) and (3.14) gives (3.12).\(\square \)
We now turn to the general case for which \(2<r\le 3\).
Lemma 3.4
Suppose that AS.1 holds, \(2<r\le 3\), \(k \in \mathcal{B}\) and
Then iteration k of Algorithm 2.1 is very successful.
Proof
Since \(k \in \mathcal{B}\), it follows from (2.7), (3.10), (3.1), (3.5), (3.8), (3.3), (3.4) and (3.15) that
As before, (2.8) then ensures that the iteration is very successful. \(\square \)
Lemma 3.5
Suppose that AS.1 holds and \(2<r\le 3\). Then Algorithm 2.1 ensures that
for all \(k \ge 0\).
Proof
The proof mimics that of Lemma 3.3. First, suppose that \(k \in \mathcal{B}_{\gamma _3}\) and that iteration \(k+1\) is the first for which
Then, since \(\sigma _k < \sigma _{k+1}\), iteration k must have been unsuccessful and (2.8) gives that
which implies that \(k \in \mathcal{B}\) and (3.15) holds. But then Lemma 3.4 implies that iteration k must be very successful. This contradiction provides the first three terms in the bound (3.17), while the others arise as for the proof of Lemma 3.3 when \(k \notin \mathcal{B}_{\gamma _3}\). \(\square \)
Next, we bound the number of iterations in terms of the number of successful ones.
Lemma 3.6
[8, Theorem 2.1]. The adjustment (2.8) in Algorithm 2.1 ensures that
and \(\sigma _{\max }\) is any known upper bound on \(\sigma _k\).
Our final ingredient is to find a useful bound on the smallest model decrease as the algorithm proceeds. Let \(\mathcal{L}{:=}\{ k \mid \Vert s_k\Vert \le 1 \}\), and let \(\mathcal{G}{:=}\{ k \mid \Vert s_k\Vert > 1 \}\) be its compliment. We then have the following crucial bounds.
Lemma 3.7
Suppose that AS.1 holds and \(2\le r\le 3\). Then Algorithm 2.1 ensures that
Proof
Consider \(k \in \mathcal{L}\). The Cauchy-Schwarz inequality and (2.4) reveal that
Combining (3.20) with (3.2), (2.6), (3.12), (3.17) and \(\Vert s_k\Vert \le 1\) we have
and thus that
But then, combining this with (3.5), the lower bound
imposed by Algorithm 2.1 and (3.5) provides the first possibility in (3.19).
By contrast, if \(k \in \mathcal{G}\), (3.5), \(\Vert s_k\Vert > 1\) and (3.21) ensure the second possibility in (3.19). \(\square \)
Corollary 3.8
Suppose that AS.1 holds and \(2\le r\le 3\). Then Algorithm 2.1 ensures that
Proof
The result follows directly from and (2.7) and (3.19). \(\square \)
We now provide our three main convergence results. Firstly, we establish the global convergenceFootnote 1 of our algorithm to first-order critical points of \(\Phi (x)\).
Theorem 3.9
Suppose that AS.1 holds and \(2\le r\le 3\). Then the iterates \(\{x_k\}\) generated by Algorithm 2.1 satisfy
if no non-trivial termination test is provided.
Proof
Suppose that \(\epsilon > 0\), and consider any successful iteration for which
Then it follows from (3.22) that
Consider the set \(\mathcal{U}_{\epsilon } = \{ k \in \mathcal{S}\mid \Vert \nabla _x \Phi (x_k)\Vert \ge \epsilon \}\), suppose that \(\mathcal{U}_{\epsilon }\) is infinite, and let \(k_i\) be the i-th entry of \(\mathcal{U}_{\epsilon }\). Now consider
Thus summing (3.25) over successful iterations, recalling that \(\Phi (x_0) = {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2\), \(\Phi (x_k) \ge 0\), and that \(\Phi \) decreases monotonically and using (3.25), we have that
This contradiction shows that \(\mathcal{U}_{\epsilon }\) is finite for any \(\epsilon > 0\), and therefore (3.23) holds. \(\square \)
Secondly, we provide an evaluation complexity result based on the stopping criterion (1.2).
Theorem 3.10
Suppose that AS.1 holds and \(2\le r\le 3\). Then Algorithm 2.1 requires at most
evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test
is satisfied for given \(0< \epsilon < 1\), where \(\kappa _u\) and \(\kappa _s\) are defined in (3.18).
Proof
If the algorithm has not terminated, (3.24) holds, so summing (3.25) as before
since \(\epsilon < 1\), and thus that
Combining this with (3.18) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (3.27). \(\square \)
Notice how the evaluation complexity improves from \(O(\epsilon ^{-2})\) evaluations with quadratic \((r=2)\) regularization to \(O(\epsilon ^{-3/2})\) evaluations with cubic \((r=3)\) regularization. It is not clear if these bounds are sharp.
Finally, we refine this analysis to provide an alternative complexity result based on the stopping rule (1.3). The proof of this follows similar arguments in [9, §3.2], [11, §3] and crucially depends upon the following elementary result.
Lemma 3.11
Suppose that \(a > b \ge 0\). Then
for all integers \(i \ge -1\).
Proof
The result follows directly by induction using the identity \(A^2 - B^2 = (A-B)(A+B)\) with \(A=a^{1/2^j} > B=b^{1/2^j}\) for increasing \(j\le i\). \(\square \)
Theorem 3.12
Suppose that AS.1 holds, \(2 < r\le 3\) and that the integer
is given. Then Algorithm 2.1 requires at most
evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test
is satisfied for given \(\epsilon _p>0\) and \(\epsilon _d>0\), where \(\kappa _u\) and \(\kappa _s\) are defined in (3.18), \(\kappa _c\), \(\kappa _g\) and \(\kappa _r\) are given by
and \(\beta \in (0,1)\) is a fixed problem-independent constant.
Proof
Consider \(\mathcal{S}_{\beta } {:=}\{ l \in \mathcal{S}\mid \Vert r(x_{l+1})\Vert > \beta \Vert r(x_l)\Vert \}\), and let \(i\) be the smallest integer for which
that is \(i\) satisfies (3.29).
First, consider \(l \in \mathcal{G}\cap \mathcal{S}\). Then (3.22) gives that
and, since
for all \(l \in \mathcal{S}\), Lemma 3.11 implies that
By contrast, for \(l \in \mathcal{L}\cap \mathcal{S}\), (3.22) gives that
If additionally \(l \in \mathcal{S}_{\beta }\), (3.36) may be refined as
from (1.4) and the requirement that \(\Vert r(x_{l+1})\Vert > \beta \Vert r(x_l)\Vert \). Using (3.37), (3.34), Lemma 3.11 and (3.33), we then obtain the bound
for all \(l \in \mathcal{L}\cap \mathcal{S}_{\beta }\). Finally, consider \(l \in \mathcal{S}\setminus \mathcal{S}_{\beta }\), for which \(\Vert r(x_{l+1})\Vert \le \beta \Vert r(x_l)\Vert \) and hence \(\Vert r(x_{l+1})\Vert ^{1/2^{i}} \le \beta ^{1/2^{i}} \Vert r(x_l)\Vert ^{1/2^{i}}\). Thus we have that
for all \(l \in \mathcal{L}\cap ( \mathcal{S}\setminus \mathcal{S}_{\beta } )\). Thus, combining (3.35), (3.38) and (3.39), we have that
for \(\kappa _c\), \(\kappa _g\) and \(\kappa _r\) given by (3.32), for all \(l \in \mathcal{S}\).
Now suppose that the stopping rule (3.31) has not been satisfied up until the start of iteration \(k+1\), and thus that
for all \(l \in \mathcal{S}_k\). Combining this with (3.40), we have that
and thus, summing over \(l \in \mathcal{S}_k\) and using (3.34),
As before, combining this with (3.18) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (3.30).\(\square \)
If \(i< i_{{\scriptscriptstyle 0}}\), a weaker bound that includes \(r = 2\) is possible. The key is to note that the purpose of (3.33) is to guarantee the second inequality in (3.38). Without this, we have instead
for all \(l \in \mathcal{L}\cap \mathcal{S}_{\beta }\), and this leads to
where
if (3.41) holds. This results in a bound of \(O\Big (\max ( 1,\epsilon _d^{r/(r-1)} \cdot \epsilon _p^{\left( r/(r-1) - (2^{i+1}-1)/2^{i}\right) }, \epsilon _p^{1/2^{i}} )\Big )\) function evaluations, which approaches that in (3.30) as i increases to infinity when \(r=2\).
4 A modified algorithm for cubic-and-higher regularization
For the case where \(r > 3\), the proof of Lemma 3.4 breaks down as there is no obvious bound on the quantity \(\Vert s_k\Vert ^{3-r}/\sigma _k\). One way around this defect is to modify Algorithm 2.1 so that such a bound automatically occurs. We consider the following variant; our development follows very closely that in [12], itself inspired by [20]. For completeness, we allow \(r = 3\) in this new framework since it is trivial to do so.
Algorithm 4.1
Adaptive Tensor-Newton Regularization when\(r \ge 3\).
A starting point \(x_0\), an initial regularization parameter \(\sigma _0>0\) and algorithmic parameters \(\theta > 0\), \(\alpha \in (0,{\scriptstyle \frac{1}{3}}]\), \(\gamma _3\ge \gamma _2> 1> \gamma _1 >0\) and \(1>\eta _2\ge \eta _1>0\), are given. Evaluate \(\Phi (x_0)\), and test for termination at \(x_0\).
For \(k=0, 1, \ldots \), until termination, do:
-
1.
Compute derivatives of r(x) at \(x_k\).
-
2.
Compute a step \(s_k\) by approximately minimizing \(m^{\mathrm{R}}(x_k,s,\sigma _k)\) so that
$$\begin{aligned} m^{\mathrm{R}}(x_k,s_k,\sigma _k)<m^{\mathrm{R}}(x_k,0,\sigma _k) \end{aligned}$$and
$$\begin{aligned} \Vert \nabla _s m^{\mathrm{R}}(x_k,s_k,\sigma _k)\Vert \le \theta \Vert s_k\Vert ^{2} \end{aligned}$$(4.1)hold.
-
3.
Set \(\widehat{x}_k = x_k+s_k\), and test for termination at \(\widehat{x}_k\).
-
4.
Compute \(\Phi (\widehat{x}_k)\) and
$$\begin{aligned} \rho _k = \frac{\Phi (x_k)-\Phi (\widehat{x}_k)}{m(x_k,0)-m(x_k,s_k)}. \end{aligned}$$If \(\rho _k \ge \eta _1\) and
$$\begin{aligned} \sigma _k \Vert s_k\Vert ^{r-1} \ge \alpha \Vert \nabla _x \Phi (\widehat{x}_k)\Vert , \end{aligned}$$(4.2)set \(x_{k+1} = \widehat{x}_k\).
-
5.
Set
$$\begin{aligned} \sigma _{k+1} \in \left\{ \begin{array}{cl} [\gamma _1\sigma _k,\sigma _k ] &{} \;\; \text{ if } \;\; \rho _k \ge \eta _2 \;\; \text{ and } \;\; (4.2) \;\; \text{ holds } \;\; \\ \;[ \sigma _k, \gamma _2 \sigma _k ] &{} \;\; \text{ if } \;\; \eta _1 \le \rho _k< \eta _2 \;\; \text{ and } \;\; (4.2) \;\; \text{ holds } \;\; \\ \; [ \gamma _2 \sigma _k, \gamma _3 \sigma _k ] &{} \;\; \text{ if } \;\; \rho _k < \eta _1 \;\; \text{ or } \;\; (4.2) \;\; \text{ fails, } \;\; \end{array}\right. \end{aligned}$$(4.3)and go to Step 2 if \(\rho _k < \eta _1\) or (4.2) fails.
It is important that termination is tested at Step 3 as deductions from computations in subsequent steps rely on this. We modify our definition of a successful step accordingly so that now \(\mathcal{S}_k =\{0 \le l \le k \mid \rho _l \ge \eta _1 \;\; \text{ and } \;\; ((4.2)) \;\; \text{ holds }\}\) and \(\mathcal{S}=\{ k \ge 0 \mid \rho _k \ge \eta _1 \;\; \text{ and } \;\; ((4.2)) \;\; \text{ holds }\}\), and note in particular that Lemma 3.6 continues to hold in this case since it only depends on the adjustments in (4.3). Likewise, a very successful iteration is now one for which \(\rho _k \ge \eta _2\) and (4.2) holds. Note that (4.3), unlike (2.8) in Algorithm 2.1, does not impose a nonzero lower bound on the generated regularization weight; this will be reflected in our derived complexity bound (cf Theorems 3.12 and 4.7).
As is now standard, our first task is to establish an upper bound on \(\sigma _k\).
Lemma 4.1
Suppose that AS.1 holds, \(r \ge 3\) and
Then iteration k of Algorithm 4.1 is very successful.
Proof
It follows immediately from (2.7), (3.1), (3.5) and (4.4) that
and thus \(\rho _k \ge \eta _2\). Observe that
since \(1 - \eta _2 \le 1\) and \(r \ge 1\). We also have from (3.20), (3.2) and (4.1) that
and thus from (4.4), (4.5) and the algorithmic restriction \(3 \le 1/\alpha \) that
Thus (4.2) is also satisfied, and hence iteration k is very successful. \(\square \)
Lemma 4.2
Suppose that AS.1 holds, \(r \ge 3\) and
and \(\kappa _2\) is defined in the statement of Lemma 4.1. Then iteration k of Algorithm 4.1 is very successful.
Proof
It follows from Lemma 4.1 that it suffices to show that (4.7) implies (4.4). The result is immediate when \(r=3\) since then (4.4) is (4.7). Suppose therefore that \(r > 3 \) and that (4.4) is not true, that is
Then (4.6), (4.8) and (4.5) imply that
which contradicts (4.7). Thus (4.4) holds. \(\square \)
Unlike in our previous analysis of Algorithm 2.1 when \(r \le 3\), we are unable to deduce an upper bound on \(\sigma _k\) without further consideration. With this in mind, we now suppose that all the iterates \(x_k+s_k\) generated by Algorithm 4.1 satisfy
for some \(\epsilon > 0\) and all \(0 \le k \le l\), and thus, from (4.2), that
for \(k \in \mathcal{S}_l\). In this case, we can show that \(\sigma _k\) is bounded from above.
Lemma 4.3
Suppose that AS.1 holds and \(r \ge 3\). Then provided that (4.9) holds for all \(0 \le k \le l\), Algorithm 4.1 ensures that
and \(\kappa _1\) is defined in the statement of Lemma 4.2.
Proof
The proof is similar to the first part of that of Lemma 3.5. Suppose that iteration \(k+1\) (with \(k \le l\)) is the first for which \(\sigma _{k+1} \ge \sigma _{\max }\). Then, since \(\sigma _k < \sigma _{k+1}\), iteration k must have been unsuccessful and (4.3) gives that
i.e., that
because of (4.9). But then Lemma 4.2 implies that iteration k must be very successful. This contradiction establishes (4.11). \(\square \)
We may also show that a successful step ensures a non-trivial reduction in \(\Phi (x)\).
Lemma 4.4
Suppose that AS.1 holds and \(r \ge 3\). Suppose further that (4.9) holds for all \(0 \le k \le l\). Then provided that (4.9) holds for all \(0 \le k \le l\) and some \(0 < \epsilon \le 1\), Algorithm 4.1 guarantees that
for all \(k \in \mathcal{S}\), where
and \(\kappa _1\) is defined in the statement of Lemma 4.2.
Proof
Since \(0 < \epsilon \le 1\), (4.11) ensures that \(\sigma _{\max } \le \kappa _3 \epsilon ^{(3-r)/2}\) and thus if \(k \in \mathcal{S}\), it follows from (3.5) and (4.10) that
as required. \(\square \)
These introductory lemmas now lead to our main convergence results. First we establish global convergence to a critical point of \(\Phi (x)\).
Theorem 4.5
Suppose that AS.1 holds and \(r \ge 3\). Then the iterates \(\{x_k\}\) generated by Algorithm 4.1 satisfy
if no non-trivial termination test is provided.
Proof
Suppose that (4.14) does not hold, in which case (4.9) holds for some \(0 < \epsilon \le 1\) and all \(k \ge 0\). We then deduce by summing the reduction in \(\Phi (x)\) guaranteed by Lemma 4.4 over successful iterations that
Just as in the proof of Theorem 3.10, this ensures that there are only a finite number of successful iterations. If iteration k is the last of these, all subsequent iterations are unsuccessful, and thus \(\sigma _k\) grows without bound. But as this contradicts Lemma 4.3, (4.9) cannot be true, and thus (4.14) holds. \(\square \)
Next, we give an evaluation complexity result based on the stopping criterion (1.2).
Theorem 4.6
Suppose that AS.1 holds and \(r \ge 3\). Then Algorithm 4.1 requires at most
evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test
is satisfied for given \(0< \epsilon < 1\), where
\(\kappa _u\) is defined in (3.18), \(\kappa _1\) in (4.7) and \(\kappa _3\) in (4.13).
Proof
If the algorithm has not terminated on or before iteration k, (4.9) holds, and so summing (4.12) over successful iterations and recalling that \(\Phi (x_0) = {\scriptstyle \frac{1}{2}}\Vert r(x_0)\Vert ^2\) and \(\Phi (x_k) \ge 0\), we have that
Thus there at most
successful iterations. Combining this with Lemma 3.6, accounting for the \(\max \) in (4.11) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (4.15). \(\square \)
We note in passing that in order to derive Theorem 4.6, we could have replaced the test (4.2) in Algorithm 4.1 by the normally significantly-weaker requirement (4.10).
Our final result examines the evaluation complexity under the stopping rule (3.31).
Theorem 4.7
Suppose that AS.1 holds, \(r \ge 3\) and an \(i \ge 1\) is given. Then Algorithm 4.1 requires at most
when \(r = 3\)
when \(r > 3\) and \(\epsilon _p \epsilon _d < \left( \displaystyle \frac{\kappa _1}{\sigma _0}\right) ^{2/(r-3)} , \text{ or } \text{ otherwise }\)
evaluations of r(x) and its derivatives to find an iterate \(x_{k}\) for which the termination test
is satisfied for given \(0 < \epsilon _p, \epsilon _d \le 1\), where \(\kappa _u\) is defined in (3.18),
\(\kappa _1\) is defined in (4.7), \(\kappa _b\), \(\kappa _i\), \(\kappa _e\) and \(\kappa _a\) in (4.16), and \(\beta \in (0,1)\) is a fixed problem-independent constant.
Proof
As in the proof of Theorem 3.12, let \(\mathcal{S}_{\beta } {:=}\{ l \in \mathcal{S}\mid \Vert r(x_{l+1})\Vert > \beta \Vert r(x_l)\Vert \}\) for a given \(\beta \in (0,1)\). We suppose that Algorithm 4.1 has not terminated prior to iteration \(l + 1\), and thus that
for all \(k \le l+1\). If \(l \in \mathcal{S}_{\beta }\), it follows from (3.5), (4.2) and the definition (1.4) that
and thus applying Lemma 3.11 with \(i \ge 1\),
where \(\kappa _d {:=}\displaystyle \frac{\eta _1 \alpha ^{r/(r-1)}}{2^ir} \beta ^{r/(r-1)} \Vert r(x_0)\Vert ^{(3/2- (2^{i+1}-1)/2^i)}\), as \(3/2 \le (2^{i+1}-1)/2^i\) and (3.34) holds.
In particular (4.20) becomes
and (4.10) holds with \(\epsilon = \epsilon _p \epsilon _d\), and so
from Lemma 4.3. Consider the possibility
In this case, (4.22) implies that
and hence combining with (4.21), we find that
If (4.23) does not hold,
and thus (4.21) implies that
since \(\epsilon _p\) and \(\epsilon _d \le 1\) and \(r \ge 3\). Hence (4.24) and (4.25) hold when \(l \in \mathcal{S}_{\beta }\),
For \(l \in \mathcal{S}\setminus \mathcal{S}_{\beta }\), for which \(\Vert r(x_{l+1})\Vert \le \beta \Vert r(x_l)\Vert \) and hence \(\Vert r(x_{l+1})\Vert ^{1/2^i} \le \beta ^{1/2^i} \Vert r(x_l)\Vert ^{1/2^i}\). Thus in view of (4.19), we have that
for all \(l \in \mathcal{S}\setminus \mathcal{S}_{\beta }\). Thus, combining (4.24), (4.25) and (4.26), we have that
for all \(l \in \mathcal{S}\), where \(\kappa _g\) and \(\kappa _r\) are given by (4.18). Summing over \(l \in \mathcal{S}_k\) and using (3.34),
and thus that there are at most
successful iterations. As before, combining this with Lemma 3.6 for \(\epsilon = \epsilon _p \epsilon _d\), accounting for the \(\max \) in (4.11) and remembering that we need to evaluate the function and gradient at the final \(x_{k+1}\) yields the bound (4.17). \(\square \)
Comparing (3.30) with (4.17), there seems little theoretical advantage (aside from constants) in using regularization of order more than three. We note, however, that the constants in the complexity bounds in Sect. 3 depend (inversely) on \(\sigma _{\min }\), while those in Sect. 4 do not; whether this is important in practice for small chosen \(\sigma _{\min }\) depends on quite how tight our bounds actually are when \(r=3\).
5 Numerical experiments
We compare the performance of the newly proposed algorithm with a Gauss–Newton method, with regularization of order two, and a Newton method, with regularization of order three. We use implementations of these algorithms found in our RALFit software [28], which is an open-source Fortran package for solving nonlinear least-squares problems. We apply tensor-Newton methods with regularization powers \(r = 2\) and 3, and we solve the subproblem (Step 2 of Algorithm 2.1) by calling the RALFit code recursively; see [19] for details.
Table 1 reports the number of iterations, function evaluations, and Jacobian evaluations needed to solve the 26 problems in the NIST nonlinear regression test set [27]. We also include the median numbers over all tests.
Table 1 reports that, for most problems in the test set, the tensor-Newton methods required fewer iterations, function evaluations, and Jacobian evaluations. We can learn more about the performance of individual problems by looking at convergence curves that plot the gradient, \(\Vert J^Tr\Vert \), at each iteration; we give these for a number of the problems, chosen to represent different behaviours, in Fig. 1. As should be expected, the asymptotic convergence rate of the Newton approximation is better than that of Gauss–Newton. We also see that, despite the inferior asymptotic convergence rate of Gauss–Newton, it often converges in fewer iterations that Newton due to the fact that it takes longer for Newton to enter this asymptotic regime (see, e.g., [13]). This is the case in Fig. 1a–c (see also Table 1). Our newly proposed tensor-Newton algorithm seems to converge at the same asymptotic rate as Newton, but with this regime being entered into much earlier, as is typical of Gauss–Newton. We credit this behaviour to the fact that, unlike Newton, the Gauss–Newton and tensor-Newton models are themselves sums-of-squares. We note that, although we observe something close to quadratic convergence in practice, whether this is always the asymptotic convergence rate is an open question (but see “Appendix B”).
Figure 1d shows convergence curves for one of the few tests where the performance of tensor-Newton is worse than that of the alternatives. All four methods struggle with this problem initially, but Gauss–Newton and Newton fall into the asymptotic regime first. Figure 1c, by contrast, shows an example where both variants of tensor-Newton perform much better than Gauss–Newton/Newton, which both suffer from a long period of stagnation.
The NIST examples are generally too small to make useful time comparisons. In Table 2 we report timings for those where at least one of the solvers took over 0.5s. These computations were performed on a desktop machine running Linux Mint 18.2, with an Intel Core i7-7700 and 16GB RAM, and we used the gfortran compiler.
We see that the cost of carrying out an iteration of the tensor-Newton method is significantly higher than that of Gauss–Newton/Newton, but there are examples (e.g., BENNETT5, MGH17) where it is the fastest.
In order to demonstrate the behaviour of the algorithms with an expensive function evaluation, we performed an experiment where we read in the data at each function/derivative evaluation from a directory stored on a remote computer. We performed this test for the example closest to the median behaviour in Table 1: MISRA1B. Here, Gauss–Newton took 0.108 s, Newton 0.148 s, and tensor-Newton 0.004 s. This highlights that, while more work needs to be done per iteration in the tensor-Newton method, once the function has been evaluated and the derivatives calculated, it makes greater use of the information, which can lead to a faster solution time.
6 Conclusions
We have proposed and analysed a related pair of tensor-Newton algorithms for solving nonlinear least-squares problems. Under reasonable assumptions, the algorithms have been shown to converge globally to a first-order critical point. Moreover, their function-evaluation complexity is as good as the best-known algorithms for such problems. In particular, convergence to an \(\epsilon \)-first-order critical point of the sum-of-squares objective (1.1) requires at most \(O\left( \epsilon ^{-\min (r/(r-1),3/2)}\right) \) function evaluations with r-th-order regularization with \(r \ge 2\). Moreover, convergence to a point that satisfies the more natural convergence criteria (1.3) takes at most \(O\left( \max ( \epsilon _d^{-\min (r/(r-1),3/2)}, \epsilon _p^{- 1/2^{i}}\right) \) evaluations for any chosen \(i\ge \left\lceil \log _2 \left( (r-1)/(r-2) \right) \right\rceil \). Whether such bounds may be achieved is an open question.
Although quadratic (\(r=2\)) regularization produces the poorest theoretical worst-case bound in the above, in practice it often performs well. Moreover, although quadratic regularization is rarely mentioned for general optimization in the literature (but see [2] for a recent example), it is perhaps more natural in the least-squares setting since the Gauss- and tensor-Newton approximations (2.2) are naturally bounded from below and thus it might be argued that regularization need not be so severe. The rather weak dependence of the second bound above on \(\epsilon _p\) is worth noting. Indeed, increasing \(i\) reduces the influence, but of course the constant hidden by the \(O(\cdot )\) notation grows with \(i\). A similar improvement on the related bound in [9, Theorem 3.2] is possible using the same arguments.
It is also possible to imagine generalizations of the methods here in which the quadratic tensor-Newton model in (2.1) is replaced by a \(p-\)th-order Taylor approximation (\(p>2)\). One might then anticipate evaluation-complexity bounds in which the exponents \(\min (r/(r-1),3/2)\) mentioned above are replaced by \(\min (r/(r-1),(p+1)/p)\), along the lines considered elsewhere [11, 12]. The limiting applicability will likely be the cost of computing higher-order derivative tensors.
An open question relates to the asymptotic rates of convergence of our methods. It is well known that Gauss–Newton methods converge quadratically for rank-deficient problems under reasonable assumptions, but that a Newton-like method is needed to achieve this rate when the optimal residuals are nonzero. It is not clear what the rate is for our tensor-Newton method. The main obstacle to a convincing analysis is that, unlike its quadratic counterpart, a quartic model such as used by the tensor-Newton may have multiple minimizers. Our inner-iteration stopping criteria make no attempt to distinguish, indeed to do so would require global optimality conditions. In practice, however, we generally observe at least quadratic convergence, sometimes even faster when the optimal residuals are zero. In “Appendix B”, we indicate that a reasonable choice of the step \(s_k\) in Algorithm 2.1 does indeed converge with an asymptotic Q rate of \(r-1\) for \(2< r < 3\) under standard assumptions. Extending this to Algorithm 4.1 is less obvious as it is unclear that the additional required acceptance test (4.2) might not deny an otherwise rapidly-converging natural choice of the step.
Our interest in these algorithms has been prompted by observed good behaviour when applied to practical problems [19]. The resulting software is available as part of the RALFit [28] and GALAHAD [18] software libraries.
Notes
Our proof avoids the traditional route via a \(\lim \inf \) result, and is indebted to [14].
It is unclear what happens when \(r=2\) or 3.
References
Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, PhL: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163(1), 359–368 (2017)
Birgin, E.G., Martinez, J.M.: Quadratic regularization with cubic descent for unconstrained optimization. Technical Report MCDO271016, State University of Campinas, Brazil (2016)
Björck, Å.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
Bouaricha, A., Schnabel, R.B.: Algorithm 768: TENSOLVE: a software package for solving systems of nonlinear equations and nonlinear least-squares problems. ACM Trans. Math. Softw. 23(2), 174–195 (1997)
Bouaricha, A., Schnabel, R.B.: Tensor methods for large, sparse nonlinear least squares problems. SIAM J. Sci. Stat. Comput. 21(4), 1199–1221 (1999)
Cartis, C., Gould, N.I.M., Toint, PhL: On the complexity of steepest descent, Newton’s method and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)
Cartis, C., Gould, N.I.M., Toint, PhL: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. Ser. A 127(2), 245–295 (2011)
Cartis, C., Gould, N.I.M., Toint, PhL: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function and derivative-evaluation complexity. Math. Program. Ser. A 130(2), 295–319 (2011)
Cartis, C., Gould, N.I.M., Toint, PhL: On the evaluation complexity of cubic regularization methods for potentially rank-deficient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM J. Optim. 23(3), 1553–1574 (2013)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Evaluation complexity bounds for smooth constrained nonlinear optimization using scaled KKT conditions and high-order models. Report naXys-11-2015(R1), University of Namur, Belgium (2015)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Improved worst-case evaluation complexity for potentially rank-deficient nonlinear least-Euclidean-norm problems using higher-order regularized models. Technical Report RAL-TR-2015-011, Rutherford Appleton Laboratory, Chilton, Oxfordshire (2015)
Cartis, C., Gould, N.I.M., Toint, Ph.L.: Universal regularization methods-varying the power, the smoothness and the accuracy. Preprint RAL-P-2016-010, Rutherford Appleton Laboratory, Chilton, Oxfordshire (2016)
Chen, P.: Hessian matrix vs. Gauss–Newton Hessian matrix. SIAM J. Numer. Anal. 49(4), 1417–1435 (2011)
Curtis, F.E., Lubberts, Z., Robinson, D.P.: Concise complexity analyses for trust-region methods. Technical Report 01-2018, Johns Hopkins University, Baltimore (2018)
Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst-case iteration complexity of \(O(\epsilon ^{-3/2})\) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2017)
Dennis, J.E., Gay, D.M., Welsh, R.E.: An adaptive nonlinear least squares algorithm. ACM Trans. Math. Softw. 7(3), 348–368 (1981)
Gill, P.E., Murray, W.: Algorithms for the solution of the nonlinear least squares problem. SIAM J. Numer. Anal. 15(5), 977–992 (1978)
Gould, N.I.M., Orban, D., Toint, PhL: GALAHAD—a library of thread-safe Fortran 90 packages for large-scale nonlinear optimization. ACM Trans. Math. Softw. 29(4), 353–372 (2003)
Gould, N.I.M., Rees, T., Scott, J.A.: A higher order method for solving nonlinear least-squares problems. Technical Report RAL-P-2017-010, STFC Rutherford Appleton Laboratory (2017)
Grapiglia, G.N., Nesterov, Y.: Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 27(1), 478–506 (2017)
Levenberg, K.: A method for the solution of certain problems in least squares. Q. Appl. Math. 2(2), 164–168 (1944)
Marquardt, D.: An algorithm for least-squares estimation of nonlinear parameters. SIAM J. Appl. Math. 11(2), 431–441 (1963)
Moré, J.J.: The Levenberg-Marquardt algorithm: implementation and theory. In: Watson, G.A. (ed.) Numerical Analysis, Dundee 1977, Number 630 in Lecture Notes in Mathematics, pp. 105–116. Springer, Berlin (1978)
Morrison, D.D.: Methods for nonlinear least squares problems and convergence proofs. In: Lorell, J., Yagi, F (eds.) Proceedings of the Seminar on Tracking Programs and Orbit Determination, pp. 1–9, Pasadena. Jet Propulsion Laboratory (1960)
Nesterov, Yu.: Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Dordrecht (2004)
Nesterov, Yu., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
NIST Nonlinear Regression Datasets. http://www.itl.nist.gov/div898/strd/nls/nls_main.shtml. Accessed June 2018
RALFit. https://github.com/ralna/RALFit. Accessed 20 July 2018
Transtrum, M.K., Machta, B.B., Sethna, J.P.: Why are nonlinear fits to data so challenging? Phys. Rev. Lett. 104(6), 060201 (2010)
Transtrum, M.K., Sethna, J.P.: Geodesic acceleration and the small-curvature approximation for nonlinear least squares (2012). arXiV.1207.4999
Zhang, H., Conn, A.R.: On the local convergence of a derivative-free algorithm for least-squares minimization. Comput. Optim. Appl. 51(2), 481–507 (2012)
Zhang, H., Conn, A.R., Scheinberg, K.: A derivative-free algorithm for least-squares minimization. SIAM J. Optim. 20(6), 3555–3576 (2010)
Acknowledgements
The authors are grateful to two referees and the editor for their very helpful comments on the original draft of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by EPSRC Grant EP/M025179/1.
Appendices
Appendix A: Proofs of function bounds (3.1)–(3.4)
We assume that \(r_i(x)\), \(i=1,\ldots ,m\) are twice-continuously differentiable, and that they and their first two derivatives are Lipschitz on the intervals \(\mathcal{F}_k = \{ x: x = x_k+ \alpha s_k \;\; \text{ for } \text{ some } \;\; \alpha \in [0,1]\}\). Therefore
for x, \(y \in \mathcal{F}_k\) . Moreover, these Lipschitz bounds imply that
for \(x \in \mathcal{F}_k\) [25, Lemma 1.2.2]. It follows from Taylor’s theorem and (A.1) that
and from the definition (2.1) of \(t_i(x,s)\), the Cauchy-Schwarz inequality, (A.2) and the monotonicity bound
that
Thus if \(\Vert s_k\Vert \le 1\), it follows from the triangle inequality that
which provides the bound (3.1) with \(L_{\mathrm{f}} {:=}{\scriptstyle \frac{1}{12}}m L_{\mathrm{h}} ( 2\Vert r(x_0)\Vert + 2L_{\mathrm{r}} + L_{\mathrm{j}} + L_{\mathrm{h}})\).
Taylor’s theorem once again gives that
But then the triangle inequality together with (A.3), (A.5) and (A.6) give
Hence, if \(\Vert s_k\Vert \le 1\), we have that
which is (3.2) with \(L_{\mathrm{g}} {:=}m ( {\scriptstyle \frac{1}{6}}L_{\mathrm{h}} L_{\mathrm{j}} + {\scriptstyle \frac{1}{2}}L_{\mathrm{j}} (\Vert r(x_0)\Vert + L_{\mathrm{r}} + {\scriptstyle \frac{1}{2}}L_{\mathrm{j}} ))\).
The bound (3.3) follows immediately from Cauchy-Schwarz and (A.2) with \(L_{\mathrm{J}} {:=}L_{\mathrm{r}}\). Finally (A.2), (A.4) and the well-known relationship \(\Vert \cdot \Vert _1^{} \le \sqrt{m} \Vert \cdot \Vert \) between the \(\ell _1\) and Euclidean norms give
which is (3.4) with \(L_{\mathrm{H}}\, {:=}\,\sqrt{m} L_{\mathrm{j}}\).
Appendix B: Superlinear convergence
We focus on Algorithm 2.1 andFootnote 2 the case \(2< r < 3\). Denote the leftmost eigenvalue of a generic real symmetric matrix H by \(\lambda _{\min }[H]\). Consider the gradient \(\nabla _s m^{\mathrm{R}}(x,s,\sigma )\) of the regularized model given by (2.4). It follows from (2.1) and (2.2) that
where for brevity we have written
Ideally one might hope to choose s in (B.1) to make \(\nabla _s m^{\mathrm{R}}(x_k,s,\sigma _k) =0\), but this is generally unrealistic as \(\nabla _s m^{\mathrm{R}}(x,s,\sigma )\) is a combination of a cubic function and the derivative of the regularization term. A tractable compromise is to pick \(s =s^{\mathrm{N}}_k\), so that
where
since this provides a zero of the lower-order terms in (B.1).
We will try \(s_k = s^{\mathrm{N}}_k\) if \(H_k\) is positive definite, with leftmost eigenvalue \(\lambda _{\min ,k} {:=}\lambda _{\min }[H_k] > 0\), and three essential properties hold, namely that
If so, \(s^{\mathrm{N}}_k\) provides a successful step in Algorithm 2.1, since (B.4)–(B.6) are then that (2.5)–(2.6) and \(\rho _k \ge \eta _1\) hold, We are not specific about how \(s_k\) is chosen when \(H_k\) is not positive definite, nor how \(s_k\) might be chosen if \(s^{\mathrm{N}}_k\) does not provide a successful step.
Consider the sub-sequence of iterates \(\{x_k\}\), \(k \in \mathcal{K}\), whose limit is \(x_*\) (and thus for which \(g_* {:=}\nabla _x \Phi (x_*) = 0\) because of Theorem 3.9), suppose that \(\nabla _x \Phi (x)\) is Lipschitz continuous in an open neighbourhood of \(x_*\) and that \(\lambda _{\min ,*} {:=}\lambda _{\min }[ \nabla _{xx} \Phi (x_*)] > 0\). Then, for all \(k \in \mathcal{K}\) sufficiently large, \(\lambda _{\min ,k} \ge {\scriptstyle \frac{1}{2}}\lambda _{\min ,*}\). This ensures that
and hence (B.2) and (B.7) provides the bound
But Lipschitz continuity and Taylor’s theorem applied to \(\nabla _{x} \Phi (x)\) yields
and
for some constants \(L_1, L_2 > 0\), and thus
because of (B.8).
Define
where \(\sigma _{\max }\) is given by (3.17), and suppose that \(x_k \in \mathcal{X}\), where
and \(\mathcal{B}_{\delta } = \{ x \mid \Vert x-x_*\Vert \le \delta \}\) is any ball around \(x_*\) of fixed radius \(\delta > 0\) for which \(\lambda _{\min }[\nabla _{xx} \Phi (x)] \ge {\scriptstyle \frac{1}{2}}\lambda _{\min ,*}\) for all \(x \in \mathcal{B}_{\delta }\). In this case (B.10) guarantees that
and hence, trivially,
We now establish the required bounds (B.4)–(B.6). Firstly, expanding the definition (2.2) of m(x, s) gives
and it follows directly from the Cauchy-Schwarz inequality, (A.2) and (B.14) that
Substituting (B.15) into the definition (2.3) of the regularized model \(m^{\mathrm{R}}(x,s,\sigma )\) gives
because of the positive semi-definiteness of \( H_k + \lambda _k I\), the requirement that \(\sigma _k \ge \sigma _{\min }>0\), and the bounds (A.2) and (B.16) and the second term in (B.13). This provides the required bound (B.4).
It also follows immediately from (B.1) and (B.2) that
using the triangle inequality, (A.2) and the third term in (B.13), which establishes (B.5).
Finally, it follows precisely as in (3.16) that
since \(\sigma _k \ge \sigma _{\min }>0\). Combining this with the fourth term in (B.13) immediately gives that \(| \rho _k - 1 | \le 1 - \eta _2\) and hence that (B.6) holds. Thus we have shown that \(s^{\mathrm{N}}_k\) is allowed by Step 2 of Algorithm 2.1, and leads to a successful iteration for which \(x_{k+1} = x_k + s^{\mathrm{N}}_k\).
Our intention is to show that
for some \(\kappa > 0\), and hence the resulting iteration ultimately converges at a (Q-order \(r-1\)) superlinear rate. The iterate \(x_{k+1} = x_k + s^{\mathrm{N}}_k\) satisfies
Taking norms and combining this with (B.9) gives
using (B.18), (B.7), (B.3), (B.10) and (B.11) and the appropriate bound \(\sigma _k \le \sigma _{\max }\) from (3.17). Thus (B.17) holds. Moreover, it also follows from (B.19) and the first term in (B.12) that
in which case \(x_{k+1} \in \mathcal{X}\) and thus (B.12) continues to hold at iteration \(k+1\). Hence once an iterate enters \(\mathcal{X}\), it will remain there, and the remaining sequence will converge superlinearly to \(x_*\).
Rights and permissions
About this article
Cite this article
Gould, N.I.M., Rees, T. & Scott, J.A. Convergence and evaluation-complexity analysis of a regularized tensor-Newton method for solving nonlinear least-squares problems. Comput Optim Appl 73, 1–35 (2019). https://doi.org/10.1007/s10589-019-00064-2
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-019-00064-2