1 Introduction

We consider the following general unconstrained optimization problem

$$\begin{aligned} \min _{x\in R^n} f(x), \end{aligned}$$
(1)

where f is assumed to be at least continuously differentiable. Line-search and trust-region methods [14, 35] represent two competing approaches to solving (1). The cases when one of them is more successful than the other are problem dependent.

At each iteration of the trust-region methods, a trial step is generated by minimizing a quadratic model of f(x) within a trust-region. The trust-region subproblem is formulated, for the k-th iteration, as follows:

$$\begin{aligned} \min _{s\in {\varOmega }_k} g_k^Ts + \frac{1}{2}s^T B_k s \equiv q_k(s), \end{aligned}$$
(2)

where \(g_k=\nabla f(x_k)\), and \(B_k\) is either the true Hessian in \(x_k\) or its approximation. The trust-region is a ball of radius \({\varDelta }_k\)

$$\begin{aligned} {\varOmega }_k = \{s\in R^n: \Vert s\Vert \le {\varDelta }_k\}. \end{aligned}$$

It is usually defined by a fixed vector norm, typically, scaled \(l_2\) or \(l_{\infty }\) norm. If the trial step provides a sufficient decrease in f, it is accepted, otherwise the trust-region radius is decreased while keeping the same model function.

Worth mentioning are the following attractive features of a trust-region method. First, these methods can benefit from using negative curvature information contained in \(B_k\). Secondly, another important feature is exhibited when the full quasi-Newton step \(-B_k^{-1}g_k\) does not produce a sufficient decrease in f, and the radius \({\varDelta } _k < \Vert -B_k^{-1}g_k\Vert \) is such that the quadratic model provides a relatively good prediction of f within the trust-region. In this case, since the accepted trial step provides a better predicted decrease in f than the one provided by minimizing \(q_k(s)\) within \({\varOmega }_k\) along the quasi-Newton direction, it is natural to expect that the actual reduction in f produced by a trust-region method is better. Furthermore, if \(B_k \succ 0\) and it is ill-conditioned, the quasi-Newton search direction may be almost orthogonal to \(g_k\), which can adversely affect the efficiency of a line-search method. In contrast, the direction of the vector \(s({\varDelta }_k)\) that solves (2) approaches the direction of \(-g_k\) when \({\varDelta }_k\) decreases.

There exists a variety of approaches [14, 35, 40, 44] to approximately solving the trust-region subproblem defined by the Euclidean norm. Depending on how accurately the trust-region subproblem is solved, the methods are categorized as nearly-exact or inexact.

The class of inexact trust-region methods includes, e.g., the dogleg method [37, 38], double-dogleg method [15], truncated conjugate gradient (CG) method [39, 41], Newton-Lanczos method [26], subspace CG method [43] and two-dimensional subspace minimization method [12].

Faster convergence, in terms of the number of iterations, is generally expected when the trust-region subproblem is solved more accurately. Nearly-exact methods are usually based on the optimality conditions, presented by Moré and Sorensen [32] for the Euclidean norm used in (2). These conditions state that there exists a pair \((s^*,\sigma ^*)\) such that \(\sigma ^*\ge 0\) and

$$\begin{aligned} \begin{array}{rcl} (B_k+\sigma ^* I)s^* &{} = &{} -g_k,\\ \sigma ^*(\Vert s^*\Vert _2-{\varDelta }) &{}= &{} 0,\\ B_k+\sigma ^* I&{} \succeq &{} 0. \end{array} \end{aligned}$$
(3)

In these methods, a nearly-exact solution is obtained by iteratively improving \(\sigma \) and solving in s the linear system

$$\begin{aligned} (B_k+\sigma I) s = - g_k. \end{aligned}$$
(4)

The class of limited-memory quasi-Newton methods [3, 23, 24, 30, 34] is one of the most effective tools used for solving large-scale problems, especially when the maintaining and operating with dense Hessian approximation is costly. In these methods, a few pairs of vectors

$$\begin{aligned} s_i = x_{i+1} - x_i \quad \hbox {and} \quad y_i = \nabla f(x_{i+1}) - \nabla f(x_i) \end{aligned}$$
(5)

are stored for implicitly building an approximation of the Hessian, or its inverse, by using a low rank update of a diagonal matrix. The number of such pairs is limited by \(m \ll n\). This allows for arranging efficient matrix-vector multiplications involving \(B_k\) and \(B_k^{-1}\).

For most of the quasi-Newton updates, the Hessian approximation admits a compact representation

$$\begin{aligned} B_k=\delta _kI+V_k^TW_kV_k, \end{aligned}$$
(6)

where \(\delta _k\) is a scalar, \({W_k\in R^{{\bar{m}}\times {\bar{m}}}}\) is a symmetric matrix and \({V_k\in R^{n\times {\bar{m}}}}\). This is the main property that will be exploited in this paper. The value of \({\bar{m}}\) depends on the number of stored pairs (5) and it may vary from iteration to iteration. Its maximal value depends on the updating formula and equals, typically, m or 2m. To simplify the presentation and our analysis, especially when specific updating formulas are discussed, we shall assume that the number of stored pairs and \({\bar{m}}\) equal to their maximal values.

So far, the most successful implementations of limited-memory methods were associated with line search. Nowadays, the most popular limited-memory line-search methods are based on the BFGS-update [35], named after Broyden, Fletcher, Goldfarb and Shanno. The complexity of computing a search direction in the best implementations of these methods is 4mn.

Line-search methods often employ the strong Wolfe conditions [42] that require additional function and gradient evaluations. These methods have a strong requirement of positive definiteness of the Hessian matrix approximation, while trust-region methods, as mentioned above, can even gain from exploiting information about possible negative curvature. Moreover, the latter methods do not require gradient computation in unacceptable trial points. Unfortunately, any straightforward embedding of limited-memory quasi-Newton techniques in the trust-region framework deteriorates the efficiency of the former approach.

The existing refined limited-memory trust-region methods [10, 19, 20, 29] typically use the limited-memory BFGS updates (L-BFGS) for approximating the Hessian and the Euclidean norm for defining the trust-region. In the double-dogleg approach by Kaufman [29], the Hessian and its inverse are simultaneously approximated using the L-BFGS in a compact representation [11]. The cost of one iteration for this inexact approach varies from \(4mn + O(m^2)\) to O(n) operations depending on whether the trial step was accepted at the previous iteration or not. Using the same compact representation, Burke et al. [10] proposed two versions of implementing the Moré-Sorensen approach [32] for finding a nearly-exact solution to the trust-region subproblem. The cost of one iteration varies from \({2}mn + O(m^{3})\) to either \(2m^{2}n+2mn+O(m^3)\) or \(6mn + O(m^2)\) operations, depending on how the updating of \(B_k\) is implemented. Recently Erway and Marcia [1820] proposed a new technique for solving (4), based on the unrolling formula of L-BFGS [11]. In this case, the cost of one iteration of their implementation [20] is \(O(m^2n)\) operations. In the next sections, we describe the aforementioned limited-memory trust-region methods in more detail and compare them with those we propose here.

The aim of this paper is to develop new approaches that would allow for effectively combining the limited-memory and trust-region techniques. They should break a wide-spread belief that such combinations are less efficient than the line-search-based methods.

We focus here on the quasi-Newton updates that admit a compact representation (6). It should be underlined that a compact representation is available for the most of quasi-Newton updates, such as BFGS [11], symmetric rank-one (SR1) [11] and multipoint symmetric secant approximations [8], which contain the Powell-symmetric-Broyden (PSB) update [38] as a special case. Most recently, Erway and Marcia [21] provided a compact representation for the entire Broyden convex class of updates.

We begin in Sect. 2 with showing how to efficiently compute, at a cost of \(O({{{\bar{m}}}}^3)\) operations, the eigenvalues of \(B_k\) with implicitly defined eigenvectors. This way of computing and using the eigenvalue decomposition of general limited-memory updates (6) was originally introduced in [7, 9], and then successfully exploited in [2, 21, 22]. An alternative way was earlier presented in unpublished doctoral dissertation [31] for a special updating, namely, the limited-memory SR1. The difference between these two ways is discussed in Sect. 2. For the case when the trust-region is defined by the Euclidean norm, and the implicit eigenvalue decomposition is available, we show in Sect. 3 how to find a nearly-exact solution to the trust-region subproblem at a cost of \(2{{\bar{m}}}n+O({{\bar{m}}})\) operations. In Sect. 4, we introduce two new norms which leans upon the eigenvalue decomposition of \(B_k\). The shape of the trust-region defined by these norms changes from iteration to iteration. The new norms allow for decomposing the corresponding trust-region subproblem into a set of easy-to-solve quadratic programming problems. For one of the new norms, the exact solution to the trust-region subproblem is obtained in closed form. For the other one, the solution is reduced to a small \({\bar{m}}\)-dimensional trust-region subproblem in the Euclidean norm. In Sect. 5, a generic trust-region algorithm is presented, which is used in the implementation of our algorithms. In Sect. 6, global convergence is proved for eigenvalue-based limited-memory methods. In Sects. 26, \(B_k\) is not required to be positive definite, except Lemma 3 where the L-BFGS updating formula is considered. The rest of the paper is focused on specific positive definite quasi-Newton updates, namely, L-BFGS. For this case, we develop in Sect. 7 an algorithm, in which the computational cost of one iteration varies from \(4mn+O(m^3)\) to \(2mn+O(m^2)\) operations, depending on whether the trial step was accepted at the previous iteration or not. This means that the highest order term in the computational cost is the same as for computing the search direction in the line-search L-BFGS algorithms. In Sect. 8, we propose improved versions of the limited-memory trust-region algorithms [10, 29]. The results of numerical experiments are presented in Sect. 9. They demonstrate the efficiency of our limited-memory trust-region algorithms. We conclude our work and discuss future direction in Sect. 10.

2 Spectrum of limited-memory Hessian approximation

Consider the trust-region subproblem (2), in which we simplify notations by dropping the subscripts, i.e., we consider

$$\begin{aligned} \min _{s\in {\varOmega }} g^Ts + \frac{1}{2}s^T B s \equiv q(s). \end{aligned}$$
(7)

It is assumed in this paper that the Hessian approximation admits the compact representation (6), that is,

$$\begin{aligned} B=\delta I+VWV^T, \end{aligned}$$
(8)

where \(W\in R^{{\bar{m}} \times {\bar{m}}}\) is a symmetric matrix and \(V\in R^{n\times {\bar{m}}}\). The main assumption here is that \({\bar{m}} \ll n\). It is natural to assume that the scalar \(\delta \) is positive because there are, in general, no special reasons to base the model q(s) on the hypothesis that the curvature of f(x) is zero or negative along all directions orthogonal to the columns of V.

Below, we demonstrate how to exploit compact representation (8) and efficiently compute eigenvalues of B. For trust-regions of a certain type, this will permit us to easily solve the trust-region subproblem (7).

Suppose that the Cholesky factorization \(V^T V = R^T R\) is available, where \(R \in R^{{\bar{m}} \times {\bar{m}} }\) is upper triangular. The rank of V, denoted here by r, is equal to the number of nonzero diagonal elements of R. Let \(R_{\dagger } \in R^{r \times {\bar{m}}}\) be obtained from R by deleting the rows that contain zero diagonal elements, and let \(R_{\ddagger } \in R^{r \times r}\) be obtained by additionally deleting the columns of R of the same property. Similarly, we obtain \(V_{\dagger } \in R^{n \times r}\) by deleting the corresponding columns of V. Consider the \(n \times r\) matrix

$$\begin{aligned} Q = V_{\dagger } R_{\ddagger }^{-1}. \end{aligned}$$
(9)

Its columns form an orthonormal basis for the column space of both \(V_{\dagger }\) and V. The equality

$$\begin{aligned} V = Q R_{\dagger } \end{aligned}$$
(10)

can be viewed as the rank revealing QR (RRQR) decomposition of V [1, Theorem 1.3.4].

By decomposition (10), we have

$$\begin{aligned} B=\delta I+Q R_{\dagger } W R_{\dagger }^T Q^T, \end{aligned}$$

where the matrix \(R_{\dagger } W R_{\dagger }^T\in R^{r\times r}\) is symmetric. Consider its eigenvalue decomposition \(R_{\dagger } W R_{\dagger }^T = U D U^T\), where \(U\in R^{r\times r}\) is orthogonal and \(D\in R^{r\times r}\) is a diagonal matrix composed of the eigenvalues \((d_1,d_2,\ldots ,d_{r})\). Denote \(P_{\parallel }=QU\in R^{n\times r}\). The columns of \(P_{\parallel }\) form an orthonormal basis for the column space of V. This yields the following representation of the quasi-Newton matrix:

$$\begin{aligned} B = \delta I + P_{\parallel } D P_{\parallel }^T. \end{aligned}$$

Let \(P_{\perp }\in R^{n\times (n-r)}\) define the orthogonal complement to \(P_{\parallel }\). Then \(P=[P_{\parallel }\; P_{\perp }]\in R^{n\times n}\) is an orthogonal matrix. This leads to the eigenvalue decomposition:

$$\begin{aligned} B=P\left( \begin{array}{cc} {\varLambda } &{} \quad 0 \\ 0 &{} \quad \delta I_{{n-r}} \end{array} \right) P^T, \end{aligned}$$
(11)

where \({\varLambda }=\text{ diag }(\lambda _1,\lambda _2,\ldots ,\lambda _{r})\) and \(\lambda _i=\delta +d_i\), \(i=1,\ldots ,r\). From (11), we conclude that the spectrum of B consists of:

  • r eigenvalues \(\lambda _1,\lambda _2,\ldots ,\lambda _{r}\) with the eigenspace defined by \(P_{\parallel }\);

  • \((n-r)\) identical eigenvalues \(\delta \) with the eigenspace defined by \(P_{\perp }\).

Thus, B has at most \(r+1\) distinct eigenvalues that can be computed at a cost of \(O(r^3)\) operations for the available W and RRQR decomposition of V.

In our implementation, we do not explicitly construct the matrix Q, but only the triangular matrix R, which is obtained from the aforementioned Cholesky factorization of the \({\bar{m}} \times {\bar{m}}\) Gram matrix \(V^TV\) at a cost of \(O({\bar{m}}^3)\) operations [25]. The complexity can be decreased to \(O({\bar{m}}^2)\) if the Cholesky factorization is updated after each iteration by taking into account that the current matrix V differs from the previous one by at most two columns. We show in Sect. 7.3 how to update \(V^TV\) at \({\bar{m}}n + O({\bar{m}}^2)\) operations. Although this will be shown for the L-BFGS update, the same technique works for the other limited-memory quasi-Newton updates that admit the compact representation (8). Any matrix-vector multiplications involving Q are implemented using the representation (9). For similar purposes, we make use of the representation

$$\begin{aligned} P_{\parallel }=V_{\dagger } R_{\ddagger }^{-1} U. \end{aligned}$$
(12)

In contrast to the eigenvalues that are explicitly computed, the eigenvectors are not computed explicitly. Therefore, we can say that the eigenvalue decomposition of B (11) is defined implicitly. The matrices P, \(P_{\parallel }\) and \(P_{\perp }\) will be involved in presenting our approach, but they are not used in any of our algorithms.

As it was mentioned above, an alternative way of computing the eigenvalue decomposition of B was presented in unpublished work [31] for the limited-memory SR1 update. For this purpose, the author made use of the eigenvalue decomposition of the \(V^T V = (Y - \delta S)^T(Y - \delta S)\), whereas the Cholesky factorization of \(V^T V\) is used in the present paper. In [31], the limited-memory SR1 was combined with a trust-region technique.

In the next section, we describe how to solve the trust-region subproblem (7) in the Euclidean norm by exploiting the implicit eigenvalue decomposition of B.

3 Trust-region subproblem in the Euclidean norm

It is assumed here that \({\varOmega }=\{s\in R^n:\Vert s\Vert _2\le {\varDelta }\}\). To simplify notation, \(\Vert \cdot \Vert \) denotes further the Euclidean vector norm and the induced matrix norm.

The Moré–Sorenson approach [32] seeks for an optimal pair \((s^*,\sigma ^*)\) that satisfies conditions (3). If \(B\succ 0\) and the quasi-Newton step \(s_N=-B^{-1}g\in {\varOmega }\), then \(s_N\) solves the trust-region subproblem. Otherwise, its solution is related to solving the equation

$$\begin{aligned} \phi (\sigma ) = 0, \end{aligned}$$
(13)

where \(\phi (\sigma )=1/{\varDelta }-1 / \Vert s\Vert \) and \(s=s(\sigma )\) is the solution to the linear system

$$\begin{aligned} (B+\sigma I)s=-g. \end{aligned}$$
(14)

In the standard Moré–Sorenson approach, the Cholesky factorization of \(B+\sigma I\) is typically used for solving (14). To avoid this computationally demanding factorization, we take advantage of the implicitly available eigenvalue decomposition of B (11), which yields:

$$\begin{aligned} B +\sigma I= P\left( \begin{array}{cc} {\varLambda } + \sigma I_{ r} &{} \quad 0\\ 0 &{} \quad (\delta +\sigma )I_{n-r}\\ \end{array} \right) P^T. \end{aligned}$$

Consider a new n-dimensional variable defined by the orthogonal matrix P as

$$\begin{aligned} v = P^Ts {= \left( \begin{array}{c} v_{\parallel } \\ v_{\perp } \end{array} \right) } \in R^n, \end{aligned}$$
(15)

where \(v_{\parallel }=P_{\parallel }^Ts\in R^{ r}\) and \(v_{\perp }=P_{\perp }^Ts\in R^{n-r}\). Then Eq. (14) is reduced to

$$\begin{aligned} \left\{ \begin{array}{l} ({\varLambda } + \sigma I_{ r}) v_{\parallel } = -g_{\parallel }\\ (\delta +\sigma ) v_{\perp } = -g_{\perp } \end{array} \right. , \end{aligned}$$
(16)

where \(g_{\parallel } = P_{\parallel }^Tg \in R^{ r}\) and \(g_{\perp }=P_{\perp }^Tg \in R^{n-r}\). For the values of \(\sigma \) that make this system nonsingular, we denote its solution by \(v(\sigma )\). Since \(\Vert s\Vert = \Vert v\Vert \), the function \(\phi (\sigma )\) in (13) can now be defined as \(\phi (\sigma )=1/{\varDelta }-1 / \Vert v(\sigma )\Vert \).

Let \(\lambda _{\min }\) stand for the smallest eigenvalue of B. Let \(P_{\min }\) be the set of columns of P that span the subspace corresponding to \(\lambda _{\min }\). We denote \(v^*=P^Ts^*\) and seek now for a pair \((v^*,\sigma ^*)\) that solves (16). Conditions (3) require also that

$$\begin{aligned} \sigma ^*(\Vert v^*\Vert - {\varDelta }) = 0 \quad \hbox {and} \quad \sigma ^* \ge \max \left( 0, -\lambda _{\min }\right) . \end{aligned}$$
(17)

We shall show how to find a pair with the required properties separately for each of the following two cases.

Case I: \(\lambda _{\min } > 0\) or \(\Vert P_{\min }^Tg\Vert \ne 0\).

Here if \(\lambda _{\min } > 0\) and \(\Vert v(0)\Vert \le {\varDelta }\), we have \(v^*=v(0)\) and \(\sigma ^*=0\). Otherwise,

$$\begin{aligned} \sigma ^* > \max \left( 0, -\lambda _{\min }\right) . \end{aligned}$$
(18)

Then Eq. (13) is solved by Newton’s root-finding algorithm [32], where each iteration takes the form

$$\begin{aligned} \sigma \leftarrow \sigma - \frac{\phi (\sigma )}{\phi ^\prime (\sigma )} = \sigma - {\frac{(\Vert v(\sigma )\Vert -{\varDelta }) \cdot \Vert v(\sigma )\Vert ^2}{{\varDelta } \cdot v^T(\sigma )v^\prime (\sigma )}}. \end{aligned}$$
(19)

For this formula, Eq. (16) yields

$$\begin{aligned} \Vert v(\sigma )\Vert ^2 = g_{\parallel }^T ({\varLambda } + \sigma I_{ r})^{-2} g_{\parallel } + (\delta +\sigma )^{-2} \Vert g_{\perp }\Vert ^2 \end{aligned}$$
(20)

and

$$\begin{aligned} v^T(\sigma ) v^\prime (\sigma )= & {} - v^T(\sigma )\left( \begin{array}{cc} {\varLambda } + \sigma I_{ r} &{} 0\\ 0 &{}(\delta +\sigma )I_{n-r}\\ \end{array} \right) ^{-1} v(\sigma )\nonumber \\= & {} - g_{\parallel }^T ({\varLambda } + \sigma I_{ r})^{-3} g_{\parallel } - (\delta +\sigma )^{-3} \Vert g_{\perp }\Vert ^2. \end{aligned}$$
(21)

It is easy to control the iterates from below by making use of the property (18), which guarantees that the diagonal matrices \({\varLambda } + \sigma I_{ r}\) and \((\delta +\sigma )I_{n-r}\) are nonsingular. In practice, just a pair of iterations (19) are often sufficient for solving (13) to an appropriate accuracy [35]. For the obtained approximate value \(\sigma ^*\), the two blocks that compose \(v^*\) are defined by the formulas

$$\begin{aligned} v_{\parallel }^*= & {} - ({\varLambda } + \sigma ^* I_r)^{-1} g_{\parallel }, \end{aligned}$$
(22)
$$\begin{aligned} v_{\perp }^*= & {} - (\delta +\sigma ^*)^{-1} g_{\perp }. \end{aligned}$$
(23)

Case II: \(\lambda _{\min } \le 0\) and \(\Vert P_{\min }^Tg\Vert =0\).

Here \(\lambda _{\min } < 0\) may lead to the so-called hard case [32]. Since it was assumed that \(\delta > 0\), we have \(\lambda _{\min } \ne \delta \). Let \({\bar{r}}\) be the algebraic multiplicity of \(\lambda _{\min }\). Suppose that the eigenvalues are sorted in the way that \(\lambda _{\min }=\lambda _{1}=\cdots =\lambda _{{\bar{r}}} < \lambda _i\), for all \(i>{\bar{r}}\). Denote \({\bar{v}}=(v_{{\bar{r}} + 1}, v_{{\bar{r}} + 2}, \dots , v_n)^T\). The process of finding an optimal pair \((\sigma ^*,v^*)\) is based on a simple analysis of the alternatives in (16), which require that, for all \(1\le i \le {\bar{r}}\), either \(\lambda _i + \sigma = 0\) or \((v_{\parallel })_i = 0\). It is associated with finding the unique solution of the following auxiliary trust-region subproblem:

$$\begin{aligned} \min _{{\bar{v}}\in {\bar{{\varOmega }}}} \sum _{i={\bar{r}}+1}^{r} \left( (g_{\Vert })_{i} (v_{\Vert })_{i} + \frac{\lambda _{i} - \lambda _{\min }}{2}(v_{\Vert })_{i}^2 \right) + g_{\perp }^Tv_{\perp }+\frac{\delta - \lambda _{\min }}{2}\Vert v_{\perp }\Vert ^2, \end{aligned}$$

where \({\bar{{\varOmega }}}=\{{\bar{v}}\in R^{n-{\bar{r}}}: \Vert {\bar{v}}\Vert \le {\varDelta }\}\). This subprolem corresponds to the already considered Case I because its objective function is strictly convex. Let \({\bar{\sigma }}^*\) and \({\bar{v}}^*=(v_{{\bar{r}} + 1}^*, v_{{\bar{r}} + 2}^*, \dots , v_n^*)^T\) be the optimal pair for the auxiliary subproblem. Denote

$$\begin{aligned} \mu = \left\{ \begin{array}{ll} 0, &{} \quad \hbox {if }\quad \lambda _{\min } = 0,\\ \sqrt{{\varDelta } ^2 - \Vert {\bar{v}}^*\Vert ^2}, &{} \quad \hbox {if }\quad \lambda _{\min } < 0. \end{array} \right. \end{aligned}$$

It can be easily verified that the pair

$$\begin{aligned} \sigma ^* = -\lambda _{\min } + {\bar{\sigma }}^*, \quad v^* = (\underbrace{\mu , 0, \ldots ,0,}_{{\bar{r}}} v_{{\bar{r}} + 1}^*, \dots , v_n^*)^T \end{aligned}$$

satisfies the optimality conditions (16) and (17). The vector \(v_{\perp }^*\) is defined by formula (23), but as one can see below, it is not necessary to compute this vector. The same refers to \(v_{\perp }^*\) in Case I.

In each of the cases, we compute, first,

$$\begin{aligned} g_{\parallel } = U^T R_{\ddagger }^{-T} V_{\dagger }^T g. \end{aligned}$$
(24)

It is then used for finding \(\Vert g_{\perp }\Vert \) from the relation

$$\begin{aligned} \Vert g_{\perp }\Vert ^2=\Vert g\Vert ^2-\Vert g_{\Vert }\Vert ^2, \end{aligned}$$
(25)

which follows from the orthogonality of P represented as

$$\begin{aligned} P_{\perp }P_{\perp }^T=I-P_{\parallel } P_{\parallel }^T. \end{aligned}$$
(26)

The described procedure of finding \(\sigma ^*\) and \(v_{\parallel }^*\) produces an exact or nearly-exact solution to the trust-region subproblem (7). This solution is computed using (15), (23) and (26) as

$$\begin{aligned} s^*= & {} {P_{\parallel } v_{\parallel }^* + P_{\perp } v_{\perp }^*} = {P_{\parallel } v_{\parallel }^*} -(\delta + \sigma ^*)^{-1} P_{\perp } P_{\perp }^Tg \nonumber \\= & {} -(\delta + \sigma ^*)^{-1}g + P_{\parallel } {\left( v_{\parallel }^* + (\delta + \sigma ^*)^{-1} g_{\parallel }\right) }. \end{aligned}$$
(27)

The presented eigenvalue-based approach to solving the trust-region subproblem has the following attractive feature. Once the eigenvalues of B are computed, in Case I, formula (24) requires \({\bar{m}}n+O({\bar{m}}^2)\) operationsFootnote 1 and formulas (20) and (21) require \(O( {{\bar{m}}})\) operations per iteration to approximately solve (13). The computation of \(v_{\parallel }^*\) by formula (22) requires \(O({\bar{m}})\) operations. In Case II, the computation of \(\sigma ^*\) and \(v_{\parallel }^*\) has the same order of complexity. The computation of \(s^*\) by formula (27) requires a few additional matrix-vector multiplications for \(P_{\parallel }\) defined by (12). The associated cost is \({\bar{m}}n+O({\bar{m}}^2)\).

In the next section, we introduce an alternative eigenvalue-based approach to solving the trust-region subproblem.

4 Trust-region subproblem in eigenvalue-based norms

We consider here the trust-region subproblem (7) defined by the norms introduced below. All constituent parts of the compact representation (8) are assumed to be available.

4.1 Eigenvalue-based decomposition of the model function

Observe that the new variable defined by (15) allows us to decompose the objective function in (7) as

$$\begin{aligned} q_P(v)\equiv q(Pv)=q(P_{\Vert }v_{\Vert }+P_{\perp }v_{\perp })=q_{\Vert }(v_{\Vert })+q_{\perp }(v_{\perp }), \end{aligned}$$
(28)

where

$$\begin{aligned} q_{\Vert }(v_{\Vert })= & {} g_{\Vert }^Tv_{\Vert }+\frac{1}{2}v_{\Vert }^T{\varLambda } v_{\Vert } = \sum _{i=1}^{ r} \left( (g_{\Vert })_{i} (v_{\Vert })_{i} + \frac{\lambda _{i}}{2}(v_{\Vert })_{i}^2 \right) , \end{aligned}$$
(29)
$$\begin{aligned} q_{\perp }(v_{\perp })= & {} g_{\perp }^Tv_{\perp }+\frac{\delta }{2}\Vert v_{\perp }\Vert ^2 . \end{aligned}$$
(30)

It should be noted that when the trust-region is defined by the standard norms like \(l_2\) or \(l_{\infty }\), this decomposition does not give any advantage, in contrast to the case of the new norms proposed below.

4.2 New norms and related subproblem properties

In this subsection, we introduce two nonstandard norms to define the trust-region. The new norms enable us to decompose the original trust-region subproblem into a set of smaller subproblems, which can be easily solved. For one of the new norms, the solution can be written in closed form.

4.2.1 Shape changing norms

To exploit separability of the objective function, we introduce the following norms:

$$\begin{aligned} \Vert s\Vert _{P,\infty }\equiv & {} \max \left( \Vert P_{\parallel }^Ts\Vert _{\infty },\Vert P_{\perp }^Ts\Vert \right) , \end{aligned}$$
(31)
$$\begin{aligned} \Vert s\Vert _{P,2}\equiv & {} \max \left( \Vert P_{\parallel }^Ts\Vert ,\Vert P_{\perp }^Ts\Vert \right) . \end{aligned}$$
(32)

Recall that \(\Vert \cdot \Vert \) stands for the Euclidean norm. It can be easily verified that (31) and (32) do satisfy the vector norm axioms. Since P changes from iteration to iteration, we refer to them as shape-changing norms. The following result establishes a norm equivalence between the new norms and the Euclidean norm with the equivalence factors not depending on P.

Lemma 1

For any vector \(x\in R^n\) and orthogonal matrix \(P=[P_{\parallel }\; P_{\perp }]\in R^{n\times n}\), where \(P_{\parallel }\in R^{n\times r}\) and \(P_{\perp }\in R^{n\times (n-r)}\), the following inequalities hold:

$$\begin{aligned} \frac{\Vert x\Vert }{\sqrt{r+1}} \ \le \ \Vert x\Vert _{P,\infty } \ \le \ \Vert x\Vert \end{aligned}$$
(33)

and

$$\begin{aligned} \frac{1}{\sqrt{2}}\Vert x\Vert \le \Vert x\Vert _{P,2} \le \Vert x\Vert . \end{aligned}$$
(34)

Here, the lower and upper bounds are attainable.

Proof

We start by justifying the lower bound in (34). The definition (32) gives \(\Vert P_{\parallel }^T x\Vert ^2 \le \Vert x\Vert _{P,2}^2\) and \(\Vert P_{\perp }^T x\Vert ^2 \le \Vert x\Vert _{P,2}^2\). Then we have

$$\begin{aligned} \Vert x\Vert ^2=\Vert P^Tx\Vert ^2=\Vert P_{\parallel }^T x\Vert ^2 + \Vert P_{\perp }^T x\Vert ^2\le 2\Vert x\Vert _{P,2}^2, \end{aligned}$$
(35)

which establishes the first of the bounds (34). Further, the inequality above becomes an equality for every x that satisfies \(\Vert P_{\parallel }^Tx\Vert =\Vert P_{\perp }x\Vert \), which shows that this bound is attainable.

Due to (35), the second inequality in (34) obviously holds. Notice that it holds with equality for any x that satisfies \(P_{\parallel }^T x = 0\).

Consider now the norm (31). Since \(\Vert P_{\parallel }^Tx\Vert _{\infty } \le \Vert P_{\parallel }^Tx\Vert \), we have \( \Vert x\Vert _{P,\infty } \le \Vert x\Vert _{P,2}\). Then the upper bound in (33) follows from (34). This bound is attainable for the same choice of x as above.

It remains to justify the lower bound in (33). Using the norm definition (31) and the relations between \(l_2\) and \(l_{\infty }\) norms, we get

$$\begin{aligned} \Vert x\Vert _{P,\infty }^2\ge & {} \Vert P_{\parallel }^Tx\Vert _{\infty }^2 \ge \frac{1}{r}\Vert P_{\parallel }^Tx\Vert ^2,\\ \Vert x\Vert _{P,\infty }^2\ge & {} \Vert P_{\perp }^Tx\Vert ^2. \end{aligned}$$

Due to (35), these inequalities imply

$$\begin{aligned} ({ r}+1)\Vert x\Vert _{P,\infty }^2 \ge \Vert P_{\parallel }^Tx\Vert ^2+\Vert P_{\perp }^Tx\Vert ^2=\Vert x\Vert ^2. \end{aligned}$$

This proves the first inequality in (33). It holds with equality for every x that satisfies \(\Vert P_{\parallel }^Tx\Vert _{\infty }=\Vert P_{\perp }x\Vert \). This accomplishes the proof of the lemma. \(\square \)

It should be emphasized that the bounds in (33) and (34) do not depend on n. Moreover, according to Lemma 1, the norm (32), in contrast to the \(l_{\infty }\) norm, does not differ too much from the \(l_2\) norm in the sense of their ratio. The same refers to the other shape-changing norm when r is sufficiently small. For \(r=10\), which is a typical value in our numerical experiments, the norm (31) is not less than approximately one third of the \(l_2\) norm.

4.2.2 Subproblem separability for the new norms

For the norm (31), the trust-region \({\varOmega }\) is defined by the inequalities

$$\begin{aligned} \Vert s\Vert _{P,\infty }\le {\varDelta } \iff \left\{ \begin{array}{l} |(v_{\Vert })_{i}|\le {\varDelta }, \quad i=1,\ldots ,r,\\ \Vert v_{\perp }\Vert \le {\varDelta } . \end{array}\right. \end{aligned}$$

By combining this with the separability of the model function (28), (29), (30), we get the following separability of the trust-region subproblem:

$$\begin{aligned} \min _{\Vert s\Vert _{P,\infty }\le {\varDelta }} q(s)&= \sum _{i=1}^{ r} \min _{|(v_{\parallel })_{i}|\le {\varDelta }} \left( (g_{\Vert })_{i} (v_{\Vert })_{i} + \frac{\lambda _{i}}{2}(v_{\Vert })_{i}^2 \right) \nonumber \\&\quad + \min _{\Vert v_{\perp }\Vert \le {\varDelta }} \left( g_{\perp }^Tv_{\perp }+\frac{\delta }{2}\Vert v_{\perp }\Vert ^2\right) . \end{aligned}$$
(36)

We can write the solution to each of these subproblems in closed form as

$$\begin{aligned} (v_{\parallel }^*)_i= & {} \left\{ \begin{array}{ll} -\frac{1}{\lambda _{i}}(g_{\Vert })_{i}, &{} \quad \text{ if } |(g_{\Vert })_{i}|\le \lambda _{i}{\varDelta },\; { \lambda _i>0;}\\ {\zeta }, &{} \quad \text{ if } (g_{\Vert })_i=0, \quad { \lambda _{i}\le 0}, \\ -\frac{{\varDelta }}{|(g_{\Vert })|_{i}}(g_{\parallel })_{i}, &{} \quad \text{ otherwise. } \end{array} \right. i=1,\ldots ,{ r};\end{aligned}$$
(37)
$$\begin{aligned} v_{\perp }^*= & {} -{ t}g_{\perp }, \end{aligned}$$
(38)

where \(\zeta =\pm {\varDelta }\) for \(\lambda _i<0\), \(\zeta \in [-{\varDelta },{\varDelta }]\) for \(\lambda _i=0\) and

$$\begin{aligned} t = \left\{ \begin{array}{ll} \frac{1}{\delta }, &{} \quad \text{ if } \Vert g_{\perp }\Vert \le \delta {\varDelta },\\ \frac{{\varDelta }}{\Vert g_{\perp }\Vert }, &{} \quad \text{ otherwise }. \end{array} \right. \end{aligned}$$
(39)

In the original space, the corresponding optimal solution \(s^*\) is calculated as

$$\begin{aligned} s^*=Pv^* = P_{\parallel }v^*_{\Vert } + P_{\perp }v^*_{\perp }, \end{aligned}$$

where \(P_{\perp }v^*_{\perp }=-{ t} P_{\perp }P_{\perp }^Tg\). Recalling (26), we finally obtain

$$\begin{aligned} s^*= -{ t} g + P_{\parallel }(v^*_{\parallel } + t {g_{\parallel }}). \end{aligned}$$
(40)

Here the cost of computing \(v_{\parallel }^*\) by (37) is \(O({\bar{m}})\). The formulas for \(P_{\parallel }\) (12) and \(g_{\parallel }\) (24) suggest that the dominant cost in (40) is determined by two matrix-vector multiplications involving \(V_{\dagger }\). This requires \(2{{\bar{m}}}n\) operations. Hence, the overall cost of solving the trust-region subproblem defined by norm (31) is essentially the same as for the Euclidean norm (see Sect. 3). The advantage of the new norm (31) over the Euclidean norm is a decomposition of the trust-region subproblem that yields the closed-form solution (40) without invoking any iterative procedure.

Consider now the trust-region defined by the norm (32). In this case, the trust-region subproblem is decomposed into the two subproblems:

$$\begin{aligned} \min _{\Vert s\Vert _{P,2}\le {\varDelta }} q(s) \!=\!\min _{\Vert v_{\parallel }\Vert \le {\varDelta }} \left( g_{\Vert }^T v_{\Vert } + \frac{1}{2}v_{\Vert }^T{\varLambda } v_{\parallel } \right) + \min _{\Vert v_{\perp }\Vert \le {\varDelta }} \left( g_{\perp }^Tv_{\perp }+\frac{\delta }{2}\Vert v_{\perp }\Vert ^2\right) .\qquad \end{aligned}$$
(41)

Here, the first subproblem is a low-dimensional case of problem (7). It can be easily solved by any standard trust-region method [14], especially because \({\varLambda }\) is diagonal. In case of truncated conjugate gradient method, it requires only a few simple operations with r-dimensional vectors per one CG iteration. For the dogleg method, it is required to compute the quasi-Newton step \(-{\varLambda }^{-1}g_{\parallel }\) and the steepest descent step \(-\mu _{\parallel } g_{\parallel }\), where \(\mu _{\parallel }= g_{\parallel }^T {\varLambda } g_{\parallel }/g_{\parallel }^T g_{\parallel }\). These operations require \(O({{\bar{m}}})\) multiplications. Moreover, the procedure described in Sect. 3 can be easily adapted for the purpose of finding a nearly-exact solution \(v_{\parallel }^*\) to the first subproblem.

The second subproblem in (41) is the same as in (36) with the optimal solution \(v^*_{\perp }\) defined by formulas (38) and (39). Then one can show, as above, that the solution to (41) is of the form (40). The same formula is applied to finding an approximate solution to the trust-region subproblem (41) when \(v_{\parallel }^*\) represents an approximate solution to the first subproblem.

5 Algorithm

In Algorithm 1, we present a generic trust-region framework [14] in the form close to our implementation (see Sect. 9 for details). In this algorithm, the trust-region subproblem (2) is assumed to be defined by a vector norm \(\Vert \cdot \Vert _k\). This norm may differ from the Euclidean norm, and moreover, it may change from iteration to iteration, like the norms (31) and (32).

We say that the trust-region subproblem is solved with sufficient accuracy, if there exists a scalar \(0<c<1\) such that

$$\begin{aligned} q_k(s_k) \le -c\Vert g_k\Vert ^2 \min \left( \frac{1}{\Vert B_k\Vert }, \frac{{\varDelta }}{\Vert g_k\Vert _k} \right) , \quad \forall k\ge 0. \end{aligned}$$
(42)

In other words, the model decrease is at least a fixed fraction of that attained by the Cauchy point [14]. The sufficient accuracy property plays an important role in proving global convergence of inexact trust-region methods.

figure a

6 Convergence analysis

In Algorithm 1, we assume that if the norm is defined by (31), then the exact solution is found as described in Sect. 4.2.2. In case of norm (32), we assume that the first subproblem in (41) is solved with sufficient accuracy and the second subproblem is solved exactly. This, according to the following result, guarantees that the whole trust-region subproblem is solved with sufficient accuracy.

Lemma 2

Let \(v=(v_{\parallel }, v_{\perp })^T\) be a solution to the trust-region subproblem (41), such that

$$\begin{aligned} q_{\parallel } (v_{\parallel }) \le -{c_0}\Vert g_{\parallel }\Vert \min \left( \frac{\Vert g_{\parallel }\Vert }{\Vert {\varLambda }\Vert },{\varDelta }\right) \end{aligned}$$
(43)

for some \(0<{c_0}<1\) and \(v_{\perp }\) is the exact solution to the second subproblem defined by (38) and (39). Suppose that \(g\ne 0\), then

$$\begin{aligned} q_P(v)\le -c\Vert g\Vert ^2\min \left( \frac{1}{\Vert B\Vert },\frac{{\varDelta }}{\Vert g\Vert _{P,2}}\right) , \end{aligned}$$
(44)

where \(c=\min ({c_0},\frac{1}{2})\).

Proof

Since \(v_{\perp }\) is the Cauchy point for the second subproblem, the following inequality holds (see, e.g., [35, Lemma 4.3]):

$$\begin{aligned} q_{\perp } (v_{\perp }) \le -\frac{1}{2}\Vert g_{\perp }\Vert \min \left( \frac{\Vert g_{\perp }\Vert }{|\delta |},{\varDelta }\right) . \end{aligned}$$
(45)

Since P is orthogonal, the eigenvalue decomposition of B (11) implies

$$\begin{aligned} \Vert B\Vert =\left\| \begin{array}{cc} {\varLambda } &{} 0 \\ 0 &{} \delta I_{{n-{ r}}} \end{array} \right\| = \max \left( \Vert {\varLambda }\Vert ,|\delta | \right) . \end{aligned}$$
(46)

By the norm definition (32), we have \(\Vert g\Vert _{P,2}=\max (\Vert g_{\parallel }\Vert , \Vert g_{\perp }\Vert )\). This formula along with (43), (45) and (46) yield

$$\begin{aligned} q_{\parallel } (v_{\parallel })\le & {} -c\Vert g_{\parallel }\Vert \min \left( \frac{\Vert g_{\parallel }\Vert }{\Vert B\Vert },{\varDelta } \frac{\Vert g_{\parallel }\Vert }{\Vert g\Vert _{P,2}}\right) = -c\Vert g_{\parallel }\Vert ^2\min \left( \frac{1}{\Vert B\Vert },\frac{{\varDelta }}{\Vert g\Vert _{P,2}}\right) , \\ q_{\perp } (v_{\perp })\le & {} -c\Vert g_{\perp }\Vert \min \left( \frac{\Vert g_{\perp }\Vert }{\Vert B\Vert },{\varDelta } \frac{\Vert g_{\perp }\Vert }{\Vert g\Vert _{P,2}}\right) = -c\Vert g_{\perp }\Vert ^2\min \left( \frac{1}{\Vert B\Vert },\frac{{\varDelta }}{\Vert g\Vert _{P,2}}\right) . \end{aligned}$$

Combining these inequalities with (25), we finally obtain the inequality

$$\begin{aligned} q_{\parallel }(v_{\parallel })+ q_{\perp }(v_{\perp }) \le -c\Vert g\Vert ^2\min \left( \frac{1}{\Vert B\Vert },\frac{{\varDelta }}{\Vert g\Vert _{P,2}}\right) . \end{aligned}$$

Then the trust-region decomposition (28) implies (44). This accomplishes the proof. \(\square \)

Corollary 1

If inequality (43) holds for all \(k\ge 0\), where \({c_0}\) does not depend on k, then the trust-region subproblem (41) is solved with sufficient accuracy.

Although the shape of the trust-region defined by the new norms changes from iteration to iteration, it turns out that Algorithm 1, where the trust-region subproblem is solved as proposed in Sect. 4.2.2, converges to a stationary point. This fact is justified by the following result.

Theorem 1

Let \(f:R^n\rightarrow R\) be twice-continuously differentiable and bounded from below on \(R^n\). Suppose that there exists a scalar \(c_1>0\) such that

$$\begin{aligned} \Vert \nabla ^2f(x)\Vert \le c_1 \end{aligned}$$

for all \(x\in R^n\). Consider the infinite sequence \(\{x_k\}\) generated by Algorithm 1, in which the norm \(\Vert \cdot \Vert _k\) is defined by any of the two formulas (31) or (32), and the stopping criterion is suppressed. Suppose also that there exists a scalar \(c_2>0\) such that

$$\begin{aligned} \Vert B_k\Vert \le c_2, \quad \forall k \ge 0. \end{aligned}$$
(47)

Then

$$\begin{aligned} \lim _{k\rightarrow \infty } \Vert \nabla f(x_k)\Vert = 0. \end{aligned}$$
(48)

Proof

By Lemma 1, there holds the equivalence between the norms \(\Vert \cdot \Vert _k\) and the Euclidean norm where the coefficients in the lower and upper bounds do not depend on k. Moreover, Algorithm 1 explicitly requires that the trust-region subproblem is solved with sufficient accuracy. All this and the assumptions of the theorem allow us to apply here Theorem 6.4.6 in [14] which proves the convergence (48). \(\square \)

In the case of convex f(x), Theorem 1 holds for the L-BFGS updates due to the boundedness of \(B_k\) established, e.g., in [34].

Suppose now that f(x) is not necessarily convex. Consider the boundedness of \(B_k\) for the limited-memory versions of BFGS, SR1 and stable multipoint symmetric secant updates [46, 8]. Let \(K_k\) denote the sequence of the iteration indexes of those pairs \(\{s_i, y_i\}\) that are involved in generating \(B_k\) starting from an initial Hessian approximation \(B^0_k\). The number of such pairs is assumed to be limited by m, i.e.

$$\begin{aligned} |K_k|\le m. \end{aligned}$$
(49)

In L-BFGS, the positive definiteness of \(B_k\) can be enforced by composing \(K_k\) of only those indexes of the recently generated pairs \(\{s_i, y_i\}\) that satisfy the inequality

$$\begin{aligned} s^T_iy_i > c_3 \Vert s_i\Vert \Vert y_i\Vert \end{aligned}$$
(50)

for a positive constant \(c_3\) (see [35]). This requirement permits us to show in the following lemma that the boundedness of \(B_k\), and hence Theorem 1, hold in the nonconvex case.

Lemma 3

Suppose that the assumptions of Theorem 1 concerning f(x) are satisfied. Let all \(B_k\) be generated by the L-BFGS updating formula. Let the updating start at each iteration from \(B^0_k\) and involve the pairs \(\{s_i, y_i\}_{i \in K_k}\), whose number is limited in accordance with (49). Suppose that there exists a constant \(c_3 \in (0,1)\) such that, for all k, (50) is satisfied. Suppose also that there exists a constant \(c_4>0\) such that the inequality

$$\begin{aligned} \Vert B^0_k\Vert \le c_4, \quad \forall k\ge 0, \end{aligned}$$
(51)

holds with the additional assumption that \(B^0_k\) is positive semi-definite. Then there exists a constant \(c_2>0\) such that (47) holds.

Proof

For each \(i\in K_k\), the process of updating by the L-BFGS formula the current Hessian approximation B (initiated with \(B^0_k\)) can be presented as follows

$$\begin{aligned} B^{new} = B^{\frac{1}{2}} \left( I - \frac{B^{\frac{1}{2}}s_i s_i^TB^{\frac{1}{2}}}{\Vert B^{\frac{1}{2}}s_i\Vert ^2} \right) B^{\frac{1}{2}} + \frac{y_i y_i^T}{s_i^T y_i} . \end{aligned}$$

This equation along with inequality (50) give

$$\begin{aligned} \Vert B^{new}\Vert \le \Vert B\Vert + \frac{\Vert y_i\Vert }{c_3 \Vert s_i\Vert }, \end{aligned}$$

where in accordance with the boundedness of \(\nabla _{xx}f(x)\) we have \(\Vert y_i\Vert \le c_1 \Vert s_i\Vert \). After summing these inequalities over all \(i\in K_k\), we obtain the inequality

$$\begin{aligned} \Vert B_k\Vert \le \Vert B^0_k\Vert + mc_1 / c_3, \end{aligned}$$

which, due to (51), finally proves inequality (47) for \(c_2 = c_4 + mc_1 / c_3\). \(\square \)

The boundedness of \(B_k\) generated by SR1 and stable multipoint symmetric secant updates will be proved in a separate paper, which will be focused on the case of nonconvex f(x).

To guarantee the boundedness of \(B_k\) generated by the limited-memory version of SR1, we require that the pairs \(\{s_i, y_i\}_{i \in K_k}\) satisfy, for a positive constant \(c_3\), the inequality

$$\begin{aligned} |s^T_i(y_i - Bs_i)| > c_3 \Vert s_i\Vert \Vert y_i - Bs_i\Vert , \end{aligned}$$

where B is the intermediate matrix to be updated based on the pair \(\{s_i, y_i\}\) in the process of generating \(B_k\). This makes the SR1 updates well defined (see [13, 35]).

The stable multipoint symmetric secant updating process is organized in the way that a uniform linear independence of the vectors \(\{s_i\}_{i \in K_k}\) is maintained. The Hessian approximations \(B_k\) are uniquely defined by the equations

$$\begin{aligned} s_i^T B_k s_j = s_i^T y_j, \quad p^T B_k s_l = p^T y_l, \quad p^T B_k p = p^T B_k^0 p, \end{aligned}$$
(52)

which hold for all \(i,j,l \in K_k\), \(i<j\), and also for all \(p \in R^n\), such that \(p^T s_t = 0\) for all \(t \in K_k\). The boundedness of the generated approximations \(B_k\) in the case of nonconvex f(x) follows from the mentioned uniform linear independence and Eq. (52).

7 Implementation details for L-BFGS

In this section, we consider the Hessian approximation B in (7) defined by the L-BFGS update [11]. It requires storing at most m pairs of vectors \(\{s_i,y_i\}\) obtained at those of the most recent iterations for which (50) holds. As was mentioned above, the number of stored pairs is assumed, for simplicity, to be equal to m. The compact representation (8) of the L-BFGS update has the form

$$\begin{aligned} B = \delta I - [S\; Y] \left[ \begin{array}{cc} S^T S/\delta &{} \quad L/\delta \\ L^T/\delta &{} \quad -E\\ \end{array} \right] ^{-1} \left[ \begin{array}{c} S^T\\ Y^T\\ \end{array} \right] \;, \end{aligned}$$
(53)

in which case \({\bar{m}}=2m\). In terms of (8), the matrix \(V=[S\; Y]\) is composed of the stored pairs (5) in the way that the columns of \({S=[\ldots ,s_i,\ldots ]}\) and \({Y=[\ldots ,y_i,\ldots ]}\) are sorted in increasing iteration index i. The sequence of these indexes may have some gaps that correspond to the cases, in which (50) is violated. The matrix W is the inverse of a \(2m\times 2m\)-matrix, which contains a strictly lower triangular part of the matrix \(S^T Y\), denoted in (53) by L, and the main diagonal of \(S^T Y\), denoted by E.

At iteration k of L-BFGS, the Hessian approximation of \(B_k\) is determined by the stored pairs \(\{s_i,y_i\}\) and the initial Hessian approximation \(\delta _k I\). The most popular choice of the parameter \(\delta _k\) is defined, like in [35], by the formula

$$\begin{aligned} \delta _k=\frac{y_{k}^Ty_{k}}{s_{k}^Ty_{k}}, \end{aligned}$$
(54)

which represents the most recent curvature information about the function.

7.1 Uniform representation of eigenvalue-based solutions

Recall that the approaches presented above rely on the implicitly defined RRQR decomposition of V and eigenvalue decomposition of B. In this section, we show that each of the eigenvalue-based solutions of the considered trust-region subproblems (7) can be presented as

$$\begin{aligned} s^*=-\alpha g {+} V_{\dagger } p, \end{aligned}$$
(55)

where \(\alpha \) is a scalar and

$$\begin{aligned} p = R_{\ddagger }^{-1} U ( {v_{\parallel }^* + g_{\parallel }}). \end{aligned}$$

The specific values of \(\alpha \) and \({v_{\parallel }^*}\) are determined by the norm defining the trust-region and the solution to the trust-region subproblem.

Let us first consider the trust-region subproblem defined by the Euclidean norm. Due to (12), we can rewrite formula (27) for a nearly-exact solution \(s^*\) in the form (55), where \(\alpha = (\delta +\sigma ^*)^{-1}\).

Consider now the trust-region subproblem (36) defined by the norm (31). Its solution can be represented in the form (55), where \(\alpha \) stands for t defined by (39). Note that since the Hessian approximations generated by the L-BFGS update are positive definite, the case of \(\lambda _{i} {\le } 0\) in (37) is excluded. Therefore, the optimal solution to the first subproblem in (36) is computed as \(v_{\parallel }^* = -A g_{\parallel }\), where \(A\in R^{r \times r}\) is a diagonal matrix defined as

$$\begin{aligned} A_{ii} = \left\{ \begin{array}{ll} \frac{1}{\lambda _i}, &{} \quad \text{ if } \quad |(g_{\Vert })_i| \le \lambda _i {\varDelta },\\ \frac{{\varDelta }}{|(g_{\Vert })_{i}|}, &{}\quad \text{ otherwise. } \end{array} \right. \end{aligned}$$

When the trust-region subproblem (41) is defined by the norm (32) and \(v_{\parallel }^*\) is an approximate solution to the first subproblem in (41), formula (55) holds for the same \(\alpha =t\).

In each of the considered three cases, the most expensive operations, 4mn, are the two matrix-vector multiplications \(V_{\dagger }^Tg\) and \(V_{\dagger } p\). The linear systems involving the triangular matrix R can be solved at a cost of \(O(m^2)\) operations.

7.2 Model function evaluation

In Algorithm 1, the model function value is used to decide whether to accept the trial step. Let \(s^*\) denote a nearly-exact or exact solution to the trust-region subproblem. In this subsection, we show how to reduce the evaluation of \(q(s^*)\) to cheap manipulations with the available low-dimensional matrix \(V^TV\) and vector \(V^Tg\). It is assumed that \(\Vert g\Vert ^2\) has also been calculated before the model function evaluation.

Consider, first, the trust-region subproblem defined by the Euclidean norm. Suppose that \(s^*\) is of the form (55) and satisfies (14) for \(\sigma ^*\ge 0\). Then

$$\begin{aligned} q(s^*)= & {} g^Ts^*-\frac{1}{2}(g+\sigma ^* s^*)^Ts^* = \frac{1}{2} \left( g^Ts^* {-} \sigma ^*\Vert s^*\Vert ^2\right) \nonumber \\= & {} \frac{1}{2}\left( {-{}\alpha } \Vert g\Vert ^2+ p^T(V_{\dagger }^Tg) {-} \sigma ^*\Vert s^*\Vert ^2\right) , \end{aligned}$$
(56)

where \(\Vert s^*\Vert ^2\) is calculated by the formula

$$\begin{aligned} \Vert s^*\Vert ^2=\alpha ^2\Vert g\Vert ^2 {-2}\alpha p^T(V_{\dagger }^Tg) +p^T(V_{\dagger }^T V_{\dagger })p. \end{aligned}$$

Thus, the most expensive operation in calculating \(q(s^*)\) is the multiplication of the matrix \(V_{\dagger }^T V_{\dagger }\) by the vector p at a cost of \({O(m^2)}\) operations. Note that this does not depend on whether the eigenvalue decomposition is used for computing \(s^*\).

Consider now the trust-region subproblem defined by any of our shape-changing norms. Let \(v_{\parallel }^*\) be the available solution to the first of the subproblems in (36) or (41), depending on which norm, (31) or (32), is used. The separability of the model function (28) and formulas (25), (38) give

$$\begin{aligned} q(s^*)=(g_{\parallel })^Tv^*_{\parallel }+\frac{1}{2}(v_{\parallel }^*)^T{\varLambda } v_{\parallel }^*+\left( { {t}^2\delta /2-{t} }\right) \left( \Vert g\Vert ^2-\Vert g_{\parallel }\Vert ^2\right) . \end{aligned}$$

One can see that only cheap operations with r-dimensional vectors are required for computing \(q(s^*)\).

In the next subsection, we show how to exploit the uniform representation of the trust-region solution (55) for efficiently implementing the L-BFGS update once the trial step is accepted.

7.3 Updating Hessian approximation

The updating of the Hessian approximation B is based on updating the matrices \(S^TS\), L and E in (53), which, in turn, is based on updating the matrix \(V^T V\). Restoring the omitted subscript k, we note that the matrix \(V_{k+1}\) is obtained from \(V_k\) by adding the new pair \(\{s_k,y_k\}\), provided that (50) holds, and possibly removing the oldest one in order to store at most m pairs. Hence, the updating procedure for \(V_k^TV_k\) requires computing \(V_k^T s_k\) and \(V_k^Ty_k\) after computing \(s_k\). The straightforward implementation would require 4mn operations. It is shown below how to implement these matrix-vector products more efficiently.

Assuming that \(V^Tg\), \(V^TV\) and p have already been computed, we conclude from formula (55) that the major computational burden in

$$\begin{aligned} V_k^T s_k=V^T s=-\alpha V^Tg {+ } V^T V_{\dagger } p \end{aligned}$$
(57)

is associated with computing \((V^TV_{\dagger })\cdot p\) at a cost of \(4m^2\) multiplications. Recalling that \(y_k=g_{k+1}-g_k\), we observe that

$$\begin{aligned} V_k^T y_k = V_k^T g_{k+1} - V_k^T g_k \end{aligned}$$
(58)

is a difference between two 2m-dimensional vectors, of which \(V_k^T g_k(=V^T g)\) is available and \(V_k^Tg_{k+1}\) is calculated at a cost of 2mn operations. Then at the next iteration, the vector \(V_{k+1}^Tg_{k+1}\) can be obtained from \(V_k^Tg_{k+1}\) at a low cost, because these two vectors differ only in two components.

Thus, \(V_k^T s_k\) and \(V_k^Ty_k\) can be computed by formulas (57) and (58) at a cost in which 2mn is a dominating term. This cost is associated with computing \(V_k^Tg_{k+1}\) and allows for saving on the next iteration the same 2mn operations on computing \(V_{k+1}^Tg_{k+1}\).

In the next subsection, we discuss how to make the implementation of our approaches more numerically stable.

7.4 Numerical stability

Firstly, in our numerical experiments we observed that the Gram matrix \(V^T V\) updated according to (57) and (58) was significantly more accurate if we used normalized vectors \(s/\Vert s\Vert \) and \(y/\Vert y\Vert \) instead of s and y, respectively. More importantly, the columns of R produced by the Cholesky factorization are, in this case, also of unit length. This is crucial for establishing rank deficiency of V in the way described below. It can be easily seen that the compact representation of B (53) takes the same form for V composed of the normalized vectors. To avoid 2n operations, the normalized vectors are actually never formed, but the matrix-vector multiplications involving V are preceded by multiplying the vector by a \({2m\times 2m}\) diagonal matrix whose diagonal elements are of the form \(1/\Vert s\Vert \), \(1/\Vert y\Vert \).

Secondly, at the first 2m iterations, the matrix V is rank-deficient, i.e., \({r=} \hbox {rank}(V) < 2m\). The same may hold at the subsequent iterations. To detect linear dependence of the columns of V, we used the diagonal elements of R. In the case of normalized columns, \(R_{ii}\) is equal to \(\sin \psi _i\), where \(\psi _i\) is the angle between the i-th column of V and the linear subspace, generated by the columns \(1,2,\ldots ,i-1\). We introduced a threshold parameter \(\nu \in (0,1)\). It remains fixed for all k, and only those parts of V and R, namely, \(V_{\dagger }\), \(R_{\dagger }\) and \(R_{\ddagger }\), which correspond to \(|R_{ii}|>\nu \), were used for computing \(s^*\).

7.5 Computational complexity

The cost of one iteration is estimated as follows. The Cholesky factorization of \(V^T V\in R^{{2}m \times {2}m} \) requires \(O(m^3)\) operations, or only \(O(m^2)\) if it is taken into account that the new V differs from the old one in a few columns. Computing \(R_{\dagger } W R_{\dagger }^T\in R^{2m \times 2m}\), where W is the inverse of a \({2m\times 2m}\) matrix, takes \(O(m^3)\) operations. The eigenvalue decomposition for \(R_{\dagger } W R_{\dagger }^T \) costs \(O(m^3)\). Note that \(O(m^3)=\left( \frac{m^2}{n}\right) O(mn)\) is only a small fraction of mn operations when \(m\ll n\). Since \(V^Tg\) is available from the updating of B at the previous iteration, the main cost in (55) for calculating \(s^*\) is associated with the matrix-vector product \(V_{\dagger } p\) at a cost of 2mn operations. The Gram matrix \(V^TV\) is updated by formulas (57) and (58) at a cost of \({2}mn+O(m^2)\) operations. Thus, the dominating term in the overall cost is 4mn, which is the same as for the line-search versions of the L-BFGS.

7.6 Computing the quasi-Newton step

In our numerical experiments, we observed that, for the majority of the test problems, the quasi-Newton step \(s_N=-B^{-1}g\) was accepted at more than 75 % of iterations. If the trust-region is defined by the Euclidean norm, we can easily check if this step belongs to the trust-region without calculating \(s_N\) or the eigenvalues of B. Indeed, consider the following compact representation of the inverse Hessian approximation [11]:

$$\begin{aligned} B^{-1} = \gamma I + [S \,Y] M \left[ \begin{array}{c} S^T\\ Y^T\\ \end{array} \right] . \end{aligned}$$
(59)

Here \(\gamma =\delta ^{-1}\) and the symmetric matrix \(M\in R^{2m\times 2m}\) is defined as

$$\begin{aligned} M=\left[ \begin{array}{ll} T^{-T}(E+\gamma Y^TY) T^{-1} &{} \quad -\gamma T^{-T}\\ -\gamma T^{-T} &{}\quad 0\\ \end{array} \right] , \end{aligned}$$

where the matrix \(T=S^TY-L\) is upper-triangular. Then, since \(Y^T Y\) and T are parts of \(V^TV\), the norm of the quasi-Newton step can be computed as

$$\begin{aligned} \Vert s_N\Vert ^2 = \gamma ^2 \Vert g\Vert ^2 + 2\gamma (V^Tg)^T M V^Tg+\Vert {VM}V^Tg\Vert ^2. \end{aligned}$$
(60)

The operations that involve the matrix M can be efficiently implemented as described in [11, 29]. Formula (60) requires only \(O(m^2)\) operations because \(V^Tg\) and \(\Vert g\Vert \) have already been computed. If \(\Vert s_N\Vert \le {\varDelta }\), the representation (59) of \(B^{-1}\) allows for directly computing \(s_N\) without any extra matrix factorizations. The dominating term in the cost of this operation is 4mn. The factorizations considered above are used only when the quasi-Newton step is rejected and it is then required to solve the trust-region subproblem.

When \(s_N\) is used as a trial step, the saving techniques discussed in Sects. 7.2 and 7.3 can be applied as follows. The uniform representation (55) holds for \(s_N\) with \(V_{\dagger }\) replaced by V, and

$$\begin{aligned} \alpha =\gamma \quad {\hbox {and}} \quad p={-}M V^Tg. \end{aligned}$$

This allows for evaluating the model function in a cheap way by formula (56), where \(\sigma ^*=0\). Moreover, \(V^Ts_N\) can be computed by formula (57), which saves 2mn operations.

Note that if \(\Vert s_N\Vert \le {\varDelta }\) in the Euclidean norm, then, by Lemma 1, \(s_N\) belongs to the trust-region defined by any of our new norms. This permits us to use formula (60) for checking if \(s_N\) is guaranteed to belong to the trust-region in the corresponding norm.

8 Alternative limited-memory trust-region approaches, improved versions

In this section, we describe in more detail some of those approaches mentioned in Sect. 1, which combine limited-memory and trust-region techniques, namely, the algorithms proposed in [10, 29]. They both use the L-BFGS approximation and the Euclidean norm. We propose below improved versions of these algorithms. They do not require the eigenvalue decomposition. The purpose of developing the improved versions was to compare them with our eigenvalue-based algorithms. A comparative study is performed in Sect. 9.

8.1 Nearly-exact trust-region algorithm

A nearly-exact trust-region algorithm was proposed by Burke et al. [10]. It does not formally fall into the conventional trust-region scheme, because at each iteration, the full quasi-Newton step is always computed at a cost of 4mn operations like in [11] and used as a trial step independently of its length. If it is rejected, the authors exploit the Sherman–Morrison–Woodbury formula and obtain the following representation:

$$\begin{aligned} (B+\sigma I)^{-1} = (\delta +\sigma )^{-1} \left( I - V \left( (\delta +\sigma )^{-1}W^{-1} + V^T V \right) ^{-1} V^T \right) . \end{aligned}$$

Furthermore, by exploiting a special structure of W for the BFGS update (53), a triangular factorization of a \(2m \times 2m \) matrix \((\delta +\sigma )^{-1}W^{-1} + V^T V\) is computed using Cholesky factorizations of two \({m\times m}\) matrices. This allows for efficiently implementing Newton’s iterations in solving (13), which, in turn, requires solving in \(u,w\in R^{{2}m}\) the following system of linear equations:

$$\begin{aligned} \left\{ \begin{array}{l} \left( (\delta +\sigma )W^{-1} {+} V^TV\right) u= V^Tg\\ \left( (\delta +\sigma )W^{-1} {+} V^TV\right) w= W^{-1}u\\ \end{array} \right. . \end{aligned}$$
(61)

Then, a new trial step

$$\begin{aligned} s^*=- {(\delta +\sigma ^*)^{-1}}\left( {g} {-} V\left( (\delta +\sigma ^*)W^{-1} {+}V^TV\right) ^{-1}V^Tg \right) \end{aligned}$$
(62)

is computed at an additional cost of \({2}mn+O(m^3)\) operations. The authors proposed also to either update \(S^TS\) and L after each successful iteration at a cost of 2mn operations or, alternatively, compute, if necessary, these matrices at \({2}m^2n\) operations only. Thus, the worst case complexity of one iteration is \({6}mn+O(m^{2})\) or \({2}m^2n+{2}mn+O(m^3)\).

We improve the outlined algorithm as follows. The full quasi-Newton step \(s_N\) is not computed first, but only \(\Vert s_N\Vert \) by formula (60) at a cost of \({2}mn+O(m^2)\) operations, which allows to check if \(s_N\in {\varOmega }\). Then only one matrix-vector multiplication of the additional cost 2mn is required to compute the trial step, independently on whether it is the full quasi-Newton step. The matrices \(S^TS\) and L are updated after each successful iteration at a cost of \(O(m^2)\) operations as it was proposed in Sect. 7.3, because formula (62) can be presented in the form (55) with \(V_{\dagger }\) replaced by V, and

$$\begin{aligned} \alpha =(\delta +\sigma ^*)^{-1} \quad {\hbox {and}} \quad p= {(\delta +\sigma ^*)^{-1}}\left( (\delta +\sigma ^*)W^{-1} {+}V^TV\right) ^{-1}V^Tg. \end{aligned}$$

The cost of the model function evaluation is estimated in Sect. 7.2 as \(O(m^2)\). The proposed modifications allow us to reduce the cost of one iteration to \({4}mn+O(m^3)\) operations.

8.2 Inexact trust-region algorithm

The ways of solving trust-region subproblems considered so far in this paper were either exact or nearly-exact. In this subsection, we consider inexact algorithms proposed by Kaufman in [29], where the trust-region subproblem is approximately solved with the help of the double-dogleg approach [15]. At each iteration of the algorithms, the Hessian and its inverse are simultaneously approximated using the L-BFGS updates. Techniques, similar to those presented in Sect. 7.3 are applied. The main feature of these algorithms is that the parameter \(\delta \) is fixed, either for a series of iterations followed by a restart, or for all iterations. Here the restart means removing all stored pairs \(\{s_i,y_i\}\). The reason for fixing \(\delta \) is related to author’s intention to avoid computational complexity above \(O(m^2)\) in manipulations with small matrices. As it was mentioned in [29], the performance of the algorithms is very sensitive to the choice of \(\delta \). In the line-search L-BFGS algorithms, the parameter \(\delta \) is adjusted after each iteration, which is aimed at estimating the size of the true Hessian along the most recent search direction. This explains why a good approximation of the Hessian matrix in [29] requires larger values of m than in the case of the line-search L-BFGS.

We propose here the following algorithm based on the double-dogleg approach with \(\delta \) changing at each iteration. It combines our saving techniques with some of those used in [29]. The Gram matrix \(V^T V\) is updated like in Sect. 7.3. In accordance with the double-dogleg approach, an approximate solution to the trust-region subproblem is found by minimizing the model function along a piecewise linear path that begins in \(s=0\), ends in the quasi-Newton step and has two knots. One of them is the Cauchy point

$$\begin{aligned} s_C= - \min \left( \frac{\Vert g\Vert ^2}{g^TBg},\frac{{\varDelta }}{\Vert g\Vert }\right) g \equiv -\mu g. \end{aligned}$$

The other knot is the point \(\tau s_N\) on the quasi-Newton direction, where \(\tau \in (0,1)\) is such that \(q(\tau s_N)<q(s_C)\) and \(\Vert \tau s_N\Vert > \Vert s_C\Vert \). Since q(s) and \(\Vert s\Vert \) are monotonically decreasing and increasing, respectively, along the double-dogleg path, the minimizer of q(s) on the feasible segment of the path is the endpoint of the segment. Thus,

$$\begin{aligned} s= \left\{ \begin{array}{ll} s_N, &{} \quad \text{ if } \quad \Vert s_N\Vert \le {\varDelta }, \\ ({\varDelta } / \Vert s_N\Vert )s_N, &{} \quad \text{ if } \quad \Vert \tau s_N\Vert \le {\varDelta } < \Vert s_N\Vert , \\ s_C+\theta (\tau s_N - s_C), &{} \quad \text{ otherwise },\\ \end{array} \right. \end{aligned}$$
(63)

where \(\theta \in [0,1)\) is such that \(\Vert s\Vert ={\varDelta }\). In our implementation, we used

$$\begin{aligned} \tau =0.2+0.8\Vert g\Vert ^4 / \left( (g^T B^{-1} g) (g^TBg)\right) {,} \end{aligned}$$

as suggested in [15].

At each iteration, we first compute the full quasi-Newton step using (59) and its norm by (60). If this step belongs to the trust-region, it is then used as a trial point. Otherwise, the Cauchy point and \(\tau \) are computed, which requires O(m) operations for calculating

$$\begin{aligned} g^TB^{-1}g=\gamma \Vert g\Vert ^2+(V^Tg)^T(MV^Tg), \end{aligned}$$

where the 2m-dimensional vectors \(V^Tg\) and \(MV^Tg\) have already been computed for \(s_N\). The additional \(O(m^3)\) operations are required for calculating

$$\begin{aligned} g^TBg=\delta \Vert g\Vert ^2+(V^Tg)^TW(V^Tg), \end{aligned}$$

where W is the inverse of a \({2m\times 2m}\) matrix. Note that in our implementation the multiplication of the matrix W by the vector \(V^Tg\) is done as in [11]. This cannot be implemented at a cost of \(O(m^2)\) operations like in [29], because \(\delta \) is updated by formula (54) after each successful iteration. Note that \(\theta \) in (63) can be computed at a cost of O(1) operations. To show this, denote \({{\hat{s}}}=\tau s_N-s_C\). Then

$$\begin{aligned} \theta =\frac{\beta }{\psi +\sqrt{\psi ^2+\Vert {{\hat{s}}}\Vert ^2\beta }}, \end{aligned}$$

where \(\beta ={\varDelta }^2-\Vert s_C\Vert ^2\) and \(\psi ={{\hat{s}}}^Ts_C\). Observing that

$$\begin{aligned} \Vert s_C\Vert= & {} \mu \Vert g\Vert =\min \left( \frac{\Vert g\Vert ^3}{g^TBg},{\varDelta }\right) ,\\ \psi= & {} \tau s_N^Ts_C-\Vert s_C\Vert ^2=\tau \mu g^TB^{-1}g-\Vert s_C\Vert ^2,\\ \Vert {{\hat{s}}}\Vert ^2= & {} \tau ^2\Vert s_N\Vert ^2-2\tau s_C^Ts_N+\Vert s_C\Vert ^2=\tau ^2\Vert s_N\Vert ^2-2\psi -\Vert s_C\Vert ^2, \end{aligned}$$

one can see that the computation of \(\theta \) involves just a few scalar operations.

For estimating the cost of computing the double-dogleg solution, consider separately the two cases depending on whether the trial step was accepted at the previous iteration, or not. In the former case, the major computational burden in finding s by formula (63) is related to computing the quasi-Newton step \(s_N\) at a cost of 4mn operations. Otherwise, the new trial step requires only O(n) operations, because \(s_N\) is available and \(s_C\) is updated for the new \({\varDelta }\) at this cost.

According to (63), the double-dogleg solution is a linear combination of the gradient and the quasi-Newton step, i.e., \(s=\alpha _1 g+\alpha _2s_N\), where \(\alpha _1,\alpha _2\) are scalars. Then the model function in the trial point is computed by the formula

$$\begin{aligned} q(\alpha _1 g + \alpha _2 s_N)= & {} \alpha _1\Vert g\Vert ^2-\alpha _2g^T B^{-1}g+(\alpha _1 g + \alpha _2 s_N)^T B(\alpha _1 g + \alpha _2 s_N)/2\\= & {} \alpha _1\Vert g\Vert ^2 - \alpha _2g^T B^{-1}g + (\alpha _1^2g^TBg - {2}\alpha _1\alpha _2\Vert g\Vert ^2 + \alpha _2^2g^TB^{-1}g)/2\\= & {} (\alpha _1-\alpha _1\alpha _2)\Vert g\Vert ^2-(\alpha _2-\alpha _2^2/2)g^TB^{-1}g + (\alpha _1^2/2)g^TB^Tg \end{aligned}$$

at a cost of O(1) operations.

As it was shown in Sect. 7.6, the representation (55) holds for the quasi-Newton step \(s_N\). The same obviously refers to the second alternative in formula (63). In the case of the third alternative in (63), representation (55) holds with \(V_{\dagger }\) replaced by V, and

$$\begin{aligned} \alpha =(1-{\theta })\mu + \theta \tau \gamma \quad \hbox {and} \quad p={-}\theta \tau M V^Tg. \end{aligned}$$

This allows for applying the saving techniques presented in Sects. 7.2 and 7.3. Therefore, the worst case complexity of one iteration of our inexact algorithm is \({4}mn+O(m^3)\). It is the same as for the proposed above exact and nearly-exact algorithms. However, the actual computational burden related to the term \(O(m^3)\) and required for implementing the product \(W \cdot (V^Tg)\) in accordance with [11] is lower for our double-dogleg algorithm because it comes from one Cholesky factorization of a smaller \({m\times m}\) matrix. Moreover, if in our algorithm the trial step is rejected, the calculation of the next trial step requires, as mentioned earlier, only O(n) operations, whereas the same requires at least 2mn operations in the other algorithms proposed above.

9 Numerical tests

The developed here limited-memory trust-region algorithms were implemented in matlab R2011b. The numerical experiments were performed on a Linux machine HP Compaq 8100 Elite with 4 GB RAM and quad-core processor Intel Core i5-650 (3.20 GHz).

All our implementations of the trust-region approach were based on Algorithm 1 whose parameters were chosen as \(\delta _0=1\), \(\tau _1=0\), \(\tau _2=0.25\), \(\tau _3=0.75\), \({\eta _1}=0.25\), \({\eta _2}=0.5\), \({\eta _3}=0.8\), \({\eta _4}=2\). The very first trial step was obtained by a backtracking line search along the steepest descent direction, where the trial step-size was increased or decreased by factor two. For a fairer comparison with line-search algorithms, the number of accepted trial steps was counted as the number of iterations. In such a case, each iteration requires at most two gradient evaluations (see below for details). To take into account the numerical errors in computing \(\rho _k\), we adopted the techniques discussed in [14, 28] by setting \(\rho _k=1\) whenever

$$\begin{aligned} |f(x_k+s_k)-f(x_k)|\le 10^{-11}\cdot |f(x_k)|. \end{aligned}$$
(64)

This precaution may result in a small deterioration of the objective function value, but it helps to prevent from stopping because of a too small trust-region. The most recent pair \(\{s_k,y_k\}\) was not stored if

$$\begin{aligned} s_k^Ty_k\le 10^{-8}\cdot \Vert s_k\Vert \cdot \Vert y_k\Vert . \end{aligned}$$

The stopping criterion in Algorithm 1 was

$$\begin{aligned} \Vert g_k\Vert \le 10^{-5}\cdot \max \left( 1,\Vert x\Vert \right) . \end{aligned}$$

We also terminated algorithm and considered it failed when the trust-region radius was below \(10^{-15}\) or the number of iterations exceeded 100,000.

Algorithms were evaluated on 62 large-scale (\(1000\le n \le \)10,000) CUTEr test problems [27] with their default parameters. The version of CUTEr is dated January 8th, 2013. The average run time of algorithm on each problem was calculated on the base of ten runs. The following problems were excluded from the mentioned set: PENALTY2, SBRYBND, SCOSINE, SCURLY10, SCURLY20, SCURLY30, because all tested algorithms failed; CHAINWOO, because the algorithms converged to local minima with different objective function values; FLETCBV2, because it satisfied the stopping criterion in the starting point; FMINSURF, because we failed to decode it.

One of the features of CUTEr is that it is computationally faster to make simultaneous evaluation of function and gradient in one call instead of two separate calls. In Algorithm 1, the function is evaluated in all trial points, while the gradient is evaluated in accepted trial points only. We observed that, for the majority of the test problems, the quasi-Newton step was accepted at more than 75 % of iterations. Then we decided to simultaneously evaluate f(x) and \(\nabla f(x)\) in one call whenever the quasi-Newton step belongs to the trust-region, independently on whether the corresponding trial point is subsequently accepted.

We used performance profiles [16] to compare algorithms. This is done as follows. For each problem p and solver s, denote

$$\begin{aligned} t_{p,s} = \text{ the } \text{ result } \text{ gained } \text{ in } \text{ solving } \text{ problem } p \text{ by } \text{ solver } s, \end{aligned}$$

which can be, e.g., the CPU time, the number of iterations, or the number of function evaluations. The performance ratio is defined as

$$\begin{aligned} \pi _{p,s} = \frac{t_{p,s}}{\min \limits _{l} t_{p,l}}. \end{aligned}$$

For each solver, we plot the distribution function of a performance metric

$$\begin{aligned} \rho _s(\tau ) = \frac{1}{n_p}\text{ card }\{p: \pi _{p.s}\le \tau \}, \end{aligned}$$

where \(n_p\) is the total number of test problems. For given \(\tau >1\), the function \(\rho _s(\tau )\) returns the portion of problems that solver s could solve within a factor \(\tau \) of the best result.

We shall refer to the trust-region algorithm based on the shape-changing norm (31) as EIG\((\infty ,2)\). We used it as a reference algorithm for comparing with other algorithms, because EIG\((\infty ,2)\) was one of the most successful. We studied its performance for the parameter values \(m=5\), 10 and 15, which means storing at most 5, 10 and 15 pairs \(\{s_k,y_k\}\), respectively. We performed numerical experiments for various values of the threshold parameter \(\nu \) introduced in Sect. 7.4 for establishing rank-deficiency of V. Since the best results were obtained for \(\nu =10^{-7}\), we used this value in our algorithms. In some test problems, we observed that computing \(V_k^T s_k\) according to (57) could lead to numerical errors in \(V_k^TV_k\). To easily identify such cases we computed a relative error in \(s_{k-1}^Ts_k\), and if the error was larger than \(10^{-4}\), a restart was applied meaning to store only the latest pair \(\{s_k,y_k\}\). This test was implemented in all our eigenvalue-based trust-region algorithms. An alternative could be to recompute \(V_k^TV_k\) but it is computationally more expensive, and in our experiments, it did not sufficiently decrease the number of iterations.

Fig. 1
figure 1

Performance profile for EIG\((\infty ,2)\) for \(m=5,10,15\). a Number of iterations. b CPU time

In Fig. 1a, one can see that the memory size \(m=15\) corresponds to the fewest number of iterations. This is a typical behavior of limited-memory algorithms, because larger memory allows for using more complete information about the Hessian matrix, carried by the pairs \(\{s_k,y_k\}\), which tends to decrease the number of iterations. On the other hand, each iteration becomes computationally more expensive. For each test problem there exists its own best value of m that reflects a tradeoff. Figure 1b shows that the fastest for the most of the test problems was the case of \(m=5\). A similar behavior was demonstrated by the trust-region algorithms based on the Euclidean norm and shape-changing norm (32). This motivates the use of \(m=5\) in our numerical experiments.

We implemented in matlab three versions of the line-search L-BFGS. Two of them use the Moré–Thuente line search [33] implemented in matlab by Dianne O’Leary [36] with the same line-search parameter values as in [30]. The difference is in computing the search direction, which is based either on the two-loop recursion [30, 34] or on the compact representation of the inverse Hessian approximation [11] presented by formula (59). These two line-search versions have the same theoretical properties, namely, they generate identical iterates and have the same computational complexity, 2mn. Nevertheless, owing to the efficient matrix operations in matlab, the former version was faster.

In the third version, the search direction is computed with the use of the compact representation. We adapted here Algorithm 1 to make a fairer comparison with our trust-region algorithms under the same choice of parameter values. The trial step in this version is obtained by minimizing the same model function along the quasi-Newton direction bounded by the trust-region. Like in our trust-region algorithms, it is accepted whenever (64) holds, which makes the line search non-monotone. This version of L-BFGS was superior to the other two line-search versions in every respect. The success can be explained as follows. In comparison to the Wolve conditions, it required less number of function and gradient evaluations for satisfying the acceptance conditions of Algorithm 1. Furthermore, the aforementioned possible non-monotonicity due to (64) made the third version more robust. We shall refer to it as L-BFGS. Only this version is involved in our comparative study of the implemented algorithms.

Fig. 2
figure 2

Performance profile for EIG\((\infty ,2)\) and L-BFGS. a Number of iterations. b CPU time

As one can see in Fig. 2a, algorithm EIG\((\infty ,2)\) performed well in terms of the number of iterations. In contrast to L-BFGS, it was able to solve all the test problems, which is indicative of its robustness and better numerical stability. The performance profiles for the number of gradient evaluations were almost identical to those for the number of iterations. Note that, owing to the automatic differentiation, the gradients are efficiently calculated in the CUTEr. This means that if the calculation of gradients were much more time consuming, the performance profiles for the CPU time in Fig. 3b would more closely resemble those in Fig. 2a.

We observed that, for each CUTEr test problem, the step-size one was used for, at least, 60 % of iterations of L-BFGS. To demonstrate the advantage of the trust-region framework over the line search, we selected all those test problems (10 in total), where the step-size one was rejected by L-BFGS in, at least, 30 % of iterations. At each of these iterations, the line-search procedure computed function values more than once. The corresponding performance profiles are given in Fig. 3 (the only figure where the profiles are presented for the reduced set of problems). Algorithm L-BFGS failed on one of these problems, and it was obviously less effective than EIG\((\infty ,2)\) on the rest of them, both in terms of the number of iterations and the CPU time.

Fig. 3
figure 3

Performance profile for EIG\((\infty ,2)\) and L-BFGS on problems with the step-size one rejected in, at least, 30 % of iterations. a Number of iterations. b CPU time

Our numerical experiments indicate that algorithm EIG\((\infty ,2)\) is, at least, competing with L-BFGS. It is natural to expect that EIG\((\infty ,2)\) will be dominating in those of the problems originating from simulation-based applications and industry, where the cost of function and gradient evaluations is much more expensive than computing a trial step.

We compared EIG\((\infty ,2)\) also with some other eigenvalue-based limited-memory trust-region algorithms. In one of them, the trust-region is defined by the Euclidean norm, and the other algorithm uses the shape-changing norm (32). We refer to them as EIG-MS and EIG-MS(2,2), respectively. In EIG-MS, the trust-region subproblem is solved by the Moré-Sorenson approach. We used the same approach in EIG-MS(2,2) for solving the first subproblem in (41) defined by the Euclidean norm in a lower-dimensional space. Notice that since BFGS updates generate positive definite Hessian approximation, the hard case is impossible. In all our experiments, the tolerance of solving (13) was defined by the inequality

$$\begin{aligned} \Big |\Vert s\Vert -{\varDelta }\Big |\le {\varDelta }\cdot 10^{-1}, \end{aligned}$$

which almost always required to perform from one to three Newton iterations (19). We observed also that the higher accuracy increased the total computational time without any noticeable improvement in the number of iterations.

Fig. 4
figure 4

Performance profile for EIG\((\infty ,2)\), EIG-MS and EIG-MS(2,2). a Number of iterations. b CPU time

Figure 4 shows that EIG\((\infty ,2)\) and EIG-MS(2,2) were able to solve all the test problems, whereas EIG-MS(2,2) failed on one of them. Algorithm EIG\((\infty ,2)\) often required the same or even fewer number of iterations than the other two algorithms. The behavior of EIG-MS(2,2) was very similar to EIG-MS, which can be explained as follows.

In our numerical experiments with L-BFGS updates, we observed that

$$\begin{aligned} \Vert g_{\perp }\Vert \ll \Vert g_{\parallel }\Vert \approx \Vert g\Vert . \end{aligned}$$
(65)

Our intuition about this property is presented in the next paragraph. For s that solves the trust-region subproblem, (65) results in \( \Vert P_{\perp }^Ts\Vert \ll \Vert P_{\parallel }^Ts\Vert \), i.e., the component of s that belongs to the subspace defined by \(P_{\perp }\) is often vanishing, and therefore, the shape-changing norm (32) of s is approximately the same as its Euclidean norm. This is expected to result, for the double-dogleg approach, in approximately the same number of iterations in the case of the two norms, because the Cauchy vectors are approximately the same for these norms. However, it is unclear how to make the computational cost of one iteration for the norm (32) lower than for the Euclidean norm in \(R^n\). This is the reason why our combination of the double-dogleg approach with the norm (32) was not successful.

Fig. 5
figure 5

Performance profile for EIG\((\infty ,2)\) and BWX-MS. a Number of iterations. b CPU time

One of the possible explanations why (65) is typical for L-BFGS originates from its relationship with CG. Recall that, in the case of quadratic f(x), the first m iterates generated by L-BFGS with exact line search are identical to those generated by CG. Furthermore, CG has the property that \(g_k\) belongs to the subspace spanned by the columns of \(S_k\) and \(Y_k\), i.e., \(g_{\perp }=0\). The numerical experiments show that L-BFGS inherits this property in an approximate form when f(x) is not quadratic.

We implemented also our own version of the limited-memory trust-region algorithm by Burke et al. [10]. This version was presented in Sect. 8.1, and it will be referred to as BWX-MS. It has much better performance than its original version. We compare it with EIG\((\infty ,2)\). Note that BWX-MS requires two Cholesky factorizations of \(m\times m\) matrices for solving (61) at each Newton’s iteration (19) (see [10]). Algorithm EIG\((\infty ,2)\) requires one Cholesky factorization of a \((2m)\times (2m)\) matrix and one eigenvalue decomposition for a matrix of the same size, but in contrast to BWX-MS, this is to be done only once when \(x_{k+1}\ne x_k\), and this is not required to be done for a decreased trust-region radius when the trial point is rejected. This explains why the performance profile demonstrated in Fig. 5 is obviously better for our eigenvalue-based approach than for our improved version of the one proposed in [10]. In our numerical experiments, the advantage in performance was getting more significant when a higher accuracy of solving the trust-region subproblem was set. Algorithm EIG\((\infty ,2)\) is particularly more efficient than BWX-MS in problems where the trial step is often rejected.

It should be mentioned here an alternative approach developed by Erway and Marcia [1820] for solving the trust-region subproblem for the L-BFGS updates. The available version of their implementation [20], called MSS, was far less efficient in our numerical experiments than EIG\((\infty ,2)\).

We implemented the original version of the double-dogleg algorithm proposed in [29] and outlined in Sect. 8.2. This inexact limited-memory trust-region algorithm failed on 12 of the CUTEr problems, and it was far less efficient than EIG\((\infty ,2)\) on the rest of the problems.

We implemented also our own version of the double-dogleg approach presented in Sect. 8.2. We refer to it as D-DOGL. The performance profiles for D-DOGL and EIG\((\infty ,2)\) are presented in Fig. 6. It shows that D-DOGL generally required more iterations than EIG\((\infty ,2)\), because the trust-region subproblem was solved to a low accuracy. However, it was often faster unless it took significantly more iterations to converge. This algorithm does not require eigenvalue decomposition of B and when the trial step is rejected, computes the new one only at a cost of O(n) operations. We should note that, as it was observed earlier, e.g., in [17], the CUTEr collection of large-scale test problems is better suited for applying inexact trust-region algorithms like D-DOGL. But such algorithms are not well suited for problems where a higher accuracy of solving trust-region subproblem is required for a better total computational time and robustness.

Fig. 6
figure 6

Performance profile for EIG\((\infty ,2)\) and D-DOGL. a Number of iterations. b CPU time

An alternative approach to approximately solving trust-region subproblem is related to the truncated conjugate gradients [39]. We applied it to solving the first subproblem in (41). Since this CG-based algorithm produces approximate solutions, the number of external iterations was, in general, larger than in the case of the algorithms producing exact or nearly-exact solutions to the trust-region subproblem. The cost of one iteration was not low enough to compete in computational time with the fastest implementations considered in this section.

10 Conclusions

We have developed efficient combinations of limited-memory and trust-region techniques. The numerical experiments indicate that our limited-memory trust-region algorithms are competitive with the line-search versions of the L-BFGS method. Our eigenvalue-based approach, originally presented in [9] and further developed in the earlier version of this paper [7], has already been successfully used in [2, 21, 22].

The future aim is to extend our approaches to limited-memory SR1 and multipoint symmetric secant approximations. In case of indefinite matrix, we are going to exploit the useful information about negative curvature directions along which the objective function is expected to decrease most rapidly.

Furthermore, the proposed here computationally efficient techniques, including the implicit eigenvalue decomposition, could be considered for improving the performance of limited-memory algorithms used, e.g., for solving constrained and bound constrained optimization problems.