1 Introduction

Quasi-Newton and truncated-Newton methods have been widely used since their origin. Arguments for their use are that their convergence is faster than that of the classical gradient method, due to curvature information and simple implementations and that, in the limited-memory form and when using truncated-Newton, they are suitable for large-scale optimization. These features have made the methods a textbook must-have [13]. In this paper, we will analyze some of the more common quasi-Newton methods, specifically quasi-Newton methods of the Broyden [4] and Huang [5] families, which include the well-known variants Broyden–Fletcher–Goldfarb–Shanno (BFGS) [69], symmetric rank 1 (SR1) [4, 1013] and Davidon–Fletcher–Powell (DFP) [14, 15] (see also [16]). We will also consider the limited-memory BFGS (L-BFGS) method [17, 18]. Local convergence of these algorithms is well studied, see [2, 3] for an overview.

Many connections exist between quasi-Newton methods and first-order methods. The DFP method using exact line search generates the same iterations as the conjugate gradient method for quadratic functions [19] (see also [20, pp. 57] and [21, pp. 200–222]). A more general result is that the nonlinear Fletcher–Reeves conjugate gradient method is identical to the quasi-Newton Broyden method (with exact line search and initialized with the identity matrix) when applied to quadratic functions [2, The. 3.4.2]. These connections only apply for quadratic functions, but stronger properties and connections are also known. A memoryless BFGS with exact line search is equivalent to the nonlinear Polak–Ribiére conjugate gradient methods with exact line search [22] (note that the Hestenes-Stiefel and Polak–Ribiére nonlinear conjugate gradient methods are equivalent when utilizing exact line search [3, §7.2]). The DFP method with exact line search and initialized with the identity matrix is a nonlinear conjugate gradient method with exact line search [21, pp. 216–222]. Further, all quasi-Newton methods of the Broyden family are equivalent when equipped with exact line search [23]. Consequently, all the above statements can be extended to all methods of the Broyden family when exact line search is utilized (see also [2, pp. 64]).

Even though it seems clear that quasi-Newton methods and first-order methods are similar, it is not uncommon to encounter thoughts on the subject along the lines: “Two of the most popular methods for smooth large-scale optimization, the inexact- (or truncated-) Newton method and limited-memory BFGS method, are typically implemented so that the rate of convergence is only linear. They are good examples of algorithms that fall between first- and second-order methods” [24]. This statement raises two questions (i) Why is a limited-memory BFGS method not considered a first-order method? (ii) Can we give a more informative classification of truncated-Newton methods?

We can already partly answer question one since this is addressed in the convex programming complexity theory of Nemirovsky and Yudin [25]. Since the limited-memory BFGS utilizes information from a first-order oracle, it is a first-order method which implies certain worst-case convergence rates. The first-order definition and related worst-case convergence rates can be simplified for instructional purposes, as done by Nesterov [26]. However, the inclusion of quasi-Newton methods in the simplified and more accessible analysis of first-order methods by Nesterov is not addressed.

The second question on the classification of truncated-Newton methods is open for discussion. The truncated-Newton method implicitly utilizes second-order information. Customary classification would then denote the truncate-Newton method a second-order method since it requires a second-order oracle. However, the standard Newton method is also a second-order method. The worst-case convergence rates of the Newton and truncated-Newton methods are not the same and grouping these two substantially different methods into the same classification is dissatisfactory.

Contributions This paper elaborates on previous results and offers a more straightforward and accessible proof to show that a range of quasi-Newton methods, including the limited-memory BFGS method, are first-order methods in the definition of Nesterov [26]. Further, by defining a so-called generalized first-order method, we extend the analysis to include truncated-Newton methods as well. For the sake of completeness, we also consider complexity analysis of a class of smooth and strongly convex problems in finite dimensions. For a worst-case scenario, quasi-Newton and truncated-Newton methods applied to this class of problems show linear convergence rate for as many iterations as up to half the size of the problem dimension \(k\le {\textstyle \frac{1}{2}}n\). Hence, problems exist for which local superlinear or faster convergence of quasi-Newton and truncated-Newton methods will not be effective unless \(k > {\textstyle \frac{1}{2}}n\).

The rest of the paper is organized as follows. Section 2 below describes the used definition of first-order methods. Section 3 describes quasi-Newton methods, and it is shown that a range of quasi-Newton methods are first-order methods in the definition of Nesterov. Section 4 describes a generalized first-order method and shows that the truncated-Newton method belongs to this class and that the worst-case convergence rate is the same as that of first-order methods. Section 5 contains the conclusions.

2 First-Order Methods

We denote \({\mathbb {N}} := \{0,1,2,\ldots \}\) the natural numbers including 0. A vector is denoted \(x = [x^1,\ldots ,x^n]^T \in {\mathbb {R}}^n\). The vector \(e_i \in {\mathbb {R}}^n\) is the ith standard basis vector for \({\mathbb {R}}^n\). We will define the span of a set \(X \subseteq {\mathbb {R}}^n\) as \(\mathrm {span}\, X := \{ \sum _{i=1}^{|X|} c_i x_i \, : \, x_1,\ldots ,x_{|X|} \in X; c_1,\ldots ,c_{|X|} \in {\mathbb {R}}\}\). Note that this means that \(\mathrm {span}\, \{x_1,\ldots , x_m\} = \{ \sum _{i=1}^m c_i x_i \, : \, c_1,\ldots ,c_m \in {\mathbb {R}}\}\). Let X and Y be two sets with \(X \subseteq {\mathbb {R}}^n\) and \(Y \subseteq {\mathbb {R}}^n\), and then the (Minkowski) sum of sets is

$$\begin{aligned} X + Y := \{ x+y \, : \, x \in X, \, y\in Y \}. \end{aligned}$$
(1)

We consider the convex, unconstrained optimization problem

$$\begin{aligned} \displaystyle \mathop {\text{ minimize }}\limits _{x\in {\mathbb {R}}^n} \;\; f(x) \end{aligned}$$
(2)

with optimal objective \(f(x^\star ) = f^\star \). We assume that f is a convex function, continuously differentiable with Lipschitz continuous gradient constant L:

$$\begin{aligned} f(x) \le f(y) + \nabla f(y)^T(x-y) + {\textstyle \frac{1}{2}}L \Vert x-y\Vert _2^2 , \qquad \forall \, x,y \in {\mathbb {R}}^n . \end{aligned}$$
(3)

From time to time, we will strengthen our assumption on f and assume that f is also strongly convex with strong convexity parameter \(\mu > 0\), such that

$$\begin{aligned} f(x) \ge f(y) + \nabla f(y)^T (x-y) + {\textstyle \frac{1}{2}}\mu \Vert x-y\Vert _2^2 , \qquad \forall \, x,y \in {\mathbb {R}}^n . \end{aligned}$$
(4)

For twice differentiable functions, the requirements on \(\mu \) and L are equivalent to the matrix inequality \(\mu I \preceq \nabla ^2 f(x) \preceq L I\). The condition number of a function is given by \(Q = \frac{L}{\mu }\). Following Nesterov [26], we will define a first-order (black-box) method for the problem (2) as follows.

Definition 2.1

A first-order method for differentiable objectives is any iterative method that from an initial point \(x_0\) generates \((x_i)_{i=1,\ldots ,k+1}\) such that

$$\begin{aligned} x_{k+1}&\in x_0 + F_{k+1}, \text {where}\\ F_{k+1}&= F_k + \mathrm {span}\left\{ \nabla f(x_k) \right\} \end{aligned}$$

with \(F_0 = \emptyset \) and \(k \in {\mathbb {N}}\).

Strong results exist for such first-order methods [26]. These results date back to [25], but methods for achieving these bounds were first given in [27]. In Theorem 2.1, we reproduce an important result related to the first-order methods in Definition 2.1 as provided in [26].

Theorem 2.1

[26, Th 2.1.4 and Th 2.1.13] For any \(k \in {\mathbb {N}}\), \(1 \le k \le {\textstyle \frac{1}{2}}(n-1)\) and \(x_0\), there exists a (quadratic) function \(f: {\mathbb {R}}^n \mapsto {\mathbb {R}}\), which has a Lipschitz continuous gradient with constant L, such that any first-order method satisfies

$$\begin{aligned} \frac{f(x_k)-f^\star }{\Vert x_0 -x^\star \Vert _2^2} \ge \frac{3L}{32(k+1)^2}. \end{aligned}$$

There exists a function \(f: {\mathbb {R}}^\infty \mapsto {\mathbb {R}}\) with a Lipschitz continuous gradient which is strongly convex with condition number Q, such that any first-order method satisfies

$$\begin{aligned} \frac{\Vert x_k -x^\star \Vert _2^2}{\Vert x_0-x^\star \Vert _2^2} \ge \left( \frac{\sqrt{Q}-1}{\sqrt{Q}+1} \right) ^{2k}. \end{aligned}$$

We extend the above result for smooth and strongly convex problems from infinite-dimensional to finite-dimensional problems.

Theorem 2.2

For any \(k \in {\mathbb {N}}\), \(1\le k\le {\textstyle \frac{1}{2}}n\) and \(x_0\in {\mathbb {R}}^n\), there exists a function \(f: {\mathbb {R}}^n \mapsto {\mathbb {R}}\) with a Lipschitz continuous gradient which is strongly convex with condition number \(Q\ge 8\), such that any first-order method satisfies

$$\begin{aligned} \frac{\Vert x_k -x^\star \Vert _2^2}{\Vert x_0-x^\star \Vert _2^2} \ge \frac{1}{2} \left( \frac{\sqrt{Q}/\beta - 1}{\sqrt{Q}/\beta +1} \right) ^{2k} \end{aligned}$$

with constant \(\beta = 1.1\).

Proof

See “Appendix 1”. \(\square \)

This means that for certain problems, the convergence rate of any first-order method cannot be faster than linear. Since some methods achieve these bounds, it is known the latter are tight (up to a constant) [26]. Several other variants have been published [2832], see also the overview [33].

3 Quasi-Newton Methods

For the (line search) quasi-Newton methods, we select the iterations as follows

$$\begin{aligned} x_{k+1} = x_{k} - t_{k} H_{k}\nabla f(x_{k}), \qquad k = 0,1,\ldots \end{aligned}$$
(5)

where \(x_0\) and \(H_0\) are provided initializations. Consequently, several methods for selecting the step sizes \(t_k\) and the approximations of the inverse Hessian \(H_k\) exist. Often \(H_{k+1}\) is built as a function of \(H_k\) and the differences

$$\begin{aligned} y_{k} = \nabla f(x_{k+1}) - \nabla f(x_{k}), \quad s_{k} = x_{k+1}-x_{k} . \end{aligned}$$
(6)

The following lemma connects the line search quasi-Newton methods to first-order methods in a straightforward manner.

Lemma 3.1

Any method of the form

$$\begin{aligned} x_{k+1} = x_{k} - t_{k} H_{k}\nabla f(x_{k}), \qquad k = 0,1,\ldots \end{aligned}$$

where \(t_k \in {\mathbb {R}}\), \(H_0= \alpha I\), \(\alpha \in {\mathbb {R}}, \alpha \ne 0\) and

$$\begin{aligned} H_{k+1} z \in \mathrm {span}\{ H_{k} z, x_{k+1}-x_k, H_k ( \nabla f(x_{k+1}) - \nabla f(x_k) ) \}\, \forall z \in {\mathbb {R}}^n \end{aligned}$$

is a first-order method.

Proof

The first iteration \(k=0\) is a special case; this is verified by

$$\begin{aligned} x_1 = x_0 - t_0 H_0 \nabla f(x_0) = x_0 - t_0 \alpha \nabla f(x_0) \in x_0 + \mathrm {span}\{\nabla f(x_0) \}. \end{aligned}$$

The iterations \(k=1,2,\ldots \) will be shown by induction of the statement:

$$\begin{aligned} \forall \, k=1,2,\ldots , X(k) : \left\{ \begin{array}{l} H_{k} z \in \mathrm {span}\left\{ z, \nabla f(x_0), \nabla f(x_1), \ldots , \nabla f(x_{k}) \right\} \forall z \in {\mathbb {R}}^n \\ x_{k+1} \in x_0 + \mathrm {span}\left\{ \nabla f(x_0), \nabla f(x_1), \ldots , \nabla f(x_{k}) \right\} \\ \end{array}\right. \end{aligned}$$

Induction start for \(k=1\) we have

$$\begin{aligned} H_1 z&\in \mathrm {span}\{ H_{0} z, x_{1}-x_0, H_0 ( \nabla f(x_{1}) - \nabla f(x_0) ) \}\\&\in \mathrm {span}\{ \alpha z, -t_0 \alpha \nabla f(x_0), \alpha ( \nabla f(x_{1}) - \nabla f(x_0) ) \}\\&\in \mathrm {span}\{ z, \nabla f(x_0), \nabla f(x_1) \}\, \forall z \in {\mathbb {R}}^n \end{aligned}$$

and

$$\begin{aligned} x_2 = x_1 - t_1 H_1 \nabla f(x_1)&= x_0 - t_0 H_0 \nabla f(x_0) - t_1 H_1 \nabla f(x_1) \\&\in x_0 + \mathrm {span}\{ \nabla f(x_0), \nabla f(x_1) \} \end{aligned}$$

so X(1) holds. Induction step. Assume X(k) holds. We have

$$\begin{aligned} H_{k+1} z&\in \mathrm {span}\left\{ H_{k}z,x_{k+1}-x_k,H_k ( \nabla f(x_{k+1}) - \nabla f(x_k) ) \right\} \nonumber \\&\in \mathrm {span}\left\{ H_{k}z,H_{k}\nabla f(x_{k}), H_k \nabla f(x_{k+1}) - H_k \nabla f(x_k) )\right\} \nonumber \\&\in \mathrm {span}\left\{ H_{k}z,H_{k}\nabla f(x_{k}),H_{k}\nabla f(x_{k+1})\right\} \nonumber \\&\in \mathrm {span}\left\{ z,\nabla f(x_0),\ldots ,\nabla f(x_{k}),\nabla f(x_{k+1})\right\} \, \forall z \in {\mathbb {R}}^n , \end{aligned}$$
(7)

where we have assumed that X(k) holds. Equation (7) is the first part of \(X(k+1)\). For the iterators, we then have

$$\begin{aligned} x_{k+2}&= x_{k+1} -t_{k+1} H_{k+1} \nabla f(x_{k+1}) \nonumber \\&\in x_0 + \mathrm {span}\left\{ \nabla f(x_0),\ldots ,\nabla f(x_{k}) \right\} + \mathrm {span}\left\{ \nabla f(x_0),\ldots ,\nabla f(x_{k+1})\right\} \nonumber \\&\in x_0 + \mathrm {span}\left\{ \nabla f(x_0),\ldots , \nabla f(x_{k+1}) \right\} , \end{aligned}$$
(8)

where we have used X(k) and (7). Equation (8) is the second part of \(X(k+1)\); consequently, \(X(k+1)\) holds. The iterations then satisfy \(x_{k+1} \in x_0 + F_{k+1}\) and the method is a first-order method. \(\square \)

We will now show that the quasi-Newton methods in the Broyden and Huang family, and the L-BFGS methods are first-order methods. This implies that they share the worst-case complexity bounds of any first-order method given in Theorems 2.1 and 2.2. Note that these methods are not necessarily optimal, in the sense that (up to a constant) they achieve the bounds in Theorems 2.1 and 2.2. As described in the introduction, these are known results [25]. Our reason for conducting this analysis is twofold: we believe that (i) the provided analysis is more intuitive and insightful (ii) it provides a logical background for analyzing truncated-Newton, taking a similar approach.

3.1 The Broyden Family

The (one parameter) Broyden family includes updates on the form

$$\begin{aligned} H_{k+1} = H_k + \rho _k s_k s_k^T - \varrho _k H_k y_k y_k^ T H_k + \frac{\eta _k}{\varrho _k} v_k v_k^T \end{aligned}$$
(9)

where \(\eta _k\) is the Broyden parameter and

$$\begin{aligned} \rho _k&= \frac{1}{y_k^T s_k},&\varrho _k&= \frac{1}{y_k^T H_k y_k},&v_k&= \rho _k s_k - \varrho _k H_k y_k . \end{aligned}$$

The Broyden family includes the well-known BFGS, DFP, and SR1 quasi-Newton methods as special cases with certain settings of \(\eta _k\), specifically:

$$\begin{aligned} \text {BFGS}:&\qquad \eta _k = 1, \end{aligned}$$
(10)
$$\begin{aligned} \text {DFP}:&\qquad \eta _k = 0 \end{aligned}$$
(11)
$$\begin{aligned} \text {SR1}:&\qquad \eta _k = \frac{1}{\rho _k y_k^T(s_k - H_k y_k)}. \end{aligned}$$
(12)

Corollary 3.1

Any quasi-Newton method of the Broyden family with \(\eta _k \in {\mathbb {R}}\), any step size rule and \(H_0 = \alpha I\), \(\alpha \in {\mathbb {R}}\), \(\alpha \ne 0\) is a first-order method.

Proof

Using Eq. (9), we have:

$$\begin{aligned} H_{k+1} z&= H_kz + \rho _k s_k s_k^Tz -\varrho _k H_k y_k y_k^T H_kz + \frac{\eta _k}{\varrho _k} v_k v_k^Tz \\&\in \mathrm {span}\{H_k z, s_k, H_k y_k, v_k \} \forall z \in {\mathbb {R}}^n \\&\in \mathrm {span}\{H_k z, s_k, H_k y_k \} \forall z \in {\mathbb {R}}^n \\&\in \mathrm {span}\{H_k z, x_{k+1} - x_k, H_k (\nabla f(x_{k+1}) - \nabla f(x_k)) \} \forall z \in {\mathbb {R}}^n \end{aligned}$$

and the results then follow directly from Lemma 3.1. \(\square \)

3.2 The Huang Family

The Huang family includes updates on the form [5]:

$$\begin{aligned} K_{k+1} = H_k + \psi _k \sigma _k s_k (\phi _k s_k + \varphi _k H_k^T y_k )^T - \varsigma _k H_k y_k (\theta _k s_k + \vartheta _k H_k^T y_k)^T \end{aligned}$$
(13)

where \(\psi _k,\, \phi _k,\,\varphi _k,\,\theta _k,\,\vartheta _k \in {\mathbb {R}}\) and

$$\begin{aligned} \sigma _k&= \frac{1}{(\phi _k s_k + \varphi _k H_k^T y_k)^T y_k},&\varsigma _k&= \frac{1}{(\theta _k s_k + \vartheta _k H_k^T y_k)^T y_k} \end{aligned}$$

Note that the Huang family includes the DFP methods with the selection \(\psi _k = 1, \phi _k= 1, \varphi _k=0, \theta _k = 0, \vartheta _k = 1\), but also a range of other methods, see e.g.  [5], including non-symmetric forms.

Corollary 3.2

Any quasi-Newton method of the Huang family, any step size rule and \(H_0 = \alpha I\), \(\alpha \in {\mathbb {R}}\), \(\alpha \ne 0\) is a first-order method.

Proof

Using Eq. (13), we have:

$$\begin{aligned} H_{k+1} z&= H_kz + \psi _k \sigma _k s_k (\phi _k s_k + \varphi H_k^T y_k )^Tz - \varsigma _k H_k y_k (\theta _k s_k + \vartheta _k H_k^T y_k)^Tz \\&\in \mathrm {span}\{H_k z, s_k, H_k y_k \} \forall z \in {\mathbb {R}}^n \\&\in \mathrm {span}\{H_k z, x_{k+1} - x_k, H_k (\nabla f(x_{k+1}) - \nabla f(x_k)) \} \forall z \in {\mathbb {R}}^n \end{aligned}$$

and the results then follow directly from Lemma 3.1. \(\square \)

3.3 Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS)

For L-BFGS, we only build a Hessian approximation \({\bar{H}}_k\) out of the m most recent gradients [3]

$$\begin{aligned} {\bar{H}}_k&= (V^T_{k-1} \ldots V^T_{k-m}) {\bar{H}}_k^0 (V_{k-m} \ldots V_{k-1}) \\&\quad + \,\rho _{k-m}(V_{k-1}^T\ldots V^T_{k-m+1}) s_{k-m} s^T_{k-m} (V_{k-m+1} \ldots V_{k-1}) \\&\quad +\, \qquad \vdots \\&\quad +\, \rho _{k-1} s_{k-1} s_{k-1}^T \, \end{aligned}$$

where \(V_k = I - \rho _k y_k s_k^T\). From this, it is clear that for \(k \le m\) and \({\bar{H}}_{ k}^0 = H_0\) the approximations of the Hessian for the BFGS and L-BFGS methods are identical \({\bar{H}}_{k} = H_{k}\). This motivates that L-BFGS is also a first-order method.

Corollary 3.3

The L-BFGS method with any step size rule and \({\bar{H}}_k^0 = \gamma _k I\), \(\gamma _k \in {\mathbb {R}}\), \(\gamma _k \ne 0\) is a first-order method

Proof

See “Appendix 2”. \(\square \)

4 Generalized First-Order Methods and the Truncated-Newton Method

In this section, we will show that the truncated-Newton method is very similar as regards behavior as a first-order method. However, the truncated-Newton method is not a first-order method following Definition 2.1. Instead, when comparing a first-order method with a truncated-Newton method, we must assume some properties of the underlying iterative solver of the Newton equation

$$\begin{aligned} \nabla f(x_k) + \nabla ^2 f(x_k) (x -x_k) = 0 \end{aligned}$$

which is usually achieved by an underlying lower level algorithm that is often a first-order methods itself, e.g. the conjugate gradient method. For a fixed higher level index k, these iterations start with \(x_{k,0}:=x_k\) (warm-start) and \(F_{k,0}^\prime =\emptyset ,\) and subsequently iterate for \(i \in {\mathbb {N}}\)

$$\begin{aligned} x_{k,i+1}&\in x_{k} + F_{k,i+1}^\prime , \quad \text {with} \\ F_{k,i+1}^\prime&= F_{k,i}^\prime + \mathrm {span}\{ \nabla f(x_k) + \nabla ^2 f(x_k) (x_{k,i} - x_k)\} \, . \end{aligned}$$

After the lower level iterations have been completed, e.g.  at iterate \(i^\star +1\), the new higher level iterate is set to

$$\begin{aligned} x_{k+1} = x_k + t_k ( x_{k,i^\star +1} - x_k) \in x_k + F_{k,i^\star +1}^\prime \end{aligned}$$

where \(t_k\) is obtained by a line search. By renumbering all iterates so that the lower and higher level iterations become part of the same sequence, we might express the last equation as

$$\begin{aligned} x_{k+i^\star +1} \in x_k + F_{k+i^\star +1}^\prime \end{aligned}$$

where the iterates and subspaces after \(x_k\) and \(F_k^\prime =\emptyset \) are generated following

$$\begin{aligned} x_{k+i+1}&\in x_{k} + F_{k+i+1}^\prime , \quad \text {with} \\ F_{k+i+1}^\prime&= F_{k+i}^\prime + \mathrm {span}\{\nabla f(x_k) + \nabla ^2 f(x_k) (x_{k+i} - x_k) \}. \end{aligned}$$

We see that the main differences of a truncated-Newton method compared to a standard first-order method are (i) that the subspace is reset at the beginning of the lower level iterations (here, at index k) and (ii) that the exact gradient evaluations are replaced by a first-order Taylor series (here evaluated at \(x_k\)). This motivates the question of how to define a generalization of first-order methods that is also applicable to the class of truncated-Newton methods. In this definition, we will introduce a higher degree of freedom, but make sure that the truncated-Newton methods are contained. One way to formulate the generalized first-order subspaces could be

$$\begin{aligned} {\bar{F}}_{k+1}' = {\bar{F}}_k' + \mathrm {span}\{\nabla f(x_i) + \nabla ^2f(x_i) \, (x_k - x_i) \, : \, 0 \le i \le k \}. \end{aligned}$$

This definition would include all truncated-Newton methods that use first-order methods in their lower level iterations, together with standard first-order methods, and many more methods as well. Note that the dimensions of the subspaces do not only increase by one in each iteration as in the standard first-order method, but possibly by \(k+1\). Here, we can disregard the fact that in practice, we will not be able to generate these large dimensional subspaces. In this paper, we decide to generalize the subspace generation further, as follows.

Definition 4.1

A generalized first-order method for a problem with a twice continuously differentiable objective is any iterative algorithm that selects

$$\begin{aligned} x_{k+1}&\in x_0 + S_{k+1}, \quad \text {where} \nonumber \\ S_{k+1}&= S_k + \mathrm {span}\{\nabla f(p) + \nabla ^2f(p) \, d \, : \, p \in x_0 + S_k, d \in S_k\} \end{aligned}$$
(14)

with \(S_0 = \{ 0 \}\) and \(k \in {\mathbb {N}}\).

Remark 4.1

Note that here we use \(S_0 = \{0\}\) and not \(S_0 = \emptyset \) to ensure that the set-builder notation in (14) is well defined for the first iteration.

At this point, a short discussion of the term “generalized first-order methods” seems appropriate. The model in Definition 4.1 requires second-order information and using standard terminology for such methods, see for instance [25], these should be denoted second-order methods. However, one motivation for using truncated-Newton methods is the implicit usage of the second-order information in which the Hessian matrix \(\nabla ^2f(p)\in {\mathbb {R}}^{n \times n}\) is never formed but only touched upon via the matrix-vector product \(\nabla ^2f(p) d \in {\mathbb {R}}^{n}\). Further, whether or not the Hessian matrix is used explicitly or implicitly, a truncated-Newton method is more similar to first-order methods than to second-order methods from a global convergence perspective, as we will show in the following section. Hence, we believe that the term “generalized first-order methods” provides a meaningful description of these methods.

This also motivates the following definition of a generalized first-order oracle (deterministic, black box):

$$\begin{aligned} \text {input:}\, (x,z) \quad \text {return:}\, (f(x), \nabla f(x), \nabla ^2 f(x) z) \end{aligned}$$

This means that with a two-tuple input (xz), the oracle returns the three-tuple output \((f(x), \nabla f(x), \nabla ^2 f(x) z)\). In this case, the second-order information is not directly available as in a second-order oracle but only \(\nabla ^2 f(x) z\). This results in a stronger and more precise description of truncated-Newton methods than saying that they fall between first- and second-order methods as suggested in [24] and discussed in the introduction.

4.1 Equivalence of Generalized and Standard First-Order Methods for Quadratic Functions

For a quadratic function, f(x), the spaces of the usual first-order method and the generalized first-order method coincide, \(F_k=S_k\).

Lemma 4.1

For quadratic functions f, \(S_k = F_k\), \(k\ge 1\).

Proof

As f is quadratic, \(\nabla f\) is linear, and thus coincides with its first-order Taylor series, i.e. \(\nabla f(p) + \nabla ^2f(p) \, d = \nabla f(p + d)\). To simplify the notation, denote the constant Hessian \(\nabla f^2(x) = P\) for all \(x \in {\mathbb {R}}^n\). For \(F_1 = \mathrm {span}\{\nabla f(x_0) \}\), and for quadratic problems \(S_1 = 0 \,+\, \mathrm {span}\{\nabla f(x_0) + P \cdot 0 \}= \mathrm {span}\{ \nabla f(x_0) \}\). For the standard first-order method \(k\ge 1\), the following holds

$$\begin{aligned} F_{k+1}&= F_k + \mathrm {span}\{\nabla f(x_k) \} \nonumber \\&= F_k + \mathrm {span}\{\nabla f(x_0) + P (x_k-x_0) \} \nonumber \\&= F_k + \mathrm {span}\{ P (x_k-x_0) \} \nonumber \\&= F_k + \mathrm {span}\{ P x \, : \, x \in F_k \}. \end{aligned}$$
(15)

Conversely, for the generalized first-order method applied to a quadratic function for \(k\ge 1\) it holds that

$$\begin{aligned} S_{k+1}&= S_k + \mathrm {span}\{\nabla f(p) + \nabla ^2f(p) \, d \, : \, p \in x_0 + S_k, d \in S_k \} \nonumber \\&= S_k + \mathrm {span}\{\nabla f(p+d) \, : \, p \in x_0 + S_k, d \in S_k \} \nonumber \\&= S_k + \mathrm {span}\{\nabla f(x_0+d) \, : \, d \in S_k \} \nonumber \\&= S_k + \mathrm {span}\{\nabla f(x_0) + P d \, : \, d \in S_k \} \nonumber \\&= S_k + \mathrm {span}\{P d \, : \, d \in S_k \} \, . \end{aligned}$$
(16)

Since \(S_1 = F_1\) and (15)–(16) are the same recursions, \(S_k = F_k,\; \forall k\ge 1\). \(\square \)

With the Lemma 4.1 at hand, we can now present an important result which motivates the name “generalized first-order methods”.

Theorem 4.1

Theorems 2.1 and 2.2 holds for generalized first-order methods.

Proof

Generalized first-order methods and first-order methods are equivalent for quadratic problems following Lemma 4.1. Hence, the results in Theorems 2.1 and 2.2 follow directly (see also [26, The. 2.1.7 and The. 2.1.13]). \(\square \)

This means that the convergence for generalized first-order methods is the same as for first-order methods, i.e., in a worst-case scenario, they are sub-linear for smooth problems and linear for smooth and strongly convex problems, including the finite-dimensional case. This implies that from a global convergence perspective, a generalized first-order method has more in common with first-order methods compared to second-order methods.

5 Conclusions

In this paper, we have tried to answer two important questions. First, why are limited-memory BFGS methods not considered first-order methods? We believe that this is connected to the accessibility of the analysis provided in the work [25]. To this end, we have given a more straightforward analysis, using the definition of first-order methods as provided in [26]. The second question is whether it is possible to give a better description of the classification of truncated-Newton methods? We have described a class of methods named generalized first-order methods to which the truncated-Newton methods belong and have shown that the worst-case global convergence rate for generalized first-order methods is the same as that applying to for first-order methods. Thus, in a worst-case scenario, quasi-Newton and truncated-Newton methods are lower-bounded by a linear rate of convergence for \(k\le {\textstyle \frac{1}{2}}n\) according to Theorem 2.2 (see also the comment [26, pp.  41]). In a worst-case scenario, a better convergence, such as superlinear, can only be effective for \(k>{\textstyle \frac{1}{2}}n\). Since the number of iterations k for this bound depends on the dimensionality of the problem n, this bound has the strongest implication for large-scale optimization.