1 Introduction

We consider the problem

$$\begin{aligned} \min _{x\in \mathcal {M}} f(x) \end{aligned}$$
(1)

of minimizing a smooth real-valued function \(f\) defined on a Riemannian manifold \(\mathcal {M}\). Recently investigated application areas include image segmentation [30] and recognition [37], electrostatics and electronic structure calculation [39], finance and chemistry [7], multilinear algebra [21, 32], low-rank learning [8, 26], and blind source separation [25, 34].

The wealth of applications has stimulated the development of general-purpose methods for (1)—see, e.g., [2, 30, 31] and references therein—including the trust-region approach upon which we focus in this work. A well-known technique in optimization [16], the trust-region method was extended to Riemannian manifolds in [1] (or see [2, Ch. 7]), and found applications, e.g., in [8, 21, 23, 26, 38]. Trust-region methods construct a quadratic model \(m_k\) of the objective function \(f\) around the current iterate \(x_k\) and produce a candidate new iterate by (approximately) minimizing the model \(m_k\) within a region where it is “trusted”. Depending on the discrepancy between \(f\) and \(m_k\) at the candidate new iterate, the size of the trust region is updated and the candidate new iterate is accepted or rejected.

For lack of efficient techniques to produce a second-order term in \(m_k\) that is inexact but nevertheless guarantees superlinear convergence, the Riemannian trust-region (RTR) framework loses some of its appeal when the exact second-order term—the Hessian of \(f\)—is not available. This is in contrast with the Euclidean case, where several strategies exist to build an inexact second-order term that preserves superlinear convergence of the trust-region method. Among these strategies, the symmetric rank-one (SR1) update is favored in view of its simplicity and because it preserves symmetry without unnecessarily enforcing positive definiteness; see, e.g., [27, §6.2] for a more detailed discussion. The \(n+1\) step q-superlinear rate of convergence of the SR1 trust-region method was shown by Byrd et al. [11] using a sophisticated analysis that builds on [15, 24].

The classical (Euclidean) SR1 trust-region method can also be viewed as a quasi-Newton method, enhanced with a trust-region globalization strategy. The idea of quasi-Newton methods on manifolds is not new [18, §4.5], however, most of the literature of which we are aware restricts consideration to generalizing the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method combined with a line search strategy. Early work such as [10, 32] used Riemannian BFGS methods for a specific application without an analysis of convergence properties, but more recently, systematic analyses of Riemannian BFGS methods based on the framework of retraction and vector transport developed in [2, 4] have been made. Qi [29] analyzed a version of Riemannian BFGS methods with retraction and vector transport restricted to exponential mapping and parallel translation and showed superlinear convergence using a Riemannian Dennis-Moré condition. Ring and Wirth [30] proposed and analyzed a Riemannian BFGS that avoids the restrictions on retraction and vector transport assumed by Qi but that needs to resort to the derivative of the retraction. Seibert et al. [33] discussed the freedom available when generalizing BFGS to Riemannian manifolds and analyzed one generalization of BFGS method on Riemannian manifolds that are isometric to \(\mathbb {R}^n\). Most recently, Huang [20] developed a complete convergence theory that avoids the restrictions of Qi, Ring and Wirth, guarantees superlinear convergence for the Riemannian Broyden family of quasi-Newton methods (including a version of SR1), and facilitates efficient implementation.

In this paper, motivated by the situation described above, we introduce a generalization of the classical (i.e., Euclidean) SR1 trust-region method to the Riemannian setting (1). Besides making use of basic Riemannian geometric concepts (tangent space, Riemannian metric, gradient), the new method, called RTR-SR1, relies on the notions of retraction and vector transport introduced in [2, 4]. A detailed global and local convergence analysis is given. A limited-memory version of RTR-SR1, referred to as LRTR-SR1, is also introduced. Numerical experiments show that the RTR-SR1 method displays the expected convergence properties. When the Hessian of \(f\) is not available, RTR-SR1 thus offers an attractive way of tackling (1) by a trust-region approach. Moreover, even when the Hessian of \(f\) is available, making use of it can be expensive computationally, and the numerical experiments show that ignoring the Hessian information and resorting instead to the RTR-SR1 approach can be beneficial.

Another contribution of this paper with respect to [11] is an extension of the analysis to allow for inexact solutions of the trust-region subproblem—compare (10) with [11, (2.4)]. This extension makes it possible to resort to inner iterations such as the Steihaug–Toint truncated CG method (see [2, §7.3.2] for its Riemannian extension) while staying within the assumptions of the convergence analysis.

The paper is organized as follows. The RTR-SR1 method is stated and discussed in Sect. 2. The convergence analysis is carried out in Sect. 3. The limited-memory version is introduced in Sect. 4. Numerical experiments are reported in Sect. 5. Conclusions are drawn in Sect. 6.

2 The Riemannian SR1 trust-region method

The proposed Riemannian SR1 trust-region (RTR-SR1) method is described in Algorithm 1. The algorithm statement is commented in Sect. 2.1 and the important questions of representing tangent vectors and choosing the vector transport are discussed in Sects. 2.2 and 2.3.

figure a

2.1 A guide to Algorithm 1

Algorithm 1 can be viewed as a Riemannian version of the classical (Euclidean) SR1 trust-region method (see, e.g., [27, Algorithm 6.2]). It can also be viewed as an SR1 version of the Riemannian trust-region framework [2, Algorithm 10 p. 142]. Therefore, several pieces of information given in [2, Ch. 7] remain relevant for Algorithm 1.

In particular, the algorithm statement makes use of standard Riemannian concepts that are described, e.g., in [2, 28], such as the tangent space \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) to the manifold \(\mathcal {M}\) at a point \(x\), a Riemannian metric \(g\), and the gradient \({{\mathrm{\mathrm {grad}}}}f\) of a real-valued function \(f\) on \(\mathcal {M}\). The algorithm statement also relies on the notion of retraction, introduced in [4] (or see [2, §4.1]). A retraction \(R\) on \(\mathcal {M}\) is a smooth map from the tangent bundle \({{\mathrm{\mathrm {T}}}}\mathcal {M}\) (i.e., the set of all tangent vectors to \(\mathcal {M}\)) onto \(\mathcal {M}\) such that, for all \(x\in \mathcal {M}\) and all \(\xi _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\), the curve \(t\mapsto R(t\xi _x)\) is tangent to \(\xi _x\) at \(t=0\). We let \(R_x\) denote the restriction of \(R\) to \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\). The domain of \(R\) need not be the entire tangent bundle, but this is usually the case in practice, and in this work we assume throughout that \(R\) is defined wherever needed. Specific ways of constructing retractions are proposed in [2, 3, 4]; see also [22, 39] for the important case of the Stiefel manifold.

Within the Riemannian trust-region framework, the characterizing aspect of Algorithm 1 lies in the update mechanism for the Hessian approximation \(\mathcal {B}_k\). The proposed update mechanism, based on formula (3) and on Step 6 of Algorithm 1, is a rather straightforward Riemannian generalization of the classical SR1 update

$$\begin{aligned} B_{k+1} = B_k + \frac{(y_k-B_ks_k)(y_k-B_ks_k)^T}{(y_k-B_ks_k)^Ts_k}. \end{aligned}$$

Significantly less straightforward is the Riemannian generalization of the superlinear convergence result, as we will see in Sect. 3.4. (Observe that the local convergence result [2, Theorem 7.4.11] does not apply here because the Hessian approximation condition [2, (7.36)] is not guaranteed to hold.)

Instrumental in the Riemannian SR1 update is the notion of vector transport, introduced in [2, §8.1] as a generalization of the classical Riemannian concept of parallel translation. A vector transport on a manifold \(\mathcal {M}\) on top of a retraction \(R\) is a smooth mapping

$$\begin{aligned} {T\mathcal {M}}\oplus {T\mathcal {M}}\rightarrow T\mathcal {M} : (\eta _x,\xi _x) \mapsto \mathcal {T}_{\eta _x}(\xi _x)\in T\mathcal {M} \end{aligned}$$

satisfying the following properties for all \(x\in \mathcal {M}\):

  1. 1.

    (Associated retraction) \(\mathcal {T}_{\eta _x}(\xi _x) \in {{\mathrm{\mathrm {T}}}}_{R_x(\xi _x)}\mathcal {M}\) for all \(\xi _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\);

  2. 2.

    (Consistency) \(\mathcal {T}_{0_x}(\xi _x) = \xi _x\) for all \(\xi _x\in T_x\mathcal {M}\);

  3. 3.

    (Linearity) \(\mathcal {T}_{\eta _x}(a\xi _x+b\zeta _x) = a\mathcal {T}_{\eta _x}(\xi _x) + b\mathcal {T}_{\eta _x}(\zeta _x)\).

The Riemannian SR1 update uses tangent vectors at the current iterate to produce a new Hessian approximation at the next iterate, hence the need to perform a vector transport (see Step 6) from the current iterate to the next.

In the Input step of Algorithm 1, the requirement that the vector transport \(\mathcal {T}_S\) is isometric means that, for all \(x\in \mathcal {M}\) and all \(\xi _x,\zeta _x,\eta _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\), the equation

$$\begin{aligned} g(\mathcal {T}_{S_{\eta _x}}\xi _x,\mathcal {T}_{S_{\eta _x}}\zeta _x) = g(\xi _x,\zeta _x) \end{aligned}$$
(4)

holds. Techniques for constructing an isometric vector transport on submanifolds of Euclidean spaces are described in Sect. 2.3.

The symmetry requirement on \(\mathcal {B}_0\) with respect to the Riemannian metric \(g\) means that \(g(\mathcal {B}_0\xi _{x_0},\eta _{x_0}) = g(\xi _{x_0},\mathcal {B}_0\eta _{x_0})\) for all \(\xi _{x_0},\eta _{x_0}\in {{\mathrm{\mathrm {T}}}}_{x_0}\mathcal {M}\). It is readily seen from (3) and Step 6 of Algorithm 1 that \(\mathcal {B}_k\) is symmetric for all \(k\). Note however that \(\mathcal {B}_k\) is, in general, not positive definite.

A possible stopping criterion for Algorithm 1 is \(\Vert {{\mathrm{\mathrm {grad}}}}f(x_{k})\Vert < \epsilon \) for some specified \(\epsilon >0\), where \(\Vert \cdot \Vert \), which also appears in the statement of Algorithm 1, denotes the norm induced by the Riemannian metric \(g\), i.e.,

$$\begin{aligned} \Vert \xi \Vert = \sqrt{g(\xi ,\xi )}. \end{aligned}$$
(5)

In the spirit of [30, Remark 4], we point out that it is possible to formulate the SR1 update (3) in the new tangent space \({{\mathrm{\mathrm {T}}}}_{x_{k+1}}\mathcal {M}\); in the present case of SR1, the algorithm remains equivalent since the vector transport is isometric.

Otherwise, Algorithm 1 does not call for comments other than those made in [2, Ch. 7]. In particular, we point out that the meaning of “approximately” in Step 2 of Algorithm 1 depends on the desired convergence results. We will see in the convergence analysis (Sect. 3) that enforcing the Cauchy decrease (9) is enough to ensure global convergence to stationary points, but another condition such as (10) is needed to guarantee superlinear convergence. The truncated CG method, discussed in [2, §7.3.2] in the Riemannian context, is an inner iteration for Step 2 that returns an \(s_k\) satisfying conditions (9) and (10).

2.2 Representation of tangent vectors

Let us now consider the frequently encountered situation where the manifold \(\mathcal {M}\) is described as a \(d\)-dimensional submanifold of an \(m\)-dimensional Euclidean space \(\mathcal {E}\). In particular, this is the case of the sphere and the Stiefel manifold involved in the numerical experiments in Sect. 5.

A tangent vector in \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) can be represented either by its \(d\)-dimensional vector of coordinates in a given basis \(B_x\) of \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\), or else as an \(m\)-dimensional vector in \(\mathcal {E}\) since \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\subset {{\mathrm{\mathrm {T}}}}_x\mathcal {E}\simeq \mathcal {E}\). The latter option may be preferable when the codimension \(m - d\) is small (e.g., the sphere) because building, storing and manipulating the basis \(B_x\) of \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) may be inconvenient.

Likewise, since \(\mathcal {B}_k\) is a linear transformation of \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\), it can be represented in the basis \(B_x\) as a \(d\times d\) matrix, or as an \(m\times m\) matrix restricted to act on \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\). Here again, the latter approach may be computationally more efficient when the codimension \(m-d\) is small.

A related choice has to be made for the representation of the vector transport, since \(\mathcal {T}_{\eta _x}\) is a linear map from \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) to \({{\mathrm{\mathrm {T}}}}_{R_x(\eta _x)}\mathcal {M}\). This question is addressed in Sect. 2.3.

2.3 Isometric vector transport

We present two ways of constructing an isometric vector transport on a \(d\)-dimensional submanifold \(\mathcal {M}\) of an \(m\)-dimensional Euclidean space \(\mathcal {E}\).

2.3.1 Vector transport by parallelization

An open subset \(\mathcal {U}\) of \(\mathcal {M}\) is termed parallelizable if it admits a smooth field of tangent bases, i.e., a smooth function \(B: \mathcal {U} \rightarrow \mathbb {R}^{m \times d} : z \mapsto B_z\) where \(B_z\) is a basis of \({{\mathrm{\mathrm {T}}}}_z\mathcal {M}\). The whole manifold \(\mathcal {M}\) itself may not be parallelizable; in particular, every manifold of nonzero Euler characteristic is not parallelizable [35, §1.6], and it is also known that the only parallelizable spheres are those of dimension 1, 3, and 7 [6, p. 116]. However, for the global convergence analysis carried out in Sect. 3.2, the vector transport is not required to be smooth or even continuous, and for the local convergence analysis in Sect. 3.4, we only need a parallelizable neighborhood \(\mathcal {U}\) of the limit point \(x^*\). Such a neighborhood always exists (take for example a coordinate neighborhood [2, p. 37]).

Given an orthonormal smooth field of tangent bases \(B\), i.e., such that \(B_x^\flat B_x=I\) for all \(x\) (where \(I\) stands for the identity matrix of adequate size), the proposed isometric vector transport from \({{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) to \({{\mathrm{\mathrm {T}}}}_y \mathcal {M}\) is given by

$$\begin{aligned} \mathcal {T} = B_y B_x^\flat . \end{aligned}$$
(6)

The \(d\times d\) matrix representation of this vector transport in the pair of bases \((B_x,B_y)\) is simply the identity. This considerably simplifies the implementation of Algorithm 1.

2.3.2 Vector transport by rigging

If \(\mathcal {M}\) is described as a \(d\)-dimensional submanifold of an \(m\)-dimensional Euclidean space \(\mathcal {E}\) and the codimension \((m-d)\) is much smaller than the dimension \(d\), then the vector transport by rigging, introduced next, may be preferable. For generality, we do not assume that \(\mathcal {M}\) is a Riemannian submanifold of \(\mathcal {E}\); in other words, the Riemannian metric \(g\) on \(\mathcal {M}\) may not be the one induced by the metric of \(\mathcal {E}\). A motivation for this generality is to be able to handle the canonical metric of the Stiefel manifold [17, (2.22)]. For simplicity of the exposition, we work in an orthonormal basis of \(\mathcal {E}\) and, for \(x\in \mathcal {M}\), we let \(G_x\) denote a matrix expression of \(g_x\), i.e., \(g_x(\xi _x,\eta _x) = \xi _x^TG_x\eta _x\) for all \(\xi _x,\eta _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\).

An open subset \(\mathcal {U}\) of \(\mathcal {M}\) is termed rigged if it admits a smooth field of normal bases, i.e., a smooth function \(N: \mathcal {U} \rightarrow \mathbb {R}^{m\times (m-d)} : z \mapsto N_z\) where \(N_z\) is a basis of the normal space \(\mathrm{N}_z\mathcal {M}\). The whole manifold \(\mathcal {M}\) itself may not be rigged, but it is always locally rigged.

Given a smooth field of normal bases \(N\), the proposed isometric vector transport \(\mathcal {T}\) from \({{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) to \({{\mathrm{\mathrm {T}}}}_y \mathcal {M}\) is defined as follows. Compute \((I-N_x(N_x^TN_x)^{-1}N_x^T)N_y\) (i.e., the orthogonal projection of \(N_y\) onto \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\)) and observe that its column space is \({{\mathrm{\mathrm {T}}}}_x\mathcal {M} \ominus ({{\mathrm{\mathrm {T}}}}_x\mathcal {M} \cap {{\mathrm{\mathrm {T}}}}_y\mathcal {M})\). Obtain an orthonormal matrix \(Q_x\) by Gram-Schmidt orthonormalizing \((I-N_x(N_x^TN_x)^{-1}N_x^T)N_y\). Proceed likewise with \(x\) and \(y\) interchanged to get \(Q_y\). Finally, let

$$\begin{aligned} \mathcal {T} = G_y^{-\frac{1}{2}} (I - Q_x Q_x^T - Q_y Q_x^T) G_x^{\frac{1}{2}}. \end{aligned}$$
(7)

While it is clear that \(\mathcal {T}\) satisfies the three properties of vector transport mentioned in Sect. 2.1, proving that \(\mathcal {T}\) is (locally) smooth remains an open question. Moreover, the column space of \((I-N_x(N_x^TN_x)^{-1}N_x^T)N_y\) gets more sensitive to numerical errors as the distance between \(x\) and \(y\) decreases. Nevertheless, there is evidence that \(\mathcal {T}\) is smooth indeed, and we have observed that using vector transport by rigging in Algorithm 1 is a worthy alternative in large-scale low-codimension problems.

3 Convergence analysis of RTR-SR1

3.1 Notation and standing assumptions

Throughout the convergence analysis, unless otherwise specified, we let \(\{x_k\},\;\{\mathcal {B}_k\},\; \{\tilde{\mathcal {B}}_k\},\; \{s_k\},\; \{y_k\}\), and \(\{\varDelta _k\}\) be infinite sequences generated by Algorithm 1, and we make use of the notation introduced in that algorithm. We let \(\varOmega \) denote the sublevel set of \(x_0\), i.e.,

$$\begin{aligned} \varOmega = \{x\in \mathcal {M} : f(x) \le f(x_0)\}. \end{aligned}$$

The global and local convergence analyses each make standing assumptions at the beginning of their respective sections. The numbered assumptions introduced below are not standing assumptions and will be invoked specifically whenever needed. Note that, apart from Assumption 6, all the numbered assumptions are Riemannian generalizations of assumptions made in [11] for the analysis of the Euclidean SR1 trust-region method.

3.2 Global convergence analysis

In some results, we will assume for the retraction \(R\) that there exist \(\mu >0\) and \(\delta _\mu >0\) such that

$$\begin{aligned} \Vert \xi \Vert \ge \mu {{\mathrm{\mathrm {dist}}}}(x,R_x(\xi )) \quad \text {for all}\; \hbox {x}\in \varOmega ,\; \hbox {for all}\; \xi \in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}, \Vert \xi \Vert \le \delta _\mu . \end{aligned}$$
(8)

This corresponds to [2, (7.25)] restricted to the sublevel set \(\varOmega \). Such a condition is instrumental in the global convergence analysis of Riemannian trust-region schemes. Note that, in view of [30, Lemma 6], condition (8) can be shown to hold globally under the condition that \(R\) has equicontinuous derivatives.

The next assumption corresponds to [11, (A3)].

Assumption 1

The sequence of linear operators \(\{\mathcal {B}_k\}\) is bounded by a constant \(M\) such that \(\Vert \mathcal {B}_k\Vert \le M\) for all \(k\).

We will often require that the trust-region subproblem (2) is solved accurately enough that, for some positive constants \(\sigma _1\) and \(\sigma _2\),

$$\begin{aligned} m_k(0) - m_k(s_k) \ge \sigma _1 \Vert {{\mathrm{\mathrm {grad}}}}f(x_k)\Vert \min \left\{ \varDelta _k, \sigma _2 \frac{\Vert {{\mathrm{\mathrm {grad}}}}f(x_k)\Vert }{\Vert \mathcal {B}_k\Vert }\right\} , \end{aligned}$$
(9)

and that

$$\begin{aligned} \mathcal {B}_k s_k = - {{\mathrm{\mathrm {grad}}}}f(x_k) + \delta _k&\text { with } \Vert \delta _k\Vert \le \Vert {{\mathrm{\mathrm {grad}}}}f(x_k)\Vert ^{1 + \theta }, \nonumber \\&\hbox { whenever } \Vert s_k\Vert \le 0.8 \varDelta _k, \end{aligned}$$
(10)

where \(\theta >0\) is a constant. These conditions are generalizations of [11, (2.3–4)]. Observe that, even if we restrict to the Euclidean case, condition (10) remains weaker than condition [11, (2.4)]. The purpose of introducing \(\delta _k\) in (10) is to encompass stopping criteria such as [2, (7.10)] that do not require the computation of an exact solution of the trust-region subproblem. We point out in particular that (9) and (10) hold if the approximate solution of the trust-region subproblem (2) is obtained from the truncated CG method, described in [2, §7.3.2] in the Riemannian context.

We can now state and prove the main global convergence results. Point (iii) generalizes [11, Theorem 2.1] while points (i) and (ii) are based on [2, §7.4.1].

Theorem 1

(convergence) (i) If \(f\in C^2\) is bounded below on the sublevel set \(\varOmega \), Assumption 1 holds, condition (9) holds, and (8) is satisfied then \(\lim _{k\rightarrow \infty } {{\mathrm{\mathrm {grad}}}}f(x_k)=0\). (ii) If \(f\in C^2,\; \mathcal {M}\) is compact, Assumption 1 holds, and (9) holds then \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0,\; \{x_k\}\) has at least one limit point, and every limit point of \(\{x_k\}\) is a stationary point of \(f\). (iii) If \(f\in C^2\), the sublevel set \(\varOmega \) is compact, \(f\) has a unique stationary point \(x^*\) in \(\varOmega \), Assumption 1 holds, condition (9) holds, and (8) is satisfied then \(\{x_k\}\) converges to \(x^*\).

Proof

(i) Observe that the proof of [2, Theorem 7.4.4] still holds when condition [2, (7.25)] is weakened to its restriction (8) to \(\varOmega \). Indeed, since the trust-region method is a descent iteration, it follows that all iterates are in \(\varOmega \). The assumptions thus allow us to conclude, by [2, Theorem 7.4.4], that \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0\). (ii) It follows from [2, Proposition 7.4.5] and [2, Corollary 7.4.6] that all the assumptions of [2, Theorem 7.4.4] hold. Hence \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0\), and every limit point is thus a stationary point of \(f\). Since \(\mathcal {M}\) is compact, \(\{x_k\}\) is guaranteed to have at least one limit point. (iii) Again by [2, Theorem 7.4.4], we get that \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0\). Since \(\{x_k\}\) belongs to the compact set \(\varOmega \) and cannot have limit points other than \(x^*\), it follows that \(\{x_k\}\) converges to \(x^*\). \(\square \)

3.3 More notation and standing assumptions

For the purpose of conducting a local convergence analysis, we now assume that \(\{x_k\}\) converges to a point \(x^*\). Moreover, we assume throughout that \(f\in C^2\).

We let \(\mathcal {U}_{\mathrm {trn}}\) be a totally retractive neighborhood of \(x^*\), a concept inspired from the notion of totally normal neighborhood (see [13, §3.3]). By this, we mean that there is \(\delta _{\mathrm{trn}}>0\) such that, for each \(y\in \mathcal {U}_{\mathrm {trn}}\), we have that \(R_y(\mathbb {B}(0_y,\delta _{\mathrm {trn}}))\supseteq \mathcal {U}_{\mathrm {trn}}\) and \(R_y(\cdot )\) is a diffeomorphism on \(\mathbb {B}(0_y,\delta _{\mathrm {trn}})\), where \(\mathbb {B}(0_y,\delta _{\mathrm {trn}})\) denotes the ball of radius \(\delta _{\mathrm {trn}}\) in \({{\mathrm{\mathrm {T}}}}_y\mathcal {M}\) centered at the origin \(0_y\). The existence of a totally retractive neighborhood can be shown along the lines of [13, Theorem 3.3.7]. We assume without loss of generality that \(\{x_k\}\subset \mathcal {U}_{\mathrm {trn}}\). Whenever we consider an inverse retraction \(R_x^{-1}(y)\), we implicitly assume that \(x,y\in \mathcal {U}_{\mathrm {trn}}\).

3.4 Local convergence analysis

The purpose of this section is to obtain a superlinear convergence result for Algorithm 1, stated in Theorem 2. The analysis can be viewed as a Riemannian generalization of the local analysis in [11, §2]. As we proceed, we will point out the main hurdles that had to be overcome in the generalization. The analysis makes use of several preparation lemmas, independent of Algorithm 1, that are of potential interest in the broader context of Riemannian optimization. These preparation lemmas become trivial or well known in the Euclidean context.

The next assumption corresponds to a part of [11, (A1)].

Assumption 2

The point \(x^*\) is a nondegenerate local minimizer of \(f\). In other words, \({{\mathrm{\mathrm {grad}}}}f(x^*)=0\) and \({{\mathrm{\mathrm {Hess}}}}f(x^*)\) is positive definite.

The next assumption generalizes the assumption, contained in [11, (A1)], that the Hessian of \(f\) is Lipschitz continuous near \(x^*\). (Recall that \(\mathcal {T}_S\) is the vector transport invoked in Algorithm 1.) Note that the assumption holds if \(f\in C^3\); see Lemma 4.

Assumption 3

There exists a constant \(c_0\) such that for all \(x, y \in \mathcal {U}_{\mathrm {trn}}\),

$$\begin{aligned} \Vert {{\mathrm{\mathrm {Hess}}}}f(y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1}\Vert \le c {{\mathrm{\mathrm {dist}}}}(x, y), \end{aligned}$$

where \({{\mathrm{\mathrm {Hess}}}}f(x)\) is the Riemannian Hessian of \(f\) at \(x\) (see, e.g., [2, §5.5]), \(\eta = R_x^{-1}(y)\), and \(\Vert \cdot \Vert \) is also used to denote the operator norm induced by the Riemannian norm (5).

The next assumption is introduced to handle the Riemannian case; in the classical Euclidean setting, Assumption 4 follows from Assumption 3. Assumption 4 is mild since it holds if \(f\in C^3\), as shown in Lemma 4.

Assumption 4

There exists a constant \(c_0\) such that for all \(x, y \in \mathcal {U}_{\mathrm {trn}}\), all \(\xi _x \in {{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) with \(R_x(\xi _x) \in \mathcal {U}_{\mathrm {trn}}\), and all \(\xi _y \in {{\mathrm{\mathrm {T}}}}_y \mathcal {M}\) with \(R_y(\xi _y) \in \mathcal {U}_{\mathrm {trn}}\), the inequality

$$\begin{aligned} \Vert {{\mathrm{\mathrm {Hess}}}}\hat{f}_y(\xi _y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}\hat{f}_x(\xi _x) \mathcal {T}_{S_{\eta }}^{-1}\Vert \le c_0 (\Vert \xi _y\Vert + \Vert \xi _x\Vert + \Vert \eta \Vert ) \end{aligned}$$

holds, where \(\eta = R_x^{-1}(y),\; \hat{f}_x = f \circ R_x\), and \(\hat{f}_y = f \circ R_y\).

The next assumption corresponds to [11, (A2)]. It implies that no updates of \(\mathcal {B}_k\) are skipped. In the Euclidean case, Khalfan et al. [24] show that this is usually the case in practice.

Assumption 5

The inequality

$$\begin{aligned} |g(s_k, y_k - \mathcal {B}_k s_k)| \ge \nu \Vert s_k\Vert \Vert y_k - \mathcal {B}_k s_k\Vert \end{aligned}$$

holds.

The next assumption is introduced to handle the Riemannian case. It states that the iterates eventually continuously stay in the totally retractive neighborhood \(\mathcal {U}_{\mathrm {trn}}\) (the terminology is borrowed from [5, Definition 2.8]). The assumption is needed, in particular, for Lemma 5. Note that, whereas in the Euclidean setting the assumption follows from the standing assumption that \(\{x_k\}\) converges to \(x^*\), this is no longer the case on some Riemannian manifolds, where \(\{x_k\}\) may converge to \(x^*\) while the connecting segments \(\{R_{x_k}(ts_k): t\in [0,1]\}\) do not. Assumption 6 is thus invoked to ensure that we are in a position to carry out a local convergence analysis.

Assumption 6

There exists \(N\) such that, for all \(k\ge N\) and all \(t\in [0,1]\), \(R_{x_k}(t s_k) \in \mathcal {U}_{\mathrm {trn}}\).

The next lemma is proved in [19, Lemma 14.1].

Lemma 1

Let \(\mathcal {M}\) be a Riemannian manifold, let \(\mathcal {U}\) be a compact coordinate neighborhood in \(\mathcal {M}\), and let the hat denote coordinate expressions. Then there are \(c_2 > c_1 > 0\) such that, for all \(x, y \in \mathcal {U}\), we have

$$\begin{aligned} c_1 \Vert \hat{x} - \hat{y}\Vert _2 \le {{\mathrm{\mathrm {dist}}}}(x, y) \le c_2 \Vert \hat{x} - \hat{y}\Vert _2, \end{aligned}$$

where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm, i.e., \(\Vert \hat{x}\Vert _2 = \sqrt{\hat{x}^T\hat{x}}\).

Lemma 2

Let \(\mathcal {M}\) be a Riemannian manifold endowed with a retraction \(R\) and let \(\bar{x} \in \mathcal {M}\). Then there exist \(a_0 > 0\), \(a_1 > 0\), and \(\delta _{a_0, a_1} > 0\) such that for all \(x\) in a sufficiently small neighborhood of \(\bar{x}\) and all \(\xi , \eta \in {{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) with \(\Vert \xi \Vert \le \delta _{a_0, a_1}\) and \(\Vert \eta \Vert \le \delta _{a_0, a_1}\), the inequalities

$$\begin{aligned} a_0 \Vert \xi - \eta \Vert \le {{\mathrm{\mathrm {dist}}}}(R_x(\eta ), R_x (\xi )) \le a_1 \Vert \xi - \eta \Vert \end{aligned}$$

hold.

Proof

Since \(R\) is smooth, we can choose a neighborhood small enough such that \(R\) satisfies the condition of [30, Lemma 6], and the result follows from that lemma.\(\square \)

The following lemma follows from Lemma 2 by setting \(\eta =0\). We state it separately for convenience as we will frequently invoke it in the analysis.

Lemma 3

Let \(\mathcal {M}\) be a Riemannian manifold endowed with retraction \(R\) and let \(\bar{x} \in \mathcal {M}\). Then there exist \(a_0 > 0\), \(a_1 > 0\), and \(\delta _{a_0, a_1} > 0\) such that for all \(x\) in a sufficiently small neighborhood of \(\bar{x}\) and all \(\xi \in {{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) with \(\Vert \xi \Vert \le \delta _{a_0, a_1}\), the inequalities

$$\begin{aligned} a_0 \Vert \xi \Vert \le {{\mathrm{\mathrm {dist}}}}(x, R_x (\xi )) \le a_1 \Vert \xi \Vert \end{aligned}$$

hold.

Lemma 4

If \(f\in C^3\), then Assumptions 3 and 4 hold.

Proof

First, we prove that Assumption 3 holds. Define a function \(h: \mathcal {M} \times \mathcal {M} \times {{\mathrm{\mathrm {T}}}}\mathcal {M} \rightarrow {{\mathrm{\mathrm {T}}}}\mathcal {M}, (x, y, \xi _y) \rightarrow \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1} \xi _y\), where \(\eta = R_x^{-1}(y)\). Since \(f \in C^3\), we know that \(h(x, y, \xi _y)\) is \(C^1\). Then there exists \(b_0\) such that for all \(x, y \in \mathcal {U}_{\mathrm {trn}},\) \(\xi _y \in {{\mathrm{\mathrm {T}}}}_y \mathcal {M}, \Vert \xi _y\Vert = 1\),

$$\begin{aligned} \Vert h(y, y, \xi _y) - h(x, y, \xi _y)\Vert&\le b_0 {{\mathrm{\mathrm {dist}}}}(\{y, y, \xi _y\}, \{x, y, \xi _y\}) \\&\le b_1 \Vert \{\hat{y}, \hat{y}, \hat{\xi }_y\} - \{\hat{x}, \hat{y}, \hat{\xi }_y\}\Vert _{2}\quad \hbox {(by Lemma 1)} \\&= b_1 \Vert \hat{y} - \hat{x}\Vert _{2} \\&\le b_2 {{\mathrm{\mathrm {dist}}}}(y, x),\quad \hbox {(by Lemma 1)} \end{aligned}$$

where \(b_0\), \(b_1\) and \(b_2\) are some constants. So we have

$$\begin{aligned} b_2 {{\mathrm{\mathrm {dist}}}}(y, x)&\ge \Vert h(y, y, \xi _y) - h(x, y, \xi _y)\Vert \\&= \Vert ({{\mathrm{\mathrm {Hess}}}}f(y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1}) \xi _y\Vert \end{aligned}$$

Given any linear operator \(A\) on \({{\mathrm{\mathrm {T}}}}_y \mathcal {M}\), we have that \(\Vert A\Vert \) by definition is \(\sup _{\Vert \xi \Vert =1} \Vert A \xi \Vert \). Note that \(\Vert \xi \Vert =1\) is a compact set. Hence, there exists \(\Vert \xi ^*\Vert = 1\) such that \(\Vert A\Vert = \Vert A \xi ^*\Vert \). Therefore, we can choose \(\xi _y, \Vert \xi _y\Vert = 1\) such that

$$\begin{aligned} \Vert ({{\mathrm{\mathrm {Hess}}}}f(y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1}) \xi _y\Vert = \Vert ({{\mathrm{\mathrm {Hess}}}}f(y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1})\Vert . \end{aligned}$$

We obtain

$$\begin{aligned} \Vert {{\mathrm{\mathrm {Hess}}}}f(y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1}\Vert \le b_2 {{\mathrm{\mathrm {dist}}}}(y, x). \end{aligned}$$

To prove Assumption 4, we redefine \(h\) as \(h(y, x, \xi _x) = \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}\hat{f}_x(\xi _x) \mathcal {T}_{S_{\eta }}^{-1}\). If we use orthonormal vector fields to obtain the coordinate expression of \(h\), denoted by \(\hat{h}\), then the manifold norm and the Euclidean norm of coordinate expressions are the same and we have

$$\begin{aligned} \Vert {{\mathrm{\mathrm {Hess}}}}\hat{f}_y(\xi _y) - \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}\hat{f}_x(\xi _x) \mathcal {T}_{S_{\eta }}^{-1}\Vert = \Vert {{\mathrm{\mathrm {Hess}}}}\hat{f}_y(\hat{\xi }_y) - \hat{\mathcal {T}}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}\hat{f}_x(\hat{\xi }_x) \hat{\mathcal {T}}_{S_{\eta }}^{-1}\Vert _2.\nonumber \\ \end{aligned}$$
(11)

Since \(f \in C^3\), we know that \(\hat{h}\) is also in \(C^1\). Hence there exists a constant \(b_3\) such that

$$\begin{aligned} \Vert \hat{h}(\hat{y}, \hat{y}, \hat{\xi }_y) - \hat{h}(\hat{y}, \hat{x}, \hat{\xi }_x)\Vert _2 \le b_3 \Vert \{\hat{y}, \hat{y}, \hat{\xi }_y\} - \{\hat{y}, \hat{x}, \hat{\xi }_x\}\Vert _2. \end{aligned}$$

Therefore,

$$\begin{aligned}&\Vert {{\mathrm{\mathrm {Hess}}}}\hat{f}_y(\hat{\xi }_y) - \hat{\mathcal {T}}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}\hat{f}_x(\hat{\xi }_x) \hat{\mathcal {T}}_{S_{\eta }}^{-1}\Vert _2 \\&\quad = \Vert \hat{h}(\hat{y}, \hat{y}, \hat{\xi }_y) - \hat{h}(\hat{y}, \hat{x}, \hat{\xi }_x)\Vert _2 \\&\quad \le b_3 \Vert \{\hat{y}, \hat{y}, \hat{\xi }_y\} - \{\hat{y}, \hat{x}, \hat{\xi }_x\}\Vert _2 \\&\quad \le b_4 (\Vert \hat{y} - \hat{x}\Vert _2 + \Vert \hat{\xi }_y\Vert _{2} + \Vert \hat{\xi }_x\Vert _2) \\&\quad \le b_5 ({{\mathrm{\mathrm {dist}}}}(x, y) + \Vert \hat{\xi }_y\Vert _2 + \Vert \hat{\xi }_x\Vert _2) \hbox { (by Lemma 1)} \\&\quad \le b_6 (\Vert \eta \Vert + \Vert {\xi }_y\Vert + \Vert {\xi }_x\Vert ). \hbox { (by Lemma 3)} \end{aligned}$$

This and (11) give us Assumption 4.\(\square \)

The next lemma generalizes [11, Lemma 2.2]. The key difference with the Euclidean case is the following: in the Euclidean case, when \(s_k\) is accepted, we simply have \(\Vert s_k\Vert =\Vert x_{k+1}-x_{k}\Vert \), while in the Riemannian generalization, we invoke Assumption 6 and Lemma 3 to deduce that \(\Vert s_k\Vert \le \frac{1}{a_0} {{\mathrm{\mathrm {dist}}}}(x_{k + 1}, x_k)\). Note that Assumption 6 cannot be removed. To see this, consider for example the unit sphere with the exponential retraction, where we can have \(x_k=x_{k+1}\) with \(\Vert s_k\Vert =2\pi \).

Lemma 5

Suppose Assumption 6 holds. Then either

$$\begin{aligned} \varDelta _k \rightarrow 0 \end{aligned}$$
(12)

or there exist \(K > 0\) and \(\varDelta > 0\) such that for all \(k > K\)

$$\begin{aligned} \varDelta _k = \varDelta . \end{aligned}$$
(13)

In either case \(s_k \rightarrow 0\).

Proof

Let \(\varDelta = \liminf \varDelta _k\) and suppose first that \(\varDelta > 0\). From line 1 of Algorithm 1, if \(\varDelta _k\) is increased, then \(\Vert s_k\Vert \ge 0.8 \varDelta _k\) and \(x_{k+1} = R_{x_k}(s_k)\), which implies by Lemma 3 and Assumption 6 that \({{\mathrm{\mathrm {dist}}}}(x_k,x_{k+1})\ge a_0 0.8 \varDelta _k\). The latter inequality cannot hold for infinitely many values of \(k\) since \(x_k\rightarrow x^*\) and \(\liminf \varDelta _k>0\). Hence, there exists \(K \ge 0\) such that \(\varDelta _k\) is not increased for any \(k \ge K\). Since \(\varDelta > 0\), this implies that \(\varDelta _k \ge \varDelta \) for all \(k \ge K\). In view of the trust-region update mechanism in Algorithm 1 and since \(\varDelta = \liminf \varDelta _k\), we also know that, for some \(K_1 > K,\; \varDelta _{K_1} < \frac{1}{\tau _1} \varDelta \). If the trust region radius were to be decreased we would have \(\varDelta _{K_1 + 1} < \varDelta \), which we have ruled out. Since neither increase nor decrease can occur, we must have \(\varDelta _k = \varDelta \) for all \(k \ge K_1\).

Suppose now that \(\varDelta = 0\). Since \(x_k\rightarrow x^*\), for every \(\epsilon > 0\) there exists \(K_{\epsilon } \ge 0\) such that \({{\mathrm{\mathrm {dist}}}}(x_{k+1},x_k) < \epsilon \) for all \(k \ge K_{\epsilon }\). Since \(\liminf \varDelta _k = 0\), there exists \(j \ge K_{\epsilon }\) such that \(\varDelta _j < \epsilon \). But since \(\varDelta _k\) is increased only if \(\varDelta _k \le \frac{1}{0.8} \Vert s_k\Vert \le \frac{1}{0.8 a_0}{{\mathrm{\mathrm {dist}}}}(x_{k+1},x_k) < \frac{\epsilon }{0.8 a_0}\), and the increase factor is \(\tau _2\), we have that \(\varDelta _k < \frac{\tau _2 \epsilon }{0.8 a_0}\) for all \(k \ge j\). Therefore (12) follows.

To show that \(\Vert s_k\Vert \rightarrow 0\), note that if (12) is true, then clearly \(\Vert s_k\Vert \rightarrow 0\). If (13) is true, then for all \(k > K\), the step \(s_k\) is accepted and \(\Vert s_k\Vert \le \frac{1}{a_0} {{\mathrm{\mathrm {dist}}}}(x_{k + 1}, x_k)\) (by Lemma 3), hence \(\Vert s_k\Vert \rightarrow 0\) since \(\{x_k\}\) converges.\(\square \)

Lemma 6

Let \(\mathcal {M}\) be a Riemannian manifold endowed with two vector transports \(\mathcal {T}_1\) and \(\mathcal {T}_2\), and let \(\bar{x} \in \mathcal {M}\). Then there exist a constant \(a_4\) and a neighborhood \(\mathcal {U}\) of \(\bar{x}\) such that for all \(x, y \in \mathcal {U}\) and all \(\xi \in {{\mathrm{\mathrm {T}}}}_y\mathcal {M}\),

$$\begin{aligned} \Vert \mathcal {T}_{1_{\eta }}^{-1} \xi - \mathcal {T}_{2_{\eta }}^{-1} \xi \Vert \le a_4 \Vert \xi \Vert \Vert \eta \Vert , \end{aligned}$$

where \(\eta = R_{x}^{-1}(y)\).

Proof

We use the hat to denote coordinate expressions. Let \(T_1(\hat{x}, \hat{\eta })\) and \(T_2(\hat{x}, \hat{\eta })\) denote the coordinate expression of \(\mathcal {T}_{1_{\eta }}^{-1}\) and \(\mathcal {T}_{2_{\eta }}^{-1}\), respectively. Then

$$\begin{aligned}&\Vert \mathcal {T}_{1_{\eta }}^{-1} \xi - \mathcal {T}_{2_{\eta }}^{-1} \xi \Vert \\&\quad \le b_0 \Vert (T_1(\hat{x}, \hat{\eta }) - T_2(\hat{x}, \hat{\eta })) \hat{\xi }\Vert _2 \\&\quad \le b_0 \Vert \hat{\xi }\Vert _2 \Vert T_1(\hat{x}, \hat{\eta }) - T_2(\hat{x}, \hat{\eta })\Vert _2 \\&\quad \le b_1 \Vert \hat{\xi }\Vert _2 \Vert \hat{\eta }\Vert _2 (\hbox {since}\; T_1(\hat{x}, 0) = T_2(\hat{x}, 0)\; \hbox {and both}\; T_{1}\; \hbox {and}\; T_{2} \; \hbox {are smooth}) \\&\quad \le b_2 \Vert \xi \Vert \Vert \eta \Vert \end{aligned}$$

for some constants \(b_0,\; b_1\), and \(b_2\).\(\square \)

The next lemma is proved in [19, Lemma 14.5].

Lemma 7

Let \(F\) be a \(C^1\) vector field on a Riemannian manifold \(\mathcal {M}\) and let \(\bar{x} \in \mathcal {M}\) be a nondegenerate zero of \(F\). Then there exist a neighborhood \(\mathcal {U}\) of \(\bar{x}\) and \(a_5, a_6 > 0\) such that for all \(x \in \mathcal {U}\),

$$\begin{aligned} a_5 {{\mathrm{\mathrm {dist}}}}(x, \bar{x}) \le \Vert F(x)\Vert \le a_6 {{\mathrm{\mathrm {dist}}}}(x, \bar{x}). \end{aligned}$$

In the Euclidean case, the next lemma holds with \(\tilde{a}_7=0\) and reduces to the Fundamental Theorem of Calculus.

Lemma 8

Let \(F\) be a \(C^1\) vector field on a Riemannian manifold \(\mathcal {M}\), let \(R\) be a retraction on \(\mathcal {M}\), and let \(\bar{x} \in \mathcal {M}\). Then there exist a neighborhood \(\mathcal {U}\) of \(\bar{x}\) and a constant \(\tilde{a}_7\) such that for all \(x, y \in \mathcal {U}\),

$$\begin{aligned} \Vert P_{\gamma }^{0 \leftarrow 1} F(y) - F(x) - \left[ \int \limits _0^1 P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) P_{\gamma }^{t \leftarrow 0}d t\right] \eta \Vert \le \tilde{a}_7 \Vert \eta \Vert ^2, \end{aligned}$$

where \(\eta = R_x^{-1} (y)\) and \(P_\gamma \) is the parallel translation along the curve \(\gamma \) given by by \(\gamma (t) = R_x(t \eta )\).

Proof

Define \(G: [0, 1] \rightarrow {{\mathrm{\mathrm {T}}}}_x \mathcal {M}: t \mapsto G(t) = P_{\gamma }^{0 \leftarrow t} F(\gamma (t))\). Observe that \(G(0) = F(x)\) and \(G(1) = P_{\gamma }^{0 \leftarrow 1} F(y)\). We have

$$\begin{aligned} G'(t)&= \frac{d}{d \epsilon } G(t + \epsilon )|_{\epsilon = 0} \\&= P_{\gamma }^{0 \leftarrow t} \frac{d}{d \epsilon } P_{\gamma }^{t \leftarrow t + \epsilon } F(\gamma (t + \epsilon ))|_{\epsilon = 0} \\&= P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) \left[ \frac{d}{d \epsilon } \gamma (t + \epsilon )\right] |_{\epsilon = 0} \\&= P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) \left[ \mathcal {T}_{R_(t \eta )} \eta \right] , \end{aligned}$$

where we have used an expression of the covariant derivative \(\mathbb {D}\) in terms of the parallel translation \(P\) (see, e.g., [14, theorem I.2.1]), and where \(\mathcal {T}_{R(t \eta )} \eta = \frac{d}{dt}(R(t\eta ))\). Since \(G(1) - G(0) = \int _0^1 G'(t) d t\), we obtain

$$\begin{aligned}&\Vert P_{\gamma }^{0 \leftarrow 1} F(y) - F(x) -\int \limits _0^1 P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) P_{\gamma }^{t \leftarrow 0} \eta d t\Vert \\&\quad = \Vert \int \limits _0^1 P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) (\mathcal {T}_{R(t \eta )} \eta - P_{\gamma }^{t \leftarrow 0} \eta ) d t\Vert \\&\quad \le \int \limits _0^1 \Vert P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) P_{\gamma }^{t \leftarrow 0} \Vert \Vert (P_{\gamma }^{0 \leftarrow t} \mathcal {T}_{R_(t \eta )} \eta - \eta )\Vert d t \\&\quad \le \int \limits _0^1 \Vert P_{\gamma }^{0 \leftarrow t} \mathbb {D} F(\gamma (t)) P_{\gamma }^{t \leftarrow 0} \Vert \Vert (P_{\gamma }^{0 \leftarrow t} \mathcal {T}_{R(t \eta )} \eta - \mathcal {T}_{R(t \eta )}^{-1} \mathcal {T}_{R(t \eta )}\eta )\Vert d t \\&\quad \le b_0 \Vert \eta \Vert ^2 \hbox { (by Lemma 6)} \end{aligned}$$

where \(b_0\) is some constant.\(\square \)

Lemma 9

Suppose Assumptions 2 and 3 hold. Then there exist a neighborhood \(\mathcal {U}\) and a constant \(a_7\) such that for all \(x_1,\; \tilde{x}_1,\; x_2\), and \(\tilde{x}_2 \in \mathcal {U}\), we have

$$\begin{aligned}&|g(\mathcal {T}_{S_{\zeta }} \xi _1, y_2) - g(\mathcal {T}_{S_{\zeta }} y_1, \xi _2)| \\&\quad \le a_7 \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x^*), {{\mathrm{\mathrm {dist}}}}(x_2, x^*), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_1, x^*), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_2, x^*)\} \Vert \xi _1\Vert \Vert \xi _2\Vert , \end{aligned}$$

where \(\zeta = R_{x_1}^{-1} (x_2)\), \(\xi _1 = R_{x_1}^{-1} (\tilde{x}_1)\), \(\xi _2 = R_{x_2}^{-1} (\tilde{x}_2)\), \(y_1 = \mathcal {T}_{S_{\xi _1}}^{-1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1) - {{\mathrm{\mathrm {grad}}}}f(x_1)\), and \(y_2 = \mathcal {T}_{S_{\xi _2}}^{-1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2) - {{\mathrm{\mathrm {grad}}}}f(x_2)\).

Proof

Define \(\bar{y}_1 = P_{\gamma _1}^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1) - {{\mathrm{\mathrm {grad}}}}f(x_1)\) and \(\bar{y}_2 = P_{\gamma _2}^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2) - {{\mathrm{\mathrm {grad}}}}f(x_2)\), where \(P\) is the parallel transport, \(\gamma _1(t) = R_{x_1}(t \xi _1)\), and \(\gamma _2(t) = R_{x_2}(t \xi _2)\). From Lemma 8, we have

$$\begin{aligned} \Vert \bar{y}_1 - \bar{H}_1(x_1, \tilde{x}_1) \xi _1\Vert \le b_0 \Vert \xi _1\Vert ^2 \quad \text {and} \quad \Vert \bar{y}_2 - \bar{H}_2(x_2, \tilde{x}_2) \xi _2\Vert \le b_0 \Vert \xi _2\Vert ^2, \end{aligned}$$
(14)

where \(b_0\) is a positive constant, \(\bar{H}_1(x_1, \tilde{x}_1) = \int _0^1 P_{\gamma _1}^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma _1(t)) P_{\gamma _1}^{t \leftarrow 0} d t\) and \(\bar{H}_2(x_2, \tilde{x}_2) = \int _0^1 P_{\gamma _2}^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma _2(t)) P_{\gamma _2}^{t \leftarrow 0} d t\). It follows that

$$\begin{aligned}&|g(\mathcal {T}_{S_{\zeta }} \xi _1, y_2) - g(\mathcal {T}_{S_{\zeta }} y_1, \xi _2)| \nonumber \\&\quad \le |g(\mathcal {T}_{S_{\zeta }} \xi _1, \bar{y}_2) - g(\mathcal {T}_{S_{\zeta }} \bar{y}_1, \xi _2)| + |g(\mathcal {T}_{S_{\zeta }} \xi _1, y_2 - \bar{y}_2) - g(\mathcal {T}_{S_{\zeta }} (y_1 - \bar{y}_1), \xi _2)| \nonumber \\&\quad \le |g(\mathcal {T}_{S_{\zeta }} \xi _1, \bar{H}_2(x_2, \tilde{x}_2) \xi _2) - g(\mathcal {T}_{S_{\zeta }} \bar{H}_1(x_1, \tilde{x}_1) \xi _1, \xi _2)| \nonumber \\&\quad \quad + b_1 (\Vert \xi _1\Vert + \Vert \xi _2\Vert ) \Vert \xi _1\Vert \Vert \xi _2\Vert \hbox { (by (14))} \nonumber \\&\quad \quad + |g(\mathcal {T}_{S_{\zeta }} \xi _1, \mathcal {T}_{S_{\zeta _2}}^{-1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2) - P_{\gamma _2}^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2))| \nonumber \\&\quad \quad + |g(\mathcal {T}_{S_{\zeta }} (\mathcal {T}_{S_{\zeta _1}}^{-1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1) - P_{\gamma _1}^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1)), \xi _2)| \nonumber \\&\quad \le |g(\mathcal {T}_{S_{\zeta }} \xi _1, \bar{H}_2(x_2, \tilde{x}_2) \xi _2) - g(\mathcal {T}_{S_{\zeta }} \bar{H}_1(x_1, \tilde{x}_1) \xi _1, \xi _2)| \nonumber \\&\quad \quad + b_1 (\Vert \xi _1\Vert + \Vert \xi _2\Vert ) \Vert \xi _1\Vert \Vert \xi _2\Vert \nonumber \\&\quad \quad + b_2 \Vert \xi _1\Vert \Vert \xi _2\Vert \Vert {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2)\Vert + b_3 \Vert \xi _1\Vert \Vert \xi _2\Vert \Vert {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1)\Vert \hbox { (by Lemma 6)} \nonumber \\&\quad \le |g(\bar{H}_2(x_2, \tilde{x}_2) \mathcal {T}_{S_{\zeta }} \xi _1, \xi _2) - g(\mathcal {T}_{S_{\zeta }} \bar{H}_1(x_1, \tilde{x}_1) \xi _1, \xi _2)| \nonumber \\&\quad \qquad \hbox { (average Hessian is also self-adjoint)} \nonumber \\&\quad \quad + b_4 \Vert \xi _1\Vert \Vert \xi _2\Vert ({{\mathrm{\mathrm {dist}}}}(x_1, \tilde{x}_1) + {{\mathrm{\mathrm {dist}}}}(x_2, \tilde{x}_2) + {{\mathrm{\mathrm {dist}}}}(\tilde{x}_2, x^*) + {{\mathrm{\mathrm {dist}}}}(\tilde{x}_1, x^*)) \nonumber \\&\quad \qquad \hbox { (by Lemmas 3 and 7)} \nonumber \\&\quad \le b_5 \Vert \xi _1\Vert \Vert \xi _2\Vert \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x^*), {{\mathrm{\mathrm {dist}}}}(x_2, x^*), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_1, x^*), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_2, x^*)\} \nonumber \\&\quad \qquad \hbox { (by triangle inequality of distance)} \nonumber \\&\quad \quad + |g(\bar{H}_2(x_2, \tilde{x}_2) \mathcal {T}_{S_{\zeta }} \xi _1, \xi _2) - g(\mathcal {T}_{S_{\zeta }} \bar{H}_1(x_1, \tilde{x}_1) \xi _1, \xi _2)| \end{aligned}$$
(15)

where \(b_1, b_2,\; b_3,\; b_4\) and \(b_5\) are some constants. Using hat to denote coordinate expressions, \(T(\hat{x}_1, \hat{x}_2)\) to denote \(\mathcal {T}_{\zeta }\) and \(\hat{G}(\hat{x}_2)\) to denote the matrix expression of the Riemannian metric at \(x_2\), we have

$$\begin{aligned}&|g(\bar{H}_2(x_2, \tilde{x}_2) \mathcal {T}_{S_{\zeta }} \xi _1, \xi _2) - g(\mathcal {T}_{S_{\zeta }} \bar{H}_1(x_1, \tilde{x}_1) \xi _1, \xi _2)| \nonumber \\&\quad = |\hat{\xi }_1^T T(\hat{x}_1, \hat{x}_2)^T \hat{\bar{H}}_2(\hat{x}_2, \hat{\tilde{x}}_2)^T \hat{G}(\hat{x}_2) \hat{\xi }_2 - \hat{\xi }_1^T \hat{\bar{H}}_1(\hat{x}_1, \hat{\tilde{x}}_1)^T T(\hat{x}_1, \hat{x}_2)^T \hat{G}(\hat{x}_2) \hat{\xi }_2| \nonumber \\&\quad \le \Vert \hat{\xi }_1\Vert _2 \Vert T(\hat{x}_1, \hat{x}_2)^T \hat{\bar{H}}_2(\hat{x}_2, \hat{\tilde{x}}_2)^T - \hat{\bar{H}}_1(\hat{x}_1, \hat{\tilde{x}}_1)^T T(\hat{x}_1, \hat{x}_2)^T\Vert _2 \Vert \hat{G}(\hat{x}_2)\Vert _2 \Vert \hat{\xi }_2\Vert _2 \end{aligned}$$
(16)

where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm. Define a function

$$\begin{aligned} J(\hat{x}_1, \hat{\tilde{x}}_1, \hat{x}_2, \hat{\tilde{x}}_2) = T(\hat{x}_1, \hat{x}_2)^T \hat{\bar{H}}_2(\hat{x}_2, \hat{\tilde{x}}_2)^T - \hat{\bar{H}}_1(\hat{x}_1, \hat{\tilde{x}}_1)^T T(\hat{x}_1, \hat{x}_2)^T. \end{aligned}$$

We can see that when \((\hat{x}_1^T, \hat{\tilde{x}}_1^T) = (\hat{x}_2^T, \hat{\tilde{x}}_2^T)\), \(J = 0\). Since, in view of Assumption 3, \(J\) is Lipschitz continuous, it follows that (16) becomes

$$\begin{aligned}&|g(\bar{H}_2 \mathcal {T}_{S_{\zeta }} \xi _1, \xi _2) - g(\mathcal {T}_{S_{\zeta }} \bar{H}_1 \xi _1, \xi _2)| \\&\quad \le b_6 \Vert (\hat{x}_1^T, \hat{\tilde{x}}_1^T) - (\hat{x}_2^T, \hat{\tilde{x}}_2^T)\Vert _2 \Vert \hat{\xi }_1\Vert _2 \Vert \hat{\xi }_2\Vert _2 \\&\quad \le b_7 \Vert \xi _1\Vert \Vert \xi _2\Vert \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x_2), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_1, \tilde{x}_2)\}, \end{aligned}$$

where \(b_6, b_7\) are some constants. Combining this equation with (15), we obtain

$$\begin{aligned}&|g(\mathcal {T}_{S_{\zeta }} \xi _1, y_2) - g(\mathcal {T}_{S_{\zeta }} y_1, \xi _2)| \\&\quad \le b_8 \Vert \xi _1\Vert \Vert \xi _2\Vert \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x^*), {{\mathrm{\mathrm {dist}}}}(x_2, x^*), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_1, x^*), {{\mathrm{\mathrm {dist}}}}(\tilde{x}_2, x^*)\}, \end{aligned}$$

where \(b_8\) is a constant.\(\square \)

Lemma 10

Let \(\mathcal {M}\) be a Riemannian manifold endowed with a vector transport \(\mathcal {T}\) with associated retraction \(R\), and let \(\bar{x} \in \mathcal {M}\). Then there is a neighborhood \(\mathcal {U}\) of \(\bar{x}\) and \(a_8\) such that for all \(x, y \in \mathcal {U}\),

$$\begin{aligned} \Vert {{\mathrm{\mathrm {id}}}}- \mathcal {T}_{{\xi }}^{-1} \mathcal {T}_{{\eta }}^{-1} \mathcal {T}_{{\zeta }} \Vert&\le a_8 \max ({{\mathrm{\mathrm {dist}}}}(x, \bar{x}), {{\mathrm{\mathrm {dist}}}}(y, \bar{x})), \\ \Vert {{\mathrm{\mathrm {id}}}}- \mathcal {T}_{{\zeta }}^{-1} \mathcal {T}_{{\eta }} \mathcal {T}_{{\xi }}\Vert&\le a_8 \max ({{\mathrm{\mathrm {dist}}}}(x, \bar{x}), {{\mathrm{\mathrm {dist}}}}(y, \bar{x})), \end{aligned}$$

where \(\xi = R_{\bar{x}}^{-1}(x),\; \eta = R_{x}^{-1}(y),\; \zeta = R_{\bar{x}}^{-1}(y),\; {{\mathrm{\mathrm {id}}}}\) is the identity operator, and \(\Vert \cdot \Vert \) is the operator norm induced by the Riemannian metric.

Proof

Let the hat denote coordinate expressions, chosen such that the matrix expression of the Riemannian metric at \(\bar{x}\) is the identity. Let \(L(x, y)\) denote \(\mathcal {T}_{{R_x^{-1}(y)}}\). We have

$$\begin{aligned} \Vert {{\mathrm{\mathrm {id}}}}- \mathcal {T}_{{\xi }}^{-1} \mathcal {T}_{{\eta }}^{-1} \mathcal {T}_{{\zeta }} \Vert = \Vert I - L(\bar{x}, x)^{-1} L(x, y)^{-1} L(\bar{x}, y)\Vert . \end{aligned}$$

Define a function \(J(\bar{x}, \xi , \zeta ) = I - L(\bar{x}, R_{\bar{x}}(\xi ))^{-1} L(R_{\bar{x}}(\xi ), R_{\bar{x}}(\zeta ))^{-1} L(\bar{x}, R_{\bar{x}}(\zeta ))\). Notice that \(J\) is a smooth function and \(J(\bar{x}, 0_{\bar{x}}, 0_{\bar{x}}) = 0\). So

$$\begin{aligned} \Vert J(\bar{x}, \xi , \zeta )\Vert&= \Vert J(\bar{x}, \xi , \zeta ) - J(\bar{x}, 0_{\bar{x}}, 0_{\bar{x}})\Vert \\&= \Vert \hat{J}(\hat{\bar{x}}, \hat{\xi }, \hat{\zeta }) - \hat{J}(\hat{\bar{x}}, \hat{0}_{\bar{x}}, \hat{0}_{\bar{x}})\Vert _2 \\&\le b_0 (\Vert \hat{\xi }\Vert _2 + \Vert \hat{\zeta }\Vert _2) \hbox { (smoothness of J)} \\&\le b_1 ({{\mathrm{\mathrm {dist}}}}(x, \bar{x}) + {{\mathrm{\mathrm {dist}}}}(y, \bar{x})) \hbox { (by Lemma~3)} \\&\le b_2 \max ({{\mathrm{\mathrm {dist}}}}(x, \bar{x}), {{\mathrm{\mathrm {dist}}}}(y, \bar{x})), \end{aligned}$$

where \(b_0, b_1\) and \(b_2\) are some constants and \(\Vert \cdot \Vert _2\) denotes the Euclidean norm. So

$$\begin{aligned} \Vert {{\mathrm{\mathrm {id}}}}- \mathcal {T}_{{\xi }}^{-1} \mathcal {T}_{{\eta }}^{-1} \mathcal {T}_{{\zeta }} \Vert \le b_2 \max ({{\mathrm{\mathrm {dist}}}}(x, \bar{x}), {{\mathrm{\mathrm {dist}}}}(y, \bar{x})). \end{aligned}$$

This concludes the first part of the proof. The second part of the result follows from a similar argument.\(\square \)

The next lemma generalizes [15, Lemma 1]. It is instrumental in the proof of Lemma 13 below. In the Euclidean setting, it is possible to give an expression for \(a_9\) and \(a_{10}\) in terms of \(c\) of Assumption 3 and \(\nu \) of Assumption 5. In the Riemannian setting, we could not obtain such an expression, in part because the constant \(b_2\) that appears in the proof below is no longer zero. However, the existence of \(a_9\) and \(a_{10}\) can still be shown, under the assumption that \(\{x_k\}\) converges to \(x^*\), and this is all we need in order to carry on with Lemma 13.

Lemma 11

Suppose Assumptions 1, 2, 3, and 5 hold. Then

$$\begin{aligned} y_j - \tilde{\mathcal {B}}_{j + 1} s_j = 0 \end{aligned}$$
(17)

for all \(j\). Moreover, there exist constants \(a_9\) and \(a_{10}\) such that

$$\begin{aligned} \Vert y_j - (\mathcal {B}_i)_j s_j\Vert \le a_9 a_{10}^{i - j - 2} \epsilon _{i, j} \Vert s_j\Vert \end{aligned}$$
(18)

for all \(j,\; i \ge j+1\), where \(\epsilon _{i, j} = \max _{j \le k \le i} {{\mathrm{\mathrm {dist}}}}(x_k, x^*)\) and

$$\begin{aligned} (\mathcal {B}_i)_j = \mathcal {T}_{S_{\zeta _{j, i}}}^{-1} \mathcal {B}_i \mathcal {T}_{S_{\zeta _{j, i}}} \end{aligned}$$

with \(\zeta _{j, i} = R_{x_j}^{-1} (x_i)\).

Proof

From (3), we have

$$\begin{aligned} \tilde{\mathcal {B}}_{j + 1} s_j = \left( \mathcal {B}_j + \frac{(y_j - \mathcal {B}_j s_j) (y_j - \mathcal {B}_j s_j)^{\flat }}{g(s_j, y_j - \mathcal {B}_j s_j)}\right) s_j = y_j. \end{aligned}$$

This yields (17), as well as (18) with \(i = j + 1\). The proof of (18) for \(i>j+1\) is by induction. We choose \(k \ge j + 1\) and assume that (18) holds for all \(i = j + 1, \ldots , k\). Let \(r_k = y_k - \mathcal {B}_k s_k\). We have

$$\begin{aligned}&|g(r_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j)| = |g(y_k - \mathcal {B}_k s_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j)| \\&\quad \le |g(y_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j) - g(s_k, \mathcal {T}_{S_{\zeta _{j, k}}} y_j)| + |g(s_k, \mathcal {T}_{S_{\zeta _{j, k}}} (y_j - (\mathcal {B}_k)_j s_j))| \\&\quad \quad + |g(s_k, \mathcal {T}_{S_{\zeta _{j, k}}} ((\mathcal {B}_k)_j s_j)) - g(\mathcal {B}_k s_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j)| \\&\quad \le |g(y_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j) - g(s_k, \mathcal {T}_{S_{\zeta _{j, k}}} y_j)| + \Vert \mathcal {T}_{S_{\zeta _{j, k}}} (y_j - (\mathcal {B}_k)_j s_j)\Vert \Vert s_k\Vert \nonumber \\&\quad \quad + |g(s_k, \mathcal {B}_k \mathcal {T}_{S_{\zeta _{j, k}}} s_j) - g(\mathcal {B}_k s_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j)| \\&\quad \le |g(y_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j) - g(s_k, \mathcal {T}_{S_{\zeta _{j, k}}} y_j)| + b_0 a_9 a_{10}^{k - j - 2} \epsilon _{k, j} \Vert s_j\Vert \Vert s_k\Vert \nonumber \\&\quad \qquad (\mathcal {B}_k\; \hbox {self-adjoint and induction assumption}) \\&\quad \le b_0 a_9 a_{10}^{k - j - 2} \epsilon _{k, j} \Vert s_j\Vert \Vert s_k\Vert + b_1 \epsilon _{k + 1, j} \Vert s_k\Vert \Vert s_j\Vert , \hbox { (by Lemma~9)} \end{aligned}$$

where \(b_0\) and \(b_1\) are some constants. It follows that

$$\begin{aligned}&\Vert y_j - (\mathcal {B}_{k + 1})_j s_j\Vert = \Vert y_j - \mathcal {T}_{S_{\zeta _{j, k + 1}}}^{-1} \mathcal {B}_{k + 1} \mathcal {T}_{S_{\zeta _{j, k + 1}}} s_j\Vert \\&\quad = \Vert y_j - \mathcal {T}_{S_{\zeta _{j, k + 1}}}^{-1} \mathcal {T}_{S_{s_k}} \tilde{\mathcal {B}}_{k + 1} \mathcal {T}_{S_{s_k}}^{-1} \mathcal {T}_{S_{\zeta _{j, k + 1}}} s_j\Vert \\&\quad \le \Vert y_j - \mathcal {T}_{S_{\zeta _{j, k}}}^{-1} \tilde{\mathcal {B}}_{k + 1} \mathcal {T}_{S_{\zeta _{j, k}}} s_j\Vert \\&\quad \quad + \Vert \mathcal {T}_{S_{\zeta _{j, k}}}^{-1} \tilde{\mathcal {B}}_{k + 1} \mathcal {T}_{S_{\zeta _{j, k}}} s_j - \mathcal {T}_{S_{\zeta _{j, k + 1}}}^{-1} \mathcal {T}_{S_{s_k}} \tilde{\mathcal {B}}_{k + 1} \mathcal {T}_{S_{s_k}}^{-1} \mathcal {T}_{S_{\zeta _{j, k + 1}}} s_j\Vert \\&\quad \le \Vert y_j - ((\mathcal {B}_k)_j + \mathcal {T}_{S_{\zeta _{j, k}}}^{-1} \frac{(r_k) (r_k)^{\flat }}{g(s_k, r_k)} \mathcal {T}_{S_{\zeta _{j, k}}}) s_j\Vert + b_2 \epsilon _{k + 1, j} \Vert s_j\Vert \\&\quad \qquad \hbox { (by Lemma 10, Assumption 1, and (3))} \\&\quad \le \Vert y_j - (\mathcal {B}_k)_j s_j\Vert + b_3 \frac{|g(r_k, \mathcal {T}_{S_{\zeta _{j, k}}} s_j)|}{\Vert s_k\Vert } + b_2 \epsilon _{k + 1, j} \Vert s_j\Vert \hbox { (by Assumption 5)} \\&\quad \le a_9 a_{10}^{k - j - 2} \epsilon _{k, j} \Vert s_j\Vert + b_3 b_0 a_9 a_{10}^{k - j - 2} \epsilon _{k, j} \Vert s_j\Vert + b_3 b_1 \epsilon _{k, j} \Vert s_j\Vert + b_2 \epsilon _{k + 1, j} \Vert s_j\Vert \\&\quad \le (a_9 a_{10}^{k - j - 2} + b_3 b_0 a_9 a_{10}^{k - j - 2} + b_3 b_1 + b_2) \epsilon _{k + 1, j} \Vert s_j\Vert , (\hbox {since} \; \epsilon _{k, j} \le \epsilon _{k + 1, j}) \end{aligned}$$

where \(b_2, b_3\) are some constant. Because \(b_0, b_1, b_2\) and \(b_3\) are independent of \(a_9\) and \(a_{10}\), we can choose \(a_9\) and \(a_{10}\) large enough such that

$$\begin{aligned} (a_9 a_{10}^{k - j - 2} + b_3 b_0 a_9 a_{10}^{k - j - 2} + b_3 b_1 + b_2) \le a_9 a_{10}^{k + 1 - j - 2}. \end{aligned}$$

for all \(j,\; k \ge j + 1\). Take for example, \(a_9 > 1\) and \(a_{10} > 1 + b_3 b_0 + b_3 b_1 + b_2.\) Therefore

$$\begin{aligned} \Vert y_j - (\mathcal {B}_{k + 1})_j s_j\Vert \le a_9 a_{10}^{k + 1 - j - 2} \epsilon _{k + 1, j} \Vert s_j\Vert . \end{aligned}$$

This concludes the argument by induction.\(\square \)

Lemma 12

If Assumption 3 holds then there exist a neighborhood \(\mathcal {U}\) of \(x^*\) and a constant \(a_{11}\) such that for all \(x_1, x_2 \in \mathcal {U}\), the inequality

$$\begin{aligned} \Vert y - \mathcal {T}_{S_{\zeta _1}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _1}}^{-1} s\Vert \le a_{11} \Vert s\Vert \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x^*), {{\mathrm{\mathrm {dist}}}}(x_2, x^*)\} \end{aligned}$$

holds, where \(\zeta _1 = R_{x^*}^{-1} (x_1),\; s = R_{x_1}^{-1} (x_2),\; y = \mathcal {T}_{S_s}^{-1} {{\mathrm{\mathrm {grad}}}}f(x_2) - {{\mathrm{\mathrm {grad}}}}f(x_1)\).

Proof

Define \(\bar{y} = P_{\gamma }^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(x_2) - {{\mathrm{\mathrm {grad}}}}f(x_1)\), where \(P\) is the parallel transport along the curve \(\gamma \) defined by \(\gamma (t) = R_{x_1} (t s)\). From Lemma 8, we have

$$\begin{aligned} \Vert \bar{y} - \bar{H} s\Vert \le b_0 \Vert s\Vert ^2, \end{aligned}$$
(19)

where \(\bar{H} = \int _0^1 P_{\gamma }^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma (t)) P_{\gamma }^{t \leftarrow 0} d t\) and \(b_0\) is a constant. We then have

$$\begin{aligned}&\Vert y - \mathcal {T}_{S_{\zeta _1}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _1}}^{-1} s\Vert \\&\quad \le \Vert y - \bar{y}\Vert + \Vert \bar{y} - \bar{H} s\Vert + \Vert \bar{H} s - \mathcal {T}_{S_{\zeta _1}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _1}}^{-1} s\Vert \\&\quad = \Vert \mathcal {T}_{S_{\zeta }}^{-1} {{\mathrm{\mathrm {grad}}}}f(x_2) - P_{\gamma }^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(x_2)\Vert + b_0 \Vert s\Vert ^2 \nonumber \\&\quad \quad + \Vert \bar{H} - \mathcal {T}_{S_{\zeta _1}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _1}}^{-1}\Vert \Vert s\Vert \\&\quad \le b_1 \Vert s\Vert \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x^*), {{\mathrm{\mathrm {dist}}}}(x_2, x^*)\} + b_0 \Vert s\Vert ^2 \hbox { (by Lemma 6)} \\&\quad \quad + \left( \Vert \int \limits _0^1 P_{\gamma }^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma (t)) P_{\gamma }^{t \leftarrow 0} d t - {{\mathrm{\mathrm {Hess}}}}f(x_1)\Vert \right. \\&\left. \quad \quad + \Vert {{\mathrm{\mathrm {Hess}}}}f(x_1) - \mathcal {T}_{S_{\zeta _1}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _1}}^{-1}\Vert \right) \Vert s\Vert \\&\quad \le b_2 \Vert s\Vert \max \{{{\mathrm{\mathrm {dist}}}}(x_1, x^*), {{\mathrm{\mathrm {dist}}}}(x_2, x^*)\}, \hbox { (by Assumption 3)} \end{aligned}$$

where \(b_1\) and \(b_2\) are some constants.\(\square \)

With these technical lemmas in place, we now start the Riemannian generalization of the sequence of lemmas in [11] that leads to the main result [11, Theorem 2.7], generalized here as Theorem 2. For an easier comparison with [11], in the rest of the convergence analysis, we let \(n\) (instead of \(d\)) denote the dimension of the manifold \(\mathcal {M}\).

The next lemma generalizes [11, Lemma 2.3], itself a slight variation of [24, Lemma 3.2]. The proof of [11, Lemma 2.3] involves considering the span of a few \(s_j\)’s. In the Riemannian setting, a difficulty arises from the fact that the \(s_j\)’s are not in the same tangent space. We overcome this difficulty by transporting the \(s_j\)’s to \({{\mathrm{\mathrm {T}}}}_{x^*} \mathcal {M}\).

Lemma 13

Let \(s_k\) be such that \(R_{x_k} (s_k) \rightarrow x^*\). If Assumptions 1, 2, 3, and 5 hold then there exists \(K \ge 0\) such that for any set of \(n + 1\) steps \(S = \{s_{k_j} : K \le k_1 < \ldots < k_{n + 1}\}\), there exists an index \(k_m\) with \(m \in \{2, 3, \ldots , n + 1\}\) such that

$$\begin{aligned} \frac{\Vert (\mathcal {B}_{k_m} - H_{k_m}) s_{k_m}\Vert }{\Vert s_{k_m}\Vert } < (a_{12} a_{10}^{k_{n + 1} - k_1 - 2} + \bar{a}_{12}) \epsilon _S^{\frac{1}{n}}, \end{aligned}$$

where \(a_{12}, \bar{a}_{12}\) are some constants, \(n\) is the dimension of the manifold, \(H_{k_m} = \mathcal {T}_{S_{\zeta _{k_m}}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _{k_m}}}^{-1},\; \epsilon _S = \max _{1 \le j \le n + 1} \{{{\mathrm{\mathrm {dist}}}}(x_{k_j}, x^*), {{\mathrm{\mathrm {dist}}}}(R_{x_{k_j}} (s_{k_j}), x^*)\}\), and \(\zeta _{k_m} = R_{x^*}^{-1}(x_{k_m})\).

Proof

Given \(S\), for \(j = 1, 2, \ldots , n + 1\), define

$$\begin{aligned} S_j = \left[ \frac{\bar{s}_{k_1}}{\Vert \bar{s}_{k_1}\Vert }, \frac{\bar{s}_{k_2}}{\Vert \bar{s}_{k_2}\Vert }, \ldots , \frac{\bar{s}_{k_j}}{\Vert \bar{s}_{k_j}\Vert }\right] , \end{aligned}$$

where \(\bar{s}_{k_i} = \mathcal {T}_{S_{\zeta _{k_i}}}^{-1} s_{k_i}, i = 1, 2, \ldots , j\). The proof is organized as follows. We will first obtain in (28) that there exists \(m \in [2, n + 1]\) and \(u\in \mathbb {R}^{m-1}\), \(w \in {{\mathrm{\mathrm {T}}}}_{x^*} \mathcal {M}\) such that \(\bar{s}_{k_m} / \Vert \bar{s}_{k_m}\Vert = S_{m - 1} u - w\), \(S_{m - 1}\) has full column rank and is well conditioned, and \(\Vert w\Vert \) is small. We will also obtain in (30) that \((\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) S_{m - 1}\) is small due to the Hessian approximating properties of the SR1 update given in Lemma 12 above. The conclusion follows from these two results.

Let \(G_*\) denote the matrix expression of inner product of \({{\mathrm{\mathrm {T}}}}_{x^*} \mathcal {M}\) and \(\hat{S}_j\) denote the coordinate expression of \(S_j\), for \(j \in \{1, \ldots , n\}\). Let \(\kappa _j\) be the smallest singular value of \(G_*^{1/2} \hat{S}_j\) and define \(\kappa _{n + 1} = 0\). We have

$$\begin{aligned} 1 = \kappa _1 \ge \kappa _2 \ldots \ge \kappa _{n + 1} = 0. \end{aligned}$$

Let \(m\) be the smallest integer for which

$$\begin{aligned} \frac{\kappa _m}{\kappa _{m - 1}} < \epsilon _S^{\frac{1}{n}}. \end{aligned}$$
(20)

Since \(m \le n + 1\) and \(\kappa _1 = 1\), we have

$$\begin{aligned} \kappa _{m - 1} = \kappa _1 \left( \frac{\kappa _2}{\kappa _1}\right) \ldots \left( \frac{\kappa _{m - 1}}{\kappa _{m - 2}}\right) > \epsilon _S^{(m - 2) / n} > \epsilon _S^{(n - 1) / n}. \end{aligned}$$
(21)

Since \(x_k \rightarrow x^*\) and \(R_{x_k} (s_k) \rightarrow x^*\), we can assume that \(\epsilon _S \in (0, (\frac{1}{4})^n)\) for all \(k\). Now, we choose \(z \in \mathbb {R}^m\) such that

$$\begin{aligned} \Vert G_*^{1/2} \hat{S}_m z\Vert _2 = \kappa _m \Vert z\Vert _2 \end{aligned}$$
(22)

and

$$\begin{aligned} z = \left( \begin{array}{c} u \\ -1 \\ \end{array} \right) , \end{aligned}$$

where \(u \in \mathbb {R}^{m - 1}\). (The last component of \(z\) is nonzero due to that \(m\) is the smallest such that (20) is true.) Let \(w = S_m z\) and its coordinate expression \(\hat{w} = \hat{S}_m z\). From the definition of \(G_*^{1/2} \hat{S}_m\) and \(z\), we have

$$\begin{aligned} G_*^{1/2} \hat{S}_{m - 1} u - G_*^{1/2} \hat{w} = \frac{G_*^{1/2} \hat{\bar{s}}_{k_m}}{\Vert G_*^{1/2} \hat{\bar{s}}_{k_m}\Vert _2}, \end{aligned}$$
(23)

where \(\hat{\bar{s}}_{k_m}\) is the coordinate expression of \(\bar{s}_{k_m}\). Since \(\kappa _{m - 1}\) is the smallest singular value of \(G_*^{1/2} \hat{S}_{m - 1}\), we have that

$$\begin{aligned} \Vert u\Vert _2&\le \frac{1}{\kappa _{m - 1}} \Vert G_*^{1/2} \hat{S}_{m - 1} u\Vert _2 = \frac{1}{\kappa _{m - 1}} \Vert G_*^{1/2} \hat{w} + \frac{G_*^{1/2} \hat{\bar{s}}_{k_m}}{\Vert G_*^{1/2} \hat{\bar{s}}_{k_m}\Vert _2}\Vert _2 \nonumber \\&\le \frac{\Vert G_*^{1/2} \hat{w}\Vert _2 + 1}{\kappa _{m - 1}} = \frac{\Vert w\Vert + 1}{\kappa _{m - 1}} \end{aligned}$$
(24)
$$\begin{aligned}&< \frac{\Vert G_*^{1/2} \hat{w}\Vert _2 + 1}{\epsilon _S^{(n - 1) / n}} = \frac{\Vert w\Vert + 1}{\epsilon _S^{(n - 1) / n}}. \hbox { (by (21))} \end{aligned}$$
(25)

Using (22) and (24), we have that

$$\begin{aligned} \Vert w\Vert ^2&= \Vert G_*^{1/2} \hat{w}\Vert _2^2 = \Vert G_*^{1/2} \hat{S}_m z\Vert _2^2 = \kappa _m^2 \Vert z\Vert _2^2 = \kappa _m^2 (1 + \Vert u\Vert _2^2) \\&\le \kappa _m^2 + \left( \frac{\kappa _m}{\kappa _{m - 1}}\right) ^2 (\Vert G_*^{1/2} \hat{w}\Vert _2 + 1)^2 = \kappa _m^2 + \left( \frac{\kappa _m}{\kappa _{m - 1}}\right) ^2 (\Vert w\Vert + 1)^2. \end{aligned}$$

Therefore, since (20) implies that \(\kappa _m < \epsilon _S^{1 / n}\), using (20),

$$\begin{aligned} \Vert w\Vert ^2 < \epsilon _S^{2 / n} + \epsilon _S^{2 / n} (\Vert w\Vert + 1)^2 < 4 \epsilon _S^{2 / n} (\Vert w\Vert + 1)^2. \end{aligned}$$
(26)

This implies

$$\begin{aligned} \Vert w\Vert (1 - 2 \epsilon _S^{1 / n}) < 2 \epsilon _S^{1 / n}, \end{aligned}$$

and hence \(\Vert w\Vert < 1\), since \(\epsilon _S < (\frac{1}{4})^n\). Therefore, (25) and (26) imply that

$$\begin{aligned} \Vert u\Vert _2&< \frac{2}{\epsilon _S^{(n - 1) / n}}, \end{aligned}$$
(27)
$$\begin{aligned} \Vert w\Vert&< 4 \epsilon _S^{1 / n}. \end{aligned}$$
(28)

Equation (28) is the announced result that \(w\) is small. The bound (27) will also be invoked below.

Now we show that \(\Vert (\mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) S_{j - 1}\Vert \) is small for all \(j \in [2, n + 1]\) (and thus in particular for \(j=m\)). By Lemma 11, we have

$$\begin{aligned} \Vert y_i - (\mathcal {B}_{k_j})_i s_i\Vert \le a_9 a_{10}^{k_j - i - 2} \epsilon _{k_j, i} \Vert s_i\Vert \le a_9 a_{10}^{k_{n + 1} - k_1 - 2} \epsilon _S \Vert s_i\Vert , \end{aligned}$$
(29)

for all \(i \in \{k_1, k_2, \ldots , k_{j - 1}\}\). Therefore,

$$\begin{aligned}&\Vert (\mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) \frac{\bar{s}_i}{\Vert \bar{s}_i\Vert }\Vert \\&\quad \le \Vert \frac{\mathcal {T}_{S_{\zeta _i}}^{-1} y_i - \mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} \bar{s}_i}{\Vert \bar{s}_i\Vert }\Vert + \Vert \frac{\mathcal {T}_{S_{\zeta _i}}^{-1} y_i - {{\mathrm{\mathrm {Hess}}}}f(x^*) \bar{s}_i}{\Vert \bar{s}_i\Vert }\Vert \\&\quad \le \Vert \frac{\mathcal {T}_{S_{\zeta _i}}^{-1} y_i - \mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} \bar{s}_i}{\Vert \bar{s}_i\Vert }\Vert + b_1 \epsilon _S \hbox { (by Lemma 12)} \\&\quad = \Vert \frac{\mathcal {T}_{S_{\zeta _i}}^{-1} (y_i - \mathcal {T}_{S_{\zeta _i}} \mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} \mathcal {T}_{S_{\zeta _i}}^{-1} s_i)}{\Vert \bar{s}_i\Vert }\Vert + b_1 \epsilon _S \\&\quad \le b_2 \frac{\Vert (y_i - (\mathcal {B}_{k_j})_i s_i)\Vert }{\Vert s_i\Vert } + b_3 \epsilon _S \hbox { (by Lemma 10 and Assumption 1)} \\&\quad \le (b_4 a_{10}^{k_{n + 1} - k_1 - 2} + b_3) \epsilon _S \hbox { (by (29))} \end{aligned}$$

where \(b_2, b_3\) and \(b_4\) are some constants. Therefore, we have that for any \(j \in [2, n + 1]\),

$$\begin{aligned} \Vert (\mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) S_{j - 1}\Vert _{g, 2} \le b_5 \epsilon _S, \end{aligned}$$
(30)

where \(b_5 \!=\! \sqrt{n} (b_4 a_{10}^{k_{n + 1} - k_1 - 2} \!+\! b_3)\) and \(\Vert \cdot \Vert _{g, 2}\) is the norm induced by the Riemannian metric \(g\) and the Euclidean norm, i.e., \(\Vert A\Vert _{g, 2} \!=\! \sup \Vert A v\Vert / \Vert v\Vert _2\) with \(\Vert \cdot \Vert \) defined in (5).

We can now conclude the proof as follows. Using (23) and (30) with \(j = m\), (27) and (28), we have

$$\begin{aligned}&\frac{\Vert (\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) \bar{s}_m\Vert }{\Vert \bar{s}_m\Vert } \\&\quad = \Vert (\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) (S_{m - 1} u - w)\Vert \\&\quad \le \Vert (\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) S_{m - 1}\Vert _{g, 2} \Vert u\Vert _2 \\&\quad \quad + \Vert (\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*))\Vert \Vert w\Vert \\&\quad \le b_5 \epsilon _S \frac{2}{\epsilon _S^{(n - 1) / n}} + (M + {{\mathrm{\mathrm {Hess}}}}f(x^*)) 4 \epsilon _S^{1 / n} \text { (by Assumption 1)} \\&\quad \le (2 b_5 + b_6) \epsilon _S^{1 / n} \end{aligned}$$

where \(b_6\) is some constant. Finally,

$$\begin{aligned}&\frac{\Vert (\mathcal {B}_{k_m} - H_{k_m}) s_{k_m}\Vert }{\Vert s_{k_m}\Vert } = \frac{\Vert (\mathcal {B}_{k_m} - \mathcal {T}_{S_{\zeta _{k_m}}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _{k_m}}}^{-1}) s_{k_m}\Vert }{\Vert s_{k_m}\Vert } \\&\quad = \frac{\Vert (\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) \bar{s}_{k_m}\Vert }{\Vert \bar{s}_{k_m}\Vert } \le (2 b_5 + b_6) \epsilon _S^{1 / n}. \end{aligned}$$

\(\square \)

The next lemma generalizes [11, Lemma 2.4]. Its proof is a translation of the proof of [11, Lemma 2.4], where we invoke two manifold-specific results: the equality of \({{\mathrm{\mathrm {Hess}}}}f(x^*)\) and \({{\mathrm{\mathrm {Hess}}}}(f\circ R_{x^*})(0_{x^*})\) (which holds in view of [2, Proposition 5.5.6] since \(x^*\) is a critical point of \(f\)), and the bound in Lemma 3 on the retraction \(R\).

Lemma 14

Suppose that Assumptions 1, 2, 3, 4, 5 and 6 hold and the trust-region subproblem (2) is solved accurately enough for (9) to hold. Then there exists \(N\) such that for any set of \(p > n\) consecutive steps \(s_{k + 1}, s_{k + 1}, \ldots , s_{k + p}\) with \(k \ge N\), there exists a set, \(\mathcal {G}_k\), of at least \(p - n\) indices contained in the set \(\{i : k + 1 \le i \le k + p\}\) such that for all \(j \in \mathcal {G}_k\),

$$\begin{aligned} \frac{\Vert (\mathcal {B}_{j} - H_{j}) s_j\Vert }{\Vert s_j\Vert } < a_{13} \epsilon _k^{\frac{1}{n}}, \end{aligned}$$

where \(a_{13} = a_{12} a_{10}^{p - 2} + \bar{a}_{12},\; H_j = \mathcal {T}_{S_{\zeta _j}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _j}}^{-1},\; \zeta _j = R_{x^*}^{-1}(x_j)\), and

$$\begin{aligned} \epsilon _k = \max _{k + 1 \le j \le k + p} \{{{\mathrm{\mathrm {dist}}}}(x_j, x^*), {{\mathrm{\mathrm {dist}}}}(R_{x_j}(s_j), x^*)\}. \end{aligned}$$

Furthermore, for k sufficiently large, if \(j \in \mathcal {G}_k\), then

$$\begin{aligned} \Vert s_j\Vert < a_{14} {{\mathrm{\mathrm {dist}}}}(x_j, x^*), \end{aligned}$$
(31)

where \(a_{14}\) is a constant, and

$$\begin{aligned} \rho _j \ge 0.75. \end{aligned}$$
(32)

Proof

By Lemma 5, \(s_k \rightarrow 0\). Therefore, by Lemma 13, applied to the set

$$\begin{aligned} \{s_k, s_{k + 1}, \ldots , s_{k + p}\}, \end{aligned}$$
(33)

there exists \(N\) such that for any \(k \ge N\) there exists an index \(l_1\), with \(k + 1 \le l_1 \le k + p\) satisfying

$$\begin{aligned} \frac{\Vert (\mathcal {B}_{l_1} - H_{l_1}) s_{l_1}\Vert }{\Vert s_{l_1}\Vert } < a_{13} \epsilon _k^{\frac{1}{n}}, \end{aligned}$$

where \(a_{13} = a_{12} a_{10}^{p - 2}+ \bar{a}_{12}\). We apply Lemma 13 to the set \(\{s_k, s_{k + 1}, \ldots s_{k + p}\} - s_{l_1}\) to get \(l_2\). Repeating this \(p - n\) times, we get a set of \(p - n\) indices \(\mathcal {G}_k = \{l_1, l_2, \ldots , l_{p - n}\}\) such that if \(j \in \mathcal {G}_k\), then

$$\begin{aligned} \frac{\Vert (\mathcal {B}_j - H_j) s_j\Vert }{\Vert s_j\Vert } < a_{13} \epsilon _k^{\frac{1}{n}}. \end{aligned}$$
(34)

We show (31) next. Consider \(j \in \mathcal {G}_k\). By (34), we have

$$\begin{aligned} g(s_j, (H_j - \mathcal {B}_j) s_j) \le \Vert s_j\Vert \Vert (H_j - \mathcal {B}_j) s_j\Vert \le a_{13} \epsilon _k^{\frac{1}{n}} \Vert s_j\Vert ^2. \end{aligned}$$

Therefore, \(g(s_j, \mathcal {B}_j s_j) \ge g(s_j, H_j s_j) - a_{13} \epsilon _k^{\frac{1}{n}} \Vert s_j\Vert ^2 > b_0 \Vert s_j\Vert ^2\) for sufficient large \(k\), where \(b_0\) is a constant and we have

$$\begin{aligned} 0 \le m_j(0) - m_j(s_j)&= - g({{\mathrm{\mathrm {grad}}}}f(x_j), s_j) - \frac{1}{2} g(s_j, \mathcal {B}_j s_j) \\&\le \Vert {{\mathrm{\mathrm {grad}}}}f(x_j)\Vert \Vert s_j\Vert - \frac{1}{2} b_0 \Vert s_j\Vert ^2 \\&\le b_1 {{\mathrm{\mathrm {dist}}}}(x_j, x^*) \Vert s_j\Vert - \frac{1}{2} b_0 \Vert s_j\Vert ^2, \hbox { (by Lemma 7)} \end{aligned}$$

where \(b_1\) is some constant. This yields (31).

Finally, we show (32). Let \(j \in \mathcal {G}_k\) and define \(\hat{f}_x(\eta ) = f(R_x(\eta ))\). It follows that

$$\begin{aligned}&|f(x_j) - f(R_{x_j}(s_j)) - (m_j(0) - m_j(s_j))| \\&\quad = |f(x_j) - f(R_{x_j}(s_j)) + g({{\mathrm{\mathrm {grad}}}}f(x_j), s_j) + \frac{1}{2} g(s_j, \mathcal {B}_j s_j)| \\&\quad = |\hat{f}_{x_j}(0_{x_j}) - \hat{f}_{x_j}(s_j) + g({{\mathrm{\mathrm {grad}}}}f(x_j), s_j) + \frac{1}{2} g(s_j, \mathcal {B}_j s_j)| \\&\quad = |\frac{1}{2} g(s_j, \mathcal {B}_j s_j) - \int \limits _0^1 g({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x_j}(\tau s_j) [s_j], s_j) (1 - \tau ) d \tau | \hbox {(by Taylor's theorem)} \\&\quad \le |\frac{1}{2} g(s_j, \mathcal {B}_j s_j) - \frac{1}{2} g(s_j, H_j s_j)| \\&\quad \quad + |\frac{1}{2} g(s_j, H_j s_j) - \int \limits _0^1 g({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x_j}(\tau s_j) [s_j], s_j) (1 - \tau ) d \tau | \\&\quad = |\frac{1}{2} g(s_j, (\mathcal {B}_j - H_j) s_j)| \\&\quad \quad + |\int \limits _0^1 (g(s_j, \mathcal {T}_{S_{\zeta _j}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _j}}^{-1} s_j) - g({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x_j}(\tau s_j) [s_j], s_j)) (1 - \tau ) d \tau | \\&\quad \le \frac{1}{2} \Vert s_j\Vert \Vert (\mathcal {B}_j - H_j) s_j\Vert \\&\quad \quad + \Vert s_j\Vert ^2 \int \limits _0^1 \Vert (\mathcal {T}_{S_{\zeta _j}} {{\mathrm{\mathrm {Hess}}}}\hat{f}_{x^*}(0_{x^*}) \mathcal {T}_{S_{\zeta _j}}^{-1} - {{\mathrm{\mathrm {Hess}}}}\hat{f}_{x_j}(\tau s_j))\Vert (1 - \tau ) d \tau \\&\quad \qquad \hbox { (by [2, Proposition 5.5.6])} \\&\quad \le b_2 \Vert s_j\Vert ^2 \epsilon _k^{\frac{1}{n}} + b_3 \Vert s_j\Vert ^2 ({{\mathrm{\mathrm {dist}}}}(x_j, x^*) + \Vert s_j\Vert ) \\&\quad \qquad \hbox { (by (34), Lemma 3 and Assumption 4)} \\&\quad \le b_4 \Vert s_j\Vert ^2 \epsilon _k^{\frac{1}{n}}, (\hbox { by (31) and}\; {{\mathrm{\mathrm {dist}}}}(x_j, x^*) \hbox {is smaller than}\; \epsilon _k^{\frac{1}{n}}\;\hbox {eventually}) \end{aligned}$$

where \(b_2, b_3\) and \(b_4\) are some constants. In view of (31) and Lemma 7, we have

$$\begin{aligned} \Vert s_j\Vert < b_5 \Vert {{\mathrm{\mathrm {grad}}}}f(x_j)\Vert , \end{aligned}$$

where \(b_5\) is some constant. Combining with \(\Vert s_j\Vert \le \varDelta _j\), we obtain

$$\begin{aligned} \Vert s_j\Vert ^2 \le b_5 \Vert {{\mathrm{\mathrm {grad}}}}f(x_j)\Vert \min \{\varDelta _j, b_5 \Vert {{\mathrm{\mathrm {grad}}}}f(x_j)\Vert \}. \end{aligned}$$

Noticing (9), we have

$$\begin{aligned} |f(x_j) - f(R_{x_j}(s_j)) - (m_j(0) - m_j(s_j))| \le b_6 \epsilon _k^{\frac{1}{n}} (m_j(0) - m_j(s_j)), \end{aligned}$$

where \(b_6\) is a constant. This implies (32).\(\square \)

The next result generalizes [11, Lemma 2.5] in two ways: the Euclidean setting is extended to the Riemannian setting, and inexact solves are allowed by the presence of \(\delta _k\). The main hurdle that we had to overcome in the Riemannian generalization is that the equality \({{\mathrm{\mathrm {dist}}}}(x_k + s_k, x^*) = \Vert s_k - \xi _k\Vert \) does not necessarily hold. As we will see, Lemma 2 comes to our rescue.

Lemma 15

Suppose Assumptions 2 and 3 hold. If the quantities

$$\begin{aligned} e_k = {{\mathrm{\mathrm {dist}}}}(x_k, x^*) \hbox { and } \frac{\Vert (\mathcal {B}_{k} - H_{k}) s_{k}\Vert }{\Vert s_{k}\Vert } \end{aligned}$$

are sufficiently small and if

$$\begin{aligned} \mathcal {B}_k s_k = - {{\mathrm{\mathrm {grad}}}}f(x_k) + \delta _k \quad \hbox {with}\quad \Vert \delta _k\Vert \le \Vert {{\mathrm{\mathrm {grad}}}}f(x_k)\Vert ^{1 + \theta }, \end{aligned}$$

then

$$\begin{aligned} {{\mathrm{\mathrm {dist}}}}(R_{x_k}(s_k), x^*)&\le a_{15} \frac{\Vert (\mathcal {B}_{k} - H_{k}) s_{k}\Vert }{\Vert s_{k}\Vert } e_k + a_{16} e_k^{1 + \min \{\theta ,1\}},\end{aligned}$$
(35)
$$\begin{aligned} h(R_{x_k}(s_k))&\le a_{17} \frac{\Vert (\mathcal {B}_{k} - H_{k}) s_{k}\Vert }{\Vert s_{k}\Vert } h(x_k) + a_{18} h^{1 + \min \{\theta ,1\}}(x_k), \end{aligned}$$
(36)

and

$$\begin{aligned} a_{19} h(x_k) \le e_k \le a_{20} h(x_k) \end{aligned}$$
(37)

where \(a_{15}, a_{16}, a_{17}\) and \(a_{18}\) are some constants and \(h(x) = (f(x) - f(x^*))^{\frac{1}{2}}\).

Proof

By definition of \(s_k\), we have

$$\begin{aligned} s_k = H_k^{-1} [(H_k - \mathcal {B}_k) s_k - {{\mathrm{\mathrm {grad}}}}f(x_k) + \delta _k]. \end{aligned}$$
(38)

Define \(\xi _k = R^{-1}_{x_k} x^*\). Therefore, letting \(\gamma \) be the curve defined by \(\gamma (t) = R_{x_k}(t \xi _k)\), we have

$$\begin{aligned}&\Vert s_k - \xi _k\Vert = \Vert H_k^{-1} \left[ (H_k - \mathcal {B}_k) s_k - {{\mathrm{\mathrm {grad}}}}f(x_k) + \delta _k - H_k \xi _k\right] \Vert \nonumber \\&\quad \le b_0 (\Vert (H_k - \mathcal {B}_k) s_k\Vert + \Vert \delta _k\Vert \nonumber \\&\quad \quad + \Vert P_{\gamma }^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(x^*) - {{\mathrm{\mathrm {grad}}}}f(x_k) - \left( \int \limits _0^1 P_{\gamma }^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma (t)) P_{\gamma }^{t \leftarrow 0} d t\right) \xi _k\Vert \nonumber \\&\quad \quad + \Vert \left( \int \limits _0^1 P_{\gamma }^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma (t)) P_{\gamma }^{t \leftarrow 0} d t\right) \xi _k - {{\mathrm{\mathrm {Hess}}}}f(x_k) \xi _k\Vert \nonumber \\&\quad \quad + \Vert {{\mathrm{\mathrm {Hess}}}}f(x_k) \xi _k - H_k \xi _k\Vert ) \nonumber \\&\quad \le b_0 (\Vert (H_k - \mathcal {B}_k) s_k\Vert + b_1 \Vert \xi _k\Vert ^{1 + \min \{\theta ,1\}} \hbox { (by Lemmas 7 and 8)} \nonumber \\&\quad \quad + \Vert \left( \int \limits _0^1 P_{\gamma }^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma (t)) P_{\gamma }^{t \leftarrow 0} d t\right) \xi _k - {{\mathrm{\mathrm {Hess}}}}f(x_k) \xi _k\Vert \nonumber \\&\quad \quad + \Vert {{\mathrm{\mathrm {Hess}}}}f(x_k) - H_k\Vert \Vert \xi _k\Vert ) \nonumber \\&\quad \le b_0 \Vert (H_k - \mathcal {B}_k) s_k\Vert + b_0 b_1 \Vert \xi _k\Vert ^{1 + \min \{\theta ,1\}} + b_0 b_3 \Vert \xi _k\Vert ^2 \hbox { (by Assumption 3)} \nonumber \\&\quad \le b_0 \Vert (H_k - \mathcal {B}_k) s_k\Vert + b_4 \Vert \xi _k\Vert ^{1 + \min \{\theta ,1\}} \end{aligned}$$
(39)

where \(b_1,\; b_2,\; b_3\) and \(b_4\) are some constants. From Lemma 2, we have

$$\begin{aligned} {{\mathrm{\mathrm {dist}}}}(R_{x_k} (s_k), x^*) = {{\mathrm{\mathrm {dist}}}}(R_{x_k} (s_k), R_{x_k}(\xi _k)) \le b_5 \Vert s_k - \xi _k\Vert , \end{aligned}$$
(40)

where \(b_5\) is a constant. Combining (39) and (40) and using Lemma 3, we obtain

$$\begin{aligned} {{\mathrm{\mathrm {dist}}}}(R_{x_k} (s_k), x^*) \le b_0 b_5 \Vert (H_k - \mathcal {B}_k) s_k\Vert + \bar{b}_4 b_5 e_k^{1 + \min \{\theta ,1\}}. \end{aligned}$$
(41)

From (38), for \(k\) large enough such that \(\Vert H_k^{-1}\Vert \Vert (H_k - \mathcal {B}_k) s_k\Vert \le \frac{1}{2} \Vert s_k\Vert \), we have

$$\begin{aligned} \Vert s_k\Vert \le \frac{1}{2} \Vert s_k\Vert + \Vert H_k^{-1}\Vert (\Vert {{\mathrm{\mathrm {grad}}}}f(x_k)\Vert + \Vert {{\mathrm{\mathrm {grad}}}}f(x_k)\Vert ^{1 + \theta }). \end{aligned}$$

Using Lemma 7, this yields

$$\begin{aligned} \Vert s_k\Vert \le b_6 {{\mathrm{\mathrm {dist}}}}(x_k, x^*), \end{aligned}$$

where \(b_6\) is a constant. Using the latter in (41) yields

$$\begin{aligned} {{\mathrm{\mathrm {dist}}}}(R_{x_k} (s_k), x^*) \le b_0 b_5 b_6 \frac{\Vert (H_k - \mathcal {B}_k) s_k\Vert }{\Vert s_k\Vert } {{\mathrm{\mathrm {dist}}}}(x_k, x^*) + \bar{b}_4 b_5 e_k^{1 + \min \{\theta ,1\}}, \end{aligned}$$

which shows (35).

We show (37) next. Define \(\hat{f}_x(\eta ) = f(R_x(\eta ))\) and let \(\zeta _k = R_{x^*}^{-1}(x_k)\). We have, for some \(t\in (0,1)\),

$$\begin{aligned} \hat{f}_{x^*}(\zeta _k) - \hat{f}_{x^*}(0_{x^*})&= g({{\mathrm{\mathrm {grad}}}}f(x^*), \zeta _k) + g({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x^*}(t\zeta _k)[\zeta _k],\zeta _k)\\&= g({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x^*}(t\zeta _k)[\zeta _k],\zeta _k), \end{aligned}$$

where we have used (Euclidean) Taylor’s theorem to get the first equality and the fact that \(x^*\) is a critical point of \(f\) (Assumption 2) for the second one. Therefore, since \({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x^*} = {{\mathrm{\mathrm {Hess}}}}f(x^*)\) is positive definite (in view of [2, Proposition 5.5.6] and Assumption 2), there exist \(b_7\) and \(b_8\) such that

$$\begin{aligned} b_7 (\hat{f}_{x^*}(\zeta _k) - \hat{f}_{x^*}(0_{x^*})) \le \Vert \hat{\zeta }_k\Vert ^2 \le b_8 (\hat{f}_{x^*}(\zeta _k) - \hat{f}_{x^*}(0_{x^*})) \end{aligned}$$

Then, using Lemma 3, we obtain that there exist \(b_9\) and \(b_{10}\) such that

$$\begin{aligned} b_9 (f(x_k) - f(x^*)) \le {{\mathrm{\mathrm {dist}}}}(x_k, x^*)^2 \le b_{10} (f(x_k) - f(x^*)). \end{aligned}$$

In other words,

$$\begin{aligned} b_9 h^2(x_k) \le e_k^2 \le b_{10} h^2(x_k), \end{aligned}$$

and we have shown (37). Combining it with (35), we get (36).\(\square \)

With Lemmas 14 and 15 in place, the rest of the local convergence analysis is essentially a translation of the analysis in [11]. The next lemma generalizes [11, Lemma 2.6].

Lemma 16

If Assumptions 1, 2, 3, 4, 5, and 6 hold and the subproblem is solved accurately enough for (9) and (10) to hold, then

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{h_k}{\varDelta _k} = 0, \end{aligned}$$

where \(h_k = h(x_k)\).

Proof

Let \(p\) be the smallest integer greater than \(2 n + n (- \ln \tau _1 / \ln \tau _2)\), where \(\tau _1\) and \(\tau _2\) are defined in Algorithm 1. Then

$$\begin{aligned} \tau _1^n \tau _2^{p - 2 n} \ge 1. \end{aligned}$$
(42)

Applying Lemma 14 with this value of \(p\), there exists \(N\) such that if \(k \ge N\), then there exists a set of at least \(p - n\) indices, \(\mathcal {G}_k \subset \{j : k + 1 \le j \le k + p\}\), such that if \(j \in \mathcal {G}_k\), then

$$\begin{aligned} \frac{\Vert (\mathcal {B}_{j} - H_{j}) s_j\Vert }{\Vert s_j\Vert }&< c \epsilon _k^{\frac{1}{n}}, \nonumber \\ \rho _j&\ge 0.75. \end{aligned}$$
(43)

We now show that for such steps,

$$\begin{aligned} \frac{h_{j + 1}}{\varDelta _{j + 1}} \le \frac{1}{\tau _2} \frac{h_j}{\varDelta _j}. \end{aligned}$$
(44)

If \(\Vert s_j\Vert \ge 0.8 \varDelta _j\), then since from Step 12 of Algorithm 1, \(\varDelta _{j + 1} = \tau _2 \varDelta _j\) and since \(\{h_i\}\) is decreasing, (44) follows. If on the other hand \(\Vert s_j\Vert < 0.8 \varDelta _j\), then from Step 14 of Algorithm 1, we have that \(\varDelta _{j + 1} = \varDelta _j\). Also since the trust region is inactive, by condition (10), we have that \(\mathcal {B}_j s_j = - {{\mathrm{\mathrm {grad}}}}f(x_j) + \delta _k\), \(\Vert \delta _k\Vert \le \Vert {{\mathrm{\mathrm {grad}}}}f(x_j)\Vert ^{1 + \theta }\). Therefore, in view of (36) in Lemma 15 and of (43), if \(N\) is large enough, we have that

$$\begin{aligned} h_{j + 1} \le \frac{1}{\tau _2} h_j. \end{aligned}$$

This implies that (44) is true for all \(j \in \mathcal {G}_j\), where \(k \ge N\).

In addition, note that for any \(j,\; h_{j + 1} \le h_j\) and \(\varDelta _{j + 1} \ge \tau _1 \varDelta _j\) and so

$$\begin{aligned} \frac{h_{j + 1}}{\varDelta _{j + 1}} \le \frac{1}{\tau _1} \frac{h_j}{\varDelta _j}. \end{aligned}$$
(45)

Since (44) is true for \(p - n\) values of \(j \in \mathcal {G}_k\) and (45) holds for all \(j\), we have that for all \(k \ge N\),

$$\begin{aligned} \frac{h_{k + p}}{\varDelta _{k + p}} \le \left( \frac{1}{\tau _1}\right) ^n \left( \frac{1}{\tau _2}\right) ^{p - n} \frac{h_k}{\varDelta _k} \le \left( \frac{1}{\tau _2}\right) ^n \frac{h_k}{\varDelta _k}, \end{aligned}$$

where the second inequality follows from (42). Therefore, starting at \(k = N\), it follows that

$$\begin{aligned} \frac{h_{N + l p}}{\varDelta _{N + l p}} \rightarrow 0 \end{aligned}$$

as \(l \rightarrow \infty \). Using (45) again, we complete the proof.\(\square \)

The next result generalizes [11, Theorem 2.7].

Theorem 2

If Assumptions 1, 2, 3, 4, 5, and 6 hold and the subproblem is solved accurately enough for (9) and (10) to hold then, the sequence \(\{x_k\}\) generated by Algorithm 1 is \(n + 1\)-step q-superlinear (where \(n\) denotes the dimension of \(\mathcal {M}\)); i.e.,

$$\begin{aligned} \frac{{{\mathrm{\mathrm {dist}}}}(x_{k+n+1},x^*)}{{{\mathrm{\mathrm {dist}}}}(x_k,x^*)} \rightarrow 0. \end{aligned}$$

Proof

By Lemma 14, there exists \(N\) such that if \(k \ge N\), then the set of steps \(\{s_{k + 1}, \ldots , s_{k + n + 1}\}\) contains at least one step \(s_{k + j},\; 1 \le j \le n + 1\), for which

$$\begin{aligned} \frac{\Vert (\mathcal {B}_{j} - H_{j}) s_j\Vert }{\Vert s_j\Vert } < a_{13} \epsilon _k^{\frac{1}{n}}. \end{aligned}$$

By (31) in Lemma 14 and (37) in Lemma 15 (when checking the assumptions, recall the standing assumption made in Sect. 3.3 that \(e_k:={{\mathrm{\mathrm {dist}}}}(x_k,x^*) \rightarrow 0\)), there exists a constant \(b_0\) such that

$$\begin{aligned} \Vert s_{k + j}\Vert < b_0 h_{k + j}. \end{aligned}$$

Therefore, by Lemma 16, if \(N\) is large enough and \(k \ge N\), then we have \(\Vert s_{k + j}\Vert < 0.8 \varDelta _{k + j}\). By (10), this implies \(\mathcal {B}_{k + j} s_{k + j} = - {{\mathrm{\mathrm {grad}}}}f(x_{k + j}) + \delta _{k + j}\), with \(\Vert \delta _{k + j}\Vert \le \Vert {{\mathrm{\mathrm {grad}}}}f(x_{k + j})\Vert ^{1 + \theta }\). Thus by inequality (36) of Lemma 15, if \(N\) is large enough and \(k \ge N\), then

$$\begin{aligned} h_{k + j + 1} = h(R_{x_{k + j}}(s_{k + j})) \le (a_{17} a_{13} \epsilon _k^{\frac{1}{n}} + a_{18} h_{k + j}^{\min \{\theta ,1\}}) h_{k + j}. \end{aligned}$$

The first equality holds because (32) implies that the step is accepted. Since the sequence \(\{h_i\}\) is decreasing, this implies that

$$\begin{aligned} h_{k + n + 1} \le (a_{17} a_{13} \epsilon _k^{\frac{1}{n}} + a_{18} h_{k + j}^{\min \{\theta ,1\}}) h_k \end{aligned}$$

By (37),

$$\begin{aligned} e_{k + n + 1}&\le a_{20} h_{k + n + 1} \\&\le a_{20} (a_{17} a_{13} \epsilon _k^{\frac{1}{n}} + a_{18} h_{k + j}^{\min \{\theta ,1\}}) h_k \\&\le a_{20} (a_{17} a_{13} \epsilon _k^{\frac{1}{n}} + a_{18} \left( \frac{e_k}{a_{19}}\right) ^{\min \{\theta ,1\}}) \frac{e_k}{a_{19}}. \end{aligned}$$

This implies \(n + 1\)-step q-superlinear convergence.\(\square \)

It is also possible to extend to the Riemannian setting the result [11, Theorem 2.8] that the percentage of \(\mathcal {B}_k\) being positive semidefinite approaches 1 provided that \(\mathcal {B}_k\) is positive semidefinite whenever \(\Vert s_k\Vert \le 0.8 \varDelta _k\). In the proof of [11, Theorem 2.8], replace Lemma 2.6 by Lemma 16, Lemma 2.4 by Lemma 14, (2.14) by (31), and (2.9) by (37).

4 Limited memory version of RTR-SR1

In RTR-SR1 (Algorithm 1), storing \(\mathcal {B}_{k + 1} = \mathcal {T}_{\eta _k} \circ \tilde{\mathcal {B}}_{k + 1} \circ \mathcal {T}_{\eta _k}^{-1}\) in matrix form may be inefficient for two reasons. The first reason, which is also present in the Euclidean case, is that \(\tilde{\mathcal {B}}_{k+1}= \mathcal {B}_k + \frac{(y_k - \mathcal {B}_k s_k) (y_k - \mathcal {B}_k s_k)^\flat }{g(s_k,y_k - \mathcal {B}_k s_k)}\) is a rank-one modification of \(\mathcal {B}_k\). The second reason, specific to the Riemannian setting, is that when \(\mathcal {M}\) is a low-codimension submanifold of a Euclidean space \(\mathcal {E}\), it may be beneficial to express \(\mathcal {T}_{\eta _k}\) as the restriction to \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\) of a low-rank modification of the identity (7). Instead of storing full dense matrices, it may then be beneficial to store a few vectors that implicitly represent them. This is the purpose of the limited memory version of RTR-SR1 presented in this section.

The proposed limited memory RTR-SR1, called LRTR-SR1, is described in Algorithm 2. It relies on a Riemannian generalization of the compact representation of the classical (Euclidean) SR1 matrices presented in [12, §5]. We set \(\mathcal {B}_0={{\mathrm{\mathrm {id}}}}\). At step \(k>0\), we first choose a basic Hessian approximation \(\mathcal {B}_0^k\), which in the Riemannian setting becomes a linear transformation of \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\). We advocate the choice

$$\begin{aligned} \mathcal {B}_0^k = \gamma _{k} {{\mathrm{\mathrm {id}}}}, \end{aligned}$$

where

$$\begin{aligned} \gamma _{k} = \frac{g(y_{k - 1}, y_{k - 1})}{g(s_{k - 1}, y_{k - 1})}, \end{aligned}$$

which generalizes a choice usually made in the Euclidean case [27, (7.20)]. As in the Euclidean case, we let \(S_{k,m}\) and \(Y_{k,m}\) contain the (at most) \(m\) most recent corrections, which in the Riemannian setting must be transported to \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\), yielding \(S_{k, m} = \{s_{k - \ell }^{(k)}, s_{k - \ell + 1}^{(k)}, \ldots , s_{k - 1}^{(k)}\}\) and \(Y_{k, m} = \{y_{k - \ell }^{(k)}, y_{k - \ell + 1}^{(k)}, \ldots , y_{k - 1}^{(k)}\}\), where \(\ell = \min \{m, k\}\) and where \(s^{(k)}\) denotes \(s\) transported to \({{\mathrm{\mathrm {T}}}}_{x_{k}} \mathcal {M}\). We then have the following Riemannian generalization of the limited-memory update based on [12, (5.2)]:

$$\begin{aligned} \mathcal {B}_k = \mathcal {B}_0^k + (Y_{k, m} - \mathcal {B}_0^k S_{k, m}) (P_{k, m} - S_{k, m}^{\flat } \mathcal {B}_0^k S_{k, m})^{-1} (Y_{k, m} - \mathcal {B}_0^k S_{k, m})^{\flat }, \quad k>0, \end{aligned}$$

where \(P_{k, m} = D_{k, m} + L_{k, m} + L_{k, m}^T\),

$$\begin{aligned} D_{k, m} = {{\mathrm{\mathrm {diag}}}}\{g(s_{k - \ell }, y_{k - \ell }), g(x_{k - \ell + 1}, y_{k - \ell + 1}), \ldots , g(s_{k - 1}, y_{k - 1})\}, \end{aligned}$$

and

$$\begin{aligned} (L_{k, m})_{i, j} = \left\{ \begin{array}{l@{\quad }l} g(s_{k - \ell + i - 1}, y_{k - \ell + j - 1}), &{} \hbox {if i > j;} \\ 0, &{} \hbox {otherwise.} \end{array} \right. \end{aligned}$$

Moreover, letting \(Q_{k, m}\) denote the matrix \(S_{k, m}^{\flat } S_{k, m}\), we obtain

$$\begin{aligned} \mathcal {B}_k = \gamma _{k} {{\mathrm{\mathrm {id}}}}+ (Y_{k, m} - \gamma _{k} S_{k, m}) (P_{k, m} - \gamma _{k} Q_{k, m})^{-1} (Y_{k, m} - \gamma _{k} S_{k, m})^{\flat }, \quad k>0.\qquad \end{aligned}$$
(46)

For all \(\eta \in {{\mathrm{\mathrm {T}}}}_{x_k} \mathcal {M},\; \mathcal {B}_k \eta \) can thus be obtained from (46) using \(Y_{k, m},\; S_{k, m},\; P_{k, m}\) and \(Q_{k, m}\). This is how \(\mathcal {B}_k\) is defined in Algorithm 2, except that the technicality that the \(\mathcal {B}\) update may be skipped is also taken into account therein.

figure b

5 Numerical experiments

As an illustration, we investigate the performance of RTR-SR1 (Algorithm 1) and LRTR-SR1 (Algorithm 2) on a Rayleigh quotient minimization problem on the sphere and on a joint diagonalization (JD) problem on the Stiefel manifold.

For the Rayleigh quotient problem, the manifold \(\mathcal {M}\) is the sphere

$$\begin{aligned} \mathbb {S}^{n - 1} = \{x \in \mathbb {R}^n: x^T x = 1\} \end{aligned}$$

and the objective function \(f\) is defined by

$$\begin{aligned} f(x) = x^T A x, \end{aligned}$$
(47)

where \(A\) is a given \(n\)-by-\(n\) symmetric matrix. Minimizing the Rayleigh quotient of \(A\) is equivalent to computing its leftmost eigenvector (see, e.g., [2, §2.1.1]). The Rayleigh quotient problem provides convenient benchmarking experiments since, except for pathological cases, there is essentially one local minimizer (specifically, there is one pair of antipodal local minimizers), which can be readily computed by standard eigenvalue software for verification. Moreover, the Hessian of \(f\) is readily available, and this allows for a comparison with RTR-Newton [2, Ch. 7], which corresponds to Algorithm 1 with the exception that \(\mathcal {B}_k\) is replaced by the Hessian of \(f\) at \(x_k\) (and thus the vector transport is no longer needed).

In the JD problem considered, the manifold \(\mathcal {M}\) is the (compact) Stiefel manifold,

$$\begin{aligned} \mathrm {St}(p,n) = \{X \in \mathbb {R}^{n \times p}: X^T X = I_p\}, \end{aligned}$$

and the objective function \(f\) is defined by

$$\begin{aligned} f(X) = - \sum _{i = 1}^N \Vert {{\mathrm{\mathrm {diag}}}}(X^T C_i X)\Vert ^2, \end{aligned}$$
(48)

where \(C_1,\ldots ,C_N\) are given symmetric matrices, \({{\mathrm{\mathrm {diag}}}}(M)\) denotes the vector formed by the diagonal entries of \(M\), and \(\Vert {{\mathrm{\mathrm {diag}}}}(M)\Vert ^2\) thus denotes the sum of the squared diagonal elements of \(M\). This problem has applications in independent component analysis for blind source separation [36].

The comparisons are performed in Matlab 7.0.0 on a 32 bit Windows platform with 2.4 GHz CPU (T8300).

The chosen Riemannian metric \(g\) on \(\mathbb {S}^{n - 1}\) is obtained by making \(\mathbb {S}^{n-1}\) a Riemannian submanifold of the Euclidean space \(\mathbb {R}^n\). The gradient and the Hessian of \(f\) (47) with respect to the metric are given in [2, §6.4]. The chosen Riemannian metric \(g\) on \(\mathrm {St}(p,n)\) is the one obtained by viewing \(\mathrm {St}(p,n)\) as a Riemannian submanifold of the Euclidean space \(\mathbb {R}^{n\times p}\), as in [17, §2.2]. With respect to this Riemannian metric, the gradient of the objective function (48), required in the three methods, is given in [36, §2.3]. The Riemannian Hessian of (48), required for RTR-Newton, is also given therein.

The initial Hessian approximation \(\mathcal {B}_0\) is set to the identity in RTR-SR1. The \(\theta ,\; \kappa \) parameters in the inner iteration stopping criterion [2, (7.10)] of the truncated CG inner iteration [2, §7.3.2] are set to 0.1, 0.9 for RTR-SR1 and LRTR-SR1 and to 1, 0.1 for RTR-Newton. The initial radius \(\varDelta _0\) is set to 1, \(\nu \) is the square root of machine epsilon, \(c\) is set to 0.1, \(\tau _1\) to 0.25, and \(\tau _2\) to 2.

For the retraction \(R\) on \(\mathbb {S}^{n - 1}\), following [2, Example 4.1.1], we choose \(R_x(\eta ) = (x + \eta ) / \Vert x + \eta \Vert _2\). For the retraction \(R\) on \(\mathrm {St}(p,n)\), following [2, (4.8)], we choose \(R_X(\eta ) = \mathrm {qf}(X+\eta )\) where \(\mathrm {qf}\) denotes the \(Q\) factor of the QR decomposition with nonnegative elements on the diagonal of \(R\).

The chosen isometric vector transport \(\mathcal {T}\) on \(\mathbb {S}^{n-1}\) is the vector transport by rigging (7), which is (locally) uniquely defined in case of submanifolds of co-dimension 1. In the case of \(\mathbb {S}^{n-1}\), it turns out to be equivalent to the parallel translation along the shortest geodesic that joins the origin point \(x\) and the target point \(y\), i.e.,

$$\begin{aligned} \mathcal {T}_{\eta _x}\xi _x = \xi _x - \frac{2 y^T\xi _x}{\Vert x + y\Vert ^2} (x + y), \end{aligned}$$
(49)

where \(y = R_x(\eta _x)\). This operation is well defined whenever \(x\) and \(y\) are not antipodal points. On \(\mathrm {St}(p,n)\), since we will conduct experiments on problems of small dimension, we find it preferable to select a vector transport by parallelization (6), which amounts to selecting a smooth field of tangent bases \(B\) on \(\mathrm {St}(p,n)\). To this end, we note that \({{\mathrm{\mathrm {T}}}}_X \mathrm {St}(p, n) = \{X \varOmega + X_{\perp } K : \varOmega ^T = - \varOmega ,\ K \in \mathbb {R}^{(n - p) \times p}\}\), where the columns of \(X_\perp \in \mathbb {R}^{n\times (n-p)}\) form an orthonormal basis of the orthogonal complement of the column space of \(X\) (see, e.g., [2, Example 3.5.2]). Hence, an orthonormal basis of \({{\mathrm{\mathrm {T}}}}_X \mathrm {St}(p, n)\) is given by \(\{\frac{1}{\sqrt{2}} X (e_i e_j^T - e_j e_i^T): i=1,\ldots ,p, j=i+1,\ldots ,p\} \cup \{X_{\perp } \tilde{e}_i e_j^T, i = 1, \ldots , n-p, j = 1, \ldots , p\}\), where \((e_1,\ldots ,e_p)\) is the canonical basis of \(\mathbb {R}^p\) and \((\tilde{e}_1,\ldots ,\tilde{e}_{n - p})\) is the canonical basis of \(\mathbb {R}^{n - p}\). To ensure that the obtained field of tangent bases is smooth, we need to choose \(X_\perp \) as a smooth function of \(X\). This can be done locally by extracting the \(n-p\) last columns of the Gram-Schmidt orthonormalization of \(\begin{bmatrix} X&C \end{bmatrix}\) where \(C\) is a given \(n\times (n-p)\) matrix. In practice, we have observed that Matlab’s null function applied to \(X\) works adequately to produce an \(X_\perp \).

The experiments on the Rayleigh quotient (47) are reported in Table 1 and Fig. 1. Since the Hessian is readily available, RTR-Newton is the a priori method of choice over the proposed RTR-SR1 and LRTR-SR1 methods. Nevertheless, Table 1 illustrates that it is possible to exhibit a matrix \(A\) for which LRTR-SR1 is faster than RTR-Newton. Specifically, matrix \(A\) in the Rayleigh quotient (47) is set to \(U D U^T\), where \(U\) is an orthonormal matrix obtained by orthonormalizing a random matrix whose elements are drawn from the standard normal distribution and \(D = {{\mathrm{\mathrm {diag}}}}(0, 0.01, \ldots , 0.01, 2, \ldots , 2)\) with 0.01 and 2 occurring \(n/2 - 1\) and \(n/2\) times, respectively. The initial iterate is generated randomly. The number of function evaluations is equal to the number of iterations \((iter)\). \(ng\) denotes the number of gradient evaluations. The differences between \(iter\) and \(ng\) that may be observed in RTR-Newton are due to occasional rejections of the candidate new iterate as prescribed in the trust-region framework [2, §7.2]; for RTR-SR1 and LRTR-SR1, \(iter\) and \(ng\) are identical because one evaluation of the gradient is required at each iterate to update \(\mathcal {B}_k\) or store \(y_k\) even if the candidate is rejected. \(nH\) denotes the number of operations of the form \({{\mathrm{\mathrm {Hess}}}}f(x) \eta \) or \(\mathcal {B} \eta \). \(t\) denotes the run time (seconds). To obtain sufficiently stable timing values, an average is taken over several identical runs for a total run time of at least one minute. \(gf_0\) and \(gf_f\) denote the initial and final norm of the gradient. Two stopping criteria are tested: \(gf_f/gf_0 < \)1e-3 and \(gf_f/gf_0 <\) 1e-6.

Fig. 1
figure 1

Comparison of RTR-Newton and the new methods RTR-SR1 and LRTR-SR1 for the Rayleigh quotient problem (47) with \(n=1024\)

Table 1 Rayleigh quotient experiments

Unsurprisingly, Table 1 shows that RTR-Newton, which exploits the Hessian of \(f\), requires fewer iterations than the SR1 methods, which does not use this information. However, when \(n\) gets large, the time per iteration in LRTR-SR1—with moderate memory size \(m\)—gets sufficiently smaller for the method to become faster than RTR-Newton.

Table 2 and Fig. 2 present the experimental results obtained for the JD problem (48). The \(C_i\) matrices are selected as \(C_i = {{\mathrm{\mathrm {diag}}}}(n, n-1, \ldots , 1) + \epsilon _{\mathrm {C}} (R_i + R_i^T)\), where the elements of \(R_i\in \mathbb {R}^{n\times n}\) are independently drawn from the standard normal distribution. Table 2 and Fig. 2 correspond to \(\epsilon _{\mathrm {C}}=0.1\), but we have observed similar results for a wide range of values of \(\epsilon _{\mathrm {C}}\). Table 2 indicates that RTR-Newton requires fewer iterations than RTR-SR1, which requires fewer iterations than LRTR-SR1. This was expected since RTR-Newton uses the Hessian of \(f\) while RTR-SR1 uses an inexact Hessian and LRTR-SR1 is further constrained by the limited memory. However, the iterations of RTR-Newton tend to be more time-consuming that those of the SR1 methods, all the more so if \(N\) gets large since the number of terms in the Hessian of \(f\) is linear in \(N\). The experiments reported in Table 2 show that the trade-off between the number of iterations and the time per iteration is in favor of RTR-SR1 for \(N\) sufficiently large. Note also that, even though it is slower than the two other methods in the experiments presented in Table 2, LRTR-SR1 may be the method of choice in certain circumstances as it does not require the Hessian of \(f\) (unlike RTR-Newton) and it has a reduced memory usage in comparison with RTR-SR1.

Table 2 Joint diagonalization (JD) experiments: \(n = 12, p =4,\;\epsilon _{\mathrm {C}} = 1\hbox {e}-1\)
Fig. 2
figure 2

Comparison of RTR-Newton and the new methods RTR-SR1 and LRTR-SR1 for the joint diagonalization problem (48) with \(n=12,\;p=4,\; N=1024\)

6 Conclusion

We have introduced a Riemannian SR1 trust-region method, where the second-order term of the model is generated using a Riemannian generalization of the classical SR1 update. Global convergence to stationary points and \(d+1\)-step superlinear convergence are guaranteed, and the experiments reported here show promise. A limited-memory version of the algorithm has also been presented. The new algorithms will be made available in the Manopt toolbox [9].