Abstract
The well-known symmetric rank-one trust-region method—where the Hessian approximation is generated by the symmetric rank-one update—is generalized to the problem of minimizing a real-valued function over a \(d\)-dimensional Riemannian manifold. The generalization relies on basic differential-geometric concepts, such as tangent spaces, Riemannian metrics, and the Riemannian gradient, as well as on the more recent notions of (first-order) retraction and vector transport. The new method, called RTR-SR1, is shown to converge globally and \(d+1\)-step q-superlinearly to stationary points of the objective function. A limited-memory version, referred to as LRTR-SR1, is also introduced. In this context, novel efficient strategies are presented to construct a vector transport on a submanifold of a Euclidean space. Numerical experiments—Rayleigh quotient minimization on the sphere and a joint diagonalization problem on the Stiefel manifold—illustrate the value of the new methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider the problem
of minimizing a smooth real-valued function \(f\) defined on a Riemannian manifold \(\mathcal {M}\). Recently investigated application areas include image segmentation [30] and recognition [37], electrostatics and electronic structure calculation [39], finance and chemistry [7], multilinear algebra [21, 32], low-rank learning [8, 26], and blind source separation [25, 34].
The wealth of applications has stimulated the development of general-purpose methods for (1)—see, e.g., [2, 30, 31] and references therein—including the trust-region approach upon which we focus in this work. A well-known technique in optimization [16], the trust-region method was extended to Riemannian manifolds in [1] (or see [2, Ch. 7]), and found applications, e.g., in [8, 21, 23, 26, 38]. Trust-region methods construct a quadratic model \(m_k\) of the objective function \(f\) around the current iterate \(x_k\) and produce a candidate new iterate by (approximately) minimizing the model \(m_k\) within a region where it is “trusted”. Depending on the discrepancy between \(f\) and \(m_k\) at the candidate new iterate, the size of the trust region is updated and the candidate new iterate is accepted or rejected.
For lack of efficient techniques to produce a second-order term in \(m_k\) that is inexact but nevertheless guarantees superlinear convergence, the Riemannian trust-region (RTR) framework loses some of its appeal when the exact second-order term—the Hessian of \(f\)—is not available. This is in contrast with the Euclidean case, where several strategies exist to build an inexact second-order term that preserves superlinear convergence of the trust-region method. Among these strategies, the symmetric rank-one (SR1) update is favored in view of its simplicity and because it preserves symmetry without unnecessarily enforcing positive definiteness; see, e.g., [27, §6.2] for a more detailed discussion. The \(n+1\) step q-superlinear rate of convergence of the SR1 trust-region method was shown by Byrd et al. [11] using a sophisticated analysis that builds on [15, 24].
The classical (Euclidean) SR1 trust-region method can also be viewed as a quasi-Newton method, enhanced with a trust-region globalization strategy. The idea of quasi-Newton methods on manifolds is not new [18, §4.5], however, most of the literature of which we are aware restricts consideration to generalizing the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method combined with a line search strategy. Early work such as [10, 32] used Riemannian BFGS methods for a specific application without an analysis of convergence properties, but more recently, systematic analyses of Riemannian BFGS methods based on the framework of retraction and vector transport developed in [2, 4] have been made. Qi [29] analyzed a version of Riemannian BFGS methods with retraction and vector transport restricted to exponential mapping and parallel translation and showed superlinear convergence using a Riemannian Dennis-Moré condition. Ring and Wirth [30] proposed and analyzed a Riemannian BFGS that avoids the restrictions on retraction and vector transport assumed by Qi but that needs to resort to the derivative of the retraction. Seibert et al. [33] discussed the freedom available when generalizing BFGS to Riemannian manifolds and analyzed one generalization of BFGS method on Riemannian manifolds that are isometric to \(\mathbb {R}^n\). Most recently, Huang [20] developed a complete convergence theory that avoids the restrictions of Qi, Ring and Wirth, guarantees superlinear convergence for the Riemannian Broyden family of quasi-Newton methods (including a version of SR1), and facilitates efficient implementation.
In this paper, motivated by the situation described above, we introduce a generalization of the classical (i.e., Euclidean) SR1 trust-region method to the Riemannian setting (1). Besides making use of basic Riemannian geometric concepts (tangent space, Riemannian metric, gradient), the new method, called RTR-SR1, relies on the notions of retraction and vector transport introduced in [2, 4]. A detailed global and local convergence analysis is given. A limited-memory version of RTR-SR1, referred to as LRTR-SR1, is also introduced. Numerical experiments show that the RTR-SR1 method displays the expected convergence properties. When the Hessian of \(f\) is not available, RTR-SR1 thus offers an attractive way of tackling (1) by a trust-region approach. Moreover, even when the Hessian of \(f\) is available, making use of it can be expensive computationally, and the numerical experiments show that ignoring the Hessian information and resorting instead to the RTR-SR1 approach can be beneficial.
Another contribution of this paper with respect to [11] is an extension of the analysis to allow for inexact solutions of the trust-region subproblem—compare (10) with [11, (2.4)]. This extension makes it possible to resort to inner iterations such as the Steihaug–Toint truncated CG method (see [2, §7.3.2] for its Riemannian extension) while staying within the assumptions of the convergence analysis.
The paper is organized as follows. The RTR-SR1 method is stated and discussed in Sect. 2. The convergence analysis is carried out in Sect. 3. The limited-memory version is introduced in Sect. 4. Numerical experiments are reported in Sect. 5. Conclusions are drawn in Sect. 6.
2 The Riemannian SR1 trust-region method
The proposed Riemannian SR1 trust-region (RTR-SR1) method is described in Algorithm 1. The algorithm statement is commented in Sect. 2.1 and the important questions of representing tangent vectors and choosing the vector transport are discussed in Sects. 2.2 and 2.3.
2.1 A guide to Algorithm 1
Algorithm 1 can be viewed as a Riemannian version of the classical (Euclidean) SR1 trust-region method (see, e.g., [27, Algorithm 6.2]). It can also be viewed as an SR1 version of the Riemannian trust-region framework [2, Algorithm 10 p. 142]. Therefore, several pieces of information given in [2, Ch. 7] remain relevant for Algorithm 1.
In particular, the algorithm statement makes use of standard Riemannian concepts that are described, e.g., in [2, 28], such as the tangent space \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) to the manifold \(\mathcal {M}\) at a point \(x\), a Riemannian metric \(g\), and the gradient \({{\mathrm{\mathrm {grad}}}}f\) of a real-valued function \(f\) on \(\mathcal {M}\). The algorithm statement also relies on the notion of retraction, introduced in [4] (or see [2, §4.1]). A retraction \(R\) on \(\mathcal {M}\) is a smooth map from the tangent bundle \({{\mathrm{\mathrm {T}}}}\mathcal {M}\) (i.e., the set of all tangent vectors to \(\mathcal {M}\)) onto \(\mathcal {M}\) such that, for all \(x\in \mathcal {M}\) and all \(\xi _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\), the curve \(t\mapsto R(t\xi _x)\) is tangent to \(\xi _x\) at \(t=0\). We let \(R_x\) denote the restriction of \(R\) to \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\). The domain of \(R\) need not be the entire tangent bundle, but this is usually the case in practice, and in this work we assume throughout that \(R\) is defined wherever needed. Specific ways of constructing retractions are proposed in [2, 3, 4]; see also [22, 39] for the important case of the Stiefel manifold.
Within the Riemannian trust-region framework, the characterizing aspect of Algorithm 1 lies in the update mechanism for the Hessian approximation \(\mathcal {B}_k\). The proposed update mechanism, based on formula (3) and on Step 6 of Algorithm 1, is a rather straightforward Riemannian generalization of the classical SR1 update
Significantly less straightforward is the Riemannian generalization of the superlinear convergence result, as we will see in Sect. 3.4. (Observe that the local convergence result [2, Theorem 7.4.11] does not apply here because the Hessian approximation condition [2, (7.36)] is not guaranteed to hold.)
Instrumental in the Riemannian SR1 update is the notion of vector transport, introduced in [2, §8.1] as a generalization of the classical Riemannian concept of parallel translation. A vector transport on a manifold \(\mathcal {M}\) on top of a retraction \(R\) is a smooth mapping
satisfying the following properties for all \(x\in \mathcal {M}\):
-
1.
(Associated retraction) \(\mathcal {T}_{\eta _x}(\xi _x) \in {{\mathrm{\mathrm {T}}}}_{R_x(\xi _x)}\mathcal {M}\) for all \(\xi _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\);
-
2.
(Consistency) \(\mathcal {T}_{0_x}(\xi _x) = \xi _x\) for all \(\xi _x\in T_x\mathcal {M}\);
-
3.
(Linearity) \(\mathcal {T}_{\eta _x}(a\xi _x+b\zeta _x) = a\mathcal {T}_{\eta _x}(\xi _x) + b\mathcal {T}_{\eta _x}(\zeta _x)\).
The Riemannian SR1 update uses tangent vectors at the current iterate to produce a new Hessian approximation at the next iterate, hence the need to perform a vector transport (see Step 6) from the current iterate to the next.
In the Input step of Algorithm 1, the requirement that the vector transport \(\mathcal {T}_S\) is isometric means that, for all \(x\in \mathcal {M}\) and all \(\xi _x,\zeta _x,\eta _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\), the equation
holds. Techniques for constructing an isometric vector transport on submanifolds of Euclidean spaces are described in Sect. 2.3.
The symmetry requirement on \(\mathcal {B}_0\) with respect to the Riemannian metric \(g\) means that \(g(\mathcal {B}_0\xi _{x_0},\eta _{x_0}) = g(\xi _{x_0},\mathcal {B}_0\eta _{x_0})\) for all \(\xi _{x_0},\eta _{x_0}\in {{\mathrm{\mathrm {T}}}}_{x_0}\mathcal {M}\). It is readily seen from (3) and Step 6 of Algorithm 1 that \(\mathcal {B}_k\) is symmetric for all \(k\). Note however that \(\mathcal {B}_k\) is, in general, not positive definite.
A possible stopping criterion for Algorithm 1 is \(\Vert {{\mathrm{\mathrm {grad}}}}f(x_{k})\Vert < \epsilon \) for some specified \(\epsilon >0\), where \(\Vert \cdot \Vert \), which also appears in the statement of Algorithm 1, denotes the norm induced by the Riemannian metric \(g\), i.e.,
In the spirit of [30, Remark 4], we point out that it is possible to formulate the SR1 update (3) in the new tangent space \({{\mathrm{\mathrm {T}}}}_{x_{k+1}}\mathcal {M}\); in the present case of SR1, the algorithm remains equivalent since the vector transport is isometric.
Otherwise, Algorithm 1 does not call for comments other than those made in [2, Ch. 7]. In particular, we point out that the meaning of “approximately” in Step 2 of Algorithm 1 depends on the desired convergence results. We will see in the convergence analysis (Sect. 3) that enforcing the Cauchy decrease (9) is enough to ensure global convergence to stationary points, but another condition such as (10) is needed to guarantee superlinear convergence. The truncated CG method, discussed in [2, §7.3.2] in the Riemannian context, is an inner iteration for Step 2 that returns an \(s_k\) satisfying conditions (9) and (10).
2.2 Representation of tangent vectors
Let us now consider the frequently encountered situation where the manifold \(\mathcal {M}\) is described as a \(d\)-dimensional submanifold of an \(m\)-dimensional Euclidean space \(\mathcal {E}\). In particular, this is the case of the sphere and the Stiefel manifold involved in the numerical experiments in Sect. 5.
A tangent vector in \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) can be represented either by its \(d\)-dimensional vector of coordinates in a given basis \(B_x\) of \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\), or else as an \(m\)-dimensional vector in \(\mathcal {E}\) since \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\subset {{\mathrm{\mathrm {T}}}}_x\mathcal {E}\simeq \mathcal {E}\). The latter option may be preferable when the codimension \(m - d\) is small (e.g., the sphere) because building, storing and manipulating the basis \(B_x\) of \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) may be inconvenient.
Likewise, since \(\mathcal {B}_k\) is a linear transformation of \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\), it can be represented in the basis \(B_x\) as a \(d\times d\) matrix, or as an \(m\times m\) matrix restricted to act on \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\). Here again, the latter approach may be computationally more efficient when the codimension \(m-d\) is small.
A related choice has to be made for the representation of the vector transport, since \(\mathcal {T}_{\eta _x}\) is a linear map from \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\) to \({{\mathrm{\mathrm {T}}}}_{R_x(\eta _x)}\mathcal {M}\). This question is addressed in Sect. 2.3.
2.3 Isometric vector transport
We present two ways of constructing an isometric vector transport on a \(d\)-dimensional submanifold \(\mathcal {M}\) of an \(m\)-dimensional Euclidean space \(\mathcal {E}\).
2.3.1 Vector transport by parallelization
An open subset \(\mathcal {U}\) of \(\mathcal {M}\) is termed parallelizable if it admits a smooth field of tangent bases, i.e., a smooth function \(B: \mathcal {U} \rightarrow \mathbb {R}^{m \times d} : z \mapsto B_z\) where \(B_z\) is a basis of \({{\mathrm{\mathrm {T}}}}_z\mathcal {M}\). The whole manifold \(\mathcal {M}\) itself may not be parallelizable; in particular, every manifold of nonzero Euler characteristic is not parallelizable [35, §1.6], and it is also known that the only parallelizable spheres are those of dimension 1, 3, and 7 [6, p. 116]. However, for the global convergence analysis carried out in Sect. 3.2, the vector transport is not required to be smooth or even continuous, and for the local convergence analysis in Sect. 3.4, we only need a parallelizable neighborhood \(\mathcal {U}\) of the limit point \(x^*\). Such a neighborhood always exists (take for example a coordinate neighborhood [2, p. 37]).
Given an orthonormal smooth field of tangent bases \(B\), i.e., such that \(B_x^\flat B_x=I\) for all \(x\) (where \(I\) stands for the identity matrix of adequate size), the proposed isometric vector transport from \({{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) to \({{\mathrm{\mathrm {T}}}}_y \mathcal {M}\) is given by
The \(d\times d\) matrix representation of this vector transport in the pair of bases \((B_x,B_y)\) is simply the identity. This considerably simplifies the implementation of Algorithm 1.
2.3.2 Vector transport by rigging
If \(\mathcal {M}\) is described as a \(d\)-dimensional submanifold of an \(m\)-dimensional Euclidean space \(\mathcal {E}\) and the codimension \((m-d)\) is much smaller than the dimension \(d\), then the vector transport by rigging, introduced next, may be preferable. For generality, we do not assume that \(\mathcal {M}\) is a Riemannian submanifold of \(\mathcal {E}\); in other words, the Riemannian metric \(g\) on \(\mathcal {M}\) may not be the one induced by the metric of \(\mathcal {E}\). A motivation for this generality is to be able to handle the canonical metric of the Stiefel manifold [17, (2.22)]. For simplicity of the exposition, we work in an orthonormal basis of \(\mathcal {E}\) and, for \(x\in \mathcal {M}\), we let \(G_x\) denote a matrix expression of \(g_x\), i.e., \(g_x(\xi _x,\eta _x) = \xi _x^TG_x\eta _x\) for all \(\xi _x,\eta _x\in {{\mathrm{\mathrm {T}}}}_x\mathcal {M}\).
An open subset \(\mathcal {U}\) of \(\mathcal {M}\) is termed rigged if it admits a smooth field of normal bases, i.e., a smooth function \(N: \mathcal {U} \rightarrow \mathbb {R}^{m\times (m-d)} : z \mapsto N_z\) where \(N_z\) is a basis of the normal space \(\mathrm{N}_z\mathcal {M}\). The whole manifold \(\mathcal {M}\) itself may not be rigged, but it is always locally rigged.
Given a smooth field of normal bases \(N\), the proposed isometric vector transport \(\mathcal {T}\) from \({{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) to \({{\mathrm{\mathrm {T}}}}_y \mathcal {M}\) is defined as follows. Compute \((I-N_x(N_x^TN_x)^{-1}N_x^T)N_y\) (i.e., the orthogonal projection of \(N_y\) onto \({{\mathrm{\mathrm {T}}}}_x\mathcal {M}\)) and observe that its column space is \({{\mathrm{\mathrm {T}}}}_x\mathcal {M} \ominus ({{\mathrm{\mathrm {T}}}}_x\mathcal {M} \cap {{\mathrm{\mathrm {T}}}}_y\mathcal {M})\). Obtain an orthonormal matrix \(Q_x\) by Gram-Schmidt orthonormalizing \((I-N_x(N_x^TN_x)^{-1}N_x^T)N_y\). Proceed likewise with \(x\) and \(y\) interchanged to get \(Q_y\). Finally, let
While it is clear that \(\mathcal {T}\) satisfies the three properties of vector transport mentioned in Sect. 2.1, proving that \(\mathcal {T}\) is (locally) smooth remains an open question. Moreover, the column space of \((I-N_x(N_x^TN_x)^{-1}N_x^T)N_y\) gets more sensitive to numerical errors as the distance between \(x\) and \(y\) decreases. Nevertheless, there is evidence that \(\mathcal {T}\) is smooth indeed, and we have observed that using vector transport by rigging in Algorithm 1 is a worthy alternative in large-scale low-codimension problems.
3 Convergence analysis of RTR-SR1
3.1 Notation and standing assumptions
Throughout the convergence analysis, unless otherwise specified, we let \(\{x_k\},\;\{\mathcal {B}_k\},\; \{\tilde{\mathcal {B}}_k\},\; \{s_k\},\; \{y_k\}\), and \(\{\varDelta _k\}\) be infinite sequences generated by Algorithm 1, and we make use of the notation introduced in that algorithm. We let \(\varOmega \) denote the sublevel set of \(x_0\), i.e.,
The global and local convergence analyses each make standing assumptions at the beginning of their respective sections. The numbered assumptions introduced below are not standing assumptions and will be invoked specifically whenever needed. Note that, apart from Assumption 6, all the numbered assumptions are Riemannian generalizations of assumptions made in [11] for the analysis of the Euclidean SR1 trust-region method.
3.2 Global convergence analysis
In some results, we will assume for the retraction \(R\) that there exist \(\mu >0\) and \(\delta _\mu >0\) such that
This corresponds to [2, (7.25)] restricted to the sublevel set \(\varOmega \). Such a condition is instrumental in the global convergence analysis of Riemannian trust-region schemes. Note that, in view of [30, Lemma 6], condition (8) can be shown to hold globally under the condition that \(R\) has equicontinuous derivatives.
The next assumption corresponds to [11, (A3)].
Assumption 1
The sequence of linear operators \(\{\mathcal {B}_k\}\) is bounded by a constant \(M\) such that \(\Vert \mathcal {B}_k\Vert \le M\) for all \(k\).
We will often require that the trust-region subproblem (2) is solved accurately enough that, for some positive constants \(\sigma _1\) and \(\sigma _2\),
and that
where \(\theta >0\) is a constant. These conditions are generalizations of [11, (2.3–4)]. Observe that, even if we restrict to the Euclidean case, condition (10) remains weaker than condition [11, (2.4)]. The purpose of introducing \(\delta _k\) in (10) is to encompass stopping criteria such as [2, (7.10)] that do not require the computation of an exact solution of the trust-region subproblem. We point out in particular that (9) and (10) hold if the approximate solution of the trust-region subproblem (2) is obtained from the truncated CG method, described in [2, §7.3.2] in the Riemannian context.
We can now state and prove the main global convergence results. Point (iii) generalizes [11, Theorem 2.1] while points (i) and (ii) are based on [2, §7.4.1].
Theorem 1
(convergence) (i) If \(f\in C^2\) is bounded below on the sublevel set \(\varOmega \), Assumption 1 holds, condition (9) holds, and (8) is satisfied then \(\lim _{k\rightarrow \infty } {{\mathrm{\mathrm {grad}}}}f(x_k)=0\). (ii) If \(f\in C^2,\; \mathcal {M}\) is compact, Assumption 1 holds, and (9) holds then \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0,\; \{x_k\}\) has at least one limit point, and every limit point of \(\{x_k\}\) is a stationary point of \(f\). (iii) If \(f\in C^2\), the sublevel set \(\varOmega \) is compact, \(f\) has a unique stationary point \(x^*\) in \(\varOmega \), Assumption 1 holds, condition (9) holds, and (8) is satisfied then \(\{x_k\}\) converges to \(x^*\).
Proof
(i) Observe that the proof of [2, Theorem 7.4.4] still holds when condition [2, (7.25)] is weakened to its restriction (8) to \(\varOmega \). Indeed, since the trust-region method is a descent iteration, it follows that all iterates are in \(\varOmega \). The assumptions thus allow us to conclude, by [2, Theorem 7.4.4], that \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0\). (ii) It follows from [2, Proposition 7.4.5] and [2, Corollary 7.4.6] that all the assumptions of [2, Theorem 7.4.4] hold. Hence \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0\), and every limit point is thus a stationary point of \(f\). Since \(\mathcal {M}\) is compact, \(\{x_k\}\) is guaranteed to have at least one limit point. (iii) Again by [2, Theorem 7.4.4], we get that \(\lim _{k\rightarrow \infty }{{\mathrm{\mathrm {grad}}}}f(x_k)=0\). Since \(\{x_k\}\) belongs to the compact set \(\varOmega \) and cannot have limit points other than \(x^*\), it follows that \(\{x_k\}\) converges to \(x^*\). \(\square \)
3.3 More notation and standing assumptions
For the purpose of conducting a local convergence analysis, we now assume that \(\{x_k\}\) converges to a point \(x^*\). Moreover, we assume throughout that \(f\in C^2\).
We let \(\mathcal {U}_{\mathrm {trn}}\) be a totally retractive neighborhood of \(x^*\), a concept inspired from the notion of totally normal neighborhood (see [13, §3.3]). By this, we mean that there is \(\delta _{\mathrm{trn}}>0\) such that, for each \(y\in \mathcal {U}_{\mathrm {trn}}\), we have that \(R_y(\mathbb {B}(0_y,\delta _{\mathrm {trn}}))\supseteq \mathcal {U}_{\mathrm {trn}}\) and \(R_y(\cdot )\) is a diffeomorphism on \(\mathbb {B}(0_y,\delta _{\mathrm {trn}})\), where \(\mathbb {B}(0_y,\delta _{\mathrm {trn}})\) denotes the ball of radius \(\delta _{\mathrm {trn}}\) in \({{\mathrm{\mathrm {T}}}}_y\mathcal {M}\) centered at the origin \(0_y\). The existence of a totally retractive neighborhood can be shown along the lines of [13, Theorem 3.3.7]. We assume without loss of generality that \(\{x_k\}\subset \mathcal {U}_{\mathrm {trn}}\). Whenever we consider an inverse retraction \(R_x^{-1}(y)\), we implicitly assume that \(x,y\in \mathcal {U}_{\mathrm {trn}}\).
3.4 Local convergence analysis
The purpose of this section is to obtain a superlinear convergence result for Algorithm 1, stated in Theorem 2. The analysis can be viewed as a Riemannian generalization of the local analysis in [11, §2]. As we proceed, we will point out the main hurdles that had to be overcome in the generalization. The analysis makes use of several preparation lemmas, independent of Algorithm 1, that are of potential interest in the broader context of Riemannian optimization. These preparation lemmas become trivial or well known in the Euclidean context.
The next assumption corresponds to a part of [11, (A1)].
Assumption 2
The point \(x^*\) is a nondegenerate local minimizer of \(f\). In other words, \({{\mathrm{\mathrm {grad}}}}f(x^*)=0\) and \({{\mathrm{\mathrm {Hess}}}}f(x^*)\) is positive definite.
The next assumption generalizes the assumption, contained in [11, (A1)], that the Hessian of \(f\) is Lipschitz continuous near \(x^*\). (Recall that \(\mathcal {T}_S\) is the vector transport invoked in Algorithm 1.) Note that the assumption holds if \(f\in C^3\); see Lemma 4.
Assumption 3
There exists a constant \(c_0\) such that for all \(x, y \in \mathcal {U}_{\mathrm {trn}}\),
where \({{\mathrm{\mathrm {Hess}}}}f(x)\) is the Riemannian Hessian of \(f\) at \(x\) (see, e.g., [2, §5.5]), \(\eta = R_x^{-1}(y)\), and \(\Vert \cdot \Vert \) is also used to denote the operator norm induced by the Riemannian norm (5).
The next assumption is introduced to handle the Riemannian case; in the classical Euclidean setting, Assumption 4 follows from Assumption 3. Assumption 4 is mild since it holds if \(f\in C^3\), as shown in Lemma 4.
Assumption 4
There exists a constant \(c_0\) such that for all \(x, y \in \mathcal {U}_{\mathrm {trn}}\), all \(\xi _x \in {{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) with \(R_x(\xi _x) \in \mathcal {U}_{\mathrm {trn}}\), and all \(\xi _y \in {{\mathrm{\mathrm {T}}}}_y \mathcal {M}\) with \(R_y(\xi _y) \in \mathcal {U}_{\mathrm {trn}}\), the inequality
holds, where \(\eta = R_x^{-1}(y),\; \hat{f}_x = f \circ R_x\), and \(\hat{f}_y = f \circ R_y\).
The next assumption corresponds to [11, (A2)]. It implies that no updates of \(\mathcal {B}_k\) are skipped. In the Euclidean case, Khalfan et al. [24] show that this is usually the case in practice.
Assumption 5
The inequality
holds.
The next assumption is introduced to handle the Riemannian case. It states that the iterates eventually continuously stay in the totally retractive neighborhood \(\mathcal {U}_{\mathrm {trn}}\) (the terminology is borrowed from [5, Definition 2.8]). The assumption is needed, in particular, for Lemma 5. Note that, whereas in the Euclidean setting the assumption follows from the standing assumption that \(\{x_k\}\) converges to \(x^*\), this is no longer the case on some Riemannian manifolds, where \(\{x_k\}\) may converge to \(x^*\) while the connecting segments \(\{R_{x_k}(ts_k): t\in [0,1]\}\) do not. Assumption 6 is thus invoked to ensure that we are in a position to carry out a local convergence analysis.
Assumption 6
There exists \(N\) such that, for all \(k\ge N\) and all \(t\in [0,1]\), \(R_{x_k}(t s_k) \in \mathcal {U}_{\mathrm {trn}}\).
The next lemma is proved in [19, Lemma 14.1].
Lemma 1
Let \(\mathcal {M}\) be a Riemannian manifold, let \(\mathcal {U}\) be a compact coordinate neighborhood in \(\mathcal {M}\), and let the hat denote coordinate expressions. Then there are \(c_2 > c_1 > 0\) such that, for all \(x, y \in \mathcal {U}\), we have
where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm, i.e., \(\Vert \hat{x}\Vert _2 = \sqrt{\hat{x}^T\hat{x}}\).
Lemma 2
Let \(\mathcal {M}\) be a Riemannian manifold endowed with a retraction \(R\) and let \(\bar{x} \in \mathcal {M}\). Then there exist \(a_0 > 0\), \(a_1 > 0\), and \(\delta _{a_0, a_1} > 0\) such that for all \(x\) in a sufficiently small neighborhood of \(\bar{x}\) and all \(\xi , \eta \in {{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) with \(\Vert \xi \Vert \le \delta _{a_0, a_1}\) and \(\Vert \eta \Vert \le \delta _{a_0, a_1}\), the inequalities
hold.
Proof
Since \(R\) is smooth, we can choose a neighborhood small enough such that \(R\) satisfies the condition of [30, Lemma 6], and the result follows from that lemma.\(\square \)
The following lemma follows from Lemma 2 by setting \(\eta =0\). We state it separately for convenience as we will frequently invoke it in the analysis.
Lemma 3
Let \(\mathcal {M}\) be a Riemannian manifold endowed with retraction \(R\) and let \(\bar{x} \in \mathcal {M}\). Then there exist \(a_0 > 0\), \(a_1 > 0\), and \(\delta _{a_0, a_1} > 0\) such that for all \(x\) in a sufficiently small neighborhood of \(\bar{x}\) and all \(\xi \in {{\mathrm{\mathrm {T}}}}_x \mathcal {M}\) with \(\Vert \xi \Vert \le \delta _{a_0, a_1}\), the inequalities
hold.
Lemma 4
If \(f\in C^3\), then Assumptions 3 and 4 hold.
Proof
First, we prove that Assumption 3 holds. Define a function \(h: \mathcal {M} \times \mathcal {M} \times {{\mathrm{\mathrm {T}}}}\mathcal {M} \rightarrow {{\mathrm{\mathrm {T}}}}\mathcal {M}, (x, y, \xi _y) \rightarrow \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}f(x) \mathcal {T}_{S_{\eta }}^{-1} \xi _y\), where \(\eta = R_x^{-1}(y)\). Since \(f \in C^3\), we know that \(h(x, y, \xi _y)\) is \(C^1\). Then there exists \(b_0\) such that for all \(x, y \in \mathcal {U}_{\mathrm {trn}},\) \(\xi _y \in {{\mathrm{\mathrm {T}}}}_y \mathcal {M}, \Vert \xi _y\Vert = 1\),
where \(b_0\), \(b_1\) and \(b_2\) are some constants. So we have
Given any linear operator \(A\) on \({{\mathrm{\mathrm {T}}}}_y \mathcal {M}\), we have that \(\Vert A\Vert \) by definition is \(\sup _{\Vert \xi \Vert =1} \Vert A \xi \Vert \). Note that \(\Vert \xi \Vert =1\) is a compact set. Hence, there exists \(\Vert \xi ^*\Vert = 1\) such that \(\Vert A\Vert = \Vert A \xi ^*\Vert \). Therefore, we can choose \(\xi _y, \Vert \xi _y\Vert = 1\) such that
We obtain
To prove Assumption 4, we redefine \(h\) as \(h(y, x, \xi _x) = \mathcal {T}_{S_{\eta }} {{\mathrm{\mathrm {Hess}}}}\hat{f}_x(\xi _x) \mathcal {T}_{S_{\eta }}^{-1}\). If we use orthonormal vector fields to obtain the coordinate expression of \(h\), denoted by \(\hat{h}\), then the manifold norm and the Euclidean norm of coordinate expressions are the same and we have
Since \(f \in C^3\), we know that \(\hat{h}\) is also in \(C^1\). Hence there exists a constant \(b_3\) such that
Therefore,
This and (11) give us Assumption 4.\(\square \)
The next lemma generalizes [11, Lemma 2.2]. The key difference with the Euclidean case is the following: in the Euclidean case, when \(s_k\) is accepted, we simply have \(\Vert s_k\Vert =\Vert x_{k+1}-x_{k}\Vert \), while in the Riemannian generalization, we invoke Assumption 6 and Lemma 3 to deduce that \(\Vert s_k\Vert \le \frac{1}{a_0} {{\mathrm{\mathrm {dist}}}}(x_{k + 1}, x_k)\). Note that Assumption 6 cannot be removed. To see this, consider for example the unit sphere with the exponential retraction, where we can have \(x_k=x_{k+1}\) with \(\Vert s_k\Vert =2\pi \).
Lemma 5
Suppose Assumption 6 holds. Then either
or there exist \(K > 0\) and \(\varDelta > 0\) such that for all \(k > K\)
In either case \(s_k \rightarrow 0\).
Proof
Let \(\varDelta = \liminf \varDelta _k\) and suppose first that \(\varDelta > 0\). From line 1 of Algorithm 1, if \(\varDelta _k\) is increased, then \(\Vert s_k\Vert \ge 0.8 \varDelta _k\) and \(x_{k+1} = R_{x_k}(s_k)\), which implies by Lemma 3 and Assumption 6 that \({{\mathrm{\mathrm {dist}}}}(x_k,x_{k+1})\ge a_0 0.8 \varDelta _k\). The latter inequality cannot hold for infinitely many values of \(k\) since \(x_k\rightarrow x^*\) and \(\liminf \varDelta _k>0\). Hence, there exists \(K \ge 0\) such that \(\varDelta _k\) is not increased for any \(k \ge K\). Since \(\varDelta > 0\), this implies that \(\varDelta _k \ge \varDelta \) for all \(k \ge K\). In view of the trust-region update mechanism in Algorithm 1 and since \(\varDelta = \liminf \varDelta _k\), we also know that, for some \(K_1 > K,\; \varDelta _{K_1} < \frac{1}{\tau _1} \varDelta \). If the trust region radius were to be decreased we would have \(\varDelta _{K_1 + 1} < \varDelta \), which we have ruled out. Since neither increase nor decrease can occur, we must have \(\varDelta _k = \varDelta \) for all \(k \ge K_1\).
Suppose now that \(\varDelta = 0\). Since \(x_k\rightarrow x^*\), for every \(\epsilon > 0\) there exists \(K_{\epsilon } \ge 0\) such that \({{\mathrm{\mathrm {dist}}}}(x_{k+1},x_k) < \epsilon \) for all \(k \ge K_{\epsilon }\). Since \(\liminf \varDelta _k = 0\), there exists \(j \ge K_{\epsilon }\) such that \(\varDelta _j < \epsilon \). But since \(\varDelta _k\) is increased only if \(\varDelta _k \le \frac{1}{0.8} \Vert s_k\Vert \le \frac{1}{0.8 a_0}{{\mathrm{\mathrm {dist}}}}(x_{k+1},x_k) < \frac{\epsilon }{0.8 a_0}\), and the increase factor is \(\tau _2\), we have that \(\varDelta _k < \frac{\tau _2 \epsilon }{0.8 a_0}\) for all \(k \ge j\). Therefore (12) follows.
To show that \(\Vert s_k\Vert \rightarrow 0\), note that if (12) is true, then clearly \(\Vert s_k\Vert \rightarrow 0\). If (13) is true, then for all \(k > K\), the step \(s_k\) is accepted and \(\Vert s_k\Vert \le \frac{1}{a_0} {{\mathrm{\mathrm {dist}}}}(x_{k + 1}, x_k)\) (by Lemma 3), hence \(\Vert s_k\Vert \rightarrow 0\) since \(\{x_k\}\) converges.\(\square \)
Lemma 6
Let \(\mathcal {M}\) be a Riemannian manifold endowed with two vector transports \(\mathcal {T}_1\) and \(\mathcal {T}_2\), and let \(\bar{x} \in \mathcal {M}\). Then there exist a constant \(a_4\) and a neighborhood \(\mathcal {U}\) of \(\bar{x}\) such that for all \(x, y \in \mathcal {U}\) and all \(\xi \in {{\mathrm{\mathrm {T}}}}_y\mathcal {M}\),
where \(\eta = R_{x}^{-1}(y)\).
Proof
We use the hat to denote coordinate expressions. Let \(T_1(\hat{x}, \hat{\eta })\) and \(T_2(\hat{x}, \hat{\eta })\) denote the coordinate expression of \(\mathcal {T}_{1_{\eta }}^{-1}\) and \(\mathcal {T}_{2_{\eta }}^{-1}\), respectively. Then
for some constants \(b_0,\; b_1\), and \(b_2\).\(\square \)
The next lemma is proved in [19, Lemma 14.5].
Lemma 7
Let \(F\) be a \(C^1\) vector field on a Riemannian manifold \(\mathcal {M}\) and let \(\bar{x} \in \mathcal {M}\) be a nondegenerate zero of \(F\). Then there exist a neighborhood \(\mathcal {U}\) of \(\bar{x}\) and \(a_5, a_6 > 0\) such that for all \(x \in \mathcal {U}\),
In the Euclidean case, the next lemma holds with \(\tilde{a}_7=0\) and reduces to the Fundamental Theorem of Calculus.
Lemma 8
Let \(F\) be a \(C^1\) vector field on a Riemannian manifold \(\mathcal {M}\), let \(R\) be a retraction on \(\mathcal {M}\), and let \(\bar{x} \in \mathcal {M}\). Then there exist a neighborhood \(\mathcal {U}\) of \(\bar{x}\) and a constant \(\tilde{a}_7\) such that for all \(x, y \in \mathcal {U}\),
where \(\eta = R_x^{-1} (y)\) and \(P_\gamma \) is the parallel translation along the curve \(\gamma \) given by by \(\gamma (t) = R_x(t \eta )\).
Proof
Define \(G: [0, 1] \rightarrow {{\mathrm{\mathrm {T}}}}_x \mathcal {M}: t \mapsto G(t) = P_{\gamma }^{0 \leftarrow t} F(\gamma (t))\). Observe that \(G(0) = F(x)\) and \(G(1) = P_{\gamma }^{0 \leftarrow 1} F(y)\). We have
where we have used an expression of the covariant derivative \(\mathbb {D}\) in terms of the parallel translation \(P\) (see, e.g., [14, theorem I.2.1]), and where \(\mathcal {T}_{R(t \eta )} \eta = \frac{d}{dt}(R(t\eta ))\). Since \(G(1) - G(0) = \int _0^1 G'(t) d t\), we obtain
where \(b_0\) is some constant.\(\square \)
Lemma 9
Suppose Assumptions 2 and 3 hold. Then there exist a neighborhood \(\mathcal {U}\) and a constant \(a_7\) such that for all \(x_1,\; \tilde{x}_1,\; x_2\), and \(\tilde{x}_2 \in \mathcal {U}\), we have
where \(\zeta = R_{x_1}^{-1} (x_2)\), \(\xi _1 = R_{x_1}^{-1} (\tilde{x}_1)\), \(\xi _2 = R_{x_2}^{-1} (\tilde{x}_2)\), \(y_1 = \mathcal {T}_{S_{\xi _1}}^{-1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1) - {{\mathrm{\mathrm {grad}}}}f(x_1)\), and \(y_2 = \mathcal {T}_{S_{\xi _2}}^{-1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2) - {{\mathrm{\mathrm {grad}}}}f(x_2)\).
Proof
Define \(\bar{y}_1 = P_{\gamma _1}^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_1) - {{\mathrm{\mathrm {grad}}}}f(x_1)\) and \(\bar{y}_2 = P_{\gamma _2}^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(\tilde{x}_2) - {{\mathrm{\mathrm {grad}}}}f(x_2)\), where \(P\) is the parallel transport, \(\gamma _1(t) = R_{x_1}(t \xi _1)\), and \(\gamma _2(t) = R_{x_2}(t \xi _2)\). From Lemma 8, we have
where \(b_0\) is a positive constant, \(\bar{H}_1(x_1, \tilde{x}_1) = \int _0^1 P_{\gamma _1}^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma _1(t)) P_{\gamma _1}^{t \leftarrow 0} d t\) and \(\bar{H}_2(x_2, \tilde{x}_2) = \int _0^1 P_{\gamma _2}^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma _2(t)) P_{\gamma _2}^{t \leftarrow 0} d t\). It follows that
where \(b_1, b_2,\; b_3,\; b_4\) and \(b_5\) are some constants. Using hat to denote coordinate expressions, \(T(\hat{x}_1, \hat{x}_2)\) to denote \(\mathcal {T}_{\zeta }\) and \(\hat{G}(\hat{x}_2)\) to denote the matrix expression of the Riemannian metric at \(x_2\), we have
where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm. Define a function
We can see that when \((\hat{x}_1^T, \hat{\tilde{x}}_1^T) = (\hat{x}_2^T, \hat{\tilde{x}}_2^T)\), \(J = 0\). Since, in view of Assumption 3, \(J\) is Lipschitz continuous, it follows that (16) becomes
where \(b_6, b_7\) are some constants. Combining this equation with (15), we obtain
where \(b_8\) is a constant.\(\square \)
Lemma 10
Let \(\mathcal {M}\) be a Riemannian manifold endowed with a vector transport \(\mathcal {T}\) with associated retraction \(R\), and let \(\bar{x} \in \mathcal {M}\). Then there is a neighborhood \(\mathcal {U}\) of \(\bar{x}\) and \(a_8\) such that for all \(x, y \in \mathcal {U}\),
where \(\xi = R_{\bar{x}}^{-1}(x),\; \eta = R_{x}^{-1}(y),\; \zeta = R_{\bar{x}}^{-1}(y),\; {{\mathrm{\mathrm {id}}}}\) is the identity operator, and \(\Vert \cdot \Vert \) is the operator norm induced by the Riemannian metric.
Proof
Let the hat denote coordinate expressions, chosen such that the matrix expression of the Riemannian metric at \(\bar{x}\) is the identity. Let \(L(x, y)\) denote \(\mathcal {T}_{{R_x^{-1}(y)}}\). We have
Define a function \(J(\bar{x}, \xi , \zeta ) = I - L(\bar{x}, R_{\bar{x}}(\xi ))^{-1} L(R_{\bar{x}}(\xi ), R_{\bar{x}}(\zeta ))^{-1} L(\bar{x}, R_{\bar{x}}(\zeta ))\). Notice that \(J\) is a smooth function and \(J(\bar{x}, 0_{\bar{x}}, 0_{\bar{x}}) = 0\). So
where \(b_0, b_1\) and \(b_2\) are some constants and \(\Vert \cdot \Vert _2\) denotes the Euclidean norm. So
This concludes the first part of the proof. The second part of the result follows from a similar argument.\(\square \)
The next lemma generalizes [15, Lemma 1]. It is instrumental in the proof of Lemma 13 below. In the Euclidean setting, it is possible to give an expression for \(a_9\) and \(a_{10}\) in terms of \(c\) of Assumption 3 and \(\nu \) of Assumption 5. In the Riemannian setting, we could not obtain such an expression, in part because the constant \(b_2\) that appears in the proof below is no longer zero. However, the existence of \(a_9\) and \(a_{10}\) can still be shown, under the assumption that \(\{x_k\}\) converges to \(x^*\), and this is all we need in order to carry on with Lemma 13.
Lemma 11
Suppose Assumptions 1, 2, 3, and 5 hold. Then
for all \(j\). Moreover, there exist constants \(a_9\) and \(a_{10}\) such that
for all \(j,\; i \ge j+1\), where \(\epsilon _{i, j} = \max _{j \le k \le i} {{\mathrm{\mathrm {dist}}}}(x_k, x^*)\) and
with \(\zeta _{j, i} = R_{x_j}^{-1} (x_i)\).
Proof
From (3), we have
This yields (17), as well as (18) with \(i = j + 1\). The proof of (18) for \(i>j+1\) is by induction. We choose \(k \ge j + 1\) and assume that (18) holds for all \(i = j + 1, \ldots , k\). Let \(r_k = y_k - \mathcal {B}_k s_k\). We have
where \(b_0\) and \(b_1\) are some constants. It follows that
where \(b_2, b_3\) are some constant. Because \(b_0, b_1, b_2\) and \(b_3\) are independent of \(a_9\) and \(a_{10}\), we can choose \(a_9\) and \(a_{10}\) large enough such that
for all \(j,\; k \ge j + 1\). Take for example, \(a_9 > 1\) and \(a_{10} > 1 + b_3 b_0 + b_3 b_1 + b_2.\) Therefore
This concludes the argument by induction.\(\square \)
Lemma 12
If Assumption 3 holds then there exist a neighborhood \(\mathcal {U}\) of \(x^*\) and a constant \(a_{11}\) such that for all \(x_1, x_2 \in \mathcal {U}\), the inequality
holds, where \(\zeta _1 = R_{x^*}^{-1} (x_1),\; s = R_{x_1}^{-1} (x_2),\; y = \mathcal {T}_{S_s}^{-1} {{\mathrm{\mathrm {grad}}}}f(x_2) - {{\mathrm{\mathrm {grad}}}}f(x_1)\).
Proof
Define \(\bar{y} = P_{\gamma }^{0 \leftarrow 1} {{\mathrm{\mathrm {grad}}}}f(x_2) - {{\mathrm{\mathrm {grad}}}}f(x_1)\), where \(P\) is the parallel transport along the curve \(\gamma \) defined by \(\gamma (t) = R_{x_1} (t s)\). From Lemma 8, we have
where \(\bar{H} = \int _0^1 P_{\gamma }^{0 \leftarrow t} {{\mathrm{\mathrm {Hess}}}}f(\gamma (t)) P_{\gamma }^{t \leftarrow 0} d t\) and \(b_0\) is a constant. We then have
where \(b_1\) and \(b_2\) are some constants.\(\square \)
With these technical lemmas in place, we now start the Riemannian generalization of the sequence of lemmas in [11] that leads to the main result [11, Theorem 2.7], generalized here as Theorem 2. For an easier comparison with [11], in the rest of the convergence analysis, we let \(n\) (instead of \(d\)) denote the dimension of the manifold \(\mathcal {M}\).
The next lemma generalizes [11, Lemma 2.3], itself a slight variation of [24, Lemma 3.2]. The proof of [11, Lemma 2.3] involves considering the span of a few \(s_j\)’s. In the Riemannian setting, a difficulty arises from the fact that the \(s_j\)’s are not in the same tangent space. We overcome this difficulty by transporting the \(s_j\)’s to \({{\mathrm{\mathrm {T}}}}_{x^*} \mathcal {M}\).
Lemma 13
Let \(s_k\) be such that \(R_{x_k} (s_k) \rightarrow x^*\). If Assumptions 1, 2, 3, and 5 hold then there exists \(K \ge 0\) such that for any set of \(n + 1\) steps \(S = \{s_{k_j} : K \le k_1 < \ldots < k_{n + 1}\}\), there exists an index \(k_m\) with \(m \in \{2, 3, \ldots , n + 1\}\) such that
where \(a_{12}, \bar{a}_{12}\) are some constants, \(n\) is the dimension of the manifold, \(H_{k_m} = \mathcal {T}_{S_{\zeta _{k_m}}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _{k_m}}}^{-1},\; \epsilon _S = \max _{1 \le j \le n + 1} \{{{\mathrm{\mathrm {dist}}}}(x_{k_j}, x^*), {{\mathrm{\mathrm {dist}}}}(R_{x_{k_j}} (s_{k_j}), x^*)\}\), and \(\zeta _{k_m} = R_{x^*}^{-1}(x_{k_m})\).
Proof
Given \(S\), for \(j = 1, 2, \ldots , n + 1\), define
where \(\bar{s}_{k_i} = \mathcal {T}_{S_{\zeta _{k_i}}}^{-1} s_{k_i}, i = 1, 2, \ldots , j\). The proof is organized as follows. We will first obtain in (28) that there exists \(m \in [2, n + 1]\) and \(u\in \mathbb {R}^{m-1}\), \(w \in {{\mathrm{\mathrm {T}}}}_{x^*} \mathcal {M}\) such that \(\bar{s}_{k_m} / \Vert \bar{s}_{k_m}\Vert = S_{m - 1} u - w\), \(S_{m - 1}\) has full column rank and is well conditioned, and \(\Vert w\Vert \) is small. We will also obtain in (30) that \((\mathcal {T}_{S_{\zeta _{k_m}}}^{-1} \mathcal {B}_{k_m} \mathcal {T}_{S_{\zeta _{k_m}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) S_{m - 1}\) is small due to the Hessian approximating properties of the SR1 update given in Lemma 12 above. The conclusion follows from these two results.
Let \(G_*\) denote the matrix expression of inner product of \({{\mathrm{\mathrm {T}}}}_{x^*} \mathcal {M}\) and \(\hat{S}_j\) denote the coordinate expression of \(S_j\), for \(j \in \{1, \ldots , n\}\). Let \(\kappa _j\) be the smallest singular value of \(G_*^{1/2} \hat{S}_j\) and define \(\kappa _{n + 1} = 0\). We have
Let \(m\) be the smallest integer for which
Since \(m \le n + 1\) and \(\kappa _1 = 1\), we have
Since \(x_k \rightarrow x^*\) and \(R_{x_k} (s_k) \rightarrow x^*\), we can assume that \(\epsilon _S \in (0, (\frac{1}{4})^n)\) for all \(k\). Now, we choose \(z \in \mathbb {R}^m\) such that
and
where \(u \in \mathbb {R}^{m - 1}\). (The last component of \(z\) is nonzero due to that \(m\) is the smallest such that (20) is true.) Let \(w = S_m z\) and its coordinate expression \(\hat{w} = \hat{S}_m z\). From the definition of \(G_*^{1/2} \hat{S}_m\) and \(z\), we have
where \(\hat{\bar{s}}_{k_m}\) is the coordinate expression of \(\bar{s}_{k_m}\). Since \(\kappa _{m - 1}\) is the smallest singular value of \(G_*^{1/2} \hat{S}_{m - 1}\), we have that
Using (22) and (24), we have that
Therefore, since (20) implies that \(\kappa _m < \epsilon _S^{1 / n}\), using (20),
This implies
and hence \(\Vert w\Vert < 1\), since \(\epsilon _S < (\frac{1}{4})^n\). Therefore, (25) and (26) imply that
Equation (28) is the announced result that \(w\) is small. The bound (27) will also be invoked below.
Now we show that \(\Vert (\mathcal {T}_{S_{\zeta _{k_j}}}^{-1} \mathcal {B}_{k_j} \mathcal {T}_{S_{\zeta _{k_j}}} - {{\mathrm{\mathrm {Hess}}}}f(x^*)) S_{j - 1}\Vert \) is small for all \(j \in [2, n + 1]\) (and thus in particular for \(j=m\)). By Lemma 11, we have
for all \(i \in \{k_1, k_2, \ldots , k_{j - 1}\}\). Therefore,
where \(b_2, b_3\) and \(b_4\) are some constants. Therefore, we have that for any \(j \in [2, n + 1]\),
where \(b_5 \!=\! \sqrt{n} (b_4 a_{10}^{k_{n + 1} - k_1 - 2} \!+\! b_3)\) and \(\Vert \cdot \Vert _{g, 2}\) is the norm induced by the Riemannian metric \(g\) and the Euclidean norm, i.e., \(\Vert A\Vert _{g, 2} \!=\! \sup \Vert A v\Vert / \Vert v\Vert _2\) with \(\Vert \cdot \Vert \) defined in (5).
We can now conclude the proof as follows. Using (23) and (30) with \(j = m\), (27) and (28), we have
where \(b_6\) is some constant. Finally,
\(\square \)
The next lemma generalizes [11, Lemma 2.4]. Its proof is a translation of the proof of [11, Lemma 2.4], where we invoke two manifold-specific results: the equality of \({{\mathrm{\mathrm {Hess}}}}f(x^*)\) and \({{\mathrm{\mathrm {Hess}}}}(f\circ R_{x^*})(0_{x^*})\) (which holds in view of [2, Proposition 5.5.6] since \(x^*\) is a critical point of \(f\)), and the bound in Lemma 3 on the retraction \(R\).
Lemma 14
Suppose that Assumptions 1, 2, 3, 4, 5 and 6 hold and the trust-region subproblem (2) is solved accurately enough for (9) to hold. Then there exists \(N\) such that for any set of \(p > n\) consecutive steps \(s_{k + 1}, s_{k + 1}, \ldots , s_{k + p}\) with \(k \ge N\), there exists a set, \(\mathcal {G}_k\), of at least \(p - n\) indices contained in the set \(\{i : k + 1 \le i \le k + p\}\) such that for all \(j \in \mathcal {G}_k\),
where \(a_{13} = a_{12} a_{10}^{p - 2} + \bar{a}_{12},\; H_j = \mathcal {T}_{S_{\zeta _j}} {{\mathrm{\mathrm {Hess}}}}f(x^*) \mathcal {T}_{S_{\zeta _j}}^{-1},\; \zeta _j = R_{x^*}^{-1}(x_j)\), and
Furthermore, for k sufficiently large, if \(j \in \mathcal {G}_k\), then
where \(a_{14}\) is a constant, and
Proof
By Lemma 5, \(s_k \rightarrow 0\). Therefore, by Lemma 13, applied to the set
there exists \(N\) such that for any \(k \ge N\) there exists an index \(l_1\), with \(k + 1 \le l_1 \le k + p\) satisfying
where \(a_{13} = a_{12} a_{10}^{p - 2}+ \bar{a}_{12}\). We apply Lemma 13 to the set \(\{s_k, s_{k + 1}, \ldots s_{k + p}\} - s_{l_1}\) to get \(l_2\). Repeating this \(p - n\) times, we get a set of \(p - n\) indices \(\mathcal {G}_k = \{l_1, l_2, \ldots , l_{p - n}\}\) such that if \(j \in \mathcal {G}_k\), then
We show (31) next. Consider \(j \in \mathcal {G}_k\). By (34), we have
Therefore, \(g(s_j, \mathcal {B}_j s_j) \ge g(s_j, H_j s_j) - a_{13} \epsilon _k^{\frac{1}{n}} \Vert s_j\Vert ^2 > b_0 \Vert s_j\Vert ^2\) for sufficient large \(k\), where \(b_0\) is a constant and we have
where \(b_1\) is some constant. This yields (31).
Finally, we show (32). Let \(j \in \mathcal {G}_k\) and define \(\hat{f}_x(\eta ) = f(R_x(\eta ))\). It follows that
where \(b_2, b_3\) and \(b_4\) are some constants. In view of (31) and Lemma 7, we have
where \(b_5\) is some constant. Combining with \(\Vert s_j\Vert \le \varDelta _j\), we obtain
Noticing (9), we have
where \(b_6\) is a constant. This implies (32).\(\square \)
The next result generalizes [11, Lemma 2.5] in two ways: the Euclidean setting is extended to the Riemannian setting, and inexact solves are allowed by the presence of \(\delta _k\). The main hurdle that we had to overcome in the Riemannian generalization is that the equality \({{\mathrm{\mathrm {dist}}}}(x_k + s_k, x^*) = \Vert s_k - \xi _k\Vert \) does not necessarily hold. As we will see, Lemma 2 comes to our rescue.
Lemma 15
Suppose Assumptions 2 and 3 hold. If the quantities
are sufficiently small and if
then
and
where \(a_{15}, a_{16}, a_{17}\) and \(a_{18}\) are some constants and \(h(x) = (f(x) - f(x^*))^{\frac{1}{2}}\).
Proof
By definition of \(s_k\), we have
Define \(\xi _k = R^{-1}_{x_k} x^*\). Therefore, letting \(\gamma \) be the curve defined by \(\gamma (t) = R_{x_k}(t \xi _k)\), we have
where \(b_1,\; b_2,\; b_3\) and \(b_4\) are some constants. From Lemma 2, we have
where \(b_5\) is a constant. Combining (39) and (40) and using Lemma 3, we obtain
From (38), for \(k\) large enough such that \(\Vert H_k^{-1}\Vert \Vert (H_k - \mathcal {B}_k) s_k\Vert \le \frac{1}{2} \Vert s_k\Vert \), we have
Using Lemma 7, this yields
where \(b_6\) is a constant. Using the latter in (41) yields
which shows (35).
We show (37) next. Define \(\hat{f}_x(\eta ) = f(R_x(\eta ))\) and let \(\zeta _k = R_{x^*}^{-1}(x_k)\). We have, for some \(t\in (0,1)\),
where we have used (Euclidean) Taylor’s theorem to get the first equality and the fact that \(x^*\) is a critical point of \(f\) (Assumption 2) for the second one. Therefore, since \({{\mathrm{\mathrm {Hess}}}}\hat{f}_{x^*} = {{\mathrm{\mathrm {Hess}}}}f(x^*)\) is positive definite (in view of [2, Proposition 5.5.6] and Assumption 2), there exist \(b_7\) and \(b_8\) such that
Then, using Lemma 3, we obtain that there exist \(b_9\) and \(b_{10}\) such that
In other words,
and we have shown (37). Combining it with (35), we get (36).\(\square \)
With Lemmas 14 and 15 in place, the rest of the local convergence analysis is essentially a translation of the analysis in [11]. The next lemma generalizes [11, Lemma 2.6].
Lemma 16
If Assumptions 1, 2, 3, 4, 5, and 6 hold and the subproblem is solved accurately enough for (9) and (10) to hold, then
where \(h_k = h(x_k)\).
Proof
Let \(p\) be the smallest integer greater than \(2 n + n (- \ln \tau _1 / \ln \tau _2)\), where \(\tau _1\) and \(\tau _2\) are defined in Algorithm 1. Then
Applying Lemma 14 with this value of \(p\), there exists \(N\) such that if \(k \ge N\), then there exists a set of at least \(p - n\) indices, \(\mathcal {G}_k \subset \{j : k + 1 \le j \le k + p\}\), such that if \(j \in \mathcal {G}_k\), then
We now show that for such steps,
If \(\Vert s_j\Vert \ge 0.8 \varDelta _j\), then since from Step 12 of Algorithm 1, \(\varDelta _{j + 1} = \tau _2 \varDelta _j\) and since \(\{h_i\}\) is decreasing, (44) follows. If on the other hand \(\Vert s_j\Vert < 0.8 \varDelta _j\), then from Step 14 of Algorithm 1, we have that \(\varDelta _{j + 1} = \varDelta _j\). Also since the trust region is inactive, by condition (10), we have that \(\mathcal {B}_j s_j = - {{\mathrm{\mathrm {grad}}}}f(x_j) + \delta _k\), \(\Vert \delta _k\Vert \le \Vert {{\mathrm{\mathrm {grad}}}}f(x_j)\Vert ^{1 + \theta }\). Therefore, in view of (36) in Lemma 15 and of (43), if \(N\) is large enough, we have that
This implies that (44) is true for all \(j \in \mathcal {G}_j\), where \(k \ge N\).
In addition, note that for any \(j,\; h_{j + 1} \le h_j\) and \(\varDelta _{j + 1} \ge \tau _1 \varDelta _j\) and so
Since (44) is true for \(p - n\) values of \(j \in \mathcal {G}_k\) and (45) holds for all \(j\), we have that for all \(k \ge N\),
where the second inequality follows from (42). Therefore, starting at \(k = N\), it follows that
as \(l \rightarrow \infty \). Using (45) again, we complete the proof.\(\square \)
The next result generalizes [11, Theorem 2.7].
Theorem 2
If Assumptions 1, 2, 3, 4, 5, and 6 hold and the subproblem is solved accurately enough for (9) and (10) to hold then, the sequence \(\{x_k\}\) generated by Algorithm 1 is \(n + 1\)-step q-superlinear (where \(n\) denotes the dimension of \(\mathcal {M}\)); i.e.,
Proof
By Lemma 14, there exists \(N\) such that if \(k \ge N\), then the set of steps \(\{s_{k + 1}, \ldots , s_{k + n + 1}\}\) contains at least one step \(s_{k + j},\; 1 \le j \le n + 1\), for which
By (31) in Lemma 14 and (37) in Lemma 15 (when checking the assumptions, recall the standing assumption made in Sect. 3.3 that \(e_k:={{\mathrm{\mathrm {dist}}}}(x_k,x^*) \rightarrow 0\)), there exists a constant \(b_0\) such that
Therefore, by Lemma 16, if \(N\) is large enough and \(k \ge N\), then we have \(\Vert s_{k + j}\Vert < 0.8 \varDelta _{k + j}\). By (10), this implies \(\mathcal {B}_{k + j} s_{k + j} = - {{\mathrm{\mathrm {grad}}}}f(x_{k + j}) + \delta _{k + j}\), with \(\Vert \delta _{k + j}\Vert \le \Vert {{\mathrm{\mathrm {grad}}}}f(x_{k + j})\Vert ^{1 + \theta }\). Thus by inequality (36) of Lemma 15, if \(N\) is large enough and \(k \ge N\), then
The first equality holds because (32) implies that the step is accepted. Since the sequence \(\{h_i\}\) is decreasing, this implies that
By (37),
This implies \(n + 1\)-step q-superlinear convergence.\(\square \)
It is also possible to extend to the Riemannian setting the result [11, Theorem 2.8] that the percentage of \(\mathcal {B}_k\) being positive semidefinite approaches 1 provided that \(\mathcal {B}_k\) is positive semidefinite whenever \(\Vert s_k\Vert \le 0.8 \varDelta _k\). In the proof of [11, Theorem 2.8], replace Lemma 2.6 by Lemma 16, Lemma 2.4 by Lemma 14, (2.14) by (31), and (2.9) by (37).
4 Limited memory version of RTR-SR1
In RTR-SR1 (Algorithm 1), storing \(\mathcal {B}_{k + 1} = \mathcal {T}_{\eta _k} \circ \tilde{\mathcal {B}}_{k + 1} \circ \mathcal {T}_{\eta _k}^{-1}\) in matrix form may be inefficient for two reasons. The first reason, which is also present in the Euclidean case, is that \(\tilde{\mathcal {B}}_{k+1}= \mathcal {B}_k + \frac{(y_k - \mathcal {B}_k s_k) (y_k - \mathcal {B}_k s_k)^\flat }{g(s_k,y_k - \mathcal {B}_k s_k)}\) is a rank-one modification of \(\mathcal {B}_k\). The second reason, specific to the Riemannian setting, is that when \(\mathcal {M}\) is a low-codimension submanifold of a Euclidean space \(\mathcal {E}\), it may be beneficial to express \(\mathcal {T}_{\eta _k}\) as the restriction to \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\) of a low-rank modification of the identity (7). Instead of storing full dense matrices, it may then be beneficial to store a few vectors that implicitly represent them. This is the purpose of the limited memory version of RTR-SR1 presented in this section.
The proposed limited memory RTR-SR1, called LRTR-SR1, is described in Algorithm 2. It relies on a Riemannian generalization of the compact representation of the classical (Euclidean) SR1 matrices presented in [12, §5]. We set \(\mathcal {B}_0={{\mathrm{\mathrm {id}}}}\). At step \(k>0\), we first choose a basic Hessian approximation \(\mathcal {B}_0^k\), which in the Riemannian setting becomes a linear transformation of \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\). We advocate the choice
where
which generalizes a choice usually made in the Euclidean case [27, (7.20)]. As in the Euclidean case, we let \(S_{k,m}\) and \(Y_{k,m}\) contain the (at most) \(m\) most recent corrections, which in the Riemannian setting must be transported to \({{\mathrm{\mathrm {T}}}}_{x_k}\mathcal {M}\), yielding \(S_{k, m} = \{s_{k - \ell }^{(k)}, s_{k - \ell + 1}^{(k)}, \ldots , s_{k - 1}^{(k)}\}\) and \(Y_{k, m} = \{y_{k - \ell }^{(k)}, y_{k - \ell + 1}^{(k)}, \ldots , y_{k - 1}^{(k)}\}\), where \(\ell = \min \{m, k\}\) and where \(s^{(k)}\) denotes \(s\) transported to \({{\mathrm{\mathrm {T}}}}_{x_{k}} \mathcal {M}\). We then have the following Riemannian generalization of the limited-memory update based on [12, (5.2)]:
where \(P_{k, m} = D_{k, m} + L_{k, m} + L_{k, m}^T\),
and
Moreover, letting \(Q_{k, m}\) denote the matrix \(S_{k, m}^{\flat } S_{k, m}\), we obtain
For all \(\eta \in {{\mathrm{\mathrm {T}}}}_{x_k} \mathcal {M},\; \mathcal {B}_k \eta \) can thus be obtained from (46) using \(Y_{k, m},\; S_{k, m},\; P_{k, m}\) and \(Q_{k, m}\). This is how \(\mathcal {B}_k\) is defined in Algorithm 2, except that the technicality that the \(\mathcal {B}\) update may be skipped is also taken into account therein.
5 Numerical experiments
As an illustration, we investigate the performance of RTR-SR1 (Algorithm 1) and LRTR-SR1 (Algorithm 2) on a Rayleigh quotient minimization problem on the sphere and on a joint diagonalization (JD) problem on the Stiefel manifold.
For the Rayleigh quotient problem, the manifold \(\mathcal {M}\) is the sphere
and the objective function \(f\) is defined by
where \(A\) is a given \(n\)-by-\(n\) symmetric matrix. Minimizing the Rayleigh quotient of \(A\) is equivalent to computing its leftmost eigenvector (see, e.g., [2, §2.1.1]). The Rayleigh quotient problem provides convenient benchmarking experiments since, except for pathological cases, there is essentially one local minimizer (specifically, there is one pair of antipodal local minimizers), which can be readily computed by standard eigenvalue software for verification. Moreover, the Hessian of \(f\) is readily available, and this allows for a comparison with RTR-Newton [2, Ch. 7], which corresponds to Algorithm 1 with the exception that \(\mathcal {B}_k\) is replaced by the Hessian of \(f\) at \(x_k\) (and thus the vector transport is no longer needed).
In the JD problem considered, the manifold \(\mathcal {M}\) is the (compact) Stiefel manifold,
and the objective function \(f\) is defined by
where \(C_1,\ldots ,C_N\) are given symmetric matrices, \({{\mathrm{\mathrm {diag}}}}(M)\) denotes the vector formed by the diagonal entries of \(M\), and \(\Vert {{\mathrm{\mathrm {diag}}}}(M)\Vert ^2\) thus denotes the sum of the squared diagonal elements of \(M\). This problem has applications in independent component analysis for blind source separation [36].
The comparisons are performed in Matlab 7.0.0 on a 32 bit Windows platform with 2.4 GHz CPU (T8300).
The chosen Riemannian metric \(g\) on \(\mathbb {S}^{n - 1}\) is obtained by making \(\mathbb {S}^{n-1}\) a Riemannian submanifold of the Euclidean space \(\mathbb {R}^n\). The gradient and the Hessian of \(f\) (47) with respect to the metric are given in [2, §6.4]. The chosen Riemannian metric \(g\) on \(\mathrm {St}(p,n)\) is the one obtained by viewing \(\mathrm {St}(p,n)\) as a Riemannian submanifold of the Euclidean space \(\mathbb {R}^{n\times p}\), as in [17, §2.2]. With respect to this Riemannian metric, the gradient of the objective function (48), required in the three methods, is given in [36, §2.3]. The Riemannian Hessian of (48), required for RTR-Newton, is also given therein.
The initial Hessian approximation \(\mathcal {B}_0\) is set to the identity in RTR-SR1. The \(\theta ,\; \kappa \) parameters in the inner iteration stopping criterion [2, (7.10)] of the truncated CG inner iteration [2, §7.3.2] are set to 0.1, 0.9 for RTR-SR1 and LRTR-SR1 and to 1, 0.1 for RTR-Newton. The initial radius \(\varDelta _0\) is set to 1, \(\nu \) is the square root of machine epsilon, \(c\) is set to 0.1, \(\tau _1\) to 0.25, and \(\tau _2\) to 2.
For the retraction \(R\) on \(\mathbb {S}^{n - 1}\), following [2, Example 4.1.1], we choose \(R_x(\eta ) = (x + \eta ) / \Vert x + \eta \Vert _2\). For the retraction \(R\) on \(\mathrm {St}(p,n)\), following [2, (4.8)], we choose \(R_X(\eta ) = \mathrm {qf}(X+\eta )\) where \(\mathrm {qf}\) denotes the \(Q\) factor of the QR decomposition with nonnegative elements on the diagonal of \(R\).
The chosen isometric vector transport \(\mathcal {T}\) on \(\mathbb {S}^{n-1}\) is the vector transport by rigging (7), which is (locally) uniquely defined in case of submanifolds of co-dimension 1. In the case of \(\mathbb {S}^{n-1}\), it turns out to be equivalent to the parallel translation along the shortest geodesic that joins the origin point \(x\) and the target point \(y\), i.e.,
where \(y = R_x(\eta _x)\). This operation is well defined whenever \(x\) and \(y\) are not antipodal points. On \(\mathrm {St}(p,n)\), since we will conduct experiments on problems of small dimension, we find it preferable to select a vector transport by parallelization (6), which amounts to selecting a smooth field of tangent bases \(B\) on \(\mathrm {St}(p,n)\). To this end, we note that \({{\mathrm{\mathrm {T}}}}_X \mathrm {St}(p, n) = \{X \varOmega + X_{\perp } K : \varOmega ^T = - \varOmega ,\ K \in \mathbb {R}^{(n - p) \times p}\}\), where the columns of \(X_\perp \in \mathbb {R}^{n\times (n-p)}\) form an orthonormal basis of the orthogonal complement of the column space of \(X\) (see, e.g., [2, Example 3.5.2]). Hence, an orthonormal basis of \({{\mathrm{\mathrm {T}}}}_X \mathrm {St}(p, n)\) is given by \(\{\frac{1}{\sqrt{2}} X (e_i e_j^T - e_j e_i^T): i=1,\ldots ,p, j=i+1,\ldots ,p\} \cup \{X_{\perp } \tilde{e}_i e_j^T, i = 1, \ldots , n-p, j = 1, \ldots , p\}\), where \((e_1,\ldots ,e_p)\) is the canonical basis of \(\mathbb {R}^p\) and \((\tilde{e}_1,\ldots ,\tilde{e}_{n - p})\) is the canonical basis of \(\mathbb {R}^{n - p}\). To ensure that the obtained field of tangent bases is smooth, we need to choose \(X_\perp \) as a smooth function of \(X\). This can be done locally by extracting the \(n-p\) last columns of the Gram-Schmidt orthonormalization of \(\begin{bmatrix} X&C \end{bmatrix}\) where \(C\) is a given \(n\times (n-p)\) matrix. In practice, we have observed that Matlab’s null function applied to \(X\) works adequately to produce an \(X_\perp \).
The experiments on the Rayleigh quotient (47) are reported in Table 1 and Fig. 1. Since the Hessian is readily available, RTR-Newton is the a priori method of choice over the proposed RTR-SR1 and LRTR-SR1 methods. Nevertheless, Table 1 illustrates that it is possible to exhibit a matrix \(A\) for which LRTR-SR1 is faster than RTR-Newton. Specifically, matrix \(A\) in the Rayleigh quotient (47) is set to \(U D U^T\), where \(U\) is an orthonormal matrix obtained by orthonormalizing a random matrix whose elements are drawn from the standard normal distribution and \(D = {{\mathrm{\mathrm {diag}}}}(0, 0.01, \ldots , 0.01, 2, \ldots , 2)\) with 0.01 and 2 occurring \(n/2 - 1\) and \(n/2\) times, respectively. The initial iterate is generated randomly. The number of function evaluations is equal to the number of iterations \((iter)\). \(ng\) denotes the number of gradient evaluations. The differences between \(iter\) and \(ng\) that may be observed in RTR-Newton are due to occasional rejections of the candidate new iterate as prescribed in the trust-region framework [2, §7.2]; for RTR-SR1 and LRTR-SR1, \(iter\) and \(ng\) are identical because one evaluation of the gradient is required at each iterate to update \(\mathcal {B}_k\) or store \(y_k\) even if the candidate is rejected. \(nH\) denotes the number of operations of the form \({{\mathrm{\mathrm {Hess}}}}f(x) \eta \) or \(\mathcal {B} \eta \). \(t\) denotes the run time (seconds). To obtain sufficiently stable timing values, an average is taken over several identical runs for a total run time of at least one minute. \(gf_0\) and \(gf_f\) denote the initial and final norm of the gradient. Two stopping criteria are tested: \(gf_f/gf_0 < \)1e-3 and \(gf_f/gf_0 <\) 1e-6.
Unsurprisingly, Table 1 shows that RTR-Newton, which exploits the Hessian of \(f\), requires fewer iterations than the SR1 methods, which does not use this information. However, when \(n\) gets large, the time per iteration in LRTR-SR1—with moderate memory size \(m\)—gets sufficiently smaller for the method to become faster than RTR-Newton.
Table 2 and Fig. 2 present the experimental results obtained for the JD problem (48). The \(C_i\) matrices are selected as \(C_i = {{\mathrm{\mathrm {diag}}}}(n, n-1, \ldots , 1) + \epsilon _{\mathrm {C}} (R_i + R_i^T)\), where the elements of \(R_i\in \mathbb {R}^{n\times n}\) are independently drawn from the standard normal distribution. Table 2 and Fig. 2 correspond to \(\epsilon _{\mathrm {C}}=0.1\), but we have observed similar results for a wide range of values of \(\epsilon _{\mathrm {C}}\). Table 2 indicates that RTR-Newton requires fewer iterations than RTR-SR1, which requires fewer iterations than LRTR-SR1. This was expected since RTR-Newton uses the Hessian of \(f\) while RTR-SR1 uses an inexact Hessian and LRTR-SR1 is further constrained by the limited memory. However, the iterations of RTR-Newton tend to be more time-consuming that those of the SR1 methods, all the more so if \(N\) gets large since the number of terms in the Hessian of \(f\) is linear in \(N\). The experiments reported in Table 2 show that the trade-off between the number of iterations and the time per iteration is in favor of RTR-SR1 for \(N\) sufficiently large. Note also that, even though it is slower than the two other methods in the experiments presented in Table 2, LRTR-SR1 may be the method of choice in certain circumstances as it does not require the Hessian of \(f\) (unlike RTR-Newton) and it has a reduced memory usage in comparison with RTR-SR1.
6 Conclusion
We have introduced a Riemannian SR1 trust-region method, where the second-order term of the model is generated using a Riemannian generalization of the classical SR1 update. Global convergence to stationary points and \(d+1\)-step superlinear convergence are guaranteed, and the experiments reported here show promise. A limited-memory version of the algorithm has also been presented. The new algorithms will be made available in the Manopt toolbox [9].
References
Absil, P.A., Baker, C.G., Gallivan, K.A.: Trust-region methods on Riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007)
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ (2008)
Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM Journal on Optimization 22(1), 135–158 (2012)
Adler, R.L., Dedieu, J.P., Margulies, J.Y.: Newton’s method on Riemannian manifolds and a geometric model for the human spine. IMA J. Numer. Anal. 22(3), 359–390 (2002)
Afsari, B., Tron, R., Vidal, R.: On the convergence of gradient descent for finding the Riemannian center of mass. SIAM J. Control Optim. 51(3), 2230–2260 (2013). ArXiv:1201.0925v1
Boothby, W.M.: An Introduction to Differentiable Manifolds and Riemannian Geometry, 2nd edn. Academic Press, London (2003)
Borsdorf, R.: Structured Matrix Nearness Problems: Theory and Algorithms. Ph.D. thesis, The University of Manchester (2012)
Boumal, N., Absil, P.A.: RTRMC: A Riemannian trust-region method for low-rank matrix completion. Adv. Neural Inf. Process. Syst. 24(NIPS), 406–414 (2011)
Boumal, N., Mishra, B.: The Manopt toolbox. http://www.manopt.org
Brace, I., Manton, J.H.: An improved BFGS-on-manifold algorithm for computing weighted low rank approximations. In Proceedings of 17th International Symposium on Mathematical Theory of Networks and Systems, pp. 1735–1738 (2006)
Byrd, R.H., Khalfan, H.F., Schnabel, R.B.: Analysis of a symmetric rank-one trust region method. SIAM J. Optim. 6(4), 1025–1039 (1996)
Byrd, R.H., Nocedal, J., Schnabel, R.B.: Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program. 63(1–3), 129–156 (1994)
do Carmo, M.P.: Riemannian geometry. Mathematics: Theory & Applications (1992)
Chavel, I.: Riemannian Geometry: A Modern Introduction, 2nd ed. Cambridge Studies in Advanced Mathematics (2006)
Conn, A.R., Gould, N.I.M., Toint, P.L.: Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Program. 50(1–3), 177–195 (1991). doi:10.1007/BF01594934
Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust-region methods. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, MPS/SIAM Series on Optimization (2000)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998). doi:10.1137/S0895479895290954
Gabay, D.: Minimizing a differentiable function over a differential manifold. J. Optim. Theory Appl. 37(2), 177–219 (1982)
Gallivan, K.A., Qi, C., Absil, P.A.: A Riemannian Dennis-Moré condition. In: Berry, M.W., Gallivan, K.A., Gallopoulos, E., Grama, A., Philippe, B., Saad, Y., Saied, F. (eds.) High-Performance Scientific Computing, pp. 281–293. Springer, London (2012)
Huang, W.: Optimization algorithms on Riemannian manifolds with applications. Ph.D. thesis, Department of Mathematics, Florida State University (2013)
Ishteva, M., Absil, P.A., Van Huffel, S., De Lathauwer, L.: Best low multilinear rank approximation of higher-order tensors, based on the Riemannian trust-region scheme. SIAM J. Matrix Anal. Appl. 32(1), 115–135 (2011)
Jiang, B., Dai, Y.H.: A framework of constraint preserving update schemes for optimization on the Stiefel manifold (2013). ArXiv:1301.0172
Journée, M., Bach, F., Absil, P.A., Sepulchre, R.: Low-rank optimization on the cone of positive semidefinite matrices. SIAM J. Optim. 20(5), 2327–2351 (2010)
Khalfan, H.F., Byrd, R.H., Schnabel, R.B.: A theoretical and experimental study of the symmetric rank-one update. SIAM J. Optim. 3(1), 1–24 (1993). doi:10.1137/0803001
Kleinsteuber, M., Shen, H.: Blind source separation with compressively sensed linear mixtures. IEEE Signal Process. Lett. 19(2), 107–110 (2012). ArXiv:1110.2593v1
Mishra, B., Meyer, G., Bach, F., Sepulchre, R.: Low-rank optimization with trace norm penalty (2011). ArXiv:1112.2318v2
Nocedal, J., Wrigh, S.J.: Numerical Optimization, 2nd edn. Springer, Berlin (2006)
O’Neill, B.: Semi-Riemannian geometry. Academic Press Incorporated [Harcourt Brace Jovanovich Publishers] (1983)
Qi, C.: NuMerical OptiMization Methods on RieMannian Manifolds. Ph.D. thesis, DepartMent of Mathematics, Florida State University (2011)
Ring, W., Wirth, B.: Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 22(2), 596–627 (2012). doi:10.1137/11082885X
Sato, H., Iwai, T.: Convergence analysis for the Riemannian conjugate gradient method (2013). ArXiv:1302.0125v1
Savas, B., Lim, L.H.: Quasi-Newton methods on Grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput. 32(6), 3352–3393 (2010)
Seibert, M., Kleinsteuber, M., Hüper, K.: Properties of the BFGS method on Riemannian manifolds. Mathematical System Theory C Festschrift in Honor of Uwe Helmke on the Occasion of his Sixtieth Birthday pp. 395–412 (2013)
Selvan, S.E., Amato, U., Gallivan, K.A., Qi, C.: Descent algorithms on oblique manifold for source-adaptive ICA contrast. IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1930–1947 (2012)
Stiefel, E.: Richtungsfelder und fernparallelismus in n-dimensionalen mannigfaltigkeiten. Commentarii Mathematici Helvetici 8(1), 305–353 (1935). doi:10.1007/BF01199559
Theis, F.J., Cason, T.P., Absil, P.A.: Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In: Proceedings of the 8th International Conference on Independent Component Analysis and Signal Separation, vol. 5441, pp. 354–361 (2009)
Turaga, P., Veeraraghavan, A., Srivastava, A., Chellappa, R.: Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11), 2273–86 (2011). doi:10.1109/TPAMI.2011.52
Vandereycken, B., Vandewalle, S.: A Riemannian optimization approach for computing low-rank solutions of Lyapunov equations. SIAM J. Matrix Anal. Appl. 31(5), 2553–2579 (2010). doi:10.1137/090764566
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Mathematical Programming, Published online (2012). doi:10.1007/s10107-012-0584-1
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme initiated by the Belgian Science Policy Office. This work was financially supported by the Belgian FRFC (Fonds de la Recherche Fondamentale Collective). This work was performed in part while the third author was a Visiting Professor at the Institut de mathématiques pures et appliquées (MAPA) at Université catholique de Louvain.
Rights and permissions
About this article
Cite this article
Huang, W., Absil, PA. & Gallivan, K.A. A Riemannian symmetric rank-one trust-region method. Math. Program. 150, 179–216 (2015). https://doi.org/10.1007/s10107-014-0765-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-014-0765-1
Keywords
- Riemannian optimization
- Optimization on manifolds
- Symmetric rank-one update
- Rayleigh quotient
- Joint diagonalization
- Stiefel manifold